CodeFEATUREDMachine LearningNewsTechnology

DeepSparse: Enabling GPU-Level Inference on Your CPU

mpt-chat-comparison

What’s New
DeepSparse accelerates inference for deep learning models on CPUs using neural network sparsity. Through sparse kernels, 8-bit quantization, pruning, and intelligent caching of attention keys/values, the library enables GPU-like inference speeds for LLMs on commodity CPUs.

Why Does It Matter
Traditionally, deploying deep learning models on CPUs has meant compromising speed for scalability and flexibility. DeepSparse’s support for efficient inference on CPUs means that you can deploy performant models without being tied to accelerators, reducing costs and barriers to entry in developing ML applications.

How it Works
The framework uses a method from a recent study called “Sparse Finetuning” to prune and quantize the model. For MosaicML’s MPT-7B, this technique enables pruning to 60% sparsity without losing accuracy. With DeepSparse, the sparse model runs 7 times faster than the original dense model.

Features

Broad Model Compatibility: Supports LLMs, BERT, ViT, ResNet, and other popular architectures.

Versatile Deployment: Adaptable across various hardware, from cloud to edge.

Seamless Integration: Reduces deployment complexities and compatibility issues.

Join Upaspro to get email for news in AI and Finance

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses User Verification plugin to reduce spam. See how your comment data is processed.