Model Compression Techniques: An Introductory and Comparative Guide
Learn the fundamentals of popular model compression techniques and the trade-offs of each approach
As an on-device machine learning engineer, you often face the challenge of deploying models on resource-constrained devices. After going through the typical ML pipeline — data collection, preprocessing, and designing a high-performance model — you may find that the trained model is too large or computationally expensive for devices like mobile phones, IoT systems, or edge devices. Understanding the resource constraints of your target hardware and optimizing your model to meet those requirements is key. This process is called model compression, and in this blog post, we’ll delve into what model compression is and understand a key set of techniques like quantization, pruning, and knowledge distillation along with the trade-offs associated with each category of model compression techniques.
Table of Contents
- Overview and Comparison of Pruning Techniques
- Overview and Comparison of Quantization Techniques
- Overview and Comparison of Knowledge Distillation Techniques
Glossary
What is Model Compression?
Model compression refers to techniques used to reduce the size and complexity of machine learning models without significantly compromising their accuracy. As models grow in size — especially deep learning models with millions or even billions of parameters — they demand more computational power, memory, and storage, which can be impractical for deployment in resource-constrained environments like mobile devices or edge devices. Model compression helps address this challenge by making models more efficient and easier to deploy.
Why is Model Compression Important?
Model compression is especially crucial in the context of TinyML and edge AI, where hardware is constrained by limited power, memory, and storage. It is a key enabler for making on-device ML feasible.
Here are the main benefits of model compression:
Reduced storage requirements: Compressed models occupy less space on devices. For example, a smaller model of an iPhone app reduces the storage needed on mobile devices.
Faster downloads: Smaller models require less time and bandwidth to download, enhancing user experience, particularly in low-bandwidth environments.
Lower memory consumption: Compressed models use less RAM during execution, freeing up memory for other application components, which can lead to improved performance and stability.
Common Methods of Model Compression
There are several approaches to compressing a model, including but not limited to pruning, quantization, and knowledge distillation. These are some of the most popular model compression techniques, but they’re not the only ones available. These techniques reduce the model’s size, computational load, and energy consumption, enabling faster inference times and more affordable deployments while maintaining an acceptable level of performance.
Let’s dive into these techniques:
Pruning
Pruning achieves model compression by removing unnecessary or less important parameters (weights or connections) of a neural network. Model pruning is like how the brain prioritizes key connections between neurons, minimizing the use of less essential pathways to focus on the most important ones.
After pruning, it is important to evaluate the model on a relevant dataset to ensure that its accuracy and efficiency meet the required standards.
Here are the main pruning approaches:
Training-time pruning
Training-time pruning incorporates the pruning process directly into the neural network’s training phase. As the model is trained, sparsity is encouraged, or less important connections and neurons are removed as part of the optimization. This means pruning decisions are made alongside weight updates during each training iteration, resulting in better accuracy retention compared to post-training pruning. The other advantage is that training-time pruning can act as a regularization technique, reducing overfitting by forcing the model to rely only on the most relevant parameters.
One main drawback is that pruning during training can alter the model’s optimization landscape, which may sometimes lead to convergence difficulties. Careful tuning of pruning rates and retraining techniques is often necessary to avoid disrupting the model’s learning process.
Post-train pruning
Post-train pruning is applied after the model has been fully trained. In this approach, pruning techniques identify and remove redundant or less important weights, nodes, and larger structures within the network. While post-training pruning reduces the model size, fine-tuning the pruned network may be necessary to recover any lost accuracy and maintain performance. The post-training pruning can be categorized into two types:
Comparison of Pruning Techniques
Pruning techniques for model compression vary in terms of compression rate, accuracy retention, and computational trade-offs. Post-training unstructured pruning provides a high compression rate with flexibility in pruning individual weights, allowing for fine-grained control. However, it can lead to slower inference speeds on most hardware due to sparse matrix operations. In contrast, post-training structured pruning removes entire structures like nodes or filters, which achieves better hardware efficiency with dense matrix computations, though it may result in greater accuracy loss and a slightly lower compression rate. Training-time pruning incorporates pruning directly within the training process, allowing for enhanced accuracy retention and improved regularization, as the model adapts to pruning gradually. However, this approach is computationally intensive and may lead to convergence issues, as the model undergoes continuous structural changes during training. Each method has its advantages, making the choice dependent on the specific requirements for compression rate, accuracy, and hardware compatibility.
Recommended by LinkedIn
Quantization
Quantization is a way to make models faster and lighter by using smaller, low-precision numbers to represent data. In addition, full quantization (moving all calculations to int 8) opens the door to many embedded systems that can’t handle floating-point processing, enabling a wide range of applications across the IoT landscape.
There are many variations of quantization. Depending on the type of quantization, the quantization converts the inputs, outputs, weights, and/or activations of a model from high-precision representations (e.g., float32) to lower-precision representations (e.g., float16, int32, int16, int8, and even int2). This reduces the memory and computation costs of running the model, making it much more efficient without losing too much accuracy.
Here are some of the well-known quantization techniques:
Post-training quantization: This technique includes general techniques to reduce CPU and hardware accelerator latency, processing, power, and model size with little degradation in model accuracy. These techniques are performed on an already-trained model. Here are some examples of Post-training quantization supported by the LiteRT (the new name of TFLite) framework:
Quantization-aware training: This technique simulates the effects of quantization during training, allowing models to be fully quantized by downstream tools after training. Essentially, the model is trained with an awareness that it will later be converted to a lower precision format. As a result, these quantized models operate with lower precision — such as 8-bit integers instead of 32-bit floats. Quantization-aware training typically provides higher accuracy than post-training quantization.
Comparison of Quantization Techniques
The table below provides an overview of various quantization techniques, summarizing their key pros and cons. Quantization techniques offer a range of trade-offs between compression rate, accuracy retention, and hardware compatibility. Post-training float16 quantization reduces the model size by half and provides faster computation on compatible CPU and GPU hardware, but it has the smallest compression rate and is incompatible with integer-only devices. Post-training dynamic range quantization offers a significant compression rate, making models four times smaller with a 2x-3x speedup, though it also requires floating-point support. Post-training full integer quantization achieves the highest compression rate and speedup (4x smaller, 3x+ speedup) and is compatible with a broad range of hardware, including integer-only devices like microcontrollers and ML accelerators, but it tends to have a higher accuracy impact. Quantization-aware training provides the best accuracy retention among these methods, with only a marginal accuracy loss, making it suitable for highly resource-constrained environments like FPGAs and ASICs, though it is more computationally intensive due to the training requirements. Each method addresses different needs, from maximizing speed and compression on limited hardware to minimizing accuracy loss in edge deployments.
Knowledge Distillation
Knowledge Distillation also known as Model Distillation aims to transfer the learnings from a large complex model (teacher model) to a smaller, deployable one (student model) while preserving the performance. The distillation process entails training the small student neural network to mimic the behavior of the large and complex teacher network by learning from its predictions or internal features. This approach is a type of supervised learning, where the student reduces the difference between its own predictions and those of the teacher model.
Knowledge distillation can be categorized into three types: response-based, feature-based, and relation-based distillation, each focusing on different aspects of knowledge transfer between the teacher and student models.
There are three main training methods in knowledge distillation:
Comparison of Knowledge Distillation Techniques
The table below provides an overview of various knowledge distillation techniques, summarizing their key pros and cons.
Other Model Compression Techniques
There are many other model compression techniques beyond the ones above such as low-rank factorization, weight clustering, and parameter sharing. We will delve into some of these model compression techniques in future posts.
Conclusion
In an era where machine learning models are becoming increasingly sophisticated, deploying them on resource-constrained devices can be challenging. Model compression techniques, such as pruning, quantization, and distillation enable these models to operate efficiently without sacrificing too much performance.
As more industries rely on edge computing, mastering these techniques becomes crucial for any on-device ML engineer. By applying the right compression methods, you can balance model performance with hardware limitations, paving the way for faster, lighter, and more scalable AI solutions.
Ready to dive deeper into model compression techniques? Stay tuned for my next article, where I’ll showcase practical examples of these model compression techniques. If you want to learn more or have specific questions, feel free to reach out in the comments or connect with me on LinkedIn.
References
[1] G. Hinton, O. Vinyals and J. Dean, Distilling the Knowledge in a Neural Network (2015), arXiv.
[2] N. Mocanu, E. Mocanu and P. Stone, When to Prune? A Policy Towards Early Structural Pruning (2021), IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[3] T. Hoefler, D. Alistarh, T. Ben-Nun, N. Dryden, A. Peste, Sparsity in Deep Learning: Pruning and Growth for Efficient Inference and Training in Neural Networks (2021), Journal of Machine Learning Research.
[4] Google, Model Optimization Guide for TensorFlow Lite (2023), Google AI Developer Website.
[5] B. Predic , U. Vukic 1 , M. Saracevic, D. Karabaševic, and Dragiša Stanujkic, The Possibility of Combining and Implementing Deep Neural Network Compression Methods (2022), MDPI.
[6] F. Hohman, M. Beth Kery, D. Ren, D. Moritz, Model Compression in Practice: Lessons Learned from Practitioners Creating On-device Machine Learning Experiences (2024), Apple Machine Learning Research.
[7] P. Vilar Dantas, W. Sabino da Silva Jr, L. Carvalho Cordeiro, C. Barbosa Carvalho, A comprehensive review of model compression techniques in machine learning (2024), Appl Intell 54, 11804–11844