Model Compression Techniques: An Introductory and Comparative Guide

Model Compression Techniques: An Introductory and Comparative Guide

Learn the fundamentals of popular model compression techniques and the trade-offs of each approach

As an on-device machine learning engineer, you often face the challenge of deploying models on resource-constrained devices. After going through the typical ML pipeline — data collection, preprocessing, and designing a high-performance model — you may find that the trained model is too large or computationally expensive for devices like mobile phones, IoT systems, or edge devices. Understanding the resource constraints of your target hardware and optimizing your model to meet those requirements is key. This process is called model compression, and in this blog post, we’ll delve into what model compression is and understand a key set of techniques like quantization, pruning, and knowledge distillation along with the trade-offs associated with each category of model compression techniques.

Table of Contents

  • What is Model Compression?
  • Why is Model Compression Important?
  • Common Methods of Model Compression

- Overview and Comparison of Pruning Techniques 

- Overview and Comparison of Quantization Techniques

- Overview and Comparison of Knowledge Distillation Techniques

  • Other Model Compression Techniques
  • Conclusion

Glossary

  • TinyML: A field of machine learning focused on designing small, efficient models specifically for deployment on microcontrollers and resource-limited hardware.
  • IoT (Internet of Things): A network of physical devices embedded with sensors, software, and other technologies that enable them to collect and exchange data over the Internet.
  • TFLite (TensorFlow Lite): A suite of tools designed to convert and optimize TensorFlow models for execution on mobile and edge devices. Google recently renamed this framework to LiteRT (short for Lite Runtime).

What is Model Compression?

Model compression refers to techniques used to reduce the size and complexity of machine learning models without significantly compromising their accuracy. As models grow in size — especially deep learning models with millions or even billions of parameters — they demand more computational power, memory, and storage, which can be impractical for deployment in resource-constrained environments like mobile devices or edge devices. Model compression helps address this challenge by making models more efficient and easier to deploy.

Why is Model Compression Important?

Model compression is especially crucial in the context of TinyML and edge AI, where hardware is constrained by limited power, memory, and storage. It is a key enabler for making on-device ML feasible.

Here are the main benefits of model compression:

Reduced storage requirements: Compressed models occupy less space on devices. For example, a smaller model of an iPhone app reduces the storage needed on mobile devices.

Faster downloads: Smaller models require less time and bandwidth to download, enhancing user experience, particularly in low-bandwidth environments.

Lower memory consumption: Compressed models use less RAM during execution, freeing up memory for other application components, which can lead to improved performance and stability.

Common Methods of Model Compression

There are several approaches to compressing a model, including but not limited to pruning, quantization, and knowledge distillation. These are some of the most popular model compression techniques, but they’re not the only ones available. These techniques reduce the model’s size, computational load, and energy consumption, enabling faster inference times and more affordable deployments while maintaining an acceptable level of performance.

Let’s dive into these techniques:

Pruning

Pruning achieves model compression by removing unnecessary or less important parameters (weights or connections) of a neural network. Model pruning is like how the brain prioritizes key connections between neurons, minimizing the use of less essential pathways to focus on the most important ones. 

After pruning, it is important to evaluate the model on a relevant dataset to ensure that its accuracy and efficiency meet the required standards. 

Structured vs. Unstructured Pruning (Image by the author)

Here are the main pruning approaches:

Training-time pruning

Training-time pruning incorporates the pruning process directly into the neural network’s training phase. As the model is trained, sparsity is encouraged, or less important connections and neurons are removed as part of the optimization. This means pruning decisions are made alongside weight updates during each training iteration, resulting in better accuracy retention compared to post-training pruning. The other advantage is that training-time pruning can act as a regularization technique, reducing overfitting by forcing the model to rely only on the most relevant parameters.

One main drawback is that pruning during training can alter the model’s optimization landscape, which may sometimes lead to convergence difficulties. Careful tuning of pruning rates and retraining techniques is often necessary to avoid disrupting the model’s learning process.

Post-train pruning

Post-train pruning is applied after the model has been fully trained. In this approach, pruning techniques identify and remove redundant or less important weights, nodes, and larger structures within the network. While post-training pruning reduces the model size, fine-tuning the pruned network may be necessary to recover any lost accuracy and maintain performance. The post-training pruning can be categorized into two types:

  • Unstructured pruning removes individual model parameters without a specific pattern. Unstructured pruning operates at a fine-grained level, removing individual weights without constraints, which allows for very precise pruning. This flexibility enables the removal of a large number of parameters while maintaining model performance. The main drawback is that pruning results in sparse matrix operations, a type of computation that tends to be slow on most hardware. As a result, despite being smaller, pruned models can be nearly as slow as their full-sized counterparts.
  • Structured pruning involves cutting out entire weight structures such as neurons, layers, filters, and blocks. Because structured pruning removes larger portions of the model, only a limited amount of pruning can be done before significantly affecting accuracy. However, by preserving the model’s structure, the pruned models retain dense matrix operations, allowing them to remain fast on most hardware.

Comparison of Pruning Techniques 

Pruning techniques for model compression vary in terms of compression rate, accuracy retention, and computational trade-offs. Post-training unstructured pruning provides a high compression rate with flexibility in pruning individual weights, allowing for fine-grained control. However, it can lead to slower inference speeds on most hardware due to sparse matrix operations. In contrast, post-training structured pruning removes entire structures like nodes or filters, which achieves better hardware efficiency with dense matrix computations, though it may result in greater accuracy loss and a slightly lower compression rate. Training-time pruning incorporates pruning directly within the training process, allowing for enhanced accuracy retention and improved regularization, as the model adapts to pruning gradually. However, this approach is computationally intensive and may lead to convergence issues, as the model undergoes continuous structural changes during training. Each method has its advantages, making the choice dependent on the specific requirements for compression rate, accuracy, and hardware compatibility.

Comparison of pruning techniques (Image by the author)

Quantization

Quantization is a way to make models faster and lighter by using smaller, low-precision numbers to represent data. In addition, full quantization (moving all calculations to int 8) opens the door to many embedded systems that can’t handle floating-point processing, enabling a wide range of applications across the IoT landscape.

There are many variations of quantization. Depending on the type of quantization, the quantization converts the inputs, outputs, weights, and/or activations of a model from high-precision representations (e.g., float32) to lower-precision representations (e.g., float16, int32, int16, int8, and even int2). This reduces the memory and computation costs of running the model, making it much more efficient without losing too much accuracy.

Quantization from Float32 to Int8 (Image by author)

Here are some of the well-known quantization techniques:

Post-training quantization: This technique includes general techniques to reduce CPU and hardware accelerator latency, processing, power, and model size with little degradation in model accuracy. These techniques are performed on an already-trained model. Here are some examples of Post-training quantization supported by the LiteRT (the new name of TFLite) framework:

  • Post-training float16 quantization: This method converts weights to 16-bit floating point values during model conversion resulting in a 2x reduction in model size since all weights become half of their original size. Float16 quantization is useful in GPU deployment scenarios. GPUs can compute natively in this reduced precision arithmetic, achieving faster computation over traditional 32-bit floating-point execution.
  • Post-training dynamic range quantization: This quantization method statically converts only the weights from floating-point to integer at conversion time, offering 8-bit precision. To further decrease inference latency, dynamic-range operators dynamically quantize activations to 8 bits based on their value ranges, allowing computations to be performed with 8-bit weights and activations. This optimization achieves latencies approaching those of fully fixed-point inferences. However, outputs are still stored in floating-point format, meaning that while dynamic-range operations provide significant speed-ups, they fall slightly short of the performance gains of full fixed-point computation. Dynamic range quantization achieves a 4x reduction in the model size. 
  • Post-training integer quantization: Integer quantization converts floating point values such as weights and activation outputs to the nearest 8-bit fixed-point values. This approach reduces model size and boosts inference speed, making it ideal for edge devices. Additionally, full integer quantization is necessary for integer-only devices. You can choose the level of quantization for your model. In “full integer quantization,” all weights, activations, inputs, and outputs are converted into 8-bit integer data. In contrast, “integer with float fallback quantization” leaves the input and output values as 32-bit floating points, making this model incompatible with integer-only devices.

Quantization-aware training: This technique simulates the effects of quantization during training, allowing models to be fully quantized by downstream tools after training. Essentially, the model is trained with an awareness that it will later be converted to a lower precision format. As a result, these quantized models operate with lower precision — such as 8-bit integers instead of 32-bit floats. Quantization-aware training typically provides higher accuracy than post-training quantization.

Comparison of Quantization Techniques

The table below provides an overview of various quantization techniques, summarizing their key pros and cons. Quantization techniques offer a range of trade-offs between compression rate, accuracy retention, and hardware compatibility. Post-training float16 quantization reduces the model size by half and provides faster computation on compatible CPU and GPU hardware, but it has the smallest compression rate and is incompatible with integer-only devices. Post-training dynamic range quantization offers a significant compression rate, making models four times smaller with a 2x-3x speedup, though it also requires floating-point support. Post-training full integer quantization achieves the highest compression rate and speedup (4x smaller, 3x+ speedup) and is compatible with a broad range of hardware, including integer-only devices like microcontrollers and ML accelerators, but it tends to have a higher accuracy impact. Quantization-aware training provides the best accuracy retention among these methods, with only a marginal accuracy loss, making it suitable for highly resource-constrained environments like FPGAs and ASICs, though it is more computationally intensive due to the training requirements. Each method addresses different needs, from maximizing speed and compression on limited hardware to minimizing accuracy loss in edge deployments.

Comparison of different quantization techniques (Image by the author, inspired by

Knowledge Distillation

Knowledge Distillation also known as Model Distillation aims to transfer the learnings from a large complex model (teacher model) to a smaller, deployable one (student model) while preserving the performance. The distillation process entails training the small student neural network to mimic the behavior of the large and complex teacher network by learning from its predictions or internal features. This approach is a type of supervised learning, where the student reduces the difference between its own predictions and those of the teacher model.

Knowledge distillation can be categorized into three types: response-based, feature-based, and relation-based distillation, each focusing on different aspects of knowledge transfer between the teacher and student models.

Three types of Knowledge Distillation (Image by the author)

  • Response-based Distillation is the most straightforward and widely used approach in knowledge distillation. In this method, the student model is trained to replicate the output predictions of the teacher model. By focusing only on the teacher’s final layer predictions, response-based distillation simplifies the training process, making it less computationally demanding. This technique is particularly useful when deploying lightweight models for tasks where overall prediction accuracy is more critical than internal representations. However, it may not capture the deeper, nuanced understanding of the teacher’s internal features, potentially limiting its performance in highly complex tasks.
  • Feature-based Distillation involves transferring the teacher model’s internal representations, such as hidden layer activations, to the student. Instead of only matching final outputs, the student is encouraged to learn intermediate features that the teacher has acquired during training. This approach helps the student to understand the task in a more nuanced way, potentially leading to higher accuracy and better generalization. However, because it requires alignment at multiple layers, feature-based distillation can be computationally intensive and memory-demanding, making it more suitable for applications where accuracy is a higher priority than resource constraints.
  • Relation-based Distillation focuses on teaching the student not just individual outputs or features, but the relationships between different data samples or layers within the teacher model. This method allows the student to learn the structural and relational information that the teacher has developed across different layers or groups of data points, fostering a stronger sense of generalization. Relation-based distillation tends to yield models with enhanced robustness and adaptability, though it requires more computational resources and greater expertise to implement effectively. This technique is ideal for complex tasks that benefit from understanding correlations and interactions within the data.

There are three main training methods in knowledge distillation:

  • Offline Distillation: This method is the most common training method and it involves using a pre-trained, frozen teacher model to train the student. The teacher model is first trained on a dataset, and then knowledge from the teacher model is distilled to train the student model. This is the most common approach in knowledge distillation, particularly with large language models (LLMs), where the teacher is often a proprietary model that cannot be modified. In this setup, the student model learns from a stable, unchanging teacher, making it easier to manage but dependent on having an effective pre-trained teacher model.
  • Online Distillation: In this method, both teacher and student models are trained simultaneously. This approach is useful when a suitable pre-trained teacher is unavailable, or when the teacher model needs to adapt to specific conditions in real time. For instance, in a semantic segmentation model for live sports broadcasts, the teacher network is continuously trained on the evolving visual conditions of a match. It then distills its updated knowledge into a smaller student model that can quickly process and produce outputs in real-time. Online distillation allows the teacher to be dynamic and adaptive, though it is generally more complex to implement and manage than offline distillation. 
  • Self-Distillation: Self-distillation is a unique form of knowledge distillation where a single model serves as both the teacher and the student, transferring knowledge within itself rather than from an external teacher model. In self-distillation, a network’s deeper layers act as the “teacher” for its shallower layers, helping these shallower layers learn more effectively. This is achieved by adding intermediate “shallow classifiers” at various depths, which act as points where the model can learn from itself. However, once the model is fully trained and ready for deployment, these intermediate classifiers are removed, resulting in a smaller, more efficient model that retains the performance gains from training. 

Comparison of Knowledge Distillation Techniques

The table below provides an overview of various knowledge distillation techniques, summarizing their key pros and cons. 

Comparison of different distillation techniques in terms of knowledge transfer and training scheme (Image by the author)

Other Model Compression Techniques

There are many other model compression techniques beyond the ones above such as low-rank factorization, weight clustering, and parameter sharing. We will delve into some of these model compression techniques in future posts.

Conclusion

In an era where machine learning models are becoming increasingly sophisticated, deploying them on resource-constrained devices can be challenging. Model compression techniques, such as pruning, quantization, and distillation enable these models to operate efficiently without sacrificing too much performance.

As more industries rely on edge computing, mastering these techniques becomes crucial for any on-device ML engineer. By applying the right compression methods, you can balance model performance with hardware limitations, paving the way for faster, lighter, and more scalable AI solutions.

Ready to dive deeper into model compression techniques? Stay tuned for my next article, where I’ll showcase practical examples of these model compression techniques. If you want to learn more or have specific questions, feel free to reach out in the comments or connect with me on LinkedIn.

References

[1] G. Hinton, O. Vinyals and J. Dean, Distilling the Knowledge in a Neural Network (2015), arXiv.

[2] N. Mocanu, E. Mocanu and P. Stone, When to Prune? A Policy Towards Early Structural Pruning (2021), IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3] T. Hoefler, D. Alistarh, T. Ben-Nun, N. Dryden, A. Peste, Sparsity in Deep Learning: Pruning and Growth for Efficient Inference and Training in Neural Networks (2021), Journal of Machine Learning Research.

[4] Google, Model Optimization Guide for TensorFlow Lite (2023), Google AI Developer Website.

[5] B. Predic , U. Vukic 1 , M. Saracevic, D. Karabaševic, and Dragiša Stanujkic, The Possibility of Combining and Implementing Deep Neural Network Compression Methods (2022), MDPI.

[6] F. Hohman, M. Beth Kery, D. Ren, D. Moritz, Model Compression in Practice: Lessons Learned from Practitioners Creating On-device Machine Learning Experiences (2024), Apple Machine Learning Research.

[7] P. Vilar Dantas, W. Sabino da Silva Jr, L. Carvalho Cordeiro, C. Barbosa Carvalho, A comprehensive review of model compression techniques in machine learning (2024), Appl Intell 54, 11804–11844


To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics