Azizi Othman’s Post

Reducing the Size of AI Models Running large AI models on edge devices Image created using Pixlr AI models, particularly Large Language Models (LLMs), need large amounts of GPU memory. For example, in the case of the LLaMA 3.1 model, released in July 2024, the memory requirements are: The 8 billion parameter model needs 16 GB memory in 16-bit floating point weights The larger 405 billion parameter model needs 810 GB using 16-bit floats In a full-sized machine learning model, the weights are represented as 32-bit floating point numbers. Modern models have hundreds of millions to tens (or even hundreds) of billions of weights. Training and running such large models is very resource-intensive: It takes lots of compute (processing power). It requires large amounts of GPU memory. It consumes large amounts of energy, In particular, the biggest contributors to this energy consumption are: - Performing a large number of computations (matrix multiplications) using 32-bit floats - Data transfer — copying the model data from memory to the processing units. Being highly resource-intensive has two main drawbacks: Training: Models with large GPU requirements are expensive and slow to train. This limits new research and development to groups with big budgets. Inference: Large models need specialized (and expensive) hardware (dedicated GPU servers) to run. They cannot be run on consumer devices like regular laptops and mobile phones. Thus, end-users and personal devices must necessarily access AI models via a paid API service. This leads to a suboptimal user experience for both consumer apps and their developers: It introduces latency due to network access and server load. It also introduces budget constraints on developers building AI-based software. Being able to run AI models locally — on consumer devices, would mitigate these problems. Reducing the size of AI models is therefore an active area of research and development. This is the first of a series of articles discussing ways of reducing model size, in particular by a method called quantization. These articles are based on studying the original research papers. Throughout the series, you will find links to the PDFs of the reference papers. The current introductory article gives an overview of different approaches to reducing model size. It introduces quantization as the most promising method and as a subject of current research. Quantizing the Weights of AI Models illustrates the arithmetics of quantization using numerical examples. Quantizing Neural Network Models discusses the architecture and process of applying quantization to neural network models, including the basic mathematical principles. In particular, it focuses on how to train models to perform well during inference with quantized weights. Different Approaches to Quantization explains different types of quantization, such as quantizing to different precisions, the granularity of quantization, deterministic...

Reducing the Size of AI Models

Running large AI models on edge devices

Image created using Pixlr

AI models, particularly Large Language Models \(LLMs\), need large amounts of GPU memory. For example, in the case of the LLaMA 3.1 model, released in July 2024, the memory requirements are:

The 8 billion parameter model needs 16 GB memory in 16-bit floating point weights

The larger 405 billion...

Reducing the Size of AI Models Running large AI models on edge devices Image created using Pixlr AI models, particularly Large Language Models \(LLMs\), need large amounts of GPU memory. For example, in the case of the LLaMA 3.1 model, released in July 2024, the memory requirements are: The 8 billion parameter model needs 16 GB memory in 16-bit floating point weights The larger 405 billion...

towardsdatascience.com

To view or add a comment, sign in

Explore topics