Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take the GenAI Test: 25 Questions, 6 Topics. Free from Activeloop & Towards AI

Publication

Quantization: Post Training Quantization, Quantization Error, and Quantization Aware Training
Latest   Machine Learning

Quantization: Post Training Quantization, Quantization Error, and Quantization Aware Training

Last Updated on July 17, 2024 by Editorial Team

Author(s): JAIGANESAN

Originally published on Towards AI.

Photo by Jason Leung on Unsplash

Most of us used open-source Large Language Models, VLMs, and Multi-Modal Models in our system, colab, or Kaggle notebook. You might have noticed that most of the time we used it in quantized versions like fp16, int8, or int4. Even though the model got quantized the output generation is quite good.

This article will give you a comprehensive overview of why we need to quantize the model, what quantization is, post-training quantization, Quantization error, and quantization-aware training.

Why Do We Need to Quantize a Model? 🧐

In recent times, AI models have grown significantly in terms of their parameters. For example, let’s consider the Mistral 7B model, which has approximately 7.2 billion parameters. If we were to store these parameters in float32 format, the model would require around 28–29 GB of HBM to load onto a GPU β€” 1 Billion parameters in float32 is approximately 4GB. This is a large amount of GPU memory, which is not always available to average users.

To overcome this limitation, we often load models in lower precision like fp16, int8, and int4. By doing so, we can reduce the memory requirements. For example, loading the Mistral 7B model in fp16 would require only 14.5 GB of HBM in the GPU.

If we were to use an even lower precision, such as int4, the total memory required to load the Mistral 7B model would be around 4 GB. The more we quantize the model, the less space we need to load it. But at the same time, we compromise the accuracy. But it can perform certain tasks well. This is why quantizing a model is essential in today’s AI landscape. Quantized models can be used on mobile and edge devices.

Quantization 🦸‍♂️

Quantization means the conversion from higher precision to lower precision of parameters or weights.

In Models, the parameters are float32 (Single Precision), 32-bit (4 Byte) floating point numbers. There are 3 components in this 32-bit binary number. Sign, exponent, and mantissa(fraction). The high precision helps the model for higher accuracy and higher expressive power of the Model.

Image 1: 32-bit IEEE format. Image by author

The First bit, the sign bit indicates the sign of the number. 0 means a positive number, and 1 represents a negative number. The Next 8 bits are exponent bits. The exponent is stored in a biased format. For single-precision floating point, the bias(zero point) is 127. The exponent in fp32 ranges from -126 to 127. The next 23 bits (Actually 24 bits β†’ 23 + 1 implicit bit) are called Mantissa, the Mantissa is nothing but a fraction in the floating point numbers.

Image 2: FP32, FP16, BFLOAT16. Image by author

Image 2 shows the bit allocation to fp16 and Bfloat16. In the fp16 the exponent has only 5 bits.

There are two types of quantization, Symmetric quantization and Asymmetric quantization.

Asymmetric Quantization: The Input range and output range are Asymmetric. For example, Quantize from fp32 with input range -126 to 127, to fp16 (unsigned) output range 0 to 31 [Exponent Range]. For this Quantization, the scaling factor and zero point will be 8.1 and 15.

Image 3: Asymmetric Quantization. Image by author

Let’s Take We have trained the Model with fp32 format. we want to quantize it using Asymmetric in fp16, the formula in image 3 will help, to quantize the model. max_fp32 is the largest number in the parameters, and min_fp32 is the smallest number in the parameter. The (-min_fp32/scaling factor) part calculates zero point. This means the fp32 zero value is mapped into this zero point after quantization.

Symmetric Quantization: Quantize from symmetric input range into symmetric output range. For example, Quantize from fp32 with an input range of -126 to 127, to fp16 with an output range of -14 to 15 [Exponent Range].

Image 4: Symmetric Quantization. Image by author

The absolute maximum value in fp32 is used to find the scaling factor in symmetric Quantization. n is the number of bits in the exponents. The mantissa or fractions are truncated. The most significant bits are kept and the least significant bits are discarded (Like Keeping the 1st 10 bits).

Post Training Quantization

Post-training Quantization is applied after the Model has been trained completely. When we load the Model, the observers (Scaling factor and zero point) help to quantize the model to our desired low precision like fp16, int 8, or int4. This Queezing Process from full precision (High Precision) to Half precision (Low Precision) is called Caliberation.

To make things more clear, let’s take a look at below code examples. I’ll show you how the Mistral 7B model loaded into float 16, int8, and int4 format. By understanding these examples, you’ll get a better grasp of how quantization works in real-world scenarios and how it can benefit us in practice.

Note: Try These Codes Alongside This Article to Get a Clearer Understanding

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.3")
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.3", device_map='cuda')
Image 5: Mistral-7B-Loading Model. Image by author

Take a closer look at the code snippet, which shows how to load the Mistral 7B model from Hugging Face. As we know, the model’s size in fp32 format is 28–29 GB, and it has 7.2 billion parameters, each taking up 4 bytes of space. However, if you look at Image 5 closely, you’ll notice that three model shards are downloaded, with a total size of 14.5 GB. So, how is this possible? The answer lies in the fact that we’ve downloaded a quantized model. In this scenario, Each parameter only takes 2 bytes (fp16 β€” Half Precision 16 bit) of Memory.

# BitsAndBytes configuration for int8
bnb_config = BitsAndBytesConfig(
load_in_8bit=True, # load in int8
)

model_name = "mistralai/Mistral-7B-v0.3"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load the model with quantization configuration
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)


model_size_in_bytes = sum(param.nelement() * param.element_size() for param in model.parameters())
model_size_in_mb = model_size_in_bytes / (1024 * 1024)
print(f"Model size: {model_size_in_mb:.2f} MB")

#Output:
Model size: 7168.51 MB

Also, Let’s take a closer look at the code snippet above, which shows the 8-bit quantization of the Mistral 7B model. In this scenario, each parameter only occupies 1 byte of space, which significantly reduces the need for memory. However, the model is still able to maintain its performance.

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline

bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
)

model_name = "mistralai/Mistral-7B-v0.3"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
#load_in_4bit=True,
quantization_config=bnb_config,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
model_size_in_bytes = sum(param.nelement() * param.element_size() for param in model.parameters())
model_size_in_mb = model_size_in_bytes / (1024 * 1024)
print(f"Model size: {model_size_in_mb:.2f} MB")

#Output:
Model size: 3840.51 MB

Same Like, take a closer look at this code snippet also, here we stored the model in 4-bit. We are doing 4-bit quantization here. Each parameter only takes a half byte.

We have seen 3 scenarios of how the Model quantization happens in real-time. Based on the available hardware resources, we can use the Model and still get better results. But we also losing some level of accuracy. We actually reduce the model’s expressive power by doing quantization.

Imagine precision in data representation like a mailing address. FP32 is like having your full address, including the door number, street name, city, state, and Postal code. It’s extremely precise and detailed. FP16 is like having the street name, city, state, and postal code, but without the door number. It’s still pretty specific, but not as exact as FP32. And int8 is like having just the city, state, and pincode β€” it gives you a general idea of where something is, but not the exact location.

Quantization Error 😤

This Part is very important for understanding the Quantization aware Training. Before getting into the Quantization error, you need to understand one term called Dequantization. So far, we’ve explored Quantization, which involves converting high-precision data to low-precision data. Dequantization, on the other hand, does the opposite. It takes low-precision data and converts it back to high-precision data. For example, Converting from half precision (fp16) to full precision(fp32).

Take a closer look at this code snippet, which highlights the concept of Quantization Error.

import numpy as np

def quantize_and_dequantize_with_scale(weights, max_abs_value):
# Calculate the scale factor
scale_factor = max_abs_value / 15.0 # 15 is the maximum value representable in fp16

# Quantize to fp16
quantized_weights_fp16 = np.clip(weights / scale_factor, -14, 15).astype(np.float16)

# Dequantize back to fp32
dequantized_weights_fp32 = quantized_weights_fp16.astype(np.float32) * scale_factor

return dequantized_weights_fp32

# Sample set of weights in fp32
original_weights = np.random.uniform(-126, 127, 10).astype(np.float32)

# Maximum absolute value of the weights
max_abs_value = np.max(np.abs(original_weights))

# Quantization and dequantization
quantized_and_dequantized_weights = quantize_and_dequantize_with_scale(original_weights, max_abs_value)

# Quantization error
quantization_error = original_weights - quantized_and_dequantized_weights

print("Original weights :", original_weights)
print("Quantized and dequantized weights :", quantized_and_dequantized_weights)
print("Quantization error :", quantization_error)

# Mean absolute quantization error
mean_abs_error = np.mean(np.abs(quantization_error))
print("Mean absolute quantization error:", mean_abs_error)

# Output:
Original weights : [ -20.410507 -19.901762 -70.0985 -13.243117 12.347162 -100.66862
-41.767776 10.851324 32.425034 -96.281494]
Quantized and dequantized weights : [-20.408989 -19.897781 -70.10101 -13.245526 12.347635 -93.957375
-41.761745 10.853335 32.42893 -93.957375]
Quantization error : [-1.5182495e-03 -3.9806366e-03 2.5100708e-03 2.4089813e-03
-4.7302246e-04 -6.7112427e+00 -6.0310364e-03 -2.0112991e-03
-3.8948059e-03 -2.3241196e+00]
Mean absolute quantization error: 0.90581906 **

What does this code output tell us? This code shows that when we quantize the parameters, we lose some information. This error occurs when we reduce the precision of a model’s weights and Biases. Simply Quantizing the Pre-Trained Model leads to some level of accuracy loss. In most scenarios, we are using a Quantized version of the Model, because average users don’t have access to high computational resources. This is where Quantization-aware Training comes into play.😃

Quantization Aware Training 🤥

This approach involves training models intending to eventually deploy them in a quantized form. In other words, we train our models, knowing that they’ll be converted to a lower precision format later on. If you look closely you’ll notice that some of the most popular Large Language Models (LLMs) are also available in quantized versions (fp16) on the Hugging Face platform. It might gone through Quantization Aware Training.

This approach makes our model more resilient to the effects of quantization. We do this by making the model’s weights aware of the errors that occur during quantization. To achieve this, we insert quantization and dequantization steps [simulate the quantization effects without actually quantizing the model parameters] into the neural network’s computation process.

This allows the learning network to experience the effects of quantization error, and as a result, the loss function updates the weights to account for these errors. Over time, the model becomes more robust to quantization.

Image 6: Mistral FFN-Quantization Aware Training. Image by author

To illustrate QAT (Quantization Aware Training), I took Mistral 7B Feed Forward Network. The brown Part in image 6 denotes Quantization and Dequantization in FFN. These layers simulate the Quantization and Dequantization in training parameters.

That causes some quantization errors in the FFN. By doing training like this, we make the FFN network aware of quantization. So, When we quantize the Parameters after the training (Post training Quantization), we don’t typically see a significant drop in accuracy. This is because the model has already learned to adapt to the effects of quantization during the training process.

And we come to the end of this article. I hope this article has provided you with a clear understanding of why model quantization is necessary, what quantization actually is, the concept of post-training quantization, the impact of quantization error, and the importance of quantization-aware training.

Do you want to visualize LoRA or want to Learn LoRA fundamentals from Math, code, and Visuals? Consider checking out my article.

Visualizing Low-Rank Adaptation (LoRA) 👀

Exploring Singular Value Decomposition (SVD), Feed-Forward Networks (FFN), and LoRA

pub.towardsai.net

Thanks for reading this article 🤩. If you found it useful 👍, don’t forget to give Clapssss👏 (+50 🫰). Feel free to follow for more insights 😉.

Let’s stay connected and explore the exciting world of AI together!

Join me on LinkedIn: linkedin.com/in/jaiganesan-n/ 🌍❤️

Check out my other articles on Medium: https://medium.com/@jaiganesan 🤩 ❤️

References:

[1] Single Precision Floating point format, Wikipedia.org

[2] Mistral 7B β€” v3.0 Inference and Model, Huggingface.co

[3] Basics Symmetric and Asymmetric Quantization, Krish Naik YouTube Video 2024.

[4] Quantization Aware Training, YouTube Video (2022)

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓