A Brief Practical Guide to LLM Quantization
Last Updated on September 18, 2024 by Editorial Team
Author(s): Raghunaathan
Originally published on Towards AI.
In recent years, the field of language models has undergone significant transformation since the introduction of ChatGPT. Weβve witnessed a notable surge in the availability of large language models (LLMs), both commercially and as open-source options. Despite this abundance, many commercial applications rely on a small selection of popular foundation models, which are then fine-tuned for specific tasks.
The rapid expansion of LLMs has introduced challenges, particularly when it comes to deploying these models in environments with limited GPU resources, such as on mobile devices or within budget constraints. Cutting-edge LLMs are generally trained on hundreds of NVIDIA H100 GPUs β Meta, for example, reportedly owns around 340,000 H100s β making it impractical to manage and fine-tune these models on local hardware. To overcome this, techniques like quantization, pruning, and knowledge distillation have been developed, with quantization standing out as a particularly promising approach. In this article, weβll explore the widely used quantization strategies that are commonly adopted.
What is Quantization ?
Quantization is a technique designed to decrease the memory requirements of high-precision weights by converting them into a lower-precision format, aiming to retain their distribution and performance as much as possible. For example, a 32-bit floating-point format can represent values ranging from approximately -3.4×1β°Β³βΈ to 3.4×1β°Β³βΈ. This format includes 1 bit for the sign, 8 bits for the exponent, and 23 bits for the mantissa (or fraction).
When using a 16-bit half-precision format, the range of representable values is reduced to between -65504 and 65504. In this format, 1 bit is for the sign, 5 bits are for the exponent, and 10 bits are for the mantissa. To illustrate the memory savings, consider a model with 100 million parameters using 32-bit precision: it would require approximately 0.372 GB of memory (1×1β°βΈx 4 bytes). With 16-bit precision, the memory requirement drops to about 0.186 GB (1×1β°βΈx 2 bytes).
This reduction in precision facilitates more efficient computation, particularly on hardware with limited resources. However, it may also lead to some loss of information. Therefore, selecting the appropriate quantization strategy and metrics is crucial to balancing performance and accuracy.
Types of Quantization
Quantization is typically categorized based on weight distribution, the elements subject to quantization (whether weights, activations, or both), and the approach to quantization (such as individual weights, entire layers, or blocks). The choice of these mechanisms significantly influences the effectiveness of various quantization methods.
Linear/ Symmetric Quantization
This method is similar to using a min-max scaler where the actual weights range gets squeezed into the quantization scale range. This method is applicable to symmetric weight ranges (i.e) the distribution is similar between quantized scale and weights scale. Quantization is done using the below functions :-
Here, r is the weight value to be quantized, s is the quantization factor/ scaling factor and z is the zero offset.
Consider a weight range [0,β¦β¦.,1000] and the quantization range is uint8. therefore 0β255. Here, s = (1000β0)/(255β0) = 1000/255 = 3.92 ; z = round(0β0/3.92) = 0 ; therefore q = r/3.92 + 0 = r/3.92 . But linear quantization can struggle with asymmetric ranges because it assumes that the input values and the quantization levels are symmetrically distributed around a central point. In the below image, the top scale is the actual weights and bottom is the quantization scale.
Asymmetric Quantization
This approach involves scaling each channel (or weight matrix) by a different factor, which is an affine transformation that includes both a scaling factor and a zero-point shift. Consider the case where we have weight range [-20β¦β¦.1000] and the quantization range is for uint8 => [0β¦255]. Now applying the min-max scaler yields the below result.
If we do the linear quantization, then we get the below quantized value for the weight of -20 which is not in the range of 0β255.
Therefore the factor z comes into place which is always the negative of the representation of the minimum floating point value since the minimum will always be negative or zero. The zero-point acts as a bias for shifting the scaled floating point value and corresponds to the value in the quantized range that represents the floating point value 0.0. Therefore the zero-offset here would be -5 => -5-(-5) = -5+5 = 0. Therefore, if we have a value lesser than 0, we clamp it to 0 and if the value is greater than 2^n-1 then we clamp it to 2^n-1 (extreme values). Whenever we quantize a value, we will always add the zero-point to this scaled value to get the actual quantized value in the valid quantization range. You can check the importance of having 0 centered quantization scale from the first link in the resources section.
There are 4 main range selection strategies involved to decide the extreme limits of the weights scale before quantization, these are namely, min-max scaler, percentile, mean-squared error and cross-entropy.
Methods of Quantization
Post Training Quantization (PTQ)
In post-training quantization, we start by making a copy of the base model that retains the original weights. We then attach observers, such as the PyTorch MinMaxObserver, to the intermediate layers of the model. These observers collect statistics during calibration, which involves running the model on a sample training dataset. The calibration data is used to determine the range of activations and weights, which are then applied to quantize the model effectively. This process allows for the adjustment of the modelβs parameters to better fit the quantized representation.
Quantization Aware Training (QAT)
In this technique, fake quantization modules are introduced during the training of the model to improve its resilience to quantization effects. This method, known as Quantization Aware Training (QAT), is developed to counteract the accuracy losses that may arise with Post-Training Quantization (PTQ). Unlike PTQ, which applies quantization after the model has been trained, QAT integrates quantization into the training process itself. This allows the model to adjust and optimize for quantization effects from the outset, enhancing its performance and accuracy when deployed in a quantized form.
As shown above, the primary distinction between the two methods lies in whether the model is trained after quantization or concurrently with it. In PTQ (Post-Training Quantization), the model is first quantized and then fine-tuned. In contrast, QAT (Quantization-Aware Training), quantization is integrated into the training process, allowing the model to adapt to quantization effects during training. This results in a more robust and potentially more accurate quantized model. With this in mind let us look at some popular quantization methods used.
Group-wise Precision Tuning Quantization (GPTQ)
This method takes inspiration from the optimal brain quantization methodology and scales it to quantize very large LLMs. OBQ quantizes the weights one by one targeting them in the ascending order of error. It also has a robust solution to remove the outliers as soon as they are found, thereby minimizing the quantization error and excluding the quantized weights from the matrix, thereby minimizing the computation. This approach works well up to the lower million parameter models but becomes inefficient with larger models. The original OBQ method quantizes rows of W independently, in a specific order defined by the corresponding errors. By contrast, GPTQ will quantize the weights of all rows in the same order and still yield results with a final squared error that is similar to the original solution. This speeds up the process as the computation is done per column instead of per weight. Also, the updates are done in batches of columns. This is because of the observation that the column updates are not interdependent. Due to this, a block of columns are updated, which in turn is used for updating the global matrix. Using these two approaches, GPTQ is able to lower the memory and time complexity of the process. Finally, to tackle the numerical inaccuracies caused by repeated application of the same functions to very large models, a Cholesky decomposition method is adopted along with a dampening factor (Ξ»). This method was able to successfully quantize BLOOM and OPT models in a single A100 GPU. This method was tested on generative tasks using quantized BLOOM and OPT models on a single A100 GPU. This method performs well to create 4-bit precision models but is not CPU-friendly.
# install the necessary packages
!BUILD_CUDA_EXT=0 pip install -q auto-gptq transformers optimum
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfigmodel_id = "facebook/opt-125m"
tokenizer = AutoTokenizer.from_pretrained(model_id)
quant_config = GPTQConfig(bits=4, dataset = "c4", tokenizer=tokenizer)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", quantization_config=quant_config)
model.save_pretrained('gptq-opt125') # replace with your directory pathYou can find the implementation here - https://huggingface.co/docs/transformers/main/en/quantization/gptq
Activation-aware Weight Quantization (AWQ)
The intuition is that the activations(e.g., 1x4K matrix)are thousand times smaller than the weights(e.g., 4x4K matrix), and preserving 1% of the salient weights(unquantized) is enough to significantly reduce the quantization error. With these in mind, this method adopts a low-bit-weight-only quantization approach.
How do we determine the salient weight (yellow channel in the 2nd matrix in the image? For this we make use of the activations, when we find a huge activation channel that is consistent for different inputs, the corresponding weights are considered as important, thereby making the weights selection an activation-aware mechanism. Now, the problem of mixed-precision arises, as you can see from the 2nd matrix in the above image. To solve this, the model uses a simple approach of multiplying the salient weight channel by a number >1 and dividing the activation channels by the same number. This multiplier is identified using a fast grid search algorithm for the optimal perplexity. This model shows a significant computational advantage and a much smaller calibration set(10x less) compared to the GPTQ method, with a speed up in runtime, making it more suitable for multi-modal tasks.
!pip install autoawq transformers nvidia-ml-py3
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizermodel_path = 'mistralai/Mistral-7B-Instruct-v0.2'
quant_path = 'mistral-instruct-v0.2-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4 }# Load model
model = AutoAWQForCausalLM.from_pretrained(
model_path, **{"low_cpu_mem_usage": True, "use_cache": False}
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)# Quantize
model.quantize(tokenizer, quant_config=quant_config)# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)print(f'Model is quantized and saved at "{quant_path}"'))
GPT-Generated Unified Format (GGUF/ GGML)
The GGML tensor library was developed by Georgi Gerganov using llama.cpp to enable LLM inference on consumer-grade computer hardware. This format enables billion-parameter models to be loaded and executed even on laptops without dedicated GPUs. GGUF was later invented to solve the GGMLβs inability to accommodate newer models. The significance of this model is itβs ability to distribute layers between CPU and GPU efficiently (depending on the model and hardware used) to speed up the process. This method is generally slower as it is CPU-centered, but it does solve the problem of computational requirements. This method shows lower perplexity (on resource-constrained hardware) than others for various natural language tasks and has become the standard for running LLMs without dedicated GPUs.
! git clone https://github.com/ggerganov/llama.cpp
! cd llama.cpp && make
! pip install -r requirements.txt
from huggingface_hub import snapshot_downloadmodel_name = 'Qwen/Qwen2-1.5B'
methods = ["q4_k_m"]
base_model = './original_model/'
quantized_path = './quantized_model/'
snapshot_download(repo_id=model_name, local_dir=base_model)# in the next shell
original_model = quantized_path+'/FP16.gguf'
!mkdir ./quantized_model/
# converting the model to gguf f16 or bf16 format
!python llama.cpp/convert_hf_to_gguf.py ./original_model/ --outtype f16 --outfile ./quantized_model/FP16.gguf
# you can quantize the f16 version to a lower bit precision
! ./llama.cpp/quantize ./quantized_model/FP16.gguf ./quantized_model/q4_k_m.gguf Q4_K_M
# you can test the quantized model (here 256 is the number of output tokens (change accordingly)
! ./llama.cpp/main -m ./quantized_model/q4_k_m.gguf -n 256 -p "your query to test"
Half-Quadratic Quantization (HQQ/HQQ+)
This is a data-independent dynamic quantization method; that is it does not depend on calibration data, thus avoiding calibration data bias and lowering the quantization time (50x lower when compared to GPTQ). You can read the process in detail from here. This method is also a weight-driven quantization and focuses on minimizing the errors on weights rather than the layer activation. It uses a robust mechanism to detect outliers and quantization factors. It splits the problem into multiple sub-problems using a half-quadratic solver and makes the quantization simpler by fixing the scaling factor and optimizing only for the zero-point/offset. One noteworthy result is that the Llama-2β70B at 2-bit quantization via HQQ achieves a lower perplexity than the full-precision Llama-2β13B. This still requires the use of dedicated GPUs.
!pip install hqq
from transformers import AutoModelForCausalLM, AutoTokenizer, HqqConfig# In this approach, all linear layers will use the same quantization config
# another approach is each linear layer with the same tag will use a dedicated
# quantization config
quant_config = HqqConfig(nbits=8, group_size=64, quant_zero=False, quant_scale=False, axis=0) #axis=0 is used by default
model_id = "Qwen/Qwen2-1.5B"
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="cuda",
quantization_config=quant_config
)
model.save_pretrained("qwen-hqq") # replace with your path
Additive Quantization of Language Models (AQLM)
This is a data-dependent extreme quantization technique designed to generate highly compressed language models, often down to 2-bit precision, without sacrificing significant generative performance. AQLM has demonstrated superior results compared to popular methods like GPTQ and even more recent approaches such as QuIP and QuIP#. Its unique approach involves quantizing multiple weights together, leveraging the interdependencies between them. It works as follows:-
- Weight Grouping: AQLM groups together 8β16 weights from the LLM. This grouping strategy leverages the interdependencies between these weights, improving the efficiency of quantization.
- Codebook Learning: A codebook is created, containing a set of vector codes. These codes are learned from the grouped weights to represent them effectively.
- Nearest Neighbor Search: For each group of weights, AQLM finds the nearest neighbor in the codebook using a nearest neighbor search algorithm.
- Additive Representation: The grouped weights are then represented as a sum of multiple vector codes from the codebook. This additive representation allows for flexible and efficient quantization.
This method is computationally expensive and preferable to start with a compatible pretrained model. A more technical brief can be seen here and studied here. The official notebook for pretrained models can be found here. You can follow the GitHub readme file to quantize any base model.
Easy and Efficient Quantization for Transformers (EETQ)
This is a GPU-dependent int-8 weight-only channel-wise quantization technique that requires no calibration dataset nor a pretrained model. It is inspired by this research. According to the contributors, The Int8 weight-only quantization method is fundamentally straightforward, involving per-channel and symmetric quantization without any accuracy-restoration operations. Within Cutlass, this method integrates the dequantization process with the FP16 matrix multiplication operator. Experimental results indicate that for w8a16, the loss in LLM generation accuracy is minimal, whereas the accuracy loss for w4a16 is more significant. Consequently, to improve accuracy for w4a16 or w3a16, algorithmic adjustments using techniques like AWQ and GPTQ are necessary. While algorithmic restoration enhances accuracy, EETQ aims to deliver a universal, user-friendly, and efficient weight-only GEMM inference backend plugin.
!git clone https://github.com/NetEase-FuXi/EETQ.git
!cd EETQ/ && pip install .
from transformers import AutoModelForCausalLM, EetqConfig
path = "Qwen/Qwen2-1.5B"
quantization_config = EetqConfig("int8")
model = AutoModelForCausalLM.from_pretrained(path, device_map="auto", quantization_config=quantization_config)
quant_path = "eetq_qwen"
model.save_pretrained(quant_path)
Low-Rank Adaptation of Large Language Models (LoRA)
This approach revolves around the hypothesis that updates to the weights have a βlow intrinsic rankβ during adaptation. It reduces the number of trainable parameters by learning pairs of rank-decomposition matrices while freezing the original weights. Therefore, for a given input, you first apply the fixed weight matrix W0 β to get an initial output. Then, you apply the update ΞW=BA(where B is a matrix of size dΓr and A is a matrix of size rΓk and rank r << min(d, k)) to the same input to get an updated result. The final result is obtained by summing up the outputs from W0β and ΞW. This combined output is what the model uses for making predictions or further processing. Therefore, this method proves that the LLM can perform accurately even with lower dimensions. This vastly reduces the storage requirement for large language models adapted to specific tasks and enables efficient task-switching during deployment all without introducing inference latency. LoRA also outperforms several other adaptation methods, including adapter, prefix-tuning, and fine-tuning.
# Install required libraries with pip
!pip install -q bitsandbytes datasets accelerate loralib
!pip install -q git+https://github.com/huggingface/peft.git git+https://github.com/huggingface/transformers.git
import torch
import os# Set environment variable to specify which GPU to use
os.environ["CUDA_VISIBLE_DEVICES"]="0"import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
import transformers# Load a pre-trained model for causal language modeling
model = AutoModelForCausalLM.from_pretrained(
"bigscience/bloom-560m", # Model identifier
torch_dtype=torch.float16, # Use 16-bit floating point for reduced memory usage
device_map='auto', # Automatically assign model layers to available devices
)# Load the tokenizer for the model
tokenizer = AutoTokenizer.from_pretrained("bigscience/tokenizer")# Freeze all parameters in the model to prevent their modification during training
for param in model.parameters():
param.requires_grad = False # Freeze the model - train adapters later
if param.ndim == 1:
# Convert small parameters (like layernorm weights) to float32 for numerical stability
param.data = param.data.to(torch.float32)# Enable gradient checkpointing to reduce memory usage
model.gradient_checkpointing_enable()
# Ensure that input embeddings are required to compute gradients
model.enable_input_require_grads()# Define a custom output layer to cast outputs to float32
class CastOutputToFloat(nn.Sequential):
def forward(self, x): return super().forward(x).to(torch.float32)
# Replace the model's output head with the custom layer
model.lm_head = CastOutputToFloat(model.lm_head)# Configure LoRA (Low-Rank Adaptation) for the model
config = LoraConfig(
r=8, # Rank of the adaptation
lora_alpha=16, # Scaling factor for LoRA
target_modules=["query_key_value"], # Target specific modules in the model for LoRA
lora_dropout=0.05, # Dropout rate for LoRA
bias="none", # No bias in LoRA layers
task_type="CAUSAL_LM" # Task type: Causal Language Modeling
)# Apply LoRA configuration to the model
model = get_peft_model(model, config)# Load the training dataset for question answering
qa_dataset = load_dataset("squad_v2")# Define a function to create training prompts from context, question, and answer
def create_prompt(context, question, answer):
if len(answer["text"]) < 1:
answer = "Cannot Find Answer"
else:
answer = answer["text"][0]
prompt_template = f"### CONTEXT\n{context}\n\n### QUESTION\n{question}\n\n### ANSWER\n{answer}</s>"
return prompt_template# Map the dataset to the tokenizer to prepare prompts
mapped_qa_dataset = qa_dataset.map(lambda samples: tokenizer(create_prompt(samples['context'], samples['question'], samples['answers'])))# Initialize the Trainer for training the model with LoRA
trainer = transformers.Trainer(
model=model, # Model to be trained
train_dataset=mapped_qa_dataset["train"], # Training dataset
args=transformers.TrainingArguments(
per_device_train_batch_size=4, # Batch size per device
gradient_accumulation_steps=4, # Accumulate gradients over multiple steps
warmup_steps=100, # Number of warmup steps
max_steps=100, # Total number of training steps
learning_rate=1e-3, # Learning rate
fp16=True, # Use 16-bit floating point for training
logging_steps=1, # Frequency of logging
output_dir='outputs', # Directory to save the model outputs
),
data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False) # Data collator for language modeling
)# Disable cache to prevent warnings during training
model.config.use_cache = False
# Train the model with the specified arguments
trainer.train()
You can have a look at the various LoRA strategies from here.
Conclusion
In conclusion, LLM quantization is a powerful technique for optimizing large language models, making them more efficient and accessible without compromising too much on performance. By reducing the precision of the modelβs weights, you can achieve significant improvements in speed and memory usage, which is crucial for deploying models in resource-constrained environments.
As you embark on implementing these quantization strategies, remember that they are just one piece of the puzzle. In the next article, weβll dive deeper into the fine-tuning and deployment aspects of LLMs. Weβll explore how to tailor these models to specific tasks and how to deploy them effectively, ensuring they perform optimally in real-world applications. Hope you found this article useful; keep exploring and happy learning!
Resources
- A Visual Guide to Quantization, Maarten Grootendorst β Blog
- Quantization explained in PyTorch, Umar Jamil β YouTube
- TensorFlow Model Optimization, LiteRT β Documentation
- Fitting AI models in your pocket with quantization, Felix Baum β Stack Overflow Blog
- Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment, Lingling Xu, Haoran Xie, Si-Zhao Joe Qin, Xiaohui Tao, Fu Lee Wang β arXiv
- A Comprehensive Evaluation of Quantization Strategies for Large Language Models, Renren Jin1, Jiangcun Du1 β arXiv
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI