Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!


LLM Quantization Techniques- GPTQ
Latest   Machine Learning

LLM Quantization Techniques- GPTQ

Last Updated on February 21, 2024 by Editorial Team

Author(s): Rajesh K

Originally published on Towards AI.

Recent advances in neural network technology have dramatically increased the scale of the model, resulting in greater sophistication and intelligence. Large Language Models (LLMs) have received high praise for their expertise in understanding code and answering complex questions. However, this increase in complexity requires resource-intensive hardware solutions, which can be expensive and require specialized hardware such as GPUs, custom chips, etc. These hardware solutions in particular this enable fast and efficient computation, which is important for efficient large-scale network networks.

Quantization is a method of reducing the computational memory cost of running calculations that represent weights and functions with irregular data sets such as 8-bit integers rather than the usual 32-bit floating point This method allows much smaller model representation, takes less power, matrix multiplication with integer arithmetic It also enables faster reactions. It is particularly useful at inference time because it saves a lot of inference calculation costs without sacrificing excessive mathematical accuracy.

During quantization, the weights of a model are shifted from a more accurate floating-point representation to a lower accurate floating-point or integer representation. This reduced accuracy speeds up model performance in inference when using high performance vectored functions on many hardware platforms for example, PyTorch supports INT8 quantization, allowing model size to be reduced 4x with hardware support a it is 4x faster for INT8 calculations compared to FP32 calculations

Floating point representations

To delve into quantized training, we first need to understand how deep learning frameworks like PyTorch represent floating-point numbers, as this representation forms the basis for neural network training. These frameworks typically employ a 32-bit floating-point representation, which is a binary approximation of the actual, full-precision numbers. However, to improve computational efficiency and reduce memory usage, quantized training reduces the number of bits used to represent activations and gradients within the network throughout training, resulting in low-precision representations that are more suitable for inference on resource-constrained devices

FLoat representation U+007C source

Quantized training introduces a trade-off between accuracy and computational efficiency, as lower-precision representations can lead to some error. Common low-precision representations include 16-bit floating-point (FP16) and 8-bit integer (INT8), with INT8 being a colloquial term for quantized representations

In the realm of deep learning, various floating-point representations are employed to optimize performance and memory usage. Common formats include 32-bit (FP32) and 16-bit (FP16) floats, which are widely used due to their compatibility with standard hardware and software.To further enhance performance, particularly in high-performance computing, hardware-specific floating-point formats have emerged. Notable examples include NVIDIA’s TensorFloat-32 (TF32), Google’s BrainFloat-16 (bfloat16), and AMD’s FP24. Smaller formats, such as FP8, are also being developed for microcontrollers and embedded devices. These formats, like FP8, are typically used in resource-constrained environments and are often announced in new generations of GPUs, such as NVIDIA’s H100. Double-precision formats like FP64 are rarely used in deep learning due to their large memory overhead and no significant benefit for deep neural networks. However, they can be beneficial for statistical modeling, where single-precision is not sufficient. In summary:

  • Common floating-point formats for deep learning: FP32, FP16
  • Hardware-specific formats: TF32, bfloat16, FP24
  • Smaller formats: FP8
  • Double-precision codecs: FP64 (rarely used in deep learning)

Types of Quantization

  • Post-training Quantization (PTQ): This is the only sort of quantization, wherein the weights of an already skilled model are transformed to a lower precision without any retraining. This is a trustworthy and clean-to-put-into-effect method, but it could result in a mild degradation in the model's overall performance because of the loss of precision within the values of the weights.
  • Quantization-Aware Training (QAT): Unlike PTQ, QAT integrates the burden conversion process into the education degree itself. This often affects better model overall performance, but it’s also more computationally traumatic. A normally used QAT method is referred to as Quantization-conscious Learned ReLU Activation (QLoRA).

Running inference for large transformer-based language models (LLMs) is a challenging task due to two primary factors:

  1. Substantial memory requirements. During inference, both the model parameters and intermediate states must be stored in memory, leading to a significant memory footprint. For instance: The computational cost associated with the attention mechanism in transformers scales quadratically with the input sequence length, further exacerbating the memory demands.
  2. Limited parallelizability. The inference generation process in transformer models is executed in an autoregressive manner, making it inherently challenging to parallelize the decoding process across multiple computational units. This sequential nature of the decoding step hinders the ability to leverage parallel computing resources effectively, thereby restricting potential performance optimizations.

Specifically, the ever-increasing size of state-of-the-art transformer-based LLMs, and their large memory footprint from central state storage, and the quadratic scaling of the attention mechanism, as well as inference and process, collectively memory management and computational parallelization pose formidable visual challenges, making computation of these large dimensions a daunting task

How can a pre-quantized model version be obtained?

The Hugging Face Hub is a repository of multiple quantized model versions readily available. These quantized models have been optimized using techniques such as GPTQ , NF4, or GGML quantization methods. A quick search of the hub shows that a significant portion of these quantized models have been contributed by TheBloke, a well-known and well-respected figure in the LLM community This contributor has tested and published several models, each of which uses various methods to quantify, so that users their specific use case -Can select the most appropriate quantization method to suit the requirements

How can the sample be quantified?

Various quantization techniques, including NF4, GPTQ, and AWQ, are available to reduce the computational and memory demands of language models. In this context, we will delve into the process of quantifying the Falcon-RW-1B small language model ( SLM) using the GPTQ quantification method.

Comparison of GPTQ, NF4, and GGML Quantization Techniques


GPTQ stands for “Generative Pre-trained Transformer Quantization”. This is a post-training quantization technique that helps to fill large language systems to be more efficient without significantly affecting their performance.

The main features of the GPTQ algorithm are:

  1. Undesirable load quantization schemes: Unlike traditional methods that prioritize load levels in a specific order, GPTQ has found that arbitrary load levels for large transformer levels work well for almost the process is simple
  2. Sizing all rows in the same order: Instead of GPTQ varying the size of each row in the same order, which reduces runtime complexity
  3. Lazy batch updates for efficiency: GPTQ allocates groups and updates columns in batches, improving GPU performance, and providing greater speed, especially for larger models
  4. Cholesky reconstruction for numerical stability: To handle numerical corrections on large scales, GPTQ uses the Cholesky decomposition method and applies light damping to the diagonal elements of the Hessian matrix

The main advantage of GPTQ is that it can significantly reduce the computational and memory requirements of large language models while maintaining their performance, making it more efficient for use in resource-constrained environments

Now let’s explore how we can Quantize on kaggle

Getting the Notebook Ready

For this demonstration, we’ll be using a Kaggle notebook. However, you can use any Jupyter notebook environment that suits you. Kaggle provides a generous 30 hours of free GPU usage per week, which should be more than enough for our experimentation.

Here are the steps to get started:

  1. Open a new notebook on Kaggle (or your preferred environment).
  2. Create some headings to organize your work.
  3. Select to the P100 GPU Accelarator to start using the computational resources.

With the notebook set up and the runtime connected, we’ll be ready to dive into our experimentation.

Load required libraries

!pip install transformers optimum accelerate peft trl auto-gptq bitsandbytes datasets==2.17.0
  1. transformers: This is the popular Hugging Face library for building and training transformer models, including language models like BERT, GPT, and many others.
  2. optimum: A library for efficient model optimization and quantization, often used in conjunction with the Hugging Face libraries.
  3. accelerate: A lightweight library for distributed training and multi-GPU/TPU acceleration of machine learning models.
  4. peft: The “Parameter Efficient Fine-Tuning” library, which enables fine-tuning large language models with a fraction of the parameters.
  5. trl: The “Transformers Reinforcement Learning” library, designed for applying reinforcement learning to transformer models.
  6. auto-gptq: A library for automatic quantization of Hugging Face transformer models using the GPTQ technique (which you explained earlier).
  7. bitsandbytes: A library for optimized numerical operations and data types, often used for efficient quantization and compression of neural networks.
  8. datasets==2.17.0: This specifies the version 2.17.0 of the Hugging Face datasets library, which provides a convenient way to load and preprocess data for machine learning tasks.

Quantizing model

from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
import torch

model_id = "tiiuae/falcon-rw-1b"

quantization_config = GPTQConfig(
group_size=128, # The group size to use for quantization; default value
dataset="ptb", # The dataset used for calibration.
desc_act=False,# Whether to quantize columns in order of decreasing activation size. Setting it to False can significantly speed up inference but the perplexity may become slightly worse.


tokenizer = AutoTokenizer.from_pretrained(model_id,
trust_remote_code = True)

quant_model = AutoModelForCausalLM.from_pretrained(model_id,
trust_remote_code = True,
  1. It specifies the pre-trained model ID (“tiiuae/falcon-rw-1b”) that will be loaded and quantized.
  2. It creates a GPTQConfig object, which configures the GPTQ quantization process:
  • Quantizes the model weights to 4 bits (instead of 32-bit floats) for compression.
  • Sets the group size for quantization to 128 (a hyperparameter).
  • Uses the Penn Treebank dataset for calibration during quantization.
  • Disables quantizing columns in order of decreasing activation size for faster inference.
  1. It loads the tokenizer associated with the pre-trained model for text encoding/decoding.
  2. Finally, it loads the pre-trained language model itself using AutoModelForCausalLM and applies the configured GPTQ quantization.
  • The quantized model is set to automatically split across available GPUs/CPUs for efficient inference.
  • The trust_remote_code=True flag allows the code to trust and run the model’s code from the Hugging Face Hub.

After running this code, quant_model will be a quantized 4-bit version of the original “tiiuae/falcon-rw-1b” model, ready for efficient inference or fine-tuning tasks while significantly reducing the model’s size and computational requirements.

Publishing to Hugging face


  • Upload (push) the quantized language model to the Hugging Face Hub with a descriptive repository name (e.g., “falcon-rw-1bt-gptq-4bit-ptb”).
  • Upload (push) the associated tokenizer to the same repository on the Hub.

Inspect the Keys

Inspect the available keys (or settings) in the quantization configuration dictionary by printing them out.

quant_dict = quant_model.config.quantization_config.to_dict()

dict_keys([‘quant_method’, ‘bits’, ‘tokenizer’, ‘dataset’, ‘group_size’, ‘damp_percent’, ‘desc_act’, ‘sym’, ‘true_sequential’, ‘use_cuda_fp16’, ‘model_seqlen’, ‘block_name_to_quantize’, ‘module_name_preceding_first_block’, ‘batch_size’, ‘pad_token_id’, ‘use_exllama’, ‘max_input_length’, ‘exllama_config’, ‘cache_block_outputs’, ‘modules_in_block_to_quantize’])

Running Inference on Quantized Model

Load the 4bit quantized model and check for the outcome

import torch
from transformers import AutoModelForCausalLM, GPTQConfig, AutoTokenizer

falcon_4_bit_model = AutoModelForCausalLM.from_pretrained(

tokenizer = AutoTokenizer.from_pretrained(
trust_remote_code = True
text = "I want to travel to"
inputs = tokenizer(text, return_tensors="pt").to(0)

out = falcon_4_bit_model.generate(**inputs)
print(tokenizer.decode(out[0], skip_special_tokens=True))
  • Load pre-trained and quantized “falcon-rw-1bt-gptq-4bit-ptb” language model
  • Load associated tokenizer
  • Define input text: “I want to travel to”
  • Tokenize input text into PyTorch tensor
  • Use model’s generate() method to generate text continuation from input
  • Decode generated output tensor back into text
  • Print decoded text

Similarly have generated the 2 bit Quantized models and pushed it to hugging face . The jupyter notebooks is available here

Evaluating the Quantized Models

This study utilizes the evaluation modules provided by LlamaIndex inspired by Wenqi Glantz’s, , benefiting from the streamlined process offered by the recently introduced LlamaPack RagEvaluatorPack. To assess the performance of pipeline, we invoke the following two modules from LlamaHub:

  • RagEvaluatorPack: This LlamaPack is specifically designed for evaluating RAG pipelines. It accepts the query_engine, rag_dataset, and judge_llm as inputs, executes a comprehensive suite of evaluation metrics, and obtains benchmark scores for the pipeline’s performance. I have skipped using judge_llm and check the evualtion parameters
  • LabelledRagDataset: We employ the Paul Graham Essay Dataset, a labeled RAG dataset derived from an essay by Paul Graham.

We Evaluate along the following parameters

  1. Correctness: This parameter assesses the relevance and accuracy of the generated answer by comparing it against a reference answer, ensuring that the output aligns with the expected response.
  2. Relevancy: This metric measures the degree to which the generated reaction and the selected source nodes are pertinent to the given question, ensuring that the machine retrieves and makes use of relevant statistics to cope with the query effectively.
  3. Faithfulness: This parameter evaluates whether the reaction generated with the aid of the query engine as it should reflect the records contained within the source nodes, ensuring that the system no longer introduces extraneous or unsubstantiated information.
  4. Context Similarity: This criterion evaluates the fine of the query-answering machine by assessing the semantic similarity between the question, the generated answer, and the applicable context, ensuring that the machine comprehends and utilizes the contextual information efficiently.

We evaluated the four bit , 2 bit and base model and beneath are the evaluation Takeaways

Overall, the scores suggest that the model performed well in terms of correctness and context similarity, but not so well in terms of relevance and faithfulness.

Let’s analyze the metrics that performed well in detail:

  1. mean_correctness_score:
  • All three versions (4-bit, 2-bit, and base) have a mean score of 1.0, indicating that the responses are considered to be completely correct compared to the reference responses
  • This indicates that the quantization process, even at aggressive levels such as 2-bit, had little effect on the accuracy or precision of the model’s output
  1. mean_relevancy_score:
  • The 4-bit quantized model has a score of 0.022727, which is low but nonzero.
  • Both the 2-bit quantized model and the base model have a score of 0.0. This metric measures whether the response generated with the selected source node matches the query.
  • A low score indicates that the ability of the model to capture and use relevant information from the source nodes to solve the query is very limited, regardless of quantization level

Based on these scores, the following observations and conclusions can be drawn.

  1. Quantization effect: the quantization process, even at levels as extreme as 2-bit, does not significantly degrade the precision or accuracy of the model’s outputs, as indicated by the correct mean_correctness_score for all versions
  2. Relevancy problem: All versions of the model (quantized and base) seem to struggle to retrieve and use relevant information from source nodes to solve the query, as reflected in low mean_relevancy_scores
  3. Potential Causes: The low relevancy scores could be attributed to various factors, such as:
  • Limitations in the model’s ability to understand and process the queries effectively.
  • Insufficient or inadequate training data for the specific task or domain.
  • Challenges in retrieving and integrating relevant information from the source nodes.
  • Potential issues with the evaluation dataset or the relevancy metric itself.

The most probable cause seems to be the p2b dataset or using a superior Judge LLM like GPT-4 . In the future will try out on these 2 areas

The jupyter notebooks for the quantization and inference can be found below

Wishing you joyous endeavors in coding and continual learning!


Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓