# LLM Quantization intuition & simple explaination

Last Updated on October 19, 2024 by Editorial Team

**Author(s): Allohvk**

Originally published on Towards AI.

## Quantization explained *in plain English*

When BERT was released around 5 years ago, it triggered a wave of **L**arge **L**anguage **M**odels with ever increasing sizes. As the model size got bigger, the performance improved with no apparent upper limit to the improvement! In other words β the *larger the model, the better the output*. The law of diminishing returns is Mother Natureβs way of ensuring a balance, which is vital to continuity. But (*at least for a while*), it seemed that Large models were an exception to this rule. With strong innovations in computing power and the availability of open data sets, models kept getting bigger & better β finally reaching a point where nearly all (readily) available data was consumed. But the juggernaut barely paused! *Models can now train on data generated by models β a perpetual source of fuel!*

Is there going to be a deterioration in the quality of fuel as it is recycled? Will the computing power continue to improve with innovations or will it run into the quantum barrier (*transistor gates are now barely a dozen atoms wide β any more compression and the innate quantum behavior of these materials can no longer be overlooked*). We can only speculate. But until then, we (the users) have a problem in training and inferencing (i.e., running) these large models. They are huge & bulky and have been shown to have sparse information. Is there a way to* zip βem up* and make them easier to handle? Can we MP3'ize these models like how Brandenburgβs group did in 1995 & revolutionized the way we listened to music (aided along the way by a certain Mr Sean Parker)? **Enter Quantization!**

If you were to dare open an LLM in the Notepad app, you would notice that it is nothing but a set of numbers. These numbers are the **Weights & Biases **of the modelβs internal layers. If you have ever tuned a complex stereo system, you may have manually rotated the various *knobs & dials *of the amplifier while listening intently to the outcoming music and may have narrowed downon a particular combination which you felt was optimal. You may even have a different sets of such combinations based on the music β a *ghazal *may not sound as nice when using the amplifier settings optimized for a *death-metal* band. Based on the input music, there are different settings of *dials & knobs* and as the input signal flows thruβ these layers, it is **transformed **before the output finally comes out of the speaker.

An LLM is not too different. It has several layers each with its own dials & knobs. We call them **Weights & Biases **(*or sometimes just* *Weights** or just **parameters. **We also interchangably use the terms LLM and Large model though the latter is more appropriate for this article***)**. The input signal is transformed by these **Weights **as it flows thruβ the layers and we finally get the output we desire. These **Weights **are stored as *32-bit floating point* numbers or its variants. So a **3-B parameter**-model has 3 billion such** weights** & if we want to load this on our laptop, it will need 3 billion * 32 bit or ~12GB of memory. That is an awful lot of memory! Imagine a **70-B** **parameter**-model! One option to compress it is to simply convert these **weights **to *half-precision*β¦ as simple as rounding from *float-32* to *float-16* and this results in reducing the model to half-size without impacting too much of its performance!

Of course no compression is non-lossy. The success of lossless music streaming clearly shows that. There is a market of niche high-end music system users who might prefer a 60 MB lossless music to a 5 MB MP3. But most users dont care because we can get upto a 90β95% performance with the 5 MB MP3. Same is the case with LLMs. Quantization (*we havenβt really defined it yet, but for now the **zip** analogy will do*) can be really useful to run (*i.e.* *inference*) large models on edge devices β like your mobile or low-end laptop, or *fine-tune* them on a modest sized GPU.

## Quantization: The bare essence

The simplest form of quntization could be the rounding down from 32 to 16 bit float. Of course, since we have narrowed the range that can be represented, we would lose out some really, really low values (*like the epsilon value in the layernorm layer of the LLM which is usually pretty low*) or some really high values, so there are some minor ramifications that need to be considered based on the downstream task you want to do. But most of the time this arrangement works just fine! *Why not take a good thing all the way further?* Can we round the values to 8 or even 4-bit? In theory β YES, but this goes a little beyond rounding & *involves mapping continuous infinite values to a smaller set of discrete finite valuesβ¦*

The range of **32-bit float** numbers is from -3.4E+38 to 3.4E+38. This comes to around 4 billion values! On the other hand a **4-bit** representation can store 16 possible values β binary 0000 to binary 1111. So we have the delicate task of taking a number in 32-bit (*with such a wide range*) and representing this number in **one of 16 destination buckets**. We need to do this process for all the *(billions of)* weights in the model. Woooo! If we do this, will the model even work? We can deduce that this is going to be a messy job & hence need to be careful how we do it. It is definitely not going to be as easy as rounding but it is surprisingly not that difficult either!

We can now take a look at the (slightly simplified) Wikipedia definition of **Quantization β***It is the process of mapping **input **values **from **a large set **to output **values in a smaller set. Rounding & truncation are (typical) examples of quantization processes. *Let us also take a peek a little ahead and notice a curious mathematical property mentioned in the same Wiki pageβ *Because quantization is a many-to-few mapping, it is an inherently non-linear and **irreversible process** (i.e., because the same output value is shared by multiple input values, it is impossible, in general, to recover the exact input value when given only the output value). *For now, just file away this information and we will revisit it at a later stage.

So how do we go about converting a number from a large range and represent it in 16 buckets without losing too much information? There are a few methods but usually quantization is about finding the best way to **project **the input range

of float-32 values into the 4-bit output space. Why only the *[min_inp_value, max_inp_value]*

range and why not the entire range that can be represented in float-32? Well, if the input signal values are spread across a smaller range, then it makes sense to utilize the 16 destination buckets for only this smaller range. For e.g. if all weights in the non-quantized model were between (say) -80.0 to +96.0, then we dont have to waste destination buckets for values outside this range. So one scheme would be to represent -80.0 by the *[min_inp_value, max_inp_value]**bucket-0* and +96.0 represented by the *bucket-15* and all other weights in between be mapped to the 14 remaining buckets in between. This is the core idea though the actual implementation varies based on the method.

Here is a simple recipe for quantizing 4 weights [3, 1, -2, 3]:

- Find the absolute maximum value of the input: [3, 1, -2, 3] -> 3
- Divide by that value: [3, 1, -2, 3] becomes [1, 0.33, -0.66, 1.0]
- Multiply by the abs-max range of the target data type. We have 16 buckets in 4-bit. What would be the 0? One way is to sacrifice one of these buckets & consider only 15 buckets from -7 thru 0 to +7. We have sacrificed one bucket just to keep things simple, and the abs max range of the target data type is 7. Multiplying gives: [7.0, 2.31, -4.62, 7.0]
- Lastly, round to get: [7, 2, -5, 7]

We have done a simple **scaling **op. The **scale **is just the ratio β *absolute Max value in destination Range* / *absolute Max value in input Range. *The **Quantized Value **is simply the **scaled up** original value i.e *Round (**Scale *** Original Value)*. For simplicity, we have assumed a symmetry around 0 for both ranges. This approach is called a *symmetric quantization scheme* & since we leveraged the* absolute maximumβs *to calculate the quantization scale β it is called *absmax* technique or something to that effect (I get lost in semantics sometimes).

We could also have *non-symmetric* input distributions for special cases (which we discuss later). Such weights are not symmetric around zero and we may need a *scale*** -and-shift** operation for efficient quantization β much like how we convert degrees to fahrenheit. In addition to the

**scale S,**we need a

*zero point offset*β we call it

**Z**. We thus have the

*affine quantization scheme*where

*output =*

*Round(**Scale *** Original_Value **+ Z**)*

. There is a *shifting*by the zero point offset, in addition to the

*scaling*.

**Scale**continues to be ratio of o/p & the i/p range, but instead of considering the

*absmax*values

*,*it is now the difference between the max and the min values in the range (

*hence called min-max technique or words to that effect*). An e.g. might make things clear.

**Scale **is the *o/p range* divided by the *i/p range*. *Range *now is not *absmax *but **diff between max & min values**. In case of 4-bit, the *o/p range* is from *-8 **to *** +7 **(

*note how we havent sacrificed any buckets here*) and the denominator is simply

*max(inp weight) β min(inp weight).*So in case of the e.g. above i.e.

`[3, 1, -2, 3]`

, the **scale**is

*7-(-8) / 3-(-2) = 3.*This is also called a

*zero-point quantization*technique (so many names!). This is because

*Zero*is no longer

*zero*after quantization but

**Z**. It is calculated as

*-Round(Scale * min_input_value) β 8*

*.*In other words, it is the

**quantized representation of the lowest-valued input**

*shifted by*

**the lowest possible value in the destination range**. For our e.g.,

**Z**comes to -2. Applying the

**formula in prev para i.e.**

*so-many-names-scheme**Round(**Scale *** Original_Value **+ Z**)*

, we get `[7,1,-8,7]`

(*with*

*Scale**=3 &*

*Z=**-2*).

Note that in the *symmetric quantization* universe, the quantized value for *zero *in the float-32 space is exactly *zero *in the quantized space also & hence it is simply not factored into the calculations. So, as it turns out it was just a special case of the *affine quantization*. Anyway, to sum it up, **S** and **Z** are the quantization parameters and we need to store them in case we want to dequantize and get back the (near-original) input. You can implement quantization yourself & play around with the weights as done nicely here.

This kind of compression works surprisingly well. This is because LLM weights are such that they get *reasonably distributed across all the 16 output buckets.* In other words, given an input weight range (*not the entire float32 range but the actual range of the input LLM weights*), we can notice that the weights are not concentrated in a narrow band inside this range, but reasonably well distributed across the entire range. This, in turn, ensures that the output values are well spread across the 16 o/p buckets after quantization. The way neural nets are trained & made to converge usually involves ensuring that the LLM weights take the form of a normal distribution. *If this were not the case, most of the weights would end up going into 1 or 2 o/p buckets only & make quantization un-viable.*

We need to account for ** outliers** thoughβ¦ Imagine a crazily high value in the input range & all rest being normal values. So our

*abs(max(inp_weight))*will be this crazily high number. This will mess up our entire quantization operation.

*This high number will drive all the remaining normal weights to 0 during the quantization process.*There are few ways to deal with this. We could clip outliers and set max/min values. We also quantize in small blocks and avoid quantizing across the entire set, thereby limiting damage of outliers.

There is a last scenario that needs to be taken care of. Can you guess what it is? *Clue*: Some blogs when talking of quantizing model parameters also talk about storing **activations**! Now, typically a modelβs parameters are its layerβs weights and biases. When we download a pretrained model, this is mostly what we download β a set of weights and biases for every layer in the model. So what is this whole business about storing **activations**? Why do we need to do this during quantization? Let us spend 2 mins to talk on **activation** itself.

## The curious case of a neuron in the medial part of the temporal lobe

Inside the brain of a particular patient undergoing treatment for epilepsy in UK, scientists discovered a neuron with a very peculiar characteristic. This particular neuron always got activated when showed a picture of Jennifer Aniston. The patient was shown 7 different pics of the Aniston apart from 80 other pics of animals, buildings or other famous people. The neuron steadfastly ignored all other pics, but *always* *fired* (i.e got *activated*) each time Aniston appeared on screen. This is a bizzare discovery if we think about it β a neuron in the brain dedicated to Jennifer Aniston.

The concept itself is decades old & over time other studies have pointed out to similar dedicated-neurons in other subjects for other famous personalities. But for a moment, think of the neuron as a switch that gets activated based on the inputs received. In case of Anistonβs pic, the neurons in the preceeding layers of the patient would have processed and sorted out the various features that make **Rachel **β the hair, the eyes, the strong character and (*I am sure*) many other things. These features in turn would have been created by many prev layers from the stream of light signals received by the eyes. Thus, the neuron in question is **activated **by the final set of features it receives as input. Since many other actresses too have similar features, we can only speculate that there may be some sort of a weighted-feature summation and some final threshold beyond which the neuron has learnt (*during its lifetime*) **to fire!**

**Activation** in neural nets is similar. Wiki defines it as β *a function that calculates the output of the node(neuron) based on its individual inputs & their weights. *This **activation** function is almost always is an integral part of a neural net. The next few lines in the Wiki give an idea why β *Non-trivial problems can be solved using only a few nodes if the activation function is nonlinear. ***In other words, we need an activation function to solve complex real-life problems**. These functions literally βbendβ the input signal in ways that makes it more amenable to produce the output we need. If we didnt have these, the entire model effectively behaves like a single-layer model & loses all its power *(a typical layer in a model is a linear function, and since a combination of two linear functions is also a linear function, so even if we have 1000 layers, mathematically it is equivalent to having just 1 layer).*

A ReLU is a classic example of an activation function. If the input to the ReLU function is negative, the output is a 0. If the input is positive, it just passes the signal as-is. So it is like a gate that opens only when a particular threshold is hit (*activation threshold*). This *activation threshold* (in case of the ReLU above) is 0. When the input signal is negative the gate remains closed and the signal is not passed to the next layer *& this particular path plays no role in determining the final output*. So here is the key question β *After we quantize the model and start running inference on it, what should be the value of this activation threshold? Will it continue to be **zero**? **No!*

All that shifting and scaling business we did during quantization would have messed up our our definition of **zero**. We are dealing with a new numerical system here. What is the new **zero**? The model does not know. We could have other activation functions or other *special* type of functions like batch normalization in the model. These *special* functions have associated numerical values that could get messed up when we quantize the rest of the model. It makes sense to quantize these as well and for that we need their **range **to apply our quantization formulas.* But since these values are available **only during inference** as the i/p data passes through the model*, we will not have the **range **during quantization time. Hmm, what can be done here?

## Handling activations and other special layers

A trivial solution to the above is to just ** de-quantize** the quantized input back to the original scale, let it pass thruβ these (activation) layers, let the layers use the default float-32 computations to decide whether to open the gate or not, and then

**re-quantize**the output and pass it on to the next layer. Of course, this slows inference (

*due to the additional dequantizing & quantizing business*) but is still faster than a E2E float-32 inference and of course the memory savings are huge.

So we start with a **Weight-Only Quantization **and then do a** Dynamic quantization i.e quantize activation outputs on the fly***!* In other words, *during quantization***, **the model weights are statically quantized, while activations etc are left in float-32. *During inference* β activations are quantized on the fly! For that, we need a range of values for the activations to determine the scales etc. We can do it live β the forward pass can be done with float-32 tensors, the activation values for a batch can be obtained & then we get an idea of the range. We can now apply our quantization formulas, get the scale & the zero point, and quantize to 4-bit dynamically during runtime for the onward layers.

Another alternative is to make an educated guess on the activation range during quantization itself. If we have a good, small calibration dataset with enough representative values, we could insert some probes into the model, pass the representative data samples, check when the neuron fires & record that particular activation value & thus calculate activation ranges and then apply the formula & quantize the whole paraphreneila including weights, biases, activations, special layers and what-nots. This is a **Full-Integer Quantization **(It is also a fully **Static Quantization **method).

All the above are categorized as **Post-Training Quantization (PTQ**) techniques. As the name suggests, here quantization is done after an LLM is trained. We also have **Quantization-Aware Training (QAT**) β Here we do some extra stuff (for quantization) *during the model training stage itself*. Part of the process is similar to the last item we saw above. But we do something additional here. We create *fake quantize operators, *which* simulate the error induced by quantization to let the model adapt to it during the training process! *It is an interesting technique & worth spending an extra para exploring.

We start by adding a fake quantization to the model in the forward pass to make it experience the effects of quantization. A *FakeQuant *node is basically a combination of *Quantize *and *Dequantize *operations. These are added to every operation involving computations. We now have added a quantization error in forward pass which also adds up to the (overall) error in the loss function, which is then tuned using optimizer in backward pass. The gradients are calculated without loss of precision making it robust to quantization. In this way, the training not only adjusts the weights as per the original training objective (*whatever that may be*) but also **tunes the model to minimize the impact of the errors introduced by quantization!**

Usually, **PTQ **techniques outlined above are used. As we saw, some of them involved the *art of *** de-quantizing**. We canβt just add preceeding zeros to convert from 4-bit to 32-bit float. The de-quantization process should take the value approximately back to the original range (

*before it quantized*). This means we will have to store the

**scaling values**during quantization. These would be needed to

**scale**back the 4-bit values back to the original de-quantized scale.

*Since we had quantized in small blocks to avoid outlier effects, we will have*

*individual scaling values**for each of these blocks.*All these scales need to be stored during quantization process and do take up additional space. These scaling numbers themselves could be quantized while storing leading to double-quantization effects but you get the core story by now.

*We are pretty much done with Quantization! *Let us now look at a few actual **implementations of quantization** in the industry. Let us also tie some misc loose ends together & get some semantics right:

- Quantization parameters can be computed at different granular levels β from a
*per-tensor*basis to a*per-channel*basis. Considering a tensor of shape`[N, C, H, W]`

, a*per-channel*or channel-wise quantization for the second dimension would result in having a scaling factor for each of the`C`

channels. - While we discussed exclusively about 4-bit quantization, the same techniques can be used for 8-bit as well as any other reduced size. Moreover, we assumed that the non-quantized model was in float-32. This need not be the case. There is an efficient bfloat representation which is in 16-bit. So quantization then involves converting 16-bit numbers to 8/4/2-bit etc. We could also have different destination type instead of 0β16 buckets as in the case of a 4-bit. For e.g.
**QLoRA**introduces a new 4-bit data type called NormalFloat (**NF4**) that is theoretically optimal for normally distributed weights. Each**NF4**has a value between -1 and 1 & is a more accurate representation compared to a conventional 4-bit float or a 4-bit Int. I will cover LoRA in the next blog, so for now, just think of QLoRA as a*quantization to 4-bit NF4.* **GPTQ**is a**PTQ**that uses a**layer-wise**quantization approach. Basically it works one layer at time, finding out the best quantized weights that minimize output error (Mean Squared Error between original signal & the quantized one) using a*calibration*batch and some back & forth.*adjustments**It does this layer-by-layer till a full quantized model is produced. It also retains activations in higher precision.*Precision levels like INT8, INT4, INT3, INT2 etc are used. I have deliberately skipped details concerning how theare done. The seeds are from the OBQ paper β Quantize one weight at a time while*adjustments**updating**all not-yet-quantized weights, in order to compensate for the error incurred by quantizing that single weight*and repetitively iterate!**GGML**from the Llama family uses different bit widths (from 2 to 8) depending on the chosen quant method. In fact, it also ensures that the*important weights*are quantized to a higher-precision data type and the rest to a lower precision. Note that**GGUF**is the new version of GGML and is more flexible and extensible. I initially felt that experience and empirical observations were used to decide which are the*important weights*(various possible combinations are given to the user). It appears that this choice is also driven by the magnitude of the weights gradients on a given training data. Lower the gradient, lesser this weight matters in the final scheme of things and hence the precision can be set lower.*Note: I couldnt find any GGML/GGUF paper. It is best to directly look at the code for more specific details.***AWQ**β*A**ctivation-Aware***W**eight**Q**uantization is based on a simple premise β*that ~1% of weights (called**salient weights**) contribute significantly to model performance and therefore be treated with special care*. While it would have been ideal to not quantize these**salient weights**at all, they believe that this kind of*mix-precision quantization**(some weights in 32, some in 4-bit)*is hardware-inefficient & suggest an alternative. They show that**scaling****up**the salient weight channels before quantizing preserves most of the information. Interestingly, the salient weights & the scales are determined by collecting the*activation*statistics. Yes, to find the salient*weight*channels, they refer to the*activation*distribution instead of the*weight*distribution. This is simply because*weight channels corresponding to*are more relevant to the output.*larger activation magnitudes**We already saw in prev section how even large weights can be rendered ineffective if the activation gates are closed.*Also since they donβt use the**adjustment**process of GPTQ, this preserves the generalist abilities of the model ensuring it does not over-fit to the calibration set*(AWQ uses calibration set only to measure the average activation magnitude per channel)*.

There are no ground rules for choosing one over the other. GGML runs nicely on a CPU or a M series Mac machine. QLoRAβs NF4 or the INT8 implementation are easily available in *bitsandbytes *library and closely coupled to Huggingface. GPTQ is popular while AWQ claims to retain the generalist capabilities of LLMβ¦ In general, it is best to play around with different quantization schemes using the model and dataset of choice before freezing on one. If you want to avoid quantizing an LLM yourself, a bloke named TheBloke on Huggingface has several 1000βs of popular models in various quantized formats ready for download. The decision then eliminates the quantization process itself & reduces the focus to a tradeoff between inference speed, memory & accuracy β the former two can be instantaneously tested.

If you have been with me on this journey this far, we have covered most of the key concepts recorded as of 2024 in the field of quantization. I have tried to explain the concepts in plain English *without diluting any of the essence.* For a more technical treatment, please refer to Maoβs blog. This is the first of a 10-series article titled **My LLM diaries**. I originally intended to write one article a weekend inspired by Vaidehiβs DSA series, but this first one took me 4 weekends & a weekday holiday in between β *an intense 40-hours effort*! This is because I completely avoid contaminating my writings with any form of Gen AI output β not only due to hallucinations issues but for a more selfish reason β* I can understand the **whyβ**s** **& the** how**βs** **better only if I write stuff myself in plain english*. The **what **of course is easily google-able or available from Gen AI. In any case, I hope I am able to churn out future articles at a faster pace. Others planned in the LLM diaries are:

- LoRA in plain English
- A detailed inspection of LLM.int8() and QLoRA
- RAG in plain English β Basic
- RAG in plain English β Advanced
- LLMs on the laptop β A practical primer
- Taming LLM β Fine-tuning & other techniques
- Agents in plain English
- LLMops in plain English β Operationalizing trained models
- Taking a step back β On model sentience, conscientiousness & other philosophical aspects

β

Alice:Would you tell me, please, which way I ought to go from here?The Cheshire Cat:That depends a good deal on where you want to get to.Alice:I donβt much care where.The Cheshire Cat:Then it doesnβt much matter which way you go.Alice:β¦So long as I get somewhere.The Cheshire Cat:Oh, youβre sure to do that, if only you walk long enough.β

β **Lewis Carroll, ****Alice in Wonderland**

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI