Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Unlock the full potential of AI with Building LLMs for Productionβ€”our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

LLM Quantization intuition & simple explaination
Latest   Machine Learning

LLM Quantization intuition & simple explaination

Last Updated on October 19, 2024 by Editorial Team

Author(s): Allohvk

Originally published on Towards AI.

Quantization explained in plain English

When BERT was released around 5 years ago, it triggered a wave of Large Language Models with ever increasing sizes. As the model size got bigger, the performance improved with no apparent upper limit to the improvement! In other words β€” the larger the model, the better the output. The law of diminishing returns is Mother Nature’s way of ensuring a balance, which is vital to continuity. But (at least for a while), it seemed that Large models were an exception to this rule. With strong innovations in computing power and the availability of open data sets, models kept getting bigger & better β€” finally reaching a point where nearly all (readily) available data was consumed. But the juggernaut barely paused! Models can now train on data generated by models β€” a perpetual source of fuel!

Is there going to be a deterioration in the quality of fuel as it is recycled? Will the computing power continue to improve with innovations or will it run into the quantum barrier (transistor gates are now barely a dozen atoms wide β€” any more compression and the innate quantum behavior of these materials can no longer be overlooked). We can only speculate. But until then, we (the users) have a problem in training and inferencing (i.e., running) these large models. They are huge & bulky and have been shown to have sparse information. Is there a way to zip β€˜em up and make them easier to handle? Can we MP3'ize these models like how Brandenburg’s group did in 1995 & revolutionized the way we listened to music (aided along the way by a certain Mr Sean Parker)? Enter Quantization!

If you were to dare open an LLM in the Notepad app, you would notice that it is nothing but a set of numbers. These numbers are the Weights & Biases of the model’s internal layers. If you have ever tuned a complex stereo system, you may have manually rotated the various knobs & dials of the amplifier while listening intently to the outcoming music and may have narrowed downon a particular combination which you felt was optimal. You may even have a different sets of such combinations based on the music β€” a ghazal may not sound as nice when using the amplifier settings optimized for a death-metal band. Based on the input music, there are different settings of dials & knobs and as the input signal flows thru’ these layers, it is transformed before the output finally comes out of the speaker.

Photo by Jeremy Lanfranchi on Unsplash

An LLM is not too different. It has several layers each with its own dials & knobs. We call them Weights & Biases (or sometimes just Weights or just parameters. We also interchangably use the terms LLM and Large model though the latter is more appropriate for this article). The input signal is transformed by these Weights as it flows thru’ the layers and we finally get the output we desire. These Weights are stored as 32-bit floating point numbers or its variants. So a 3-B parameter-model has 3 billion such weights & if we want to load this on our laptop, it will need 3 billion * 32 bit or ~12GB of memory. That is an awful lot of memory! Imagine a 70-B parameter-model! One option to compress it is to simply convert these weights to half-precision… as simple as rounding from float-32 to float-16 and this results in reducing the model to half-size without impacting too much of its performance!

Of course no compression is non-lossy. The success of lossless music streaming clearly shows that. There is a market of niche high-end music system users who might prefer a 60 MB lossless music to a 5 MB MP3. But most users dont care because we can get upto a 90–95% performance with the 5 MB MP3. Same is the case with LLMs. Quantization (we haven’t really defined it yet, but for now the zip analogy will do) can be really useful to run (i.e. inference) large models on edge devices β€” like your mobile or low-end laptop, or fine-tune them on a modest sized GPU.

Quantization: The bare essence

The simplest form of quntization could be the rounding down from 32 to 16 bit float. Of course, since we have narrowed the range that can be represented, we would lose out some really, really low values (like the epsilon value in the layernorm layer of the LLM which is usually pretty low) or some really high values, so there are some minor ramifications that need to be considered based on the downstream task you want to do. But most of the time this arrangement works just fine! Why not take a good thing all the way further? Can we round the values to 8 or even 4-bit? In theory β€” YES, but this goes a little beyond rounding & involves mapping continuous infinite values to a smaller set of discrete finite values…

The range of 32-bit float numbers is from -3.4E+38 to 3.4E+38. This comes to around 4 billion values! On the other hand a 4-bit representation can store 16 possible values β€” binary 0000 to binary 1111. So we have the delicate task of taking a number in 32-bit (with such a wide range) and representing this number in one of 16 destination buckets. We need to do this process for all the (billions of) weights in the model. Woooo! If we do this, will the model even work? We can deduce that this is going to be a messy job & hence need to be careful how we do it. It is definitely not going to be as easy as rounding but it is surprisingly not that difficult either!

We can now take a look at the (slightly simplified) Wikipedia definition of Quantization β€”It is the process of mapping input values from a large set to output values in a smaller set. Rounding & truncation are (typical) examples of quantization processes. Let us also take a peek a little ahead and notice a curious mathematical property mentioned in the same Wiki pageβ€” Because quantization is a many-to-few mapping, it is an inherently non-linear and irreversible process (i.e., because the same output value is shared by multiple input values, it is impossible, in general, to recover the exact input value when given only the output value). For now, just file away this information and we will revisit it at a later stage.

So how do we go about converting a number from a large range and represent it in 16 buckets without losing too much information? There are a few methods but usually quantization is about finding the best way to project the input range [min_inp_value, max_inp_value] of float-32 values into the 4-bit output space. Why only the [min_inp_value, max_inp_value] range and why not the entire range that can be represented in float-32? Well, if the input signal values are spread across a smaller range, then it makes sense to utilize the 16 destination buckets for only this smaller range. For e.g. if all weights in the non-quantized model were between (say) -80.0 to +96.0, then we dont have to waste destination buckets for values outside this range. So one scheme would be to represent -80.0 by the bucket-0 and +96.0 represented by the bucket-15 and all other weights in between be mapped to the 14 remaining buckets in between. This is the core idea though the actual implementation varies based on the method.

Here is a simple recipe for quantizing 4 weights [3, 1, -2, 3]:

  1. Find the absolute maximum value of the input: [3, 1, -2, 3] -> 3
  2. Divide by that value: [3, 1, -2, 3] becomes [1, 0.33, -0.66, 1.0]
  3. Multiply by the abs-max range of the target data type. We have 16 buckets in 4-bit. What would be the 0? One way is to sacrifice one of these buckets & consider only 15 buckets from -7 thru 0 to +7. We have sacrificed one bucket just to keep things simple, and the abs max range of the target data type is 7. Multiplying gives: [7.0, 2.31, -4.62, 7.0]
  4. Lastly, round to get: [7, 2, -5, 7]

We have done a simple scaling op. The scale is just the ratio β€” absolute Max value in destination Range / absolute Max value in input Range. The Quantized Value is simply the scaled up original value i.e Round (Scale * Original Value). For simplicity, we have assumed a symmetry around 0 for both ranges. This approach is called a symmetric quantization scheme & since we leveraged the absolute maximum’s to calculate the quantization scale β€” it is called absmax technique or something to that effect (I get lost in semantics sometimes).

We could also have non-symmetric input distributions for special cases (which we discuss later). Such weights are not symmetric around zero and we may need a scale-and-shift operation for efficient quantization β€” much like how we convert degrees to fahrenheit. In addition to the scale S, we need a zero point offset β€” we call it Z. We thus have the affine quantization scheme where output = Round(Scale * Original_Value + Z). There is a shifting by the zero point offset, in addition to the scaling. Scale continues to be ratio of o/p & the i/p range, but instead of considering the absmax values, it is now the difference between the max and the min values in the range (hence called min-max technique or words to that effect). An e.g. might make things clear.

Scale is the o/p range divided by the i/p range. Range now is not absmax but diff between max & min values. In case of 4-bit, the o/p range is from -8 to +7 (note how we havent sacrificed any buckets here) and the denominator is simply max(inp weight) β€” min(inp weight). So in case of the e.g. above i.e. [3, 1, -2, 3], the scale is 7-(-8) / 3-(-2) = 3. This is also called a zero-point quantization technique (so many names!). This is because Zero is no longer zero after quantization but Z. It is calculated as -Round(Scale * min_input_value) β€” 8. In other words, it is the quantized representation of the lowest-valued input shifted by the lowest possible value in the destination range. For our e.g., Z comes to -2. Applying the so-many-names-scheme formula in prev para i.e. Round(Scale * Original_Value + Z), we get [7,1,-8,7](with Scale=3 & Z= -2).

Note that in the symmetric quantization universe, the quantized value for zero in the float-32 space is exactly zero in the quantized space also & hence it is simply not factored into the calculations. So, as it turns out it was just a special case of the affine quantization. Anyway, to sum it up, S and Z are the quantization parameters and we need to store them in case we want to dequantize and get back the (near-original) input. You can implement quantization yourself & play around with the weights as done nicely here.

This kind of compression works surprisingly well. This is because LLM weights are such that they get reasonably distributed across all the 16 output buckets. In other words, given an input weight range (not the entire float32 range but the actual range of the input LLM weights), we can notice that the weights are not concentrated in a narrow band inside this range, but reasonably well distributed across the entire range. This, in turn, ensures that the output values are well spread across the 16 o/p buckets after quantization. The way neural nets are trained & made to converge usually involves ensuring that the LLM weights take the form of a normal distribution. If this were not the case, most of the weights would end up going into 1 or 2 o/p buckets only & make quantization un-viable.

We need to account for outliers though… Imagine a crazily high value in the input range & all rest being normal values. So our abs(max(inp_weight)) will be this crazily high number. This will mess up our entire quantization operation. This high number will drive all the remaining normal weights to 0 during the quantization process. There are few ways to deal with this. We could clip outliers and set max/min values. We also quantize in small blocks and avoid quantizing across the entire set, thereby limiting damage of outliers.

There is a last scenario that needs to be taken care of. Can you guess what it is? Clue: Some blogs when talking of quantizing model parameters also talk about storing activations! Now, typically a model’s parameters are its layer’s weights and biases. When we download a pretrained model, this is mostly what we download β€” a set of weights and biases for every layer in the model. So what is this whole business about storing activations? Why do we need to do this during quantization? Let us spend 2 mins to talk on activation itself.

The curious case of a neuron in the medial part of the temporal lobe

Inside the brain of a particular patient undergoing treatment for epilepsy in UK, scientists discovered a neuron with a very peculiar characteristic. This particular neuron always got activated when showed a picture of Jennifer Aniston. The patient was shown 7 different pics of the Aniston apart from 80 other pics of animals, buildings or other famous people. The neuron steadfastly ignored all other pics, but always fired (i.e got activated) each time Aniston appeared on screen. This is a bizzare discovery if we think about it β€” a neuron in the brain dedicated to Jennifer Aniston.

The concept itself is decades old & over time other studies have pointed out to similar dedicated-neurons in other subjects for other famous personalities. But for a moment, think of the neuron as a switch that gets activated based on the inputs received. In case of Aniston’s pic, the neurons in the preceeding layers of the patient would have processed and sorted out the various features that make Rachel β€” the hair, the eyes, the strong character and (I am sure) many other things. These features in turn would have been created by many prev layers from the stream of light signals received by the eyes. Thus, the neuron in question is activated by the final set of features it receives as input. Since many other actresses too have similar features, we can only speculate that there may be some sort of a weighted-feature summation and some final threshold beyond which the neuron has learnt (during its lifetime) to fire!

Activation in neural nets is similar. Wiki defines it as β€” a function that calculates the output of the node(neuron) based on its individual inputs & their weights. This activation function is almost always is an integral part of a neural net. The next few lines in the Wiki give an idea why β€” Non-trivial problems can be solved using only a few nodes if the activation function is nonlinear. In other words, we need an activation function to solve complex real-life problems. These functions literally β€œbend” the input signal in ways that makes it more amenable to produce the output we need. If we didnt have these, the entire model effectively behaves like a single-layer model & loses all its power (a typical layer in a model is a linear function, and since a combination of two linear functions is also a linear function, so even if we have 1000 layers, mathematically it is equivalent to having just 1 layer).

A ReLU is a classic example of an activation function. If the input to the ReLU function is negative, the output is a 0. If the input is positive, it just passes the signal as-is. So it is like a gate that opens only when a particular threshold is hit (activation threshold). This activation threshold (in case of the ReLU above) is 0. When the input signal is negative the gate remains closed and the signal is not passed to the next layer & this particular path plays no role in determining the final output. So here is the key question β€” After we quantize the model and start running inference on it, what should be the value of this activation threshold? Will it continue to be zero? No!

All that shifting and scaling business we did during quantization would have messed up our our definition of zero. We are dealing with a new numerical system here. What is the new zero? The model does not know. We could have other activation functions or other special type of functions like batch normalization in the model. These special functions have associated numerical values that could get messed up when we quantize the rest of the model. It makes sense to quantize these as well and for that we need their range to apply our quantization formulas. But since these values are available only during inference as the i/p data passes through the model, we will not have the range during quantization time. Hmm, what can be done here?

Handling activations and other special layers

A trivial solution to the above is to just de-quantize the quantized input back to the original scale, let it pass thru’ these (activation) layers, let the layers use the default float-32 computations to decide whether to open the gate or not, and then re-quantize the output and pass it on to the next layer. Of course, this slows inference (due to the additional dequantizing & quantizing business) but is still faster than a E2E float-32 inference and of course the memory savings are huge.

So we start with a Weight-Only Quantization and then do a Dynamic quantization i.e quantize activation outputs on the fly! In other words, during quantization, the model weights are statically quantized, while activations etc are left in float-32. During inference β€” activations are quantized on the fly! For that, we need a range of values for the activations to determine the scales etc. We can do it live β€” the forward pass can be done with float-32 tensors, the activation values for a batch can be obtained & then we get an idea of the range. We can now apply our quantization formulas, get the scale & the zero point, and quantize to 4-bit dynamically during runtime for the onward layers.

Another alternative is to make an educated guess on the activation range during quantization itself. If we have a good, small calibration dataset with enough representative values, we could insert some probes into the model, pass the representative data samples, check when the neuron fires & record that particular activation value & thus calculate activation ranges and then apply the formula & quantize the whole paraphreneila including weights, biases, activations, special layers and what-nots. This is a Full-Integer Quantization (It is also a fully Static Quantization method).

All the above are categorized as Post-Training Quantization (PTQ) techniques. As the name suggests, here quantization is done after an LLM is trained. We also have Quantization-Aware Training (QAT) β€” Here we do some extra stuff (for quantization) during the model training stage itself. Part of the process is similar to the last item we saw above. But we do something additional here. We create fake quantize operators, which simulate the error induced by quantization to let the model adapt to it during the training process! It is an interesting technique & worth spending an extra para exploring.

We start by adding a fake quantization to the model in the forward pass to make it experience the effects of quantization. A FakeQuant node is basically a combination of Quantize and Dequantize operations. These are added to every operation involving computations. We now have added a quantization error in forward pass which also adds up to the (overall) error in the loss function, which is then tuned using optimizer in backward pass. The gradients are calculated without loss of precision making it robust to quantization. In this way, the training not only adjusts the weights as per the original training objective (whatever that may be) but also tunes the model to minimize the impact of the errors introduced by quantization!

Usually, PTQ techniques outlined above are used. As we saw, some of them involved the art of de-quantizing. We can’t just add preceeding zeros to convert from 4-bit to 32-bit float. The de-quantization process should take the value approximately back to the original range (before it quantized). This means we will have to store the scaling values during quantization. These would be needed to scale back the 4-bit values back to the original de-quantized scale. Since we had quantized in small blocks to avoid outlier effects, we will have individual scaling values for each of these blocks. All these scales need to be stored during quantization process and do take up additional space. These scaling numbers themselves could be quantized while storing leading to double-quantization effects but you get the core story by now.

We are pretty much done with Quantization! Let us now look at a few actual implementations of quantization in the industry. Let us also tie some misc loose ends together & get some semantics right:

  • Quantization parameters can be computed at different granular levels β€” from a per-tensor basis to a per-channel basis. Considering a tensor of shape [N, C, H, W], a per-channel or channel-wise quantization for the second dimension would result in having a scaling factor for each of the C channels.
  • While we discussed exclusively about 4-bit quantization, the same techniques can be used for 8-bit as well as any other reduced size. Moreover, we assumed that the non-quantized model was in float-32. This need not be the case. There is an efficient bfloat representation which is in 16-bit. So quantization then involves converting 16-bit numbers to 8/4/2-bit etc. We could also have different destination type instead of 0–16 buckets as in the case of a 4-bit. For e.g. QLoRA introduces a new 4-bit data type called NormalFloat (NF4) that is theoretically optimal for normally distributed weights. Each NF4 has a value between -1 and 1 & is a more accurate representation compared to a conventional 4-bit float or a 4-bit Int. I will cover LoRA in the next blog, so for now, just think of QLoRA as a quantization to 4-bit NF4.
  • GPTQ is a PTQ that uses a layer-wise quantization approach. Basically it works one layer at time, finding out the best quantized weights that minimize output error (Mean Squared Error between original signal & the quantized one) using a calibration batch and some back & forth adjustments. It does this layer-by-layer till a full quantized model is produced. It also retains activations in higher precision. Precision levels like INT8, INT4, INT3, INT2 etc are used. I have deliberately skipped details concerning how the adjustments are done. The seeds are from the OBQ paper β€” Quantize one weight at a time while updating all not-yet-quantized weights, in order to compensate for the error incurred by quantizing that single weight and repetitively iterate!
  • GGML from the Llama family uses different bit widths (from 2 to 8) depending on the chosen quant method. In fact, it also ensures that the important weights are quantized to a higher-precision data type and the rest to a lower precision. Note that GGUF is the new version of GGML and is more flexible and extensible. I initially felt that experience and empirical observations were used to decide which are the important weights (various possible combinations are given to the user). It appears that this choice is also driven by the magnitude of the weights gradients on a given training data. Lower the gradient, lesser this weight matters in the final scheme of things and hence the precision can be set lower. Note: I couldnt find any GGML/GGUF paper. It is best to directly look at the code for more specific details.
  • AWQ β€” Activation-Aware Weight Quantization is based on a simple premise β€” that ~1% of weights (called salient weights) contribute significantly to model performance and therefore be treated with special care. While it would have been ideal to not quantize these salient weights at all, they believe that this kind of mix-precision quantization (some weights in 32, some in 4-bit) is hardware-inefficient & suggest an alternative. They show that scaling up the salient weight channels before quantizing preserves most of the information. Interestingly, the salient weights & the scales are determined by collecting the activation statistics. Yes, to find the salient weight channels, they refer to the activation distribution instead of the weight distribution. This is simply because weight channels corresponding to larger activation magnitudes are more relevant to the output. We already saw in prev section how even large weights can be rendered ineffective if the activation gates are closed. Also since they don’t use the adjustment process of GPTQ, this preserves the generalist abilities of the model ensuring it does not over-fit to the calibration set (AWQ uses calibration set only to measure the average activation magnitude per channel).

There are no ground rules for choosing one over the other. GGML runs nicely on a CPU or a M series Mac machine. QLoRA’s NF4 or the INT8 implementation are easily available in bitsandbytes library and closely coupled to Huggingface. GPTQ is popular while AWQ claims to retain the generalist capabilities of LLM… In general, it is best to play around with different quantization schemes using the model and dataset of choice before freezing on one. If you want to avoid quantizing an LLM yourself, a bloke named TheBloke on Huggingface has several 1000’s of popular models in various quantized formats ready for download. The decision then eliminates the quantization process itself & reduces the focus to a tradeoff between inference speed, memory & accuracy β€” the former two can be instantaneously tested.

If you have been with me on this journey this far, we have covered most of the key concepts recorded as of 2024 in the field of quantization. I have tried to explain the concepts in plain English without diluting any of the essence. For a more technical treatment, please refer to Mao’s blog. This is the first of a 10-series article titled My LLM diaries. I originally intended to write one article a weekend inspired by Vaidehi’s DSA series, but this first one took me 4 weekends & a weekday holiday in between β€” an intense 40-hours effort! This is because I completely avoid contaminating my writings with any form of Gen AI output β€” not only due to hallucinations issues but for a more selfish reason β€” I can understand the why’s & the how’s better only if I write stuff myself in plain english. The what of course is easily google-able or available from Gen AI. In any case, I hope I am able to churn out future articles at a faster pace. Others planned in the LLM diaries are:

  • LoRA in plain English
  • A detailed inspection of LLM.int8() and QLoRA
  • RAG in plain English β€” Basic
  • RAG in plain English β€” Advanced
  • LLMs on the laptop β€” A practical primer
  • Taming LLM β€” Fine-tuning & other techniques
  • Agents in plain English
  • LLMops in plain English β€” Operationalizing trained models
  • Taking a step back β€” On model sentience, conscientiousness & other philosophical aspects
Image by Tobias Brunner from Pixabay

β€œAlice: Would you tell me, please, which way I ought to go from here?
The Cheshire Cat: That depends a good deal on where you want to get to.
Alice: I don’t much care where.
The Cheshire Cat: Then it doesn’t much matter which way you go.
Alice: …So long as I get somewhere.
The Cheshire Cat: Oh, you’re sure to do that, if only you walk long enough.”

― Lewis Carroll, Alice in Wonderland

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓