8-Bit LLM Quantization with Lightning Fabric
Last Updated on April 13, 2024 by Editorial Team

Author(s): Tim Cvetko

2024 — Easiest Way to any LLM Int-8 Quantization with Lightning Fabric

LLMs are called “large” for a reason. Models, like GPT-4, have over 220B weights, and over 1.4T total parameters. For us mortals, fine-tuning LLMs that have otherwise performed well on general tasks must take some form of optimization:

Model distillation — training a comparatively-smaller LLMPEFT — freeze some layers during fine-tuningPruning — reducing model size after trainingQuantization — using less precise bits to store weight information

->U+1F4A1 8-Bit Quantization(int8) enables loading larger models you normally wouldn’t be able to fit into memory, and speeding up inference.

8-bit Quantization: Image by AuthorP1: Introduction to Model QuantizationP2: Why 8-Bit QuantizationP3: How YOU can Fine-Tune any LLM with Lightning AI’s Fabric Module

Quantization is a must for most production systems given that edge devices and consumer-grade hardware typically require models of a much smaller memory footprint than more powerful hardware such as NVIDIA’s A100 80GB. Learning about this technique will enable a better understanding of deployment of LLMs like a Llama 2 and SDXL, and requirements for edge devices in robotics, vehicles, and other systems.

The size of a model is determined by the number of its parameters, and their precision, typically one of float32, float16 or bfloat16.

Float Precision: Image by Author

To calculate the model size in bytes,… Read the full blog for free on Medium.

