3-Part Series: LLM Latency in Production (Part 1)
Last Updated on June 3, 2026 by Editorial Team
Author(s): Mehedi Hasan
Originally published on Towards AI.
3-Part Series: LLM Latency in Production (Part 1)
Originally published at https://mhabir.substack.com.
Part 1 — Model-Level Speed: Make the Model Fast on the GPU
If you’re shipping LLMs to production, your first performance bottleneck isn’t serving logic or network overhead-it’s the raw arithmetic happening inside the GPU. Most teams waste weeks tuning their batching logic before realizing their model baseline is 3–4x slower than it should be. This part is about fixing that baseline.
Why LLM Inference Is Memory-Bandwidth Bound (Especially in Decode)
The fundamental misconception: LLMs are not always compute-bound. Decode is typically memory-bandwidth bound, while prefill is mixed (compute + memory) and becomes kernel-sensitive, especially with long contexts. Here’s the intuition that proves it.
A 7B parameter model in FP16 needs 14 GB just for weights. For a single token generation step (decode), you’re moving those 14 GB through GPU memory bandwidth (TB/s-class HBM) to do ~ 14 GFLOPs of computation. That’s an arithmetic intensity around 1 FLOP/byte-well below the roofline where compute becomes the limit. On modern GPUs, you’d need >200 FLOP/byte to saturate tensor cores. In practice, during decode, you’re waiting on HBM reads, not matrix multiplications.
This has two consequences:
- Batching helps because amortizing weight loads across multiple sequences improves effective memory bandwidth utilization.
- Quantization is a bandwidth win: INT4 weights are 4x smaller, so you move 4x less data per token. That directly translates to lower latency.
Caveat: This is an upper-bound mental model. Effective traffic depends on batching, caching, and parallelization-real workloads see less than this theoretical maximum.
Every LLM request has two phases with completely different performance characteristics.

The Prefill-Decode Asymmetry
Prefill processes the entire prompt in one forward pass, but it’s compute-intensive and memory-heavy because you’re building the KV cache. For a 4K-token prompt, you’re doing attention over a 4K sequence in parallel-not 4K autoregressive steps-creating an O(n²) attention matrix and storing 4K × hidden_dim × num_layers × 2 (K and V) values. This can be multiple GB per request on large models.
Decode generates tokens autoregressively. Each step processes one token, but reuses the KV cache. It’s memory-bandwidth dominated because you’re streaming the entire KV cache through HBM on every step.
This asymmetry means your optimization strategy must be phase-aware. Faster prefill requires better attention kernels (Flash Attention). Faster decode requires better cache management (paged KV, quantization).
Quantization is the single most effective model-level optimization. It reduces memory footprint, improves bandwidth efficiency, and often comes with minimal quality loss.
INT8 (LLM.int8()) uses vector-wise quantization with outlier preservation. It’s the safest starting point-most models show <0.1% perplexity degradation. Implementation is straightforward:
# bitsandbytes INT8 inference
pip install bitsandbytes
In your model loading code:
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_threshold=6.0, # outlier threshold
llm_int8_has_fp16_weight=False
)
This works out-of-the-box in vLLM and TGI:
# TGI with bitsandbytes INT8
text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2 --quantize bitsandbytes
# vLLM with INT8 (via config)
python -m vllm.entrypoints.api_server --model mistralai/Mistral-7B-Instruct-v0.2 --quantization bitsandbytes
INT4 is where the real speedup lives. You achieve 4x memory reduction and 2–3x latency improvement, but measurable quality degradation occurs. Always validate with your actual prompt distribution.
AWQ: Activation-Aware Weight Quantization
AWQ’s key insight: not all weights are equally important. Activation magnitudes reveal which weights matter most. By scaling weights based on activation statistics, AWQ achieves better 4-bit accuracy than naive quantization.
Installation & Usage:
git clone https://github.com/mit-han-lab/llm-awq
cd llm-awq
pip install -e .
cd awq/kernels && python setup.py install # Build efficient CUDA kernels
Quantize a model:
# Step 1: AWQ search (calibration)
python -m awq.entry --model_path meta-llama/Llama-2-7b-hf \
--w_bit 4 --q_group_size 128 \
--run_awq --dump_awq llama-2-7b-w4-g128.pt
# Step 2: Generate quantized weights
python -m awq.entry --model_path meta-llama/Llama-2-7b-hf \
--w_bit 4 --q_group_size 128 \
--load_awq llama-2-7b-w4-g128.pt \
--q_backend real --dump_quant llama-2-7b-w4-g128-awq.pt
In vLLM/TGI, use pre-quantized models:
# vLLM with AWQ (supported in many recent versions)
python -m vllm.entrypoints.api_server --model TheBloke/Llama-2-7B-AWQ --quantization awq
# TGI with AWQ
text-generation-launcher --model-id TheBloke/Llama-2-7B-AWQ
AWQ Configuration Details:
q_group_size=128: Weights are quantized in groups of 128 channels. Smaller groups improve accuracy but increase quantization overhead.w_bit=4: 4-bit quantization. AWQ also supports 3-bit for extreme compression.version="GEMM": Choose between GEMM (general matrix multiply) or GEMV (vector) kernels. GEMM is faster for batch sizes > 1.
GPTQ: Gradient-Based Post-Training Quantization
GPTQ uses second-order information (Hessian) to minimize quantization error. It’s slightly more computationally expensive to quantize but produces excellent 4-bit models.
Installation:
pip install auto-gptq --no-build-isolation
Quantization:
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
model = AutoGPTQForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantize_config=BaseQuantizeConfig(
bits=4,
group_size=128,
desc_act=False, # False for speed, True for slight quality improvement
)
)
# Calibrate with ~128-256 samples from your domain
examples = [...] # List of tokenized samples
model.quantize(examples)
model.save_quantized("llama-2-7b-gptq")
GPTQ in Serving:
# vLLM with GPTQ
python -m vllm.entrypoints.api_server --model TheBloke/Llama-2-7B-GPTQ --quantization gptq
Key GPTQ Configs:
desc_act=False: Disables activation reordering. This is 2-3x faster in inference with minimal quality loss. Set toTrueonly if perplexity degradation is > 2%.use_marlin=True: On Ampere GPUs (A100, RTX 30xx/40xx), Marlin kernels are 30-50% faster than default exllamav2.
bitsandbytes NF4/FP4: The No-Precompute Option
bitsandbytes 4-bit (used in QLoRA) quantizes on-the-fly during model loading. No calibration needed, but inference is often slower than AWQ/GPTQ because quantization happens per forward pass.
Use when: Config:
from transformers import BitsAndBytesConfig bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", # or "fp4" bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True # Compresses quantization constants )
Performance note: NF4 inference is often slower than AWQ/GPTQ for pure inference because of runtime dequantization overhead. Use it for development, not max-throughput serving.
GPU Acceleration: Kernels That Actually Matter
Quantization reduces memory traffic. These kernels make the traffic you do have more efficient.
Flash Attention: The Prefill King
Flash Attention eliminates the need to materialize the full N×N attention matrix. Instead, it tiles the computation and uses smart memory management to reduce HBM reads/writes by 10–20x in theory, with typical speedups of 30–50% in practice on long sequences.
Installation:
pip install flash-attn --no-build-isolation
In Practice: FlashAttention-2 is integrated into all major inference engines. You just need to install it before building your engine. For custom PyTorch code, use the flash_attn interface.
Performance Impact: On long prompts (>2K tokens), Flash Attention can significantly reduce prefill time, often by 30–50% depending on hardware and sequence length. The improvement is most dramatic on memory-bound configs.
Fused Kernels: The Decode Accelerator
In autoregressive generation, every token passes through attention, layernorm, and MLP blocks. Each operation launches a separate kernel, incurring overhead. Fused kernels merge these into a single launch.
Examples:
In vLLM/TGI: These are automatically used when available. For custom implementations, look at Triton’s fused operations.
Performance Impact: Fused kernels improve decode tokens/sec by 15–25% by reducing kernel launch overhead and memory roundtrips.
Paged KV Cache: The Memory Fragmentation Fix
vLLM’s breakthrough innovation treats the KV cache like virtual memory. Instead of pre-allocating fixed-size cache blocks per request, paged attention allocates blocks dynamically (typically 16–64 tokens per block). This eliminates fragmentation and allows 2–3x higher batch sizes.
How it works:
- Allocate KV cache in fixed-size blocks (e.g., 16 tokens × hidden_dim)
- Maintain a block table per request (like page tables)
- On each decode step, gather scattered blocks into a contiguous attention computation
In vLLM: This is automatic and transparent. Configure block size:
python -m vllm.entrypoints.api_server \ --model meta-llama/Llama-2-7b-hf \ --block-size 16 # 16 tokens per block
Performance Impact: On shared inference services, paged attention can dramatically improve GPU utilization by eliminating memory fragmentation, often from ~45% to 85%. This directly improves throughput and reduces P99 latency.
KV Cache Quantization: The Memory Saver
The KV cache dominates memory usage in long-context scenarios. KV cache quantization (often to FP8 or INT8) cuts this memory in half, enabling longer sequences or larger batches.
Implementation: NVIDIA’s FP8 format is emerging as the sweet spot. In vLLM (experimental):
python -m vllm.entrypoints.api_server \ --model your-model \ --kv-cache-dtype fp8_e4m3
Tradeoffs: KV cache quantization adds a dequantization step to each attention computation, costing ~5–10% throughput. But it doubles your effective batch size capacity, which often yields net positive system throughput.
Decision Ladder: What to Try First
Based on production deployments, this is the empirical order that yields fastest time-to-value:
Level 1: Baseline (Zero Effort)
- Use BF16/FP16: Ensure you’re not in FP32 mode
- Install Flash Attention:
pip install flash-attn --no-build-isolation - Verify GPU behavior: Watch memory bandwidth % and SM active cycles; low utilization is common at batch=1
Impact: 30–50% prefill speedup, minimal effort.
Level 2: Safe Quantization (1–2 Hours)
- Apply INT8 quantization via bitsandbyte
bash
text-generation-launcher --model-id your-model --quantize bitsandbytes
- Validate quality: Run 100 representative prompts, check output fidelity
- Measure memory: Should see ~40% reduction in GPU memory usage
Impact: 1.5–2x batch size capacity, minimal quality loss.
Level 3: Aggressive Quantization (Half Day)
- Choose AWQ or GPTQ based on availability:
- Use AWQ if pre-quantized models exist for your model
- Use GPTQ if you need to quantize custom models
- Quantize with conservative settings:
- AWQ:
w_bit=4, q_group_size=128 - GPTQ:
bits=4, group_size=128, desc_act=False - Run quality evaluation: Check perplexity on your validation set, target <2% degradation
- Deploy in vLLM/TGI: Use
--quantization awqor--quantization gptq
Impact: 3–4x memory reduction, 2–3x throughput improvement.
Level 4: Kernel Optimization (Full Day)
- Switch to vLLM: If not already using it
- Tune block size: Start with
--block-size 16, measure fragmentation - Enable KV cache quantization (if supported)
- Profile with PyTorch Profiler: Identify remaining bottlenecks
Impact: 2–3x higher concurrency, 30–50% P99 latency reduction.
Level 5: Advanced (When All Else Fails)
- Speculative decoding: For very long outputs
- Custom fused kernels: For specialized architectures
- Tensor parallelism: When a single GPU is insufficient
Impact: Variable, but can unlock 70B+ models on commodity hardware.
The Bottom Line for Tech Leads
Before you redesign your serving architecture, make sure you’re getting every ounce of performance from the model itself. In 90% of production deployments, the optimization ladder above yields 2–3x throughput improvements at zero serving-level changes.
Your order of operations:
- Measure prefill vs decode latency — know which phase hurts you
- Apply INT8 quantization — low risk, immediate memory win
- Switch to vLLM with Flash Attention — best-in-class kernel performance
- Evaluate AWQ/GPTQ INT4 — when you need another 2x
- Enable KV cache quantization — when context length is your limit
Most teams stop at Level 3 and see production latency drop from 800ms TTFT to 250ms, and tokens/sec increase from 30 to 80. That’s the difference between a usable product and a frustrating demo.
In Part 2, we’ll cover how to take this optimized model and build a serving system that doesn’t squander these gains through poor queueing, batching, and resource management.
📚 References & Further Reading
🔥 Core Papers
LLM Serving & Memory Management
- Efficient Memory Management for LLM Serving with PagedAttention (vLLM)
https://arxiv.org/abs/2309.06180
Introduces paged attention, KV cache virtualization, and high-throughput batching used by vLLM.
Attention Optimization
- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
https://arxiv.org/abs/2205.14135
Foundational paper explaining why attention is memory-bound and how IO-aware tiling reduces HBM traffic. - FlashAttention-2: Faster Attention with Better Parallelism
https://arxiv.org/abs/2307.08691
Improves kernel parallelism and throughput, especially for long-context prefill.
Quantization for LLM Inference
- AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration
https://arxiv.org/abs/2306.00978
Shows why activation statistics matter for INT4 quantization accuracy and latency. - SmoothQuant: Accurate and Efficient Post-Training Quantization for LLMs
https://arxiv.org/abs/2211.10438
Explains activation smoothing to make INT8 quantization more robust. - GPTQ: Accurate Post-Training Quantization for Generative Models
https://arxiv.org/abs/2210.17323
Gradient-based quantization minimizing second-order error; widely used in production.
🧠 Official GitHub Repositories (Production Code)
Inference Engines
- vLLM (PagedAttention, batching, scheduling)
https://github.com/vllm-project/vllm
High-throughput LLM inference engine used in real production systems. - Text Generation Inference (TGI)
https://github.com/huggingface/text-generation-inference
Production-grade LLM serving stack with batching, streaming, and quantization support.
Attention & Kernels
- FlashAttention (official implementation)
https://github.com/Dao-AILab/flash-attention
CUDA kernels implementing FlashAttention and FlashAttention-2.
Quantization Libraries
- bitsandbytes (INT8 / NF4 / FP4)
https://github.com/TimDettmers/bitsandbytes
On-the-fly quantization widely used in inference and QLoRA. - AutoGPTQ
https://github.com/PanQiWei/AutoGPTQ
Reference implementation for GPTQ-based 4-bit quantization. - AWQ (MIT Han Lab)
https://github.com/mit-han-lab/llm-awq
Official AWQ implementation with calibration and CUDA kernels.
Performance Modeling & Systems Thinking
- Roofline Models for LLM Inference
https://mlforsystems.org/assets/papers/neurips2024/paper28.pdf
Explains compute vs memory-bound behavior and latency prediction. - Awesome LLM Inference (Curated Research List)
https://github.com/xlite-dev/Awesome-LLM-Inference
Continuously updated collection of papers and implementations.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Towards AI Academy
We Build Enterprise-Grade AI. We'll Teach You to Master It Too.
15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.
Start free — no commitment:
→ 6-Day Agentic AI Engineering Email Guide — one practical lesson per day
→ Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages
Our courses:
→ AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.
→ Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.
→ AI for Work — Understand, evaluate, and apply AI for complex work tasks.
Note: Article content contains the views of the contributing authors and not Towards AI.