Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Free: 6-day Agentic AI Engineering Email Guide.
Learnings from Towards AI's hands-on work with real clients.
3-Part Series: LLM Latency in Production (Part 1)
Latest   Machine Learning

3-Part Series: LLM Latency in Production (Part 1)

Last Updated on June 3, 2026 by Editorial Team

Author(s): Mehedi Hasan

Originally published on Towards AI.

3-Part Series: LLM Latency in Production (Part 1)

Originally published at https://mhabir.substack.com.

Part 1 — Model-Level Speed: Make the Model Fast on the GPU

If you’re shipping LLMs to production, your first performance bottleneck isn’t serving logic or network overhead-it’s the raw arithmetic happening inside the GPU. Most teams waste weeks tuning their batching logic before realizing their model baseline is 3–4x slower than it should be. This part is about fixing that baseline.

Why LLM Inference Is Memory-Bandwidth Bound (Especially in Decode)

The fundamental misconception: LLMs are not always compute-bound. Decode is typically memory-bandwidth bound, while prefill is mixed (compute + memory) and becomes kernel-sensitive, especially with long contexts. Here’s the intuition that proves it.

A 7B parameter model in FP16 needs 14 GB just for weights. For a single token generation step (decode), you’re moving those 14 GB through GPU memory bandwidth (TB/s-class HBM) to do ~ 14 GFLOPs of computation. That’s an arithmetic intensity around 1 FLOP/byte-well below the roofline where compute becomes the limit. On modern GPUs, you’d need >200 FLOP/byte to saturate tensor cores. In practice, during decode, you’re waiting on HBM reads, not matrix multiplications.

This has two consequences:

  • Batching helps because amortizing weight loads across multiple sequences improves effective memory bandwidth utilization.
  • Quantization is a bandwidth win: INT4 weights are 4x smaller, so you move 4x less data per token. That directly translates to lower latency.

Caveat: This is an upper-bound mental model. Effective traffic depends on batching, caching, and parallelization-real workloads see less than this theoretical maximum.

Every LLM request has two phases with completely different performance characteristics.

3-Part Series: LLM Latency in Production (Part 1)

The Prefill-Decode Asymmetry

Prefill processes the entire prompt in one forward pass, but it’s compute-intensive and memory-heavy because you’re building the KV cache. For a 4K-token prompt, you’re doing attention over a 4K sequence in parallel-not 4K autoregressive steps-creating an O(n²) attention matrix and storing 4K × hidden_dim × num_layers × 2 (K and V) values. This can be multiple GB per request on large models.

Decode generates tokens autoregressively. Each step processes one token, but reuses the KV cache. It’s memory-bandwidth dominated because you’re streaming the entire KV cache through HBM on every step.

This asymmetry means your optimization strategy must be phase-aware. Faster prefill requires better attention kernels (Flash Attention). Faster decode requires better cache management (paged KV, quantization).

Quantization is the single most effective model-level optimization. It reduces memory footprint, improves bandwidth efficiency, and often comes with minimal quality loss.

INT8 (LLM.int8()) uses vector-wise quantization with outlier preservation. It’s the safest starting point-most models show <0.1% perplexity degradation. Implementation is straightforward:

# bitsandbytes INT8 inference
pip install bitsandbytes

In your model loading code:

from transformers import BitsAndBytesConfig 
quantization_config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_threshold=6.0, # outlier threshold
llm_int8_has_fp16_weight=False
)

This works out-of-the-box in vLLM and TGI:

# TGI with bitsandbytes INT8
text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2 --quantize bitsandbytes

# vLLM with INT8 (via config)
python -m vllm.entrypoints.api_server --model mistralai/Mistral-7B-Instruct-v0.2 --quantization bitsandbytes

INT4 is where the real speedup lives. You achieve 4x memory reduction and 2–3x latency improvement, but measurable quality degradation occurs. Always validate with your actual prompt distribution.

AWQ: Activation-Aware Weight Quantization

AWQ’s key insight: not all weights are equally important. Activation magnitudes reveal which weights matter most. By scaling weights based on activation statistics, AWQ achieves better 4-bit accuracy than naive quantization.

Installation & Usage:

git clone https://github.com/mit-han-lab/llm-awq
cd llm-awq
pip install -e .
cd awq/kernels && python setup.py install # Build efficient CUDA kernels

Quantize a model:

# Step 1: AWQ search (calibration)
python -m awq.entry --model_path meta-llama/Llama-2-7b-hf \
--w_bit 4 --q_group_size 128 \
--run_awq --dump_awq llama-2-7b-w4-g128.pt

# Step 2: Generate quantized weights
python -m awq.entry --model_path meta-llama/Llama-2-7b-hf \
--w_bit 4 --q_group_size 128 \
--load_awq llama-2-7b-w4-g128.pt \
--q_backend real --dump_quant llama-2-7b-w4-g128-awq.pt

In vLLM/TGI, use pre-quantized models:

# vLLM with AWQ (supported in many recent versions)
python -m vllm.entrypoints.api_server --model TheBloke/Llama-2-7B-AWQ --quantization awq

# TGI with AWQ
text-generation-launcher --model-id TheBloke/Llama-2-7B-AWQ

AWQ Configuration Details:

  • q_group_size=128: Weights are quantized in groups of 128 channels. Smaller groups improve accuracy but increase quantization overhead.
  • w_bit=4: 4-bit quantization. AWQ also supports 3-bit for extreme compression.
  • version="GEMM": Choose between GEMM (general matrix multiply) or GEMV (vector) kernels. GEMM is faster for batch sizes > 1.

GPTQ: Gradient-Based Post-Training Quantization

GPTQ uses second-order information (Hessian) to minimize quantization error. It’s slightly more computationally expensive to quantize but produces excellent 4-bit models.

Installation:

pip install auto-gptq --no-build-isolation

Quantization:

from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

model = AutoGPTQForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantize_config=BaseQuantizeConfig(
bits=4,
group_size=128,
desc_act=False, # False for speed, True for slight quality improvement
)
)

# Calibrate with ~128-256 samples from your domain
examples = [...] # List of tokenized samples
model.quantize(examples)
model.save_quantized("llama-2-7b-gptq")

GPTQ in Serving:

# vLLM with GPTQ
python -m vllm.entrypoints.api_server --model TheBloke/Llama-2-7B-GPTQ --quantization gptq

Key GPTQ Configs:

  • desc_act=False: Disables activation reordering. This is 2-3x faster in inference with minimal quality loss. Set to True only if perplexity degradation is > 2%.
  • use_marlin=True: On Ampere GPUs (A100, RTX 30xx/40xx), Marlin kernels are 30-50% faster than default exllamav2.

bitsandbytes NF4/FP4: The No-Precompute Option

bitsandbytes 4-bit (used in QLoRA) quantizes on-the-fly during model loading. No calibration needed, but inference is often slower than AWQ/GPTQ because quantization happens per forward pass.

Use when: Config:

from transformers import BitsAndBytesConfig bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", # or "fp4" bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True # Compresses quantization constants )

Performance note: NF4 inference is often slower than AWQ/GPTQ for pure inference because of runtime dequantization overhead. Use it for development, not max-throughput serving.

GPU Acceleration: Kernels That Actually Matter

Quantization reduces memory traffic. These kernels make the traffic you do have more efficient.

Flash Attention: The Prefill King

Flash Attention eliminates the need to materialize the full N×N attention matrix. Instead, it tiles the computation and uses smart memory management to reduce HBM reads/writes by 10–20x in theory, with typical speedups of 30–50% in practice on long sequences.

Installation:

pip install flash-attn --no-build-isolation

In Practice: FlashAttention-2 is integrated into all major inference engines. You just need to install it before building your engine. For custom PyTorch code, use the flash_attn interface.

Performance Impact: On long prompts (>2K tokens), Flash Attention can significantly reduce prefill time, often by 30–50% depending on hardware and sequence length. The improvement is most dramatic on memory-bound configs.

Fused Kernels: The Decode Accelerator

In autoregressive generation, every token passes through attention, layernorm, and MLP blocks. Each operation launches a separate kernel, incurring overhead. Fused kernels merge these into a single launch.

Examples:

In vLLM/TGI: These are automatically used when available. For custom implementations, look at Triton’s fused operations.

Performance Impact: Fused kernels improve decode tokens/sec by 15–25% by reducing kernel launch overhead and memory roundtrips.

Paged KV Cache: The Memory Fragmentation Fix

vLLM’s breakthrough innovation treats the KV cache like virtual memory. Instead of pre-allocating fixed-size cache blocks per request, paged attention allocates blocks dynamically (typically 16–64 tokens per block). This eliminates fragmentation and allows 2–3x higher batch sizes.

How it works:

  1. Allocate KV cache in fixed-size blocks (e.g., 16 tokens × hidden_dim)
  2. Maintain a block table per request (like page tables)
  3. On each decode step, gather scattered blocks into a contiguous attention computation

In vLLM: This is automatic and transparent. Configure block size:

python -m vllm.entrypoints.api_server \ --model meta-llama/Llama-2-7b-hf \ --block-size 16 # 16 tokens per block

Performance Impact: On shared inference services, paged attention can dramatically improve GPU utilization by eliminating memory fragmentation, often from ~45% to 85%. This directly improves throughput and reduces P99 latency.

KV Cache Quantization: The Memory Saver

The KV cache dominates memory usage in long-context scenarios. KV cache quantization (often to FP8 or INT8) cuts this memory in half, enabling longer sequences or larger batches.

Become a Medium member

Implementation: NVIDIA’s FP8 format is emerging as the sweet spot. In vLLM (experimental):

python -m vllm.entrypoints.api_server \ --model your-model \ --kv-cache-dtype fp8_e4m3

Tradeoffs: KV cache quantization adds a dequantization step to each attention computation, costing ~5–10% throughput. But it doubles your effective batch size capacity, which often yields net positive system throughput.

Decision Ladder: What to Try First

Based on production deployments, this is the empirical order that yields fastest time-to-value:

Level 1: Baseline (Zero Effort)

  • Use BF16/FP16: Ensure you’re not in FP32 mode
  • Install Flash Attention: pip install flash-attn --no-build-isolation
  • Verify GPU behavior: Watch memory bandwidth % and SM active cycles; low utilization is common at batch=1

Impact: 30–50% prefill speedup, minimal effort.

Level 2: Safe Quantization (1–2 Hours)

  • Apply INT8 quantization via bitsandbyte

bash

text-generation-launcher --model-id your-model --quantize bitsandbytes
  • Validate quality: Run 100 representative prompts, check output fidelity
  • Measure memory: Should see ~40% reduction in GPU memory usage

Impact: 1.5–2x batch size capacity, minimal quality loss.

Level 3: Aggressive Quantization (Half Day)

  • Choose AWQ or GPTQ based on availability:
  • Use AWQ if pre-quantized models exist for your model
  • Use GPTQ if you need to quantize custom models
  • Quantize with conservative settings:
  • AWQ: w_bit=4, q_group_size=128
  • GPTQ: bits=4, group_size=128, desc_act=False
  • Run quality evaluation: Check perplexity on your validation set, target <2% degradation
  • Deploy in vLLM/TGI: Use --quantization awq or --quantization gptq

Impact: 3–4x memory reduction, 2–3x throughput improvement.

Level 4: Kernel Optimization (Full Day)

  • Switch to vLLM: If not already using it
  • Tune block size: Start with --block-size 16, measure fragmentation
  • Enable KV cache quantization (if supported)
  • Profile with PyTorch Profiler: Identify remaining bottlenecks

Impact: 2–3x higher concurrency, 30–50% P99 latency reduction.

Level 5: Advanced (When All Else Fails)

  • Speculative decoding: For very long outputs
  • Custom fused kernels: For specialized architectures
  • Tensor parallelism: When a single GPU is insufficient

Impact: Variable, but can unlock 70B+ models on commodity hardware.

The Bottom Line for Tech Leads

Before you redesign your serving architecture, make sure you’re getting every ounce of performance from the model itself. In 90% of production deployments, the optimization ladder above yields 2–3x throughput improvements at zero serving-level changes.

Your order of operations:

  1. Measure prefill vs decode latency — know which phase hurts you
  2. Apply INT8 quantization — low risk, immediate memory win
  3. Switch to vLLM with Flash Attention — best-in-class kernel performance
  4. Evaluate AWQ/GPTQ INT4 — when you need another 2x
  5. Enable KV cache quantization — when context length is your limit

Most teams stop at Level 3 and see production latency drop from 800ms TTFT to 250ms, and tokens/sec increase from 30 to 80. That’s the difference between a usable product and a frustrating demo.

In Part 2, we’ll cover how to take this optimized model and build a serving system that doesn’t squander these gains through poor queueing, batching, and resource management.

📚 References & Further Reading

🔥 Core Papers

LLM Serving & Memory Management

  • Efficient Memory Management for LLM Serving with PagedAttention (vLLM)
    https://arxiv.org/abs/2309.06180
    Introduces paged attention, KV cache virtualization, and high-throughput batching used by vLLM.

Attention Optimization

  • FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
    https://arxiv.org/abs/2205.14135
    Foundational paper explaining why attention is memory-bound and how IO-aware tiling reduces HBM traffic.
  • FlashAttention-2: Faster Attention with Better Parallelism
    https://arxiv.org/abs/2307.08691
    Improves kernel parallelism and throughput, especially for long-context prefill.

Quantization for LLM Inference

  • AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration
    https://arxiv.org/abs/2306.00978
    Shows why activation statistics matter for INT4 quantization accuracy and latency.
  • SmoothQuant: Accurate and Efficient Post-Training Quantization for LLMs
    https://arxiv.org/abs/2211.10438
    Explains activation smoothing to make INT8 quantization more robust.
  • GPTQ: Accurate Post-Training Quantization for Generative Models
    https://arxiv.org/abs/2210.17323
    Gradient-based quantization minimizing second-order error; widely used in production.

🧠 Official GitHub Repositories (Production Code)

Inference Engines

Attention & Kernels

Quantization Libraries

Performance Modeling & Systems Thinking

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.

Start free — no commitment:

6-Day Agentic AI Engineering Email Guide — one practical lesson per day

Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages

Our courses:

AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.

Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.

AI for Work — Understand, evaluate, and apply AI for complex work tasks.

Note: Article content contains the views of the contributing authors and not Towards AI.