Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Free: 6-day Agentic AI Engineering Email Guide.
Learnings from Towards AI's hands-on work with real clients.
Part 3 — Implementation/Engine-Level: Choosing the Runtime That Gives You These for Free
Latest   Machine Learning

Part 3 — Implementation/Engine-Level: Choosing the Runtime That Gives You These for Free

Last Updated on June 3, 2026 by Editorial Team

Author(s): Mehedi Hasan

Originally published on Towards AI.

Part 3 — Implementation/Engine-Level: Choosing the Runtime That Gives You These for Free

You now know how to make the model fast (Part 1) and how to build a stable serving layer around it (Part 2). The final question is: which engine actually implements all of this without forcing you to write a custom scheduler from scratch?

The theme of this part: inference engines are not neutral wrappers. They bake in specific opinions about batching, KV cache memory layout, prefix caching, and kernel selection. Pick the engine that aligns with your pain points, and you get chunked prefill, continuous batching, and paged KV cache for free. Pick the wrong one, and you’ll spend sprints reimplementing features the right engine already has.

Here is how the four major runtimes compare in 2026, with exact configs and the tradeoffs that matter for production.

vLLM: The Production Default

vLLM is the safest starting point for most teams. Its core innovation — PagedAttention — treats the KV cache like virtual memory with fixed-size blocks, reducing fragmentation from 60–80% in naive systems to under 4%. This directly translates to 2–4x higher concurrency on the same GPU.

What you get out of the box:

  • Continuous batching (iteration-level scheduling): requests enter and leave the GPU every token step, not every batch
  • Chunked prefill (v0.4+): long prompts are broken into chunks and interleaved with decode steps, so a 3K-token prefill doesn’t starve short chat requests
  • Automatic prefix caching (APC): the engine detects shared prompt prefixes and reuses KV cache automatically
  • Speculative decoding (EAGLE, Medusa, n-gram): 2–3x latency reduction for memory-bound decode
  • Multi-LoRA serving: serve hundreds of fine-tuned adapters on one base model
  • Broad quantization support: GPTQ, AWQ, FP8, INT8, INT4, AutoRound
  • 200+ model architectures: Llama, Qwen, DeepSeek, Mixtral, MoE, VLMs, embedding models

The config that matters:

python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.85 \
--max-model-len 8192 \
--max-num-seqs 64 \
--enable-prefix-caching \
--enable-chunked-prefill \
--quantization fp8

Key flags explained:

  • --enable-prefix-caching: turns on automatic prefix caching for shared system prompts (massive for RAG)
  • --enable-chunked-prefill: prevents long prefill monopolization; interleaves prefill chunks with decode
  • --gpu-memory-utilization 0.85: leaves 15% headroom for CUDA graph capture and KV cache growth; going to 0.95 often causes OOM during graph compilation
  • --max-num-seqs 64: caps concurrent sequences. Higher isn’t always better—if you hit memory limits, the engine will evict blocks and thrash.

MRV2 (Model Runner V2): In v0.17.0+, enable VLLM_USE_V2_MODEL_RUNNER=1 for a rewritten backend that delivers significant throughput gains, especially on newer architectures like GB200.

When to choose vLLM:

  • You support many different models and need one engine to handle them all
  • You run on heterogeneous hardware (NVIDIA, AMD, Intel Gaudi, AWS Trainium)
  • Your team wants the largest community, best documentation, and fastest debugging
  • You need to be online in under 90 seconds from a cold start

Limitation: Peak throughput on dedicated H100 clusters is ~29% lower than SGLang or LMDeploy in some benchmarks, primarily due to Python orchestration overhead. If you have a fixed model and a specialized team, you can squeeze more out of other engines. But for most teams, vLLM’s breadth outweighs that gap.

SGLang: The Throughput Challenger with Automatic Prefix Caching

SGLang, developed by LMSYS (the team behind Chatbot Arena), is no longer a niche alternative. It powers xAI’s Grok 3 and Microsoft Azure’s DeepSeek R1 deployments, running on over 400,000 GPUs worldwide.

What differentiates it: RadixAttention. Instead of manually configuring prefix caches, SGLang builds a radix tree from request prefixes and automatically reuses KV cache across any requests that share token sequences. This is transformative for multi-turn chat, agent loops, and RAG pipelines where system prompts and retrieved contexts repeat.

What you get out of the box:

  • RadixAttention: automatic, dynamic prefix caching without manual key management
  • Chunked prefill: same interleaving benefit as vLLM
  • EAGLE/EAGLE3 speculative decoding: state-of-the-art draft-model speculation
  • Prefill-decode disaggregation: separate prefill and decode across different GPU pools for independent scaling
  • MLA-optimized kernels: specifically tuned for DeepSeek models
  • Zero-overhead CPU scheduler: moves scheduling logic off the GPU thread

The config that matters:

python -m sglang.launch_server \
--model-path meta-llama/Llama-3.3-70B-Instruct \
--tp 2 \
--quantization fp8 \
--context-length 8192 \
--mem-fraction-static 0.92 \
--enable-flashinfer-mla \
--host 0.0.0.0 \
--port 8000

Key flags explained:

  • --tp 2: tensor parallelism across 2 GPUs
  • --mem-fraction-static 0.92: SGLang’s memory allocator is more aggressive than vLLM’s; 0.92 is typically stable on H100
  • --enable-flashinfer-mla: enables optimized Multi-Head Latent Attention kernels for DeepSeek-class models

Performance reality check: In H100 benchmarks with unique prompts (no prefix sharing), SGLang achieves roughly 29% higher throughput than vLLM. However, the gap narrows or reverses on workloads with high memory pressure where vLLM’s PagedAttention is more mature. The real win is in shared-prefix workloads — multi-turn conversations, agent loops, and RAG with fixed retrievers — where RadixAttention provides gains no other engine matches automatically.

When to choose SGLang:

  • Your workload is dominated by multi-turn conversations or shared system prompts
  • You are serving DeepSeek models (MLA kernels are best-in-class)
  • You have a dedicated inference team that can manage dependencies (FlashInfer can be finicky to install)
  • You need prefill-decode disaggregation at scale

Limitation: Model coverage is narrower than vLLM. If you serve exotic architectures or need to swap models frequently, vLLM is safer.

TensorRT-LLM: The NVIDIA Optimizer (With a Catch)

TensorRT-LLM is NVIDIA’s official inference SDK. It delivers the highest raw throughput and lowest TTFT on NVIDIA hardware when fully tuned. But it makes very specific tradeoffs.

The compiled engine tradeoff: Traditionally, TensorRT-LLM required compiling a model into a serialized engine — a process that takes ~28 minutes for a 70B model. This is a one-time cost per model version, but it breaks auto-scaling and blue-green deploys unless you precompile and cache engines.

The PyTorch backend (v1.0+): This changed the game. TensorRT-LLM now defaults to a PyTorch backend that loads HuggingFace weights directly, cutting cold start to ~60–90 seconds (comparable to vLLM). You lose some peak throughput compared to the compiled engine, but you gain deployment flexibility.

Become a Medium member

What you get out of the box:

  • Fused kernels: aggressive kernel fusion for attention and MLP layers
  • FP8 quantization: native, optimized FP8 on Hopper (H100/H200/GB200)
  • Tensor parallelism + pipeline parallelism: mature multi-GPU orchestration
  • Speculative decoding: supported via draft models
  • CUDA graph capture: minimal CPU launch overhead

The config that matters (compiled engine):

# Step 1: Quantize and compile (one-time, ~28 min for 70B)
python quantize.py --model_dir ./llama-3.3-70b \
--output_dir ./quantized \
--qformat fp8
trtllm-build --checkpoint_dir ./quantized \
--output_dir ./engine \
--gemm_plugin fp8
# Step 2: Serve
trtllm-serve --engine_dir ./engine \
--max_batch_size 32 \
--max_input_len 4096 \
--max_output_len 1024

The config that matters (PyTorch backend, no compile):

trtllm-serve --model ./llama-3.3-70b \
--quantization fp8 \
--tp 2 \
--max_batch_size 32

When to choose TensorRT-LLM:

  • You have a single model that won’t change for months
  • You are on NVIDIA-only infrastructure (Hopper or newer)
  • Your team can invest 1–2 weeks in tuning and compilation pipelines
  • You need the absolute highest throughput at 100+ concurrent requests

When to avoid it:

  • You auto-scale from zero (unless you use the PyTorch backend)
  • You serve multiple models and need to swap them daily
  • You are not on NVIDIA hardware

NVIDIA NIM: If you want TensorRT-LLM performance without the compilation headache, NVIDIA NIM bundles precompiled engines, weights, and an API server into a single container. It is essentially TensorRT-LLM with DevOps handled for you.

TGI (Text Generation Inference): The Maintenance Mode Legacy

TGI was HuggingFace’s production serving engine, powering Hugging Chat and the Inference API. It introduced continuous batching and Flash Attention to a wide audience. But as of 2026, TGI is officially in maintenance mode.

HuggingFace’s own guidance: accept pull requests for minor bug fixes only, and recommend migrating to vLLM or SGLang for new deployments.

What this means for you:

  • If you are already running TGI in production, plan a migration path
  • If you are starting a new project, do not choose TGI
  • TGI’s ecosystem contributions (quantization support, model architectures) have been upstreamed into vLLM and SGLang

TGI remains a respectable piece of engineering, but it is no longer the future.

llama.cpp: The Edge and Local Workhorse

llama.cpp is not a datacenter serving engine. It is optimized for running quantized models (GGUF format) on consumer hardware, CPUs, and edge devices.

What it does well:

  • GGUF quantization: runs 70B models on 24GB consumer GPUs via aggressive quantization
  • CPU inference: AVX/AVX2 optimized paths for machines without GPUs
  • Metal backend: runs on Apple Silicon (M3/M4 Ultra)
  • Local server mode: exposes an HTTP API for local development

When to choose it:

  • You need inference on a laptop, edge device, or embedded system
  • You are building a local AI assistant (e.g., Ollama, which wraps llama.cpp)
  • You want to avoid cloud costs entirely for personal use

When to avoid it:

  • Multi-tenant GPU serving
  • High-throughput API backends
  • Workloads requiring continuous batching across hundreds of concurrent users

LMDeploy: The Dark Horse (C++ Native, Minimal Friction)

LMDeploy is a pure C++ inference engine that achieves near-SGLang throughput with trivial installation (pip install lmdeploy). It is the practical choice if you want maximum performance without dependency hell.

What you get:

  • Native C++ backend: zero Python orchestration overhead
  • First-class quantization: AWQ, GPTQ, FP8, INT4
  • Turbomind engine: optimized CUDA kernels for decode
  • One-line deployment: simpler setup than SGLang or TensorRT-LLM

The config:

lmdeploy serve api_server \
meta-llama/Llama-3.3-70B-Instruct \
--model-format hf \
--quant-config dict(type='fp8') \
--tp 2

When to choose LMDeploy:

  • You want 99% of SGLang’s throughput with 10% of the setup complexity
  • You are on NVIDIA hardware and don’t need vLLM’s broad hardware portability
  • Your team values installation simplicity over ecosystem size

Decision Guide: Map Your Pain Point to the Engine

Here is how to choose based on the problems you identified in Parts 1 and 2.

Part 3 — Implementation/Engine-Level: Choosing the Runtime That Gives You These for Free
Comprehensive comparison of major LLM serving engines

The Boring Choice Is Usually the Right Choice

If you are a tech lead making this decision for a team, here is the empirical advice:

Start with vLLM. It is not the fastest engine on any single benchmark, but it is the fastest to deploy, the easiest to debug, and the most forgiving when your requirements change. You can switch to SGLang later if you measure that RadixAttention would materially improve your workload. You can switch to TensorRT-LLM later if you have a fixed model and a dedicated team to manage compilation.

Do not build your own engine. The gap between a naive FastAPI wrapper around model.generate() and vLLM is 10-24x in throughput. The gap between vLLM and a custom C++ scheduler you wrote in a month is that your custom scheduler has bugs vLLM already fixed.

Measure before you optimize. Run your actual workload — your actual prompts, your actual concurrency patterns — through vLLM first. Log TTFT, TPOT, and queue depth. If your P99 is dominated by KV cache exhaustion, tune --max-model-len and --gpu-memory-utilization. If your P99 is dominated by long prompts blocking short ones, enable chunked prefill. Only after you have exhausted the engine’s built-in optimizations should you consider switching engines.

The engine is the last 10% of the optimization stack. Parts 1 and 2 gave you the 90%: quantization, kernel selection, traffic lanes, batching discipline, and backpressure. Get those right with any modern engine, and your users will see sub-second TTFT and stable streaming. Get those wrong, and the fastest engine in the world will still feel broken.

Series Summary:

  • Part 1 made the model fast: quantization, Flash Attention, paged KV cache, and GPU kernel tuning.
  • Part 2 made the serving stable: traffic lanes, continuous batching, backpressure, cold-start avoidance, and output control.
  • Part 3 gave you the engine map: vLLM for breadth, SGLang for shared-prefix workloads, TensorRT-LLM for peak NVIDIA throughput, and llama.cpp for the edge.

Pick the engine, deploy the configs, and ship.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.

Start free — no commitment:

6-Day Agentic AI Engineering Email Guide — one practical lesson per day

Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages

Our courses:

AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.

Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.

AI for Work — Understand, evaluate, and apply AI for complex work tasks.

Note: Article content contains the views of the contributing authors and not Towards AI.