Part 2 — Serve-Level Speed: System Design That Stabilizes P95/P99
Last Updated on June 3, 2026 by Editorial Team
Author(s): Mehedi Hasan
Originally published on Towards AI.
Part 2 — Serve-Level Speed: System Design That Stabilizes P95/P99
You’ve quantized the model, switched to Flash Attention, and maybe even dropped to INT4. Your GPU kernels are now efficient. But users still complain that the app is “sometimes slow.” Welcome to serving hell, where the bottleneck is rarely the model and almost always the system around it.
The theme of this part: once the model is efficient, most production wins come from queueing discipline, traffic routing, and stability controls. P95 and P99 latency are not driven by tensor core utilization. They’re driven by queueing, noisy neighbors, long prompts stuck behind short ones, and slow clients holding onto GPU memory.

The Real Enemy Is Queueing, Not Compute
Here is the counterintuitive truth of production LLM serving: most latency is waiting time, not compute time. A request that takes 50ms of actual GPU work can easily spend 800ms in a queue because the batcher decided to wait for one more request, or because a 4K-token RAG prompt monopolized the prefill slot.
P95 and P99 latency are almost always caused by:
- Queueing: Requests piling up behind a large batch, or when the KV cache pool is exhausted and no new slots are free
- Noisy neighbors: One tenant submits a 10K-token prompt and stalls everyone else
- Long prompts: Prefill dominates the GPU, starving decode steps
- Slow clients: A streaming client on a 3G connection buffers tokens and pins GPU memory
- Cold starts: A freshly scaled replica that hasn’t loaded weights or allocated KV cache
If you only optimize median latency, you miss the real user experience. Users remember the one time they waited three seconds for the first token. Your product metrics will look fine while your user trust erodes.
Measure the Right Things
Before you fix anything, you need metrics that split the problem correctly. Most teams log “total request time” and call it a day. That is useless.
Log these on every request:
- Time-to-first-token (TTFT): the user’s perception of responsiveness
- Time Per Output Token (TPOT): the standard industry metric for decode speed (e.g., 20ms/token). It is the inverse of tokens-per-second, making it easier to calculate SLAs
- Prompt tokens and output tokens: separates prefill cost from decode cost
- Queue wait time (KV Cache Starvation): time spent before the GPU starts work. Note: requests usually queue not because the GPU is at 100% compute, but because it has run out of PagedAttention blocks in the KV cache
- Prefill time and decode time separately: tells you which phase to optimize
- P95 and P99 per lane: not global P99, but per traffic lane (interactive vs. batch, short vs. long)
Why per-lane matters: If you mix a 50-token chat query with a 4K-token legal document summary, your global P99 will be dominated by the long prompt. You’ll optimize the wrong thing. Split your metrics by lane and optimize each lane independently.
Practical implementation: Most teams pipe these into Prometheus or Datadog. The key is tagging every metric with lane, model, quantization, and gpu_type. If you can’t segment, you can’t diagnose.
Traffic Shaping: Separate Your Lanes
The single most effective serving optimization is also the simplest: don’t let different workloads fight over the same GPU.
Interactive vs. Batch
Interactive traffic (chat, streaming UI) needs low TTFT. Batch traffic (background summarization, embedding generation) needs high throughput. They want opposite things from the scheduler.
The rule: Run them on separate replicas, or at minimum, separate queues with different scheduling policies.
In vLLM, you can approximate this with separate engine instances:
# Interactive lane: small max batch, prioritize TTFT
python -m vllm.entrypoints.api_server \
--model your-model \
--max-num-seqs 4 \
--max-model-len 4096 \
--port 8000
# Batch lane: larger batch, tolerate higher latency
python -m vllm.entrypoints.api_server \
--model your-model \
--max-num-seqs 16 \
--max-model-len 8192 \
--port 8001
In TGI, the scheduler is less configurable per instance, so the cleanest approach is separate deployments behind a router.
Prompt-Length Lanes
Even within interactive traffic, a 100-token prompt and a 3K-token RAG prompt should not share a queue. The long prompt will stall the short one during prefill.
Fast lane: Prompts under ~512 tokens. Strict max wait time (5–10ms), small batch cap. Slow lane: Prompts over ~512 tokens. Longer wait time allowed (50–100ms), larger batch cap acceptable.
Router logic (pseudocode concept):
if prompt_tokens < 512 and streaming == true:
route_to_fast_lane()
else:
route_to_slow_lane()
The threshold depends on your model and GPU. Measure where your prefill time starts to dominate TTFT and draw the line there.
The Modern Alternative: Chunked Prefill
While routing by prompt length is a great architectural defense, modern inference engines (like vLLM 0.4+ and TGI) now solve this at the scheduler level via Chunked Prefill. Instead of computing a 3,000-token prefill in one giant block (which starves all other requests of decode steps), the engine breaks the prefill into smaller chunks (e.g., 512 tokens). It computes one chunk, runs a decode step for active streams, computes the next chunk, and so on. (We’ll cover which engines support this in Part 3.)
Continuous Batching and SLA Caps
Static batching is dead. Today, engines use Continuous Batching (or iteration-level scheduling). Instead of waiting for a batch to fill, the scheduler greedily injects new requests the moment a single token finishes and KV cache frees up.
The danger of continuous batching is that greedy schedulers can ruin TTFT. Set your maximum wait times to your TTFT SLA. If your interactive SLA is 100ms, configure the router so no request waits more than 80ms in the queue, leaving 20ms for the actual prefill compute. (Note: this 80/20 math only applies to the fast-lane with short prompts; long prompts will need dedicated SLA tracking.)
In vLLM, this is handled internally, but you control the tradeoff via --max-num-seqs (max concurrent sequences) and --max-model-len. For more explicit control, some teams run custom dispatchers:
# Dispatcher pseudocode
MAX_WAIT_MS = 10
MAX_BATCH = 8
while True:
batch = queue.collect_until(
max_items=MAX_BATCH,
timeout_ms=MAX_WAIT_MS
)
if batch:
submit_to_gpu(batch)
In TGI, the --max-waiting-tokens and --max-batch-total-tokens flags serve a similar purpose:
text-generation-launcher \
--model-id your-model \
--max-waiting-tokens 20 \
--max-batch-total-tokens 8192
Fairness and Admission Control
Without fairness controls, one user can submit 100 requests and fill the entire batch, spiking everyone else’s latency.
Rate limiting (HTTP 429): Per-tenant or per-API-key token limits. Not just requests per minute, but tokens per minute, because a 4K prompt is 80x more expensive than a 50-token prompt. If a single tenant spikes, return 429 Too Many Requests. This triggers SDKs to auto-retry safely.
Admission control (HTTP 503): If the server’s global queue depth exceeds a threshold, reject new requests with a 503 Service Unavailable + Retry-After header. A fast 503 is always better than a request that hangs for 30 seconds and times out.
# Admission control pseudocode
if queue_depth > MAX_QUEUE_DEPTH:
return HTTP_503(retry_after=5)
Per-tenant fairness: Some engines support token-bucket or round-robin scheduling across tenants. If yours doesn’t, shard by tenant ID at the router level.
Streaming Stability
Streaming is non-negotiable for UX, but it introduces a stability risk: slow clients create backpressure.
If a client on a slow connection reads tokens at 1 token per second, but your model generates at 50 tokens per second, the server buffers 49 unacknowledged tokens. These tokens sit in cheap CPU RAM, but there is a hidden, fatal cost: the active connection pins the user’s KV cache state in precious GPU VRAM, preventing new users from taking that GPU slot.
Backpressure
Implement a max unacknowledged token limit using a bounded queue. If the client falls behind, the queue fills up, and you cancel the generation to free the KV cache:
# Pseudocode: bounded backpressure using asyncio.Queue
MAX_UNACKNOWLEDGED = 32
async def stream_tokens(generator, client):
# The queue acts as our bounded backpressure buffer
queue = asyncio.Queue(maxsize=MAX_UNACKNOWLEDGED)
# Start background task to drain queue to client (network I/O)
asyncio.create_task(drain_to_client(queue, client))
async for token in generator:
try:
# If client is slow, the queue fills. put_nowait raises QueueFull.
queue.put_nowait(token)
except asyncio.QueueFull:
# Backpressure limit reached; free the GPU KV cache
generator.cancel()
break
Cancel on Disconnect
If a user closes their browser tab, you must stop generating immediately. Otherwise, you burn GPU cycles and memory for a ghost request.
In FastAPI/Starlette:
async def generate(request):
async for token in model.stream():
if await request.is_disconnected():
model.cancel()
break
yield token
The API Gateway Trap: Your FastAPI or Python backend will never know the user disconnected unless your API Gateway (Nginx, AWS API Gateway, Cloudflare) is explicitly configured to eagerly propagate connection closures. If the gateway holds the connection open, your GPU will happily burn cycles generating a ghost response.
AWS Specific Caveat: AWS API Gateway (REST/HTTP) has a hard, unchangeable 29-second integration timeout. If your LLM stream takes longer than 29 seconds to finish, AWS will forcefully sever the connection with a 504 Timeout, killing the generation even if the user is still waiting. For LLMs, you usually must bypass API Gateway entirely (e.g., routing directly through an Application Load Balancer or using WebSockets).
Critical point: Cancel-on-disconnect is not a UX nicety. It is a multi-tenant stability requirement. One leaked generation can occupy a GPU slot for 30 seconds.
Cold-Start Avoidance
Cold starts are the silent killers of P99. When a new replica spins up — whether from autoscaling or a deployment — you pay four penalties:
- Model weight loading: Reading multi-GB from disk to GPU
- GPU memory allocation: CUDA context initialization and PagedAttention KV cache pool allocation
- Kernel warmup: First-run kernel compilation and CUDA graph capture
- Empty caches: No prefix KV cache, no attention plan cache
Warm Replicas
For interactive traffic, set your autoscaler minimum to at least 1 warm replica per GPU type. Never scale to zero for real-time workloads.
In Kubernetes, standard HPA based on nvidia.com/gpu utilization is a trap (LLMs often show 100% utilization even when underloaded). Instead, scale on queue depth or concurrent requests using KEDA:
# KEDA ScaledObject scaling on vLLM queue depth
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vllm-scaler
spec:
minReplicaCount: 2 # Never scale to zero!
maxReplicaCount: 10
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus-server:9090
# The exact metric exported by vLLM for queued requests
metricName: vllm:num_requests_waiting
threshold: '10' # Scale up when >10 requests are waiting
Warmup Prompts
After loading weights, run a few representative prompts through the model before marking the replica healthy. This triggers kernel compilation, CUDA graph capture, and attention plan caching. You also want to force the engine to allocate its maximum PagedAttention KV cache pool upfront, preventing Out-Of-Memory (OOM) crashes under real traffic.
def warmup(model):
shapes = [
("Hi", 1), # short chat: 1 token forces prefill only
("Summarize: " + "x"*2000, 1) # long RAG: prefill without burning decode time
]
for text, max_new in shapes:
# max_new_tokens=1 forces the prefill without burning decode time
model.generate(text, max_new_tokens=max_new)
Health check trick: Only return HTTP 200 from /health after warmup completes. Your load balancer will not route traffic to a cold replica.
Warm Prefix Caches
If you use prefix caching (from Part 1), precompute your top system prompts at boot time. This ensures the first real user gets a cache hit and forces the KV cache pool to be fully allocated.
# Precompute common prefixes and allocate KV pool at startup
for system_prompt in COMMON_SYSTEM_PROMPTS:
# max_new_tokens=1 forces the prefill without burning decode time
model.generate(system_prompt, max_new_tokens=1)
Output Control
Unbounded decode is the fastest way to ruin throughput. If every request decodes 1,024 tokens “just in case,” your GPU spends most of its time on low-value tail tokens.
Adaptive Max Output Tokens
Set max_new_tokens based on task type and prompt length:
- Extraction / classification: 64–128 tokens
- Chat / Q&A: 256–512 tokens
- Reasoning / coding: 512–1,024 tokens
If the prompt is already huge, tighten the cap further. A 6K-token RAG prompt with a 1K-token output is 7K total tokens. The same prompt with a 256-token output is 6.25K total — a 10% reduction in total work.
def choose_max_tokens(task: str, prompt_tokens: int) -> int:
base = {"extract": 96, "chat": 384, "reason": 768}.get(task, 256)
if prompt_tokens > 4000:
base = min(base, 256)
return base
Stop Sequences
Use stop sequences to cut generation short when the model is clearly done. Common patterns:
- Tool call boundaries:
"<<function_calls>","</invoke>" - Structured output delimiters:
"}", "```" - Reasoning markers:
"Final Answer:","###"
This is especially effective with JSON mode or tool use. The model often starts generating padding or repetition after the valid output ends.
Early Stopping Heuristics
For non-structured tasks, consider stopping when:
- The model repeats the same token 3+ times
- Entropy drops below a threshold (model is stuck in a loop)
- A confidence score on the EOS token exceeds a threshold
These are advanced and require careful validation, but they can cut 20–30% of wasted decode time.
How to Stop P99 from Ruining Your Product
Here is the practical order that actually works in production:
- Split your metrics by lane. You cannot optimize what you cannot see. Separate interactive vs. batch, short vs. long prompt.
- Route traffic into separate lanes. Interactive gets its own replicas or queue. Long prompts never block short ones.
- Add continuous batching with a max wait. Protect TTFT. A small batch that ships fast is better than a full batch that waits.
- Implement admission control. Reject requests when queue depth is high. A fast 503 is better than a slow timeout.
- Add cancel-on-disconnect and backpressure. Stop burning GPU on ghost requests and slow clients. Fix your API gateway to propagate disconnects.
- Keep warm replicas and run warmup prompts. Cold starts are P99 spikes you can eliminate entirely.
- Cap output length by task type. Most requests don’t need 1,024 tokens. Save decode for the ones that do.
- Add prefix caching and warm it at boot. If you have long system prompts, this is the highest-ROI TTFT win.
The mindset shift: Stop thinking about “how fast is my model?” Start thinking about “how predictable is my latency distribution?” A median of 100ms with a P99 of 3 seconds is worse than a median of 200ms with a P99 of 400ms.
But here is the catch: building all of this custom routing and queueing logic from scratch is exhausting. In Part 3, we’ll map all of these needs to specific inference engines — vLLM, TGI, TensorRT-LLM, and llama.cpp. We’ll look at which engines give you chunked prefill, continuous batching, and prefix caching out-of-the-box, and which ones force you to build the architecture yourself.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Towards AI Academy
We Build Enterprise-Grade AI. We'll Teach You to Master It Too.
15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.
Start free — no commitment:
→ 6-Day Agentic AI Engineering Email Guide — one practical lesson per day
→ Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages
Our courses:
→ AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.
→ Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.
→ AI for Work — Understand, evaluate, and apply AI for complex work tasks.
Note: Article content contains the views of the contributing authors and not Towards AI.