Moving from Ollama to vLLM: Finding Stability for High-Throughput LLM Serving

Author(s): Daniel Voyce

Originally published on Towards AI.

Moving from Ollama to vLLM: Finding Stability for High-Throughput LLM Serving — Photo by Fabrizio Chiagano on Unsplash

If you have read any of my previous articles you will see that more often than not I try and self-host my infrastructure (because as a perpetual startup CTO, I am cheap by nature).

I have been pretty heavily utilising GraphRAG (Both Microsofts version and my own home-grown version) for the past year and I am always amazed at how much a small increase in document complexity can blow out budgets.

Back when I was using gpt-4.1-mini from OpenAI — one set of documents alone cost me over $200 (!!).

Even with gpt-4.1-nano (the cheapest frontier model right now) my budget was ridiculous. 215 Million tokens for a few (admittedly large) documents is absurd and the fact this was taking several days to process was excessive.

When I first started deploying local Large Language Models (LLMs) on my trusty NAS, Ollama stood out immediately as an option to do this. It was simple, quick to set up with Docker, and offered support for huge context windows — perfect for GraphRAG’s demanding use cases. The ability to handle prompts up to 128K tokens was particularly attractive using the newest Gemma3 open source models from Google.

However, as I began pushing Ollama to its limits, issues quickly surfaced. Its context window calculations were oddly inconsistent, often settling on random sizes like 35,567 tokens instead of the configured 128K. This regularly caused the model to stall or freeze under heavier workloads.

Things got worse when the traffic increased. Ollama frequently locked up, dropping contexts and needing constant manual restarts, quickly becoming clear that it wouldn’t hold up in production. I tried a custom adapter to directly manage Ollama’s context windows and set safe defaults. While this patch helped a bit, it felt more like a band-aid than a genuine fix. The final straw was when I realised Ollama would lock up for whatever its full timeout was — that timeout was set at 48 hours which is obvously ridiculous!

Eventually, reliable performance and throughput became essential, not optional. vLLM came onto my radar as a more robust solution, promising efficient batching, better memory handling, and consistent performance at scale.

The differences between vLLM and Ollama

There are a million other articles on this so I won’t regurgitate those — I particularly like this one:

VLLM vs. Ollama: Choosing the Right Lightweight LLM Framework for Your AI Applications

Why the Right LLM Framework Matters

blog.stackademic.com

In a nutshell vLLM offers a more production ready approach with higher throughput at the expense of losing some convenience (No multi-model, no GGUF, Quantisation limitations).

It basically means that you deploy 1 container that loads 1 model into the GPU for inference.

Getting vLLM Up and Running with Docker Compose

Switching over to vLLM involved replicating my existing Ollama setup using Docker Compose. Using the official vllm/vllm-openai:latest Docker image, I set up my Gemma-3 model quickly, I chose a smaller than normal model because the 4B one is smart enough for my requirements and it gives reasonable performance on my 2 x RTX A2000 12GB GPU’s.
The setup involved mounting my Hugging Face cache, injecting my Hugging Face token securely through Docker secrets, and configuring essential GPU and context parameters.

To give you the TLDR; (for those who have come here trying to find a simple way to do the same), Here’s what the final Docker Compose file looked like:

services:
 vllm:
 image: vllm/vllm-openai:latest
 container_name: vllm
 ipc: host
 ports:
 - "28888:8000"
 volumes:
 - ~/.cache/huggingface:/root/.cache/huggingface
 environment:
 NVIDIA_VISIBLE_DEVICES: "0,1"
 HUGGING_FACE_HUB_TOKEN: "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"
 PYTORCH_CUDA_ALLOC_CONF: "max_split_size_mb:64,expandable_segments:True"
 command: >
 --model unsloth/gemma-3-4b-it
 --max-model-len 56000
 --max-num-seqs 2
 --tensor-parallel-size 2
 --gpu-memory-utilization 0.88
 --swap-space 16
 --enable-chunked-prefill
 --trust-remote-code
 --quantization bitsandbytes
 --enforce-eager
 networks:
 - rag
 deploy:
 resources:
 reservations:
 devices:
 - driver: nvidia
 count: all
 capabilities: [gpu]

networks:
 ragflow:
 external: true
 name: docker_rag

I hate posting stuff like because it always looks so forehead-slappingly-simple but it was many hours getting to this point. (why NCCL_P2P_DISABLE?, Why do I have to use that specific model? etc)

Failures and Adjustments

While the setup appeared easy enough, real-world deployment quickly brought some challenges to the surface. The first hurdle was with shared-memory limits on multi-GPU setups. Specifically, running vLLM on two RTX A2000 GPUs led to NCCL errors related to insufficient shared memory (/dev/shm). Docker’s default 64MB limit wasn’t nearly enough—each GPU required around 33MB, causing crashes at startup. I’ve not really had to mess with IPC settings in docker so far but after some back and forth with ChatGPT, I added ipc: host to let containers access the host’s larger shared memory, solving the NCCL initialisation issue.

Another problem emerged with GPU memory allocation. Default settings quickly triggered out-of-memory (OOM) errors during model loading. The solution involved tweaking vLLM’s memory utilisation parameter down from the default 0.98 to a safer range around 0.85. This adjustment provided a buffer against transient memory spikes which allowed it to start up successfully.

Memory shared across 2 GPU vs Load Balancing the GPU’s with NGINX

Originally I was planning on copying how Ollama did things (e.g. start it and forget it) but vLLM is a different beast, as I had 2 GPU’s that were supposedly identical it made sense to me to try and load balance the 2 of them by using NGINX to round-robin the requests and increase parallelism. Unfortunately this wasn’t as easy as it seemed.

Running two separate containers, one per GPU, brought unexpected GPU memory contention, especially with GPU0 also handling display tasks. Initially, GPU0 failed to load properly due to the VRAM occupied by the system’s graphics display messing with the balance of memory available between the 2 GPU’s. Adjusting the gpu-memory-utilization slightly lower for GPU0, while allowing GPU1 to fully utilise its available memory, effectively resolving the issue and allowing both GPUs to function correctly.

Sounds good? When testing with a heavy GraphRAG pipeline, it became clear that while NGINX-based load balancing offered initial ease of deployment, it fell short on stable performance, especially under heavier, parallel inference loads.

In the end, the adopting memory sharing across both GPUs using vLLM’s tensor parallelism over NGINX-based load balancing made more sense— not merely for the sake of raw performance but also for ensuring long-term reliability and operational simplicity.

The final form

Wrapping up, the journey to our final vLLM setup came down to some practical and clear-cut decisions. Choosing the unsloth/gemma-3–4b-it model gave us the flexibility we needed. Rather than using a pre-quantised model, we opted for dynamic quantisation with --quantization bitsandbytes. This approach let us better manage GPU memory without locking us into any rigid limitations.

We set PYTORCH_CUDA_ALLOC_CONF to max_split_size_mb:64,expandable_segments:True specifically to tackle CUDA memory fragmentation. This change allowed PyTorch to handle GPU memory more effectively, cutting down on annoying runtime memory errors and making sure things ran smoothly over long sessions.

Using --enforce-eager turned out to be pretty helpful, too. It made debugging easier and GPU behaviour more predictable. Sure, it added a bit of overhead, but the simpler troubleshooting was well worth it.

The decision to cap the model’s maximum length at 56,000 tokens was carefully considered. Even though our KV cache still had room to spare, this length felt like the right balance — long enough for our needs without slowing down performance, crashing when a particularly large prompt gets pushed through or adding unnecessary complexity. It also left us some room to increase this limit in the future if we needed to — but right now I needed stability so I could leave some tasks running over night!

Results

So, did it work?

YES! I have now been running GraphRAG over some pretty heavy sets of documents for basically zero cost (Electricity not included). The knowledge graph builds cleanly and the retrieval is excellent!

And actually, compared to using OpenAI, it’s actually faster despite only having a token rate of around 16t/s:

About the author

Dan is the founder of MindLattice, dedicated to modernising customer data landscapes to enable machine learning and AI.

He is a start-up veteran with over 20 years of experience delivering solutions for some of the largest companies in Australia and the UK.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Moving from Ollama to vLLM: Finding Stability for High-Throughput LLM Serving

Author(s): Daniel Voyce

The differences between vLLM and Ollama

VLLM vs. Ollama: Choosing the Right Lightweight LLM Framework for Your AI Applications

Why the Right LLM Framework Matters

Getting vLLM Up and Running with Docker Compose

Failures and Adjustments

Memory shared across 2 GPU vs Load Balancing the GPU’s with NGINX

The final form

Results

About the author

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Moving from Ollama to vLLM: Finding Stability for High-Throughput LLM Serving

Author(s): Daniel Voyce

The differences between vLLM and Ollama

VLLM vs. Ollama: Choosing the Right Lightweight LLM Framework for Your AI Applications

Why the Right LLM Framework Matters

Getting vLLM Up and Running with Docker Compose

Failures and Adjustments

Memory shared across 2 GPU vs Load Balancing the GPU’s with NGINX

The final form

Results

About the author

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement