Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Our 15 AI experts built the most comprehensive, practical, 90+ lesson courses to master AI Engineering - we have pathways for any experience at Towards AI Academy. Cohorts still open - use COHORT10 for 10% off.

Publication

Moving from Ollama to vLLM: Finding Stability for High-Throughput LLM Serving
Latest   Machine Learning

Moving from Ollama to vLLM: Finding Stability for High-Throughput LLM Serving

Author(s): Daniel Voyce

Originally published on Towards AI.

Moving from Ollama to vLLM: Finding Stability for High-Throughput LLM Serving
Photo by Fabrizio Chiagano on Unsplash

If you have read any of my previous articles you will see that more often than not I try and self-host my infrastructure (because as a perpetual startup CTO, I am cheap by nature).

I have been pretty heavily utilising GraphRAG (Both Microsofts version and my own home-grown version) for the past year and I am always amazed at how much a small increase in document complexity can blow out budgets.

Back when I was using gpt-4.1-mini from OpenAI — one set of documents alone cost me over $200 (!!).

Even with gpt-4.1-nano (the cheapest frontier model right now) my budget was ridiculous. 215 Million tokens for a few (admittedly large) documents is absurd and the fact this was taking several days to process was excessive.

When I first started deploying local Large Language Models (LLMs) on my trusty NAS, Ollama stood out immediately as an option to do this. It was simple, quick to set up with Docker, and offered support for huge context windows — perfect for GraphRAG’s demanding use cases. The ability to handle prompts up to 128K tokens was particularly attractive using the newest Gemma3 open source models from Google.

However, as I began pushing Ollama to its limits, issues quickly surfaced. Its context window calculations were oddly inconsistent, often settling on random sizes like 35,567 tokens instead of the configured 128K. This regularly caused the model to stall or freeze under heavier workloads.

Ollama logs showing 27minute hangs

Things got worse when the traffic increased. Ollama frequently locked up, dropping contexts and needing constant manual restarts, quickly becoming clear that it wouldn’t hold up in production. I tried a custom adapter to directly manage Ollama’s context windows and set safe defaults. While this patch helped a bit, it felt more like a band-aid than a genuine fix. The final straw was when I realised Ollama would lock up for whatever its full timeout was — that timeout was set at 48 hours which is obvously ridiculous!

Eventually, reliable performance and throughput became essential, not optional. vLLM came onto my radar as a more robust solution, promising efficient batching, better memory handling, and consistent performance at scale.

The differences between vLLM and Ollama

There are a million other articles on this so I won’t regurgitate those — I particularly like this one:

VLLM vs. Ollama: Choosing the Right Lightweight LLM Framework for Your AI Applications

Why the Right LLM Framework Matters

blog.stackademic.com

In a nutshell vLLM offers a more production ready approach with higher throughput at the expense of losing some convenience (No multi-model, no GGUF, Quantisation limitations).

It basically means that you deploy 1 container that loads 1 model into the GPU for inference.

Getting vLLM Up and Running with Docker Compose

Switching over to vLLM involved replicating my existing Ollama setup using Docker Compose. Using the official vllm/vllm-openai:latest Docker image, I set up my Gemma-3 model quickly, I chose a smaller than normal model because the 4B one is smart enough for my requirements and it gives reasonable performance on my 2 x RTX A2000 12GB GPU’s.
The setup involved mounting my Hugging Face cache, injecting my Hugging Face token securely through Docker secrets, and configuring essential GPU and context parameters.

To give you the TLDR; (for those who have come here trying to find a simple way to do the same), Here’s what the final Docker Compose file looked like:

services:
vllm:
image: vllm/vllm-openai:latest
container_name: vllm
ipc: host
ports:
- "28888:8000"
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
environment:
NVIDIA_VISIBLE_DEVICES: "0,1"
HUGGING_FACE_HUB_TOKEN: "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"
PYTORCH_CUDA_ALLOC_CONF: "max_split_size_mb:64,expandable_segments:True"
command: >
--model unsloth/gemma-3-4b-it
--max-model-len 56000
--max-num-seqs 2
--tensor-parallel-size 2
--gpu-memory-utilization 0.88
--swap-space 16
--enable-chunked-prefill
--trust-remote-code
--quantization bitsandbytes
--enforce-eager
networks:
- rag
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]

networks:
ragflow:
external: true
name: docker_rag

I hate posting stuff like because it always looks so forehead-slappingly-simple but it was many hours getting to this point. (why NCCL_P2P_DISABLE?, Why do I have to use that specific model? etc)

Failures and Adjustments

While the setup appeared easy enough, real-world deployment quickly brought some challenges to the surface. The first hurdle was with shared-memory limits on multi-GPU setups. Specifically, running vLLM on two RTX A2000 GPUs led to NCCL errors related to insufficient shared memory (/dev/shm). Docker’s default 64MB limit wasn’t nearly enough—each GPU required around 33MB, causing crashes at startup. I’ve not really had to mess with IPC settings in docker so far but after some back and forth with ChatGPT, I added ipc: host to let containers access the host’s larger shared memory, solving the NCCL initialisation issue.

Another problem emerged with GPU memory allocation. Default settings quickly triggered out-of-memory (OOM) errors during model loading. The solution involved tweaking vLLM’s memory utilisation parameter down from the default 0.98 to a safer range around 0.85. This adjustment provided a buffer against transient memory spikes which allowed it to start up successfully.

Memory shared across 2 GPU vs Load Balancing the GPU’s with NGINX

Originally I was planning on copying how Ollama did things (e.g. start it and forget it) but vLLM is a different beast, as I had 2 GPU’s that were supposedly identical it made sense to me to try and load balance the 2 of them by using NGINX to round-robin the requests and increase parallelism. Unfortunately this wasn’t as easy as it seemed.

Running two separate containers, one per GPU, brought unexpected GPU memory contention, especially with GPU0 also handling display tasks. Initially, GPU0 failed to load properly due to the VRAM occupied by the system’s graphics display messing with the balance of memory available between the 2 GPU’s. Adjusting the gpu-memory-utilization slightly lower for GPU0, while allowing GPU1 to fully utilise its available memory, effectively resolving the issue and allowing both GPUs to function correctly.

Sounds good? When testing with a heavy GraphRAG pipeline, it became clear that while NGINX-based load balancing offered initial ease of deployment, it fell short on stable performance, especially under heavier, parallel inference loads.

In the end, the adopting memory sharing across both GPUs using vLLM’s tensor parallelism over NGINX-based load balancing made more sense— not merely for the sake of raw performance but also for ensuring long-term reliability and operational simplicity.

The final form

Wrapping up, the journey to our final vLLM setup came down to some practical and clear-cut decisions. Choosing the unsloth/gemma-3–4b-it model gave us the flexibility we needed. Rather than using a pre-quantised model, we opted for dynamic quantisation with --quantization bitsandbytes. This approach let us better manage GPU memory without locking us into any rigid limitations.

We set PYTORCH_CUDA_ALLOC_CONF to max_split_size_mb:64,expandable_segments:True specifically to tackle CUDA memory fragmentation. This change allowed PyTorch to handle GPU memory more effectively, cutting down on annoying runtime memory errors and making sure things ran smoothly over long sessions.

Using --enforce-eager turned out to be pretty helpful, too. It made debugging easier and GPU behaviour more predictable. Sure, it added a bit of overhead, but the simpler troubleshooting was well worth it.

The decision to cap the model’s maximum length at 56,000 tokens was carefully considered. Even though our KV cache still had room to spare, this length felt like the right balance — long enough for our needs without slowing down performance, crashing when a particularly large prompt gets pushed through or adding unnecessary complexity. It also left us some room to increase this limit in the future if we needed to — but right now I needed stability so I could leave some tasks running over night!

Results

So, did it work?

Knowledge Graph from GraphRAG

YES! I have now been running GraphRAG over some pretty heavy sets of documents for basically zero cost (Electricity not included). The knowledge graph builds cleanly and the retrieval is excellent!

And actually, compared to using OpenAI, it’s actually faster despite only having a token rate of around 16t/s:

Token throughput

About the author

Dan is the founder of MindLattice, dedicated to modernising customer data landscapes to enable machine learning and AI.

He is a start-up veteran with over 20 years of experience delivering solutions for some of the largest companies in Australia and the UK.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


Discover Your Dream AI Career at Towards AI Jobs

Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!

Note: Content contains the views of the contributing authors and not Towards AI.