RAG in Practice: Exploring Versioning, Observability, and Evaluation in Production Systems

Author(s): Adil Said

Originally published on Towards AI.

I’ve seen a few posts on LinkedIn recently declaring RAG systems are dead.

The core argument? “Context windows are getting bigger, so who needs retrieval anymore?”

It got me thinking. RAG only really entered the mainstream about a year or two ago. So why the premature funeral?

Sure, context windows have expanded, but does that eliminate the need for retrieval?

I don’t think so. Here’s why:

Cost & latency scale linearly with context size. Dumping your entire knowledge base into a prompt isn’t efficient.
Relevance filtering still matters. RAG gives you control over what context is passed.
Traceability and source attribution are key in regulated or high-stakes domains.
Most enterprise knowledge bases exceed even long context windows.

So no, I don’t think RAG is dead. But it is maturing alongside LLM-powered applications.

Which is why I wanted to do this project. I wanted to explore what LLMOps really means in practice, essentially MLOps, adopted for large language models. So I decided to explore RAG systems.

Here are the questions I wanted to answer while doing this project:

How do we track and evaluate quality without labelled data?

Most traditional machine learning (ML) systems rely on ground-truth labels. RAG systems might not have that luxury.

What does observability mean in a retrieval-based or LLM-powered system?

It’s not just about latency and uptime, but how do we get visibility into retrieval quality, context relevance, and user satisfaction?

How do you design a stack that’s modular, monitorable, and scalable?

Microservices? Monolith? Where do these services live, and how do they talk to each other?

What does model versioning mean if you’re not training the model yourself?

You might be using hosted or open-weight models, so how do you track upgrades, API changes, or fine-tuning?

What does data versioning look like when your “data” is a knowledge base, not training sets?

In RAG systems, your documents are your data. So, how do you track changes across documents and embeddings? How do you track prompt templates?

Before diving in, it’s worth saying that this field is a rapidly evolving space.

What’s considered best practice today might shift tomorrow.

One thing to know is that the core principles of observability, evaluation/testing, reproducibility, modularity and versioning always hold up.

System Architecture Overview

To explore these questions hands-on, I built a RAG system.

It’s fully containerised and orchestrated with Docker Compose, so you can spin it up yourself locally. GitHub link at the bottom of the page.

RAG in Practice: Exploring Versioning, Observability, and Evaluation in Production Systems — Source: Image by the author — System Architecture Overview — Arrows showcase the flow of data and which services talk to which

I went for a microservice architecture, with each core part of the RAG pipeline handled by its own containerised service. At the centre is a FastAPI orchestrator, which exposes three endpoints — /metrics, /embed, and /query — and acts as the glue that coordinates ingestion, embedding, vector storage, retrieval, generation, and logging.

This approach gives a few benefits:

Easier to swap components (e.g. different embedding or generation models)
Clear separation of concerns
Scalable services that can evolve independently

In the next sections, I explore how I approached the core principles I mentioned earlier — versioning, observability, evaluation, and model deployment and what they look like in a real RAG system.

Data Versioning

In traditional ML workflows, data versioning usually refers to tracking training, validation, and test datasets. It’s mostly about ensuring reproducibility and understanding how model performance changes as the data evolves. It’s also just good practice — it gives you a safety net, so you can roll back if something breaks or goes sideways.

But in the context of RAG systems, you’re most likely not training your own model, so you don’t have training data in the traditional sense. So, what does data versioning mean in this context?

Instead, your “data” is the knowledge base, the raw documents, articles, PDFs, and internal content that you embed and retrieve from. That means versioning here is about ensuring that your embeddings and responses can be traced back to the specific version of a document they came from.

This becomes important because:

Documents change over time, and so do the embeddings.
Once embedded, content is detached from its source version.
Without tracking, it’s impossible to know which version of a document influenced a model’s response.

To handle this, I used lakeFS to version the document store. When a document is ingested, it’s committed to a lakeFS repo. The resulting commit hash is stored alongside the document’s embedding in Qdrant. That way, every embedding is tied to a specific snapshot of the document state at the time of ingestion.

Source: Image by the author — How this system is preserving data lineage between embeddings and documents

One way to improve traceability even further is by tracking how documents were split and tagged, and which embedding model was used.

Small changes in chunking logic or model versions can lead to very different results. You could do this by recording the Git commit/tag or Docker image hash that performed the ingestion, and storing the embedding model version as metadata. This would make it easier to trace back exactly how a particular answer was generated.

Observability

One of the things I wanted to understand while building this system was: what does observability even look like in a RAG pipeline?

In traditional ML, you’re usually monitoring things like model accuracy or drift over time. But with RAG, there’s more going on. You’re not just calling a model — you’re ingesting documents, embedding them, storing vectors, retrieving chunks, generating text, and logging results. I wanted some visibility into that whole process.

To get started, I added monitoring using Prometheus and Grafana. I exposed an /metrics endpoint in the FastAPI orchestrator, which Prometheus scrapes on a regular interval. Grafana is then used to visualise that data.

Source: Image by the author — Grafana Dashboard

Here’s what I decided to monitor:

Ingestion latency: how long it takes to process and embed a document.
Query latency: total time to return an answer to the user. I wanted to understand the end-to-end performance.
Context retrieval stats: how many chunks were retrieved and what their relevance scores were. This gave me some idea of how well the retrieval step was working.
Generation health: counts of failures, timeouts, or any exceptions during the model response. Basically, is the model behaving or not?
Basic usage metrics: counters for how many ingestion and query requests were made — useful just to get a pulse on what’s happening.

There’s definitely more I could be tracking — things like user feedback loops, context drift over time, or retrieval quality across different document sets. But even with this basic setup, it gave me a clearer picture of what’s happening under the hood. It also sets up the structure to add more metrics in the future.

Evaluating and Testing RAG

One of the biggest challenges I ran into was figuring out how do you actually evaluate a RAG system.

In traditional ML, you usually have a clear ground truth — a label, a category, a number — and you can compare your model’s output against that. With LLMs, especially in open-ended question-answering, it’s rarely that clear-cut. There isn’t always one “correct” answer.

So, how do you measure quality?

First, we need a way to track model outputs. I decided to log user queries, the retrieved documents, and the generated response in MongoDB. This gave me a structured dataset I could evaluate offline. One way to do this is with RAGAS, a tool that uses LLMs themselves to score responses across a few key dimensions:

Retrieval Precision — Were the retrieved documents actually relevant to the query?
Groundedness — Is the generated answer supported by the retrieved content?
Factual Consistency — Does the answer reflect what the documents actually say?
Completeness — Does the answer fully address the user’s question?

Each evaluation run gets logged in MLflow, so I can compare results across different model versions, prompt templates, or retrieval strategies.

Of course, using an LLM to evaluate another LLM has its own challenges — you’re introducing another layer of subjectivity. But in the absence of labelled data, it’s a helpful starting point.

I’ve also thought about incorporating lightweight human-user feedback. Something like a thumbs up/down on each answer, to collect signals over time. Even simple binary ratings can be valuable for identifying patterns or spotting failure cases.

There’s more to explore here. Things like long-term answer consistency, hallucination detection, or task-specific scoring would be great next steps. But this setup gave me enough structure to start reasoning about quality.

Model Deployment

For this project, I deployed two separate services using BentoML:

An embedding service using all-MiniLM-L6-v2
A generation service using EleutherAI/gpt-neo-1.3B

These models were chosen arbitrarily. The goal wasn’t performance tuning, just to get a working setup to explore system design and deployment patterns.

Source: Image by the author — Query flow

Keeping them separate gave me more control during experimentation, and BentoML made it easy to spin up standalone model servers with clear API contracts. But realistically, you probably wouldn’t self-host a generation model in production — not unless you have serious infrastructure and ops resources. Hosting your own LLM is expensive, operationally complex, and often less efficient than using managed APIs.

So what if you’re using a hosted LLM, like from OpenAI, Anthropic, or OpenRouter?

In that case, you’re not training or fine-tuning anything, and the model can silently change behind the scenes. That makes model versioning tricky, but still important. Here’s what you can do:

Log the model name and version string (if available) with every request
Store the prompt and output pairs to compare behaviour over time
Periodically re-run fixed evaluation sets (same prompts, same documents) to catch shifts in answer quality or consistency

Even when using a managed API, I think versioning still plays an important role. It’s less about checkpoints or model weights and more about keeping track of how the model behaves at a given time, so you can spot any unexpected changes or regressions that might affect users later on.

What I Would Change

Right now, the system follows a synchronous orchestration pattern using FastAPI and HTTP between services. It works fine for now, but I’d like to explore an event-driven architecture in the future, introducing a message broker (like Kafka) so services can publish and consume events asynchronously. This also helps to decouple the current system.

As mentioned earlier, I’d also like to invest more in evaluation infrastructure. While RAGAS gave me a great starting point, combining it with live user feedback or lightweight human review would give a better picture of model performance in the real world. Having an interface where users can upvote/downvote answers or flag bad results could feed into evaluation pipelines and model iteration.

On the database side, MongoDB and Postgres support vector search, so the system could be simplified by reducing the number of services and sticking to more general-purpose infrastructure. I went with Qdrant here to explore a purpose-built vector DB — and it’s been great — but in a production setting, consolidating services could reduce operational complexity.

Wrapping Up

This project was my way of getting hands-on with the real-world challenges of deploying RAG systems — not just building something that works, but trying to understand what “production-ready” actually means in this space.

There’s still plenty I’d like to explore: event-driven orchestration, tighter evaluation loops, smarter feedback collection, and maybe even experimenting with different model serving setups. As mentioned earlier, evaluation especially feels like a space that deserves more thought — if you’ve tackled this before, I’d love to learn from your experience.

And if you’re reading this thinking I’ve missed something important, I probably have. This project was a learning exercise, and there’s always more to uncover. I’m always open to feedback or ideas, so don’t hesitate to connect or drop me a message.

GitHub – adilsaid64/open-rag-stack: A playground for building and serving Retrieval-Augmented…

A playground for building and serving Retrieval-Augmented Generation (RAG) systems using best practices in MLOps and…

github.com

Let's connect: LinkedIn.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

RAG in Practice: Exploring Versioning, Observability, and Evaluation in Production Systems

Author(s): Adil Said

System Architecture Overview

Data Versioning

Observability

Evaluating and Testing RAG

Model Deployment

What I Would Change

Wrapping Up

GitHub – adilsaid64/open-rag-stack: A playground for building and serving Retrieval-Augmented…

A playground for building and serving Retrieval-Augmented Generation (RAG) systems using best practices in MLOps and…

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

RAG in Practice: Exploring Versioning, Observability, and Evaluation in Production Systems

Author(s): Adil Said

System Architecture Overview

Data Versioning

Observability

Evaluating and Testing RAG

Model Deployment

What I Would Change

Wrapping Up

GitHub – adilsaid64/open-rag-stack: A playground for building and serving Retrieval-Augmented…

A playground for building and serving Retrieval-Augmented Generation (RAG) systems using best practices in MLOps and…

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement