Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Free: 6-day Agentic AI Engineering Email Guide.
Learnings from Towards AI's hands-on work with real clients.
Stop Your AI From Lying. Build RAG.
Artificial Intelligence   Latest   Machine Learning

Stop Your AI From Lying. Build RAG.

Last Updated on May 29, 2026 by Editorial Team

Author(s): Priyanka Mali

Originally published on Towards AI.

Day 9 — No vector database. No API. Just Ollama and a JSON file.

A few weeks ago, a colleague stopped me mid conversation.

“You keep learning all this AI stuff,” she said. “But tell me — if I ask your chatbot what your favourite colour is, will it know?”

I went quiet.

I knew how to build a chatbot. I had built one. But her question exposed a gap I hadn’t fully thought through. The chatbot I built could talk. It could remember our conversation. But it only knew what the AI model already knew — which is everything on the internet up to a certain date, and absolutely nothing about me.

That question sent me digging. And what I found was RAG.

First — Why Do LLMs Fail at Personal Questions?

Stop Your AI From Lying. Build RAG.
LLMs know everything on the internet — and nothing about you. | Photo: Unsplash

Large language models are trained on massive amounts of public text. Books, websites, Wikipedia, research papers. They are incredibly good at answering general questions.

Ask an LLM why the sky is blue — perfect answer.

Ask it what you ate for breakfast — it will either say “I don’t know” or worse, confidently make something up. That confident wrong answer is what we call hallucination. And it happens because the model has never seen your personal data. It is just guessing based on patterns.

This is the core problem:

  • LLMs are frozen in time — trained up to a cutoff date
  • They know nothing about your private documents
  • They cannot access your company data
  • They make up answers when they don’t know — and sound confident doing it

This is exactly what my colleague was pointing at. And for anyone building real AI products — this is the biggest wall you hit.

RAG is the solution.

What is RAG?

RAG stands for Retrieval Augmented Generation.

I know — the name sounds intimidating. But once you break it down, it is one of the most elegant ideas in AI.

Let me start with just the G — Generation.

This is what LLMs already do. You ask a question, the model generates an answer. That part you already understand.

Now add R and A — Retrieval Augmented.

Before the model generates an answer, you first retrieve the most relevant information from your own data. Then you augment the model’s input with that information. The model now generates an answer based on what you gave it — not just what it was trained on.

Think of it like this. Imagine you are a new employee on your first day. A customer calls and asks about the refund policy. You don’t know it yet — you just started. So you quickly search the company handbook, find the relevant section, read it, and answer the customer based on what you just read.

That is RAG. You are the LLM. The handbook is your document. The search is retrieval. The answer you give is generation — augmented by what you retrieved.

How RAG Actually Works — Step by Step

The RAG pipeline — from document to answer. | Image: AI Generated

Here is the full flow — broken into simple steps.

Step 1 — Ingest your document

You paste your document into the system. This could be a PDF, a text file, a company policy — anything with text.

Step 2 — Chunk it

The document gets split into smaller pieces called chunks. Not the whole document at once — just paragraphs or sections. Why? Because you don’t want to send the entire document to the LLM every time. You only want to send the relevant part.

Step 3 — Convert chunks to vectors (embeddings)

Each chunk gets converted into a list of numbers called a vector or embedding. These numbers capture the meaning of the text — not just the words.

A key thing to know: in our app, each chunk becomes exactly 768 numbers regardless of how long or short the text is. That fixed size is what makes comparison possible.

Step 4 — Store in a vector database

These vectors get saved — in our case to a simple JSON file. This is your vector database. Each entry stores the original chunk text and its corresponding vector.

Step 5 — User asks a question

When someone asks a question, the same embedding process runs on the question. The question becomes a vector too.

Step 6 — Find the closest match (cosine similarity)

The system compares the question vector against all the stored chunk vectors. It finds the chunks whose meaning is closest to the question. This is called cosine similarity — a mathematical way of measuring how similar two vectors are.

Step 7 — Send to the LLM

The most relevant chunks get sent to the LLM along with the question. The LLM reads those chunks and generates an answer based on them — not from its general training.

Step 8 — You get your answer

Grounded. Specific. From your data.

The App I Built

Asked my RAG app ‘What is my favourite colour?’ — it found the answer from my document. | Screenshot: Author’s own

After understanding RAG conceptually I sat down and built one.

No paid APIs. No external vector database. No cloud. Runs completely on my machine using Ollama — same tool I used in Day 7.

Here is what it does:

  • You paste a document via POST /ingest
  • It chunks the document, converts to vectors, saves to db.json
  • You ask a question via POST /ask
  • It finds the most relevant chunks using cosine similarity
  • Sends them to Llama 3.2 with a custom prompt
  • Returns an answer strictly from your document

The project structure:

rag-app/
├── server.jsall API routes
├── src/
│ ├── chunker.js ← splits doc into chunks
│ ├── embedder.js ← converts text to vectors via Ollama
│ ├── vectorDB.js ← stores vectors, does cosine similarity
│ ├── retriever.js ← combines chunker + embedder + vectorDB
│ └── llm.js ← prompt engineering + gets answer from llama3.2
└── data/
└── db.json ← your vector database

Five files. One JSON file as the database. That is a complete RAG system.

The Code — What Each File Does

chunker.js — Split the document

Subscribe to the Medium newsletter

Takes your raw text and splits it into smaller chunks. Each chunk is a paragraph or section — small enough to be meaningful, large enough to have context.

Here is the most interesting line in the whole file:

i += chunkSize - overlap

This one line is why the chunker works well. Each chunk overlaps with the previous one by 40 words. Why? Because if a sentence starts at the end of chunk 1 and finishes at the start of chunk 2 — without overlap, that sentence gets cut in half and loses its meaning. The overlap makes sure context never gets lost at chunk boundaries.

Most tutorials skip this detail. It is one of those small decisions that makes a real difference in answer quality.

embedder.js — Convert text to vectors

Calls Ollama’s embedding model to convert any text into a list of 768 numbers. Here is the actual call:

body: JSON.stringify({
model: 'nomic-embed-text',
prompt: text
})

One thing worth noticing — this uses nomic-embed-text, not llama3.2. Two completely different models working together. The embedding model converts text to numbers. The LLM generates the final answer. Neither does the other's job.

Most people don’t realise RAG needs two models. This is why.

vectorDB.js — The cosine similarity

This is the mathematical heart of RAG. When you ask a question, the system compares your question vector against every stored chunk vector. The function that does this comparison is cosine similarity:

function cosineSimilarity(vecA, vecB) {
let dot = 0, magA = 0, magB = 0
for (let i = 0; i < vecA.length; i++) {
dot += vecA[i] * vecB[i]
magA += vecA[i] * vecA[i]
magB += vecB[i] * vecB[i]
}
return dot / (Math.sqrt(magA) * Math.sqrt(magB))
}

It returns a number between 0 and 1. The closer to 1, the more similar the meaning. The top 3 scoring chunks get sent to the LLM.

You don’t need to understand the maths to use it. But it helps to know this is what “finding the closest match” actually means under the hood.

llm.js — The prompt engineering that prevents hallucination

This is my favourite part of the whole app. The prompt we send to the LLM:

const prompt = `You are a helpful assistant that answers 
questions strictly from the provided document context.
Rules:
1. Answer ONLY from the context provided below
2. If the answer is not in the context, say
"I could not find that in the document"
3. Be concise and clear
4. Mention which excerpt your answer came from`

Rule 2 is everything.

Instead of letting the model guess when it doesn’t know, we explicitly tell it to say “I could not find that in the document.” That single instruction is what stops hallucination. The model is not allowed to go outside the provided context.

This is prompt engineering doing real, practical work. Not fancy — just precise.

How to Run It

Step 1 — Install Ollama Download from ollama.com

Step 2 — Pull both models

ollama pull nomic-embed-text
ollama pull llama3.2

Step 3 — Clone and install

git clone https://github.com/PriyankaMali-13/AI
cd rag-app
npm install

Step 4 — Start the server

node server.js

Step 5 — Ingest your document

POST /ingest
{ "text": "your document text here..." }

Step 6 — Ask your question

POST /ask
{ "question": "What is deep learning?" }

Important — always call /ingest before /ask. Otherwise:

{ "error": "No document ingested yet. Please call /ingest first." }

What Makes This Different From a Regular Chatbot

Going back to my colleague’s question — “will it know my favourite colour?”

With a regular chatbot — no. The model has never seen that information.

With RAG — yes. If you give it a document that contains your preferences, it will find the relevant section and answer from it. The model isn’t guessing. It is reading.

The other big difference: no hallucination.

When the LLM is told to answer strictly from the provided context, it says “I don’t have that information” instead of making something up. That is a fundamental shift in reliability.

This is why most enterprise and production AI systems use RAG. They need answers from verified sources — their own data — not from whatever the model learned from the internet.

One Thing I Found Surprising

I expected building RAG to be complicated. It’s not.

The hard part is understanding the concepts — chunking, embeddings, cosine similarity. Once you understand those, the code is straightforward. Five files. A JSON file. Two Ollama models.

The most powerful AI technique in production today can be built by a beginner in an afternoon.

That genuinely surprised me.

What I Learned Today

  • RAG = Retrieval Augmented Generation — give the LLM your data at query time
  • LLMs hallucinate because they don’t know your data — RAG fixes this
  • The pipeline: ingest → chunk → embed → store → retrieve → generate
  • Each chunk becomes a fixed 768-dimensional vector regardless of text length
  • Cosine similarity finds the closest matching chunk to your question
  • You don’t need a paid vector database — a JSON file works for learning
  • Two Ollama models work together: nomic-embed-text for embeddings, llama3.2 for answers
  • Most enterprise AI systems use RAG — it’s the standard for grounded, reliable answers

The Code

Full project on GitHub: github.com/PriyankaMali-13/AI/tree/master/rag-app

Clone it, paste a document, ask questions. See RAG working with your own eyes.

I’m Priyanka — backend engineer, chatbot builder, and someone learning AI from first principles. Writing everything down as I go. Follow along if that sounds useful.

#365DaysOfAI #RAG #RetrievalAugmentedGeneration #AI #Ollama #NodeJS #LLM #LearningInPublic

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.

Start free — no commitment:

6-Day Agentic AI Engineering Email Guide — one practical lesson per day

Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages

Our courses:

AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.

Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.

AI for Work — Understand, evaluate, and apply AI for complex work tasks.

Note: Article content contains the views of the contributing authors and not Towards AI.