Build Rag With Llamaindex To Make LLM Answer About Yourself, Like in an Interview or About General Information

Last Updated on June 3, 2024 by Editorial Team

Author(s): Lakshmi Narayana Santha

Originally published on Towards AI.

Advanced RAG pipeline with Llamaindex for chatting with yourself

From the day ChatGPT was introduced, the whole NLP/AI ecosystem was changed and came up with numerous new techniques to integrate the LLMs into various fields and use-cases. One of those gems that evolved along with the LLMs is RAG (Retrieved Augmentation Generation). Even with LLMs like Gemini supporting context up to millions of length, RAG is still relevant and has been used for building various applications like chatting with the documents, helping with research processes for specific domain, providing domain-specific data for LLM to inference, and mostly providing companies to integrate AI capabilities with their sensitive customer data.

In this blog, we will see how to build one such use-case with RAG to make LLM answer about yourself. The input data could be your resume or even general information about yourself. I have used some general information like my interests in movies and TV, and brief technical information of my professional career.

Check out my Github repo for a full-stack chat bot application that I have built with Docker, Next.js, and Python (FastAPI, Llamaindex). Refer to the sub-repo doppalf-ai for the Python application

GitHub – santhalakshminarayana/doppalf: Doppalf is RAG powered AI chat bot application

Doppalf is RAG powered AI chat bot application. Contribute to santhalakshminarayana/doppalf development by creating an…

github.com

After the final building of the RAG pipeline with Llamaindex, we can see the response from LLM like the following:

A simple response from LLM that assumes your character and answers about you

I have used Cohere as LLM and Qdrant for storing vector embeddings. You can create free APIs for both of these to use.

Create Cohere API trail key and Qdrant Cloud API key that offers a free 1 GB cluster for storing vectors.

You can use any other LLM and vector store (or even in-memory storage)

With Llamaindex, we can build a full chat engine with the following steps:

Load Documents from the directory
Parse text into Sentences (as nodes) with a Window size as 1 (configurable)
Get vector embeddings for each node (sentences) (Cohere embeddings)
Index the nodes and store the vector embeddings (Qdrant cloud)
Persist the index for re-use further run-times
Build a Chat engine from the index with a retrieval strategy as “Small-to-Big” and with some buffered chat memory history
Provide the retrieved context and use Cohere Rerank for re-ranking the retrieved nodes
Synthesis the response using LLM (Cohere AI) with the retrieved context

The following is the whole RAG pipeline we will build with Llamaindex

Install the following Python packages first

python-dotenv
fastapi
uvicorn
llama-index
llama-index-embeddings-cohere
llama-index-llms-cohere
llama-index-postprocessor-cohere-rerank
llama-index-vector-stores-qdrant
cohere
qdrant-client

The above dependencies install the FastAPI and core Llamaindex packages. As I am using Cohere and Qdrant with Llamaindex, the above list contains those Llamaindex support packages.

Getting into the real action. First, we will get the required configuration (like API Keys, documents location, etc.,) from .env file and load them into the runtime using python-dotenv package

DOCS_DIR="documents"
INDEX_STORAGE_DIR="pstorage"
COLLECTION_NAME="ps_rag"

MAX_BUFFER_MEMORY_TOKENS=4096

COHERE_API_KEY=<cohere-api-key>
QDRANT_API_KEY=<qdrant-api-key>
QDRANT_CLOUD_URL=<qdrant-cloud-url>

Load the above .env file into the program runtime as

from typing import Self
import os
from threading import Lock

from dotenv import load_dotenv


load_dotenv()

env_keys = {
 "DOCS_DIR": "DOCS_DIR",
 "INDEX_STORAGE_DIR": "INDEX_STORAGE_DIR",
 "COLLECTION_NAME": "COLLECTION_NAME",
 "MAX_BUFFER_MEMORY_TOKENS": "MAX_BUFFER_MEMORY_TOKENS",
 "COHERE_API_KEY": "COHERE_API_KEY",
 "QDRANT_API_KEY": "QDRANT_API_KEY",
 "QDRANT_CLOUD_URL": "QDRANT_CLOUD_URL",
}

def check_all_dict_keys_not_none(o: dict) -> bool:
 for v in o.values():
 if v is None:
 return False
 
 return True

class ENV():
 _env_instance = None
 _env_config = {}
 _lock = Lock()

 def __new__(cls) -> Self:
 if cls._env_instance is None:
 with cls._lock:
 if cls._env_instance is None:
 cls._env_instance = super(ENV, cls).__new__(cls)
 cls._env_instance._load_env()

 return cls._env_instance
 
 def _load_env(self):
 config = {}
 for v in env_keys.values():
 config[v] = os.getenv(v)

 if not check_all_dict_keys_not_none(config):
 raise ValueError("env has some values missing")

 self._env_config = config

 
 def get(self, key:str) -> any:
 return self._env_config.get(key)

We have loaded the environment variables into a Single class and we can get the loaded object that has environment variables stored.

I have used the following brief information about my professional career:

Santha Lakshmi Narayana holds the role of Senior Software Engineer at Nouveau Labs in Bengaluru, India. His expertise lies in AI, Machine Learning, and Backend technologies, with a deep understanding of Advanced Image Processing, Computer Vision, NLP, and System Design & Architecture. Throughout his career, he has contributed to various projects, including Contact-center solutions (both call and chat), AutoML, Image enhancement, Search information extraction, and Name matching & mapping.

With a strong belief in prioritizing performance-optimized code quality over quantity, Lakshmi Narayana is dedicated to delivering robust software solutions that remain resilient even with new additions or modifications.

His core proficiencies encompass Python, Go, OpenCV, Keras, Pytorch, Tensorflow, Redis, and MySQL. Additionally, he has experience in JavaScript, TypeScript, React, React Native, Next.js, Flutter, and Dart.

For effective service management, he relies on tools such as Git, Nginx, Docker, and Kubernetes. He actively shares his insights, project developments, comprehensive research, and other tech-related content through his blog hosted at https://santhalakshminarayana.github.io.

He also maintains an active presence on GitHub (https://github.com/santhalakshminarayana), where he oversees repositories for side projects like AutoML and Image enhancement.

Along with this I have also provided some personal interests in Movies and TV. All these documents are stored inside the directory documents, and Llamaindex loads the documents from this directory

And finally, the whole code for the RAG pipeline is

from llama_index.core import load_index_from_storage
from llama_index.core.memory import ChatMemoryBuffer
from llama_index.core.node_parser import SentenceWindowNodeParser
from llama_index.core.postprocessor import MetadataReplacementPostProcessor
from llama_index.embeddings.cohere import CohereEmbedding
from llama_index.llms.cohere import Cohere
from llama_index.postprocessor.cohere_rerank import CohereRerank
from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client import QdrantClient

from src.config.env import ENV, env_keys
from src.config.logger import get_logger

from .constants import CHAT_PROMPT

envk = ENV()
logger = get_logger()

index = None
chat_engine = None

def load_rag() -> None:
 global index
 global chat_engine

 cdir = os.getcwd()
 docs_dir = envk.get(env_keys.get("DOCS_DIR"))
 docs_path = os.path.join(cdir, docs_dir)

 # check if any documents are provided for index
 if not os.path.exists(docs_path):
 raise FileNotFoundError(f"Documents dir at path: {docs_path} not exists.")
 if not os.listdir(docs_dir):
 raise FileNotFoundError(f"Provide documents inside directory: {docs_path} for indexing.")
 
 storage_dir = envk.get(env_keys.get("INDEX_STORAGE_DIR"))
 storage_path = os.path.join(cdir, storage_dir)
 
 cohere_api_key = envk.get(env_keys.get("COHERE_API_KEY"))
 qdrant_api_key = envk.get(env_keys.get("QDRANT_API_KEY"))

 Settings.llm = Cohere(
 api_key=cohere_api_key,
 model="command-r-plus", 
 )
 Settings.embed_model = CohereEmbedding(
 cohere_api_key=cohere_api_key,
 model_name="embed-english-v3.0",
 input_type="search_document",
 )
 
 qd_client = QdrantClient(
 envk.get(env_keys.get("QDRANT_CLOUD_URL")),
 api_key=qdrant_api_key,
 )

 sentence_node_parser = SentenceWindowNodeParser.from_defaults(
 window_size=1,
 window_metadata_key="window",
 original_text_metadata_key="original_text", 
 )

 vector_store = QdrantVectorStore(
 client=qd_client, 
 collection_name=envk.get(env_keys.get("COLLECTION_NAME")),
 )

 # index was previously persisted
 if os.path.exists(storage_path) and os.listdir(storage_path):
 logger.debug("Using existing index.")
 storage_context = StorageContext.from_defaults(
 vector_store=vector_store, persist_dir=storage_path
 )
 
 index = load_index_from_storage(storage_context)

 else:
 logger.debug("Creating new index for documents.")
 reader = SimpleDirectoryReader(input_dir=docs_path)
 
 all_docs = []
 for docs in reader.iter_data():
 all_docs.extend(docs)
 
 for doc in all_docs:
 logger.debug(f"id: {doc.doc_id}\nmetada: {doc.metadata}")

 nodes = sentence_node_parser.get_nodes_from_documents(all_docs)
 
 storage_context = StorageContext.from_defaults(vector_store=vector_store)
 
 index = VectorStoreIndex(nodes, storage_context=storage_context)

 index.storage_context.persist(persist_dir=storage_path)


 chat_engine = index.as_chat_engine(
 chat_mode="condense_plus_context",
 memory=ChatMemoryBuffer.from_defaults(token_limit=int(envk.get(env_keys.get("MAX_BUFFER_MEMORY_TOKENS")))),
 context_prompt=CHAT_PROMPT,
 similarity_top_k=3, 
 node_postprocessors=[
 MetadataReplacementPostProcessor(target_metadata_key="window"),
 CohereRerank(api_key=cohere_api_key, top_n=3),
 ],
 verbose=False,
 )


def chat(query: str):
 global chat_engine
 
 response = chat_engine.stream_chat(query)
 for res in response.response_gen:
 yield f"data: {json.dumps({"message":res})}\n\n"

The load_rag() function will first check if any previously stored index exists for re-using else it will build one. If no index is stored in the given storage_path directory, it will build the index by loading the documents from the docs_dir directory. SentenceWindowNodeParser is used for parsing the sentences into nodes with a window size of 1. This window size makes the post-retrieval context have surrounding sentences also part of it for better answer synthesis.

For created nodes, get the vector embeddings from the Cohere embeddings and store them in the Qdrant Cloud. And finally store the index for skipping this whole process again when the application restarts.

From the index, build the chat engine with some memory (4096 tokens here) for memorising past conversations with the user and provide Cohere Re-ranker as post node retrieval processer for reranking the nodes based on the query relevancy. This chat engine retrieves the surrounding context for each retrieved node for more information and finally sends the whole retrieved-context and the query to the LLM for answer generation as a prompt. I have used a custom prompt for making LLM answer as me.

CHAT_PROMPT=(
 "You are impersonating the human 'Lakshmi Narayana' and so your name."
 "So you are Lakshmi Narayana and answers in first person.When asked any question about you, you will answer as if Lakshmi Narayana is answering."
 "You will answer politely and take the help of the following context for more relevant answers."
 "If you don't have any sufficient information from the context, use your knowledge to answer."
 "Or don't hallucinate if you are sure you cannot answer."
 "Here are the relevant documents for the context:\n{context_str}\n"
 "Instruction: Use the previous chat history, or the context above, to interact and help the user and answer as if you are Lakshmi Narayana."
 "Don't add any additional data if the answer can be derived from context."
 "Generate the response in markdown format."
)

LLamaindex uses this prompt for context ingestion and sends this to LLM for answer generation.

Finally, the chat generation API is exposed for streaming the response using FastAPI as follows

from fastapi import APIRouter, HTTPException
from pydantic import BaseModel
from starlette.responses import StreamingResponse

from .rag import chat


class GenerateModel(BaseModel):
 message: str


grouter = APIRouter(tags=["generate"])


@grouter.post("")
async def generate(data: GenerateModel):
 try:
 return StreamingResponse(
 chat(data.message), 
 media_type='text/event-stream',
 )
 except Exception as e:
 raise HTTPException(status_code=500, detail=e)

The above API generate takes the user query as part of request body data with key message and calls the chat function for generating the answer.

This will generate the response and stream to the client as SSE (Server Sent Event).

With all of the above things done, if the API is requested with a user query like the following and the LLM will answer about me as

With the help Llamaindex and a small RAG pipeline we could build a AI chat bot that answer about ourselves. Hope this small article provides a guidance on how to build simple RAG powered chat bot applications for real world scenarios.

I have written a comprehensive blog about this whole full-stack project Doppalf on my personal blog. Read the following blog for more details

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Build Rag With Llamaindex To Make LLM Answer About Yourself, Like in an Interview or About General Information

Author(s): Lakshmi Narayana Santha

GitHub – santhalakshminarayana/doppalf: Doppalf is RAG powered AI chat bot application

Doppalf is RAG powered AI chat bot application. Contribute to santhalakshminarayana/doppalf development by creating an…

Doppalf: RAG powered full-stack AI chatbot like ChatGPT- Santha Lakshmi Narayana

Build a full-stack RAG-powered AI chatbot like ChatGPT to give LLM your personality with Python, FastAPI, Llamaindex…

GitHub – santhalakshminarayana/doppalf: Doppalf is RAG powered AI chat bot application

Doppalf is RAG powered AI chat bot application. Contribute to santhalakshminarayana/doppalf development by creating an…

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Understandability of Deep Learning Models

AI for Everyone: The Biggest AI Myths People Still Believe

How We Taught Machines to Think

#62 Will AI Take Your Job?

NN#6 — Neural Networks Decoded: Concepts Over Code

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Build Rag With Llamaindex To Make LLM Answer About Yourself, Like in an Interview or About General Information

Author(s): Lakshmi Narayana Santha

GitHub – santhalakshminarayana/doppalf: Doppalf is RAG powered AI chat bot application

Doppalf is RAG powered AI chat bot application. Contribute to santhalakshminarayana/doppalf development by creating an…

Doppalf: RAG powered full-stack AI chatbot like ChatGPT- Santha Lakshmi Narayana

Build a full-stack RAG-powered AI chatbot like ChatGPT to give LLM your personality with Python, FastAPI, Llamaindex…

GitHub – santhalakshminarayana/doppalf: Doppalf is RAG powered AI chat bot application

Doppalf is RAG powered AI chat bot application. Contribute to santhalakshminarayana/doppalf development by creating an…

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement