Build Rag With Llamaindex To Make LLM Answer About Yourself, Like in an Interview or About General Information
Last Updated on June 3, 2024 by Editorial Team
Author(s): Lakshmi Narayana Santha
Originally published on Towards AI.
From the day ChatGPT was introduced, the whole NLP/AI ecosystem was changed and came up with numerous new techniques to integrate the LLMs into various fields and use-cases. One of those gems that evolved along with the LLMs is RAG (Retrieved Augmentation Generation). Even with LLMs like Gemini supporting context up to millions of length, RAG is still relevant and has been used for building various applications like chatting with the documents, helping with research processes for specific domain, providing domain-specific data for LLM to inference, and mostly providing companies to integrate AI capabilities with their sensitive customer data.
In this blog, we will see how to build one such use-case with RAG to make LLM answer about yourself. The input data could be your resume or even general information about yourself. I have used some general information like my interests in movies and TV, and brief technical information of my professional career.
Check out my Github repo for a full-stack chat bot application that I have built with Docker, Next.js, and Python (FastAPI, Llamaindex). Refer to the sub-repo doppalf-ai for the Python application
GitHub – santhalakshminarayana/doppalf: Doppalf is RAG powered AI chat bot application
Doppalf is RAG powered AI chat bot application. Contribute to santhalakshminarayana/doppalf development by creating anβ¦
github.com
After the final building of the RAG pipeline with Llamaindex, we can see the response from LLM like the following:
I have used Cohere as LLM and Qdrant for storing vector embeddings. You can create free APIs for both of these to use.
Create Cohere API trail key and Qdrant Cloud API key that offers a free 1 GB cluster for storing vectors.
You can use any other LLM and vector store (or even in-memory storage)
With Llamaindex, we can build a full chat engine with the following steps:
- Load Documents from the directory
- Parse text into Sentences (as nodes) with a Window size as 1 (configurable)
- Get vector embeddings for each node (sentences) (Cohere embeddings)
- Index the nodes and store the vector embeddings (Qdrant cloud)
- Persist the index for re-use further run-times
- Build a Chat engine from the index with a retrieval strategy as βSmall-to-Bigβ and with some buffered chat memory history
- Provide the retrieved context and use Cohere Rerank for re-ranking the retrieved nodes
- Synthesis the response using LLM (Cohere AI) with the retrieved context
The following is the whole RAG pipeline we will build with Llamaindex
Install the following Python packages first
python-dotenv
fastapi
uvicorn
llama-index
llama-index-embeddings-cohere
llama-index-llms-cohere
llama-index-postprocessor-cohere-rerank
llama-index-vector-stores-qdrant
cohere
qdrant-client
The above dependencies install the FastAPI and core Llamaindex packages. As I am using Cohere and Qdrant with Llamaindex, the above list contains those Llamaindex support packages.
Getting into the real action. First, we will get the required configuration (like API Keys, documents location, etc.,) from .env file and load them into the runtime using python-dotenv package
DOCS_DIR="documents"
INDEX_STORAGE_DIR="pstorage"
COLLECTION_NAME="ps_rag"
MAX_BUFFER_MEMORY_TOKENS=4096
COHERE_API_KEY=<cohere-api-key>
QDRANT_API_KEY=<qdrant-api-key>
QDRANT_CLOUD_URL=<qdrant-cloud-url>
Load the above .env file into the program runtime as
from typing import Self
import os
from threading import Lock
from dotenv import load_dotenv
load_dotenv()
env_keys = {
"DOCS_DIR": "DOCS_DIR",
"INDEX_STORAGE_DIR": "INDEX_STORAGE_DIR",
"COLLECTION_NAME": "COLLECTION_NAME",
"MAX_BUFFER_MEMORY_TOKENS": "MAX_BUFFER_MEMORY_TOKENS",
"COHERE_API_KEY": "COHERE_API_KEY",
"QDRANT_API_KEY": "QDRANT_API_KEY",
"QDRANT_CLOUD_URL": "QDRANT_CLOUD_URL",
}
def check_all_dict_keys_not_none(o: dict) -> bool:
for v in o.values():
if v is None:
return False
return True
class ENV():
_env_instance = None
_env_config = {}
_lock = Lock()
def __new__(cls) -> Self:
if cls._env_instance is None:
with cls._lock:
if cls._env_instance is None:
cls._env_instance = super(ENV, cls).__new__(cls)
cls._env_instance._load_env()
return cls._env_instance
def _load_env(self):
config = {}
for v in env_keys.values():
config[v] = os.getenv(v)
if not check_all_dict_keys_not_none(config):
raise ValueError("env has some values missing")
self._env_config = config
def get(self, key:str) -> any:
return self._env_config.get(key)
We have loaded the environment variables into a Single class and we can get the loaded object that has environment variables stored.
I have used the following brief information about my professional career:
Santha Lakshmi Narayana holds the role of Senior Software Engineer at Nouveau Labs in Bengaluru, India. His expertise lies in AI, Machine Learning, and Backend technologies, with a deep understanding of Advanced Image Processing, Computer Vision, NLP, and System Design & Architecture. Throughout his career, he has contributed to various projects, including Contact-center solutions (both call and chat), AutoML, Image enhancement, Search information extraction, and Name matching & mapping.
With a strong belief in prioritizing performance-optimized code quality over quantity, Lakshmi Narayana is dedicated to delivering robust software solutions that remain resilient even with new additions or modifications.
His core proficiencies encompass Python, Go, OpenCV, Keras, Pytorch, Tensorflow, Redis, and MySQL. Additionally, he has experience in JavaScript, TypeScript, React, React Native, Next.js, Flutter, and Dart.
For effective service management, he relies on tools such as Git, Nginx, Docker, and Kubernetes. He actively shares his insights, project developments, comprehensive research, and other tech-related content through his blog hosted at https://santhalakshminarayana.github.io.
He also maintains an active presence on GitHub (https://github.com/santhalakshminarayana), where he oversees repositories for side projects like AutoML and Image enhancement.
Along with this I have also provided some personal interests in Movies and TV. All these documents are stored inside the directory documents, and Llamaindex loads the documents from this directory
And finally, the whole code for the RAG pipeline is
from llama_index.core import load_index_from_storage
from llama_index.core.memory import ChatMemoryBuffer
from llama_index.core.node_parser import SentenceWindowNodeParser
from llama_index.core.postprocessor import MetadataReplacementPostProcessor
from llama_index.embeddings.cohere import CohereEmbedding
from llama_index.llms.cohere import Cohere
from llama_index.postprocessor.cohere_rerank import CohereRerank
from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from src.config.env import ENV, env_keys
from src.config.logger import get_logger
from .constants import CHAT_PROMPT
envk = ENV()
logger = get_logger()
index = None
chat_engine = None
def load_rag() -> None:
global index
global chat_engine
cdir = os.getcwd()
docs_dir = envk.get(env_keys.get("DOCS_DIR"))
docs_path = os.path.join(cdir, docs_dir)
# check if any documents are provided for index
if not os.path.exists(docs_path):
raise FileNotFoundError(f"Documents dir at path: {docs_path} not exists.")
if not os.listdir(docs_dir):
raise FileNotFoundError(f"Provide documents inside directory: {docs_path} for indexing.")
storage_dir = envk.get(env_keys.get("INDEX_STORAGE_DIR"))
storage_path = os.path.join(cdir, storage_dir)
cohere_api_key = envk.get(env_keys.get("COHERE_API_KEY"))
qdrant_api_key = envk.get(env_keys.get("QDRANT_API_KEY"))
Settings.llm = Cohere(
api_key=cohere_api_key,
model="command-r-plus",
)
Settings.embed_model = CohereEmbedding(
cohere_api_key=cohere_api_key,
model_name="embed-english-v3.0",
input_type="search_document",
)
qd_client = QdrantClient(
envk.get(env_keys.get("QDRANT_CLOUD_URL")),
api_key=qdrant_api_key,
)
sentence_node_parser = SentenceWindowNodeParser.from_defaults(
window_size=1,
window_metadata_key="window",
original_text_metadata_key="original_text",
)
vector_store = QdrantVectorStore(
client=qd_client,
collection_name=envk.get(env_keys.get("COLLECTION_NAME")),
)
# index was previously persisted
if os.path.exists(storage_path) and os.listdir(storage_path):
logger.debug("Using existing index.")
storage_context = StorageContext.from_defaults(
vector_store=vector_store, persist_dir=storage_path
)
index = load_index_from_storage(storage_context)
else:
logger.debug("Creating new index for documents.")
reader = SimpleDirectoryReader(input_dir=docs_path)
all_docs = []
for docs in reader.iter_data():
all_docs.extend(docs)
for doc in all_docs:
logger.debug(f"id: {doc.doc_id}\nmetada: {doc.metadata}")
nodes = sentence_node_parser.get_nodes_from_documents(all_docs)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex(nodes, storage_context=storage_context)
index.storage_context.persist(persist_dir=storage_path)
chat_engine = index.as_chat_engine(
chat_mode="condense_plus_context",
memory=ChatMemoryBuffer.from_defaults(token_limit=int(envk.get(env_keys.get("MAX_BUFFER_MEMORY_TOKENS")))),
context_prompt=CHAT_PROMPT,
similarity_top_k=3,
node_postprocessors=[
MetadataReplacementPostProcessor(target_metadata_key="window"),
CohereRerank(api_key=cohere_api_key, top_n=3),
],
verbose=False,
)
def chat(query: str):
global chat_engine
response = chat_engine.stream_chat(query)
for res in response.response_gen:
yield f"data: {json.dumps({"message":res})}\n\n"
The load_rag() function will first check if any previously stored index exists for re-using else it will build one. If no index is stored in the given storage_path directory, it will build the index by loading the documents from the docs_dir directory. SentenceWindowNodeParser is used for parsing the sentences into nodes with a window size of 1. This window size makes the post-retrieval context have surrounding sentences also part of it for better answer synthesis.
For created nodes, get the vector embeddings from the Cohere embeddings and store them in the Qdrant Cloud. And finally store the index for skipping this whole process again when the application restarts.
From the index, build the chat engine with some memory (4096 tokens here) for memorising past conversations with the user and provide Cohere Re-ranker as post node retrieval processer for reranking the nodes based on the query relevancy. This chat engine retrieves the surrounding context for each retrieved node for more information and finally sends the whole retrieved-context and the query to the LLM for answer generation as a prompt. I have used a custom prompt for making LLM answer as me.
CHAT_PROMPT=(
"You are impersonating the human 'Lakshmi Narayana' and so your name."
"So you are Lakshmi Narayana and answers in first person.When asked any question about you, you will answer as if Lakshmi Narayana is answering."
"You will answer politely and take the help of the following context for more relevant answers."
"If you don't have any sufficient information from the context, use your knowledge to answer."
"Or don't hallucinate if you are sure you cannot answer."
"Here are the relevant documents for the context:\n{context_str}\n"
"Instruction: Use the previous chat history, or the context above, to interact and help the user and answer as if you are Lakshmi Narayana."
"Don't add any additional data if the answer can be derived from context."
"Generate the response in markdown format."
)
LLamaindex uses this prompt for context ingestion and sends this to LLM for answer generation.
Finally, the chat generation API is exposed for streaming the response using FastAPI as follows
from fastapi import APIRouter, HTTPException
from pydantic import BaseModel
from starlette.responses import StreamingResponse
from .rag import chat
class GenerateModel(BaseModel):
message: str
grouter = APIRouter(tags=["generate"])
@grouter.post("")
async def generate(data: GenerateModel):
try:
return StreamingResponse(
chat(data.message),
media_type='text/event-stream',
)
except Exception as e:
raise HTTPException(status_code=500, detail=e)
The above API generate takes the user query as part of request body data with key message and calls the chat function for generating the answer.
This will generate the response and stream to the client as SSE (Server Sent Event).
With all of the above things done, if the API is requested with a user query like the following and the LLM will answer about me as
With the help Llamaindex and a small RAG pipeline we could build a AI chat bot that answer about ourselves. Hope this small article provides a guidance on how to build simple RAG powered chat bot applications for real world scenarios.
I have written a comprehensive blog about this whole full-stack project Doppalf on my personal blog. Read the following blog for more details
Doppalf: RAG powered full-stack AI chatbot like ChatGPT- Santha Lakshmi Narayana
Build a full-stack RAG-powered AI chatbot like ChatGPT to give LLM your personality with Python, FastAPI, Llamaindexβ¦
santhalakshminarayana.github.io
You can get the whole project code for end-to-end chatbot application with UI (streaming answers like ChatGPT) and backend in my Github repo:
GitHub – santhalakshminarayana/doppalf: Doppalf is RAG powered AI chat bot application
Doppalf is RAG powered AI chat bot application. Contribute to santhalakshminarayana/doppalf development by creating anβ¦
github.com
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI