Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Unlock the full potential of AI with Building LLMs for Productionβ€”our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

Build Rag With Llamaindex To Make LLM Answer About Yourself, Like in an Interview or About General Information
Latest   Machine Learning

Build Rag With Llamaindex To Make LLM Answer About Yourself, Like in an Interview or About General Information

Last Updated on June 3, 2024 by Editorial Team

Author(s): Lakshmi Narayana Santha

Originally published on Towards AI.

Advanced RAG pipeline with Llamaindex for chatting with yourself

From the day ChatGPT was introduced, the whole NLP/AI ecosystem was changed and came up with numerous new techniques to integrate the LLMs into various fields and use-cases. One of those gems that evolved along with the LLMs is RAG (Retrieved Augmentation Generation). Even with LLMs like Gemini supporting context up to millions of length, RAG is still relevant and has been used for building various applications like chatting with the documents, helping with research processes for specific domain, providing domain-specific data for LLM to inference, and mostly providing companies to integrate AI capabilities with their sensitive customer data.

In this blog, we will see how to build one such use-case with RAG to make LLM answer about yourself. The input data could be your resume or even general information about yourself. I have used some general information like my interests in movies and TV, and brief technical information of my professional career.

Check out my Github repo for a full-stack chat bot application that I have built with Docker, Next.js, and Python (FastAPI, Llamaindex). Refer to the sub-repo doppalf-ai for the Python application

GitHub – santhalakshminarayana/doppalf: Doppalf is RAG powered AI chat bot application

Doppalf is RAG powered AI chat bot application. Contribute to santhalakshminarayana/doppalf development by creating an…

github.com

After the final building of the RAG pipeline with Llamaindex, we can see the response from LLM like the following:

A simple response from LLM that assumes your character and answers about you

I have used Cohere as LLM and Qdrant for storing vector embeddings. You can create free APIs for both of these to use.

Create Cohere API trail key and Qdrant Cloud API key that offers a free 1 GB cluster for storing vectors.

You can use any other LLM and vector store (or even in-memory storage)

With Llamaindex, we can build a full chat engine with the following steps:

  1. Load Documents from the directory
  2. Parse text into Sentences (as nodes) with a Window size as 1 (configurable)
  3. Get vector embeddings for each node (sentences) (Cohere embeddings)
  4. Index the nodes and store the vector embeddings (Qdrant cloud)
  5. Persist the index for re-use further run-times
  6. Build a Chat engine from the index with a retrieval strategy as β€œSmall-to-Big” and with some buffered chat memory history
  7. Provide the retrieved context and use Cohere Rerank for re-ranking the retrieved nodes
  8. Synthesis the response using LLM (Cohere AI) with the retrieved context

The following is the whole RAG pipeline we will build with Llamaindex

The whole RAG pipeline described above

Install the following Python packages first

python-dotenv
fastapi
uvicorn
llama-index
llama-index-embeddings-cohere
llama-index-llms-cohere
llama-index-postprocessor-cohere-rerank
llama-index-vector-stores-qdrant
cohere
qdrant-client

The above dependencies install the FastAPI and core Llamaindex packages. As I am using Cohere and Qdrant with Llamaindex, the above list contains those Llamaindex support packages.

Getting into the real action. First, we will get the required configuration (like API Keys, documents location, etc.,) from .env file and load them into the runtime using python-dotenv package

DOCS_DIR="documents"
INDEX_STORAGE_DIR="pstorage"
COLLECTION_NAME="ps_rag"

MAX_BUFFER_MEMORY_TOKENS=4096

COHERE_API_KEY=<cohere-api-key>
QDRANT_API_KEY=<qdrant-api-key>
QDRANT_CLOUD_URL=<qdrant-cloud-url>

Load the above .env file into the program runtime as

from typing import Self
import os
from threading import Lock

from dotenv import load_dotenv


load_dotenv()

env_keys = {
"DOCS_DIR": "DOCS_DIR",
"INDEX_STORAGE_DIR": "INDEX_STORAGE_DIR",
"COLLECTION_NAME": "COLLECTION_NAME",
"MAX_BUFFER_MEMORY_TOKENS": "MAX_BUFFER_MEMORY_TOKENS",
"COHERE_API_KEY": "COHERE_API_KEY",
"QDRANT_API_KEY": "QDRANT_API_KEY",
"QDRANT_CLOUD_URL": "QDRANT_CLOUD_URL",
}

def check_all_dict_keys_not_none(o: dict) -> bool:
for v in o.values():
if v is None:
return False

return True

class ENV():
_env_instance = None
_env_config = {}
_lock = Lock()

def __new__(cls) -> Self:
if cls._env_instance is None:
with cls._lock:
if cls._env_instance is None:
cls._env_instance = super(ENV, cls).__new__(cls)
cls._env_instance._load_env()

return cls._env_instance

def _load_env(self):
config = {}
for v in env_keys.values():
config[v] = os.getenv(v)

if not check_all_dict_keys_not_none(config):
raise ValueError("env has some values missing")

self._env_config = config


def get(self, key:str) -> any:
return self._env_config.get(key)

We have loaded the environment variables into a Single class and we can get the loaded object that has environment variables stored.

I have used the following brief information about my professional career:

Santha Lakshmi Narayana holds the role of Senior Software Engineer at Nouveau Labs in Bengaluru, India. His expertise lies in AI, Machine Learning, and Backend technologies, with a deep understanding of Advanced Image Processing, Computer Vision, NLP, and System Design & Architecture. Throughout his career, he has contributed to various projects, including Contact-center solutions (both call and chat), AutoML, Image enhancement, Search information extraction, and Name matching & mapping.

With a strong belief in prioritizing performance-optimized code quality over quantity, Lakshmi Narayana is dedicated to delivering robust software solutions that remain resilient even with new additions or modifications.

His core proficiencies encompass Python, Go, OpenCV, Keras, Pytorch, Tensorflow, Redis, and MySQL. Additionally, he has experience in JavaScript, TypeScript, React, React Native, Next.js, Flutter, and Dart.

For effective service management, he relies on tools such as Git, Nginx, Docker, and Kubernetes. He actively shares his insights, project developments, comprehensive research, and other tech-related content through his blog hosted at https://santhalakshminarayana.github.io.

He also maintains an active presence on GitHub (https://github.com/santhalakshminarayana), where he oversees repositories for side projects like AutoML and Image enhancement.

Along with this I have also provided some personal interests in Movies and TV. All these documents are stored inside the directory documents, and Llamaindex loads the documents from this directory

And finally, the whole code for the RAG pipeline is

from llama_index.core import load_index_from_storage
from llama_index.core.memory import ChatMemoryBuffer
from llama_index.core.node_parser import SentenceWindowNodeParser
from llama_index.core.postprocessor import MetadataReplacementPostProcessor
from llama_index.embeddings.cohere import CohereEmbedding
from llama_index.llms.cohere import Cohere
from llama_index.postprocessor.cohere_rerank import CohereRerank
from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client import QdrantClient

from src.config.env import ENV, env_keys
from src.config.logger import get_logger

from .constants import CHAT_PROMPT

envk = ENV()
logger = get_logger()

index = None
chat_engine = None

def load_rag() -> None:
global index
global chat_engine

cdir = os.getcwd()
docs_dir = envk.get(env_keys.get("DOCS_DIR"))
docs_path = os.path.join(cdir, docs_dir)

# check if any documents are provided for index
if not os.path.exists(docs_path):
raise FileNotFoundError(f"Documents dir at path: {docs_path} not exists.")
if not os.listdir(docs_dir):
raise FileNotFoundError(f"Provide documents inside directory: {docs_path} for indexing.")

storage_dir = envk.get(env_keys.get("INDEX_STORAGE_DIR"))
storage_path = os.path.join(cdir, storage_dir)

cohere_api_key = envk.get(env_keys.get("COHERE_API_KEY"))
qdrant_api_key = envk.get(env_keys.get("QDRANT_API_KEY"))

Settings.llm = Cohere(
api_key=cohere_api_key,
model="command-r-plus",
)
Settings.embed_model = CohereEmbedding(
cohere_api_key=cohere_api_key,
model_name="embed-english-v3.0",
input_type="search_document",
)

qd_client = QdrantClient(
envk.get(env_keys.get("QDRANT_CLOUD_URL")),
api_key=qdrant_api_key,
)

sentence_node_parser = SentenceWindowNodeParser.from_defaults(
window_size=1,
window_metadata_key="window",
original_text_metadata_key="original_text",
)

vector_store = QdrantVectorStore(
client=qd_client,
collection_name=envk.get(env_keys.get("COLLECTION_NAME")),
)

# index was previously persisted
if os.path.exists(storage_path) and os.listdir(storage_path):
logger.debug("Using existing index.")
storage_context = StorageContext.from_defaults(
vector_store=vector_store, persist_dir=storage_path
)

index = load_index_from_storage(storage_context)

else:
logger.debug("Creating new index for documents.")
reader = SimpleDirectoryReader(input_dir=docs_path)

all_docs = []
for docs in reader.iter_data():
all_docs.extend(docs)

for doc in all_docs:
logger.debug(f"id: {doc.doc_id}\nmetada: {doc.metadata}")

nodes = sentence_node_parser.get_nodes_from_documents(all_docs)

storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex(nodes, storage_context=storage_context)

index.storage_context.persist(persist_dir=storage_path)


chat_engine = index.as_chat_engine(
chat_mode="condense_plus_context",
memory=ChatMemoryBuffer.from_defaults(token_limit=int(envk.get(env_keys.get("MAX_BUFFER_MEMORY_TOKENS")))),
context_prompt=CHAT_PROMPT,
similarity_top_k=3,
node_postprocessors=[
MetadataReplacementPostProcessor(target_metadata_key="window"),
CohereRerank(api_key=cohere_api_key, top_n=3),
],
verbose=False,
)


def chat(query: str):
global chat_engine

response = chat_engine.stream_chat(query)
for res in response.response_gen:
yield f"data: {json.dumps({"message":res})}\n\n"

The load_rag() function will first check if any previously stored index exists for re-using else it will build one. If no index is stored in the given storage_path directory, it will build the index by loading the documents from the docs_dir directory. SentenceWindowNodeParser is used for parsing the sentences into nodes with a window size of 1. This window size makes the post-retrieval context have surrounding sentences also part of it for better answer synthesis.

For created nodes, get the vector embeddings from the Cohere embeddings and store them in the Qdrant Cloud. And finally store the index for skipping this whole process again when the application restarts.

From the index, build the chat engine with some memory (4096 tokens here) for memorising past conversations with the user and provide Cohere Re-ranker as post node retrieval processer for reranking the nodes based on the query relevancy. This chat engine retrieves the surrounding context for each retrieved node for more information and finally sends the whole retrieved-context and the query to the LLM for answer generation as a prompt. I have used a custom prompt for making LLM answer as me.

CHAT_PROMPT=(
"You are impersonating the human 'Lakshmi Narayana' and so your name."
"So you are Lakshmi Narayana and answers in first person.When asked any question about you, you will answer as if Lakshmi Narayana is answering."
"You will answer politely and take the help of the following context for more relevant answers."
"If you don't have any sufficient information from the context, use your knowledge to answer."
"Or don't hallucinate if you are sure you cannot answer."
"Here are the relevant documents for the context:\n{context_str}\n"
"Instruction: Use the previous chat history, or the context above, to interact and help the user and answer as if you are Lakshmi Narayana."
"Don't add any additional data if the answer can be derived from context."
"Generate the response in markdown format."
)

LLamaindex uses this prompt for context ingestion and sends this to LLM for answer generation.

Finally, the chat generation API is exposed for streaming the response using FastAPI as follows

from fastapi import APIRouter, HTTPException
from pydantic import BaseModel
from starlette.responses import StreamingResponse

from .rag import chat


class GenerateModel(BaseModel):
message: str


grouter = APIRouter(tags=["generate"])


@grouter.post("")
async def generate(data: GenerateModel):
try:
return StreamingResponse(
chat(data.message),
media_type='text/event-stream',
)
except Exception as e:
raise HTTPException(status_code=500, detail=e)

The above API generate takes the user query as part of request body data with key message and calls the chat function for generating the answer.

This will generate the response and stream to the client as SSE (Server Sent Event).

With all of the above things done, if the API is requested with a user query like the following and the LLM will answer about me as

With the help Llamaindex and a small RAG pipeline we could build a AI chat bot that answer about ourselves. Hope this small article provides a guidance on how to build simple RAG powered chat bot applications for real world scenarios.

I have written a comprehensive blog about this whole full-stack project Doppalf on my personal blog. Read the following blog for more details

Doppalf: RAG powered full-stack AI chatbot like ChatGPT- Santha Lakshmi Narayana

Build a full-stack RAG-powered AI chatbot like ChatGPT to give LLM your personality with Python, FastAPI, Llamaindex…

santhalakshminarayana.github.io

Chatbot with UI like Chat GPT

You can get the whole project code for end-to-end chatbot application with UI (streaming answers like ChatGPT) and backend in my Github repo:

GitHub – santhalakshminarayana/doppalf: Doppalf is RAG powered AI chat bot application

Doppalf is RAG powered AI chat bot application. Contribute to santhalakshminarayana/doppalf development by creating an…

github.com

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓