Unlock the full potential of AI with Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

The LLM Series #5: Simplifying RAG for Every Learner
Artificial Intelligence   Latest   Machine Learning

The LLM Series #5: Simplifying RAG for Every Learner

Last Updated on June 3, 2024 by Editorial Team

Author(s): Muhammad Saad Uddin

Originally published on Towards AI.

Welcome to the fifth edition of the LLM Series, where I continue to unravel the applications of large language models (LLMs). In this article, I aim to simplify the concept of Retrieval Augmented Generation (RAG) for every learner. In the last edition of this series we explored how can we use Azure architecture to build a RAG based on capabilities of both semantics and vector search. To be honest, it was a little bit too technical for many and I thought “what can I do for those who don't have access to the Azure environment and I came up with this idea”. How can anyone without the need of any closed-source software or platform can create an RAG application and thus I have the answer.

In this article, I am breaking down those barriers. I’ll guide you through creating an RAG application with minimal need for any proprietary software. By the end of this article, you’ll see how easily and quickly you can develop sophisticated RAG solutions with minimal effort. It’s about democratizing AI and ensuring that everyone, regardless of their technical background or resources, can harness the power of RAG. So get ready to simplify and innovate as I bring RAG solutions to everyone effortlessly ✨💡.

We are using the same dataset from our previous article, “LLM Series #4: Mastering Vector and Semantic Search via Azure.” Since we previously stored the JSON data with embeddings, we can simply read and utilize it in this article. If you are interested in understanding the steps we took to reach this point, I encourage you to refer to “LLM Series #4.” In that article, I detailed the process of creating and storing embeddings, which laid the groundwork for the current discussion. By building on our earlier work, we can streamline our approach and focus on new insights and applications.

with open(r'C:\wiki_docs\wiki_vector.json', 'r') as file: 
wiki_data = json.load(file)

Next, we will use FAISS, a powerful opensource library developed by Facebook AI Research for efficient similarity search and clustering of dense vectors.

import faiss

The reason for using FAISS here is this package is designed to handle large-scale datasets with millions or even billions of vectors efficiently.

(I haven't tested billions yet but for few million it is quite efficient)

Plus, it provides fast nearest neighbor search capabilities, essential for real time or near real time applications. Its efficient indexing structures, such as the Inverted File Index (IVF) and Hierarchical Navigable Small World (HNSW), enable rapid retrieval of relevant information, which is critical for maintaining the responsiveness of RAG based solution.

Before moving forward, let’s have a small overview of its index types. I can bet this be quite helpful for many of you for future RAG based developments.

1 – Flat (IndexFlat)

  • Description: The simplest type of index that stores all vectors without any compression or approximation.
  • Use Case: Best for small datasets where an exact nearest neighbor search is required.
  • Pros: Provides exact results.
  • Cons: Not memory efficient for large datasets.

2 – IVF (Inverted File Index)

  • Description: Divides the dataset into clusters using k-means clustering and then performs search within these clusters.
  • Use Case: Suitable for large datasets where approximate nearest neighbor search is acceptable.
  • Pros: Faster search compared to Flat index, especially with large datasets.
  • Cons: Approximate results, requires training.

3 – HNSW (Hierarchical Navigable Small World)

  • We already discussed this approach in a previous article, see it here.

4 – PQ (Product Quantization)

  • Description: Compresses the vectors into smaller codes, which significantly reduces memory usage while allowing fast approximate nearest neighbor search.
  • Use Case: Ideal for very large datasets where memory efficiency is crucial.
  • Pros: Highly memory efficient.
  • Cons: Requires careful parameter tuning and can be complex to set up.

5 – LSH (Locality Sensitive Hashing)

  • Description: Maps vectors into hash buckets such that similar vectors fall into the same bucket with high probability.
  • Use Case: Useful for approximate nearest neighbor search in high-dimensional spaces.
  • Pros: Simple and effective for certain types of data distributions.
  • Cons: Generally less accurate compared to other methods.

You can find more details of them here.

For our use case, we have chosen to use the IndexFlat method. Given that our dataset consists of only 1,000 documents, the IndexFlat index is a suitable choice for several reasons like no overhead, accuracy and simplicity.

embeddings = [item['textVector'] for item in wiki_data if 'textVector' in item] 
print(len(embeddings))
embeddings = np.array(embeddings).astype('float32')
print(embeddings.shape[1])

index_wiki = faiss.IndexFlatL2(embeddings.shape[1]) # Use the dimension of the embeddings as the parameter
print(index_wiki.is_trained)
index_wiki.add(embeddings)
print(index_wiki.ntotal)
1000
1536
True
1000

We will continue to use the same generate_embeddings function from our previous article. This function has been designed to efficiently create embeddings by leveraging the OpenAI API with the "text-embedding-ada-002" model.

The reason for this is, that the embeddings stored in our JSON file were created using the same model, “text-embedding-ada-002.” By using the same generate_embeddings function, we ensure that any new embeddings generated will be consistent with the existing ones. This consistency is crucial because it allows us to seamlessly integrate new data with the old, maintaining uniformity across our entire dataset.

Other reasons are dimensionality and tokenization. Different embedding models can produce vectors of varying dimensions. If we were to use a different model for generating new embeddings during inference, the dimensionality of these vectors might not match the dimensionality of the existing vectors. This mismatch would make it impossible to perform accurate similarity searches or comparisons, as the vectors would not align properly in the vector space. Similarly, Embedding models often employ distinct tokenization methods to convert text into numerical vectors. Tokenization involves breaking down text into smaller units (tokens), which are then mapped to vectors. Different models may tokenize the same text in different ways, leading to variations in the resulting embeddings.

@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))
def generate_embeddings(text: str):
client = AzureOpenAI(
api_key="# Azure OpenAI api key",
azure_endpoint = "# Azure OpenAI endpoint",
api_version = "# Azure OpenAI version"
)
response = client.embeddings.create(
input=text, model="text-embedding-ada-002")
embeddings = response.data[0].embedding
return embeddings

We also create a function to extract top indices from the FAISS index and get relevant data from our JSON file.

def find_top_match(text, wiki_data, generate_embeddings, index_wiki): 
query_embedding = generate_embeddings(text)
query_embedding = np.array(query_embedding).astype('float32')
query_embedding = query_embedding.reshape(1, -1)
distances, indices = index_wiki.search(query_embedding, 3)
keys = ['title', 'text', 'URL']
top_match = [{k: wiki_data[i][k] for k in keys} for i in indices[0]]
return top_match

The find_top_match function identifies the top three relevant Wikipedia articles for a given input text. It generates an embedding for the input text using the generate_embeddings function, converts this embedding into a NumPy array of type float32, and reshapes it for compatibility with a FAISS index. The function then performs a similarity search using the FAISS index (index_wiki), retrieving the indices and distances of the three closest matches. It extracts the title, text, and URL for each of these matches the wiki_data and returns them as a list of dictionaries, each containing the relevant information of the top three Wikipedia articles.

It is time to test this approach.

text = 'What is Data Science, Explain in detail'
answer = find_top_match(text, wiki_data, generate_embeddings, index_wiki)
print(answer)
[
{'title': 'Data science',
'text': 'Data science is an interdisciplinary academic field that uses......',
'URL': 'https://en.wikipedia.org/wiki/Data_science'},
{'title': 'Master in Data Science',
'text': 'A Master of Science in Data Science is an interdisciplinary degree program designed......'
'URL': 'https://en.wikipedia.org/wiki/Master_in_Data_Science'},
{'title': 'Scientific Data (journal)',
'text': 'Scientific Data is a peer-reviewed open access scientific journal published by...........'
'URL': 'https://en.wikipedia.org/wiki/Scientific_Data_(journal)'}
]

The results so far look promising, but our work is far from complete. Our goal is not just to find the right articles but to provide precise answers to user questions based on those articles. To deliver a comprehensive, human-like response, we will now integrate GPT-4. This model will read the retrieved information as context and answer user questions in clear, simple English, moving beyond just presenting dictionary-like entries or JSON data. This might look like art to us, but not for everyone 😀

We will introduce a function now which is similar to our approach in the previous article. We will utilize the GPT-4 model from Azure OpenAI because of its advanced capabilities in understanding and generating human like text.

system_prompt = '''You are a Wiki Doc Assistant. You will be given user question \
and contextual data. Only answer user question from given data and don't use your\
exisiting knowledge for it. If the given data doesnot provide answer to user question\
simply answer: this is out of my scope as wiki doc assistant.
Also if context data contain relevant url, also provide this with your answer to user'''


def get_LLM_response(system_prompt, context, ques, engine="gpt-4"):
client = AzureOpenAI(
api_key="Key",
api_version="version",
azure_endpoint="Endpoint"
)
response = client.chat.completions.create(
engine=engine,
messages=[
{"role": "system", "content": f"{system_prompt}"},
{"role": "user", "content": f"data: here's the data for context: {context}
and here is user question: {ques} "
}
]
)
resp = response['choices'][0]['message']['content']
return resp

I would recommend creating a system prompt here that is as detailed as possible, including several examples (based on your model’s context window) to illustrate how the model should behave in various scenarios. This few-shot approach has proven to be remarkably effective in numerous use cases. By refining the prompt from the previous article, we can enhance its effectiveness, particularly potent in handling edge cases. Let’s explore how these improvements make the model more robust and capable of delivering precise and contextually appropriate responses.

ques: 'Simply Explain what is AI (artifiical intelligence)'
print(get_LLM_response(system_prompt, context, ques, engine="gpt-4"))
Artificial Intelligence (AI) refers to the simulation of human intelligence processes by machines, especially computer systems. 
In a broader sense, its the intelligence exhibited by machines or software.
Its a field of research within computer science that is devoted to creating and studying methods
and software that enable machines to understand their environment.
You can find more details on A.I. here: "https://en.wikipedia.org/wiki/Artificial_intelligence"

For those who read my previous article, you will notice the difference in results here. For others, let me explain: by simply improving our prompt, we ensure that the responses include the appropriate Wikipedia article URLs. This feature is incredibly useful for building user trust in your application. Not only do users receive accurate answers, but they also get a link to the source of the information. This transparency allows users to verify the information easily, especially in cases where the answer might seem incomplete or when users have low confidence in the response. This approach also helps in cases where users might need more context or want to explore related information beyond the immediate answer. Additionally, this method aids in content verification and fact-checking, which is crucial in maintaining the integrity of the information provided. By empowering users to cross-check the sources, we not only improve the reliability of our application but also foster a more engaging and educational experience.

Let’s try an edge case to test the effectiveness of our instructions in the prompt. Edge cases often present unique challenges that can reveal the strengths and weaknesses of our approach. By evaluating how well the model handles these scenarios, we can assess the robustness of our prompt improvements.

ques: 'What is Mixture of Expert (MoE) in AI'
print(get_LLM_response(system_prompt, context, ques, engine="gpt-4"))
As a Wiki Doc Assitant, I couldn't find anything related to this topic and 
this is out of my scope of knowledge based on given data.

As expected, adding more detailed instructions as context in the system prompt helps control the design behavior of GPT and minimizes hallucinations, ensuring the model adheres to the RAG solution approach. Few-shot learning has proven to be quite effective in reducing hallucinations by providing clear examples of desired responses. But sometimes model can still occasionally hallucinate. To address this, we can implement guardrails to further mitigate these instances. (I will discuss effective strategies for controlling hallucinations in more detail in upcoming articles of this LLM series.)

Let’s try one more interesting case to further evaluate the effectiveness of our system prompt.

ques: 'Create me a table with three difference between AI and AGI'
print(get_LLM_response(system_prompt, context, ques, engine="gpt-4"))
Here's this information in a table form:

| Artificial Intelligence (AI) | Artificial General Intelligence (AGI) |
|-----------------------------------|------------------------------------------------|
| AI refers to systems able to perform tasks that normally require human intelligence | AGI refers to a type of artificial intelligence that has the capacity to understand, learn, and apply its intelligence to any intellectual task |
| AI is designed to excel in one specific task | AGI, on the other hand, can theoretically perform any intellectual task that a human being can. |
| AI is usually not capable of doing tasks they are not specifically programmed for. | AGI systems have the ability to understand, learn, adapt, and implement |

As demonstrated, we can also create tables using GPT based on the type of data we have and whether it can be effectively presented in this format. It’s quite fascinating, isn’t it? This capability adds another dimension to how we can structure and present information, making it more accessible and understandable for users. With this, I conclude this article.

To summarize our work here: we used the Wikipedia data from Hugging Face, similar to our previous article. This time, instead of using Azure Cognitive Search (now Azure AI Search), we opted for FAISS, an open source alternative for vector storage and search. We integrated GPT-4 with improved prompt instructions to mitigate hallucinations and witness results from it and I believe we are happy with this solution (for now at least).

That’s it for today, But rest assured, our journey is far from over! In the upcoming chapters of the LLM series, we will develop the RAG application with multi modal approach so we can simply share images, tables and other data from documents as answers to user queries. If this guide has sparked your curiosity and you are keen on exploring more intriguing projects within this LLM series, make sure to follow me. With each new project, I promise a journey filled with learning, creativity, and fun. Furthermore:

  • 👏 Clap for the story (50 claps) to help this article be featured
  • 🔔 Follow Me: LinkedIn |Medium | Website
  • 🌟 Need help in converting these prototypes to products? Contact Me!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓