Will Long Context Language Models Replace RAG?
Last Updated on December 10, 2024 by Editorial Team
Author(s): Claudio Giorgio Giancaterino
Originally published on Towards AI.
Kaggle has launched a competition surrounding the Gemini 1.5 model, introduced on August 8th, 2024, by Google, which aims to showcase its innovative capabilities, particularly its ability to process up to 2 million tokens concurrently.
The Gemini 1.5 model is a family of multimodal models known for efficiently processing content across multiple modalities, including text, images, audio, and video. It consists of Gemini 1.5 Pro, which boasts enhanced features, and Gemini 1.5 Flash, optimised for speed and efficiency, maintaining high performance with less computational demand.
This topic was already addressed in the paper βCan Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?β. This paper introduces LOFT (Long-Context Frontiers), a benchmark projected to evaluate the performances of long-context language models (LCLMs) on tasks that typically rely on external tools like retrieval systems or databases. LOFT is built upon six tasks under 35 datasets, as span text, visual, and audio modalities. The researchers argue that LCLMs have the potential to replace these traditional tools by directly ingesting and processing vast amounts of information within their context windows. They evaluated three state-of-the-art LCLMs: Gemini 1.5 Pro, GPT-4o, and Claude 3 Opus on LOFT, and compared their performance to custom models specifically built for each task.
Chasing the competition I stressed the Gemini 1.5 Flash capabilities in the following paper: βA Comprehensive Review of Generative AI in Healthcareβ. The paper provides a valuable overview of the current state of generative AI, specifically diffusion models and transformer-based models, in healthcare and highlights its potential to improve patient care and advance biomedical research.
You can follow my analysis in this notebook.
To evaluate the performance of the model, Iβve selected 5 questions:
# Define a set of questions to ask regarding the paper's content.
question_1 = "Which types of encoder-decoder transformer architectures are discussed in the review ?"
question_2 = "Could you say who developed MERGIS, and what is it ?"
question_3 = "Could you say who developed AdaDiff, and what is it ?"
question_4 = "What process represents the equation 1 used in the review, could you explain it ?"
question_5 = "Could you say who developed ProteinBERT, and what is it ?"
with the following answers retrieved from the paper:
right_answer_1 = """
Encoder-only models:
These models are advantageous for text classification tasks such as sentiment analysis. An example of an LLM utilizing an encoder-only model is BERT.
Decoder-only or autoregressive models:
These models are suitable for text generation tasks, akin to the predictive text functionality in a smartphone chat application. For instance, as you input text, the AI predicts the subsequent word or phrase. An example of this model is GPT-3.
Encoder-decoder models:
These models facilitate generative AI tasks such as language translation and summarization. Notable LLMs employing this methodology include Facebookβs BART and Googleβs T5.
"""
right_answer_2 = """
MERGIS, was proposed by (Nimalsiri et al. 2023). It uses image segmentation and a modern
transformer-based encoder-decoder model to enhance the accuracy of automated report generation.
"""
right_answer_3 = """
(Γzbey et al. 2023) introduced an innovative technique named Adaptive Diffusion
Priors (AdaDiff) for the reconstruction of MRI. This approach involves a series of diffusion processes
that enhance the authenticity of the generated images. AdaDiff dynamically adjusts its priors during
the inference stage to align more closely with the distribution of the test data.
"""
right_answer_4 ="""
The forward diffusion process is depicted as a Markov Chain, characterized by the inclusion of
Gaussian noise in a series of stages, culminating in generating noisy samples. The uncorrupted
or original data distribution is represented as π(π₯0). With a data sample π₯0 drawn from this
distribution, π(π₯0), a forward noising operation, denoted as π, is employed.
This operation introduces Gaussian noise iteratively at various time points, represented by π‘,
resulting in a series of latent states π₯1 through π₯π. The process can be mathematically defined as
follows:
π(π₯π‘ | π₯π‘β1)=π©(π₯π‘:β1β π½π‘.π₯π‘β1,π½π‘.Ξ ),βπ‘β{1,β¦,π}
π denotes the number of diffusion steps, while π½1,β¦, π½π, each within the interval of [0, 1), signify
the variance schedule spread throughout the diffusion steps.
The identity matrix is symbolized by π, and π©(π₯; π,π), which characterizes the normal distribution
possessing a mean of π and a covariance of π.
"""
right_answer_5 ="""
ProteinBERT was developed by (Brandes et al. 2022). It's a specialized deep language model for protein
sequences that amalgamates local and global representations for comprehensive end-to-end processing.
"""
β¦and Iβve used two metrics:
- ROUGE is a statistical scoring method. It primarily focuses on the surface-level overlap of words and phrases without deeply considering the semantic meaning or reasoning behind the text. Although ROUGE can be a helpful tool for a quick and basic evaluation, it might not be the most accurate measure for complex language tasks where deeper understanding is crucial.
- BERTScore is a hybrid method that combines statistical scoring and a model-based approach. It uses contextual embeddings generated by pre-trained models like BERT to evaluate the similarity between generated text and reference text. While this method is more semantically aware than purely statistical methods, it can be influenced by biases present in the training data of the pre-trained models.
–First solution: knowledge base Gemini 1.5 Flash
With this solution there is a direct interaction, between the Gemini model and the paper, allowing for the contextual setup of conversations and dynamic interaction based on user prompts.
For this purpose, I have implemented functions to start and interact with a chat session using a language model. The first function, start_chat_session
, initiates a chat by combining some contextual information with the text from the document. The second function, chatAI
, interacts with a previously started chat session. It takes the chat session, some contextual information, and a prompt. It then returns the model's response as text. At the end, a chat session is initialised for queries.
# start_chat_session: Starts a chat by combining a context and the text of the document
def start_chat_session(context_info, papers_text):
if papers_text:
# Combine context with the text content of the first document
combined_content = context_info + "\n\n" + papers_text[0]
# Use the combined text in the model history
chat_session = llm.start_chat(
history=[
{
'role': 'user',
'parts': [combined_content]
}
]
)
return chat_session
return None
# chatAI: Sends prompts to the chat session and returns the model's response
def chatAI(chat_session, context_info, prompt):
if chat_session:
# Combine the prompt with context to provide more depth
full_prompt = f"{context_info}\n\n{prompt}"
response = chat_session.send_message(full_prompt)
return response.text
return "No chat session initialised."
# The chat session is initiated using the extracted paper text and context.
chat_session = start_chat_session(context_info, papers_text)
Here are the responses and performances from the first model:
display(Markdown(response_1_base))
βThe review mentions three types of encoder-decoder transformer architectures:
- Encoder-only models (e.g., BERT) suitable for text classification tasks.
- Decoder-only or autoregressive models (e.g., GPT-3) suitable for text generation tasks.
- Encoder-decoder models (e.g., Facebookβs BART and Googleβs T5) suitable for tasks like language translation and summarization.β
print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_1, response_1_base)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_1, response_1_base)
Calculating scores:
ROUGE Scores:
rouge1: precision=0.6724, recall=0.3861, fmeasure=0.4906
rouge2: precision=0.4561, recall=0.2600, fmeasure=0.3312
rougeL: precision=0.4828, recall=0.2772, fmeasure=0.3522
BERTScore: Precision=0.7877, Recall=0.7473, F1=0.7670
display(Markdown(response_2_base))
βMERGIS was developed by Nimalsiri et al. It is a transformer-based encoder-decoder model that uses image segmentation to enhance the accuracy of automated report generation.β
print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_2, response_2_base)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_2, response_2_base)
Calculating scores:
ROUGE Scores:
rouge1: precision=0.8889, recall=0.8571, fmeasure=0.8727
rouge2: precision=0.6538, recall=0.6296, fmeasure=0.6415
rougeL: precision=0.7778, recall=0.7500, fmeasure=0.7636
BERTScore: Precision=0.9364, Recall=0.9032, F1=0.9195
display(Markdown(response_3_base))
βAdaDiff was developed by Γzbey et al. It is an Adaptive Diffusion Priors method for MRI reconstruction that dynamically adjusts its priors during inference to better match the test data distribution, leading to improved reconstruction quality and speed.β
print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_3, response_3_base)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_3, response_3_base)
Calculating scores:
ROUGE Scores:
rouge1: precision=0.6053, recall=0.4259, fmeasure=0.5000
rouge2: precision=0.2703, recall=0.1887, fmeasure=0.2222
rougeL: precision=0.5263, recall=0.3704, fmeasure=0.4348
BERTScore: Precision=0.8303, Recall=0.7937, F1=0.8116
display(Markdown(response_4_base))
βEquation (1), π(π₯π‘ | π₯π‘β1)=π©(π₯π‘:β1β π½π‘.π₯π‘β1,π½π‘.Ξ ),βπ‘β{1,β¦,π}, represents the forward diffusion process in Denoising Diffusion Probabilistic Models (DDPMs). This process is a Markov chain where Gaussian noise is iteratively added to the input data (xt-1) at each time step (t). The amount of noise added is controlled by the variance schedule (Ξ²t), resulting in a series of increasingly noisy latent states (x1 through xT). The equation shows that each noisy state xt is drawn from a normal distribution (π©) with a mean dependent on the previous state (scaled by β1β π½π‘) and a variance (Ξ²tI), where I is the identity matrix.β
print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_4, response_4_base)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_4, response_4_base)
Calculating scores:
ROUGE Scores:
rouge1: precision=0.5517, recall=0.4848, fmeasure=0.5161
rouge2: precision=0.2174, recall=0.1908, fmeasure=0.2033
rouge: precision=0.2845, recall=0.2500, fmeasure=0.2661
BERTScore: Precision=0.7437, Recall=0.7865, F1=0.7645
display(Markdown(response_5_base))
βProteinBERT was developed by Brandes et al. It is a specialized deep language model for protein sequences that combines local and global representations for comprehensive end-to-end processing.β
print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_5, response_5_base)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_5, response_5_base)
Calculating scores:
ROUGE Scores:
rouge1: precision=0.9310, recall=0.9000, fmeasure=0.9153
rouge2: precision=0.8214, recall=0.7931, fmeasure=0.8070
rougeL: precision=0.9310, recall=0.9000, fmeasure=0.9153
BERTScore: Precision=0.9656, Recall=0.9331, F1=0.9491
The answers are correct and BERTScore shows results more aligned with reality than ROUGE with results close to 80% and even above 90%, more considering a mathematical question.
Then I compared this solution with two Retrieval-Augmented Generation (RAG) architectures: Naive RAG and Advanced RAG.
β¦.but firstly what is a Retrieval-Augmented Generation?
Retrieval-Augmented Generation (RAG) improves Large Language Models (LLMs), by retrieving relevant information from external databases, enhancing accuracy and credibility. RAG tackles limitations of LLMs, such as βhallucinationsβ or generating incorrect content, by referencing external knowledge to supplement the LLMsβ internal knowledge.
The process starts with a user question to the Retrieval-Augmented Generation (RAG) system, which then transforms the userβs query into a format optimised for retrieval. This query is then used to search an external knowledge base, consisting of text documents, databases, knowledge graphs, or even previously generated LLM content. The system retrieves the most relevant chunks of information based on semantic similarity calculations. This retrieved information is combined with the userβs original query to create a comprehensive prompt for the LLM. This approach enables the LLM to generate a response enriched by its internal knowledge and the relevant external information.
–Second solution: knowledge base RAG with Gemini 1.5 Flash
Naive RAG represents the earliest methodology for Retrieval-Augmented Generation. This approach follows a traditional process of:
- Indexing: Involves converting raw data into a uniform text format, segmenting this text into smaller chunks, encoding these chunks into vector representations, and storing them in a vector database for efficient similarity searches.
- Retrieval: This stage transforms the user query into a vector representation and retrieves the most similar chunks from the indexed corpus, using these as expanded context.
- Generation: The query and retrieved chunks create a prompt for a large language model, which generates a response based on the provided information.
The second model implements a simple information retrieval system combined with a Retrieval-Augmented Generation (RAG) model for responding to queries based on a paperβs content.
The paperβs text is split into sections. Then a TF-IDF (Term Frequency-Inverse Document Frequency) vectoriser converts text sections into numerical vectors. This vectorisation is essential for comparing the text sections and the userβs query. The FAISS (Facebook AI Similarity Search) is employed here to create an index that allows fast retrieval of the most similar sections. The vectors corresponding to each text section are added to the FAISS index, enabling efficient nearest neighbour searches. For each user query, FAISS retrieves the most relevant sections of the document and feds these sections as context, along with the query, to the Gemini model. The chatAI_RAG
function formulates a prompt using context information (context_info
), the relevant sections retrieved, and the user query. This prompt is sent to a chat session (chat_session.send_message
) which interacts with a language model to generate and return a text response.
# Split paper text into sections for better vectorization (e.g., paragraphs or sentences)
sections = papers_text[0].split('\n') # Split by newline or any other delimiter as per requirement
# Compute TF-IDF matrix
np.random.seed(0)
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(sections)
# Create FAISS index
dimension = tfidf_matrix.shape[1] # Dimension of TF-IDF vectors
index = faiss.IndexFlatL2(dimension) # Using L2 (euclidean) distance
# Convert the sparse matrix to dense for FAISS
dense_vectors = tfidf_matrix.toarray()
# Add vectors to the index
index.add(dense_vectors)
def retrieve_relevant_section(query, top_k=2):
# Compute query vector
query_vector = vectorizer.transform([query]).toarray()
# Search FAISS index
_, indices = index.search(query_vector, top_k)
# Retrieve relevant text sections
relevant_sections = [sections[i] for i in indices[0]]
return " ".join(relevant_sections)
# Function to generate a response using retrieved information and context
def chatAI_RAG(query):
# Retrieve relevant sections
relevant_text = retrieve_relevant_section(query)
# Combine context with relevant sections and user query
full_prompt = f"{context_info}\n\nRelevant Information:\n{relevant_text}\n\nQuestion:\n{query}"
response = chat_session.send_message(full_prompt)
return response.text
Here are the responses and performances from the second model:
display(Markdown(response_1_rag))
βThe review discusses three types of encoder-decoder transformer architectures: encoder-only models (e.g., BERT), decoder-only or autoregressive models (e.g., GPT-3), and encoder-decoder models (e.g., Facebookβs BART and Googleβs T5).β
print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_1, response_1_rag)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_1, response_1_rag)
Calculating scores:
ROUGE Scores:
rouge1: precision=0.6579, recall=0.2475, fmeasure=0.3597
rouge2: precision=0.4324, recall=0.1600, fmeasure=0.2336
rougeL: precision=0.5789, recall=0.2178, fmeasure=0.3165
BERTScore: Precision=0.7573, Recall=0.6871, F1=0.7205
display(Markdown(response_2_rag))
βNimalsiri et al. developed MERGIS. Itβs a transformer-based encoder-decoder model that uses image segmentation to improve the accuracy of automated report generation.β
print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_2, response_2_rag)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_2, response_2_rag)
Calculating scores:
ROUGE Scores:
rouge1: precision=0.8400, recall=0.7500, fmeasure=0.7925
rouge2: precision=0.5417, recall=0.4815, fmeasure=0.5098
rougeL: precision=0.6800, recall=0.6071, fmeasure=0.6415
BERTScore: Precision=0.9198, Recall=0.8871, F1=0.9032
display(Markdown(response_3_rag))
βΓzbey et al. developed AdaDiff. Itβs an Adaptive Diffusion Priors method for MRI reconstruction that dynamically adjusts its priors during inference to better match the test data distribution, resulting in superior reconstruction quality and speed.β
print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_3, response_3_rag)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_3, response_3_rag)
Calculating scores:
ROUGE Scores:
rouge1: precision=0.6389, recall=0.4259, fmeasure=0.5111
rouge2: precision=0.2857, recall=0.1887, fmeasure=0.2273
rougeL: precision=0.5556, recall=0.3704, fmeasure=0.4444
BERTScore: Precision=0.8316, Recall=0.7973, F1=0.8141
display(Markdown(response_4_rag))
βEquation (1), π(π₯π‘ | π₯π‘β1)=π©(π₯π‘:β1β π½π‘.π₯π‘β1,π½π‘.Ξ ),βπ‘β{1,β¦,π}, represents the forward diffusion process in Denoising Diffusion Probabilistic Models (DDPMs). This is a Markov chain where Gaussian noise is iteratively added to the input data at each time step. The amount of noise added is controlled by the variance schedule (Ξ²t), resulting in a series of increasingly noisy latent states. The equation shows that each noisy state xt is drawn from a normal distribution (π©) with a mean dependent on the previous state (scaled by β1β π½π‘) and a variance (Ξ²tI), where I is the identity matrix.β
print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_4, response_4_rag)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_4, response_4_rag)
Calculating scores:
ROUGE Scores:
rouge1: precision=0.6224, recall=0.4621, fmeasure=0.5304
rouge2: precision=0.2577, recall=0.1908, fmeasure=0.2193
rougeL: precision=0.3163, recall=0.2348, fmeasure=0.2696
BERTScore: Precision=0.7739, Recall=0.7870, F1=0.7804
display(Markdown(response_5_rag))
βBrandes et al. developed ProteinBERT. Itβs a specialized deep language model for protein sequences that uses both local and global representations for complete end-to-end processing.β
print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_5, response_5_rag)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_5, response_5_rag)
Calculating scores:
ROUGE Scores:
rouge1: precision=0.8929, recall=0.8333, fmeasure=0.8621
rouge2: precision=0.7037, recall=0.6552, fmeasure=0.6786
rougeL: precision=0.8214, recall=0.7667, fmeasure=0.7931
BERTScore: Precision=0.9408, Recall=0.9086, F1=0.9245
Looking at the results and the F1 score from BERTScore the second model performs slightly better in questions 3 and 4.
-Third solution: knowledge advanced RAG with Gemini 1.5 Flash
Advanced RAG builds upon the foundation of Naive RAG by introducing specific improvements to address its limitations. This paradigm focuses primarily on enhancing retrieval quality through:
- Pre-Retrieval Strategies:
Refining indexing techniques: using a sliding window approach for chunking, fine-grained segmentation, and incorporating metadata to improve the quality of indexed content.
Optimising the original query: employing methods like query rewriting, transformation, and expansion to make the userβs question clearer and more suitable for retrieval.
- Post-Retrieval Strategies:
Reranking chunks: rearranging retrieved chunks to prioritise the most relevant content.
Context compression: selecting essential information from retrieved documents, emphasising critical sections, and shortening the context to prevent information overload and focus the LLM on key details.
By implementing these strategies, Advanced RAG aims to overcome the retrieval challenges and augmentation hurdles faced in Naive RAG, leading to a more efficient and effective retrieval process.
The third model improves the Naive RAG with the re-ranking strategy. It initially retrieves a set of relevant sections vectorised using TF-IDF, converting text into numeric form based on term frequency and inverse document frequency, providing a basic relevance estimation. Then, these vectors are indexed using FAISS, a library for efficient vector similarity search, which supports quick retrieval of similar vectors in high-dimensional spaces. The additional step loads a pre-trained Sentence Transformer model (all-MiniLM-L6-v2
) to generate contextual embeddings. This model is used for semantic similarity computation. The re_rank_documents
function encodes a query and document sections into embeddings and computes cosine similarities between the query and each document section. The documents are then sorted based on their similarity scores to the query, with the most similar ones ranked highest. The retrieve_relevant_section
function transforms a query into a TF-IDF vector and searches the FAISS index to retrieve the top-k most relevant sections. These sections are then re-ranked using the re_rank_documents
function for more precise ordering based on semantic similarity. The chatAI_ARAG
function combines context information with the top re-ranked document sections and the user's query to form a prompt. This prompt is sent to a chat session (chat_session.send_message
) which interacts with a language model to generate and return a text response.
# Load a sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')
def re_rank_documents(query, documents):
# Encode the query and documents using the transformer model
query_embedding = model.encode(query, convert_to_tensor=True)
doc_embeddings = model.encode(documents, convert_to_tensor=True)
# Compute cosine similarities
cosine_scores = util.pytorch_cos_sim(query_embedding, doc_embeddings)[0]
# Sort documents based on descending cosine similarity scores
scores_and_docs = sorted(zip(cosine_scores.tolist(), documents), key=lambda x: x[0], reverse=True)
# Retrieve top ranked documents and their scores
top_ranked_docs = [doc for _, doc in scores_and_docs]
return top_ranked_docs
# Pre-process and index documents for initial retrieval
# Split a text (likely a scientific paper or a large document) into sections using newline characters as delimiters.
sections = papers_text[0].split('\n')
# Create a TF-IDF vectorizer and use it to transform the sections into a TF-IDF matrix representation.
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(sections)
# Convert the sparse TF-IDF matrix to a dense representation (array format).
dense_vectors = tfidf_matrix.toarray()
# Create FAISS index
dimension = tfidf_matrix.shape[1] # The number of features (or terms) in the TF-IDF matrix.
index = faiss.IndexFlatL2(dimension) # A FAISS index using the L2 (Euclidean) distance metric, often used for dense vector similarity.
index.add(dense_vectors) # Add the dense TF-IDF vectors to the index.
def retrieve_relevant_section(query, top_k=10):
# Compute query vector
query_vector = vectorizer.transform([query]).toarray()
# Search FAISS index for initial retrieval
_, indices = index.search(query_vector, top_k)
initial_relevant_sections = [sections[i] for i in indices[0]]
# Re-rank the initially retrieved sections
top_ranked_docs = re_rank_documents(query, initial_relevant_sections)
# Select a subset for the final response, if desired
return " ".join(top_ranked_docs[:2]) # Take top 2 after re-ranking
def chatAI_ARAG(query):
relevant_text = retrieve_relevant_section(query)
full_prompt = f"{context_info}\n\nRelevant Information:\n{relevant_text}\n\nQuestion:\n{query}"
response = chat_session.send_message(full_prompt)
return response.text
Here are the responses and performances from the second model:
display(Markdown(response_1_arag))
βThe review mentions three types of encoder-decoder transformer architectures: encoder-only models (like BERT), decoder-only or autoregressive models (like GPT-3), and encoder-decoder models (like BART and T5).β
print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_1, response_1_arag)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_1, response_1_arag)
Calculating scores:
ROUGE Scores:
rouge1: precision=0.6774, recall=0.2079, fmeasure=0.3182
rouge2: precision=0.3667, recall=0.1100, fmeasure=0.1692
rougeL: precision=0.5806, recall=0.1782, fmeasure=0.2727
BERTScore: Precision=0.7824, Recall=0.6746, F1=0.7245
display(Markdown(response_2_arag))
βNimalsiri et al. developed MERGIS. It is a transformer-based encoder-decoder model that uses image segmentation to improve the accuracy of automated report generation.β
print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_2, response_2_arag)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_2, response_2_arag)
Calculating scores:
ROUGE Scores:
rouge1: precision=0.8400, recall=0.7500, fmeasure=0.7925
rouge2: precision=0.5417, recall=0.4815, fmeasure=0.5098
rougeL: precision=0.6800, recall=0.6071, fmeasure=0.6415
BERTScore: Precision=0.9269, Recall=0.8885, F1=0.9073
display(Markdown(response_3_arag))
βΓzbey et al. developed AdaDiff. It is an Adaptive Diffusion Priors method for MRI reconstruction that dynamically adjusts its priors during inference to better match the test data distribution, leading to superior reconstruction quality and speed.β
print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_3, response_3_arag)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_3, response_3_arag)
Calculating scores:
ROUGE Scores:
rouge1: precision=0.6389, recall=0.4259, fmeasure=0.5111
rouge2: precision=0.2857, recall=0.1887, fmeasure=0.2273
rougeL: precision=0.5556, recall=0.3704, fmeasure=0.4444
BERTScore: Precision=0.8348, Recall=0.7984, F1=0.8162
display(Markdown(response_4_arag))
βEquation (1), π(π₯π‘ | π₯π‘β1)=π©(π₯π‘:β1β π½π‘.π₯π‘β1,π½π‘.Ξ ),βπ‘β{1,β¦,π}, describes the forward diffusion process within Denoising Diffusion Probabilistic Models (DDPMs). Itβs a Markov chain where Gaussian noise is iteratively added to the input data (xt-1) at each timestep (t). The noise amount is controlled by the variance schedule (Ξ²t), creating a sequence of increasingly noisy latent states (x1 to xT). The equation states that each noisy state xt is sampled from a normal distribution (π©) with a mean dependent on the previous state (scaled by β1β π½π‘) and a variance (Ξ²tI), where I is the identity matrix.β
print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_4, response_4_arag)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_4, response_4_arag)
Calculating scores:
ROUGE Scores:
rouge1: precision=0.4865, recall=0.4091, fmeasure=0.4444
rouge2: precision=0.1727, recall=0.1450, fmeasure=0.1577
rougeL: precision=0.2523, recall=0.2121, fmeasure=0.2305
BERTScore: Precision=0.7366, Recall=0.7771, F1=0.7563
display(Markdown(response_5_arag))
βBrandes et al. developed ProteinBERT. It is a specialized deep language model for protein sequences that combines local and global representations for comprehensive end-to-end processing.β
print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_5, response_5_arag)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_5, response_5_arag)
Calculating scores:
ROUGE Scores:
rouge1: precision=0.9259, recall=0.8333, fmeasure=0.8772
rouge2: precision=0.7308, recall=0.6552, fmeasure=0.6909
rougeL: precision=0.8519, recall=0.7667, fmeasure=0.8070
BERTScore: Precision=0.9472, Recall=0.9127, F1=0.9296
Looking at the results and the F1 score from BERTScore the third model performs slightly better in questions 1, 2, 3, and 5 than the second model. The third model performs only better in question 3 than the first model.
Conclusions
Naive RAG suffers from limitations in retrieval, generation and augmentation processes. Retrieval challenges arise from issues with precision and recall, leading to the selection of irrelevant chunks and missing crucial information. Generation faces difficulties with hallucination and potential irrelevance, toxicity, or bias in outputs. Augmentation doesnβt integrate retrieved information smoothly, often resulting in incoherent outputs, redundancy from repeated information, and an over-reliance on augmented information without adding insightful synthesis. Advanced RAG addresses these limitations with several strategies, leading to more relevant and accurate information retrieval and improved performance compared to Naive RAG. In this study, I used a post-retrieval strategy as re-ranking obtaining slightly better performance in almost all questions than the Naive RAG, but not all compared with the native LLM. Anyway, all models offer quite the same results, the Advanced RAG developed may require better tuning, maybe a different embedding model and a pre-retrieval strategy, but surely Gemini long text window is an invaluable tool in competition with Retrieval-Augmented Generation because it can save time in building an architecture for a specific use-case obtaining overall same performance and for this use-case little superior results. Long Context Language Models exhibit promising capabilities that could enable them to replace certain functions within RAG systems, with gains in scalability. I think it is too early to replace RAG architectures, maybe in the future.
References:
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?
A Comprehensive Review of Generative AI in Healthcare
Retrieval-Augmented Generation for Large Language Models: A Survey
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI