Will Long Context Language Models Replace RAG?

Last Updated on December 10, 2024 by Editorial Team

Author(s): Claudio Giorgio Giancaterino

Originally published on Towards AI.

Updated production-ready Gemini models, reduced 1.5 Pro pricing, increased rate limits, and more — Google Developers Blog

Kaggle has launched a competition surrounding the Gemini 1.5 model, introduced on August 8th, 2024, by Google, which aims to showcase its innovative capabilities, particularly its ability to process up to 2 million tokens concurrently.

The Gemini 1.5 model is a family of multimodal models known for efficiently processing content across multiple modalities, including text, images, audio, and video. It consists of Gemini 1.5 Pro, which boasts enhanced features, and Gemini 1.5 Flash, optimised for speed and efficiency, maintaining high performance with less computational demand.

This topic was already addressed in the paper “Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?”. This paper introduces LOFT (Long-Context Frontiers), a benchmark projected to evaluate the performances of long-context language models (LCLMs) on tasks that typically rely on external tools like retrieval systems or databases. LOFT is built upon six tasks under 35 datasets, as span text, visual, and audio modalities. The researchers argue that LCLMs have the potential to replace these traditional tools by directly ingesting and processing vast amounts of information within their context windows. They evaluated three state-of-the-art LCLMs: Gemini 1.5 Pro, GPT-4o, and Claude 3 Opus on LOFT, and compared their performance to custom models specifically built for each task.

Chasing the competition I stressed the Gemini 1.5 Flash capabilities in the following paper: “A Comprehensive Review of Generative AI in Healthcare”. The paper provides a valuable overview of the current state of generative AI, specifically diffusion models and transformer-based models, in healthcare and highlights its potential to improve patient care and advance biomedical research.

[2310.00795] A Comprehensive Review of Generative AI in Healthcare

You can follow my analysis in this notebook.

To evaluate the performance of the model, I’ve selected 5 questions:

# Define a set of questions to ask regarding the paper's content.
question_1 = "Which types of encoder-decoder transformer architectures are discussed in the review ?"
question_2 = "Could you say who developed MERGIS, and what is it ?"
question_3 = "Could you say who developed AdaDiff, and what is it ?"
question_4 = "What process represents the equation 1 used in the review, could you explain it ?"
question_5 = "Could you say who developed ProteinBERT, and what is it ?"

with the following answers retrieved from the paper:

right_answer_1 = """
Encoder-only models:
These models are advantageous for text classification tasks such as sentiment analysis. An example of an LLM utilizing an encoder-only model is BERT.

Decoder-only or autoregressive models:
These models are suitable for text generation tasks, akin to the predictive text functionality in a smartphone chat application. For instance, as you input text, the AI predicts the subsequent word or phrase. An example of this model is GPT-3.

Encoder-decoder models:
These models facilitate generative AI tasks such as language translation and summarization. Notable LLMs employing this methodology include Facebook’s BART and Google’s T5.
"""

right_answer_2 = """
MERGIS, was proposed by (Nimalsiri et al. 2023). It uses image segmentation and a modern 
transformer-based encoder-decoder model to enhance the accuracy of automated report generation.
"""

right_answer_3 = """
(Özbey et al. 2023) introduced an innovative technique named Adaptive Diffusion 
Priors (AdaDiff) for the reconstruction of MRI. This approach involves a series of diffusion processes 
that enhance the authenticity of the generated images. AdaDiff dynamically adjusts its priors during 
the inference stage to align more closely with the distribution of the test data.
"""

right_answer_4 ="""
The forward diffusion process is depicted as a Markov Chain, characterized by the inclusion of 
Gaussian noise in a series of stages, culminating in generating noisy samples. The uncorrupted 
or original data distribution is represented as 𝑞(𝑥0). With a data sample 𝑥0 drawn from this 
distribution, 𝑞(𝑥0), a forward noising operation, denoted as 𝑝, is employed. 
This operation introduces Gaussian noise iteratively at various time points, represented by 𝑡, 
resulting in a series of latent states 𝑥1 through 𝑥𝑇. The process can be mathematically defined as 
follows:
𝑞(𝑥𝑡 | 𝑥𝑡−1)=𝒩(𝑥𝑡:√1− 𝛽𝑡.𝑥𝑡−1,𝛽𝑡.Ι ),∀𝑡∈{1,…,𝑇}
𝑇 denotes the number of diffusion steps, while 𝛽1,…, 𝛽𝑇, each within the interval of [0, 1), signify 
the variance schedule spread throughout the diffusion steps. 
The identity matrix is symbolized by 𝐈, and 𝒩(𝑥; 𝜇,𝜎), which characterizes the normal distribution 
possessing a mean of 𝜇 and a covariance of 𝜎.
"""

right_answer_5 ="""
ProteinBERT was developed by (Brandes et al. 2022). It's a specialized deep language model for protein 
sequences that amalgamates local and global representations for comprehensive end-to-end processing.
"""

…and I’ve used two metrics:

ROUGE is a statistical scoring method. It primarily focuses on the surface-level overlap of words and phrases without deeply considering the semantic meaning or reasoning behind the text. Although ROUGE can be a helpful tool for a quick and basic evaluation, it might not be the most accurate measure for complex language tasks where deeper understanding is crucial.
BERTScore is a hybrid method that combines statistical scoring and a model-based approach. It uses contextual embeddings generated by pre-trained models like BERT to evaluate the similarity between generated text and reference text. While this method is more semantically aware than purely statistical methods, it can be influenced by biases present in the training data of the pre-trained models.

–First solution: knowledge base Gemini 1.5 Flash

With this solution there is a direct interaction, between the Gemini model and the paper, allowing for the contextual setup of conversations and dynamic interaction based on user prompts.

For this purpose, I have implemented functions to start and interact with a chat session using a language model. The first function, start_chat_session, initiates a chat by combining some contextual information with the text from the document. The second function, chatAI, interacts with a previously started chat session. It takes the chat session, some contextual information, and a prompt. It then returns the model's response as text. At the end, a chat session is initialised for queries.

# start_chat_session: Starts a chat by combining a context and the text of the document
def start_chat_session(context_info, papers_text):
 if papers_text:
 # Combine context with the text content of the first document
 combined_content = context_info + "\n\n" + papers_text[0]

 # Use the combined text in the model history
 chat_session = llm.start_chat(
 history=[
 {
 'role': 'user',
 'parts': [combined_content]
 }
 ]
 )
 return chat_session
 return None

# chatAI: Sends prompts to the chat session and returns the model's response
def chatAI(chat_session, context_info, prompt):
 if chat_session:
 # Combine the prompt with context to provide more depth
 full_prompt = f"{context_info}\n\n{prompt}"
 response = chat_session.send_message(full_prompt)
 return response.text
 return "No chat session initialised."

# The chat session is initiated using the extracted paper text and context.
chat_session = start_chat_session(context_info, papers_text)

Here are the responses and performances from the first model:

display(Markdown(response_1_base))

“The review mentions three types of encoder-decoder transformer architectures:

Encoder-only models (e.g., BERT) suitable for text classification tasks.
Decoder-only or autoregressive models (e.g., GPT-3) suitable for text generation tasks.
Encoder-decoder models (e.g., Facebook’s BART and Google’s T5) suitable for tasks like language translation and summarization.”

print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_1, response_1_base)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_1, response_1_base)

Calculating scores:
ROUGE Scores:
rouge1: precision=0.6724, recall=0.3861, fmeasure=0.4906
rouge2: precision=0.4561, recall=0.2600, fmeasure=0.3312
rougeL: precision=0.4828, recall=0.2772, fmeasure=0.3522
BERTScore: Precision=0.7877, Recall=0.7473, F1=0.7670

display(Markdown(response_2_base))

“MERGIS was developed by Nimalsiri et al. It is a transformer-based encoder-decoder model that uses image segmentation to enhance the accuracy of automated report generation.”

print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_2, response_2_base)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_2, response_2_base)

Calculating scores:
ROUGE Scores:
rouge1: precision=0.8889, recall=0.8571, fmeasure=0.8727
rouge2: precision=0.6538, recall=0.6296, fmeasure=0.6415
rougeL: precision=0.7778, recall=0.7500, fmeasure=0.7636
BERTScore: Precision=0.9364, Recall=0.9032, F1=0.9195

display(Markdown(response_3_base))

“AdaDiff was developed by Özbey et al. It is an Adaptive Diffusion Priors method for MRI reconstruction that dynamically adjusts its priors during inference to better match the test data distribution, leading to improved reconstruction quality and speed.”

print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_3, response_3_base)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_3, response_3_base)

Calculating scores:
ROUGE Scores:
rouge1: precision=0.6053, recall=0.4259, fmeasure=0.5000
rouge2: precision=0.2703, recall=0.1887, fmeasure=0.2222
rougeL: precision=0.5263, recall=0.3704, fmeasure=0.4348
BERTScore: Precision=0.8303, Recall=0.7937, F1=0.8116

display(Markdown(response_4_base))

“Equation (1), 𝑞(𝑥𝑡 | 𝑥𝑡−1)=𝒩(𝑥𝑡:√1− 𝛽𝑡.𝑥𝑡−1,𝛽𝑡.Ι ),∀𝑡∈{1,…,𝑇}, represents the forward diffusion process in Denoising Diffusion Probabilistic Models (DDPMs). This process is a Markov chain where Gaussian noise is iteratively added to the input data (xt-1) at each time step (t). The amount of noise added is controlled by the variance schedule (βt), resulting in a series of increasingly noisy latent states (x1 through xT). The equation shows that each noisy state xt is drawn from a normal distribution (𝒩) with a mean dependent on the previous state (scaled by √1− 𝛽𝑡) and a variance (βtI), where I is the identity matrix.”

print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_4, response_4_base)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_4, response_4_base)

Calculating scores:
ROUGE Scores:
rouge1: precision=0.5517, recall=0.4848, fmeasure=0.5161
rouge2: precision=0.2174, recall=0.1908, fmeasure=0.2033
rouge: precision=0.2845, recall=0.2500, fmeasure=0.2661
BERTScore: Precision=0.7437, Recall=0.7865, F1=0.7645

display(Markdown(response_5_base))

“ProteinBERT was developed by Brandes et al. It is a specialized deep language model for protein sequences that combines local and global representations for comprehensive end-to-end processing.”

print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_5, response_5_base)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_5, response_5_base)

Calculating scores:
ROUGE Scores:
rouge1: precision=0.9310, recall=0.9000, fmeasure=0.9153
rouge2: precision=0.8214, recall=0.7931, fmeasure=0.8070
rougeL: precision=0.9310, recall=0.9000, fmeasure=0.9153
BERTScore: Precision=0.9656, Recall=0.9331, F1=0.9491

The answers are correct and BERTScore shows results more aligned with reality than ROUGE with results close to 80% and even above 90%, more considering a mathematical question.

Then I compared this solution with two Retrieval-Augmented Generation (RAG) architectures: Naive RAG and Advanced RAG.

….but firstly what is a Retrieval-Augmented Generation?

[2312.10997] Retrieval-Augmented Generation for Large Language Models: A Survey

Retrieval-Augmented Generation (RAG) improves Large Language Models (LLMs), by retrieving relevant information from external databases, enhancing accuracy and credibility. RAG tackles limitations of LLMs, such as ‘hallucinations’ or generating incorrect content, by referencing external knowledge to supplement the LLMs’ internal knowledge.

The process starts with a user question to the Retrieval-Augmented Generation (RAG) system, which then transforms the user’s query into a format optimised for retrieval. This query is then used to search an external knowledge base, consisting of text documents, databases, knowledge graphs, or even previously generated LLM content. The system retrieves the most relevant chunks of information based on semantic similarity calculations. This retrieved information is combined with the user’s original query to create a comprehensive prompt for the LLM. This approach enables the LLM to generate a response enriched by its internal knowledge and the relevant external information.

–Second solution: knowledge base RAG with Gemini 1.5 Flash

Naive RAG represents the earliest methodology for Retrieval-Augmented Generation. This approach follows a traditional process of:

Indexing: Involves converting raw data into a uniform text format, segmenting this text into smaller chunks, encoding these chunks into vector representations, and storing them in a vector database for efficient similarity searches.
Retrieval: This stage transforms the user query into a vector representation and retrieves the most similar chunks from the indexed corpus, using these as expanded context.
Generation: The query and retrieved chunks create a prompt for a large language model, which generates a response based on the provided information.

The second model implements a simple information retrieval system combined with a Retrieval-Augmented Generation (RAG) model for responding to queries based on a paper’s content.

The paper’s text is split into sections. Then a TF-IDF (Term Frequency-Inverse Document Frequency) vectoriser converts text sections into numerical vectors. This vectorisation is essential for comparing the text sections and the user’s query. The FAISS (Facebook AI Similarity Search) is employed here to create an index that allows fast retrieval of the most similar sections. The vectors corresponding to each text section are added to the FAISS index, enabling efficient nearest neighbour searches. For each user query, FAISS retrieves the most relevant sections of the document and feds these sections as context, along with the query, to the Gemini model. The chatAI_RAG function formulates a prompt using context information (context_info), the relevant sections retrieved, and the user query. This prompt is sent to a chat session (chat_session.send_message) which interacts with a language model to generate and return a text response.

# Split paper text into sections for better vectorization (e.g., paragraphs or sentences)
sections = papers_text[0].split('\n') # Split by newline or any other delimiter as per requirement

# Compute TF-IDF matrix
np.random.seed(0)
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(sections)

# Create FAISS index
dimension = tfidf_matrix.shape[1] # Dimension of TF-IDF vectors
index = faiss.IndexFlatL2(dimension) # Using L2 (euclidean) distance

# Convert the sparse matrix to dense for FAISS
dense_vectors = tfidf_matrix.toarray()

# Add vectors to the index
index.add(dense_vectors)

def retrieve_relevant_section(query, top_k=2):
 # Compute query vector
 query_vector = vectorizer.transform([query]).toarray()

 # Search FAISS index
 _, indices = index.search(query_vector, top_k)

 # Retrieve relevant text sections
 relevant_sections = [sections[i] for i in indices[0]]

 return " ".join(relevant_sections)

# Function to generate a response using retrieved information and context
def chatAI_RAG(query):
 # Retrieve relevant sections
 relevant_text = retrieve_relevant_section(query)

 # Combine context with relevant sections and user query
 full_prompt = f"{context_info}\n\nRelevant Information:\n{relevant_text}\n\nQuestion:\n{query}"
 response = chat_session.send_message(full_prompt)

 return response.text

Here are the responses and performances from the second model:

display(Markdown(response_1_rag))

“The review discusses three types of encoder-decoder transformer architectures: encoder-only models (e.g., BERT), decoder-only or autoregressive models (e.g., GPT-3), and encoder-decoder models (e.g., Facebook’s BART and Google’s T5).”

print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_1, response_1_rag)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_1, response_1_rag)

Calculating scores:
ROUGE Scores:
rouge1: precision=0.6579, recall=0.2475, fmeasure=0.3597
rouge2: precision=0.4324, recall=0.1600, fmeasure=0.2336
rougeL: precision=0.5789, recall=0.2178, fmeasure=0.3165
BERTScore: Precision=0.7573, Recall=0.6871, F1=0.7205

display(Markdown(response_2_rag))

“Nimalsiri et al. developed MERGIS. It’s a transformer-based encoder-decoder model that uses image segmentation to improve the accuracy of automated report generation.”

print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_2, response_2_rag)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_2, response_2_rag)

Calculating scores:
ROUGE Scores:
rouge1: precision=0.8400, recall=0.7500, fmeasure=0.7925
rouge2: precision=0.5417, recall=0.4815, fmeasure=0.5098
rougeL: precision=0.6800, recall=0.6071, fmeasure=0.6415
BERTScore: Precision=0.9198, Recall=0.8871, F1=0.9032

display(Markdown(response_3_rag))

“Özbey et al. developed AdaDiff. It’s an Adaptive Diffusion Priors method for MRI reconstruction that dynamically adjusts its priors during inference to better match the test data distribution, resulting in superior reconstruction quality and speed.”

print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_3, response_3_rag)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_3, response_3_rag)

Calculating scores:
ROUGE Scores:
rouge1: precision=0.6389, recall=0.4259, fmeasure=0.5111
rouge2: precision=0.2857, recall=0.1887, fmeasure=0.2273
rougeL: precision=0.5556, recall=0.3704, fmeasure=0.4444
BERTScore: Precision=0.8316, Recall=0.7973, F1=0.8141

display(Markdown(response_4_rag))

“Equation (1), 𝑞(𝑥𝑡 | 𝑥𝑡−1)=𝒩(𝑥𝑡:√1− 𝛽𝑡.𝑥𝑡−1,𝛽𝑡.Ι ),∀𝑡∈{1,…,𝑇}, represents the forward diffusion process in Denoising Diffusion Probabilistic Models (DDPMs). This is a Markov chain where Gaussian noise is iteratively added to the input data at each time step. The amount of noise added is controlled by the variance schedule (βt), resulting in a series of increasingly noisy latent states. The equation shows that each noisy state xt is drawn from a normal distribution (𝒩) with a mean dependent on the previous state (scaled by √1− 𝛽𝑡) and a variance (βtI), where I is the identity matrix.”

print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_4, response_4_rag)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_4, response_4_rag)

Calculating scores:
ROUGE Scores:
rouge1: precision=0.6224, recall=0.4621, fmeasure=0.5304
rouge2: precision=0.2577, recall=0.1908, fmeasure=0.2193
rougeL: precision=0.3163, recall=0.2348, fmeasure=0.2696
BERTScore: Precision=0.7739, Recall=0.7870, F1=0.7804

display(Markdown(response_5_rag))

“Brandes et al. developed ProteinBERT. It’s a specialized deep language model for protein sequences that uses both local and global representations for complete end-to-end processing.”

print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_5, response_5_rag)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_5, response_5_rag)

Calculating scores:
ROUGE Scores:
rouge1: precision=0.8929, recall=0.8333, fmeasure=0.8621
rouge2: precision=0.7037, recall=0.6552, fmeasure=0.6786
rougeL: precision=0.8214, recall=0.7667, fmeasure=0.7931
BERTScore: Precision=0.9408, Recall=0.9086, F1=0.9245

Looking at the results and the F1 score from BERTScore the second model performs slightly better in questions 3 and 4.

-Third solution: knowledge advanced RAG with Gemini 1.5 Flash

Advanced RAG builds upon the foundation of Naive RAG by introducing specific improvements to address its limitations. This paradigm focuses primarily on enhancing retrieval quality through:

Pre-Retrieval Strategies:

Refining indexing techniques: using a sliding window approach for chunking, fine-grained segmentation, and incorporating metadata to improve the quality of indexed content.

Optimising the original query: employing methods like query rewriting, transformation, and expansion to make the user’s question clearer and more suitable for retrieval.

Post-Retrieval Strategies:

Reranking chunks: rearranging retrieved chunks to prioritise the most relevant content.

Context compression: selecting essential information from retrieved documents, emphasising critical sections, and shortening the context to prevent information overload and focus the LLM on key details.

By implementing these strategies, Advanced RAG aims to overcome the retrieval challenges and augmentation hurdles faced in Naive RAG, leading to a more efficient and effective retrieval process.

The third model improves the Naive RAG with the re-ranking strategy. It initially retrieves a set of relevant sections vectorised using TF-IDF, converting text into numeric form based on term frequency and inverse document frequency, providing a basic relevance estimation. Then, these vectors are indexed using FAISS, a library for efficient vector similarity search, which supports quick retrieval of similar vectors in high-dimensional spaces. The additional step loads a pre-trained Sentence Transformer model (all-MiniLM-L6-v2) to generate contextual embeddings. This model is used for semantic similarity computation. The re_rank_documents function encodes a query and document sections into embeddings and computes cosine similarities between the query and each document section. The documents are then sorted based on their similarity scores to the query, with the most similar ones ranked highest. The retrieve_relevant_section function transforms a query into a TF-IDF vector and searches the FAISS index to retrieve the top-k most relevant sections. These sections are then re-ranked using the re_rank_documents function for more precise ordering based on semantic similarity. The chatAI_ARAG function combines context information with the top re-ranked document sections and the user's query to form a prompt. This prompt is sent to a chat session (chat_session.send_message) which interacts with a language model to generate and return a text response.

# Load a sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2') 
def re_rank_documents(query, documents):
 # Encode the query and documents using the transformer model
 query_embedding = model.encode(query, convert_to_tensor=True)
 doc_embeddings = model.encode(documents, convert_to_tensor=True)

 # Compute cosine similarities
 cosine_scores = util.pytorch_cos_sim(query_embedding, doc_embeddings)[0]

 # Sort documents based on descending cosine similarity scores
 scores_and_docs = sorted(zip(cosine_scores.tolist(), documents), key=lambda x: x[0], reverse=True)

 # Retrieve top ranked documents and their scores
 top_ranked_docs = [doc for _, doc in scores_and_docs]
 return top_ranked_docs
# Pre-process and index documents for initial retrieval
# Split a text (likely a scientific paper or a large document) into sections using newline characters as delimiters.
sections = papers_text[0].split('\n')
# Create a TF-IDF vectorizer and use it to transform the sections into a TF-IDF matrix representation.
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(sections)
# Convert the sparse TF-IDF matrix to a dense representation (array format).
dense_vectors = tfidf_matrix.toarray()
# Create FAISS index
dimension = tfidf_matrix.shape[1] # The number of features (or terms) in the TF-IDF matrix.
index = faiss.IndexFlatL2(dimension) # A FAISS index using the L2 (Euclidean) distance metric, often used for dense vector similarity.
index.add(dense_vectors) # Add the dense TF-IDF vectors to the index.
def retrieve_relevant_section(query, top_k=10):
 # Compute query vector
 query_vector = vectorizer.transform([query]).toarray()

 # Search FAISS index for initial retrieval
 _, indices = index.search(query_vector, top_k)
 initial_relevant_sections = [sections[i] for i in indices[0]]

 # Re-rank the initially retrieved sections
 top_ranked_docs = re_rank_documents(query, initial_relevant_sections)

 # Select a subset for the final response, if desired
 return " ".join(top_ranked_docs[:2]) # Take top 2 after re-ranking

def chatAI_ARAG(query):
 relevant_text = retrieve_relevant_section(query)
 full_prompt = f"{context_info}\n\nRelevant Information:\n{relevant_text}\n\nQuestion:\n{query}"
 response = chat_session.send_message(full_prompt)
 return response.text

Here are the responses and performances from the second model:

display(Markdown(response_1_arag))

“The review mentions three types of encoder-decoder transformer architectures: encoder-only models (like BERT), decoder-only or autoregressive models (like GPT-3), and encoder-decoder models (like BART and T5).”

print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_1, response_1_arag)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_1, response_1_arag)

Calculating scores:
ROUGE Scores:
rouge1: precision=0.6774, recall=0.2079, fmeasure=0.3182
rouge2: precision=0.3667, recall=0.1100, fmeasure=0.1692
rougeL: precision=0.5806, recall=0.1782, fmeasure=0.2727
BERTScore: Precision=0.7824, Recall=0.6746, F1=0.7245

display(Markdown(response_2_arag))

“Nimalsiri et al. developed MERGIS. It is a transformer-based encoder-decoder model that uses image segmentation to improve the accuracy of automated report generation.”

print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_2, response_2_arag)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_2, response_2_arag)

Calculating scores:
ROUGE Scores:
rouge1: precision=0.8400, recall=0.7500, fmeasure=0.7925
rouge2: precision=0.5417, recall=0.4815, fmeasure=0.5098
rougeL: precision=0.6800, recall=0.6071, fmeasure=0.6415
BERTScore: Precision=0.9269, Recall=0.8885, F1=0.9073

display(Markdown(response_3_arag))

“Özbey et al. developed AdaDiff. It is an Adaptive Diffusion Priors method for MRI reconstruction that dynamically adjusts its priors during inference to better match the test data distribution, leading to superior reconstruction quality and speed.”

print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_3, response_3_arag)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_3, response_3_arag)

Calculating scores:
ROUGE Scores:
rouge1: precision=0.6389, recall=0.4259, fmeasure=0.5111
rouge2: precision=0.2857, recall=0.1887, fmeasure=0.2273
rougeL: precision=0.5556, recall=0.3704, fmeasure=0.4444
BERTScore: Precision=0.8348, Recall=0.7984, F1=0.8162

display(Markdown(response_4_arag))

“Equation (1), 𝑞(𝑥𝑡 | 𝑥𝑡−1)=𝒩(𝑥𝑡:√1− 𝛽𝑡.𝑥𝑡−1,𝛽𝑡.Ι ),∀𝑡∈{1,…,𝑇}, describes the forward diffusion process within Denoising Diffusion Probabilistic Models (DDPMs). It’s a Markov chain where Gaussian noise is iteratively added to the input data (xt-1) at each timestep (t). The noise amount is controlled by the variance schedule (βt), creating a sequence of increasingly noisy latent states (x1 to xT). The equation states that each noisy state xt is sampled from a normal distribution (𝒩) with a mean dependent on the previous state (scaled by √1− 𝛽𝑡) and a variance (βtI), where I is the identity matrix.”

print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_4, response_4_arag)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_4, response_4_arag)

Calculating scores:
ROUGE Scores:
rouge1: precision=0.4865, recall=0.4091, fmeasure=0.4444
rouge2: precision=0.1727, recall=0.1450, fmeasure=0.1577
rougeL: precision=0.2523, recall=0.2121, fmeasure=0.2305
BERTScore: Precision=0.7366, Recall=0.7771, F1=0.7563

display(Markdown(response_5_arag))

“Brandes et al. developed ProteinBERT. It is a specialized deep language model for protein sequences that combines local and global representations for comprehensive end-to-end processing.”

print("Calculating scores:")
rouge_scores = calculate_rouge_scores(right_answer_5, response_5_arag)
bert_precision, bert_recall, bert_f1 = calculate_bert_score(right_answer_5, response_5_arag)

Calculating scores:
ROUGE Scores:
rouge1: precision=0.9259, recall=0.8333, fmeasure=0.8772
rouge2: precision=0.7308, recall=0.6552, fmeasure=0.6909
rougeL: precision=0.8519, recall=0.7667, fmeasure=0.8070
BERTScore: Precision=0.9472, Recall=0.9127, F1=0.9296

Looking at the results and the F1 score from BERTScore the third model performs slightly better in questions 1, 2, 3, and 5 than the second model. The third model performs only better in question 3 than the first model.

Conclusions

Naive RAG suffers from limitations in retrieval, generation and augmentation processes. Retrieval challenges arise from issues with precision and recall, leading to the selection of irrelevant chunks and missing crucial information. Generation faces difficulties with hallucination and potential irrelevance, toxicity, or bias in outputs. Augmentation doesn’t integrate retrieved information smoothly, often resulting in incoherent outputs, redundancy from repeated information, and an over-reliance on augmented information without adding insightful synthesis. Advanced RAG addresses these limitations with several strategies, leading to more relevant and accurate information retrieval and improved performance compared to Naive RAG. In this study, I used a post-retrieval strategy as re-ranking obtaining slightly better performance in almost all questions than the Naive RAG, but not all compared with the native LLM. Anyway, all models offer quite the same results, the Advanced RAG developed may require better tuning, maybe a different embedding model and a pre-retrieval strategy, but surely Gemini long text window is an invaluable tool in competition with Retrieval-Augmented Generation because it can save time in building an architecture for a specific use-case obtaining overall same performance and for this use-case little superior results. Long Context Language Models exhibit promising capabilities that could enable them to replace certain functions within RAG systems, with gains in scalability. I think it is too early to replace RAG architectures, maybe in the future.

References:

Kaggle Competition

Notebook

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?

A Comprehensive Review of Generative AI in Healthcare

Retrieval-Augmented Generation for Large Language Models: A Survey

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Will Long Context Language Models Replace RAG?

Author(s): Claudio Giorgio Giancaterino

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Small Language Models Make Business Sense

Best Laptop For Data Science

Mastering Generative AI Architectural Patterns: A Comprehensive Guide

How Far Is AI Capable of Delivering on Its Promises and Changing Our Civilization?

Advanced Hallucination Mitigation Techniques in LLMs – RAG, knowledge editing, contrastive decoding, self-refinement, uncertainty-aware beam search

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Will Long Context Language Models Replace RAG?

Author(s): Claudio Giorgio Giancaterino

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement