Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

Dense X Retrieval Technique in Langchain and LlamaIndex
Data Science   Latest   Machine Learning

Dense X Retrieval Technique in Langchain and LlamaIndex

Last Updated on December 21, 2023 by Editorial Team

Author(s): Eduardo Muñoz

Originally published on Towards AI.

Picture by nadi borodina from Unsplash

Introduction

Last December, 12th 2.023, the research paper “Dense X Retrieval: What Retrieval Granularity Should We Use?”[1] by Tong Chen, Hongwei Wang, Sihao Chen, Wenhao Yu, Kaixin Ma, Xinran Zhao, Hongming Zhang, and Dong Yu from the University of Washington, Tencent AI Lab, University of Pennsylvania, and Carnegie Mellon University, explores the impact of retrieval unit choice on the performance of dense retrieval in open-domain Natural Language Processing (NLP) tasks.

Dense retrieval has emerged as a crucial method for obtaining relevant context or knowledge in open-domain NLP tasks. However, the choice of the retrieval unit, i.e., the pieces of text in which the corpus is indexed, such as a document, passage, or sentence, is often overlooked when a learned dense retriever is applied to a retrieval corpus at inference time. The researchers found that the choice of retrieval unit significantly influences the performance of both retrieval and downstream tasks.

“In this paper, we investigate an overlooked research question with dense retrieval inference — at what retrieval granularity should we segment and index the retrieval corpus? We discover that selecting the proper retrieval granularity at inference time can be a simple yet effective strategy for improving dense retrievers’ retrieval and downstream task performance”.[1]

Propositions as unit retrieval

In contrast to the conventional approach of using passages or sentences as retrieval units, the researchers propose a novel retrieval unit called a ‘proposition’ for dense retrieval. A proposition is defined as an atomic expression within the text, encapsulating a distinct factoid and presented in a concise, self-contained natural language. This innovative approach aims to enhance the efficiency and effectiveness of dense retrieval in open-domain NLP tasks by optimizing the retrieval unit choice.

The authors define propositions as:

  1. Each proposition should correspond to a distinct piece of meaning in text, where the composition of all propositions would represent the semantics of the entire text.

2. A proposition should be minimal

3. A proposition should be contextualized and self-contained, including all the necessary context from the text.

It suggests that the choice of retrieval unit can significantly impact retrieval performance, highlighting the potential of propositions as a novel retrieval unit. This research contributes to the ongoing efforts to improve the performance of dense retrieval in open-domain NLP tasks and offers valuable insights for researchers and professionals in the field.

Picture by Taras Hrytsak fromUnsplash

Evaluation

The authors compare the performance of six supervised (the ones have used human-labeled query-passage pairs) or unsupervised dense retriever models for inference in the context of open-domain question answering (QA). They processed and indexed an English Wikipedia dump, referred to as FACTOIDWIKI, with documents segmented into propositions. The experiments were conducted on five different open-domain QA datasets, comparing the performance of six dual-encoder retrievers when Wikipedia is indexed by passage, sentence, and the proposed proposition.

In summary, the authors introduce the Propositionizer as a text generation model fine-tuned through a two-step distillation process, involving the use of GPT-4 and Flan-T5-large. The approach combines the capabilities of pre-trained language models with task-specific fine-tuning, demonstrating a comprehensive strategy for parsing passages into propositions. The use of 1-shot demonstrations and distillation processes adds depth to the training methodology, showcasing a nuanced and effective approach in the realm of natural language processing

The evaluation is two-fold, focusing on both retrieval performance and the impact on downstream QA tasks. The key finding is that proposition-based retrieval outperforms sentence and passage-based methods, particularly in terms of generalization. This suggests that propositions, due to their compact nature and rich context, enable dense retrievers to access precise information while maintaining adequate context.

The results indicate an average improvement over passage-based retrieval, with Recall@20 showing a significant increase of +10.1 on unsupervised dense retrievers and +2.2 on supervised retrievers. Additionally, the study observes a distinct advantage in downstream QA performance when employing proposition-based retrieval.[1]

One notable implication highlighted in the findings is the suitability of propositions for overcoming the often limited input token length in language models. Propositions inherently provide a higher density of question-relevant information, enhancing the effectiveness of dense retrievers in accessing pertinent data for QA tasks.

Conclusion

In conclusion, the study underscores the potential of proposition-based retrieval as a superior approach, offering improved performance in both retrieval tasks and downstream QA applications. The compact yet context-rich nature of propositions appears to be a valuable asset in addressing the challenges posed by limited token length in language models.

Picture by Markus Spiske from Unsplash

Implementation in Langchain

You can try this new approach using the template that Langchain has provided in their repo [2]. The prompt they use is the following:

“SYSTEM

Decompose the “Content” into clear and simple propositions, ensuring they are interpretable out of context.

1. Split compound sentence into simple sentences. Maintain the original phrasing from the input whenever possible.

2. For any named entity that is accompanied by additional descriptive information, separate this information into its own distinct proposition.

3. Decontextualize the proposition by adding necessary modifier to nouns or entire sentences and replacing pronouns (e.g., “it”, “he”, “she”, “they”, “this”, “that”) with the full name of the entities they refer to.

4. Present the results as a list of strings, formatted in JSON.

Example: {A complete example here …}

HUMAN

Decompose the following:

{input}”

The Langchain template consists of a few .py files with the different sections of code, I’ve developed a Jupyter notebook that includes all the code to get the full picture and you can easily modify it to do your tests.

The main components in the notebook are:

Propositional Chain: A chain that creates the decomposed propositions from the text, it’ll be used to populate the index. sss

proposition_chain = (
PROMPT
U+007C ChatOpenAI(model="gpt-3.5-turbo-16k").bind(
tools=[
{
"type": "function",
"function": {
"name": "decompose_content",
"description": "Return the decomposed propositions",
"parameters": {
"type": "object",
"properties": {
"propositions": {
"type": "array",
"items": {"type": "string"},
}
},
"required": ["propositions"],
},
},
}
],
tool_choice={"type": "function", "function": {"name": "decompose_content"}},
)
U+007C JsonOutputToolsParser()
U+007C get_propositions
).with_fallbacks([RunnableLambda(empty_proposals)])

Main RAG Chain: Creates a multivector retriever and builds the RAG chain to solve the user questions.

 """
The RAG chain

:param retriever: A function that retrieves the necessary context for the model.
:return: A chain of functions representing the multi-modal RAG process.
"""

model = ChatOpenAI(temperature=0, model="gpt-4-1106-preview", max_tokens=1024)
prompt = ChatPromptTemplate.from_messages(
[
(
"system",
"You are an AI assistant. Answer based on the retrieved documents:"
"\n<Documents>\n{context}\n</Documents>",
),
("user", "{question}?"),
]
)

# Define the RAG pipeline
chain = (
{
"context": retriever U+007C format_docs,
"question": RunnablePassthrough(),
}
U+007C prompt
U+007C model
U+007C StrOutputParser()
)

return chain

Ingest the data: Read the data and using the propositional chain from 1 create the vector index with the propositions.

# Could add more parsing here, as it's very raw.
loader = RecursiveUrlLoader(
"https://ar5iv.labs.arxiv.org/html/1706.03762",
max_depth=2,
extractor=lambda x: Soup(x, "html.parser").text,
)
data = loader.load()
print(f"Loaded {len(data)} documents")

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size=8000, chunk_overlap=0)
all_splits = text_splitter.split_documents(data)
print(f"Split into {len(all_splits)} documents")

# Create retriever
retriever_multi_vector_img = create_index(
all_splits,
proposition_chain,
DOCSTORE_ID_KEY,
"llama2-paper"
)
  1. Invoke the RAG chain to answer the user questions

You can check the notebook in my repo to explore the code and make the changes for your own problem.

Implementation in Llamaindex

Llamaindex provides a LlamaPack that “creates a query engine that uses a RecursiveRetriever in llama-index to fetch nodes based on propositions extracted from each node. We use the provided OpenAI prompt from their paper to generate propositions, which are then embedded and used to retrieve their parent node chunks “. From the LlamaPack description, [3].

It is very easy to use:

from llama_index import SimpleDirectoryReader
from llama_index.llama_pack import download_llama_pack
from llama_index.llms import OpenAI
from llama_index.text_splitter import SentenceSplitter

# download and install dependencies
DenseXRetrievalPack = download_llama_pack(
"DenseXRetrievalPack", "./dense_pack"
)

documents = SimpleDirectoryReader("./data").load_data()

# uses the LLM to extract propositions from every document/node!
dense_pack = DenseXRetrievalPack(documents)

dense_pack = DenseXRetrievalPack(
documents,
proposition_llm=OpenAI(model="gpt-3.5-turbo", ...),
query_llm=OpenAI(model="gpt-3.5-turbo", ...),
text_splitter=SentenceSplitter(chunk_size=1024)
)
dense_query_engine = dense_pack.query_engine

response = dense_query_engine.query("How was Llama2 pretrained?")

You can find the notebook with an example in my repo. It is an extraction from the original one by llamaindex.

I hope you find this article relevant for your future RAG implementations, it might boost your application performance as the authors mention.

References

[1]. PaperDense X Retrieval: What Retrieval Granularity Should We Use?” by Tong Chen, Hongwei Wang, Sihao Chen, Wenhao Yu, Kaixin Ma, Xinran Zhao, Hongming Zhang, and Dong Yu from the University of Washington, Tencent AI Lab, University of Pennsylvania, and Carnegie Mellon University

[2]. Langchain github repo

[3]. Llamaindex implementation

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓