Improving Document Comprehension Using LangChain and OpenAI
Last Updated on November 5, 2023 by Editorial Team
Author(s): Abhishek Chaudhary
Originally published on Towards AI.
With the tremendous growth in generative AI and Language models in comprehending and extracting information from documents, we are witnessing a new era in which machines like GPT are aiding humans in better knowledge extraction, interpretation, and extraction.
In this blog post, weβll walk through a similar use case where we make use of OpenAI APIs, along with LangChain, and build a system that is capable of extracting information from a document and answering questions. Letβs get started.
Step 0: Setup your environment
For the purpose of this article, weβll be making use of Jupyter Notebook and OpenAI API.
Here are the modules weβll need for this walkthrough.
pip install openai tiktoken chromadb langchain BeautifulSoup4
Weβll also need to set OpenAI API credentials
import os
os.environ["OPENAI_API_KEY"] = "sk-xxxx"
Once this is out of the way, letβs get started.
Step 1: Getting the document content
Weβll be using an article from the Berkeley Artificial Intelligence Research blog https://bair.berkeley.edu/blog/2023/07/14/ddpo/
LangChain provides a handy way to read data from a web URL and convert it into a Document
format. You can read more about the document loaders here.
Weβll make use of a WebBaseLoader as shown below.
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://bair.berkeley.edu/blog/2023/07/14/ddpo/")
data = loader.load()
print(data[0].page_content)
> '\n\n\n\n\nTraining Diffusion Models with Reinforcement Learning β The Berkeley Artificial Intelligence Research Blog\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSubscribe\nAbout\nArchive\nBAIR\n\n\n\n\n\n\n\n\nTraining Diffusion Models with Reinforcement Learning\n\nKevin Black \xa0\xa0\n \n \n Jul 14, 2023\n \n \n\n\n\n\n\n\n\n\n\n\nTraining Diffusion Models with Reinforcement Learning\n\n\n\n\n\n\nreplay\n\n\nDiffusion models have recently emerged as the de facto standard for generating complex, high-dimensional outputs. You may know them for their ability to produce stunning AI art and hyper-realistic synthetic images, but they have also found success in other applications such as drug design\xa0and continuous control. The key idea behind diffusion models is to iteratively transform random noise into a sample, such as an image or protein structure. This is typically motivated as a maximum likelihood estimation\xa0problem, where the model is trained to generate samples that match the training data as closely as possible.\nHowever, most use cases of diffusion models are not directly concerned with matching the training data, but instead with a downstream objective. We donβt just want an image that looks like existing images, but one that has a specific type of appearance; we donβt just want a drug molecule that is physically plausible, but one that is as effective as possible. In this post, we show how diffusion models can be trained on these downstream objectives directly using reinforcement learning (RL). To do this, we finetune Stable Diffusion\xa0on a variety of objectives, including image compressibility, human-perceived aesthetic quality, and prompt-image alignment. The last of these objectives uses feedback from a large vision-language model\xa0to improve the modelβs performance on unusual prompts, demonstrating how powerful AI models can be used to improve each other\xa0without any humans in the loop.\n\n\n\n\n\n\n A diagram illustrating the prompt-image alignment objective. It uses LLaVA, a large vision-language model, to evaluate generated images.\n \n\n\nDenoising Diffusion Policy Optimization\nWhen turning diffusion into an RL problem, we make only the most basic assumption: given a sample (e.g. an image), we have access to a reward function\xa0that we can\xa0evaluate to tell us how βgoodβ that sample is. Our goal is for the diffusion model to generate samples that maximize this reward function.\nDiffusion models are typically trained using a loss function derived from maximum likelihood estimation (MLE), meaning they are encouraged to generate samples that make the training data look more likely. In the RL setting, we no longer have training data, only samples from the diffusion model and their associated rewards. One way we can still use the same MLE-motivated loss function is by treating the samples as training data and incorporating the rewards by weighting the loss for each sample by its reward. This gives us an algorithm that we call reward-weighted regression (RWR), after existing algorithms\xa0from RL literature.\nHowever, there are a few problems with this approach. One is that RWR is not a particularly exact algorithm β it maximizes the reward only approximately\xa0(see Nair et. al., Appendix A).\xa0The MLE-inspired loss for diffusion is also not exact and is instead derived using a variational bound\xa0on the true likelihood of each sample. This means that RWR maximizes the reward through two levels of approximation, which we find significantly hurts its performance.\n\n\n\n\n\n We evaluate two variants of DDPO and two variants of RWR on three reward functions and find that DDPO consistently achieves the best performance.\n \n\n\nThe key insight of our algorithm, which we call denoising diffusion policy optimization (DDPO), is that we can better maximize the reward of the final sample if we pay attention to the entire sequence of denoising steps that got us there. To do this, we reframe the diffusion process as a multi-step Markov decision process (MDP). In MDP terminology: each denoising step is an action, and the agent\xa0only gets a reward on the final step of each denoising trajectory\xa0when the final sample is produced. This framework allows us to apply many powerful algorithms from RL literature that are designed specifically for multi-step MDPs. Instead of using the approximate likelihood of the final sample, these algorithms use the exact likelihood of each denoising step, which is extremely easy to compute.\nWe chose to apply policy gradient algorithms due to their ease of implementation and past success in language model finetuning. This led to two variants of DDPO: DDPOSF, which uses the simple score function estimator of the policy gradient also known as REINFORCE; and DDPOIS, which uses a more powerful importance sampled estimator. DDPOIS\xa0is our best-performing algorithm and its implementation closely follows that of proximal policy optimization (PPO).\nFinetuning Stable Diffusion Using DDPO\nFor our main results, we finetune Stable Diffusion v1-4\xa0using DDPOIS. We have four tasks, each defined by a different reward function:\n\nCompressibility: How easy is the image to compress using the JPEG algorithm? The reward is the negative file size of the image (in kB) when saved as a JPEG.\nIncompressibility:\xa0How hard\xa0is the image to compress using the JPEG algorithm? The reward is the positive\xa0file size of the image (in kB) when saved as a JPEG.\nAesthetic Quality: How aesthetically appealing is the image to the human eye? The reward is the output of the LAION aesthetic predictor, which is a neural network trained on human preferences.\nPrompt-Image Alignment: How well does the image represent what was asked for in the prompt? This one is a bit more complicated: we feed the image into LLaVA, ask it to describe the image, and then compute the similarity between that description and the original prompt using BERTScore.\n\nSince Stable Diffusion is a text-to-image model, we also need to pick a set of prompts to give it during finetuning. For the first three tasks, we use simple prompts of the form βa(n) [animal]β. For prompt-image alignment, we use prompts of the form βa(n) [animal] [activity]β, where the activities are βwashing dishesβ, βplaying chessβ, and βriding a bikeβ. We found that Stable Diffusion often struggled to produce images that matched the prompt for these unusual scenarios, leaving plenty of room for improvement with RL finetuning.\nFirst, we illustrate the performance of DDPO on the simple rewards (compressibility, incompressibility, and aesthetic quality). All of the images are generated with the same random seed. In the top left quadrant, we illustrate what βvanillaβ Stable Diffusion generates for nine different animals; all of the RL-finetuned models show a clear qualitative difference. Interestingly, the aesthetic quality model (top right) tends towards minimalist black-and-white line drawings, revealing the kinds of images that the LAION aesthetic predictor considers βmore aestheticβ.1\n\n\n\nNext, we demonstrate DDPO on the more complex prompt-image alignment task. Here, we show several snapshots from the training process: each series of three images shows samples for the same prompt and random seed over time, with the first sample coming from vanilla Stable Diffusion. Interestingly, the model shifts towards a more cartoon-like style, which was not intentional. We hypothesize that this is because animals doing human-like activities are more likely to appear in a cartoon-like style in the pretraining data, so the model shifts towards this style to more easily align with the prompt by leveraging what it already knows.\n\n\n\nUnexpected Generalization\nSurprising generalization has been found to arise when finetuning large language models with RL: for example, models finetuned on instruction-following only in English often improve in other languages. We find that the same phenomenon occurs with text-to-image diffusion models. For example, our aesthetic quality model was finetuned using prompts that were selected from a list of 45 common animals. We find that it generalizes not only to unseen animals but also to everyday objects.\n\n\n\nOur prompt-image alignment model used the same list of 45 common animals during training, and only three activities. We find that it generalizes not only to unseen animals but also to unseen activities, and even novel combinations of the two.\n\n\n\nOveroptimization\nIt is well-known that finetuning on a reward function, especially a learned one, can lead to reward overoptimization\xa0where\xa0the model exploits the reward function to achieve a high reward in a non-useful way. Our setting is no exception: in all the tasks, the model eventually destroys any meaningful image content to maximize reward.\n\n\n\nWe also discovered that LLaVA is susceptible to typographic attacks: when optimizing for alignment with respect to prompts of the form β[n]\xa0animalsβ, DDPO was able to successfully fool LLaVA by instead generating text loosely resembling the correct number.\n\n\n\nThere is currently no general-purpose method for preventing overoptimization, and we highlight this problem as an important area for future work.\nConclusion\nDiffusion models are hard to beat when it comes to producing complex, high-dimensional outputs. However, so far theyβve mostly been successful in applications where the goal is to learn patterns from lots and lots of data (for example, image-caption pairs). What weβve found is a way to effectively train diffusion models in a way that goes beyond pattern-matching β and without necessarily requiring any training data. The possibilities are limited only by the quality and creativity of your reward function.\nThe way we used DDPO in this work is inspired by the recent successes of language model finetuning. OpenAIβs GPT models, like Stable Diffusion, are first trained on huge amounts of Internet data; they are then finetuned with RL to produce useful tools like ChatGPT. Typically, their reward function is learned from human preferences, but others\xa0have more recently figured out how to produce powerful chatbots using reward functions based on AI feedback instead. Compared to the chatbot regime,\xa0our experiments are small-scale and limited in scope. But considering the enormous success of this βpretrain + finetuneβ paradigm in language modeling, it certainly seems like itβs worth pursuing further in the world of diffusion models. We hope that others can build on our work to improve large diffusion models, not just for text-to-image generation, but for many exciting applications such as video generation, music generation, \xa0image editing, protein synthesis, robotics, and more.\nFurthermore, the βpretrain + finetuneβ paradigm is not the only way to use DDPO. As long as you have a good reward function, thereβs nothing stopping you from training with RL from the start. While this setting is as-yet unexplored, this is a place where the strengths of DDPO could really shine. Pure RL has long been applied to a wide variety of domains ranging from playing games\xa0to robotic manipulation\xa0to nuclear fusion\xa0to chip design. Adding the powerful expressivity of diffusion models to the mix has the potential to take existing applications of RL to the next level β or even to discover new ones.\n\nThis post is based on the following paper:\n\n\nTraining Diffusion Models with Reinforcement Learning\n\nKevin\xa0Black*,\n Michael\xa0Janner*,\n Yilun\xa0Du,\n Ilya\xa0Kostrikov,\n and Sergey\xa0Levine\n\narXiv Preprint.\n\n\n\nIf you want to learn more about DDPO, you can check out the paper, website, original code, or get the model weights on Hugging Face. If you want to use DDPO in your own project, check out my PyTorch + LoRA implementation where you can finetune Stable Diffusion with less than 10GB of GPU memory!\nIf DDPO inspires your work, please cite it with:\n@misc{black2023ddpo,\n title={Training Diffusion Models with Reinforcement Learning}, \n author={Kevin Black and Michael Janner and Yilun Du and Ilya Kostrikov and Sergey Levine},\n year={2023},\n eprint={2305.13301},\n archivePrefix={arXiv},\n primaryClass={cs.LG}\n}\n\n\n\n\n\n\n So, it turns out that the aesthetic score model we used was not exactly... correct. Check out this GitHub issue for the riveting details involving Google Cloud TPUs, floating point formats, and the CLIP image encoder.\n β©\n\n\n\n\n\n\n\nSubscribe to our RSS feed.\n\n \n\n Spread the word: \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'
Step 2: Transform the documents into fixed-length chunks
Once we have the document content, the next step is to convert that into fixed-sized chunks so that the text fits into our choice-of-models context window. Weβll use RecursiveCharacterTextSplitter with a chunk size of 500.
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 0)
all_splits = text_splitter.split_documents(data)
print(all_splits[0])
print(all_splits[1])
> page_content='Training Diffusion Models with Reinforcement Learning β The Berkeley Artificial Intelligence Research Blog\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSubscribe\nAbout\nArchive\nBAIR\n\n\n\n\n\n\n\n\nTraining Diffusion Models with Reinforcement Learning\n\nKevin Black \xa0\xa0\n \n \n Jul 14, 2023\n \n \n\n\n\n\n\n\n\n\n\n\nTraining Diffusion Models with Reinforcement Learning\n\n\n\n\n\n\nreplay' metadata={'source': 'https://bair.berkeley.edu/blog/2023/07/14/ddpo/', 'title': 'Training Diffusion Models with Reinforcement Learning β The Berkeley Artificial Intelligence Research Blog', 'description': 'The BAIR Blog', 'language': 'No language found.'}
page_content='Diffusion models have recently emerged as the de facto standard for generating complex, high-dimensional outputs. You may know them for their ability to produce stunning AI art and hyper-realistic synthetic images, but they have also found success in other applications such as drug design\xa0and continuous control. The key idea behind diffusion models is to iteratively transform random noise into a sample, such as an image or protein structure. This is typically motivated as a maximum likelihood' metadata={'source': 'https://bair.berkeley.edu/blog/2023/07/14/ddpo/', 'title': 'Training Diffusion Models with Reinforcement Learning β The Berkeley Artificial Intelligence Research Blog', 'description': 'The BAIR Blog', 'language': 'No language found.'}
Step 3: Storing document chunks using vector store
Once we have broken the document down into chunks, next step is to create embeddings for the text and store it in vector store. We can do it as shown below.
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())
retriever = vectorstore.as_retriever()
Step 4: Retrieve similar documents based on query
After generating embeddings for the text, we can obtain relevant document chunks for a query using a vector similarity search.
question = "What steps are used for the DDPO algorithm?"
docs = vectorstore.similarity_search(question)
print(f"Retrieved {len(docs)} documents")
print(docs[0].page_content)
> Retrieved 4 documents
The key insight of our algorithm, which we call denoising diffusion policy optimization (DDPO), is that we can better maximize the reward of the final sample if we pay attention to the entire sequence of denoising steps that got us there. To do this, we reframe the diffusion process as a multi-step Markov decision process (MDP). In MDP terminology: each denoising step is an action, and the agent only gets a reward on the final step of each denoising trajectory when the final sample is produced.
Step 5: Generating response using the distilled documents
Now that we have the distilled documents for our query, we can create an LLM chain to formulate a response to our query based on the context
thatβs provided by the vectorstore
retriever. Weβll generate a prompt thatβll instruct the LLM to answer the query and specify a fixed message in case of uncertainty.
from langchain.prompts import PromptTemplate
template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Explain the answer in 3 sentences at max. Be concise.
Always say "Done!" at the end of the answer.
{context}
Question: {question}
Answer:"""
prompt = PromptTemplate.from_template(template)
With this prompt, weβll create a LLM chain as follows
from langchain.schema.runnable import RunnablePassthrough
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
qa_chain = (
{"context": retriever, "question": RunnablePassthrough()}
U+007C prompt
U+007C llm
)
In the code above, we have a few new term β RunnablePassThrough. Itβs a protocol that makes is easier to create custom LLM chains by allowing us to pipe together components as we have done above.
Now we can use our LLM chain to answer our query.
qa_chain.invoke("What steps are used for the DDPO algorithm?").content
> 'The DDPO algorithm uses a sequence of denoising steps to maximize the reward of the final sample. Each denoising step is considered an action in a multi-step Markov decision process (MDP). The agent only receives a reward on the final step of each denoising trajectory when the final sample is produced. Done!'
qa_chain.invoke("What's the main idea disucssed in the article?").content
> 'The main idea discussed in the article is the problem of overoptimization and the need for a general-purpose method to prevent it. Done!'
The complete code for the Python notebook can be found here.
With this, we have a functional system at our disposal that can be easily utilized to aid our interpretation and comprehension of the documents.
If you are intrigued by what else can be done using the wonderful capabilities of Generative AI and LangChain, stay tuned for upcoming articles where weβll dive into several other use cases.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI