How To Build Your Own Question Answering Chatbot for Youtube Videos
Last Updated on July 25, 2023 by Editorial Team
Author(s): Ratul Ghosh
Originally published on Towards AI.
Unlocking the Power of LangChain: Tips and Tricks for Creating a ChatGPT type Bot with custom data sources.
AI is undergoing a paradigm shift with the rise of foundation models or large pre-trained language models that are trained on massive amounts of text data using unsupervised learning techniques. These models can be adapted to a wide range of downstream tasks, which makes the whole process of training a neural network from scratch on some target task, like question answering, redundant.
With ChatGPT being one of the most popular Large Language Models (LLMs) in the market, there has been a lot of buzz surrounding its capabilities since its beta release at the end of 2022. However, there are limitations, such as a knowledge cutoff date and token limits, that can hinder its effectiveness.
This blog post will discuss how to use LLMs with external sources of data and handle the token limit for large documents, with an example of a Question Answering Chatbot for YouTube videos. Weβll see the steps required to build an application that allows users to paste a YouTube link and ask questions, and the response is generated solely based on the video content.
tl;dr
- Extract the transcript from the videos.
- For longer videos, split the document with enough overlap (to ensure continuity), send the smaller parts to the LLM, get the summary for each part, and combine them for the final summary.
- Send each split along with the full summary to get an answer from each part. Choose the best answer.
Prerequisites
To build our own question-answering chatbot for YouTube videos, weβll need to have a few things in place before getting started. First, weβll need to have Python installed on our computer, along with some key packages like LangChain and Gradio. LangChain is a framework built around LLMs that lets us chain together different components to create more advanced use cases around LLMs. Gradio is a Python library that makes it easy to build simple web interfaces.
Additionally, weβll need to create an API key for the OpenAI platform, which is necessary to make use of ChatGPT. An API key can be created by visiting this link and clicking on the + Create new secret key
button.
Itβs worth noting that LangChain isnβt limited to just OpenAIβs models β we can use any LLMs weβd like, or even use our own model. In the later sections, weβll discuss how to make use of multiple LLMs.
Getting the transcript
The first step is creating an external knowledge base. In our case, itβs the transcript of the YouTube video. We will use YouTubeTranscriptApi, which allows us to get the transcript/subtitles for a given YouTube video. It also works for automatically generated subtitles, supports translating subtitles, and does not require a headless browser like other selenium-based solutions do. This code will check for the English transcript and extract that.
from youtube_transcript_api import YouTubeTranscriptApi
def getTranscript(link):
if "youtube" not in link:
return "Paste proper link"
videoId = link.split("=")[1].split("&")[0]
try:
transcript_list = YouTubeTranscriptApi.list_transcripts(videoId)
lang_list = [t.language_code for t in transcript_list]
if len(lang_list) == 1:
lang = lang_list[0]
else:
lang = [t for t in lang_list if "en" in t][0]
transcript = transcript_list.find_transcript([lang])
return transcript.fetch()
except:
return "Transcript not availabel"
Defining the language model and Text Splitter
LangChain provides many modules that can be used to build applications powered by LLMs. Modules can be combined to create more complex applications or be used individually for simple applications.
The most basic building block of LangChain is calling an LLM on some input. Here is an example of three different LLMs.
import os
from langchain.llms import OpenAI
os.environ["OPENAI_API_KEY"] = api_key
llm = OpenAI()
os.environ["COHERE_API_KEY"] = api_key
llm = Cohere()
os.environ['HUGGINGFACEHUB_API_TOKEN'] = api_key
llm = HuggingFaceHub(
repo_id='google/flan-t5-xl'
)
text = "What would be a good company name for a company that makes colorful socks?"
print(llm(text))
For dealing with long pieces of text, it is necessary to split up that text into chunks. There are a lot of different techniques here, and we can get creative here. In this example, CharacterTextSplitter is used, which splits based on characters (β\nβ in this example) and measures chunk length by the number of characters.
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
separator = "\n",
chunk_size = 8000,
chunk_overlap = 500,
length_function = len,
)
Note: If the chunk_size is small it means the document will be broken into a lot of small chunks and each one will be used to call the LLMs either concurrently or sequentially based on the combination logic used. Since most free tiers have a limit on the number of API calls per minute this might fail.
Summarizing the video
The next step is to summarize the content of the video. This is relatively easy for shorter videos, but for longer videos, the extracted transcript will be longer than the 4000 tokens limit for most language models.
To counter this, we need to split up the document with enough overlap to ensure continuity, send the smaller parts to the LLM, get the summary for each part, and finally combine the summaries to get the final result. LangChain provides various techniques for doing this, and each has its pros and cons. In this blog, we will explore two methods: βMap Reduceβ and βMap Rerankβ.
For the prompt, we can use a placeholder like β{text}β which will be replaced by the transcript.
_prompt_summary = """Write a detailed summary given the following \
transcript of a long video. Make sure the summary is succinct \
and complete:
{text}
DETAILED SUMMARY:"""
prompt_summary = Prompt(template=_prompt_summary, input_variables=["text"])
from langchain.chains.summarize import load_summarize_chain
def getSummary(transcript, llm):
if llm is None:
return "Please paste your OpenAI key to use"
texts = text_splitter.split_text(transcript)
docs = [Document(page_content=t+" "+summary) for t in texts]
chain_summary = load_summarize_chain(llm, chain_type="map_reduce",
map_prompt = prompt_summary,
combine_prompt = prompt_summary)
summary = chain_summary(docs, return_only_outputs=True)
return summary['output_text']
Here is an example of using MapReduceChain, which involves running an initial prompt on each chunk of data in parallel (for summarization tasks, this is the summary of that chunk). Then, a different prompt is run to combine all the initial outputs.
Question Answering
For question answering, we will pass each split along with the full summary to get an answer from each part. One of the biggest challenges with LLMs is Hallucinated Content, where the generated responses are unrelated to the question or the context.
To overcome this, we can tune the prompt by adding something like this βIf you donβt know the answer, just say that you donβt know, donβt try to make up an answer.β The example shown below is one such prompt for question answering.
_prompt_qa = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
In addition to giving an answer, also return a score of how fully it answered the user's question. This should be in the following format:
Question: [question here]
Helpful Answer: [answer here]
Score: [score between 0 and 100]
How to determine the score:
- Higher is a better answer
- Better responds fully to the asked question, with sufficient level of detail
- If you do not know the answer based on the context, that should be a score of 0
- Don't be overconfident!
Example #1
Context:
---------
Apples are red
---------
Question: what color are apples?
Helpful Answer: red
Score: 100
Example #2
Context:
---------
it was night and the witness forgot his glasses. he was not sure if it was a sports car or an suv
---------
Question: what type was the car?
Helpful Answer: a sports car or an suv
Score: 60
Example #3
Context:
---------
Pears are either red or orange
---------
Question: what color are apples?
Helpful Answer: This document does not answer the question
Score: 0
Begin!
Context:
---------
{context}
---------
Question: {question}
Helpful Answer:"""
output_parser = RegexParser(
regex=r"(.*?)\nScore: (.*)",
output_keys=["answer", "score"],
)
prompt_qa = PromptTemplate(
template=_prompt_qa,
input_variables=["context", "question"],
output_parser=output_parser,
)
from langchain.chains.question_answering import load_qa_chain
def getAnswer(transcript, question, llm):
if llm is None:
return "Please paste your OpenAI key to use"
texts = text_splitter.split_text(transcript)
docs = [Document(page_content=t+" "+summary) for t in texts]
chain_qa = load_qa_chain(llm, chain_type="map_rerank", prompt = prompt_qa)
llm_results = chain_qa({"input_documents": docs, "question": question},
return_only_outputs=True)
return llm_results['output_text']
Here we have used Map-Rerank, which runs on each split, trying to both (a) answer a question and (b) assign a score to how good the answer is. In the final step, it picks the answer with the highest score.
Note: While the examples above can be used as prompts, they may not be the most suitable ones for the given task. To find more appropriate prompts for specific tasks, you can refer to this paper OpenAIβs InstructGPT
Conclusion
In this blog, we discussed how to use LangChain with external data. We demonstrated how to use various modules of LangChain to interact with different language models, split long pieces of text, and combine their outputs to achieve more advanced use cases. We also discussed some of the challenges in natural language processing, such as hallucination, and showed how prompt engineering can be used to overcome these challenges. Overall, LangChain provides a powerful and flexible platform for building natural language processing applications powered by LLMs.
I hope you like this blog and find it useful. If you have any thoughts, comments, or questions, please leave a comment below or contact me on LinkedIn, and donβt forget to click on U+1F44FU+1F3FB if you like the post.
Disclaimer: It is important to note that using YouTubeβs content for any commercial purpose or in violation of their copyright or terms of service is illegal and can result in penalties. This blog post does not encourage or condone the unauthorized use of copyrighted material and any third-party content.
References:
Welcome to LangChain
Edit description
python.langchain.com
Prompt Engineering
Here, we discuss a few principles and techniques for writing prompts (inputs for our models) that will help you get theβ¦
docs.cohere.ai
Prompt Engineering and LLMs with Langchain U+007C Pinecone
We have always relied on different models for different tasks in machine learning. With the introduction ofβ¦
www.pinecone.io
GitHub – hwchase17/langchain: U+26A1 Building applications with LLMs through composability U+26A1
U+26A1 Building applications with LLMs through composability U+26A1 Production Support: As you move your LangChains intoβ¦
github.com
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI