Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

How To Build Your Own Question Answering Chatbot for Youtube Videos
Latest   Machine Learning

How To Build Your Own Question Answering Chatbot for Youtube Videos

Last Updated on July 25, 2023 by Editorial Team

Author(s): Ratul Ghosh

Originally published on Towards AI.

Unlocking the Power of LangChain: Tips and Tricks for Creating a ChatGPT type Bot with custom data sources.

AI is undergoing a paradigm shift with the rise of foundation models or large pre-trained language models that are trained on massive amounts of text data using unsupervised learning techniques. These models can be adapted to a wide range of downstream tasks, which makes the whole process of training a neural network from scratch on some target task, like question answering, redundant.

Photo by Andrea De Santis on Unsplash

With ChatGPT being one of the most popular Large Language Models (LLMs) in the market, there has been a lot of buzz surrounding its capabilities since its beta release at the end of 2022. However, there are limitations, such as a knowledge cutoff date and token limits, that can hinder its effectiveness.

This blog post will discuss how to use LLMs with external sources of data and handle the token limit for large documents, with an example of a Question Answering Chatbot for YouTube videos. We’ll see the steps required to build an application that allows users to paste a YouTube link and ask questions, and the response is generated solely based on the video content.

tl;dr

  • Extract the transcript from the videos.
  • For longer videos, split the document with enough overlap (to ensure continuity), send the smaller parts to the LLM, get the summary for each part, and combine them for the final summary.
  • Send each split along with the full summary to get an answer from each part. Choose the best answer.

Prerequisites

To build our own question-answering chatbot for YouTube videos, we’ll need to have a few things in place before getting started. First, we’ll need to have Python installed on our computer, along with some key packages like LangChain and Gradio. LangChain is a framework built around LLMs that lets us chain together different components to create more advanced use cases around LLMs. Gradio is a Python library that makes it easy to build simple web interfaces.

Additionally, we’ll need to create an API key for the OpenAI platform, which is necessary to make use of ChatGPT. An API key can be created by visiting this link and clicking on the + Create new secret key button.

It’s worth noting that LangChain isn’t limited to just OpenAI’s models — we can use any LLMs we’d like, or even use our own model. In the later sections, we’ll discuss how to make use of multiple LLMs.

Getting the transcript

The first step is creating an external knowledge base. In our case, it’s the transcript of the YouTube video. We will use YouTubeTranscriptApi, which allows us to get the transcript/subtitles for a given YouTube video. It also works for automatically generated subtitles, supports translating subtitles, and does not require a headless browser like other selenium-based solutions do. This code will check for the English transcript and extract that.

from youtube_transcript_api import YouTubeTranscriptApi
def getTranscript(link):
if "youtube" not in link:
return "Paste proper link"
videoId = link.split("=")[1].split("&")[0]
try:
transcript_list = YouTubeTranscriptApi.list_transcripts(videoId)
lang_list = [t.language_code for t in transcript_list]
if len(lang_list) == 1:
lang = lang_list[0]
else:
lang = [t for t in lang_list if "en" in t][0]

transcript = transcript_list.find_transcript([lang])
return transcript.fetch()
except:
return "Transcript not availabel"

Defining the language model and Text Splitter

LangChain provides many modules that can be used to build applications powered by LLMs. Modules can be combined to create more complex applications or be used individually for simple applications.

The most basic building block of LangChain is calling an LLM on some input. Here is an example of three different LLMs.

import os
from langchain.llms import OpenAI
os.environ["OPENAI_API_KEY"] = api_key
llm = OpenAI()
os.environ["COHERE_API_KEY"] = api_key
llm = Cohere()
os.environ['HUGGINGFACEHUB_API_TOKEN'] = api_key
llm = HuggingFaceHub(
repo_id='google/flan-t5-xl'
)
text = "What would be a good company name for a company that makes colorful socks?"
print(llm(text))

For dealing with long pieces of text, it is necessary to split up that text into chunks. There are a lot of different techniques here, and we can get creative here. In this example, CharacterTextSplitter is used, which splits based on characters (“\n” in this example) and measures chunk length by the number of characters.

from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
separator = "\n",
chunk_size = 8000,
chunk_overlap = 500,
length_function = len,
)

Note: If the chunk_size is small it means the document will be broken into a lot of small chunks and each one will be used to call the LLMs either concurrently or sequentially based on the combination logic used. Since most free tiers have a limit on the number of API calls per minute this might fail.

Summarizing the video

The next step is to summarize the content of the video. This is relatively easy for shorter videos, but for longer videos, the extracted transcript will be longer than the 4000 tokens limit for most language models.

To counter this, we need to split up the document with enough overlap to ensure continuity, send the smaller parts to the LLM, get the summary for each part, and finally combine the summaries to get the final result. LangChain provides various techniques for doing this, and each has its pros and cons. In this blog, we will explore two methods: “Map Reduce” and “Map Rerank”.

For the prompt, we can use a placeholder like “{text}” which will be replaced by the transcript.

_prompt_summary = """Write a detailed summary given the following \
transcript of a long video. Make sure the summary is succinct \
and complete:

{text}

DETAILED SUMMARY:"""


prompt_summary = Prompt(template=_prompt_summary, input_variables=["text"])
from langchain.chains.summarize import load_summarize_chain
def getSummary(transcript, llm):
if llm is None:
return "Please paste your OpenAI key to use"
texts = text_splitter.split_text(transcript)
docs = [Document(page_content=t+" "+summary) for t in texts]
chain_summary = load_summarize_chain(llm, chain_type="map_reduce",
map_prompt = prompt_summary,
combine_prompt = prompt_summary)
summary = chain_summary(docs, return_only_outputs=True)
return summary['output_text']

Here is an example of using MapReduceChain, which involves running an initial prompt on each chunk of data in parallel (for summarization tasks, this is the summary of that chunk). Then, a different prompt is run to combine all the initial outputs.

Question Answering

For question answering, we will pass each split along with the full summary to get an answer from each part. One of the biggest challenges with LLMs is Hallucinated Content, where the generated responses are unrelated to the question or the context.

To overcome this, we can tune the prompt by adding something like this “If you don’t know the answer, just say that you don’t know, don’t try to make up an answer.” The example shown below is one such prompt for question answering.

_prompt_qa = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
In addition to giving an answer, also return a score of how fully it answered the user's question. This should be in the following format:
Question: [question here]
Helpful Answer: [answer here]
Score: [score between 0 and 100]
How to determine the score:
- Higher is a better answer
- Better responds fully to the asked question, with sufficient level of detail
- If you do not know the answer based on the context, that should be a score of 0
- Don't be overconfident!
Example #1
Context:
---------
Apples are red
---------
Question: what color are apples?
Helpful Answer: red
Score: 100
Example #2
Context:
---------
it was night and the witness forgot his glasses. he was not sure if it was a sports car or an suv
---------
Question: what type was the car?
Helpful Answer: a sports car or an suv
Score: 60
Example #3
Context:
---------
Pears are either red or orange
---------
Question: what color are apples?
Helpful Answer: This document does not answer the question
Score: 0
Begin!
Context:
---------
{context}
---------
Question: {question}
Helpful Answer:"""


output_parser = RegexParser(
regex=r"(.*?)\nScore: (.*)",
output_keys=["answer", "score"],
)

prompt_qa = PromptTemplate(
template=_prompt_qa,
input_variables=["context", "question"],
output_parser=output_parser,
)
from langchain.chains.question_answering import load_qa_chain
def getAnswer(transcript, question, llm):
if llm is None:
return "Please paste your OpenAI key to use"
texts = text_splitter.split_text(transcript)
docs = [Document(page_content=t+" "+summary) for t in texts]
chain_qa = load_qa_chain(llm, chain_type="map_rerank", prompt = prompt_qa)
llm_results = chain_qa({"input_documents": docs, "question": question},
return_only_outputs=True)
return llm_results['output_text']

Here we have used Map-Rerank, which runs on each split, trying to both (a) answer a question and (b) assign a score to how good the answer is. In the final step, it picks the answer with the highest score.

Note: While the examples above can be used as prompts, they may not be the most suitable ones for the given task. To find more appropriate prompts for specific tasks, you can refer to this paper OpenAI’s InstructGPT

Conclusion

In this blog, we discussed how to use LangChain with external data. We demonstrated how to use various modules of LangChain to interact with different language models, split long pieces of text, and combine their outputs to achieve more advanced use cases. We also discussed some of the challenges in natural language processing, such as hallucination, and showed how prompt engineering can be used to overcome these challenges. Overall, LangChain provides a powerful and flexible platform for building natural language processing applications powered by LLMs.

I hope you like this blog and find it useful. If you have any thoughts, comments, or questions, please leave a comment below or contact me on LinkedIn, and don’t forget to click on U+1F44FU+1F3FB if you like the post.

Disclaimer: It is important to note that using YouTube’s content for any commercial purpose or in violation of their copyright or terms of service is illegal and can result in penalties. This blog post does not encourage or condone the unauthorized use of copyrighted material and any third-party content.

References:

Welcome to LangChain

Edit description

python.langchain.com

Prompt Engineering

Here, we discuss a few principles and techniques for writing prompts (inputs for our models) that will help you get the…

docs.cohere.ai

Prompt Engineering and LLMs with Langchain U+007C Pinecone

We have always relied on different models for different tasks in machine learning. With the introduction of…

www.pinecone.io

GitHub – hwchase17/langchain: U+26A1 Building applications with LLMs through composability U+26A1

U+26A1 Building applications with LLMs through composability U+26A1 Production Support: As you move your LangChains into…

github.com

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓