Extract the text from long videos with Python
Last Updated on January 6, 2023 by Editorial Team
Last Updated on December 13, 2020 by Editorial Team
Author(s): Eugenia Anello
A simple guide to build a speech recognizer using GoogleβsΒ API
Speech recognition is an interesting task that allows you to improve the quality of your life. In this neverending Covid period, I need to watch many videos of lessons, and itβs so easy to lose concentration. At the same time, the possibility to have all registrations available on my universityβs website made me become a perfectionist, so I would like to take every word in my notes. But itβs costly because it needs a lot of work and stealsΒ time.
Luckily, there are already API resources available such as Google, Amazon, IBM, and many others, that offer services that convert audio into text. In this article, Iβll focus only on the Google Speech-to-Text API, which I think itβs the most efficient application to transcribe many videos. Iβm going to create a speech recognition model with Python that converts a video file into textΒ format.
Google Speech-to-Text API
Google Speech to text has three types of API requests based on audioΒ content:
- In Synchronous Requests, the audio file content should be approximately 1 minute. In this type of request, the user does not have to upload the data to Google cloud. Iβm going to focus on this type ofΒ request.
- In Asynchronous Requests, the audio file should be approximately 480 minutes. In this type of request, the user has to upload their data to GoogleΒ cloud.
- The Streaming Requests are suitable for streaming data where the user is talking to the microphone directly and needs to get it transcribed. This type of request is apt for chatbots.
The current API usage limits you need to know for Speech-to-Text are:
The table shows that there is a limit of 480 hours of audio per day, while the maximum number of βStreamingRecognizeβ requests per 60 seconds is 900. Isnβt it amazing to have so many hours to convert audio into text per day? Especially when itβs free! Itβs not so obvious if you try other API or standard methods withoutΒ python.
Step 1: Download video from theΒ website
I downloaded a video from my universityβs website with a Chrome extension called Video DownloadHelper. Itβs free and very easy to use. Some operations required by Video DownloadHelper cannot be performed from within the browser. In order to make the extension work, I also installed an external app called Companion Application.
Note: Without the Premium status, the videoβs download can only be performed 120 minutes after the previousΒ one.
Step 2: Import libraries into JupiterΒ Notebook
Letβs install the libraries that weβll use in thisΒ program.
SpeechRecognition is a Python library for performing speech recognition with support for Googleβs API, while moviepy allows to cut, read, and write all the most common audio and video formats. Moreover, moviepy supports various file format:Β .ogv,Β .mp4,Β .mpeg,Β .avi,Β .mov.
Once we installed the libraries, we can importΒ them:
Step 3: Cut video file into chunks of 1 minute and convert each chunk into textΒ format
In my case, the video was in formatΒ .mp4 and was 52 minutes long. The variable num_seconds_video contains my videoβs number of seconds. After I created a list that will be used to cut the video file into a specific number of chunks, itβs needed for the start and end times in the slices of video. More details about this concept will be explained later.
Moreover, I created an empty dictionary, diz, where the key will be the string βchunk#β and the value will be the text extracted from that chunk. In the for loop, I am going to convert each slice of video into textΒ format.
Note: before running the for iteration, I created a folder βchunksβ that contain all the slices of the video and a folder βconvertedβ with all the slices of video converted into wav format. I suggest you do it if you donβt want to be full ofΒ files.
- Create a new video file, based on the initial file β videorl.mp4β, that will be cut between an initial time and an end time(in seconds). For example, the first chunk is between 0 seconds and 60 seconds, and the second chunk is between 58 seconds and 120 seconds, the third chunk will be between the 118 and 180 seconds, and so on until I reach the last chunk between 3058 and 3120 seconds. The chunks overlap by 2 seconds in order to not lose important words. The function used is ffmpeg_extract_subclip(filename, t1, t2, targetname)
- Import the new audio file created in the previous step with the function VideoFileClip(filename)
- Convert mp4 file into wav format, which works better with GoogleβsΒ API
- Create the Recognizer instance
- Import the audio file with formatΒ wav
- Use Googleβs Cloud Speech-to-text API to extract the text from the audio file in formatΒ wav.
Step 5: Export results into a TextΒ document
As the last task, weβll create a unique text file, which will contain all the chunksβΒ texts.
I create a list that only contains the extracted text from each slice of video. After I join each element of the list by a string separator βnβ, that is the newline character.
In the end, I created the file, which has all the videoβsΒ text.
Congratulations! You obtained the text of your video, or the code is still running. The last case is normal if the file is big. Itβs not too fast, but at least you can watch Netflix in the meanwhile. In the end, you will obtain your text transcription. It wonβt be perfect, there will be some redundant words because of the overlapping trick of 2 seconds between two chunks, but I think itβs a better solution compared to loose information. I hope you enjoyed this guide and you found it useful. The entire code is inΒ Github.
Extract the text from long videos with Python was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Published via Towards AI