Extract the text from long videos with Python

Last Updated on January 6, 2023 by Editorial Team

Last Updated on December 13, 2020 by Editorial Team

Author(s): Eugenia Anello

A simple guide to build a speech recognizer using Google’s API

Speech recognition is an interesting task that allows you to improve the quality of your life. In this neverending Covid period, I need to watch many videos of lessons, and it’s so easy to lose concentration. At the same time, the possibility to have all registrations available on my university’s website made me become a perfectionist, so I would like to take every word in my notes. But it’s costly because it needs a lot of work and steals time.

Luckily, there are already API resources available such as Google, Amazon, IBM, and many others, that offer services that convert audio into text. In this article, I’ll focus only on the Google Speech-to-Text API, which I think it’s the most efficient application to transcribe many videos. I’m going to create a speech recognition model with Python that converts a video file into text format.

Google Speech-to-Text API

Google Speech to text has three types of API requests based on audio content:

Figure 2: Credit: Google Speech-to-text’s limits

In Synchronous Requests, the audio file content should be approximately 1 minute. In this type of request, the user does not have to upload the data to Google cloud. I’m going to focus on this type of request.
In Asynchronous Requests, the audio file should be approximately 480 minutes. In this type of request, the user has to upload their data to Google cloud.
The Streaming Requests are suitable for streaming data where the user is talking to the microphone directly and needs to get it transcribed. This type of request is apt for chatbots.

The current API usage limits you need to know for Speech-to-Text are:

Figure 3: Credit: Google Speech-to-text’s limits

The table shows that there is a limit of 480 hours of audio per day, while the maximum number of “StreamingRecognize” requests per 60 seconds is 900. Isn’t it amazing to have so many hours to convert audio into text per day? Especially when it’s free! It’s not so obvious if you try other API or standard methods without python.

Step 1: Download video from the website

I downloaded a video from my university’s website with a Chrome extension called Video DownloadHelper. It’s free and very easy to use. Some operations required by Video DownloadHelper cannot be performed from within the browser. In order to make the extension work, I also installed an external app called Companion Application.

Note: Without the Premium status, the video’s download can only be performed 120 minutes after the previous one.

Step 2: Import libraries into Jupiter Notebook

Let’s install the libraries that we’ll use in this program.

SpeechRecognition is a Python library for performing speech recognition with support for Google’s API, while moviepy allows to cut, read, and write all the most common audio and video formats. Moreover, moviepy supports various file format: .ogv, .mp4, .mpeg, .avi, .mov.

Once we installed the libraries, we can import them:

Step 3: Cut video file into chunks of 1 minute and convert each chunk into text format

In my case, the video was in format .mp4 and was 52 minutes long. The variable num_seconds_video contains my video’s number of seconds. After I created a list that will be used to cut the video file into a specific number of chunks, it’s needed for the start and end times in the slices of video. More details about this concept will be explained later.

Moreover, I created an empty dictionary, diz, where the key will be the string “chunk#” and the value will be the text extracted from that chunk. In the for loop, I am going to convert each slice of video into text format.

Note: before running the for iteration, I created a folder “chunks” that contain all the slices of the video and a folder “converted” with all the slices of video converted into wav format. I suggest you do it if you don’t want to be full of files.

Create a new video file, based on the initial file “ videorl.mp4”, that will be cut between an initial time and an end time(in seconds). For example, the first chunk is between 0 seconds and 60 seconds, and the second chunk is between 58 seconds and 120 seconds, the third chunk will be between the 118 and 180 seconds, and so on until I reach the last chunk between 3058 and 3120 seconds. The chunks overlap by 2 seconds in order to not lose important words. The function used is ffmpeg_extract_subclip(filename, t1, t2, targetname)
Import the new audio file created in the previous step with the function VideoFileClip(filename)
Convert mp4 file into wav format, which works better with Google’s API
Create the Recognizer instance
Import the audio file with format wav
Use Google’s Cloud Speech-to-text API to extract the text from the audio file in format wav.

Step 5: Export results into a Text document

As the last task, we’ll create a unique text file, which will contain all the chunks’ texts.

I create a list that only contains the extracted text from each slice of video. After I join each element of the list by a string separator “n”, that is the newline character.

In the end, I created the file, which has all the video’s text.

Congratulations! You obtained the text of your video, or the code is still running. The last case is normal if the file is big. It’s not too fast, but at least you can watch Netflix in the meanwhile. In the end, you will obtain your text transcription. It won’t be perfect, there will be some redundant words because of the overlapping trick of 2 seconds between two chunks, but I think it’s a better solution compared to loose information. I hope you enjoyed this guide and you found it useful. The entire code is in Github.

Extract the text from long videos with Python was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Extract the text from long videos with Python

Author(s): Eugenia Anello

A simple guide to build a speech recognizer using Google’s API

Step 1: Download video from the website

Step 2: Import libraries into Jupiter Notebook

Step 3: Cut video file into chunks of 1 minute and convert each chunk into text format

Step 5: Export results into a Text document

Towards AI Team

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

LAI #66: Information Theory for People in a Hurry

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

The World’s Leading AI and Technology Publication.

Company

CONTACT US

🔥 Recommended Articles 🔥

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Extract the text from long videos with Python

Author(s): Eugenia Anello

A simple guide to build a speech recognizer using Google’s API

Step 1: Download video from the website

Step 2: Import libraries into Jupiter Notebook

Step 3: Cut video file into chunks of 1 minute and convert each chunk into text format

Step 5: Export results into a Text document

Towards AI Team

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement

Subscribe to our AI newsletter!

🔥 Recommended Articles 🔥