Building YouTube Recommender System with Video Thumbnails and Titles

Author(s): Ransaka Ravihara

Originally published on Towards AI.

Introduction

This project was born as a weekend project where I learned about autoencoders and the usage of autoencoders in real life. However, after seeing its significant progress, I compiled it as a blog post so everyone else interested can benefit from it. However, this is about more than auto-encoders; this includes vector search and text embeddings, too. Before diving into the implementation details, you can get an idea of what we are building here by visiting a hugging face space linked below.

Youtube Recommender – a Hugging Face Space by Ransaka

Discover amazing ML apps made by the community

huggingface.co

Before thinking about the final architecture, let's slice down our requirements. Essentially, we want an engine to give us suggestions by looking at video thumbnails and titles. We should have an extensive knowledge base containing thumbnails and titles to do that. Once we have that, we can search user-given inputs in our knowledge space and return the best candidates. But we have a couple of problems. First, we need to convert images into vectors and text into vectors. Two, those vector representations should be efficient and meaningful. While working on the project, I had a good grasp of how to proceed with the video titles. However, I was uncertain about how to approach the task of generating vectors for video thumbnail data. Nevertheless, I knew we could use auto-encoders to reduce dimensionality and extract features. Therefore, I decided to give it a try. After some brainstorming, I came up with the following architecture.

I used the pre-trained embedding model to convert text to fixed-length vectors. I also trained the auto-encoder to convert the image to a fixed-length vector from scratch. Once these models are completed, all text/image vectors are pre-calculated and stored as an FAISS index for later similarity search.

As a first step, we should have a proper dataset. Here, I will use the dataset below (Apache 2.0 licensed). You can follow along with it or use your dataset with the same format.

Ransaka/youtube_recommendation_data · Datasets at Hugging Face

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Perfect, now we have a dataset. After initial inspections, I noticed these points.

The dataset contains English, Sinhala, and Tamil language text. Therefore, our embedding model should be a multilingual one.
Every image is the same size (480, 360).

Video Title Processing

As a first step, let's extract vector embeddings for our video titles. We first aim to extract vector embeddings for the video title and store them as FAISS indexes. Later, we can utilize the FAISS index for similarity search. Let's do vectorization using sentence transformers.

from sentence_transformers import SentenceTransformer
from datasets import load_dataset

dataset = load_dataset("Ransaka/youtube_recommendation_data")
model = SentenceTransformer('intfloat/multilingual-e5-base')

input_texts = dataset['train']['title'] + dataset['test']['title']
embeddings = model.encode(input_texts, normalize_embeddings=True)

Now, we can define the FAISS index instance to load our embedding in the FAISS index format.

import faiss

class Indexer:
 def __init__(self, embed_vec):
 self.embeddings_vec = embed_vec
 self.build_index()

 def build_index(self):
 """
 Build the index for the embeddings.

 This function initializes the index for the embeddings. It calculates the dimension (self.d)
 of the embeddings vector and creates an IndexFlatL2 object (self.index) for the given dimension.
 It then adds the embeddings vector (self.embeddings_vec) to the index.

 Parameters:
 - None

 Return:
 - None
 """
 self.d = self.embeddings_vec.shape[1]
 self.index = faiss.IndexFlatL2(self.d)
 self.index.add(self.embeddings_vec)

 def topk(self, vector, k = 4):
 """
 A function that takes in a vector and an optional parameter k and returns the indices of the k nearest neighbors in the index.

 Parameters:
 vector: A numpy array representing the input vector.
 k (optional): An integer representing the number of nearest neighbors to retrieve. Defaults to 4 if not specified.

 Returns:
 I: A numpy array containing the indices of the k nearest neighbors in the index.
 """
 # vec = self.retreaver.encode(text)['embeddings'].detach().cpu().numpy()
 _, I = self.index.search(vector, k)
 return I

Let's load our embeddings into FAISS indexes. This FAISS indexer will return the most similar k candidates upon calling the method top_k. We'll use the top_k method when we serve this model to end-users.

text_embedding_index = Indexer(embeddings)

Video Thumbnail Processing

For thumbnail processing, I plan to use a modeling mechanism called denoising auto-encoders. This is a slightly different approach than previously used text embedding extraction. The idea here is to train a CNN model to reconstruct the original image through the bottleneck layer. If we think of the encoder and decoder as individual components, the input the encoder receives is the original input we provide for the model. But in the decoder, it's not the case; It received input from the encoder. Hence, the decoder's performance heavily depends on the encoder's latent layer's performance. This means that to increase model performance, the latent layer should learn an abstract representation of the original input. Hence, after the model is fully trained, we can unplug the decoder and utilize the encoder as an image-to-vector model or feature extractor. This is exactly what we will achieve in this tutorial.

denoising auto-encoder general model architecture

Building Denoising Auto Encoder

Building an auto-encoder is a straightforward thing. But when it comes to rectangular images, keeping track of the result size is challenging. However, this Wikipedia link did an excellent job explaining it to me.

import torch
import torch.nn as nn

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class Encoder(nn.Module):
 def __init__(self, in_channels=1, out_channels=16, latent_dim=64, act_fn=nn.ReLU()):
 super().__init__()

 self.net = nn.Sequential(
 nn.Conv2d(in_channels, out_channels, 3, padding=1), # (480, 360)
 act_fn,
 nn.Conv2d(out_channels, out_channels, 3, padding=1),
 act_fn,
 nn.Conv2d(out_channels, 2 * out_channels, 3, padding=1, stride=2), # (240, 180)
 act_fn,
 nn.Conv2d(2 * out_channels, 2 * out_channels, 3, padding=1),
 act_fn,
 nn.Conv2d(2 * out_channels, 4 * out_channels, 3, padding=1, stride=2), # (120, 90)
 act_fn,
 nn.Conv2d(4 * out_channels, 4 * out_channels, 3, padding=1),
 act_fn,
 nn.Flatten(),
 nn.Linear(4 * out_channels * 120 * 90, latent_dim),
 act_fn
 )

 def forward(self, x):
 x = x.view(-1, 1, 480, 360)
 output = self.net(x)
 return output


class Decoder(nn.Module):
 def __init__(self, in_channels=1, out_channels=16, latent_dim=64, act_fn=nn.ReLU()):
 super().__init__()

 self.out_channels = out_channels

 self.linear = nn.Sequential(
 nn.Linear(latent_dim, 4 * out_channels * 120 * 90),
 act_fn
 )

 self.conv = nn.Sequential(
 nn.ConvTranspose2d(4 * out_channels, 4 * out_channels, 3, padding=1), # (120, 90)
 act_fn,
 nn.ConvTranspose2d(4 * out_channels, 2 * out_channels, 3, padding=1,
 stride=2, output_padding=1), # (240, 180)
 act_fn,
 nn.ConvTranspose2d(2 * out_channels, 2 * out_channels, 3, padding=1),
 act_fn,
 nn.ConvTranspose2d(2 * out_channels, out_channels, 3, padding=1,
 stride=2, output_padding=1), # (480, 360)
 act_fn,
 nn.ConvTranspose2d(out_channels, out_channels, 3, padding=1),
 act_fn,
 nn.ConvTranspose2d(out_channels, in_channels, 3, padding=1)
 )

 def forward(self, x):
 output = self.linear(x)
 output = output.view(-1, 4 * self.out_channels, 120, 90)
 output = self.conv(output)
 return output


class Autoencoder(nn.Module):
 def __init__(self, encoder, decoder):
 super().__init__()
 self.encoder = encoder
 self.encoder.to(device)

 self.decoder = decoder
 self.decoder.to(device)

 def forward(self, x):
 encoded = self.encoder(x)
 decoded = self.decoder(encoded)
 return decoded

Here, I won't show the training loop and evaluation steps. You can check codes at any time by visiting the GitHub repository below. Once we successfully train our auto-encoder model, we can select its encoder part and use it as a feature extractor.

GitHub – Ransaka/yt-recommender

Contribute to Ransaka/yt-recommender development by creating an account on GitHub.

github.com

import torchvision.transforms as transforms

encoder = Encoder()
encoder.load_state_dict(
 torch.load('encoder.bin', map_location=torch.device('cpu'))
 )

class ThumbnailDataset(Dataset):
 def __init__(self, data, transforms=None):
 self.data = data
 self.transforms = transforms
 
 def __len__(self):
 return len(self.data)

 def __getitem__(self, idx):
 image = self.data[idx]['image']
 image = image.convert('L')

 if self.transforms!=None:
 image = self.transforms(image)
 return image

 
training_data = ThumbnailDataset(dataset['train'], transforms=transforms.Compose([transforms.ToTensor()]))
validation_data = ThumbnailDataset(dataset['test'], transforms=transforms.Compose([transforms.ToTensor()]))

latent_data = []
for data in training_data:
 data = data.to(device)
 latent_data.append(encoder(data).detach().cpu().numpy().flatten())

latent_data_test = []
for data in validation_data:
 data = data.to(device)
 latent_data_test.append(encoder(data).detach().cpu().numpy().flatten())

latent_data_final = np.concatenate([latent_data,latent_data_test])

Similar to text embeddings, we can now build an index for image latent data.

image_embedding_index = Indexer(latent_data)

Now everything is completed for our application. Without further delay, let's jump into the Streamlit application. First, we need to select random images to show as initial images. Below those images, there should be an expander. Once the user clicks on that expander, recommended items should be visible to the user.

import streamlit as st
import random
import os
from datasets import load_dataset, concatenate_datasets
from utils.recommendation import get_recommendations


START = random.randint(a=0,b=1000)
END = START + 10

dataset = load_dataset("Ransaka/youtube_recommendation_data", token=os.environ.get('HF'))
dataset = concatenate_datasets([dataset['train'], dataset['test']])



pil_images = dataset['image'][START:END]

def show_image_metadata_and_related_info(image_index):
 selected_image = pil_images[image_index]
 image_title = dataset['title'][image_index]
 st.image(selected_image, caption=f"{image_title}", use_column_width=True)
 
 with st.expander("You May Also Like.."):
 recommendations = get_recommendations(selected_image,image_title, 8, method)
 
 col1_row1, col2_row1 = st.columns(2)
 with col1_row1:
 st.image(image=recommendations['image'][0], caption=recommendations['title'][0], width=200)
 with col2_row1:
 st.image(image=recommendations['image'][1], caption=recommendations['title'][1], width=200)

 # Second Row
 col1_row2, col2_row2 = st.columns(2)
 with col1_row2:
 st.image(image=recommendations['image'][2], caption=recommendations['title'][2], width=200)
 with col2_row2:
 st.image(image=recommendations['image'][3], caption=recommendations['title'][3], width=200)

 # Third row
 col1_row3, col2_row3 = st.columns(2)
 with col1_row3:
 st.image(image=recommendations['image'][4], caption=recommendations['title'][4], width=200)
 with col2_row3:
 st.image(image=recommendations['image'][5], caption=recommendations['title'][5], width=200)

 # Fourth Row
 col1_row4, col2_row4 = st.columns(2)
 with col1_row4:
 st.image(image=recommendations['image'][6], caption=recommendations['title'][6], width=200)
 with col2_row4:
 st.image(image=recommendations['image'][7], caption=recommendations['title'][7], width=200)

def main():
 st.title("Youtube Recommendation Engine")

 for i, image in enumerate(pil_images):
 show_image_metadata_and_related_info(i)

if __name__ == '__main__':
 main()

Awesome! You have successfully built a YouTube recommendation system with open-source tools and libraries.

Thanks for reading! Connect with me on LinkedIn.

Conclusion

Similar to all other AL/ML projects, this method may not suited for different projects you may be working on. The purpose of this article was to provide you with an understanding of how to build a hybrid recommendation engine (image-based and text-based engine) for the provided YouTube dataset. In this specific dataset, the result seems obvious. However, doing research before any real-world applications is highly advised.

In this article, all images, unless otherwise noted, are by the author.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Building YouTube Recommender System with Video Thumbnails and Titles

Author(s): Ransaka Ravihara

Introduction

Youtube Recommender – a Hugging Face Space by Ransaka

Discover amazing ML apps made by the community

Ransaka/youtube_recommendation_data · Datasets at Hugging Face

We're on a journey to advance and democratize artificial intelligence through open source and open science.

Video Title Processing

Video Thumbnail Processing

Building Denoising Auto Encoder

GitHub – Ransaka/yt-recommender

Contribute to Ransaka/yt-recommender development by creating an account on GitHub.

Conclusion

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

AI in Medical Imaging: A Life-Saving Revolution or Ethical Minefield?

AI in Medical Imaging: A Life-Saving Revolution or Ethical Minefield?

AI in Medical Imaging: A Life-Saving Revolution or Ethical Minefield?

AI in Medical Imaging: A Life-Saving Revolution or Ethical Minefield?

AI in Medical Imaging: A Life-Saving Revolution or Ethical Minefield?

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Building YouTube Recommender System with Video Thumbnails and Titles

Author(s): Ransaka Ravihara

Introduction

Youtube Recommender – a Hugging Face Space by Ransaka

Discover amazing ML apps made by the community

Ransaka/youtube_recommendation_data · Datasets at Hugging Face

We're on a journey to advance and democratize artificial intelligence through open source and open science.

Video Title Processing

Video Thumbnail Processing

Building Denoising Auto Encoder

GitHub – Ransaka/yt-recommender

Contribute to Ransaka/yt-recommender development by creating an account on GitHub.

Conclusion

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement