Building YouTube Recommender System with Video Thumbnails and Titles
Author(s): Ransaka Ravihara
Originally published on Towards AI.
Introduction
This project was born as a weekend project where I learned about autoencoders and the usage of autoencoders in real life. However, after seeing its significant progress, I compiled it as a blog post so everyone else interested can benefit from it. However, this is about more than auto-encoders; this includes vector search and text embeddings, too. Before diving into the implementation details, you can get an idea of what we are building here by visiting a hugging face space linked below.
Youtube Recommender – a Hugging Face Space by Ransaka
Discover amazing ML apps made by the community
huggingface.co
Before thinking about the final architecture, let's slice down our requirements. Essentially, we want an engine to give us suggestions by looking at video thumbnails and titles. We should have an extensive knowledge base containing thumbnails and titles to do that. Once we have that, we can search user-given inputs in our knowledge space and return the best candidates. But we have a couple of problems. First, we need to convert images into vectors and text into vectors. Two, those vector representations should be efficient and meaningful. While working on the project, I had a good grasp of how to proceed with the video titles. However, I was uncertain about how to approach the task of generating vectors for video thumbnail data. Nevertheless, I knew we could use auto-encoders to reduce dimensionality and extract features. Therefore, I decided to give it a try. After some brainstorming, I came up with the following architecture.
I used the pre-trained embedding model to convert text to fixed-length vectors. I also trained the auto-encoder to convert the image to a fixed-length vector from scratch. Once these models are completed, all text/image vectors are pre-calculated and stored as an FAISS index for later similarity search.
As a first step, we should have a proper dataset. Here, I will use the dataset below (Apache 2.0 licensed). You can follow along with it or use your dataset with the same format.
Ransaka/youtube_recommendation_data Β· Datasets at Hugging Face
We're on a journey to advance and democratize artificial intelligence through open source and open science.
huggingface.co
Perfect, now we have a dataset. After initial inspections, I noticed these points.
- The dataset contains English, Sinhala, and Tamil language text. Therefore, our embedding model should be a multilingual one.
- Every image is the same size (480, 360).
Video Title Processing
As a first step, let's extract vector embeddings for our video titles. We first aim to extract vector embeddings for the video title and store them as FAISS indexes. Later, we can utilize the FAISS index for similarity search. Let's do vectorization using sentence transformers.
from sentence_transformers import SentenceTransformer
from datasets import load_dataset
dataset = load_dataset("Ransaka/youtube_recommendation_data")
model = SentenceTransformer('intfloat/multilingual-e5-base')
input_texts = dataset['train']['title'] + dataset['test']['title']
embeddings = model.encode(input_texts, normalize_embeddings=True)
Now, we can define the FAISS index instance to load our embedding in the FAISS index format.
import faiss
class Indexer:
def __init__(self, embed_vec):
self.embeddings_vec = embed_vec
self.build_index()
def build_index(self):
"""
Build the index for the embeddings.
This function initializes the index for the embeddings. It calculates the dimension (self.d)
of the embeddings vector and creates an IndexFlatL2 object (self.index) for the given dimension.
It then adds the embeddings vector (self.embeddings_vec) to the index.
Parameters:
- None
Return:
- None
"""
self.d = self.embeddings_vec.shape[1]
self.index = faiss.IndexFlatL2(self.d)
self.index.add(self.embeddings_vec)
def topk(self, vector, k = 4):
"""
A function that takes in a vector and an optional parameter k and returns the indices of the k nearest neighbors in the index.
Parameters:
vector: A numpy array representing the input vector.
k (optional): An integer representing the number of nearest neighbors to retrieve. Defaults to 4 if not specified.
Returns:
I: A numpy array containing the indices of the k nearest neighbors in the index.
"""
# vec = self.retreaver.encode(text)['embeddings'].detach().cpu().numpy()
_, I = self.index.search(vector, k)
return I
Let's load our embeddings into FAISS indexes. This FAISS indexer will return the most similar k candidates upon calling the method top_k. We'll use the top_k method when we serve this model to end-users.
text_embedding_index = Indexer(embeddings)
Video Thumbnail Processing
For thumbnail processing, I plan to use a modeling mechanism called denoising auto-encoders. This is a slightly different approach than previously used text embedding extraction. The idea here is to train a CNN model to reconstruct the original image through the bottleneck layer. If we think of the encoder and decoder as individual components, the input the encoder receives is the original input we provide for the model. But in the decoder, it's not the case; It received input from the encoder. Hence, the decoder's performance heavily depends on the encoder's latent layer's performance. This means that to increase model performance, the latent layer should learn an abstract representation of the original input. Hence, after the model is fully trained, we can unplug the decoder and utilize the encoder as an image-to-vector model or feature extractor. This is exactly what we will achieve in this tutorial.
Building Denoising Auto Encoder
Building an auto-encoder is a straightforward thing. But when it comes to rectangular images, keeping track of the result size is challenging. However, this Wikipedia link did an excellent job explaining it to me.
import torch
import torch.nn as nn
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
class Encoder(nn.Module):
def __init__(self, in_channels=1, out_channels=16, latent_dim=64, act_fn=nn.ReLU()):
super().__init__()
self.net = nn.Sequential(
nn.Conv2d(in_channels, out_channels, 3, padding=1), # (480, 360)
act_fn,
nn.Conv2d(out_channels, out_channels, 3, padding=1),
act_fn,
nn.Conv2d(out_channels, 2 * out_channels, 3, padding=1, stride=2), # (240, 180)
act_fn,
nn.Conv2d(2 * out_channels, 2 * out_channels, 3, padding=1),
act_fn,
nn.Conv2d(2 * out_channels, 4 * out_channels, 3, padding=1, stride=2), # (120, 90)
act_fn,
nn.Conv2d(4 * out_channels, 4 * out_channels, 3, padding=1),
act_fn,
nn.Flatten(),
nn.Linear(4 * out_channels * 120 * 90, latent_dim),
act_fn
)
def forward(self, x):
x = x.view(-1, 1, 480, 360)
output = self.net(x)
return output
class Decoder(nn.Module):
def __init__(self, in_channels=1, out_channels=16, latent_dim=64, act_fn=nn.ReLU()):
super().__init__()
self.out_channels = out_channels
self.linear = nn.Sequential(
nn.Linear(latent_dim, 4 * out_channels * 120 * 90),
act_fn
)
self.conv = nn.Sequential(
nn.ConvTranspose2d(4 * out_channels, 4 * out_channels, 3, padding=1), # (120, 90)
act_fn,
nn.ConvTranspose2d(4 * out_channels, 2 * out_channels, 3, padding=1,
stride=2, output_padding=1), # (240, 180)
act_fn,
nn.ConvTranspose2d(2 * out_channels, 2 * out_channels, 3, padding=1),
act_fn,
nn.ConvTranspose2d(2 * out_channels, out_channels, 3, padding=1,
stride=2, output_padding=1), # (480, 360)
act_fn,
nn.ConvTranspose2d(out_channels, out_channels, 3, padding=1),
act_fn,
nn.ConvTranspose2d(out_channels, in_channels, 3, padding=1)
)
def forward(self, x):
output = self.linear(x)
output = output.view(-1, 4 * self.out_channels, 120, 90)
output = self.conv(output)
return output
class Autoencoder(nn.Module):
def __init__(self, encoder, decoder):
super().__init__()
self.encoder = encoder
self.encoder.to(device)
self.decoder = decoder
self.decoder.to(device)
def forward(self, x):
encoded = self.encoder(x)
decoded = self.decoder(encoded)
return decoded
Here, I won't show the training loop and evaluation steps. You can check codes at any time by visiting the GitHub repository below. Once we successfully train our auto-encoder model, we can select its encoder part and use it as a feature extractor.
GitHub – Ransaka/yt-recommender
Contribute to Ransaka/yt-recommender development by creating an account on GitHub.
github.com
import torchvision.transforms as transforms
encoder = Encoder()
encoder.load_state_dict(
torch.load('encoder.bin', map_location=torch.device('cpu'))
)
class ThumbnailDataset(Dataset):
def __init__(self, data, transforms=None):
self.data = data
self.transforms = transforms
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
image = self.data[idx]['image']
image = image.convert('L')
if self.transforms!=None:
image = self.transforms(image)
return image
training_data = ThumbnailDataset(dataset['train'], transforms=transforms.Compose([transforms.ToTensor()]))
validation_data = ThumbnailDataset(dataset['test'], transforms=transforms.Compose([transforms.ToTensor()]))
latent_data = []
for data in training_data:
data = data.to(device)
latent_data.append(encoder(data).detach().cpu().numpy().flatten())
latent_data_test = []
for data in validation_data:
data = data.to(device)
latent_data_test.append(encoder(data).detach().cpu().numpy().flatten())
latent_data_final = np.concatenate([latent_data,latent_data_test])
Similar to text embeddings, we can now build an index for image latent data.
image_embedding_index = Indexer(latent_data)
Now everything is completed for our application. Without further delay, let's jump into the Streamlit application. First, we need to select random images to show as initial images. Below those images, there should be an expander. Once the user clicks on that expander, recommended items should be visible to the user.
import streamlit as st
import random
import os
from datasets import load_dataset, concatenate_datasets
from utils.recommendation import get_recommendations
START = random.randint(a=0,b=1000)
END = START + 10
dataset = load_dataset("Ransaka/youtube_recommendation_data", token=os.environ.get('HF'))
dataset = concatenate_datasets([dataset['train'], dataset['test']])
pil_images = dataset['image'][START:END]
def show_image_metadata_and_related_info(image_index):
selected_image = pil_images[image_index]
image_title = dataset['title'][image_index]
st.image(selected_image, caption=f"{image_title}", use_column_width=True)
with st.expander("You May Also Like.."):
recommendations = get_recommendations(selected_image,image_title, 8, method)
col1_row1, col2_row1 = st.columns(2)
with col1_row1:
st.image(image=recommendations['image'][0], caption=recommendations['title'][0], width=200)
with col2_row1:
st.image(image=recommendations['image'][1], caption=recommendations['title'][1], width=200)
# Second Row
col1_row2, col2_row2 = st.columns(2)
with col1_row2:
st.image(image=recommendations['image'][2], caption=recommendations['title'][2], width=200)
with col2_row2:
st.image(image=recommendations['image'][3], caption=recommendations['title'][3], width=200)
# Third row
col1_row3, col2_row3 = st.columns(2)
with col1_row3:
st.image(image=recommendations['image'][4], caption=recommendations['title'][4], width=200)
with col2_row3:
st.image(image=recommendations['image'][5], caption=recommendations['title'][5], width=200)
# Fourth Row
col1_row4, col2_row4 = st.columns(2)
with col1_row4:
st.image(image=recommendations['image'][6], caption=recommendations['title'][6], width=200)
with col2_row4:
st.image(image=recommendations['image'][7], caption=recommendations['title'][7], width=200)
def main():
st.title("Youtube Recommendation Engine")
for i, image in enumerate(pil_images):
show_image_metadata_and_related_info(i)
if __name__ == '__main__':
main()
Awesome! You have successfully built a YouTube recommendation system with open-source tools and libraries.
Thanks for reading! Connect with me on LinkedIn.
Conclusion
Similar to all other AL/ML projects, this method may not suited for different projects you may be working on. The purpose of this article was to provide you with an understanding of how to build a hybrid recommendation engine (image-based and text-based engine) for the provided YouTube dataset. In this specific dataset, the result seems obvious. However, doing research before any real-world applications is highly advised.
In this article, all images, unless otherwise noted, are by the author.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI