Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Recommendation System Tutorial with Python using Collaborative Filtering 
Editorial   Machine Learning   Programming   Tutorials

Recommendation System Tutorial with Python using Collaborative Filtering 

Last Updated on November 15, 2020 by Editorial Team

Author(s): Saniya Parveez, Roberto Iriondo

Building a recommendation system using Python and collaborative filtering for a Netflix use case.

Introduction

A recommendation system generates a compiled list of items in which a user might be interested, in the reciprocity of their current selection of item(s). It expands users’ suggestions without any disturbance or monotony, and it does not recommend items that the user already knows. Similarly, the Netflix recommendation system offers recommendations by matching and searching similar users’ habits and suggesting movies that share characteristics with films that users have rated highly.

This tutorial’s code is available on Github and its full implementation as well on Google Colab

Figure 1: Netflix recommendation workflow. | recommendation system tutorial with Python using collaborative filtering — Netfl
Figure 1: Netflix recommendation workflow.

The recommendation system workflow shown in the diagram above shows the user’s collaboration regarding the ratings of different movies or shows. New users get their recommendations based on the recommendations of existing users.

According to McKinsey:

75% of what people are watching on Netflix comes from recommendations [1].

Netflix Real-time data cases:

  • More than 20,000 movies and shows.
  • 2 million users.

Complications

Recommender systems are machine learning-based systems that scan through all possible options and provides a prediction or recommendation. However, building a recommendation system has below complications:

  • Users’ data is interchangeable.
  • The data volume is large and includes a significant list of movies, shows, customers’ profiles and interests, ratings, and other data points.
  • New registered customers use to have very limited information.
  • Real-time prediction for users.
  • Old users can have an overabundance of information.
  • It should not show items that are very different or too similar.
  • Users can change the rating of items on change of his/her mind.

Types of Recommendation Systems

There are two types of recommendation systems:

  • Content filtering recommender systems.
  • Collaborative filtering based recommender systems.

Fun fact: Netflix‘s recommender system filtering architecture bases on collaborative filtering [2] [3].

Content Filtering

Content filtering expects the side information such as the properties of a song (song name, singer name, movie name, language, and others.). Recommender systems perform well, even if new items are added to the library. A recommender system’s algorithm expects to include all side properties of its library’s items.

An essential aspect of content filtering:

  • Expects item information.
  • Item information should be in a text document.

Collaborative Filtering

The idea behind collaborative filtering is to consider users’ opinions on different videos and recommend the best video to each user based on the user’s previous rankings and the opinion of other similar types of users.

Figure 2: Collaborative filtering. | recommendation system tutorial with Python using collaborative filtering — Netflix
Figure 2: Collaborative filtering.

Pros:

  • It does not need a movie’s side knowledge like genres.
  • It uses information collected from other users to recommend new items to the current user.

Cons:

  • It does not achieve recommendation on a new movie or shows that have no ratings.
  • It requires the user community and can have a sparsity problem.

Different techniques of Collaborative filtering:

Non-probabilistic algorithm

  • User-based nearest neighbor.
  • Item-based nearest neighbor.
  • Reducing dimensionality.

Probabilistic algorithm

  • Bayesian-network model.
  • EM algorithm.

Issues in Collaborative Filtering

There are several challenges for collaborative filtering, as mentioned below:

Sparseness

The Netflix recommendation system’s dataset is extensive, and the user-item matrix used for the algorithm could be vast and sparse, so this encounters the problem of performance.

Figure 3: Sparseness. | recommendation system tutorial with Python using collaborative filtering — Netflix
Figure 3: Sparseness.

The sparsity of data derives from the ratio of the empty and total records in the user-item matrix.

Sparsity = 1 — |R|/|I|*|U|

Where,

R = Rating

I = Items

U = Users

Cold Start

This problem encounters when the system has no information to make recommendations for the new users. As a result, the matrix factorization techniques cannot apply.

This problem brings two observations:

  • How to recommend a new video for users?
  • What video to recommend to new users?

Solutions:

  • Suggest or ask users to rate videos.
  • Default voting for videos.
  • Use other techniques like content-based or demographic for the initial phase.

User-based Nearest Neighbor

The basic technique of user-based Nearest Neighbor for the user John:

John is an active Netflix user and has not seen a video “v” yet. Here, the user-based nearest neighbor algorithm will work like below:

  • The technique finds a set of users or nearest neighbors who have liked the same items as John in the past and have rated video “v.”
  • Algorithm predicts.
  • Performs for all the items John has not seen and recommends.

Essentially, the user-based nearest neighbor algorithm generates a prediction for item i by analyzing the rating for i from users in u’s neighborhood.

Let’s calculate user similarity for the prediction:

Figure 4: Calculation of similarity. | recommendation system tutorial with Python using collaborative filtering — Netflix
Figure 4: Calculation of similarity.

Where:

a, b = Users

r(a, p)= Rating of user a for item p

P = Set of items. Rated by both users a and b

Figure 5: Similarly calculation. | recommendation system tutorial with Python using collaborative filtering — Netflix
Figure 5: Similarity calculation.

Prediction based on the similarity function:

Figure 6: Prediction. | recommendation system tutorial with Python using collaborative filtering — Netflix
Figure 6: Prediction.

Here, similar users are defined by those that like similar movies or videos.

Challenges

  • For a considerable amount of data, the algorithm encounters severe performance and scaling issues.
  • Computationally expansiveness O(MN) can encounter in the worst case. Where M is the number of customers and N is the number of items.
  • Performance can be increase by applying the methodology of dimensionality reduction. However, it can reduce the quality of the recommendation system.

Item-based Nearest Neighbor

This technique generates predictions based on similarities between different videos or movies or items.

Prediction for a user u and item i is composed of a weighted sum of the user u’s ratings for items most similar to i.

Figure 7: Item-based nearest neighbor prediction. | recommendation system tutorial with Python using collaborative filtering
Figure 7: Item-based nearest neighbor prediction.
Figure 8: Item-based prediction. | recommendation system tutorial with Python using collaborative filtering — Netflix
Figure 8: Item-based prediction.

As shown in figure 8, look for the videos that are similar to video5. Hence, the recommendation is very similar to video4.

Role of Cosine Similarity in building Recommenders

The cosine similarity is a metric used to find the similarity between the items/products irrespective of their size. We calculate the cosine of an angle by measuring between any two vectors in a multidimensional space. It is applicable for supporting documents of a considerable size due to the dimensions.

Figure 9: Calculation of similarity. | recommendation system tutorial with Python using collaborative filtering — Netflix
Figure 9: Calculation of similarity.

Where:

cosine is an angle calculated between -1 to 1 where -1 denotes dissimilar items, and 1 shows items which are a correct match.

cos p. q — gives the dot product between the vectors.

||p|| ||q|| — represents the product of vector’s magnitude

Why do Baseline Predictors for Recommenders matter?

Baseline Predictors are independent of the user’s rating, but they provide predictions to the new user’s

General Baseline form

bu,i = µ + bu + bi

Where,

bu and bi are users and item baseline predictors.

Motivation for Baseline

  • Imputation of missing values with baseline values.
  • compare accuracy with advanced model

Netflix Movie Recommendation System

Problem Statement

Netflix is a platform that provides online movie and video streaming. Netflix wants to build a recommendation system to predict a list of movies for users based on other movies’ likes or dislikes. This recommendation will be for every user based on his/her unique interest.

Netflix Dataset

  • combine_data_2.txt: This text file contains movie_id, customer_id, rating, date
  • movie_title.csv: This CSV file contains movie_id and movie_title

Load Dataset

from datetime import datetime
import pandas as pd
import numpy as np
import seaborn as sns
import os
import random
import matplotlib
import matplotlib.pyplot as plt
from scipy import sparse
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import mean_squared_errorimport xgboost as xgb
from surprise import Reader, Dataset
from surprise import BaselineOnly
from surprise import KNNBaseline
from surprise import SVD
from surprise import SVDpp
from surprise.model_selection import GridSearchCVdef load_data():
    netflix_csv_file = open("netflix_rating.csv", mode = "w")
    rating_files = ['combined_data_1.txt']
    for file in rating_files:
        with open(file) as f:
            for line in f:
                line = line.strip()
                if line.endswith(":"):
                    movie_id = line.replace(":", "")
                else:
                    row_data = []
                    row_data = [item for item in line.split(",")]
                    row_data.insert(0, movie_id)
                    netflix_csv_file.write(",".join(row_data))  
                    netflix_csv_file.write('\n')
                    
    netflix_csv_file.close()
    df = pd.read_csv('netflix_rating.csv', sep=",", names = ["movie_id","customer_id", "rating", "date"])
    return dfnetflix_rating_df = load_data()
netflix_rating_df.head()
Figure 10: Netflix rating data. | recommendation system tutorial with Python using collaborative filtering — Netflix
Figure 10: Netflix rating data.

Analysis of Dataset

Find duplicate ratings:

netflix_rating_df.duplicated(["movie_id","customer_id", "rating", "date"]).sum()

Split train and test data:

split_value = int(len(netflix_rating_df) * 0.80)
train_data = netflix_rating_df[:split_value]
test_data = netflix_rating_df[split_value:]

Count number of ratings in the training data set:

plt.figure(figsize = (12, 8))
ax = sns.countplot(x="rating", data=train_data)ax.set_yticklabels([num for num in ax.get_yticks()])plt.tick_params(labelsize = 15)
plt.title("Count Ratings in train data", fontsize = 20)
plt.xlabel("Ratings", fontsize = 20)
plt.ylabel("Number of Ratings", fontsize = 20)
plt.show()
Figure 11: Count number of ratings in the training data. | recommendation system tutorial with Python using collaborative fil
Figure 11: Count the number of ratings in the training data.

Find the number of rated movies per user:

no_rated_movies_per_user = train_data.groupby(by = "customer_id")["rating"].count().sort_values(ascending = False)
no_rated_movies_per_user.head()
Figure 12: The number of rated movies per user. | recommendation system tutorial with Python using collaborative filtering —
Figure 12: The number of rated movies per user.

Find the Rating number per Movie:

no_ratings_per_movie = train_data.groupby(by = "movie_id")["rating"].count().sort_values(ascending = False)
no_ratings_per_movie.head()
Figure 13: Rating number per movie. | recommendation system tutorial with Python using collaborative filtering — Netflix
Figure 13: Rating number per movie.

Create User-Item Sparse Matrix

In a user-item sparse matrix, items’ values are present in the column, and users’ values are present in the rows. The rating of the user is present in the cell. Such is a sparse matrix because there can be the possibility that the user cannot rate every movie items, and many items can be empty or zero.

def get_user_item_sparse_matrix(df):
    sparse_data = sparse.csr_matrix((df.rating, (df.customer_id, df.movie_id)))
    return sparse_data

User-item Train Sparse matrix

train_sparse_data = get_user_item_sparse_matrix(train_data)

User-item test sparse matrix

test_sparse_data = get_user_item_sparse_matrix(test_data)

Global Average Rating

global_average_rating = train_sparse_data.sum()/train_sparse_data.count_nonzero()
print("Global Average Rating: {}".format(global_average_rating))
Figure 14: Global average rating. | recommendation system tutorial with Python using collaborative filtering — Netflix
Figure 14: Global average rating.

Check the Cold Start Problem

Source: Photo by Erik Mclean on Unsplash | recommendation system tutorial with Python using collaborative filtering — Netflix
Source: Photo by Erik Mclean on Unsplash

Calculate the average rating

def get_average_rating(sparse_matrix, is_user):
    ax = 1 if is_user else 0
    sum_of_ratings = sparse_matrix.sum(axis = ax).A1  
    no_of_ratings = (sparse_matrix != 0).sum(axis = ax).A1 
    rows, cols = sparse_matrix.shape
    average_ratings = {i: sum_of_ratings[i]/no_of_ratings[i] for i in range(rows if is_user else cols) if no_of_ratings[i] != 0}
    return average_ratings

Average Rating User

average_rating_user = get_average_rating(train_sparse_data, True)

Average Rating Movie

avg_rating_movie = get_average_rating(train_sparse_data, False)

Check Cold Start Problem: User

total_users = len(np.unique(netflix_rating_df["customer_id"]))
train_users = len(average_rating_user)
uncommonUsers = total_users - train_users
                  
print("Total no. of Users = {}".format(total_users))
print("No. of Users in train data= {}".format(train_users))
print("No. of Users not present in train data = {}({}%)".format(uncommonUsers, np.round((uncommonUsers/total_users)*100), 2))
Figure 15: Cold start problem-user. | recommendation system tutorial with Python using collaborative filtering — Netflix
Figure 15: Cold start problem-user.

Here, 1% of total users are new, and they will have no proper rating available. Therefore, this can bring the issue of the cold start problem.

Check Cold Start Problem: Movie

total_movies = len(np.unique(netflix_rating_df["movie_id"]))
train_movies = len(avg_rating_movie)
uncommonMovies = total_movies - train_movies
                  
print("Total no. of Movies = {}".format(total_movies))
print("No. of Movies in train data= {}".format(train_movies))
print("No. of Movies not present in train data = {}({}%)".format(uncommonMovies, np.round((uncommonMovies/total_movies)*100), 2))

Figure 16: Cold start problem-movie. | recommendation system tutorial with Python using collaborative filtering — Netflix
Figure 16: Cold start problem-movie.

Here, 20% of total movies are new, and their rating might not be available in the dataset. Consequently, this can bring the issue of the cold start problem.

Similarity Matrix

A similarity matrix is critical to measure and calculate the similarity between user-profiles and movies to generate recommendations. Fundamentally, this kind of matrix calculates the similarity between two data points.

Figure 17: Similarity matrix. | recommendation system tutorial with Python using collaborative filtering — Netflix
Figure 17: Similarity matrix.

In the matrix shown in figure 17, video2 and video5 are very similar. The computation of the similarity matrix is a very tedious job because it requires a powerful computational system.

Compute User Similarity Matrix

Computation of user similarity to find similarities of the top 100 users:

def compute_user_similarity(sparse_matrix, limit=100):
    row_index, col_index = sparse_matrix.nonzero()
    rows = np.unique(row_index)
    similar_arr = np.zeros(61700).reshape(617,100)
    
    for row in rows[:limit]:
        sim = cosine_similarity(sparse_matrix.getrow(row), train_sparse_data).ravel()
        similar_indices = sim.argsort()[-limit:]
        similar = sim[similar_indices]
        similar_arr[row] = similar
    
    return similar_arrsimilar_user_matrix = compute_user_similarity(train_sparse_data, 100)

Compute Movie Similarity Matrix

Load movies title data set

movie_titles_df = pd.read_csv("movie_titles.csv",sep = ",", header = None, names=['movie_id', 'year_of_release', 'movie_title'],index_col = "movie_id", encoding = "iso8859_2")movie_titles_df.head()
Figure 18: Movie title list.
Figure 18: Movie title list.

Compute similar movies:

def compute_movie_similarity_count(sparse_matrix, movie_titles_df, movie_id):
    similarity = cosine_similarity(sparse_matrix.T, dense_output = False)
    no_of_similar_movies = movie_titles_df.loc[movie_id][1], similarity[movie_id].count_nonzero()
    return no_of_similar_movies

Get a similar movies list:

similar_movies = compute_movie_similarity_count(train_sparse_data, movie_titles_df, 1775)
print("Similar Movies = {}".format(similar_movies))
Figure 19: Similar movie list. | recommendation system tutorial with Python using collaborative filtering — Netflix
Figure 19: Similar movie list.

Building the Machine Learning Model

Create a Sample Sparse Matrix

def get_sample_sparse_matrix(sparseMatrix, n_users, n_movies):
    users, movies, ratings = sparse.find(sparseMatrix)
    uniq_users = np.unique(users)
    uniq_movies = np.unique(movies)
    np.random.seed(15) 
    userS = np.random.choice(uniq_users, n_users, replace = False)
    movieS = np.random.choice(uniq_movies, n_movies, replace = False)
    mask = np.logical_and(np.isin(users, userS), np.isin(movies, movieS))
    sparse_sample = sparse.csr_matrix((ratings[mask], (users[mask], movies[mask])), 
                                                     shape = (max(userS)+1, max(movieS)+1))
    return sparse_sample

Sample Sparse Matrix for the training data:

train_sample_sparse_matrix = get_sample_sparse_matrix(train_sparse_data, 400, 40)

Sample Sparse Matrix for the test data:

test_sparse_matrix_matrix = get_sample_sparse_matrix(test_sparse_data, 200, 20)

Featuring the Data

Featuring is a process to create new features by adding different aspects of variables. Here, five similar profile users and similar types of movies features will be created. These new features help relate the similarities between different movies and users. Below new features will be added in the data set after featuring of data:

Figure 20: New features. | recommendation system tutorial with Python using collaborative filtering — Netflix
Figure 20: New features.
def create_new_similar_features(sample_sparse_matrix):
    global_avg_rating = get_average_rating(sample_sparse_matrix, False)
    global_avg_users = get_average_rating(sample_sparse_matrix, True)
    global_avg_movies = get_average_rating(sample_sparse_matrix, False)
    sample_train_users, sample_train_movies, sample_train_ratings = sparse.find(sample_sparse_matrix)
    new_features_csv_file = open("/content/netflix_dataset/new_features.csv", mode = "w")
    
    for user, movie, rating in zip(sample_train_users, sample_train_movies, sample_train_ratings):
        similar_arr = list()
        similar_arr.append(user)
        similar_arr.append(movie)
        similar_arr.append(sample_sparse_matrix.sum()/sample_sparse_matrix.count_nonzero())
        
        similar_users = cosine_similarity(sample_sparse_matrix[user], sample_sparse_matrix).ravel()
        indices = np.argsort(-similar_users)[1:]
        ratings = sample_sparse_matrix[indices, movie].toarray().ravel()
        top_similar_user_ratings = list(ratings[ratings != 0][:5])
        top_similar_user_ratings.extend([global_avg_rating[movie]] * (5 - len(ratings)))
        similar_arr.extend(top_similar_user_ratings)
        
        similar_movies = cosine_similarity(sample_sparse_matrix[:,movie].T, sample_sparse_matrix.T).ravel()
        similar_movies_indices = np.argsort(-similar_movies)[1:]
        similar_movies_ratings = sample_sparse_matrix[user, similar_movies_indices].toarray().ravel()
        top_similar_movie_ratings = list(similar_movies_ratings[similar_movies_ratings != 0][:5])
        top_similar_movie_ratings.extend([global_avg_users[user]] * (5-len(top_similar_movie_ratings)))
        similar_arr.extend(top_similar_movie_ratings)
        
        similar_arr.append(global_avg_users[user])
        similar_arr.append(global_avg_movies[movie])
        similar_arr.append(rating)
        
        new_features_csv_file.write(",".join(map(str, similar_arr)))
        new_features_csv_file.write("\n")
        
    new_features_csv_file.close()
    new_features_df = pd.read_csv('/content/netflix_dataset/new_features.csv', names = ["user_id", "movie_id", "gloabl_average", "similar_user_rating1", 
                                                               "similar_user_rating2", "similar_user_rating3", 
                                                               "similar_user_rating4", "similar_user_rating5", 
                                                               "similar_movie_rating1", "similar_movie_rating2", 
                                                               "similar_movie_rating3", "similar_movie_rating4", 
                                                               "similar_movie_rating5", "user_average", 
                                                               "movie_average", "rating"]) return new_features_df

Featuring (adding new similar features) for the training data:

train_new_similar_features = create_new_similar_features(train_sample_sparse_matrix)train_new_similar_features.head()
Figure 21: Featuring dataset for the training dataset. | recommendation system tutorial with Python using collaborative filte
Figure 21: Featuring dataset for the training dataset.

Featuring (adding new similar features) for the test data:

test_new_similar_features = create_new_similar_features(test_sparse_matrix_matrix)test_new_similar_features.head()
Figure 22: Featuring dataset for the test dataset. | recommendation system tutorial with Python using collaborative filtering
Figure 22: Featuring dataset for the test dataset.

Training and Prediction of the Model

Divide the train and test data from the similar_features dataset:

x_train = train_new_similar_features.drop(["user_id", "movie_id", "rating"], axis = 1)x_test = test_new_similar_features.drop(["user_id", "movie_id", "rating"], axis = 1)y_train = train_new_similar_features["rating"]y_test = test_new_similar_features["rating"]

Utility method to check accuracy:

def error_metrics(y_true, y_pred):
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    return rmse

Fit to XGBRegressor algorithm with 100 estimators:

clf = xgb.XGBRegressor(n_estimators = 100, silent = False, n_jobs  = 10)clf.fit(x_train, y_train)
Figure 23: XGB Regressor Algorithm | recommendation system tutorial with Python using collaborative filtering — Netflix
Figure 23: XGB Regressor Algorithm

Predict the result of the test data set:

y_pred_test = clf.predict(x_test)

Check accuracy of predicted data:

rmse_test = error_metrics(y_test, y_pred_test)print("RMSE = {}".format(rmse_test))
Figure 24: Accuracy. | recommendation system tutorial with Python using collaborative filtering — Netflix
Figure 24: Accuracy.

As shown in figure 24, the RMSE (Root mean squared error) for the predicted model dataset is 99%. If the accuracy is lower than our expectations, we would need to continue to train our model until the accuracy meets a high standard.

Plot Feature Importance

Feature importance is an important technique that selects a score to input features based on how valuable they are at predicting a target variable.

def plot_importance(model, clf):
    fig = plt.figure(figsize = (8, 6))
    ax = fig.add_axes([0,0,1,1])
    model.plot_importance(clf, ax = ax, height = 0.3)
    plt.xlabel("F Score", fontsize = 20)
    plt.ylabel("Features", fontsize = 20)
    plt.title("Feature Importance", fontsize = 20)
    plt.tick_params(labelsize = 15)
    
    plt.show()plot_importance(xgb, clf)
Figure 25: Feature importance plot. | recommendation system tutorial with Python using collaborative filtering — Netflix
Figure 25: Feature importance plot.

The plot shown in figure 25 displays the feature importance of each feature. Here, the user_average rating is a critical feature. Its score is higher than the other features. Other features like similar user ratings and similar movie ratings have been created to relate the similarity between different users and movies.

Conclusion

Over the years, Machine learning has solved several challenges for companies like Netflix, Amazon, Google, Facebook, and others. The recommender system for Netflix helps the user filter through information in a massive list of movies and shows based on his/her choice. A recommender system must interact with the users to learn their preferences to provide recommendations.

Collaborative filtering (CF) is a very popular recommendation system algorithm for the prediction and recommendation based on other users’ ratings and collaboration. User-based collaborative filtering was the first automated collaborative filtering mechanism. It is also called k-NN collaborative filtering. The problem of collaborative filtering is to predict how well a user will like an item that he has not rated given a set of existing choice judgments for a population of users [4].


DISCLAIMER: The views expressed in this article are those of the author(s) and do not represent the views of Carnegie Mellon University, nor other companies (directly or indirectly) associated with the author(s). These writings do not intend to be final products, yet rather a reflection of current thinking, along with being a catalyst for discussion and improvement.

Published via Towards AI

Resources:

Google Colab implementation.

Github repository.

References:

[1] How retailers can keep up with consumers, McKinsey & Company, https://www.mckinsey.com/industries/retail/our-insights/how-retailers-can-keep-up-with-consumers

[2] How Netflix’s Recommendation System Works, Netflix Research, https://help.netflix.com/en/node/100639

[3] Recommendations, Figuring out how to bring unique joy to each member, Netflix Research, https://research.netflix.com/research-area/recommendations

[4] Collaborative Filtering, University of Pittsburgh, Peter Brusilovsky, Sue Yeon and Danielle Lee, https://pitt.edu/~peterb/2480-122/CollaborativeFiltering.pdf

Feedback ↓

Sign Up for the Course
`; } else { console.error('Element with id="subscribe" not found within the page with class "home".'); } } }); // Remove duplicate text from articles /* Backup: 09/11/24 function removeDuplicateText() { const elements = document.querySelectorAll('h1, h2, h3, h4, h5, strong'); // Select the desired elements const seenTexts = new Set(); // A set to keep track of seen texts const tagCounters = {}; // Object to track instances of each tag elements.forEach(el => { const tagName = el.tagName.toLowerCase(); // Get the tag name (e.g., 'h1', 'h2', etc.) // Initialize a counter for each tag if not already done if (!tagCounters[tagName]) { tagCounters[tagName] = 0; } // Only process the first 10 elements of each tag type if (tagCounters[tagName] >= 2) { return; // Skip if the number of elements exceeds 10 } const text = el.textContent.trim(); // Get the text content const words = text.split(/\s+/); // Split the text into words if (words.length >= 4) { // Ensure at least 4 words const significantPart = words.slice(0, 5).join(' '); // Get first 5 words for matching // Check if the text (not the tag) has been seen before if (seenTexts.has(significantPart)) { // console.log('Duplicate found, removing:', el); // Log duplicate el.remove(); // Remove duplicate element } else { seenTexts.add(significantPart); // Add the text to the set } } tagCounters[tagName]++; // Increment the counter for this tag }); } removeDuplicateText(); */ // Remove duplicate text from articles function removeDuplicateText() { const elements = document.querySelectorAll('h1, h2, h3, h4, h5, strong'); // Select the desired elements const seenTexts = new Set(); // A set to keep track of seen texts const tagCounters = {}; // Object to track instances of each tag // List of classes to be excluded const excludedClasses = ['medium-author', 'post-widget-title']; elements.forEach(el => { // Skip elements with any of the excluded classes if (excludedClasses.some(cls => el.classList.contains(cls))) { return; // Skip this element if it has any of the excluded classes } const tagName = el.tagName.toLowerCase(); // Get the tag name (e.g., 'h1', 'h2', etc.) // Initialize a counter for each tag if not already done if (!tagCounters[tagName]) { tagCounters[tagName] = 0; } // Only process the first 10 elements of each tag type if (tagCounters[tagName] >= 10) { return; // Skip if the number of elements exceeds 10 } const text = el.textContent.trim(); // Get the text content const words = text.split(/\s+/); // Split the text into words if (words.length >= 4) { // Ensure at least 4 words const significantPart = words.slice(0, 5).join(' '); // Get first 5 words for matching // Check if the text (not the tag) has been seen before if (seenTexts.has(significantPart)) { // console.log('Duplicate found, removing:', el); // Log duplicate el.remove(); // Remove duplicate element } else { seenTexts.add(significantPart); // Add the text to the set } } tagCounters[tagName]++; // Increment the counter for this tag }); } removeDuplicateText(); //Remove unnecessary text in blog excerpts document.querySelectorAll('.blog p').forEach(function(paragraph) { // Replace the unwanted text pattern for each paragraph paragraph.innerHTML = paragraph.innerHTML .replace(/Author\(s\): [\w\s]+ Originally published on Towards AI\.?/g, '') // Removes 'Author(s): XYZ Originally published on Towards AI' .replace(/This member-only story is on us\. Upgrade to access all of Medium\./g, ''); // Removes 'This member-only story...' }); //Load ionic icons and cache them if ('localStorage' in window && window['localStorage'] !== null) { const cssLink = 'https://code.ionicframework.com/ionicons/2.0.1/css/ionicons.min.css'; const storedCss = localStorage.getItem('ionicons'); if (storedCss) { loadCSS(storedCss); } else { fetch(cssLink).then(response => response.text()).then(css => { localStorage.setItem('ionicons', css); loadCSS(css); }); } } function loadCSS(css) { const style = document.createElement('style'); style.innerHTML = css; document.head.appendChild(style); } //Remove elements from imported content automatically function removeStrongFromHeadings() { const elements = document.querySelectorAll('h1, h2, h3, h4, h5, h6, span'); elements.forEach(el => { const strongTags = el.querySelectorAll('strong'); strongTags.forEach(strongTag => { while (strongTag.firstChild) { strongTag.parentNode.insertBefore(strongTag.firstChild, strongTag); } strongTag.remove(); }); }); } removeStrongFromHeadings(); "use strict"; window.onload = () => { /* //This is an object for each category of subjects and in that there are kewords and link to the keywods let keywordsAndLinks = { //you can add more categories and define their keywords and add a link ds: { keywords: [ //you can add more keywords here they are detected and replaced with achor tag automatically 'data science', 'Data science', 'Data Science', 'data Science', 'DATA SCIENCE', ], //we will replace the linktext with the keyword later on in the code //you can easily change links for each category here //(include class="ml-link" and linktext) link: 'linktext', }, ml: { keywords: [ //Add more keywords 'machine learning', 'Machine learning', 'Machine Learning', 'machine Learning', 'MACHINE LEARNING', ], //Change your article link (include class="ml-link" and linktext) link: 'linktext', }, ai: { keywords: [ 'artificial intelligence', 'Artificial intelligence', 'Artificial Intelligence', 'artificial Intelligence', 'ARTIFICIAL INTELLIGENCE', ], //Change your article link (include class="ml-link" and linktext) link: 'linktext', }, nl: { keywords: [ 'NLP', 'nlp', 'natural language processing', 'Natural Language Processing', 'NATURAL LANGUAGE PROCESSING', ], //Change your article link (include class="ml-link" and linktext) link: 'linktext', }, des: { keywords: [ 'data engineering services', 'Data Engineering Services', 'DATA ENGINEERING SERVICES', ], //Change your article link (include class="ml-link" and linktext) link: 'linktext', }, td: { keywords: [ 'training data', 'Training Data', 'training Data', 'TRAINING DATA', ], //Change your article link (include class="ml-link" and linktext) link: 'linktext', }, ias: { keywords: [ 'image annotation services', 'Image annotation services', 'image Annotation services', 'image annotation Services', 'Image Annotation Services', 'IMAGE ANNOTATION SERVICES', ], //Change your article link (include class="ml-link" and linktext) link: 'linktext', }, l: { keywords: [ 'labeling', 'labelling', ], //Change your article link (include class="ml-link" and linktext) link: 'linktext', }, pbp: { keywords: [ 'previous blog posts', 'previous blog post', 'latest', ], //Change your article link (include class="ml-link" and linktext) link: 'linktext', }, mlc: { keywords: [ 'machine learning course', 'machine learning class', ], //Change your article link (include class="ml-link" and linktext) link: 'linktext', }, }; //Articles to skip let articleIdsToSkip = ['post-2651', 'post-3414', 'post-3540']; //keyword with its related achortag is recieved here along with article id function searchAndReplace(keyword, anchorTag, articleId) { //selects the h3 h4 and p tags that are inside of the article let content = document.querySelector(`#${articleId} .entry-content`); //replaces the "linktext" in achor tag with the keyword that will be searched and replaced let newLink = anchorTag.replace('linktext', keyword); //regular expression to search keyword var re = new RegExp('(' + keyword + ')', 'g'); //this replaces the keywords in h3 h4 and p tags content with achor tag content.innerHTML = content.innerHTML.replace(re, newLink); } function articleFilter(keyword, anchorTag) { //gets all the articles var articles = document.querySelectorAll('article'); //if its zero or less then there are no articles if (articles.length > 0) { for (let x = 0; x < articles.length; x++) { //articles to skip is an array in which there are ids of articles which should not get effected //if the current article's id is also in that array then do not call search and replace with its data if (!articleIdsToSkip.includes(articles[x].id)) { //search and replace is called on articles which should get effected searchAndReplace(keyword, anchorTag, articles[x].id, key); } else { console.log( `Cannot replace the keywords in article with id ${articles[x].id}` ); } } } else { console.log('No articles found.'); } } let key; //not part of script, added for (key in keywordsAndLinks) { //key is the object in keywords and links object i.e ds, ml, ai for (let i = 0; i < keywordsAndLinks[key].keywords.length; i++) { //keywordsAndLinks[key].keywords is the array of keywords for key (ds, ml, ai) //keywordsAndLinks[key].keywords[i] is the keyword and keywordsAndLinks[key].link is the link //keyword and link is sent to searchreplace where it is then replaced using regular expression and replace function articleFilter( keywordsAndLinks[key].keywords[i], keywordsAndLinks[key].link ); } } function cleanLinks() { // (making smal functions is for DRY) this function gets the links and only keeps the first 2 and from the rest removes the anchor tag and replaces it with its text function removeLinks(links) { if (links.length > 1) { for (let i = 2; i < links.length; i++) { links[i].outerHTML = links[i].textContent; } } } //arrays which will contain all the achor tags found with the class (ds-link, ml-link, ailink) in each article inserted using search and replace let dslinks; let mllinks; let ailinks; let nllinks; let deslinks; let tdlinks; let iaslinks; let llinks; let pbplinks; let mlclinks; const content = document.querySelectorAll('article'); //all articles content.forEach((c) => { //to skip the articles with specific ids if (!articleIdsToSkip.includes(c.id)) { //getting all the anchor tags in each article one by one dslinks = document.querySelectorAll(`#${c.id} .entry-content a.ds-link`); mllinks = document.querySelectorAll(`#${c.id} .entry-content a.ml-link`); ailinks = document.querySelectorAll(`#${c.id} .entry-content a.ai-link`); nllinks = document.querySelectorAll(`#${c.id} .entry-content a.ntrl-link`); deslinks = document.querySelectorAll(`#${c.id} .entry-content a.des-link`); tdlinks = document.querySelectorAll(`#${c.id} .entry-content a.td-link`); iaslinks = document.querySelectorAll(`#${c.id} .entry-content a.ias-link`); mlclinks = document.querySelectorAll(`#${c.id} .entry-content a.mlc-link`); llinks = document.querySelectorAll(`#${c.id} .entry-content a.l-link`); pbplinks = document.querySelectorAll(`#${c.id} .entry-content a.pbp-link`); //sending the anchor tags list of each article one by one to remove extra anchor tags removeLinks(dslinks); removeLinks(mllinks); removeLinks(ailinks); removeLinks(nllinks); removeLinks(deslinks); removeLinks(tdlinks); removeLinks(iaslinks); removeLinks(mlclinks); removeLinks(llinks); removeLinks(pbplinks); } }); } //To remove extra achor tags of each category (ds, ml, ai) and only have 2 of each category per article cleanLinks(); */ //Recommended Articles var ctaLinks = [ /* ' ' + '

Subscribe to our AI newsletter!

' + */ '

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

'+ '

Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

' + '
' + '' + '' + '

Note: Content contains the views of the contributing authors and not Towards AI.
Disclosure: This website may contain sponsored content and affiliate links.

' + 'Discover Your Dream AI Career at Towards AI Jobs' + '

Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 10,000 live jobs today with Towards AI Jobs!

' + '
' + '

🔥 Recommended Articles 🔥

' + 'Why Become an LLM Developer? Launching Towards AI’s New One-Stop Conversion Course'+ 'Testing Launchpad.sh: A Container-based GPU Cloud for Inference and Fine-tuning'+ 'The Top 13 AI-Powered CRM Platforms
' + 'Top 11 AI Call Center Software for 2024
' + 'Learn Prompting 101—Prompt Engineering Course
' + 'Explore Leading Cloud Providers for GPU-Powered LLM Training
' + 'Best AI Communities for Artificial Intelligence Enthusiasts
' + 'Best Workstations for Deep Learning
' + 'Best Laptops for Deep Learning
' + 'Best Machine Learning Books
' + 'Machine Learning Algorithms
' + 'Neural Networks Tutorial
' + 'Best Public Datasets for Machine Learning
' + 'Neural Network Types
' + 'NLP Tutorial
' + 'Best Data Science Books
' + 'Monte Carlo Simulation Tutorial
' + 'Recommender System Tutorial
' + 'Linear Algebra for Deep Learning Tutorial
' + 'Google Colab Introduction
' + 'Decision Trees in Machine Learning
' + 'Principal Component Analysis (PCA) Tutorial
' + 'Linear Regression from Zero to Hero
'+ '

', /* + '

Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

',*/ ]; var replaceText = { '': '', '': '', '
': '
' + ctaLinks + '
', }; Object.keys(replaceText).forEach((txtorig) => { //txtorig is the key in replacetext object const txtnew = replaceText[txtorig]; //txtnew is the value of the key in replacetext object let entryFooter = document.querySelector('article .entry-footer'); if (document.querySelectorAll('.single-post').length > 0) { //console.log('Article found.'); const text = entryFooter.innerHTML; entryFooter.innerHTML = text.replace(txtorig, txtnew); } else { // console.log('Article not found.'); //removing comment 09/04/24 } }); var css = document.createElement('style'); css.type = 'text/css'; css.innerHTML = '.post-tags { display:none !important } .article-cta a { font-size: 18px; }'; document.body.appendChild(css); //Extra //This function adds some accessibility needs to the site. function addAlly() { // In this function JQuery is replaced with vanilla javascript functions const imgCont = document.querySelector('.uw-imgcont'); imgCont.setAttribute('aria-label', 'AI news, latest developments'); imgCont.title = 'AI news, latest developments'; imgCont.rel = 'noopener'; document.querySelector('.page-mobile-menu-logo a').title = 'Towards AI Home'; document.querySelector('a.social-link').rel = 'noopener'; document.querySelector('a.uw-text').rel = 'noopener'; document.querySelector('a.uw-w-branding').rel = 'noopener'; document.querySelector('.blog h2.heading').innerHTML = 'Publication'; const popupSearch = document.querySelector$('a.btn-open-popup-search'); popupSearch.setAttribute('role', 'button'); popupSearch.title = 'Search'; const searchClose = document.querySelector('a.popup-search-close'); searchClose.setAttribute('role', 'button'); searchClose.title = 'Close search page'; // document // .querySelector('a.btn-open-popup-search') // .setAttribute( // 'href', // 'https://medium.com/towards-artificial-intelligence/search' // ); } // Add external attributes to 302 sticky and editorial links function extLink() { // Sticky 302 links, this fuction opens the link we send to Medium on a new tab and adds a "noopener" rel to them var stickyLinks = document.querySelectorAll('.grid-item.sticky a'); for (var i = 0; i < stickyLinks.length; i++) { /* stickyLinks[i].setAttribute('target', '_blank'); stickyLinks[i].setAttribute('rel', 'noopener'); */ } // Editorial 302 links, same here var editLinks = document.querySelectorAll( '.grid-item.category-editorial a' ); for (var i = 0; i < editLinks.length; i++) { editLinks[i].setAttribute('target', '_blank'); editLinks[i].setAttribute('rel', 'noopener'); } } // Add current year to copyright notices document.getElementById( 'js-current-year' ).textContent = new Date().getFullYear(); // Call functions after page load extLink(); //addAlly(); setTimeout(function() { //addAlly(); //ideally we should only need to run it once ↑ }, 5000); }; function closeCookieDialog (){ document.getElementById("cookie-consent").style.display = "none"; return false; } setTimeout ( function () { closeCookieDialog(); }, 15000); console.log(`%c 🚀🚀🚀 ███ █████ ███████ █████████ ███████████ █████████████ ███████████████ ███████ ███████ ███████ ┌───────────────────────────────────────────────────────────────────┐ │ │ │ Towards AI is looking for contributors! │ │ Join us in creating awesome AI content. │ │ Let's build the future of AI together → │ │ https://towardsai.net/contribute │ │ │ └───────────────────────────────────────────────────────────────────┘ `, `background: ; color: #00adff; font-size: large`); //Remove latest category across site document.querySelectorAll('a[rel="category tag"]').forEach(function(el) { if (el.textContent.trim() === 'Latest') { // Remove the two consecutive spaces (  ) if (el.nextSibling && el.nextSibling.nodeValue.includes('\u00A0\u00A0')) { el.nextSibling.nodeValue = ''; // Remove the spaces } el.style.display = 'none'; // Hide the element } }); // Add cross-domain measurement, anonymize IPs 'use strict'; //var ga = gtag; ga('config', 'G-9D3HKKFV1Q', 'auto', { /*'allowLinker': true,*/ 'anonymize_ip': true/*, 'linker': { 'domains': [ 'medium.com/towards-artificial-intelligence', 'datasets.towardsai.net', 'rss.towardsai.net', 'feed.towardsai.net', 'contribute.towardsai.net', 'members.towardsai.net', 'pub.towardsai.net', 'news.towardsai.net' ] } */ }); ga('send', 'pageview'); -->