Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

YouTube Dislikes Prediction in Real-time — Working With a Combination of Data; A Practical Guide
Latest   Machine Learning

YouTube Dislikes Prediction in Real-time — Working With a Combination of Data; A Practical Guide

Last Updated on July 17, 2023 by Editorial Team

Author(s): Nafiu

Originally published on Towards AI.

Hi everyone, this is a practical guide to a fascinating topic; today, we will discuss how you can work with a combination of mixed data. Well, we have all gone through it when we go through a dataset, and there are features with different data types and wonder how we can combine both types and use them for training a single machine learning model. Well, today, you have the simple guided answer for that. Also, to make things interesting, we will train a machine-learning model that predicts youtube dislikes in real-time. It's no surprise anymore that youtube removed its dislike count feature a year ago, maybe I’m even a little late to solve this problem, but the dataset we are using today is for sure gonna fulfill our needs of learning for today. Remember that since youtube dislike is a number, we are solving a regression problem. U+1F642

Table of contents

  1. Loading data
  2. Data pre-processing
  3. Modeling and training
  4. Evaluation
  5. Real-time prediction

What you will learn

  • Work with mixed data
  • Clean text data
  • Process both text and numeric data
  • Keras TensorFlow functional API
  • LSTM (RNN)
  • To create a model that processes different types of data at once
  • how to check the accuracy of a regression model
  • Youtube API

Reality: in youtube, the amount of views, likes, and dislikes that a video can get depends on a lot of things, such as the popularity of the creator, the quality of the video, SEO, user shares, and a lot of other thighs which goes beyond the dataset available for us. Anyways let’s try to get the best out of what we have U+1F601

Note: throughout this tutorial, I will be mentioning what and how you can do to improve this model and the things I have missed while creating this if I missed anything. And when it comes to adding libraries, we will be adding as we need instead of importing all of them once

Let’s get started…

You can get the dataset at https://www.kaggle.com/datasets/dmitrynikolaev/youtube-dislikes-dataset

Please download the dataset and extract it in your working directory. This dataset is a mixed language dataset of 37422 unique raws we are gonna shorten it down to only English language by using the video ids provided with the dataset, which makes our dataset 15835 unique raws in total before processing. This is actually a very short amount of data to solve a problem like this, try to increase the amount of data by processing other languages, too which will surely help to increase model accuracy.

Note: This dataset was last updated on 13/12/2021 so we will assume that is the date that this data was extracted since we will use this data to deal with the time period

Load the dataset and collect the features we need

Hope you have the .csv file in your working directory, well let’s load it using pandas and shorten it down to only English and get more details of the available features from the dataset.

Import some libraries to get started

import numpy as np
import pandas as pd
import tensorflow as tf

Loading the dataset

dataset = pd.read_csv('youtube_dislike_dataset.csv')
dataset.head()

Collecting only English data

file_US_ids = open("video_IDs/unique_ids_US.txt", "r")
US_ids = file_US_ids.read().splitlines()
dataset = dataset[dataset['video_id'].isin(US_ids)]
dataset.shape

Checking more details of the dataset, such as the data types of the features

dataset.info()
image by author

The dataset contains 12 features. We only will use 7 features from these. These features are described below:

Video_id: unique video id
Title: title of the youtube video
Channel_id: channel id of the publisher
Channel_title: name of the channel
Published_at: the date that the video was published
View_count: Number of views that video got in a period of time
Likes: Number of likes that video has
Dislikes: Number of dislikes that video got in a period of time
Comment_count: Number of comments that the video has
Tags: video tags as strings
Description: Video description
Comments: list of video comments as a string

Well, there you have all the features available, now let’s select some of them which we will use in this practical guide.

We will be using the following 7 features:
view_count, published_at, likes, comment_count, tags, description, dislikes

dataset = dataset[['view_count', 'published_at','likes', 'comment_count','tags','description','dislikes']]

We will create one more feature and also will get rid of one feature from here in the pre-processing stage. Honestly, using the title and channel name would be a great improvement for the model, you can try it. U+1F642

Data pre-processing

Let’s get started with processing data, in this stage, we do a lot of important things to our datasets such as dealing with missing values, creating related time fields, and cleaning and tokenizing text data. Let me explain more while doing it.

Dealing with missing values

The dataset is quite tricky if you just look for missing values (nan) you will not find anything this is because in a way the dataset doesn’t have any missing values since all the fields are filled with an empty string, so handle this case we will check for empty strings fields that don’t have a word in it and convert it into a nan value and check again, I’m sure we will get a good amount of null values now. And guess what? We will drop them all, this will make our dataset shot to 13536 raws in total, which is pretty small U+1F602

image by author

Creating a new feature — dealing with time

Above, I mention that we are going to assume that the data was extracted on 13/12/2021, well, this is the time that date comes into play.

We are gonna create a new feature named timesec by calculating the amount of time between the date that the video was published and the time it was extracted in minutes.

So basically: timesec = extracted date – published_at date

First, we will write a function to calculate the time between two points in minutes and use the pandas apply method to create the new feature.

Function to calculate the time between:

from datetime import datetime
def calTime(time):
start = datetime.strptime(time, '%Y-%m-%d %H:%M:%S')
end = datetime.strptime('13/12/2021 00:00:00', '%d/%m/%Y %H:%M:%S') # assuming this is the date that this dataset was extracted
return np.round((end - start).total_seconds() / 60, 2)

Using pandas apply method to run the function and create the new feature

dataset['timesec'] = dataset['published_at'].apply(calTime)

Cleaning the text features

We will create a function to clean our text data followed by a set of steps that will get the right and important meaning out of it.

Basically, the function will remove any unnecessary URLs, punctuation marks, numbers, stopwords, and random words that are not in the dictionary will be removed. It will also deal with contractions words and will lemmatize words and at last, will convert all to lowercase and return the processed string. We will do this with help of a bunch of libraries so let’s get started by importing those.

Importing the required libraries for this

import re
from string import punctuation
from bs4 import BeautifulSoup
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('words')

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

The contraction word dictionary will help to convert a combination of two words to their separate root word. (this is not all, there are more)

contraction_map={
"ain't": "is not",
"aren't": "are not",
"can't": "cannot",
...
}

The function we created to clean text


lemmatizer = WordNetLemmatizer()
in_words = set(nltk.corpus.words.words())
def clean_text(text):
text = str(text)
text = BeautifulSoup(text, "lxml").text
text = re.sub(r'\([^)]*\)', '', text)
text = re.sub('"','', text)
text = ' '.join([contraction_map[t] if t in contraction_map else t for t in text.split(" ")])
text = re.sub(r"'s\b","",text)
text = re.sub("[^a-zA-Z]", " ", text)
text = " ".join(w for w in nltk.wordpunct_tokenize(text) if w.lower() in in_words or not w.isalpha())

text = [word for word in text.split( ) if word not in stopwords.words('english')]
text = [lemmatizer.lemmatize(word) for word in text]
text = " ".join(text)
text = text.lower()
return text

At last creating two new columns with cleaned text after running the function

%time dataset['clean_description'] = dataset['description'].apply(clean_text)
%time dataset['clean_tags'] = dataset['tags'].apply(clean_text)

Splitting data

Since we are using different datatypes it's not ideal to use the usual train_test_split function so we will start by splitting data by train and test with the help of slicing after that we will separate X and Y from both train and test data, then will separate the text and numerical features, we will keep both tags and description separate but will keep all the numeric features as one. This might be confusing anyway let’s see the code, which will make things clear.

Note: Y is the dislike feature and our target variable. And X would be the view_count, likes, comment_count, timesec, clean_tags, clean_description. We don’t need the published_at feature.

Slicing to train and test

# Slicing to train and test
dataset_train = dataset.iloc[:12500,:]
dataset_test = dataset.iloc[12501:,:]

# Creating X and Y variables of train dataset
X_train = dataset_train.loc[:, dataset.columns != 'dislikes']
Y_train = dataset_train['dislikes'].values

#Creating X and Y variables of the test dataset
X_test = dataset_test.loc[:, dataset.columns != 'dislikes']
Y_test = dataset_test['dislikes'].values

# Splitting and organizing the features of train data
X_train_numaric = X_train[['view_count', 'likes', 'comment_count', 'timesec']].values
X_train_tags = X_train['clean_tags'].values
X_train_desc = X_train['clean_description'].values

# Splitting and organizing the features of the test data
X_test_numaric = X_test[['view_count', 'likes', 'comment_count', 'timesec']].values
X_test_tags = X_test['clean_tags'].values
X_test_desc = X_test['clean_description'].values

Text tokenization — processing text

Here we are gonna create a function to tokenization function to process our text features. The function will create a tokenizer and tokenize words and will add padding of 100 to it and will return a set of useful information such as processed test and train data, the number of max words, the vocabulary and the size of it, and the tokenizer itself that we will use later to make real-time predictions.
We will have to use this on both tags and description features as well.

Let’s get to the code. U+1F642

Importing required libraries for this

from tensorflow.keras.preprocessing.text import Tokenizer 
from tensorflow.keras.preprocessing.sequence import pad_sequences

The function to do the processing

def Tokenizer_func(train,test, max_words_length=0, max_seq_len=100):
tokenizer = Tokenizer()
tokenizer.fit_on_texts(train)

max_words = 0
if max_words_length > 0:
max_words = max_words_length
else:
max_words = len(tokenizer.word_counts.items())

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(train)

train_sequences = tokenizer.texts_to_sequences(train)
test_sequences = tokenizer.texts_to_sequences(test)

train = pad_sequences(train_sequences,maxlen=max_seq_len, padding='post')
test = pad_sequences(test_sequences,maxlen=max_seq_len, padding='post')

voc = tokenizer.num_words +1
return {'train': train, 'test': test, 'voc': voc, 'max_words':max_words, 'tokenizer': tokenizer}

Calling the function for tags

X_tags_processed = Tokenizer_func(X_train_tags,X_test_tags)
# Extracting data from it…
X_train_tags,X_test_tags,x_tags_voc,x_tags_max_words,x_tag_tok = X_tags_processed['train'], X_tags_processed['test'],X_tags_processed['voc'],X_tags_processed['max_words'],X_tags_processed['tokenizer']

Calling the function for description

X_desc_processed = Tokenizer_func(X_train_desc,X_test_desc)
# Extracting data from it…
X_train_desc, X_test_desc,x_desc_voc,x_desc_max_words,x_desc_tok = X_desc_processed['train'], X_desc_processed['test'],X_desc_processed['voc'],X_desc_processed['max_words'],X_desc_processed['tokenizer']

Well, we have one more thing to do before ending the processing step of this tutorial.

Numeric data normalization

For this, we will use the StandardScaler to normalize both train and test data. Well, let’s see the code.
Importing the libraries, Creating the Scaler, and processing numeric data using it

from sklearn.preprocessing import StandardScaler
Sc = StandardScaler()
X_train_numaric = Sc.fit_transform(X_train_numaric)
X_test_numaric = Sc.transform(X_test_numaric)

Yay.. we are halfway done… this is the end of data preprocessing. U+1F389U+1F389U+1F389

Creating the Model and Training

Here we will be using Keras functional API to create a machine-learning model we will be building a type of an RNN known as LSTM to process text data and later connected it to an MLP with the numeric data. I’m not gonna explain LSTM here since I have already gone through it in a previous post which is linked here…

Stock market prediction using LSTM; will the price go up tomorrow. Practical guide

The goal of this tutorial is to create a machine learning model to predict the future value of a stock traded on a…

medium.com

Allow me to explain our model here:

We have 3 inputs in the model: tags input, description input which has an input shape of none, and numeric which has an input size of 4 since we have 4 numeric features. Both tags and description starts with an embedding layer and then with two LSTM layer with units of 100 (remeber we added 100 as padding) followed by a normalization layer after each LSTM later. For both features, the first LSTM layer will have a dropout of 0.2 and return sequence set to true, and the second layer comes with a dropout of 0.4 and return sequence set to false. After passing through those layers both tags and description are combined with numeric data input and then passed through 4 dense layers with the units of 256,128,32, and 1, the first 3 layers with relu activation function and the last layer with a linear activation function (coz the problem is regression). Next, the whole model is compiled using model.compile() method, set as mean_squared_error as a loss function and mean_absolute_error as a matric and adam as an optimizer, as adam functions learning rate decreased to 0.001. Then finally we can view the summary of the model by using model.summary() method.

Let's view the code…

Import required libraries for this.

from keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Embedding, Dense, concatenate,LayerNormalization
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping

Creating the model

tagsInput = Input(shape=(None,), name='tags')
descInput = Input(shape=(None,), name='desc')
numaricInput = Input(shape=(4,), name='numaric')

tags = Embedding(input_dim=x_tags_voc,output_dim=8,input_length=x_tags_max_words)(tagsInput)
tags = LSTM(100,dropout=0.2, return_sequences=True)(tags)
tags = LayerNormalization()(tags)
tags = LSTM(100,dropout=0.4, return_sequences=False)(tags)
tags = LayerNormalization()(tags)

desc = Embedding(input_dim=x_desc_voc,output_dim=8,input_length=x_desc_max_words)(descInput)
desc = LSTM(100,dropout=0.2, return_sequences=True)(desc)
desc = LayerNormalization()(desc)
desc = LSTM(100,dropout=0.4, return_sequences=False)(desc)
desc = LayerNormalization()(desc)

combined = concatenate([tags, desc,numaricInput])
x = Dense(256,activation='relu')(combined)
x = Dense(128,activation='relu')(x)
x = Dense(32,activation='relu')(x)
x = Dense(1,use_bias=True,activation='linear')(x)
model = Model([tagsInput, descInput,numaricInput], x)

Compiling the model

model.compile(loss='mean_squared_error', optimizer=Adam(learning_rate=0.001, decay=0.001 / 20), metrics=['mae'])

Now let's see the summary of the model using model.sumary() method

image by author

Now it's time to train our model. We will train our model under 500 epochs with a batch size of 25 and with a validation split of 2 percent, since we have the early stopping on it will probably be done training the model before it even comes close to 500 epochs.

Well, there is one more thing that we must do before starting to train our model. LSTM supports 3-dimensional data but our data is 2 dimensional so we must convert it into 3-dimensional data before we start training

Initializing EarlyStopping to avoid overfitting while training the model

es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=20)

Reshaping the data to 3-dimensional data

X_train_tags = np.reshape(X_train_tags,(X_train_tags.shape[0],X_train_tags.shape[1],1))
X_train_desc = np.reshape(X_train_desc,(X_train_desc.shape[0],X_train_desc.shape[1],1))
X_train_tags.shape, X_train_desc.shape, X_train_numaric.shape
image by author

Training the model

history = model.fit(
x=[X_train_tags, X_train_desc,X_train_numaric],
y=Y_train,
epochs=500,
batch_size=25,
validation_split=0.2,
verbose=1,
callbacks=[es]
)
image by author

Here we can see our model has stopped training once it reaches 26 epochs.

Evaluating our model

Now that our model has done training, it’s time to check the results but before that, let’s answer a simple question and understand a couple of important things.
How to check the accuracy of a regression model?

Residuals — The difference between the predicted value and the actual value.

Mean Absolute Error — the mean value of the residuals between predicted values and actual values.

  • The lower the MAE the better the model is.
  • MAE is calculated by adding residuals and dividing by the length of the dataset (test data)
  • Describes how close the predictions are to the actual values on average.

Mean Squared Error — shows how close a regression line is to a set of data points

Root Mean Squared Error — is the square root of the mean of all the squared values of errors

  • indicates the absolute fit of the model relative to the given data
  • RMSE has a relationship with MAE
  • If RMSE value is much higher than MAE, it means the results of some large errors in the dataset.

R-squared — shows how well the data fits into the model

  • Outputs a value on a scale of 0 to 1
  • The higher the R2 the better the model is
  • R2 score value can be used to get the accuracy of the model same as a classification model by multiplying the value * 100

Key points:

  • R2 must be high
  • The lower MSE the better the model is
  • The lower MAE the better the model is
  • The lower RMSE the better the model is
  • MAE should be lower then RMSE

Now let’s implement these in our project and see the model results. We can do this by using the following code

pprydd = model.predict([X_test_tags, X_test_desc, X_test_numaric])
from sklearn import metrics
print("Mean Absolute Error (MAE) - Test data : ", metrics.mean_absolute_error(Y_test, pprydd))
print("Mean Squared Error (MSE) - Test data : ", metrics.mean_squared_error(Y_test, pprydd))
print("Root Mean Squared Error (RMSE) - Test data : ", np.sqrt(metrics.mean_squared_error(Y_test, pprydd)))
print("Co-efficient of determination (R2 Score): ", metrics.r2_score(Y_test, pprydd))

We can see that our model has an R2 score of 0.82 (82%) but also, we can see we have a very high MSE which is not a good sign U+1F625. And we can see both MAE and RMSE are pretty low compared to MSE, which is fine but the RMSE is higher than the MAE, which actually indicates the result of some errors in our dataset U+1F915… not the best model in the world but in my opinion, this is not bad compare to the usable data we had.

Note: you can try improving the model by doing more research about this field, like what are the things that depend on while getting more dislike on videos and focus on that. Also, try using the features that we didn’t use in this tutorial… anyways, leys continue.

Real-time prediction

Now, the part we all have been waiting for to predict youtube dislike in real-time. For this, we will be using youtube API to get targeted video data by id and collect the required data from it and process it and run it through the model. To make things easy, we will write a function to make this all happen at once.
Let’s see the code. 🙂

Import the required library

import googleapiclient.discovery

Create youtube client

DEVELOPER_KEY = 'YOUR_API_KEY' 
youtube_client = googleapiclient.discovery.build('youtube', 'v3', developerKey=DEVELOPER_KEY)

Time to write the function. Let me explain how this works.
Once the data is received, we will extract the required fields from it

For both text fields, the text will be cleaned using the previously created cleaning function and then will be processed through the tokenizer we created earlier, and at last, the padding will be added to the text.

The timesec field will be calculated using the next function. Inside this function, it will get the time period between the date that the targeted video was published and today in minutes.

Next, all the numeric features will be passed through the Scaler for normalization. Once the processing is done, then we will predict the dislike of the video by passing through the model, and the function will return the predicted dislike number among some other useful information.

def realtime(youtube,video_id):
def calTimesss(time):
start = datetime.strptime(time, '%Y-%m-%dT%H:%M:%S%z')
end = datetime.now()
return np.round((end - start.replace(tzinfo=end.tzinfo)).total_seconds() / 60, 2)
request = youtube.videos().list(part="snippet, statistics",id=video_id)
response = request.execute()

desc = response['items'][0]['snippet']['description']
desc =[clean_text(desc)]
desc = x_desc_tok.texts_to_sequences(desc)
desc = pad_sequences(desc,maxlen=100, padding='post')

tags = response['items'][0]['snippet']['tags']
tags=(" ".join(tags))
tags =[clean_text(tags)]
tags = x_tag_tok.texts_to_sequences(tags)
tags = pad_sequences(tags,maxlen=100, padding='post')

publishedAt = response['items'][0]['snippet']['publishedAt']
timesec = calTimesss(publishedAt)
viewcount = response['items'][0]['statistics']['viewCount']
likeCount = response['items'][0]['statistics']['likeCount']
commentCount = response['items'][0]['statistics']['commentCount']
numaricdata = [[viewcount, likeCount,commentCount,timesec]]
numaricdata = Sc.transform(numaricdata)

pryd = model.predict([tags, desc, numaricdata])

return {"predicted": int(pryd[0][0]), "info": {
"video_id": video_id,
"likes": likeCount,
"commentCount": commentCount,
"viewCount": viewcount,
"publishedAt": publishedAt,
"dislike": int(pryd[0][0]),
}}

Now let's run it, and let’s see the results…

video_id = "videoid"
realtime(youtube_client, video_id)
image by author

Haha, it worked U+1F602 U+1F389. According to the model, this video has 1231 diskiles. Ouch, a lot of haters U+1F61F

Well, this is the end of this tutorial about working with combination data, it was fun so hope you enjoyed it. Also hope you will try to improve this model… Keep learning U+1F642

link to GitHub repo: https://github.com/nafiu-dev/youtube_dislike_prediction_practice

you can connect with me here:

https://www.instagram.com/nafiu.dev

Linkedin: https://www.linkedin.com/in/nafiu-nizar-93a16720b

My other posts:

End to End full stack project from backend, frontend and machine learning to ethical hacking…

Hi, I welcome you all to this series of building an end to end project starting from backend development, front-end…

medium.com

Create a Boolean Image Classifier Fast With Any Data Set, With a Brief Explanation of the…

Hi everyone, in this post, we are going to look at a type of neural network called the Convolutional Neural Network…

pub.towardsai.net

Highlighting the most popular machine learning algorithms; Implementing it

Overview of the most popular machine learning models, implementing and comparing the accuracy using iris dataset

medium.com

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓

Sign Up for the Course
`; } else { console.error('Element with id="subscribe" not found within the page with class "home".'); } } }); // Remove duplicate text from articles /* Backup: 09/11/24 function removeDuplicateText() { const elements = document.querySelectorAll('h1, h2, h3, h4, h5, strong'); // Select the desired elements const seenTexts = new Set(); // A set to keep track of seen texts const tagCounters = {}; // Object to track instances of each tag elements.forEach(el => { const tagName = el.tagName.toLowerCase(); // Get the tag name (e.g., 'h1', 'h2', etc.) // Initialize a counter for each tag if not already done if (!tagCounters[tagName]) { tagCounters[tagName] = 0; } // Only process the first 10 elements of each tag type if (tagCounters[tagName] >= 2) { return; // Skip if the number of elements exceeds 10 } const text = el.textContent.trim(); // Get the text content const words = text.split(/\s+/); // Split the text into words if (words.length >= 4) { // Ensure at least 4 words const significantPart = words.slice(0, 5).join(' '); // Get first 5 words for matching // Check if the text (not the tag) has been seen before if (seenTexts.has(significantPart)) { // console.log('Duplicate found, removing:', el); // Log duplicate el.remove(); // Remove duplicate element } else { seenTexts.add(significantPart); // Add the text to the set } } tagCounters[tagName]++; // Increment the counter for this tag }); } removeDuplicateText(); */ // Remove duplicate text from articles function removeDuplicateText() { const elements = document.querySelectorAll('h1, h2, h3, h4, h5, strong'); // Select the desired elements const seenTexts = new Set(); // A set to keep track of seen texts const tagCounters = {}; // Object to track instances of each tag // List of classes to be excluded const excludedClasses = ['medium-author', 'post-widget-title']; elements.forEach(el => { // Skip elements with any of the excluded classes if (excludedClasses.some(cls => el.classList.contains(cls))) { return; // Skip this element if it has any of the excluded classes } const tagName = el.tagName.toLowerCase(); // Get the tag name (e.g., 'h1', 'h2', etc.) // Initialize a counter for each tag if not already done if (!tagCounters[tagName]) { tagCounters[tagName] = 0; } // Only process the first 10 elements of each tag type if (tagCounters[tagName] >= 10) { return; // Skip if the number of elements exceeds 10 } const text = el.textContent.trim(); // Get the text content const words = text.split(/\s+/); // Split the text into words if (words.length >= 4) { // Ensure at least 4 words const significantPart = words.slice(0, 5).join(' '); // Get first 5 words for matching // Check if the text (not the tag) has been seen before if (seenTexts.has(significantPart)) { // console.log('Duplicate found, removing:', el); // Log duplicate el.remove(); // Remove duplicate element } else { seenTexts.add(significantPart); // Add the text to the set } } tagCounters[tagName]++; // Increment the counter for this tag }); } removeDuplicateText(); //Remove unnecessary text in blog excerpts document.querySelectorAll('.blog p').forEach(function(paragraph) { // Replace the unwanted text pattern for each paragraph paragraph.innerHTML = paragraph.innerHTML .replace(/Author\(s\): [\w\s]+ Originally published on Towards AI\.?/g, '') // Removes 'Author(s): XYZ Originally published on Towards AI' .replace(/This member-only story is on us\. Upgrade to access all of Medium\./g, ''); // Removes 'This member-only story...' }); //Load ionic icons and cache them if ('localStorage' in window && window['localStorage'] !== null) { const cssLink = 'https://code.ionicframework.com/ionicons/2.0.1/css/ionicons.min.css'; const storedCss = localStorage.getItem('ionicons'); if (storedCss) { loadCSS(storedCss); } else { fetch(cssLink).then(response => response.text()).then(css => { localStorage.setItem('ionicons', css); loadCSS(css); }); } } function loadCSS(css) { const style = document.createElement('style'); style.innerHTML = css; document.head.appendChild(style); } //Remove elements from imported content automatically function removeStrongFromHeadings() { const elements = document.querySelectorAll('h1, h2, h3, h4, h5, h6, span'); elements.forEach(el => { const strongTags = el.querySelectorAll('strong'); strongTags.forEach(strongTag => { while (strongTag.firstChild) { strongTag.parentNode.insertBefore(strongTag.firstChild, strongTag); } strongTag.remove(); }); }); } removeStrongFromHeadings(); "use strict"; window.onload = () => { /* //This is an object for each category of subjects and in that there are kewords and link to the keywods let keywordsAndLinks = { //you can add more categories and define their keywords and add a link ds: { keywords: [ //you can add more keywords here they are detected and replaced with achor tag automatically 'data science', 'Data science', 'Data Science', 'data Science', 'DATA SCIENCE', ], //we will replace the linktext with the keyword later on in the code //you can easily change links for each category here //(include class="ml-link" and linktext) link: 'linktext', }, ml: { keywords: [ //Add more keywords 'machine learning', 'Machine learning', 'Machine Learning', 'machine Learning', 'MACHINE LEARNING', ], //Change your article link (include class="ml-link" and linktext) link: 'linktext', }, ai: { keywords: [ 'artificial intelligence', 'Artificial intelligence', 'Artificial Intelligence', 'artificial Intelligence', 'ARTIFICIAL INTELLIGENCE', ], //Change your article link (include class="ml-link" and linktext) link: 'linktext', }, nl: { keywords: [ 'NLP', 'nlp', 'natural language processing', 'Natural Language Processing', 'NATURAL LANGUAGE PROCESSING', ], //Change your article link (include class="ml-link" and linktext) link: 'linktext', }, des: { keywords: [ 'data engineering services', 'Data Engineering Services', 'DATA ENGINEERING SERVICES', ], //Change your article link (include class="ml-link" and linktext) link: 'linktext', }, td: { keywords: [ 'training data', 'Training Data', 'training Data', 'TRAINING DATA', ], //Change your article link (include class="ml-link" and linktext) link: 'linktext', }, ias: { keywords: [ 'image annotation services', 'Image annotation services', 'image Annotation services', 'image annotation Services', 'Image Annotation Services', 'IMAGE ANNOTATION SERVICES', ], //Change your article link (include class="ml-link" and linktext) link: 'linktext', }, l: { keywords: [ 'labeling', 'labelling', ], //Change your article link (include class="ml-link" and linktext) link: 'linktext', }, pbp: { keywords: [ 'previous blog posts', 'previous blog post', 'latest', ], //Change your article link (include class="ml-link" and linktext) link: 'linktext', }, mlc: { keywords: [ 'machine learning course', 'machine learning class', ], //Change your article link (include class="ml-link" and linktext) link: 'linktext', }, }; //Articles to skip let articleIdsToSkip = ['post-2651', 'post-3414', 'post-3540']; //keyword with its related achortag is recieved here along with article id function searchAndReplace(keyword, anchorTag, articleId) { //selects the h3 h4 and p tags that are inside of the article let content = document.querySelector(`#${articleId} .entry-content`); //replaces the "linktext" in achor tag with the keyword that will be searched and replaced let newLink = anchorTag.replace('linktext', keyword); //regular expression to search keyword var re = new RegExp('(' + keyword + ')', 'g'); //this replaces the keywords in h3 h4 and p tags content with achor tag content.innerHTML = content.innerHTML.replace(re, newLink); } function articleFilter(keyword, anchorTag) { //gets all the articles var articles = document.querySelectorAll('article'); //if its zero or less then there are no articles if (articles.length > 0) { for (let x = 0; x < articles.length; x++) { //articles to skip is an array in which there are ids of articles which should not get effected //if the current article's id is also in that array then do not call search and replace with its data if (!articleIdsToSkip.includes(articles[x].id)) { //search and replace is called on articles which should get effected searchAndReplace(keyword, anchorTag, articles[x].id, key); } else { console.log( `Cannot replace the keywords in article with id ${articles[x].id}` ); } } } else { console.log('No articles found.'); } } let key; //not part of script, added for (key in keywordsAndLinks) { //key is the object in keywords and links object i.e ds, ml, ai for (let i = 0; i < keywordsAndLinks[key].keywords.length; i++) { //keywordsAndLinks[key].keywords is the array of keywords for key (ds, ml, ai) //keywordsAndLinks[key].keywords[i] is the keyword and keywordsAndLinks[key].link is the link //keyword and link is sent to searchreplace where it is then replaced using regular expression and replace function articleFilter( keywordsAndLinks[key].keywords[i], keywordsAndLinks[key].link ); } } function cleanLinks() { // (making smal functions is for DRY) this function gets the links and only keeps the first 2 and from the rest removes the anchor tag and replaces it with its text function removeLinks(links) { if (links.length > 1) { for (let i = 2; i < links.length; i++) { links[i].outerHTML = links[i].textContent; } } } //arrays which will contain all the achor tags found with the class (ds-link, ml-link, ailink) in each article inserted using search and replace let dslinks; let mllinks; let ailinks; let nllinks; let deslinks; let tdlinks; let iaslinks; let llinks; let pbplinks; let mlclinks; const content = document.querySelectorAll('article'); //all articles content.forEach((c) => { //to skip the articles with specific ids if (!articleIdsToSkip.includes(c.id)) { //getting all the anchor tags in each article one by one dslinks = document.querySelectorAll(`#${c.id} .entry-content a.ds-link`); mllinks = document.querySelectorAll(`#${c.id} .entry-content a.ml-link`); ailinks = document.querySelectorAll(`#${c.id} .entry-content a.ai-link`); nllinks = document.querySelectorAll(`#${c.id} .entry-content a.ntrl-link`); deslinks = document.querySelectorAll(`#${c.id} .entry-content a.des-link`); tdlinks = document.querySelectorAll(`#${c.id} .entry-content a.td-link`); iaslinks = document.querySelectorAll(`#${c.id} .entry-content a.ias-link`); mlclinks = document.querySelectorAll(`#${c.id} .entry-content a.mlc-link`); llinks = document.querySelectorAll(`#${c.id} .entry-content a.l-link`); pbplinks = document.querySelectorAll(`#${c.id} .entry-content a.pbp-link`); //sending the anchor tags list of each article one by one to remove extra anchor tags removeLinks(dslinks); removeLinks(mllinks); removeLinks(ailinks); removeLinks(nllinks); removeLinks(deslinks); removeLinks(tdlinks); removeLinks(iaslinks); removeLinks(mlclinks); removeLinks(llinks); removeLinks(pbplinks); } }); } //To remove extra achor tags of each category (ds, ml, ai) and only have 2 of each category per article cleanLinks(); */ //Recommended Articles var ctaLinks = [ /* ' ' + '

Subscribe to our AI newsletter!

' + */ '

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

'+ '

Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

' + '
' + '' + '' + '

Note: Content contains the views of the contributing authors and not Towards AI.
Disclosure: This website may contain sponsored content and affiliate links.

' + 'Discover Your Dream AI Career at Towards AI Jobs' + '

Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 10,000 live jobs today with Towards AI Jobs!

' + '
' + '

🔥 Recommended Articles 🔥

' + 'Why Become an LLM Developer? Launching Towards AI’s New One-Stop Conversion Course'+ 'Testing Launchpad.sh: A Container-based GPU Cloud for Inference and Fine-tuning'+ 'The Top 13 AI-Powered CRM Platforms
' + 'Top 11 AI Call Center Software for 2024
' + 'Learn Prompting 101—Prompt Engineering Course
' + 'Explore Leading Cloud Providers for GPU-Powered LLM Training
' + 'Best AI Communities for Artificial Intelligence Enthusiasts
' + 'Best Workstations for Deep Learning
' + 'Best Laptops for Deep Learning
' + 'Best Machine Learning Books
' + 'Machine Learning Algorithms
' + 'Neural Networks Tutorial
' + 'Best Public Datasets for Machine Learning
' + 'Neural Network Types
' + 'NLP Tutorial
' + 'Best Data Science Books
' + 'Monte Carlo Simulation Tutorial
' + 'Recommender System Tutorial
' + 'Linear Algebra for Deep Learning Tutorial
' + 'Google Colab Introduction
' + 'Decision Trees in Machine Learning
' + 'Principal Component Analysis (PCA) Tutorial
' + 'Linear Regression from Zero to Hero
'+ '

', /* + '

Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

',*/ ]; var replaceText = { '': '', '': '', '
': '
' + ctaLinks + '
', }; Object.keys(replaceText).forEach((txtorig) => { //txtorig is the key in replacetext object const txtnew = replaceText[txtorig]; //txtnew is the value of the key in replacetext object let entryFooter = document.querySelector('article .entry-footer'); if (document.querySelectorAll('.single-post').length > 0) { //console.log('Article found.'); const text = entryFooter.innerHTML; entryFooter.innerHTML = text.replace(txtorig, txtnew); } else { // console.log('Article not found.'); //removing comment 09/04/24 } }); var css = document.createElement('style'); css.type = 'text/css'; css.innerHTML = '.post-tags { display:none !important } .article-cta a { font-size: 18px; }'; document.body.appendChild(css); //Extra //This function adds some accessibility needs to the site. function addAlly() { // In this function JQuery is replaced with vanilla javascript functions const imgCont = document.querySelector('.uw-imgcont'); imgCont.setAttribute('aria-label', 'AI news, latest developments'); imgCont.title = 'AI news, latest developments'; imgCont.rel = 'noopener'; document.querySelector('.page-mobile-menu-logo a').title = 'Towards AI Home'; document.querySelector('a.social-link').rel = 'noopener'; document.querySelector('a.uw-text').rel = 'noopener'; document.querySelector('a.uw-w-branding').rel = 'noopener'; document.querySelector('.blog h2.heading').innerHTML = 'Publication'; const popupSearch = document.querySelector$('a.btn-open-popup-search'); popupSearch.setAttribute('role', 'button'); popupSearch.title = 'Search'; const searchClose = document.querySelector('a.popup-search-close'); searchClose.setAttribute('role', 'button'); searchClose.title = 'Close search page'; // document // .querySelector('a.btn-open-popup-search') // .setAttribute( // 'href', // 'https://medium.com/towards-artificial-intelligence/search' // ); } // Add external attributes to 302 sticky and editorial links function extLink() { // Sticky 302 links, this fuction opens the link we send to Medium on a new tab and adds a "noopener" rel to them var stickyLinks = document.querySelectorAll('.grid-item.sticky a'); for (var i = 0; i < stickyLinks.length; i++) { /* stickyLinks[i].setAttribute('target', '_blank'); stickyLinks[i].setAttribute('rel', 'noopener'); */ } // Editorial 302 links, same here var editLinks = document.querySelectorAll( '.grid-item.category-editorial a' ); for (var i = 0; i < editLinks.length; i++) { editLinks[i].setAttribute('target', '_blank'); editLinks[i].setAttribute('rel', 'noopener'); } } // Add current year to copyright notices document.getElementById( 'js-current-year' ).textContent = new Date().getFullYear(); // Call functions after page load extLink(); //addAlly(); setTimeout(function() { //addAlly(); //ideally we should only need to run it once ↑ }, 5000); }; function closeCookieDialog (){ document.getElementById("cookie-consent").style.display = "none"; return false; } setTimeout ( function () { closeCookieDialog(); }, 15000); console.log(`%c 🚀🚀🚀 ███ █████ ███████ █████████ ███████████ █████████████ ███████████████ ███████ ███████ ███████ ┌───────────────────────────────────────────────────────────────────┐ │ │ │ Towards AI is looking for contributors! │ │ Join us in creating awesome AI content. │ │ Let's build the future of AI together → │ │ https://towardsai.net/contribute │ │ │ └───────────────────────────────────────────────────────────────────┘ `, `background: ; color: #00adff; font-size: large`); //Remove latest category across site document.querySelectorAll('a[rel="category tag"]').forEach(function(el) { if (el.textContent.trim() === 'Latest') { // Remove the two consecutive spaces (  ) if (el.nextSibling && el.nextSibling.nodeValue.includes('\u00A0\u00A0')) { el.nextSibling.nodeValue = ''; // Remove the spaces } el.style.display = 'none'; // Hide the element } }); // Add cross-domain measurement, anonymize IPs 'use strict'; //var ga = gtag; ga('config', 'G-9D3HKKFV1Q', 'auto', { /*'allowLinker': true,*/ 'anonymize_ip': true/*, 'linker': { 'domains': [ 'medium.com/towards-artificial-intelligence', 'datasets.towardsai.net', 'rss.towardsai.net', 'feed.towardsai.net', 'contribute.towardsai.net', 'members.towardsai.net', 'pub.towardsai.net', 'news.towardsai.net' ] } */ }); ga('send', 'pageview'); -->