Roberto Iriondo – Towards AI — The Best of Tech, Science, and Engineering
https://towardsai.net
The Best of Tech, Science, and EngineeringThu, 22 Oct 2020 15:19:23 +0000en-US
hourly
1 https://wordpress.org/?v=5.5.1https://towardsai.net/wp-content/uploads/2019/05/cropped-towards-ai-square-circle-png-32x32.pngRoberto Iriondo – Towards AI — The Best of Tech, Science, and Engineering
https://towardsai.net
3232Basic Linear Algebra for Deep Learning and Machine Learning Python Tutorial
https://towardsai.net/p/machine-learning/basic-linear-algebra-for-deep-learning-and-machine-learning-ml-python-tutorial-444e23db3e9e
https://towardsai.net/p/machine-learning/basic-linear-algebra-for-deep-learning-and-machine-learning-ml-python-tutorial-444e23db3e9e#respondMon, 19 Oct 2020 23:02:44 +0000https://towardsai.net/?p=6164

An introductory tutorial to linear algebra for machine learning (ML) and deep learning with sample code implementations in Python

Author(s): Saniya Parveez, Roberto Iriondo

This tutorial’s code is available on Github and its full implementation as well on Google Colab.

The foundation of machine learning and deep learning systems wholly base upon mathematics principles and concepts. It is imperative to understand the fundamental foundations of mathematical principles. During the baseline and building of the model, many mathematical concepts like the curse of dimensionality, regularization, binary, multi-class, ordinal regression, and others must be artistic in mind.

The basic unit of deep learning, commonly called a neuron, is wholly based on its mathematical concept, and such involves the sum of the multiplied values involving input and weight. Its activation functions like Sigmoid, ReLU, and others, have been built using mathematical theorems.

These are the essential mathematical areas to understand the basic concepts of machine learning and deep learning appropriately:

Linear Algebra.

Vector Calculus.

Matrix Decomposition.

Probability and Distributions.

Analytic Geometry.

Linear Algebra in Machine Learning and Deep Learning

Linear algebra plays a requisite role in machine learning due to vectors’ availability and several rules to handle vectors. We mostly tackle classifiers or regressor problems in machine learning, and then error minimization techniques are applied by computing from actual value to predicted value. Consequently, we use linear algebra to handle the before-mentioned sets of computations. Linear algebra handles large amounts of data, or in other words, “linear algebra is the basic mathematics of data.”

These are some of the areas in linear algebra that we use in machine learning (ML) and deep learning:

Vector and Matrix.

System of Linear Equations.

Vector Space.

Basis.

Also, these are the areas of machine learning (ML) and deep learning, where we apply linear algebra’s methods:

Derivation of Regression Line.

Linear Equation to predict the target value.

Support Vector Machine Classification (SVM).

Dimensionality Reduction.

Mean Square Error or Loss function.

Regularization.

Covariance Matrix.

Convolution.

Matrix

A matrix is an essential part of linear algebra. It stores m*n elements of data, and we use it for the computation of the linear equation system or the linear mappings. It is an m*n tuple of real-value elements.

The number of rows and columns is called the dimension of the matrix.

Vector

In linear algebra, a vector is an n*1 matrix. It has only one column.

Matrix Multiplication

Matrix multiplication is a dot product of rows and columns where the row of the matrix is multiplied and summed up with another matrix column.

In Linear Regression

There are multiple features to predict the price of a house. Below is the table of different houses with their features and the target value (price).

Therefore, to calculate the hypothesis:

Transpose Matrix

For A ∈ R^m*n the matrix B ∈ R^n*m with bij = aij is called transpose of A. It is represented as B = A^T.

Example:

Inverse Matrix

Consider a square matrix A ∈ R^n*n. Let matrix B∈R^n*n has the property that AB = In = BA, B is called the inverse of A and denoted by A^-1.

Orthogonal Matrix

A square matrix A∈R^n*n is an orthogonal matrix if and only if its columns are orthonormal (unit length) so that:

Example:

Consequently,

Diagonal Matrix

A square matrix A∈R^n*n is a diagonal matrix where all the elements are zero except those on the main diagonal like:

Aij =0 for all i != j

Aij = 0 for some or all i = j

Example:

Transpose Matrix and Inverse Matrix in Normal Equation

The normal equation method minimizes J by explicitly taking its derivatives concerning theta j and setting them to zero. We can directly find out the value of θ without using Gradient Descent [4].

Implementation by taking the data from the table above, “Table1” is shown in figure 5.

Create matrices of features x and target y:

import numpy as np Features
x = np.array([[2, 1834, 1],[3, 1534, 2],[2, 962, 3]])# Target or Price
y = [8500, 9600, 258800]

Transpose of matrix x:

# Transpose of x
transpose_x = x.transpose()

transpose_x

Multiplication of transposed matrix with original matrix x:

The linear equation is the central part of linear algebra by which many problems are formulated and solved. It is an equation for a straight line.

We represent the linear equation in figure 24:

Example: x = 2

Linear Equations in Linear Regression

Regression is a process that gives the equation for the straight line. It tries to find a best-fitting line with a specific set of data. The equation of the straight line bases on the linear equation:

Y = bX + a

Where,

a = It is a Y-intercept and determines the point where the line crosses the Y-axis.

b = It is a slope and determines the direction and degree to which the line is tilted.

Implementation

Predict the price of the house where the variables are square feet and price.

Reading the housing price data:

import pandas as pddf = pd.read_csv('house_price.csv')df.head()

Calculating the mean:

def get_mean(value):
total = sum(value)
length = len(value)
mean = total/length
return mean

Calculating the variance:

def get_variance(value):
mean = get_mean(value)
mean_difference_square = [pow((item - mean), 2) for item in value]
variance = sum(mean_difference_square)/float(len(value)-1)
return variance

def linear_regression(df):
X = df['square_feet']
Y = df['price']
m = len(X) square_feet_mean = get_mean(X)
price_mean = get_mean(Y)
#variance of X
square_feet_variance = get_variance(X)
price_variance = get_variance(Y)
covariance_of_price_and_square_feet = get_covariance(X, Y)
w1 = covariance_of_price_and_square_feet float(square_feet_variance) w0 = price_mean - w1 * square_feet_mean
# prediction --> Linear Equation
prediction = w0 + w1 * X
df['price (prediction)'] = prediction
return df['price (prediction)']

Calling the ‘linear_regression’ method:

linear_regression(df)

The linear equation used in the method “linear_regression”:

Vector Norms

Vector norms measure the magnitude of a vector [5]. Fundamentally, the size of a given variable x can be represented by its norm ||x||, and the norm represents the distance between two variables x and y, and it is represented by ||x-y||.

The general equation of the vector norm:

These are the general classes of p-norms:

L1 norm or Manhattan Norm.

L2 norm or Euclidean Norm.

L1 and L2 norms are used in Regularization.

L1 norm or Manhattan Norm

The L1 norm on R^n is defined for x ∈ R^n as shown in figure 31:

As shown in figure 32, the red lines symbolize the set of vectors for the L1 norm equation.

L2 norm or Euclidean Norm

The L2 norm of x∈R^n is defined as:

As shown in figure 34, the red lines symbolize the set of vectors for the L2 norm equation.

Regularization in Machine Learning

Regularization is a process of modifying the loss function to penalize specific values of the weight on learning. Regularization helps us avoid overfitting.

It is an excellent addition in machine learning for the operations below:

To handle collinearity.

To filter out noise from data.

To prevent overfitting.

To get good performance.

These are the standard regularization techniques:

L1 Regularization (Lasso)

L2 Regularization (Ridge)

Regularization is the application of the norm.

L1 Regularization (Lasso)

Lasso is a widespread regularization technique. Its formula is shown in figure 35:

L2 Regularization (Ridge)

The equation of L2 regularization (Ridge):

Where, λ = Controls the tradeoff of complexity by adjusting the weight of the penalty term.

Feature Extraction and Feature Selection

The main intention of feature extraction and feature selection is to pick an optimal set of lower dimensionality features to enhance classification efficiency. These terms essentially treat the curse of the dimensionality problem. Feature selection and feature extraction are performed in the matrix.

Feature Extraction

In feature extraction, we find a set of features from the existing features through some function mapping.

Feature Selection

In feature selection, select a subset of the original features.

Main feature extraction methods are:

Principal Component Analysis (PCA)

Linear Discriminant Analysis (LDA)

PCA is a critical feature extraction method, and it is vital to know the concepts of the covariance matrix, eigenvalues, or eigenvectors to understand the concept of PCA.

Covariance Matrix

The covariance matrix is an integral part of PCA derivation. The below concepts are important to compute a covariance matrix:

Variance.

Covariance.

Variance

or

The limitation of variance is that it does not explore the relationship between variables.

Covariance

Covariance is used to measure the joint variability of two random variables.

Covariance Matrix

A covariance matrix is a squared matrix that gives the covariance between each pair of given random vector elements.

The equation of the covariance matrix:

Eigenvalues and Eigenvectors

The definition of Eigenvalues:

Let m be an n*n matrix. A scalar λ is called the eigenvalue of m if there exists a non-zero vector x in R^n such that mx = λx.

and for Eigenvector:

The vector x is called an eigenvector corresponding to λ.

Calculation of Eigenvalues & Eigenvectors

Let m be n*n matrix with eigenvalues λ and corresponding eigenvector x. So, mx = λx. This equation can be written as below:

mx — λx = 0

So, the equation:

Example:

Calculate eigenvalues and eigenvectors of given matrix m:

Solution:

Here, the size of the matrix is 2. So:

Here, the Eigenvalues of m are 2 and -1.

There are multiple eigenvectors available to each eigenvalue.

Orthogonality

Two vectors v and w are called orthogonal if their dot product is zero [7].

v.w = 0

Example:

Orthonormal Set

If all vectors in the set are mutually orthogonal and all of the unit lengths, then it is called an Orthonormal set [8]. An orthonormal set that forms a basis is called an orthonormal basis.

Span

Let V is a vector space and its elements are v1, v2, ….., vn ∈ V.

Any sum of these elements multiplied by the scalars representing the equation shown in figure 53, a set of all linear combinations is called the span.

Example:

Hence:

Span (v1, v2, v3) = av1 + bv2 + cv3

Basis

A basis for a vector space is a sequence of vectors that form a set that is linearly independent and that spans the space [9].

Example:

The vector’s sequence below is a basis:

It is linearly independent as given below:

Principal Component Analysis (PCA)

PCA endeavors a projection that processes as much knowledge in the data as reasonable. It is a dimensionality reduction technique. It finds the directions of the highest variance and projects the data with them to decrease the dimensions.

Calculation steps of PCA:

Let there is an N*1 vector with values x1, x2, ….., xm.

Calculate the sample mean:

Subtract sample mean with vector value:

Calculate the sample covariance matrix:

Calculate the eigenvalues and eigenvectors of the covariance matrix

Dimensionality reduction: approximate x using only the first k eigenvectors (k< N).

Python Implementation of the Principal Component Analysis (PCA)

The main goal of Python’s implementation of the PCA:

Implement the covariance matrix.

Derive eigenvalues and eigenvectors.

Understand the concept of dimensionality reduction from the PCA.

Loading the Iris data

import numpy as npimport pylab as plimport pandas as pdfrom sklearn import datasetsimport matplotlib.pyplot as pltfrom sklearn.preprocessing import StandardScalerload_iris = datasets.load_iris()iris_df = pd.DataFrame(load_iris.data, columns=[load_iris.feature_names])iris_df.head()

Standardization

It is always good to standardize the data to keep all features of the data in the same scale.

total_of_eigenvalues = sum(eigenvalues)varariance = [(i / total_of_eigenvalues)*100 for i in sorted(eigenvalues, reverse=True)]varariance

The values shown in figure 68 indicate the variance giving the analysis below:

1st Component = 72.96%

2nd Component = 22.85%

3rd Component = 3.5%

4th Component = 0.5%

So, the third and fourth Components have very low variance respectively. These can be dropped. Because these components can’t add any value.

Taking 1st and 2nd Components only and Reshaping

eigenpairs = [(np.abs(eigenvalues[i]), eigenvectors[:,i]) for i in range(len(eigenvalues))]# Sorting from Higher values to lower valueeigenpairs.sort(key=lambda x: x[0], reverse=True)eigenpairs

Multiply the standardized matrix with matrix weighing:

Y = standardized_x.dot(matrix_weighing)Y

Plotting

plt.figure()target_names = load_iris.target_names
y = load_iris.targetfor c, i, target_name in zip("rgb", [0, 1, 2], target_names):
plt.scatter(Y[y==i,0], Y[y==i,1], c=c, label=target_name)plt.xlabel('PCA 1')
plt.ylabel('PCA 2')
plt.legend()
plt.title('PCA')
plt.show()

Matrix Decomposition or Matrix Factorization

Matrix Decomposition or factorization is also an important part of linear algebra used in machine learning. Basically, it is a factorization of the matrix into a product of matrices.

There are several techniques of matrix decomposition like LU decomposition, Singular value decomposition (SVD), etc.

Singular Value Decomposition (SVD)

It is a technique for the reduction of dimension. As per singular value decomposition matrix:

Let M is a rectangular matrix and can be broken down into three products of matrix — (1) orthogonal matrix (U), (2) diagonal matrix (S), and (3) transpose of the orthogonal matrix (V).

Conclusion

Machine learning and deep learning has been build upon the concept of mathematics. A vast area of mathematics is used to build algorithms and also for the computation of data.

Linear algebra is the study of vectors [10] and several rules to manipulate vectors. It is a key infrastructure, and it covers many areas of machine learning like linear regression, one-hot encoding in the categorical variable, PCA (Principle component analysis) for the dimensionality reduction, matrix factorization for recommender systems.

Deep learning is completely based on linear algebra and calculus. It is also used in several optimization techniques like gradient descent, stochastic gradient descent, and others.

Matrices are an essential part of linear algebra that we use to compactly represent systems of linear equations, linear mapping, and others. Also, vectors are unique objects that can be added together and multiplied by scalars that produce another object of similar kinds. Any suggestions or feedback is crucial to continue to improve. Please let us know in the comments if you have any.

DISCLAIMER: The views expressed in this article are those of the author(s) and do not represent the views of Carnegie Mellon University, nor other companies (directly or indirectly) associated with the author(s). These writings do not intend to be final products, yet rather a reflection of current thinking, along with being a catalyst for discussion and improvement.

]]>https://towardsai.net/p/machine-learning/basic-linear-algebra-for-deep-learning-and-machine-learning-ml-python-tutorial-444e23db3e9e/feed0Recommendation System Tutorial with Python using Collaborative Filtering
https://towardsai.net/p/machine-learning/recommendation-system-in-depth-tutorial-with-python-for-netflix-using-collaborative-filtering-533ff8a0e444
https://towardsai.net/p/machine-learning/recommendation-system-in-depth-tutorial-with-python-for-netflix-using-collaborative-filtering-533ff8a0e444#respondMon, 12 Oct 2020 23:45:11 +0000https://towardsai.net/?p=5969

Author(s): Saniya Parveez, Roberto Iriondo

Building a recommendation system using Python and collaborative filtering for a Netflix use case.

Introduction

A recommendation system generates a compiled list of items in which a user might be interested, in the reciprocity of their current selection of item(s). It expands users’ suggestions without any disturbance or monotony, and it does not recommend items that the user already knows. Similarly, the Netflix recommendation system offers recommendations by matching and searching similar users’ habits and suggesting movies that share characteristics with films that users have rated highly.

This tutorial’s code is available on Github and its full implementation as well on Google Colab.

The recommendation system workflow shown in the diagram above shows the user’s collaboration regarding the ratings of different movies or shows. New users get their recommendations based on the recommendations of existing users.

According to McKinsey:

75% of what people are watching on Netflix comes from recommendations [1].

Netflix Real-time data cases:

More than 20,000 movies and shows.

2 million users.

Complications

Recommender systems are machine learning-based systems that scan through all possible options and provides a prediction or recommendation. However, building a recommendation system has below complications:

Users’ data is interchangeable.

The data volume is large and includes a significant list of movies, shows, customers’ profiles and interests, ratings, and other data points.

New registered customers use to have very limited information.

Real-time prediction for users.

Old users can have an overabundance of information.

It should not show items that are very different or too similar.

Users can change the rating of items on change of his/her mind.

Types of Recommendation Systems

There are two types of recommendation systems:

Content filtering recommender systems.

Collaborative filtering based recommender systems.

Fun fact: Netflix‘s recommender system filtering architecture bases on collaborative filtering [2] [3].

Content Filtering

Content filtering expects the side information such as the properties of a song (song name, singer name, movie name, language, and others.). Recommender systems perform well, even if new items are added to the library. A recommender system’s algorithm expects to include all side properties of its library’s items.

An essential aspect of content filtering:

Expects item information.

Item information should be in a text document.

Collaborative Filtering

The idea behind collaborative filtering is to consider users’ opinions on different videos and recommend the best video to each user based on the user’s previous rankings and the opinion of other similar types of users.

Pros:

It does not need a movie’s side knowledge like genres.

It uses information collected from other users to recommend new items to the current user.

Cons:

It does not achieve recommendation on a new movie or shows that have no ratings.

It requires the user community and can have a sparsity problem.

Different techniques of Collaborative filtering:

Non-probabilistic algorithm

User-based nearest neighbor.

Item-based nearest neighbor.

Reducing dimensionality.

Probabilistic algorithm

Bayesian-network model.

EM algorithm.

Issues in Collaborative Filtering

There are several challenges for collaborative filtering, as mentioned below:

Sparseness

The Netflix recommendation system’s dataset is extensive, and the user-item matrix used for the algorithm could be vast and sparse, so this encounters the problem of performance.

The sparsity of data derives from the ratio of the empty and total records in the user-item matrix.

Sparsity = 1 — |R|/|I|*|U|

Where,

R = Rating

I = Items

U = Users

Cold Start

This problem encounters when the system has no information to make recommendations for the new users. As a result, the matrix factorization techniques cannot apply.

This problem brings two observations:

How to recommend a new video for users?

What video to recommend to new users?

Solutions:

Suggest or ask users to rate videos.

Default voting for videos.

Use other techniques like content-based or demographic for the initial phase.

User-based Nearest Neighbor

The basic technique of user-based Nearest Neighbor for the user John:

John is an active Netflix user and has not seen a video “v” yet. Here, the user-based nearest neighbor algorithm will work like below:

The technique finds a set of users or nearest neighbors who have liked the same items as John in the past and have rated video “v.”

Algorithm predicts.

Performs for all the items John has not seen and recommends.

Essentially, the user-based nearest neighbor algorithm generates a prediction for item i by analyzing the rating for i from users in u’s neighborhood.

Let’s calculate user similarity for the prediction:

Where:

a, b = Users

r(a, p)= Rating of user a for item p

P = Set of items. Rated by both users a and b

Prediction based on the similarity function:

Here, similar users are defined by those that like similar movies or videos.

Challenges

For a considerable amount of data, the algorithm encounters severe performance and scaling issues.

Computationally expansiveness O(MN) can encounter in the worst case. Where M is the number of customers and N is the number of items.

Performance can be increase by applying the methodology of dimensionality reduction. However, it can reduce the quality of the recommendation system.

Item-based Nearest Neighbor

This technique generates predictions based on similarities between different videos or movies or items.

Prediction for a user u and item i is composed of a weighted sum of the user u’s ratings for items most similar to i.

As shown in figure 8, look for the videos that are similar to video5. Hence, the recommendation is very similar to video4.

Role of Cosine Similarity in building Recommenders

The cosine similarity is a metric used to find the similarity between the items/products irrespective of their size. We calculate the cosine of an angle by measuring between any two vectors in a multidimensional space. It is applicable for supporting documents of a considerable size due to the dimensions.

Where:

cosine is an angle calculated between -1 to 1 where -1 denotes dissimilar items, and 1 shows items which are a correct match.

cos p. q — gives the dot product between the vectors.

||p|| ||q|| — represents the product of vector’s magnitude

Why do Baseline Predictors for Recommenders matter?

Baseline Predictors are independent of the user’s rating, but they provide predictions to the new user’s

General Baseline form

bu,i = µ + bu + bi

Where,

bu and bi are users and item baseline predictors.

Motivation for Baseline

Imputation of missing values with baseline values.

compare accuracy with advanced model

Netflix Movie Recommendation System

Problem Statement

Netflix is a platform that provides online movie and video streaming. Netflix wants to build a recommendation system to predict a list of movies for users based on other movies’ likes or dislikes. This recommendation will be for every user based on his/her unique interest.

Netflix Dataset

combine_data_2.txt: This text file contains movie_id, customer_id, rating, date

movie_title.csv: This CSV file contains movie_id and movie_title

Load Dataset

from datetime import datetime
import pandas as pd
import numpy as np
import seaborn as sns
import os
import random
import matplotlib
import matplotlib.pyplot as plt
from scipy import sparse
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import mean_squared_errorimport xgboost as xgb
from surprise import Reader, Dataset
from surprise import BaselineOnly
from surprise import KNNBaseline
from surprise import SVD
from surprise import SVDpp
from surprise.model_selection import GridSearchCVdef load_data():
netflix_csv_file = open("netflix_rating.csv", mode = "w")
rating_files = ['combined_data_1.txt']
for file in rating_files:
with open(file) as f:
for line in f:
line = line.strip()
if line.endswith(":"):
movie_id = line.replace(":", "")
else:
row_data = []
row_data = [item for item in line.split(",")]
row_data.insert(0, movie_id)
netflix_csv_file.write(",".join(row_data))
netflix_csv_file.write('\n')
netflix_csv_file.close()
df = pd.read_csv('netflix_rating.csv', sep=",", names = ["movie_id","customer_id", "rating", "date"])
return dfnetflix_rating_df = load_data()
netflix_rating_df.head()

In a user-item sparse matrix, items’ values are present in the column, and users’ values are present in the rows. The rating of the user is present in the cell. Such is a sparse matrix because there can be the possibility that the user cannot rate every movie items, and many items can be empty or zero.

total_users = len(np.unique(netflix_rating_df["customer_id"]))
train_users = len(average_rating_user)
uncommonUsers = total_users - train_users
print("Total no. of Users = {}".format(total_users))
print("No. of Users in train data= {}".format(train_users))
print("No. of Users not present in train data = {}({}%)".format(uncommonUsers, np.round((uncommonUsers/total_users)*100), 2))

Here, 1% of total users are new, and they will have no proper rating available. Therefore, this can bring the issue of the cold start problem.

Check Cold Start Problem: Movie

total_movies = len(np.unique(netflix_rating_df["movie_id"]))
train_movies = len(avg_rating_movie)
uncommonMovies = total_movies - train_movies
print("Total no. of Movies = {}".format(total_movies))
print("No. of Movies in train data= {}".format(train_movies))
print("No. of Movies not present in train data = {}({}%)".format(uncommonMovies, np.round((uncommonMovies/total_movies)*100), 2))

Here, 20% of total movies are new, and their rating might not be available in the dataset. Consequently, this can bring the issue of the cold start problem.

Similarity Matrix

A similarity matrix is critical to measure and calculate the similarity between user-profiles and movies to generate recommendations. Fundamentally, this kind of matrix calculates the similarity between two data points.

In the matrix shown in figure 17, video2 and video5 are very similar. The computation of the similarity matrix is a very tedious job because it requires a powerful computational system.

Compute User Similarity Matrix

Computation of user similarity to find similarities of the top 100 users:

def compute_user_similarity(sparse_matrix, limit=100):
row_index, col_index = sparse_matrix.nonzero()
rows = np.unique(row_index)
similar_arr = np.zeros(61700).reshape(617,100)
for row in rows[:limit]:
sim = cosine_similarity(sparse_matrix.getrow(row), train_sparse_data).ravel()
similar_indices = sim.argsort()[-limit:]
similar = sim[similar_indices]
similar_arr[row] = similar
return similar_arrsimilar_user_matrix = compute_user_similarity(train_sparse_data, 100)

Featuring is a process to create new features by adding different aspects of variables. Here, five similar profile users and similar types of movies features will be created. These new features help relate the similarities between different movies and users. Below new features will be added in the data set after featuring of data:

As shown in figure 24, the RMSE (Root mean squared error) for the predicted model dataset is 99%. If the accuracy is lower than our expectations, we would need to continue to train our model until the accuracy meets a high standard.

Plot Feature Importance

Feature importance is an important technique that selects a score to input features based on how valuable they are at predicting a target variable.

The plot shown in figure 25 displays the feature importance of each feature. Here, the user_average rating is a critical feature. Its score is higher than the other features. Other features like similar user ratings and similar movie ratings have been created to relate the similarity between different users and movies.

Conclusion

Over the years, Machine learning has solved several challenges for companies like Netflix, Amazon, Google, Facebook, and others. The recommender system for Netflix helps the user filter through information in a massive list of movies and shows based on his/her choice. A recommender system must interact with the users to learn their preferences to provide recommendations.

Collaborative filtering (CF) is a very popular recommendation system algorithm for the prediction and recommendation based on other users’ ratings and collaboration. User-based collaborative filtering was the first automated collaborative filtering mechanism. It is also called k-NN collaborative filtering. The problem of collaborative filtering is to predict how well a user will like an item that he has not rated given a set of existing choice judgments for a population of users [4].

DISCLAIMER: The views expressed in this article are those of the author(s) and do not represent the views of Carnegie Mellon University, nor other companies (directly or indirectly) associated with the author(s). These writings do not intend to be final products, yet rather a reflection of current thinking, along with being a catalyst for discussion and improvement.

]]>https://towardsai.net/p/machine-learning/recommendation-system-in-depth-tutorial-with-python-for-netflix-using-collaborative-filtering-533ff8a0e444/feed0Calculating Linear Regression and Linear Best Fit an In-depth Tutorial with Math and Python
https://towardsai.net/p/machine-learning/calculating-simple-linear-regression-and-linear-best-fit-an-in-depth-tutorial-with-math-and-python-804a0cb23660
https://towardsai.net/p/machine-learning/calculating-simple-linear-regression-and-linear-best-fit-an-in-depth-tutorial-with-math-and-python-804a0cb23660#respondWed, 07 Oct 2020 00:36:47 +0000https://towardsai.net/?p=5904

Author(s): Pratik Shukla, Roberto Iriondo

Diving into calculating a simple linear regression and linear best fit with code examples in Python and math in detail

This tutorial’s code is available on Github and its full implementation as well on Google Colab.

Simple linear regression is a statistical approach that allows us to study and summarize the relationship between two continuous quantitative variables. Simple linear regression is used in machine learning models, mathematics, statistical modeling, forecasting epidemics, and other quantitative fields.

Out of the two variables, one variable is called the dependent variable, and the other variable is called the independent variable. Our goal is to predict the dependent variable’s value based on the value of the independent variable. A simple linear regression aims to find the best relationship between X (independent variable) and Y (dependent variable).

There are three types of relationships. The kind of relationship where we can predict the output variable using its function is called a deterministic relationship. In random relationships, there are no relationships between the variables. In our statistical world, it is not likely to have a deterministic relationship. In statistics, we generally have a relationship that is not so perfect, that is called a statistical relationship, which is a mixture of deterministic and random relationships [4].

Examples:

1. Deterministic Relationship:

a. Diameter = 2*pi*radius

b. Fahrenheit = 1.8*celsius+32

2. Statistical Relationship:

a. Number of chocolates vs. cost

b. Income vs. expenditure

Understanding Simple Linear Regression:

The simplest type of regression model in machine learning is a simple linear regression. First of all, we need to know why we are going to study it. To understand it better, why don’t we start with a story of some friends that lived in “Bikini Bottom” (referencing SpongeBob) [3].

SpongeBob, Patrick, Squidward, and Gary lived in the “Bikini Bottom!”. One day Squidward went to SpongeBob, and they had this conversation. Let’s check it out.

Squidward: “Hey, SpongeBob, I have heard that you are so smart!”

SpongeBob: “Yes, sir! There is no doubt in that.”

Squidward: “Is that so?”

SpongeBob: “Umm…Yes!”

Squidward: “So here is the thing. I want to sell my house as I am going to shift to my new lavish house downtown. But I cannot figure out at which price I should sell my house! If I keep the price too high, then no one will buy it, and if I set the price low, I might face a large financial loss! So you have to help me find the best price for my house. But, please keep in mind that you have only one day!”

SpongeBob stressed as always, but optimistic about finding the solution. To discuss the problem, he went to his wise friend Patrick’s house. Patrick is in his living room watching TV with a big bowl of popcorn in his hands, and after SpongeBob described the whole situation to Patrick:

Patrick: “That is a piece of cake. Follow me!”

(They decided to go to Squidward’s neighborhood, where his two neighbors recently sold their houses. After making some discreet inquiries, they obtained the following details from Squidward’s new neighbors. Now Patrick explained the whole plan to SpongeBob.)

Patrick: Once we have some essential data on how the previous house sells from Squidward’s neighborhood, I think we can make some logical deductions to predict Squidward’s house’s price. So let us get some data.

From the collected data, Patrick was able to plot the data on a scatter plot:

If we closely observe the graph above, we can notice that we can connect our two data points with a line, and as we know, each line has its equation. From figure 3, we can quickly get the house price if we have the house’s area. It will be easier for us if we get the house price by using some formula. Please note that we can get the house price by plotting a horizontal and vertical line on the graph, but to generalize it, we use the line equation. First, we need to see some basics of geometry and dive into the equation of the line.

Basics of Coordinate Geometry:

We always look from left to right in the coordinate plane to name the points.

After looking from left-to-right, the first point we get must be named (x1,y1), and the second point will be (x2,y2).

Horizontal lines have a slope of 0.

Vertical lines have an “infinite” slope.

If the second point’s Y-coordinate is greater than the Y-coordinate of the first point, then the line has a positive(+) slope. The line has a negative slope.

Points at the same vertical distance from X-axis have the same Y-coordinate.

Points at the same vertical distance from Y-axis have the same X-coordinate.

Now let’s get back to our graph.

We all know that the equation of the line:

From the definition of the slope of a straight line:

From the rules mentioned above, we can infer that in our graph:

(X1 , Y1) = ( 1500 , 150000)

(X2 , Y2) = (2500 , 300000)

Next, we can easily find the slope of the two points.

Taking our example into consideration, in our equation, Y represents the house’s price, and X represents the area of the house.

Now since we have all the other values, we can calculate the value of slope b.

Notice that we can use any of the points to calculate the slope value. The answer to the slope will always be the same for the same straight line.
Next, since we have all our parameters, we can write the equation of line as:

To find the price of Squidward’s house, we need to plug-in X=1800 in the above equation.

Now, we can say that Squidward should sell his house for $ 195,000.00. That was easy.

Please note that we only had two data points to quickly plot a single straight line through them and get our equation of a line. In this case, the critical thing to notice here is that our prediction will depend on the value of two data points. If we change the value of any of the two available data points, our prediction will likely also change. To cope with this problem, we have data sets in larger quantities. Real-world datasets may contain millions of data points.

Now let us get back to our example. When we have more than two data points in our dataset(the usual case), we cannot draw a single straight line that passes through all points, right? That is why we will use a line that best fits our data set. This line is called the best fit line or the regression line. By using this line’s equation, we will make predictions about our dataset.

Please note that the central concept remains the same. We will find the equation of the line and plug-in X’s value (independent variable) to find Y’s value (dependent variable). We need to find the best fit line for our dataset.

Calculating the Linear Best Fit

As we can see in figure 11, we cannot plot a single straight line that passes through all the points. So what we can do here is to minimize the error. It means that we find a bar and then find the prediction error. Since we have the actual value here, we can easily find the error in prediction. Our ultimate goal will be to find the line that has the minimal error. That line is called the linear best fit.

As discussed above, our goal is to find the linear best fit for our dataset, or in other words, we can say that our goal should be to reduce the error in prediction. Now the question is, how do we calculate the error? One way to measure the distance between the scattered points and the line is to find the distance between their Y values.

To understand it better, let us get back to our actual house price prediction example. We know that the actual selling price of a house with an area of 1800 square feet is $220,000. If we predict the house price based on the line equation, which is Y = 150X-75000, we will get the house price at $ 195,000. Now here we can see that there is a prediction error.

Therefore, we can use the Sum of Squared error calculation technique to find the error in prediction for each of the data points. We randomly choose the parameters of our line and then calculate the error. Afterward, we will adjust the parameter again and then calculate the error.

We will repeat this until we get the minimum possible error. This process is a part of the gradient descent algorithm, which we will cover in later tutorials. We think now it is clear that we will recalculate the line’s parameters until we get the best fit line, or we get a minimum error in our prediction.

1. Positive error:

Actual selling price: $ 220,000

Predicted selling price: $ 195,000

Error in prediction: $ 220,000–$ 195,000 = $25,000

2. Negative error:

Actual selling price: $ 160,000

Predicted selling price: $ 195,000

Error in prediction: $160,000 — $195,000 = -$35,000

As we can see, it is also possible to get a negative error. To account for negative errors, we square the error.

To account for the negative values, we will square the errors.

Next, we have to find the parameters of a line that has the least error. Once we have that, we can form an equation of a line and predict the dataset’s data values. We will go through this part later in this tutorial.

Guidelines for regression line:

Use regression lines when there is a significant correlation to predict values.

Stay within the range of the data, and make sure not to extrapolate. For example, if the data is from 10 to 60, do not try to predict a value for 500.

Do not make predictions for a population that base on another population’s regression line.

Use-cases for linear regression:

Height and weight.

Alcohol consumption and blood alcohol content.

Vital lung capacity and pack-years of smoking.

The driving speed and gas mileage.

Finding the Equation for the Linear Best Fit

Before we dive deeper into a simple linear regression formula’s derivation, we will try to find the best fit line parameters without using any formulas. Consider the following table with data points X and Y. The next table Y’ is the predicted value and Y-Y’ gives us the prediction error.

Next, we are going to use the sum of squares method to calculate the error. For such, we will have to find (Y-Y’)². Please note that we have three terms in each row of (Y — Y’). First, we will dive into the formula to find the square with three terms.

In our case, the value of (Y — Y’)² for each row will be:

Next, notice that we need to add all the squared terms in our formula of the error sum of squares.

Next, our goal is to determine the values of slope(m) and y-intercept(b). To find out the values, we will use the formula of the vertex in a second-degree polynomial.

Next, we need to rearrange our central equation to bring it in a second-degree polynomial form. As we know that if we have two linear equations, we can quickly solve them and get the required values. Hence, our ultimate goal will be to find two linear equations and solve them.

Now that we have two equations, we can solve them to find the slope and intercept values.

Now we have all the required values for our line of best fit. So we can write our line of best fit as:

We can also plot the data on a scatter plot with the line of best fit.

So this is how we can find the best fit line for a specific dataset. We can notice that for a larger dataset, this task can be cumbersome. As a solution to that, we will use a formula that will give us the required parameter values.

However, we will not dive into the formula. Instead, we will first see how the formula is derived, and then we will use it in a code example with Python to understand the math behind it.

In conclusion, a simple linear regression is a technique in which we find a line that best fits our dataset, and once we have that line, we can predict the value of the dependent variable based on the value of the independent variable using the equation of a line and its optimal parameters.

Derivation of Simple Linear Regression Formula:

We have a total of n data points (X, Y), ranging from i=1 to i=n.

2. We define the linear best fit as:

3. We can write the error function as following:

4. We can substitute the value of equation 2 in equation 3:

Next, our ultimate goal is to find the best fit line. To find the best fit line, the error function S should be minimum. To minimize our error function, S, we must find where the first derivative of S is equal to 0 concerning a and b.

Finding a (Intercept):

Finding the partial derivative of S concerning a:

2. Simplifying the calculations:

3. Using chain rule of partial derivations:

4. Finding partial derivatives:

5. Putting it together:

6. To find the extreme values, we take the derivative=0:

7. Simplifying:

8. Further simplifying:

9. Finding the summation of a:

10. Substituting the values in the main equation:

11. Simplifying the equation:

12. Further simplifying the equation:

13. Simplifying the equation for the value of a:

Finding B (Slope):

Finding the partial derivative of S concerning B:

Finding Partial Derivative of S concerning B

2. Simplifying the calculations:

3. Using chain rule of partial derivations:

4. Finding partial derivatives:

5. Putting it together:

6. Distributing Xi:

7. To find the extreme values, we take the derivative=0:

8. Simplifying:

9. Substituting the value of a in our equation:

10. Further simplifying:

11. Splitting up the sum:

12. Simplifying:

13. Finding B from the above equation:

16. Further simplifying the equation:

Finding a (Intercept) in a generalized form:

Get the value of a:

2. Simplifying the formula:

Simple Linear Regression Formulas:

Simple Linear Regression Python Implementation from Scratch:

In the following Python code for simple linear regression, we will not use a Python library to find the optimal parameters for the regression line; instead, we will use the formulas derived earlier to find the regression (best fit) line for our dataset.

Import the required libraries:

2. Read the CSV file:

3. Get the list of columns in our dataset:

4. Checking for null values:

5. Selecting columns to build our model:

6. Plot the data on the scatterplot:

7. Divide the data into training and testing dataset:

8. Main function to calculate the coefficients of the linear best fit:

The formulas used in the following code are:

9. Check the working of the function with dummy data:

10. Plot the dummy data with the regression line:

11. Finding the coefficients for our actual dataset:

12. Plot the regression line with actual data:

13. Define the prediction function:

14. Predicting the values based on the prediction function:

15. Predicting values for the whole dataset:

16. Plotting the test data with the regression line:

17. Plot the training data with the regression line:

18. Plot the complete data with regression line:

19. Create a data frame for actual and predicted values:

19. Plot the bar graph for actual and predicted values:

20. Residual Sum of Square:

21. Calculating error:

So, that is how we can perform Simple Linear Regression from scratch with Python. Although Python libraries can perform all these calculations without diving in-depth, it is always good practice to know how these libraries perform such mathematical calculations.

Next, we will use the Scikit-learn library in Python to find the linear-best-fit regression line on the same data set. In the following code, we will see a straightforward way to calculate a simple linear regression using Scikit-learn.

Simple Linear Regression Using Scikit-learn:

Import the required libraries:

2. Read the CSV file:

3. Feature selection for regression model:

4. Plotting the data points on a scatter plot:

5. Dividing data into testing and training dataset:

6. Training the model:

7. Predicting values for a complete dataset:

8. Predicting values for training data:

9. Predicting values for testing data:

10. Plotting regression line for complete data:

11. Plotting regression line with training data:

12. Plotting regression line with testing data:

13. Create dataframe for actual and predicted data points:

14. Plotting the bar graph for actual and predicted values:

15. Calculating error in prediction:

Here we can see that we got the same output even if we use the Scikit-learn library. Therefore we can be sure that all the calculations we performed and derivations we understood are precisely accurate.

Please note that there are other methods to calculate the prediction error, and we will try to cover them in our future tutorials.

That is all for this tutorial. We hope you enjoyed it and learned something new from it. If you have any feedback, please leave us a comment or send us an email directly. Thank you for reading!

DISCLAIMER: The views expressed in this article are those of the author(s) and do not represent the views of Carnegie Mellon University. These writings do not intend to be final products, yet rather a reflection of current thinking, along with being a catalyst for discussion and improvement.

]]>https://towardsai.net/p/machine-learning/calculating-simple-linear-regression-and-linear-best-fit-an-in-depth-tutorial-with-math-and-python-804a0cb23660/feed0Bernoulli Distribution — Probability Tutorial with Python
https://towardsai.net/p/statistics/bernoulli-distribution-probability-tutorial-with-python-90061ee078a
https://towardsai.net/p/statistics/bernoulli-distribution-probability-tutorial-with-python-90061ee078a#respondFri, 25 Sep 2020 00:22:27 +0000https://towardsai.net/?p=5747

Bernoulli distribution tutorial — diving into the discrete probability distribution of a random variable with examples in Python

In this series of tutorials, we will dive into probability distributions in detail. We will not just showcase formulas, but instead, we will see how each of the formulas derives from their basic definitions (as it is essential to understand the math behind the derivations), and we will showcase such by using some examples in Python.

This tutorial’s code is available on Github and its full implementation as well on Google Colab.

Table of Contents:

What is a Random Variable?

Discrete Random Variable.

Continuous Random Variable.

Probability Distributions.

Bernoulli Distribution.

Probability Mass Function (PMF).

Mean of Bernoulli Distribution.

The variance of a Bernoulli Distribution.

Standard Deviation of Bernoulli Distribution.

Mean Deviation of Bernoulli Distribution.

Moment Generating Function for a Bernoulli Distribution.

Cumulative Density Function (CDF) for a Bernoulli Distribution.

Before diving deep into probability distributions, let’s first understand some basic terminology about a random variable.

What is a Random Variable?

A variable is called a random variable if its value is unknown. In other words, a variable is a random variable if we cannot get the same variable using any kind of function.

A random variable is a variable whose possible values are numerical outcomes of a random phenomenon.

Properties of a random variable:

We denote random variables with a capital letter.

Random variables can be discrete or continuous.

Examples:

Tossing a fair coin:

In figure 1, we show that the outcome is not dependent on any other variables. So the output of tossing a coin will be random.

2. Rolling a fair die:

In figure 2, we can notice that the output of a die cannot be predicted in advance, and it is not dependent on any other variables. So we can say that the output will be random.

Now let’s have a brief look at non-random variables.

In the example above, we can see that in example 1, we can quickly get the value of variable x by subtracting one from both sides. Therefore, the value of x is not random, but it is fixed. In the second example, we can see that the value of variable y is dependent on the value of variable x, where we can notice that the value of y changes according to the value of x. We can generate the same output variable y when we plugin the same value of x. So variable y is not random at all. In probability distributions, we will work with random variables.

Discrete Random Variable:

A random variable is called a discrete random variable if its values can be obtained by counting. Discrete variables can be counted a finite amount of time. The critical thing to note here is that discrete variables need not be an integer. We can have discrete random variables that are finite float values.

Examples:

The number of students present on a school bus.

The number of cookies on a plate.

The number of heads while flipping a coin.

The number of planets around a star.

The net income of family members.

Continuous Random Variable:

A random variable is called a continuous random variable if its values can be obtained by measuring. We cannot count continuous variables in a finite amount of time. In other ways, we can say that it will take an infinite amount of time to count continuous variables.

Examples:

The exact weight of a random animal in the universe.

The exact height of a randomly selected student.

The exact distance traveled in an hour.

The exact amount of food eaten yesterday.

The exact winning time of an athlete.

The vital thing to notice is that we are mentioning the word “Exact” here. It means that all the measurements we take are up to absolute precision.

For example, if we measure the completion time of a race for an athlete, we can say that he completed the race in 9.5 seconds. To be more precise, we can say that he completed the race in 9.52 seconds. To be more precise, we can say that the athlete completed the race in 9.523 seconds. To add more precision to the time taken, we can also say that he completed the race in 9.5238 seconds. If we keep on doing this, we can take this thing to an infinite level of precision, and it will take us an infinite amount of time to measure it. That is why it is called a continuous variable.

Main Difference Between Discrete and Continuous Variable:

Example: What is your current age?

What do you think about this? Is it a continuous variable or discrete variable? Please take a moment to think about it.

The example is classified into the group of continuous variables. As discussed above, we can say the following about your age:

Notice that we can continue writing age with more and more precision. Therefore, we can not count the exact age of a person in a finite amount of time. That is why it is a continuous variable.

On the other hand, if the question was, what is your current age in years?”. Then, in this case, the variable can be classified in the group of discrete variables. Since we already know that my age at this point is an “X amount of years.”

Next, let’s discuss Probability Distributions. Probability distributions are bases on data types, and they can be either Discrete or Continuous.

Probability Distribution:

A probability distribution is a mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment.[1]

Bernoulli Distribution:

Conditions for the Bernoulli Distribution

There must be only one trial.

There must be only two possible outcomes of the trial, one is called a success, and the other is called failure.

P(Success) = p

P(Failure) = 1 — p = q

Conventionally, we assign the value of 1 to the event with probability p and a value of 0 to the event with probability 1 — p.

Conventionally, we have p>1 — p. Another way we can say that we take the probability of success(1) as p and probability of failure(0) as 1 — p so that P(Success)>P(Failure).

We must have the probability of one of the events (Success or Failure) or some past data that indicates experimental probability.

If our data satisfies the conditions above, then:

A discrete random variable X follows a Bernoulli distribution with the probability of success=p.
Visual representation of Bernoulli distribution:

Examples:

For instance:

There are only two candidates in an election: Patrick and Gary, and we can either vote for Patrick or Gary.

P(Success) = P(1) = Vote for Patrick = 0.7

P(Failure) = P(0) = Vote for Gary = 0.3

Here we have only one trial and only two possible outcomes. So we can say that the data follows a Bernoulli distribution. To visualize it:

Probability Mass Function (PMF):

A probability mass function of a discrete random variable X assigns probabilities to each of the possible values of the random variable. By using PMF, we can get the probabilities of each random variable.

Let X be a discrete random variable with its possible values denoted by x1, x2, x3, …, xn. The probability mass function(PMF) must satisfy the following conditions:

Properties of PMF:

The sum of all the probabilities in a given PMF must be 1.

2. All the possible probability values must be greater than or equals to 0.

Probability Mass Function (PMF) for Bernoulli Distribution:

Let’s visualize the function:

Mean for Bernoulli Distribution:

The mean of discrete random variable X is it is a weighted average. Its probability weights each value of random variable X. In the Bernoulli Distribution, the random variable X can take only two values: 0 and 1, and we can quickly get the weight by using the Probability Mass Function(PMF).

Mean: The mean of a probability distribution is the long-run arithmetic average value of a random variable having that distribution.

The expected value E[X] expresses the likelihood of the favored event.

The expected value or the mean of Bernoulli Distribution is given by:

Mean of Bernoulli Distribution:

Variance for Bernoulli Distribution:

Variance(σ2) is the measure of how far each number from the set of random numbers is from the mean. The square root of the variance is called the standard deviation.

Based on its definition:

The variance of a discrete probability distribution:

In our case, variable x can take only two values: 0 and 1.

The variance of Bernoulli Distribution:

There is a more popular form to find variance in statistics:

Let’s see how this came into existence.

Basically, the variance is the expected value of the squared difference between each value and the mean of the distribution.

From the definition of variance, we can then:

Finding the variance using this formula:

In figure 25, we can see that the Bernoulli distribution variance is the same regardless of which formula we use.

Standard Deviation for Bernoulli Distribution:

A standard deviation is a number used to tell how measurements for a group are spread out from the average (mean or expected value).

A low standard deviation means that most of the numbers are close to the average, while a high standard deviation means that the numbers are more spread out.

Mean Deviation for Bernoulli Distribution:

The mean deviation is the mean of the absolute deviations of a data set about the data’s mean.

Based on the definition:

For Discrete probability Distribution:

Finding the mean deviation for the Bernoulli distribution:

Moment Generating Function For Bernoulli Distribution:

For the following derivations, we will use the formulas we derived in our previous tutorial. So we recommend you to check out our tutorial on Moment Generating Function.

Moment Generating Function:

Finding Raw Moments:

1. First Moment:

a. First Raw Moment:

2. Second Moment:

a. Second Raw Moment:

b. Second Central Moment (Variance):

3. Third Moment:

a. Third Raw Moment:

b. Third Central Moment:

c. Third Standardized Moment: (Skewness)

4. Fourth Moment:

a. Fourth Raw Moment:

b. Fourth Centered Moment:

c. Fourth Standardized Moment:( Kurtosis):

Cumulative Distribution Function(CDF):

Based on the Probability Mass Function (PMF), we can write the Cumulative Distribution Function (CDF) for the Bernoulli distribution as follows:

Next to the fun part, let’s move on to its implementation in Python.

Python Implementation:

Import required libraries:

2. Find the moments:

3. Get the mean value:

4. Get median value:

5. Get variance value:

6. Get standard Deviation value:

7. Probability Mass Function (PMF):

8. Plotting the PMF:

9. Cumulative Density Function (CDF):

10. Plot the CDF:

11. Plot the bar graph for PMF:

12. Plot the bar graph for CDF:

13. Output for different experiments:

Summary of the Bernoulli Distribution:

That is it for the Bernoulli distribution tutorial. We hope you enjoyed reading it and learned something new. We will try to cover more probability distributions in-depth in the future. Any suggestions or feedback is crucial to continue to improve. Please let us know in the comments if you have any.

DISCLAIMER: The views expressed in this article are those of the author(s) and do not represent the views of Carnegie Mellon University, nor other companies (directly or indirectly) associated with the author(s). These writings do not intend to be final products, yet rather a reflection of current thinking, along with being a catalyst for discussion and improvement.

]]>https://towardsai.net/p/statistics/bernoulli-distribution-probability-tutorial-with-python-90061ee078a/feed0Moment Generating Function for Probability Distribution with Python
https://towardsai.net/p/data-science/moment-generating-function-for-probability-distribution-with-python-tutorial-34857e93d8f6
https://towardsai.net/p/data-science/moment-generating-function-for-probability-distribution-with-python-tutorial-34857e93d8f6#respondSat, 19 Sep 2020 05:03:23 +0000https://towardsai.net/?p=5644

Diving into the Moment Generation Function for probability distribution with a complete derivation and code examples in Python

Author(s): Pratik Shukla, Roberto Iriondo

This tutorial’s code is available on Github and its full implementation as well on Google Colab.

Table of Contents:

Moments in Statistics.

Raw Moments.

Centered Moments.

Standardized Moments.

Moment Generating Function.

Proof of Moment Generating Function.

Derivation of Relationship between Raw and Central Moments.

Python Implementation.

What is a Moment in Statistics?

We generally use moments in statistics, machine learning, mathematics, and other fields to describe the characteristics of a distribution.

Let’s say the variable of our interest is X then, moments are X’s expected values. For example, E(X), E(X²), E(X³), E(X⁴),…, etc.

Moments in statistics:

1) First Moment: Measure of the central location.

2) Second Moment: Measure of dispersion/spread.

3) Third Moment: Measure of asymmetry.

4) Fourth Moment: Measure of outliers/tailedness.

Now we are very familiar with the first moment(mean) and the second moment(variance). The third moment is called skewness, and the fourth moment is known as kurtosis. The third moment measures the asymmetry of distribution while the fourth moment measures how heavy the tail values are. Physicists generally use the higher-order moments in applications of physics. Let’s have a look at the visualization of the third and fourth moments.

Third Moment(Skewness):

1) No Skew:

2) Positive Skew:

3) Negative Skew:

Fourth Moment(Kurtosis):

We will study each of these moments in detail in our next tutorial on Descriptive Statistics. In this tutorial, we will learn about the Moment Generating Function(MGF). Before getting into that, let us have a look at the formulas for the moments.

Raw Moments:

In the following formulas, “A” is an arbitrary variable. Usually, while calculating raw moments, we take A=0.

Centered Moments:

Standardized Moments:

What is the Moment Generating Function(MGF)?

As the name implies, Moment Generating Function is a function that generates moments — E(X), E(X²), E(X³), E(X⁴), … , E(X^n).

Let’s have a look at the definition of MGF:

Now notice that there is E[e^tx] in the formula of Moment Generating Function while we are interested in finding the value of E[X^n].

Taking the nth derivative of E[e^tx] and plugging in t=0 will give us E[X^n].

Now let us prove that the n-th derivative of E(e^tx) is nth-moment.

a) Finding the first derivative:

Here we can see that it gives us the first moment.

b) Finding the second derivative:

Here we can see that it gives us a second moment.

From these two derivations, we can confidently say that the nth-derivative of Moment Generating Function is the nth-moment.

What is the role of “t” in Moment Generating Function?

From the above derivations, we can see that the variable “t” works as a helper variable. By using “t,” we can find different derivatives in Moment Generating Function.

Why do we need MGF?

In the case of a continuous probability distribution, we have to integrate the Probability Density Function(PDF) to find the moments of a distribution. Moreover, it turns out that finding integration adds complexity to an algorithm and increases the run time of a program. As an alternative to that, we use Moment Generating Functions and their derivations to find the moments. Please note that we can get the moments without using Moment Generating Function, but it gets complicated as we move forward to calculate higher-order moments.

Relationship between Raw and Central moments:

At this point, we know that,

Now we will find the relationship between the central moment and raw moment.

e) Simplifying the terms using the definition of the raw moment:

f) Write the formula in a simple form:

Voila! We have derived the formula to find the relationship between raw moments and central moments. Now let’s find the relationship between them.

a) First Central Moment in terms of Raw Moments:

2) Second Central Moment in terms of Raw Moments:

3) Third Central Moment in terms of Raw Moments:

4) Fourth Central Moment in terms of Raw Moments:

In summary,

Please note that we get the raw moments while finding the moments by Moment Generating Function(MGF). We can find out the central moments from the raw moments using the above-derived formulas. We can easily find the standardized moments using the central moments. We will use these formulas in our future tutorials on probability distributions.

Python Implementation:

Using Python, we can find the central moments for a dataset. Let’s have a look at a few examples.

1) 1-Dimensional Data:

2) 2-Dimensional Data:

3) 2-Dimensional Data with axis=1:

4) Multi-Dimensional Data:

5) Higher-Order Moments:

Key Points:

For any valid Moment Generating Function, we can say that the 0th moment will be equal to 1.

Finding the derivatives using the Moment Generating Function gives us the Raw moments.

Once we have the MGF for a probability distribution, we can easily find the n-th moment.

Each probability distribution has a unique Moment Generating Function.

We can find moments without using Moment Generating Function, but using MGF reduces the time and space complexity.

In future articles, we will see each probability distributions in detail with their Moment Generating Function. We will use the derived formulas from this piece in those tutorials. Any suggestions or feedback is crucial to continue to improve. Please let us know in the comments if you have any.

DISCLAIMER: The views expressed in this article are those of the author(s) and do not represent the views of Carnegie Mellon University. These writings do not intend to be final products, yet rather a reflection of current thinking, along with being a catalyst for discussion and improvement.

]]>https://towardsai.net/p/data-science/moment-generating-function-for-probability-distribution-with-python-tutorial-34857e93d8f6/feed0Monte Carlo Simulation An In-depth Tutorial with Python
https://towardsai.net/p/machine-learning/monte-carlo-simulation-an-in-depth-tutorial-with-python-bcf6eb7856c8
https://towardsai.net/p/machine-learning/monte-carlo-simulation-an-in-depth-tutorial-with-python-bcf6eb7856c8#respondThu, 06 Aug 2020 23:51:23 +0000https://towardsai.net/?p=4960

An in-depth tutorial on the Monte Carlo Simulation methods and applications with Python

Author(s): Pratik Shukla, Roberto Iriondo

What is the Monte Carlo Simulation?

A Monte Carlo method is a technique that uses random numbers and probability to solve complex problems. The Monte Carlo simulation, or probability simulation, is a technique used to understand the impact of risk and uncertainty in financial sectors, project management, costs, and other forecasting machine learning models.

Risk analysis is part of almost every decision we make, as we constantly face uncertainty, ambiguity, and variability in our lives. Moreover, even though we have unprecedented access to information, we cannot accurately predict the future.

The Monte Carlo simulation allows us to see all the possible outcomes of our decisions and assess risk impact, in consequence allowing better decision making under uncertainty.

In this article, we will go through five different examples to understand the Monte Carlo Simulation methods.

Next, we are going to prove this formula experimentally using the Monte Carlo Method.

Python Implementation:

Import required libraries:

2. Coin flip function:

3. Checking the output of the function:

4. Main function:

5. Calling the main function:

As shown in figure 8, we show that after 5000 iterations, the probability of getting a tail is 0.502. Consequently, this is how we can use the Monte Carlo Simulation to find probabilities experimentally.

b. Estimating PI using circle and square :

To estimate the value of PI, we need the area of the square and the area of the circle. To find these areas, we will randomly place dots on the surface and count the dots that fall inside the circle and dots that fall inside the square. Such will give us an estimated amount of their areas. Therefore instead of using the actual areas, we will use the count of dots to use as areas.

In the following code, we used the turtle module of Python to see the random placement of dots.

Python Implementation:

Import required libraries:

2. To visualize the dots:

3. Initialize some required data:

4. Main function:

5. Plot the data:

6. Output:

As shown in figure 17, we can see that after 5000 iterations, we can get the approximate value of PI. Also, notice that the error in estimation also decreased exponentially as the number of iterations increased.

Suppose you are on a game show, and you have the choice of picking one of three doors: Behind one door is a car; behind the other doors, goats. You pick a door, let’s say door 1, and the host, who knows what’s behind the doors, opens another door, say door 3, which has a goat. The host then asks you: do you want to stick with your choice or choose another door? [1]

Is it to your advantage to switch your choice of door?

Based on probability, it turns out it is to our advantage to switch the doors. Let’s find out how:

Initially, for all three gates, the probability (P) of getting the car is the same (P = 1/3).

Now assume that the contestant chooses door 1. Next, the host opens the third door, which has a goat. Next, the host asks the contestant if he/she wants to switch the doors?

We will see why it is more advantageous to switch the door:

In figure 19, we can see that after the host opens door 3, the probability of the last two doors of having a car increases to 2/3. Now we know that the third door has a goat, the probability of the second door having a car increases to 2/3. Hence, it is more advantageous to switch the doors.

Now we are going to use the Monte Carlo Method to perform this test case many times and find out its probabilities in an experimental way.

Python Implementation:

Import required libraries:

2. Initialize some data:

3. Main function:

4. Calling the main function:

5. Output:

In figure 24, we show that after 1000 iterations, the winning probability if we switch the door is 0.669. Therefore, we are confident that it works to our advantage to switch the door in this example.

4. Buffon’s Needle Problem:

A French nobleman Georges-Louis Leclerc, Comte de Buffon, posted the following problem in 1777 [2] [3].

Suppose that we drop a short needle on a ruled paper — what would be the probability that the needle comes to lie in a position where it crosses one of the lines?

The probability depends on the distance (d) between the lines of the ruled paper, and it depends on the length (l) of the needle that we drop — or rather, it depends on the ratio l/d. For this example, we can interpret the needle as l ≤ d. In short, our purpose is that the needle cannot cross two different lines at the same time. Surprisingly, the answer to the Buffon’s needle problem involves PI.

Here we are going to use the solution of Buffon’s needle problem to estimate the value of PI experimentally using the Monte Carlo Method. However, before going into that, we are going to show how the solution derives, making it more interesting.

Theorem:

If a short needle, of length l, is dropped on a paper that is ruled with equally spaced lines of distance d ≥ l, then the probability that the needle comes to lie in a position where it crosses one of the lines is:

Proof:

Next, we need to count the number of needles that crosses any of the vertical lines. For a needle to intersect with one of the lines, for a specific value of theta, the following are the maximum and minimum possible values for which a needle can intersect with a vertical line.

Maximum Possible Value:

2. Minimum Possible Value:

Therefore, for a specific value of theta, the probability for a needle to lie on a vertical line is:

The above probability formula is only limited to one value of theta; in our experiment, the value of theta ranges from 0 to pi/2. Next, we are going to find the actual probability by integrating it concerning all the values of theta.

Estimating PI using Buffon’s needle problem:

Next, we are going to use the above formula to find out the value of PI experimentally.

Now, notice that we have the values for l and d. Our goal is to find the value of P first so that we can get the value of PI. To find the probability P, we must need the count of hit needles and total needles. Since we already have the count of total needles, the only thing we require now is the count of hit needles.

Below is the visual representation of how we are going to calculate the count of hit needles.

Python Implementation:

Import required libraries:

2. Main function:

3. Calling the main function:

4. Output:

As shown in figure 37, after 100 iterations we are able to get a very close value of PI using the Monte Carlo Method.

5. Why Does the House Always Win?

How do casinos earn money? The trick is straightforward — “The more you play, the more they earn.” Let us take a look at how this works with a simple Monte Carlo Simulation example.

Consider an imaginary game in which a player has to choose a chip from a bag of chips.

Rules:

There are chips containing numbers ranging from 1–100 in a bag.

Users can bet on even or odd chips.

In this game, 10 and 11 are special numbers. If we bet on evens, then 10 will be counted as an odd number, and if we bet on odds, then 11 will be counted as an even number.

If we bet on even numbers and we get 10 then we lose.

If we bet on odd numbers and we get 11 then we lose.

If we bet on odds, the probability that we will win is of 49/100. The probability that the house wins is of 51/100. Therefore, for an odd bet the house edge is = 51/100–49/100 = 200/10000 = 0.02 = 2%

If we bet on evens, the probability that the user wins is of 49/100. The probability that the house wins is of 51/100. Hence, for an odd bet the house edge is = 51/100–49/100 = 200/10000 = 0.02 = 2%

In summary, for every $ 1 bet, $ 0.02 goes to the house. In comparison, the lowest house edge on roulette with a single 0 is 2.5%. Consequently, we are certain that you will have a better chance of winning at our imaginary game than with roulette.

Python Implementation:

Import required libraries:

2. Player’s bet:

3. Main function:

4. Final output:

5. Running it for 1000 iterations:

6. Number of bets = 5:

7. Number of bets = 10:

8. Number of bets = 1000:

9. Number of bets = 5000:

10. Number of bets = 10000:

From the above experiment, we can see that the player has a better chance of making a profit if they place fewer bets on these games. In some case scenarios, we get negative numbers, which means that the player lost all of their money and accumulated debt instead of making a profit.

Please keep in mind that these percentages are for our figurative game and they can be modified.

Conclusion:

Like with any forecasting model, the simulation will only be as good as the estimates we make. It is important to remember that the Monte Carlo Simulation only represents probabilities and not certainty. Nevertheless, the Monte Carlo simulation can be a valuable tool when forecasting an unknown future.

DISCLAIMER: The views expressed in this article are those of the author(s) and do not represent the views of Carnegie Mellon University. These writings do not intend to be final products, yet rather a reflection of current thinking, along with being a catalyst for discussion and improvement.

For attribution in academic contexts, please cite this work as:

Shukla, et al., “Monte Carlo Simulation An In-depth Tutorial with Python”, Towards AI, 2020

BibTex citation:

@article{pratik_iriondo_2020,
title={Monte Carlo Simulation An In-depth Tutorial with Python},
url={https://towardsai.net/monte-carlo-simulation},
journal={Towards AI},
publisher={Towards AI Co.},
author={Pratik, Shukla and Iriondo,
Roberto},
year={2020},
month={Aug}
}

Tutorial on the basics of natural language processing (NLP) with sample code implementation in Python

In this article, we explore the basics of natural language processing (NLP) with code examples. We dive into the natural language toolkit (NLTK) library to present how it can be useful for natural language processing related-tasks. Afterward, we will discuss the basics of other Natural Language Processing libraries and other essential methods for NLP, along with their respective coding sample implementations in Python.

Computers and machines are great at working with tabular data or spreadsheets. However, as human beings generally communicate in words and sentences, not in the form of tables. Much information that humans speak or write is unstructured. So it is not very clear for computers to interpret such. In natural language processing (NLP), the goal is to make computers understand the unstructured text and retrieve meaningful pieces of information from it. Natural language Processing (NLP) is a subfield of artificial intelligence, in which its depth involves the interactions between computers and humans.

Applications of NLP:

Machine Translation.

Speech Recognition.

Sentiment Analysis.

Question Answering.

Summarization of Text.

Chatbot.

Intelligent Systems.

Text Classifications.

Character Recognition.

Spell Checking.

Spam Detection.

Autocomplete.

Named Entity Recognition.

Predictive Typing.

Understanding Natural Language Processing (NLP):

We, as humans, perform natural language processing (NLP) considerably well, but even then, we are not perfect. We often misunderstand one thing for another, and we often interpret the same sentences or words differently.

For instance, consider the following sentence, we will try to understand its interpretation in many different ways:

Example 1:

These are some interpretations of the sentence shown above.

There is a man on the hill, and I watched him with my telescope.

There is a man on the hill, and he has a telescope.

I’m on a hill, and I saw a man using my telescope.

I’m on a hill, and I saw a man who has a telescope.

There is a man on a hill, and I saw him something with my telescope.

Example 2:

In the sentence above, we can see that there are two “can” words, but both of them have different meanings. Here the first “can” word is used for question formation. The second “can” word at the end of the sentence is used to represent a container that holds food or liquid.

Hence, from the examples above, we can see that language processing is not “deterministic” (the same language has the same interpretations), and something suitable to one person might not be suitable to another. Therefore, Natural Language Processing (NLP) has a non-deterministic approach. In other words, Natural Language Processing can be used to create a new intelligent system that can understand how humans understand and interpret language in different situations.

Rule-based NLP vs. Statistical NLP:

Natural Language Processing is separated in two different approaches:

Rule-based Natural Language Processing:

It uses common sense reasoning for processing tasks. For instance, the freezing temperature can lead to death, or hot coffee can burn people’s skin, along with other common sense reasoning tasks. However, this process can take much time, and it requires manual effort.

Statistical Natural Language Processing:

It uses large amounts of data and tries to derive conclusions from it. Statistical NLP uses machine learning algorithms to train NLP models. After successful training on large amounts of data, the trained model will have positive outcomes with deduction.

Comparison:

Components of Natural Language Processing (NLP):

a. Lexical Analysis:

With lexical analysis, we divide a whole chunk of text into paragraphs, sentences, and words. It involves identifying and analyzing words’ structure.

b. Syntactic Analysis:

Syntactic analysis involves the analysis of words in a sentence for grammar and arranging words in a manner that shows the relationship among the words. For instance, the sentence “The shop goes to the house” does not pass.

c. Semantic Analysis:

Semantic analysis draws the exact meaning for the words, and it analyzes the text meaningfulness. Sentences such as “hot ice-cream” do not pass.

d. Disclosure Integration:

Disclosure integration takes into account the context of the text. It considers the meaning of the sentence before it ends. For example: “He works at Google.” In this sentence, “he” must be referenced in the sentence before it.

e. Pragmatic Analysis:

Pragmatic analysis deals with overall communication and interpretation of language. It deals with deriving meaningful use of language in various situations.

The NLTK Python framework is generally used as an education and research tool. It’s not usually used on production applications. However, it can be used to build exciting programs due to its ease of use.

spaCy is an open-source natural language processing Python library designed to be fast and production-ready. spaCy focuses on providing software for production usage.

Gensim is an NLP Python framework generally used in topic modeling and similarity detection. It is not a general-purpose NLP library, but it handles tasks assigned to it very well.

Pattern is an NLP Python framework with straightforward syntax. It’s a powerful tool for scientific and non-scientific tasks. It is highly valuable to students.

TextBlob is a Python library designed for processing textual data.

Features:

Part-of-Speech tagging.

Noun phrase extraction.

Sentiment analysis.

Classification.

Language translation.

Parsing.

Wordnet integration.

Use-cases:

Sentiment Analysis.

Spelling Correction.

Translation and Language Detection.

For this tutorial, we are going to focus more on the NLTK library. Let’s dig deeper into natural language processing by making some examples.

Exploring Features of NLTK:

a. Open the text file for processing:

First, we are going to open and read the file which we want to analyze.

Next, notice that the data type of the text file read is a String. The number of characters in our text file is 675.

b. Import required libraries:

For various data processing cases in NLP, we need to import some libraries. In this case, we are going to use NLTK for Natural Language Processing. We will use it to perform various operations on the text.

c. Sentence tokenizing:

By tokenizing the text with sent_tokenize( ), we can get the text as sentences.

In the example above, we can see the entire text of our data is represented as sentences and also notice that the total number of sentences here is 9.

d. Word tokenizing:

By tokenizing the text with word_tokenize( ), we can get the text as words.

Next, we can see the entire text of our data is represented as words and also notice that the total number of words here is 144.

e. Find the frequency distribution:

Let’s find out the frequency of words in our text.

Notice that the most used words are punctuation marks and stopwords. We will have to remove such words to analyze the actual text.

f. Plot the frequency graph:

Let’s plot a graph to visualize the word distribution in our text.

In the graph above, notice that a period “.” is used nine times in our text. Analytically speaking, punctuation marks are not that important for natural language processing. Therefore, in the next step, we will be removing such punctuation marks.

g. Remove punctuation marks:

Next, we are going to remove the punctuation marks as they are not very useful for us. We are going to use isalpha( ) method to separate the punctuation marks from the actual text. Also, we are going to make a new list called words_no_punc, which will store the words in lower case but exclude the punctuation marks.

As shown above, all the punctuation marks from our text are excluded. These can also cross-check with the number of words.

h. Plotting graph without punctuation marks:

Notice that we still have many words that are not very useful in the analysis of our text file sample, such as “and,” “but,” “so,” and others. Next, we need to remove coordinating conjunctions.

i. List of stopwords:

j. Removing stopwords:

k. Final frequency distribution:

As shown above, the final graph has many useful words that help us understand what our sample data is about, showing how essential it is to perform data cleaning on NLP.

Next, we will cover various topics in NLP with coding examples.

Word Cloud:

Word Cloud is a data visualization technique. In which words from a given text display on the main chart. In this technique, more frequent or essential words display in a larger and bolder font, while less frequent or essential words display in smaller or thinner fonts. It is a beneficial technique in NLP that gives us a glance at what text should be analyzed.

Properties:

font_path: It specifies the path for the fonts we want to use.

width: It specifies the width of the canvas.

height: It specifies the height of the canvas.

min_font_size: It specifies the smallest font size to use.

max_font_size: It specifies the largest font size to use.

font_step: It specifies the step size for the font.

max_words: It specifies the maximum number of words on the word cloud.

stopwords: Our program will eliminate these words.

background_color: It specifies the background color for canvas.

normalize_plurals: It removes the trailing “s” from words.

As shown in the graph above, the most frequent words display in larger fonts. The word cloud can be displayed in any shape or image.

For instance: In this case, we are going to use the following circle image, but we can use any shape or any image.

Word Cloud Python Implementation:

As shown above, the word cloud is in the shape of a circle. As we mentioned before, we can use any shape or image to form a word cloud.

Word CloudAdvantages:

They are fast.

They are engaging.

They are simple to understand.

They are casual and visually appealing.

Word Cloud Disadvantages:

They are non-perfect for non-clean data.

They lack the context of words.

Stemming:

We use Stemming to normalize words. In English and many other languages, a single word can take multiple forms depending upon context used. For instance, the verb “study” can take many forms like “studies,” “studying,” “studied,” and others, depending on its context. When we tokenize words, an interpreter considers these input words as different words even though their underlying meaning is the same. Moreover, as we know that NLP is about analyzing the meaning of content, to resolve this problem, we use stemming.

Stemming normalizes the word by truncating the word to its stem word. For example, the words “studies,” “studied,” “studying” will be reduced to “studi,” making all these word forms to refer to only one token. Notice that stemming may not give us a dictionary, grammatical word for a particular set of words.

Let’s take an example:

a. Porter’s Stemmer Example 1:

In the code snippet below, we show that all the words truncate to their stem words. However, notice that the stemmed word is not a dictionary word.

b. Porter’s Stemmer Example 2:

In the code snippet below, many of the words after stemming did not end up being a recognizable dictionary word.

c. SnowballStemmer:

SnowballStemmer generates the same output as porter stemmer, but it supports many more languages.

Lemmatization tries to achieve a similar base “stem” for a word. However, what makes it different is that it finds the dictionary word instead of truncating the original word. Stemming does not consider the context of the word. That is why it generates results faster, but it is less accurate than lemmatization.

If accuracy is not the project’s final goal, then stemming is an appropriate approach. If higher accuracy is crucial and the project is not on a tight deadline, then the best option is amortization (Lemmatization has a lower processing speed, compared to stemming).

Lemmatization takes into account Part Of Speech (POS) values. Also, lemmatization may generate different outputs for different values of POS. We generally have four choices for POS:

Difference between Stemmer and Lemmatizer:

a. Stemming:

Notice how on stemming, the word “studies” gets truncated to “studi.”

b. Lemmatizing:

During lemmatization, the word “studies” displays its dictionary word “study.”

Python Implementation:

a. A basic example demonstrating how a lemmatizer works

In the following example, we are taking the PoS tag as “verb,” and when we apply the lemmatization rules, it gives us dictionary words instead of truncating the original word:

b. Lemmatizer with default PoS value

The default value of PoS in lemmatization is a noun(n). In the following example, we can see that it’s generating dictionary words:

c. Another example demonstrating the power of lemmatizer

d. Lemmatizer with different POS values

Part of Speech Tagging (PoS tagging):

Why do we need Part of Speech (POS)?

Parts of speech(PoS) tagging is crucial for syntactic and semantic analysis. Therefore, for something like the sentence above, the word “can” has several semantic meanings. The first “can” is used for question formation. The second “can” at the end of the sentence is used to represent a container. The first “can” is a verb, and the second “can” is a noun. Giving the word a specific meaning allows the program to handle it correctly in both semantic and syntactic analysis.

Below, please find a list of Part of Speech (PoS) tags with their respective examples:

1. CC: Coordinating Conjunction

2. CD: Cardinal Digit

3. DT: Determiner

4. EX: Existential There

5. FW: Foreign Word

6. IN: Preposition / Subordinating Conjunction

7. JJ: Adjective

8. JJR: Adjective, Comparative

9. JJS: Adjective, Superlative

10. LS: List Marker

11. MD: Modal

12. NN: Noun, Singular

13. NNS: Noun, Plural

14. NNP: Proper Noun, Singular

15. NNPS: Proper Noun, Plural

16. PDT: Predeterminer

17. POS: Possessive Endings

18. PRP: Personal Pronoun

19. PRP$: Possessive Pronoun

20. RB: Adverb

21. RBR: Adverb, Comparative

22. RBS: Adverb, Superlative

23. RP: Particle

24. TO: To

25. UH: Interjection

26. VB: Verb, Base Form

27. VBD: Verb, Past Tense

28. VBG: Verb, Present Participle

29. VBN: Verb, Past Participle

30. VBP: Verb, Present Tense, Not Third Person Singular

31. VBZ: Verb, Present Tense, Third Person Singular

32. WDT: Wh — Determiner

33. WP: Wh — Pronoun

34. WP$ : Possessive Wh — Pronoun

35. WRB: Wh — Adverb

Python Implementation:

a. A simple example demonstrating PoS tagging.

b. A full example demonstrating the use of PoS tagging.

Chunking:

Chunking means to extract meaningful phrases from unstructured text. By tokenizing a book into words, it’s sometimes hard to infer meaningful information. It works on top of Part of Speech(PoS) tagging. Chunking takes PoS tags as input and provides chunks as output. Chunking literally means a group of words, which breaks simple text into phrases that are more meaningful than individual words.

Before working with an example, we need to know what phrases are? Meaningful groups of words are called phrases. There are five significant categories of phrases.

Noun Phrases (NP).

Verb Phrases (VP).

Adjective Phrases (ADJP).

Adverb Phrases (ADVP).

Prepositional Phrases (PP).

Phrase structure rules:

S(Sentence) → NP VP.

NP → {Determiner, Noun, Pronoun, Proper name}.

VP → V (NP)(PP)(Adverb).

PP → Pronoun (NP).

AP → Adjective (PP).

Example:

Python Implementation:

In the following example, we will extract a noun phrase from the text. Before extracting it, we need to define what kind of noun phrase we are looking for, or in other words, we have to set the grammar for a noun phrase. In this case, we define a noun phrase by an optional determiner followed by adjectives and nouns. Then we can define other rules to extract some other phrases. Next, we are going to use RegexpParser( ) to parse the grammar. Notice that we can also visualize the text with the .draw( ) function.

In this example, we can see that we have successfully extracted the noun phrase from the text.

Chinking:

Chinking excludes a part from our chunk. There are certain situations where we need to exclude a part of the text from the whole text or chunk. In complex extractions, it is possible that chunking can output unuseful data. In such case scenarios, we can use chinking to exclude some parts from that chunked text.
In the following example, we are going to take the whole string as a chunk, and then we are going to exclude adjectives from it by using chinking. We generally use chinking when we have a lot of unuseful data even after chunking. Hence, by using this method, we can easily set that apart, also to write chinking grammar, we have to use inverted curly braces, i.e.:

} write chinking grammar here {

Python Implementation:

From the example above, we can see that adjectives separate from the other text.

Named Entity Recognition (NER):

Named entity recognition can automatically scan entire articles and pull out some fundamental entities like people, organizations, places, date, time, money, and GPE discussed in them.

Use-Cases:

Content classification for news channels.

Summarizing resumes.

Optimizing search engine algorithms.

Recommendation systems.

Customer support.

Commonly used types of named entity:

Python Implementation:

There are two options :

1. binary = True

When the binary value is True, then it will only show whether a particular entity is named entity or not. It will not show any further details on it.

Our graph does not show what type of named entity it is. It only shows whether a particular word is named entity or not.

2. binary = False

When the binary value equals False, it shows in detail the type of named entities.

Our graph now shows what type of named entity it is.

WordNet:

Wordnet is a lexical database for the English language. Wordnet is a part of the NLTK corpus. We can use Wordnet to find meanings of words, synonyms, antonyms, and many other words.

a. We can check how many different definitions of a word are available in Wordnet.

b. We can also check the meaning of those different definitions.

c. All details for a word.

d. All details for all meanings of a word.

e. Hypernyms: Hypernyms gives us a more abstract term for a word.

f. Hyponyms: Hyponyms gives us a more specific term for a word.

g. Get a name only.

h. Synonyms.

i. Antonyms.

j. Synonyms and antonyms.

k. Finding the similarity between words.

Bag of Words:

What is the Bag-of-Words method?

It is a method of extracting essential features from row text so that we can use it for machine learning models. We call it “Bag” of words because we discard the order of occurrences of words. A bag of words model converts the raw text into words, and it also counts the frequency for the words in the text. In summary, a bag of words is a collection of words that represent a sentence along with the word count where the order of occurrences is not relevant.

Raw Text: This is the original text on which we want to perform analysis.

Clean Text: Since our raw text contains some unnecessary data like punctuation marks and stopwords, so we need to clean up our text. Clean text is the text after removing such words.

Tokenize: Tokenization represents the sentence as a group of tokens or words.

Building Vocab: It contains total words used in the text after removing unnecessary data.

Generate Vocab: It contains the words along with their frequencies in the sentences.

For instance:

Sentences:

Jim and Pam traveled by bus.

The train was late.

The flight was full. Traveling by flight is expensive.

a. Creating a basic structure:

b. Words with frequencies:

c. Combining all the words:

d. Final model:

Python Implementation:

Applications:

Natural language processing.

Information retrieval from documents.

Classifications of documents.

Limitations:

Semantic meaning: It does not consider the semantic meaning of a word. It ignores the context in which the word is used.

Vector size: For large documents, the vector size increase, which may result in higher computational time.

Preprocessing: In preprocessing, we need to perform data cleansing before using it.

TF-IDF

TF-IDF stands for Term Frequency — Inverse Document Frequency, which is a scoring measure generally used in information retrieval (IR) and summarization. The TF-IDF score shows how important or relevant a term is in a given document.

The intuition behind TF and IDF:

If a particular word appears multiple times in a document, then it might have higher importance than the other words that appear fewer times (TF). At the same time, if a particular word appears many times in a document, but it is also present many times in some other documents, then maybe that word is frequent, so we cannot assign much importance to it. (IDF). For instance, we have a database of thousands of dog descriptions, and the user wants to search for “a cute dog” from our database. The job of our search engine would be to display the closest response to the user query. How would a search engine do that? The search engine will possibly use TF-IDF to calculate the score for all of our descriptions, and the result with the higher score will be displayed as a response to the user. Now, this is the case when there is no exact match for the user’s query. If there is an exact match for the user query, then that result will be displayed first. Then, let’s suppose there are four descriptions available in our database.

The furry dog.

A cute doggo.

A big dog.

The lovely doggo.

Notice that the first description contains 2 out of 3 words from our user query, and the second description contains 1 word from the query. The third description also contains 1 word, and the forth description contains no words from the user query. As we can sense that the closest answer to our query will be description number two, as it contains the essential word “cute” from the user’s query, this is how TF-IDF calculates the value.

Notice that the term frequency values are the same for all of the sentences since none of the words in any sentences repeat in the same sentence. So, in this case, the value of TF will not be instrumental. Next, we are going to use IDF values to get the closest answer to the query. Notice that the word dog or doggo can appear in many many documents. Therefore, the IDF value is going to be very low. Eventually, the TF-IDF value will also be lower. However, if we check the word “cute” in the dog descriptions, then it will come up relatively fewer times, so it increases the TF-IDF value. So the word “cute” has more discriminative power than “dog” or “doggo.” Then, our search engine will find the descriptions that have the word “cute” in it, and in the end, that is what the user was looking for.

Simply put, the higher the TF*IDF score, the rarer or unique or valuable the term and vice versa.

Now we are going to take a straightforward example and understand TF-IDF in more detail.

Example:

Sentence 1: This is the first document.

Sentence 2: This document is the second document.

TF: Term Frequency

a. Represent the words of the sentences in the table.

b. Displaying the frequency of words.

c. Calculating TF using a formula.

IDF: Inverse Document Frequency

d. Calculating IDF values from the formula.

e. Calculating TF-IDF.

TF-IDF is the multiplication of TF*IDF.

In this case, notice that the import words that discriminate both the sentences are “first” in sentence-1 and “second” in sentence-2 as we can see, those words have a relatively higher value than other words.

However, there any many variations for smoothing out the values for large documents. The most common variation is to use a log value for TF-IDF. Let’s calculate the TF-IDF value again by using the new IDF value.

f. Calculating IDF value using log.

g. Calculating TF-IDF.

As seen above, “first” and “second” values are important words that help us to distinguish between those two sentences.

Now that we saw the basics of TF-IDF. Next, we are going to use the sklearn library to implement TF-IDF in Python. A different formula calculates the actual output from our program. First, we will see an overview of our calculations and formulas, and then we will implement it in Python.

Actual Calculations:

a. Term Frequency (TF):

b. Inverse Document Frequency (IDF):

c. Calculating final TF-IDF values:

Python Implementation:

Conclusion:

These are some of the basics for the exciting field of natural language processing (NLP). We hope you enjoyed reading this article and learned something new. Any suggestions or feedback is crucial to continue to improve. Please let us know in the comments if you have any.

DISCLAIMER: The views expressed in this article are those of the author(s) and do not represent the views of Carnegie Mellon University, nor other companies (directly or indirectly) associated with the author(s). These writings do not intend to be final products, yet rather a reflection of current thinking, along with being a catalyst for discussion and improvement.

]]>https://towardsai.net/p/nlp/natural-language-processing-nlp-with-python-tutorial-for-beginners-1f54e610a1a0/feed0Building Neural Networks with Python Code and Math in Detail — II
https://towardsai.net/p/machine-learning/building-neural-networks-with-python-code-and-math-in-detail-ii-bbe8accbf3d1
https://towardsai.net/p/machine-learning/building-neural-networks-with-python-code-and-math-in-detail-ii-bbe8accbf3d1#respondTue, 30 Jun 2020 00:45:24 +0000https://towardsai.net/?p=4472

Author(s): Pratik Shukla, Roberto Iriondo

The second part of our tutorial on neural networks from scratch. From the math behind them to step-by-step implementation case studies in Python. Launch the samples on Google Colab.

In the first part of our tutorial on neural networks, we explained the basic concepts about neural networks, from the math behind them to implementing neural networks in Python without any hidden layers. We showed how to make satisfactory predictions even in case scenarios where we did not use any hidden layers. However, there are several limitations to single-layer neural networks.

In this tutorial, we will dive in-depth on the limitations and advantages of using neural networks in machine learning. We will show how to implement neural nets with hidden layers and how these lead to a higher accuracy rate on our predictions, along with implementation samples in Python on Google Colab.

It can only represent a limited set of functions. If we have been training a model that uses complicated functions (which is the general case), then using a single layer neural network can lead to low accuracy in our prediction rate.

It can only predict linearly separable data. If we have non-linear data, then training our single-layer neural network will lead to low accuracy in our prediction rate.

Decision boundaries for single-layer neural networks must be hyperplane, which means that if our data distributes in 3 dimensions, then the decision boundary must be in 2 dimensions.

To overcome such limitations, we use hidden layers in our neural networks.

Advantages of single-layer neural networks:

Single-layer neural networks are easy to set up.

Single-layer neural networks take less time to train compared to a multi-layer neural network.

Single-layer neural networks have explicit links to statistical models.

The outputs in single layer neural networks are weighted sums of inputs. It means that we can interpret the output of a single layer neural network feasibly.

Advantages of multilayer neural networks:

They construct more extensive networks by considering layers of processing units.

They can be used to classify non-linearly separable data.

Multilayer neural networks are more reliable compared to single-layer neural networks.

2. How to select several neurons in a hidden layer?

There are many methods for determining the correct number of neurons to use in the hidden layer. We will see a few of them here.

The number of hidden nodes should be less than twice the size of the nodes in the input layer.

For example: If we have 2 input nodes, then our hidden nodes should be less than 4.

a. 2 inputs, 4 hidden nodes:

b. 2 inputs, 3 hidden nodes:

c. 2 inputs, 2 hidden nodes:

d. 2 inputs, 1 hidden node:

The number of hidden nodes should be 2/3 the size of input nodes, plus the size of the output node.

For example: If we have 2 input nodes and 1 output node then the hidden nodes should be = floor(2*2/3 + 1) = 2

a. 2 inputs, 2 hidden nodes:

The number of hidden nodes should be between the size of input nodes and output nodes.

For example: If we have 3 input nodes and 2 output nodes, then the hidden nodes should be between 2 and 3.

a. 3 inputs, 2 hidden nodes, 2 outputs:

b. 3 inputs, 3 hidden nodes, 2 outputs:

How many weight values do we need?

For a hidden layer: Number of inputs * No. of hidden layer nodes

For an output layer: Number of hidden layer nodes * No. of outputs

3. The General Structure of an Artificial Neural Network (ANN):

Summarization of an artificial neural network:

Take inputs.

Add bias (if required).

Assign random weights in the hidden layer and the output layer.

Run the code for training.

Find the error in prediction.

Update the weight values of the hidden layer and output layer by gradient descent algorithm.

Repeat the training phase with updated weights.

Make predictions.

Execution of multilayer neural networks:

After reading the first article, we saw that we had only 1 phase of execution there. In that phase, we find the updated weight values and rerun the code to achieve minimum error. However, things are a little spicy here. The execution in a multilayer neural network takes place in two-phase. In phase-1, we update the values of weight_output (weight values for output layer), and in phase-2, we update the value of weight_hidden ( weight values for the hidden layer ). Phase-1 is similar to that of a neural network without any hidden layers.

Execution in phase-1:

To find the derivative, we are going to use in gradient descent algorithm to update the weight values. Here we are not going to derive the derivatives for those functions we already did in part -1 of neural network.
In this phase, our goal is to find the weight values for the output layer. Here we are going to calculate the change in error concerning the change in output weight.

We first define some terms we are going to use in these derivatives:

In phase-1, we find the updated weight for the output layer. In the second phase, we need to find the updated weights for the hidden layer. Hence, find how the change in hidden weight affects the change in error value.

Represented as:

a. Finding the first derivative:

Here we are going to use the chain rule to find the derivative.

4. Implementation of a multilayer neural network in Python

Multilayer neural network: A neural network with a hidden layer For more definitions, check out our article in terminology in machine learning.

Below we are going to implement the “OR” gate without the bias value. In conclusion, adding hidden layers in a neural network helps us achieve higher accuracy in our models.

Representation:

Truth-Table:

Neural Network:

Notice that here we have 2 input features and 1 output feature. In this neural network, we are going to use 1 hidden layer with 3 nodes.

Graphical representation:

Implementation in Python:

Below, we are going to implement our neural net with hidden layers step by step in Python, let’s code:

a. Import required libraries:

b. Define input features:

Next, we take input values for which we want to train our neural network. We can see that we have taken two input features. On tangible data sets, the value of input features is mostly high.

c. Define target output values:

For the input features, we want to have a specific output for specific input features. It is called the target output. We are going to train the model that gives us the target output for our input features.

d. Assign random weights:

Next, we are going to assign random weights to the input features. Note that our model is going to modify these weight values to be optimal. At this point, we are taking these values randomly. Here we have two layers, so we have to assign weights for them separately.

The other variable is the learning rate. We are going to use the learning rate (LR) in a gradient descent algorithm to update the weight values. Generally, we keep LR as low as possible so that we can achieve a minimal error rate.

e. Sigmoid function:

Once we have our weight values and input features, we are going to send it to the main function that predicts the output. Notice that our input features and weight values can be anything, but here we want to classify data, so we need the output between 0 and 1. For such output, we are going to use a sigmoid function.

f. Sigmoid function derivative:

In a gradient descent algorithm, we need the derivative of the sigmoid function.

g. The main logic for predicting output and updating the weight values:

We are going to understand the following code step-by-step.

How does it work?

a. First of all, we run the above code 2,00,000 times. Keep in mind that if we only run this code a few times, then it is probable that we will have a higher error rate. Therefore, we update the weight values 10,000 times to reach the optimal value possible.

b. Next, we find the input for the hidden layer. Defined by the following formula:

We can also represent it as matrices to understand in a better way.

The first matrix here is input features with size (4*2), and the second matrix is weight values for a hidden layer with size (2*3). So the resultant matrix will be of size (4*3).

The intuition behind the final matrix size:

The row size of the final matrix is the same as the row size of the first matrix, and the column size of the final matrix is the same as the column size of the second matrix in multiplication (dot product).

In the representation below, each of those boxes represents a value.

c. Afterward, we have an input for the hidden layer, and it is going to calculate the output by applying a sigmoid function. Below is the output of the hidden layer:

d. Next, we multiply the output of the hidden layer with the weight of the output layer:

The first matrix shows the output of the hidden layer, which has a size of (4*3). The second matrix represents the weight values of the output layer,

e. Afterward, we calculate the output of the output layer by applying a sigmoid function. It can also be represented in matrix form as follows.

f. Now that we have our predicted output, we find the mean squared between target output and predicted output.

g. Next, we begin the first phase of training. In this step, we update the weight values for the output layer. We need to find out how much the output weights affect the error value. To update the weights, we use a gradient descent algorithm. Notice that we have already found the derivatives we will use during the training phase.

g.a. Matrix representation of the first derivative. Matrix size (4*1).

derror_douto = output_op -target_output

g.b. Matrix representation of the second derivative. Matrix size (4*1).

dout_dino = sigmoid_der(input_op)

g.c. Matrix representation of the third derivative. Matrix size (4*3).

dino_dwo = output_hidden

g.d. Matrix representation of transpose of dino_dwo. Matrix size (3*4).

g.e. Now, we are going to find the final matrix of output weight. For a detailed explanation of this step, please check out our previous tutorial. The matrix size will be (3*1), which is the same as the output_weight matrix.

Hence, we have successfully find the derivative values. Next, we update the weight values accordingly with the help of a gradient descent algorithm.

Nonetheless, we also have to find the derivative for phase-2. Let’s first find that, and then we will update the weights for both layers in the end.

h. Phase -2. Updating the weights in the hidden layer.

Since we have already discussed how we derived the derivative values, we are just going to see matrix representation for each of them to understand it better. Our goal here is to find the weight matrix for the hidden layer, which is of size (2*3).

h.a. Matrix representation for the first derivative.

derror_dino = derror_douto * douto_dino

h.b. Matrix representation for the second derivative.

dino_douth = weight_output

h.c. Matrix representation for the third derivative.

derror_douth = np.dot(derror_dino , dino_douth.T)

h.d. Matrix representation for the fourth derivative.

douth_dinh = sigmoid_der(input_hidden)

h.e. Matrix representation for the fifth derivative.

dinh_dwh = input_features

h.f. Matrix representation for the sixth derivative.

Notice that our goal was to find a hidden weight matrix with the size of (2*3). Furthermore, we have successfully managed to find it.

h.g. Updating the weight values :

We will use the gradient descent algorithm to update the values. It takes three parameters.

The original weight: we already have it.

The learning rate (LR): we assigned it the value of 0.05.

The derivative: Found on the previous step.

Gradient descent algorithm:

Since we have all of our parameter values, this will be a straightforward operation. First, we are updating the weight values for the output layer, and then we are updating the weight values for the hidden layer.

i. Final weight values:

Below, we show the updated weight values for both layers — our prediction bases on these values.

j. Making predictions:

j.a. Prediction for (1,1).

Target output = 1

Explanation:

First of all, we are going to take the input values for which we want to predict the output. The “result1” variable stores the value of the dot product of input variables and hidden layer weight. We obtain the output by applying a sigmoid function, the result stores in the result2 variable. Such is the input feature for the output layer. We calculate the input for the output layer by multiplying input features with output layer weight. To find the final output value, we take the sigmoid value of that.

Notice that the predicted output is very close to 1. So we have managed to make accurate predictions.

j.b. Prediction for (0,0).

Target output = 0

Note that the predicted output is very close to 0, which indicates the success rate of our model.

k. Final error value :

After 200,000 iterations, we have our final error value — the lower the error, the higher the accuracy of the model.

As shown above, we can see that the error value is 0.0000000189. This value is the final error value in prediction after 200,000 iterations.

Below, notice that the data we used in this example was linearly separable, which means that by a single line, we can classify outputs with 1 value and outputs with 0 values.

Notice that we did not use bias value here. Now let’s have a quick look at the neural network without hidden layers for the same input features and target values. What we are going to do is find the final error rate and compare it. Since we have already implemented the code in our previous tutorial, for this purpose, we are going to analyze it quickly. [2]

The final error value for the following code is:

As we can see, the error value is way too high compared to the error we found in our neural network implementation with hidden layers, making it one of the main reasons to use hidden layers in a neural network.

# Import required libraries :
import numpy as np# Define input features :
input_features = np.array([[0,0],[0,1],[1,0],[1,1]])
print (input_features.shape)
print (input_features)# Define target output :
target_output = np.array([[0,1,1,1]])# Reshaping our target output into vector :
target_output = target_output.reshape(4,1)
print(target_output.shape)
print (target_output)# Define weights :
weights = np.array([[0.1],[0.2]])
print(weights.shape)
print (weights)# Define learning rate :
lr = 0.05# Sigmoid function :
def sigmoid(x):
return 1/(1+np.exp(-x))# Derivative of sigmoid function :
def sigmoid_der(x):
return sigmoid(x)*(1-sigmoid(x))# Main logic for neural network :
# Running our code 10000 times :for epoch in range(10000):
inputs = input_features#Feedforward input :
pred_in = np.dot(inputs, weights)#Feedforward output :
pred_out = sigmoid(pred_in)#Backpropogation
#Calculating error
error = pred_out - target_output
x = error.sum()
#Going with the formula :
print(x)
#Calculating derivative :
dcost_dpred = error
dpred_dz = sigmoid_der(pred_out)
#Multiplying individual derivatives :
z_delta = dcost_dpred * dpred_dz#Multiplying with the 3rd individual derivative :
inputs = input_features.T
weights -= lr * np.dot(inputs, z_delta)#Predictions :#Taking inputs :
single_point = np.array([1,0])
#1st step :
result1 = np.dot(single_point, weights)
#2nd step :
result2 = sigmoid(result1)
#Print final result
print(result2)#====================================
#Taking inputs :
single_point = np.array([0,0])
#1st step :
result1 = np.dot(single_point, weights)
#2nd step :
result2 = sigmoid(result1)
#Print final result
print(result2)#===================================
#Taking inputs :
single_point = np.array([1,1])
#1st step :
result1 = np.dot(single_point, weights)
#2nd step :
result2 = sigmoid(result1)
#Print final result
print(result2)

6. Non-linearly separable data with a neural network

In this example, we are going to take a dataset that cannot be separated by a single straight line. If we try to separate it by a single line, then one or many outputs may be misclassified, and we will have a very high error. Therefore we use a hidden layer to resolve this issue.

Input Table:

Graphical Representation Of Data Points :

As shown below, we represent the data on the coordinate plane. Here notice that we have 2 colored dots (black and red). If we try to draw a single line, then the output is going to be misclassified.

As figure 59 shows, we have 2 inputs and 1 output. In this example, we are going to use 4 hidden perceptrons. The red dots have an output value of 0, and the black dots have an output value of 1. Therefore, we cannot simply classify them using a single straight line.

Neural Network:

Implementation in Python:

a. Import required libraries:

b. Define input features:

c. Define the target output:

d. Assign random weight values:

On figure 64, notice that we are using NumPy’s library random function to generate random values.

numpy.random.rand(x,y): Here x is the number of rows, and y is the number of columns. It generates output values over [0,1). It means 0 is included, but 1 is not included in the value generation.

e. Sigmoid function:

f. Finding the derivative with a sigmoid function:

g. Training our neural network:

h. Weight values of hidden layer:

i. Weight values of output layer:

j. Final error value :

After training our model for 200,000 iterations, we finally achieved a low error value.

k. Making predictions from the trained model :

k.a. Predicting output for (0.5, 2).

The predicted output is closer to 1.

k.b. Predicting output for (0, -1)

The predicted output is very near to 0.

k.c. Predicting output for (0, 5)

The predicted output is close to 1.

k.d. Predicting output for (1, 1.2)

The predicted output is close to 0.

Based on the output values, our model has done a high-grade job of predicting values.

We can separate our data in the following way as shown in Figure 76. Note that this is not the only possible way to separate these values.

Therefore to conclude, using a hidden layer on our neural networks helps us reducing the error rate when we have non-linearly separable data. Even though the training time extends, we have to remember that our goal is to make high accuracy predictions, and such will be satisfied.

Neural networks can learn from their mistakes, and they can produce output that is not limited to the inputs provided to them.

Inputs store in its networks instead of a database.

These networks can learn from examples, and we can predict the output for similar events.

In case of failure of one neuron, the network can detect the fault and still produce output.

Neural networks can perform multiple tasks in parallel processes.

DISCLAIMER: The views expressed in this article are those of the author(s) and do not represent the views of Carnegie Mellon University, nor other companies (directly or indirectly) associated with the author(s). These writings do not intend to be final products, yet rather a reflection of current thinking, along with being a catalyst for discussion and improvement.

For attribution in academic contexts, please cite this work as:

Shukla, et al., “Building Neural Networks with Python Code and Math in Detail — II”, Towards AI, 2020

BibTex citation:

@article{pratik_iriondo_2020,
title={Building Neural Networks with Python Code and Math in Detail — II},
url={https://towardsai.net/building-neural-nets-with-python},
journal={Towards AI},
publisher={Towards AI Co.},
author={Pratik, Shukla and Iriondo,
Roberto},
year={2020},
month={Jun}
}

Are you new to machine learning? Check out an overview of machine learning algorithms for beginners with code examples in Python

]]>https://towardsai.net/p/machine-learning/building-neural-networks-with-python-code-and-math-in-detail-ii-bbe8accbf3d1/feed0Neural Networks from Scratch with Python Code and Math in Detail- I
https://towardsai.net/p/machine-learning/building-neural-networks-from-scratch-with-python-code-and-math-in-detail-i-536fae5d7bbf
https://towardsai.net/p/machine-learning/building-neural-networks-from-scratch-with-python-code-and-math-in-detail-i-536fae5d7bbf#respondSat, 20 Jun 2020 01:59:17 +0000https://towardsai.net/?p=4313Author(s): Pratik Shukla, Roberto Iriondo

Learn all about neural networks from scratch. From the math behind it to step-by-step implementation case studies in Python. Launch them live on Google Colab

Note: In our upcoming second tutorial on neural networks, we will show how we can add hidden layers to our neural nets.

What is a neural network?

Neural networks form the base of deep learning, which is a subfield of machine learning, where the structure of the human brain inspires the algorithms. Neural networks take input data, train themselves to recognize patterns found in the data, and then predict the output for a new set of similar data. Therefore, a neural network is the functional unit of deep learning, which mimics the behavior of the human brain to solve complex data-driven problems.

The first thing that comes to our mind when we think of “neural networks” is biology, and indeed, neural nets are inspired by our brains. Let’s try to understand them.

In machine learning, the dendrites refer to as input, and the nucleus process the data and forward the calculated output through the axon. In a biological neural network, the width (thickness) of dendrites defines the weight associated with it.

Simply put, an ANN represents interconnected input and output units in which each connection has an associated weight. During the learning phase, the network learns by adjusting these weights in order to be able to predict the correct class for input data.

For instance:

We encounter ourselves in a deep sleep state, and suddenly our environment starts to tremble. Immediately afterward, our brain recognizes that it is an earthquake. At once, we think of what is most valuable to us:

Our beloved ones.

Essential documents.

Jewelry.

Laptop.

A pencil.

Now we only have a few minutes to get out of the house, and we can only save a few things. What will our priorities be in this case?

Perhaps, we are going to save our beloved ones first, and then if time permits, we can think of other things. What we did here is, we assigned a weight to our valuables. Each of the valuables at that moment is an input, and the priorities are the weights we assigned it to it.

The same is the case with neural networks. We assign weights to different values and predict the output from them. However, in this case, we do not know the associated weight with each input, so we make an algorithm that will calculate the weights associated with them by processing lots of input data.

2. Applications of Artificial Neural Networks:

a. Classification of data:

Based on a set of data, our trained neural network predicts whether it is a dog or a cat?

b. Anomaly detection:

Given the details about transactions of a person, it can say that whether the transaction is fraud or not.

c. Speech recognition:

We can train our neural network to recognize speech patterns. Example: Siri, Alexa, Google assistant.

d. Audio generation:

Given the inputs as audio files, it can generate new music based on various factors like genre, singer, and others.

e. Time series analysis:

A well trained neural network can predict the stock price.

f. Spell checking:

We can train a neural network that detects misspelled spellings and can also suggest a similar meaning for words. Example: Grammarly

g. Character recognition:

A well trained neural network can detect handwritten characters.

h. Machine translation:

We can develop a neural network that translates one language into another language.

i. Image processing:

We can train a neural network to process an image and extract pieces of information from it.

3. General Structure of an Artificial Neural Network (ANN):

4. What is a Perceptron?

A perceptron is a neural network without any hidden layer. A perceptron only has an input layer and an output layer.

Where we can use perceptrons?

Perceptrons’ use lies in many case scenarios. While a perceptron is mostly used for simple decision making, these can also come together in larger computer programs to solve more complex problems.

For instance:

Give access if a person is a faculty member and deny access if a person is a student.

Steps involved in the implementation of a neural network:

A neural network executes in 2 steps :

1. Feedforward:

On a feedforward neural network, we have a set of input features and some random weights. Notice that in this case, we are taking random weights that we will optimize using backward propagation.

2. Backpropagation:

During backpropagation, we calculate the error between predicted output and target output and then use an algorithm (gradient descent) to update the weight values.

Why do we need backpropagation?

While designing a neural network, first, we need to train a model and assign specific weights to each of those inputs. That weight decides how vital is that feature for our prediction. The higher the weight, the greater the importance. However, initially, we do not know the specific weight required by those inputs. So what we do is, we assign some random weight to our inputs, and our model calculates the error in prediction. Thereafter, we update our weight values and rerun the code (backpropagation). After individual iterations, we can get lower error values and higher accuracy.

Summarizing an Artificial Neural Network:

Take inputs

Add bias (if required)

Assign random weights to input features

Run the code for training.

Find the error in prediction.

Update the weight by gradient descent algorithm.

Repeat the training phase with updated weights.

Make predictions.

Flow chart for a simple neural network:

The training phase of a neural network:

5. Perceptron Example:

Below is a simple perceptron model with four inputs and one output.

What we have here is the input values and their corresponding target output values. So what we are going to do, is assign some weight to the inputs and then calculate their predicted output values.

In this example we are going to calculate the output by the following formula:

For the sake of this example, we are going to take the bias value = 0 for simplicity of calculation.

a. Let’s take W = 3 and check the predicted output.

b. After we have found the value of predicted output for W=3, we are going to compare it with our target output, and by doing that, we can find the error in the prediction model. Keep in mind that our goal is to achieve minimum error and maximum accuracy for our model.

c. Notice that in the above calculation, there is an error in 3 out of 4 predictions. So we have to change the parameter values of our weight to set in low. Now we have two options:

Increase weight

Decrease weight

First, we are going to increase the value of the weight and check whether it leads to a higher error rate or lower error rate. Here we increased the weight value by 1 and changed it to W = 4.

d. As we can see in the figure above, is that the error in prediction is increasing. So now we can conclude that increasing the weight value does not help us in reducing the error in prediction.

e. After we fail in increasing the weight value, we are going to decrease the value of weight for it. Furthermore, by doing that, we can see whether it helps or not.

f. Calculate the error in prediction. Here we can see that we have achieved the global minimum.

In figure 17, we can see that there is no error in prediction.

Now what we did here:

First, we have our input values and target output.

Then we initialized some random value to W, and then we proceed further.

Last, we calculated the error for in prediction for that weight value. Afterward, we updated the weight and predicted the output. After several trial and error epochs, we can reduce the error in prediction.

So, we are trying to get the value of weight such that the error becomes minimum. We need to figure out whether we need to increase or decrease the weight value. Once we know that, we keep on updating the weight value in that direction until error becomes minimum. We might reach a point where if further updates occur to the weight, the error will increase. At that time, we need to stop, and that is our final weight value.

In real-life data, the situation can be a bit more complex. In the example above, we saw that we could try different weight values and get the minimum error manually. However, in real-life data, weight values are often decimal (non-integer). Therefore, we are going to use a gradient descent algorithm with a low learning rate so that we can try different weight values and obtain the best predictions from our model.

6. Sigmoid Function:

A sigmoid function serves as an activation function in our neural network training. We generally use neural networks for classifications. In binary classification, we have 2 types. However, as we can see, our output value can be any possible number from the equation we used. To solve that problem, we use a sigmoid function. Now for classification, we want our output values to be 0 or 1. So to get values between 0 and 1 we use the sigmoid function. The sigmoid function converts our output values between 0 and 1.

Let’s have a look at it:

Let’s visualize our sigmoid function with Python:

Output:

Explanation:

In figure 21 and 22, for any input values, the value of the sigmoid function will always lie between 0 and 1. Here notice that for negative numbers, the output of the sigmoid function is ≤0.5, or we can say closer to zero, and for positive numbers, the output is going to be >0.5, or we can say closer to 1.

7. Neural Network Implementation from Scratch:

We are going to do is implement the “OR” logic gate using a perceptron. Keep in mind that here we are not going to use any of the hidden layers.

What is logical OR Gate?

Straightforwardly, when one of the inputs is 1, the output of the OR gate is going to be 1. It means that the output is 0 only when both of the inputs are 0.

Representation:

Truth-Table for OR gate:

Perceptron for the OR gate:

Next, we are going to assign some weights to each of the input values and calculate it.

Example: (Calculating Manually)

a. Calculate the input for o1:

b. Calculate the output value:

Notice that from our truth table, we can see that we wanted the output of 1, but what we get here is 0.68997. Now we need to calculate the error and then backpropagate and then update the weight values.

c. Error Calculation:

Next, we are going to use Mean Squared Error for calculating the error :

The summation sign (Sigma symbol) means that we have to add our error for all our input sets. Here we are going to see how that works for only one input set.

We have to do the same for all the remaining inputs. Now that we have found the error, we have to update the values of weight to make the error minimum. For updating weight values, we are going to use a gradient descent algorithm.

8. What is Gradient Descent?

Gradient Descent is a machine learning algorithm that operates iteratively to find the optimal values for its parameters. It takes into account, user-defined learning rate, and initial parameter values.

Working: (Iterative)

1. Start with initial values.

2. Calculate cost.

3. Update values using the update function.

4. Returns minimized cost for our cost function

Why do we need it?

Generally, what we do is, we find the formula that gives us the optimal values for our parameter. However, in this algorithm, it finds the value by itself!

Interesting, isn’t it?

We are going to update our weight with this algorithm. First of all, we need to find the derivative f(X).

9. Derivation of the formula used in a neural network

Next, what we want to find is how a particular weight value affects the error. To find that we are going to apply the chain rule.

Afterward, what we have to do is we have to find values for these three derivatives.

In the following images, we have tried to show the derivation of each of these derivatives to showcase the math behind gradient descent.

d. Calculating derivatives:

In our case:

Output = 0.68997
Target = 1

e. Finding the second part of the derivative:

Figure 36: Calculating the second part

To understand it step-by-step:

e.a. Value of outo1:

e.b. Finding the derivative with respect to ino1:

e.c. Simplifying it a bit to find the derivative easily:

e.d. Applying chain rule and power rule:

e.e. Applying sum rule:

e.f. The derivative of constant is zero:

e.g. Applying exponential rule and chain rule:

e.h. Simplifying it a bit:

e.i. Multiplying both negative signs:

e.j. Put the negative power in the denominator:

That is it. However, we need to simplify it as it is a little complex for our machine learning algorithm to process for a large number of inputs.

e.k. Simplifying it:

e.l. Further simplification:

e.k. Adding +1–1:

e.l. Separate the parts:

e.m. Simplify:

e.n. Now we all know the value of outo1 from equation 1:

e.o. From that we can derive the following final derivative:

e.p. Calculating the value of our input:

f. Finding the third part of the derivative :

f.a Value of ino:

f.b. Finding derivative:

All the other values except w2 will be considered constant here.

f.c Calculating both values for our input:

f.d. Putting it all together:

f.e. Putting it in our main equation:

f.f. We can calculate:

Notice that the value of the weight has increased here. We can calculate all the values in this way, but as we can see, it is going to be a lengthy process. So now we are going to implement all the steps in Python.

Summary of The Manual Implementation of a Neural Network:

a. Input for perceptron:

b. Applying sigmoid function for predicted output :

c. Calculate the error:

d. Changing the weight value based on gradient descent formula:

e. Calculating the derivative:

f. Individual derivatives:

Source: Image created by the author.

Source: Image created by the author.

g. After then we run the same code with updated weight values.

Let’s code:

10. Implementation of a Neural Network In Python:

10.1 Import Required libraries:

First, we are going to import Python libraries. We are using NumPy for the calculations:

10.2 Assign Input values:

Next, we are going to take input values for which we want to train our neural network. Here we can see that we have taken two input features. In actual data sets, the value of the input features is mostly high.

10.3 Target Output:

For the input features, we want to have a specific output for specific input features. It is called the target output. We are going to train the model that gives us the target output for our input features.

10.3 Assign the Weights :

Next, we are going to assign random weights to the input features. Note that our model is going to modify these weight values to be optimum. At this point, we are taking these values randomly. Here we have two input features, so we are going to take two weight values.

10.4 Adding Bias Values and Assigning a Learning Rate :

Now here we are going to add the bias value. The value of bias = 1. However, the weight assigned to it is random at first, and our model will optimize it for our target output.

The other parameter is called the learning rate(LR). We are going to use the learning rate in a gradient descent algorithm to update the weight values. Generally, we keep the learning rate as low as possible so that we can achieve a minimum error rate.

10.5 Applying a Sigmoid Function:

Once we have our weight values and input features, we are going to send it to the main function that predicts the output. Now notice that our input features and weight values can be anything, but here we want to classify data, so we need the output between 0 and 1. For such, we are going to a sigmoid function.

10.6 Derivative of sigmoid function:

In gradient descent algorithm we are going to need the derivative of the sigmoid function.

10.7 The main logic for predicting output and updating the weight values:

We are going to explain the following code step-by-step.

How does it work?

First of all, the code above will need to run approximately 10,000 times. Keep in mind that if we only run this code a few times, then probably we are going to have a higher error rate. Therefore, in short, we can say that we are going to update the weight values 10,000 times to reach the optimal value possible.

Next, what we need to do is multiply the input features with it is corresponding weight values, the values we are going to feed to the perceptron can be represented in the form of a matrix.

in_o represents the dot product of input_features and weight. Notice that the first matrix (input features) is of size (4*2), and the second matrix (weights) is of size (2*1). After multiplication, the resultant matrix is of size (4*1).

In the above representation, each of those boxes represents a value.

Now in our formula, we also have the bias value. Let’s understand it with simple matrix representation.

Next, we are going to add the bias value. Addition operation in the matrix is easy to understand. Such is the input for the sigmoid function. Afterward, we are going to apply the sigmoid function to our input value, which will give us the predicted output value between 0 and 1.

Next, we have to calculate the error in prediction. We generally use Mean Squared Error (MSE) for this, but here we are just going to use simple error function for simplicity in the calculation. Last, we are going to add the error for all of our four inputs.

Our ultimate goal is to minimize the error. To minimize the error, we can update the value of our weights. To update the weight value, we are going to use a gradient descent algorithm.

To find the derivative, we are going to need the values of some derivatives for our gradient descent algorithm. As we have already discussed, we are going to find 3 individual values for derivatives and then multiply it.

The first derivative is:

The second derivative is:

The third derivative is:

Notice that we can easily find the values of the first two derivatives as they are not dependent on inputs. Next, we store the values of the multiplication of the first two derivatives in the deriv variable. Now the values of these derivatives must be of the same size as the size of weights. The size of the weights is (2*1).

To find the final derivative, we need to find the transpose of our input_features and then we are going to multiply it with our deriv variable that is basically the multiplication of the other two derivatives.

Let’s have a look at the matrix representation of the operation.

On figure 83, the first matrix is the transposed matrix of input_features. The second matrix stores the values of the multiplication of the other two derivatives. Now see that we have stored these values in a matrix called deriv_final. Notice that, the size of deriv_final is (2*1) which is the same as the size of our weight matrix (2*1).

Afterward, we update the weight value, notice that we have all the values needed for updating our weight. We are going to use the following formula to update the weight values.

Last, we need to update the bias value. If we remember the diagram, we might have noticed that the value of bias weight is not dependent on the input. So we have to update it separately. In this case, we need the deriv values, as it is not dependent on the input values. To update the bias value, we go through the for loop for updating value at each input on every iteration.

10.8 Check the Values of Weight and Bias:

On figure 85, notice that our weight and bias values have changed from our randomly assigned values.

10.9 Predicting values :

Since we have trained our model, we can start to make predictions from it.

10.9.1 Prediction for (1,0):

Target value = 1

On figure 86, we can see the predicted output is very near to 1.

10.9.2 Prediction for (1,1):

Target output = 1

On figure 87, we can see that the predicted output is very close to 1.

10.9.3 Prediction for (0,0):

Target output = 0

On figure 88, we can see that the predicted output is very close to 0.

#Multiplying individual derivatives:
deriv = derror_douto * douto_dino #Multiplying with the 3rd individual derivative:
#Finding the transpose of input_features:
inputs = input_features.T
deriv_final = np.dot(inputs,deriv)

#Updating the weights values:
weights -= lr * deriv_final #Updating the bias weight value:
for i in deriv:
bias -= lr * i #Check the final values for weight and biasprint (weights)

Suppose if we have input values (0,0), the sum of the products of the input nodes and weights is always going to be zero. In this case, the output will always be zero, no matter how much we train our model. To resolve this issue and make reliable predictions, we use the bias term. In short, we can say that the bias term is necessary to make a robust neural network.

Therefore, how does the value of bias affects the shape of our sigmoid function? Let’s visualize it with some examples.

To change the steepness of the sigmoid curve, we can adjust the weight accordingly.

For instance:

From the output, we can quickly notice that for negative values, the output of the sigmoid function is going to be ≤0.5. Moreover, for positive values, the output is going to be >0.5.

From the figure (red curve), you can see that if we decrease the value of the weight, it decreases the value of steepness, and if we increase the value of weight (green curve), it increases the value of steepness. However, for all of the three curves, if the input is negative, the output is always going to be ≤0.5. For positive numbers, the output is always going to be >0.5.

What if we want to change this pattern?

For such case scenarios, we use bias values.

From the output, we can notice that we can shift the curves on the x-axis that helps us to change the pattern we show in the previous example.

Summary:

In neural networks:

We can view bias as a threshold value for activation.

Bias increases the flexibility of the model

The bias value allows us to shift the activation function to the right or left.

The bias value is most useful when we have all zeros (0,0) as input.

Let’s try to understand it with the same example we saw earlier. Nevertheless, here we are not going to add the bias value. After the model has trained, we will try to predict the value of (0,0). Ideally, it should be close to zero. Now let’s check out the following example.

An Implementation Without Bias Value:

a. Import required libraries:

b. Input features:

c. Target output:

d. Define Input weights:

e. Define the learning rate:

f. Activation function:

g. A derivative of the sigmoid function:

h. The main logic for training our model:

Here notice that we are not going to use bias values anywhere.

i. Making predictions:

i.a. Prediction for (1,0) :

Target output = 1

From the predicted output we can see that it’s close to 1.

i.b. Prediction for (0,0) :

Target output = 0

Here we can see that it’s nowhere near 0. So we can say that our model failed to predict it. This is the reason for adding the bias value.

i.c. Prediction for (1,1):

Target output = 1

We can see that it’s close to 1.

Putting it all together:

# Import required libraries :
import numpy as np# Define input features :
input_features = np.array([[0,0],[0,1],[1,0],[1,1]])
print (input_features.shape)
print (input_features)# Define target output :
target_output = np.array([[0,1,1,1]])# Reshaping our target output into vector :
target_output = target_output.reshape(4,1)
print(target_output.shape)
print (target_output)# Define weights :
weights = np.array([[0.1],[0.2]])
print(weights.shape)
print (weights)# Define learning rate :
lr = 0.05# Sigmoid function :
def sigmoid(x):
return 1/(1+np.exp(-x))# Derivative of sigmoid function :
def sigmoid_der(x):
return sigmoid(x)*(1-sigmoid(x))# Main logic for neural network :
# Running our code 10000 times :for epoch in range(10000):
inputs = input_features#Feedforward input :
pred_in = np.dot(inputs, weights)#Feedforward output :
pred_out = sigmoid(pred_in)#Backpropogation
#Calculating error
error = pred_out — target_output
x = error.sum()
#Going with the formula :
print(x)
#Calculating derivative :
dcost_dpred = error
dpred_dz = sigmoid_der(pred_out)
#Multiplying individual derivatives :
z_delta = dcost_dpred * dpred_dz#Multiplying with the 3rd individual derivative :
inputs = input_features.T
weights -= lr * np.dot(inputs, z_delta)
#Taking inputs :
single_point = np.array([1,0])#1st step :
result1 = np.dot(single_point, weights)#2nd step :
result2 = sigmoid(result1)#Print final result
print(result2)#Taking inputs :
single_point = np.array([0,0])#1st step :
result1 = np.dot(single_point, weights)#2nd step :
result2 = sigmoid(result1)#Print final result
print(result2)#Taking inputs :
single_point = np.array([1,1])#1st step :
result1 = np.dot(single_point, weights)#2nd step :
result2 = sigmoid(result1)#Print final result
print(result2)

Now a real-life example of a prediction case study with a neural network ↓

Case Study: Predicting Whether a Person will be Positive for a Virus with a Neural Net

Dataset:

For this example, our goal is to predict whether a person is positive for a virus or not based on the given input features. Here 1 represents “Yes” and 0 represents “No”.

Let’s code:

a. Import required libraries:

Source: Image created by the author.

b. Input features:

c. Target output:

d. Define weights:

e. Bias value and learning rate:

f. Sigmoid function:

g. Derivative of sigmoid function:

h. The main logic for training model:

i. Making predictions:

i.a. A tested person is positive for the virus.

i.b. A tested person is negative for the virus.

i.c. A tested person is positive for the virus.

j. Final weight and bias values:

In this example, we can notice that the input feature “loss of smell” influences the output the most. If it is true, then in most of the case, the person tests positive for the virus. We can also derive this conclusion from the weight values. Keep in mind that the higher the value of the weight, the more the influence on the output. The input feature “Weight loss” is not affecting the output much, so we can rule it out while we are making predictions for a larger dataset.

Putting it all together:

# Import required libraries :
import numpy as np# Define input features :
input_features = np.array([[1,0,0,1],[1,0,0,0],[0,0,1,1],
[0,1,0,0],[1,1,0,0],[0,0,1,1],
[0,0,0,1],[0,0,1,0]])
print (input_features.shape)
print (input_features)# Define target output :
target_output = np.array([[1,1,0,0,1,1,0,0]])# Reshaping our target output into vector :
target_output = target_output.reshape(8,1)
print(target_output.shape)
print (target_output)# Define weights :
weights = np.array([[0.1],[0.2],[0.3],[0.4]])
print(weights.shape)
print (weights)# Bias weight :
bias = 0.3# Learning Rate :
lr = 0.05# Sigmoid function :
def sigmoid(x):
return 1/(1+np.exp(-x))# Derivative of sigmoid function :
def sigmoid_der(x):
return sigmoid(x)*(1-sigmoid(x))# Main logic for neural network :
# Running our code 10000 times :for epoch in range(10000):
inputs = input_features#Feedforward input :
pred_in = np.dot(inputs, weights) + bias#Feedforward output :
pred_out = sigmoid(pred_in)#Backpropogation
#Calculating error
error = pred_out — target_output
#Going with the formula :
x = error.sum()
print(x)
#Calculating derivative :
dcost_dpred = error
dpred_dz = sigmoid_der(pred_out)
#Multiplying individual derivatives :
z_delta = dcost_dpred * dpred_dz#Multiplying with the 3rd individual derivative :
inputs = input_features.T
weights -= lr * np.dot(inputs, z_delta)#Updating the bias weight value :
for i in z_delta:
bias -= lr * i#Printing final weights:

In the examples above, we did not use any hidden layers for calculations. Notice that in the above examples, our data were linearly separable. For instance:

We can see that the red line can separate the yellow dots (value = 1) and green dot (value = 0 ).

Limitations of a Perceptron Model (Without Hidden Layers):

1. Single-layer perceptrons cannot classify non-linearly separable data points.

2. Complex problems that involve many parameters do not resolve with single-layer perceptrons.

However, in several cases, the data is not linearly separable. In that case, our perceptron model (without hidden layers) fails to make accurate predictions. To make accurate predictions, we need to add one or more hidden layers.
Visual representation of non-linearly separable data:

DISCLAIMER: The views expressed in this article are those of the author(s) and do not represent the views of Carnegie Mellon University, nor other companies (directly or indirectly) associated with the author(s). These writings do not intend to be final products, yet rather a reflection of current thinking, along with being a catalyst for discussion and improvement.

Citation

For attribution in academic contexts, please cite this work as:

Shukla, et al., “Neural Networks from Scratch with Python Code and Math in Detail — I”, Towards AI, 2020

BibTex citation:

@article{pratik_iriondo_2020,
title={Neural Networks from Scratch with Python Code and Math in Detail — I},
url={https://towardsai.net/neural-networks-with-python},
journal={Towards AI},
publisher={Towards AI Co.},
author={Pratik, Shukla and Iriondo,
Roberto},
year={2020},
month={Jun}
}

]]>https://towardsai.net/p/machine-learning/building-neural-networks-from-scratch-with-python-code-and-math-in-detail-i-536fae5d7bbf/feed0Gather AI, Revolutionizing Inventory Management One Drone at a Time
https://towardsai.net/p/news/gather-ai-revolutionizing-inventory-management-one-drone-at-a-time-461aedfe8759
https://towardsai.net/p/news/gather-ai-revolutionizing-inventory-management-one-drone-at-a-time-461aedfe8759#respondSat, 06 Jun 2020 13:55:29 +0000https://towardsai.net/?p=4139Author(s): Roberto Iriondo

Gather AI’s video showcases the world’s first dedicated autonomous software-only inventory management platform for warehouses.

]]>https://towardsai.net/p/news/gather-ai-revolutionizing-inventory-management-one-drone-at-a-time-461aedfe8759/feed0Machine Learning (ML) Algorithms For Beginners with Code Examples in Python
https://towardsai.net/p/machine-learning/machine-learning-algorithms-for-beginners-with-python-code-examples-ml-19c6afd60daa
https://towardsai.net/p/machine-learning/machine-learning-algorithms-for-beginners-with-python-code-examples-ml-19c6afd60daa#respondWed, 03 Jun 2020 21:00:35 +0000https://towardsai.net/?p=4054

Overview of the major machine learning algorithms for beginners with coding samples

Machine learning (ML) is rapidly changing the world, from diverse types of applications and research pursued in industry and academia. Machine learning is affecting every part of our daily lives. From voice assistants using NLP and machine learning to make appointments, check our calendar and play music, to programmatic advertisements — that are so accurate that they can predict what we will need before we even think of it.

More often than not, the complexity of the scientific field of machine learning can be overwhelming, making keeping up with “what is important” a very challenging task. However, to make sure that we provide a learning path to those who seek to learn machine learning, but are new to these concepts. In this article, we look at the most critical basic algorithms that hopefully make your machine learning journey less challenging.

Any suggestions or feedback is crucial to continue to improve. Please let us know in the comments if you have any.

Index

Introduction to Machine Learning.

Major Machine Learning Algorithms.

Supervised vs. Unsupervised Learning.

Linear Regression.

Multivariable Linear Regression.

Polynomial Regression.

Exponential Regression.

Sinusoidal Regression.

Logarithmic Regression.

What is machine learning?

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. ~ Tom M. Mitchell [1]

Machine learning behaves similarly to the growth of a child. As a child grows, her experience E in performing task T increases, which results in higher performance measure(P).

For instance, we give a “shape sorting block” toy to a child. (Now we all know that in this toy, we have different shapes and shape holes). In this case, our task T is to find an appropriate shape hole for a shape. Afterward, the child observes the shape and tries to fit it in a shaped hole. Let us say that this toy has three shapes: a circle, a triangle, and a square. In her first attempt at finding a shaped hole, her performance measure(P) is 1/3, which means that the child found 1 out of 3 correct shape holes.

Second, the child tries it another time and notices that she is a little experienced in this task. Considering the experience gained (E), the child tries this task another time, and when measuring the performance(P), it turns out to be 2/3. After repeating this task (T) 100 times, the baby now figured out which shape goes into which shape hole.

So her experience (E) increased, her performance(P) also increased, and then we notice that as the number of attempts at this toy increases. The performance also increases, which results in higher accuracy.

Such execution is similar to machine learning. What a machine does is, it takes a task (T), executes it, and measures its performance (P). Now a machine has a large number of data, so as it processes that data, its experience (E) increases over time, resulting in a higher performance measure (P). So after going through all the data, our machine learning model’s accuracy increases, which means that the predictions made by our model will be very accurate.

Another definition of machine learning by Arthur Samuel:

Machine Learning is the subfield of computer science that gives “computers the ability to learn without being explicitly programmed.” ~ Arthur Samuel [2]

Let us try to understand this definition: It states “learn without being explicitly programmed” — which means that we are not going to teach the computer with a specific set of rules, but instead, what we are going to do is feed the computer with enough data and give it time to learn from it, by making its own mistakes and improve upon those. For example, We did not teach the child how to fit in the shapes, but by performing the same task several times, the child learned to fit the shapes in the toy by himself.

Therefore, we can say that we did not explicitly teach the child how to fit the shapes. We do the same thing with machines. We give it enough data to work on and feed it with the information we want from it. So it processes the data and predicts the data accurately.

Why do we need machine learning?

For instance, we have a set of images of cats and dogs. What we want to do is classify them into a group of cats and dogs. To do that we need to find out different animal features, such as:

How many eyes does each animal have?

What is the eye color of each animal?

What is the height of each animal?

What is the weight of each animal?

What does each animal generally eat?

We form a vector on each of these questions’ answers. Next, we apply a set of rules such as:

If height > 1 feet and weight > 15 lbs, then it could be a cat.

Now, we have to make such a set of rules for every data point. Furthermore, we place a decision tree of if, else if, else statements and check whether it falls into one of the categories.

Let us assume that the result of this experiment was not fruitful as it misclassified many of the animals, which gives us an excellent opportunity to use machine learning.

What machine learning does is process the data with different kinds of algorithms and tells us which feature is more important to determine whether it is a cat or a dog. So instead of applying many sets of rules, we can simplify it based on two or three features, and as a result, it gives us a higher accuracy. The previous method was not generalized enough to make predictions.

Machine learning models helps us in many tasks, such as:

Object Recognition

Summarization

Prediction

Classification

Clustering

Recommender systems

And others

What is a machine learning model?

A machine learning model is a question/answering system that takes care of processing machine-learning related tasks. Think of it as an algorithm system that represents data when solving problems. The methods we will tackle below are beneficial for industry-related purposes to tackle business problems.

For instance, let us imagine that we are working on Google Adwords’ ML system, and our task is to implementing an ML algorithm to convey a particular demographic or area using data. Such a task aims to go from using data to gather valuable insights to improve business outcomes.

Major Machine Learning Algorithms:

1. Regression (Prediction)

We use regression algorithms for predicting continuous values.

Regression algorithms:

Linear Regression

Polynomial Regression

Exponential Regression

Logistic Regression

Logarithmic Regression

2. Classification

We use classification algorithms for predicting a set of items’ class or category.

Classification algorithms:

K-Nearest Neighbors

Decision Trees

Random Forest

Support Vector Machine

Naive Bayes

3. Clustering

We use clustering algorithms for summarization or to structure data.

Clustering algorithms:

K-means

DBSCAN

Mean Shift

Hierarchical

4. Association

We use association algorithms for associating co-occurring items or events.

Association algorithms:

Apriori

5. Anomaly Detection

We use anomaly detection for discovering abnormal activities and unusual cases like fraud detection.

6. Sequence Pattern Mining

We use sequential pattern mining for predicting the next data events between data examples in a sequence.

7. Dimensionality Reduction

We use dimensionality reduction for reducing the size of data to extract only useful features from a dataset.

8. Recommendation Systems

We use recommenders algorithms to build recommendation engines.

Examples:

Netflix recommendation system.

A book recommendation system.

A product recommendation system on Amazon.

Nowadays, we hear many buzz words like artificial intelligence, machine learning, deep learning, and others.

What are the fundamental differences between Artificial Intelligence, Machine Learning, and Deep Learning?

Artificial Intelligence (AI):

Artificial intelligence (AI), as defined by Professor Andrew Moore, is the science and engineering of making computers behave in ways that, until recently, we thought required human intelligence [4].

These include:

Computer Vision

Language Processing

Creativity

Summarization

Machine Learning (ML):

As defined by Professor Tom Mitchell, machine learning refers to a scientific branch of AI, which focuses on the study of computer algorithms that allow computer programs to automatically improve through experience [3].

These include:

Classification

Neural Network

Clustering

Deep Learning:

Deep learning is a subset of machine learning in which layered neural networks, combined with high computing power and large datasets, can create powerful machine learning models. [3]

Why do we prefer Python to implement machine learning algorithms?

Python is a popular and general-purpose programming language. We can write machine learning algorithms using Python, and it works well. The reason why Python is so popular among data scientists is that Python has a diverse variety of modules and libraries already implemented that make our life more comfortable.

Let us have a brief look at some exciting Python libraries.

Numpy: It is a math library to work with n-dimensional arrays in Python. It enables us to do computations effectively and efficiently.

Scipy: It is a collection of numerical algorithms and domain-specific tool-box, including signal processing, optimization, statistics, and much more. Scipy is a functional library for scientific and high-performance computations.

Matplotlib: It is a trendy plotting package that provides 2D plotting as well as 3D plotting.

Scikit-learn: It is a free machine learning library for python programming language. It has most of the classification, regression, and clustering algorithms, and works with Python numerical libraries such as Numpy, Scipy.

Machine learning algorithms classify into two groups :

Supervised Learning algorithms

Unsupervised Learning algorithms

I. Supervised Learning Algorithms:

Goal: Predict class or value label.

Supervised learning is a branch of machine learning(perhaps it is the mainstream of machine/deep learning for now) related to inferring a function from labeled training data. Training data consists of a set of *(input, target)* pairs, where the input could be a vector of features, and the target instructs what we desire for the function to output. Depending on the type of the *target*, we can roughly divide supervised learning into two categories: classification and regression. Classification involves categorical targets; examples ranging from some simple cases, such as image classification, to some advanced topics, such as machine translations and image caption. Regression involves continuous targets. Its applications include stock prediction, image masking, and others- which all fall in this category.

To understand what supervised learning is, we will use an example. For instance, we give a child 100 stuffed animals in which there are ten animals of each kind like ten lions, ten monkeys, ten elephants, and others. Next, we teach the kid to recognize the different types of animals based on different characteristics (features) of an animal. Such as if its color is orange, then it might be a lion. If it is a big animal with a trunk, then it may be an elephant.

We teach the kid how to differentiate animals, this can be an example of supervised learning. Now when we give the kid different animals, he should be able to classify them into an appropriate animal group.

For the sake of this example, we notice that 8/10 of his classifications were correct. So we can say that the kid has done a pretty good job. The same applies to computers. We provide them with thousands of data points with its actual labeled values (Labeled data is classified data into different groups along with its feature values). Then it learns from its different characteristics in its training period. After the training period is over, we can use our trained model to make predictions. Keep in mind that we already fed the machine with labeled data, so its prediction algorithm is based on supervised learning. In short, we can say that the predictions by this example are based on labeled data.

Example of supervised learning algorithms :

Linear Regression

Logistic Regression

K-Nearest Neighbors

Decision Tree

Random Forest

Support Vector Machine

II. Unsupervised Learning:

Goal: Determine data patterns/groupings.

In contrast to supervised learning. Unsupervised learning infers from unlabeled data, a function that describes hidden structures in data.

Perhaps the most basic type of unsupervised learning is dimension reduction methods, such as PCA, t-SNE, while PCA is generally used in data preprocessing, and t-SNE usually used in data visualization.

A more advanced branch is clustering, which explores the hidden patterns in data and then makes predictions on them; examples include K-mean clustering, Gaussian mixture models, hidden Markov models, and others.

Along with the renaissance of deep learning, unsupervised learning gains more and more attention because it frees us from manually labeling data. In light of deep learning, we consider two kinds of unsupervised learning: representation learning and generative models.

Representation learning aims to distill a high-level representative feature that is useful for some downstream tasks, while generative models intend to reproduce the input data from some hidden parameters.

Unsupervised learning works as it sounds. In this type of algorithms, we do not have labeled data. So the machine has to process the input data and try to make conclusions about the output. For example, remember the kid whom we gave a shape toy? In this case, he would learn from its own mistakes to find the perfect shape hole for different shapes.

But the catch is that we are not feeding the child by teaching the methods to fit the shapes (for machine learning purposes called labeled data). However, the child learns from the toy’s different characteristics and tries to make conclusions about them. In short, the predictions are based on unlabeled data.

Examples of unsupervised learning algorithms:

Dimension Reduction

Density Estimation

Generative adversarial networks (GANs)

Market Basket Analysis

Clustering

For this article, we will use a few types of regression algorithms with coding samples.

1. Linear Regression:

Linear regression is a statistical approach that models the relationship between input features and output. The input features are called the independent variables, and the output is called a dependent variable. Our goal here is to predict the value of the output based on the input features by multiplying it with its optimal coefficients.

Some real-life examples of linear regression :

(1) To predict sales of products.

(2) To predict economic growth.

(3) To predict petroleum prices.

(4) To predict the emission of a new car.

(5) Impact of GPA on college admissions.

There are two types of linear regression :

Simple Linear Regression

Multivariable Linear Regression

1.1 Simple Linear Regression:

In simple linear regression, we predict the output/dependent variable based on only one input feature. The simple linear regression is given by:

Below we are going to implement simple linear regression using the sklearn library in Python.

Step by step implementation in Python:

a. Import required libraries:

Since we are going to use various libraries for calculations, we need to import them.

b. Read the CSV file:

We check the first five rows of our dataset. In this case, we are using a vehicle model dataset — please check out the dataset on Github.

c. Select the features we want to consider in predicting values:

Here our goal is to predict the value of “co2 emissions” from the value of “engine size” in our dataset.

d. Plot the data:

We can visualize our data on a scatter plot.

e. Divide the data into training and testing data:

To check the accuracy of a model, we are going to divide our data into training and testing datasets. We will use training data to train our model, and then we will check the accuracy of our model using the testing dataset.

f. Training our model:

Here is how we can train our model and find the coefficients for our best-fit regression line.

g. Plot the best fit line:

Based on the coefficients, we can plot the best fit line for our dataset.

h. Prediction function:

We are going to use a prediction function for our testing dataset.

i. Predicting co2 emissions:

Predicting the values of co2 emissions based on the regression line.

j. Checking accuracy for test data :

We can check the accuracy of a model by comparing the actual values with the predicted values in our dataset.

Putting it all together:

# Import required libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model

# Read the CSV file :
data = pd.read_csv(“Fuel.csv”)
data.head()

# Let’s select some features to explore more :
data = data[[“ENGINESIZE”,”CO2EMISSIONS”]]

# Generating training and testing data from our data:
# We are using 80% data for training.
train = data[:(int((len(data)*0.8)))]
test = data[(int((len(data)*0.8))):]

# Modeling:
# Using sklearn package to model data :
regr = linear_model.LinearRegression()
train_x = np.array(train[[“ENGINESIZE”]])
train_y = np.array(train[[“CO2EMISSIONS”]])
regr.fit(train_x,train_y)

In simple linear regression, we were only able to consider one input feature for predicting the value of the output feature. However, in Multivariable Linear Regression, we can predict the output based on more than one input feature. Here is the formula for multivariable linear regression.

Step by step implementation in Python:

a. Import the required libraries:

b. Read the CSV file :

c. Define X and Y:

X stores the input features we want to consider, and Y stores the value of output.

d. Divide data into a testing and training dataset:

Here we are going to use 80% data in training and 20% data in testing.

e. Train our model :

Here we are going to train our model with 80% of the data.

f. Find the coefficients of input features :

Now we need to know which feature has a more significant effect on the output variable. For that, we are going to print the coefficient values. Note that the negative coefficient means it has an inverse effect on the output. i.e., if the value of that features increases, then the output value decreases.

g. Predict the values:

h. Accuracy of the model:

Now notice that here we used the same dataset for simple and multivariable linear regression. We can notice that the accuracy of multivariable linear regression is far better than the accuracy of simple linear regression.

Putting it all together:

# Import the required libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model

# Read the CSV file:
data = pd.read_csv(“Fuel.csv”)
data.head()

# Consider features we want to work on:
X = data[[ ‘ENGINESIZE’, ‘CYLINDERS’, ‘FUELCONSUMPTION_CITY’,’FUELCONSUMPTION_HWY’,
‘FUELCONSUMPTION_COMB’,’FUELCONSUMPTION_COMB_MPG’]]

Y = data[“CO2EMISSIONS”]

# Generating training and testing data from our data:
# We are using 80% data for training.
train = data[:(int((len(data)*0.8)))]
test = data[(int((len(data)*0.8))):]

#Modeling:
#Using sklearn package to model data :
regr = linear_model.LinearRegression()

#Now let’s do prediction of data:
Y_pred = regr.predict(test_x)

# Check accuracy:
from sklearn.metrics import r2_score
R = r2_score(test_y , Y_pred)
print (“R² :”,R)

1.3 Polynomial Regression:

Sometimes we have data that does not merely follow a linear trend. We sometimes have data that follows a polynomial trend. Therefore, we are going to use polynomial regression.

Before digging into its implementation, we need to know how the graphs of some primary polynomial data look.

Polynomial Functions and Their Graphs:

a. Graph for Y=X:

b. Graph for Y = X²:

c. Graph for Y = X³:

d. Graph with more than one polynomials: Y = X³+X²+X:

In the graph above, we can see that the red dots show the graph for Y=X³+X²+X and the blue dots shows the graph for Y = X³. Here we can see that the most prominent power influences the shape of our graph.

Below is the formula for polynomial regression:

Now in the previous regression models, we used sci-kit learn library for implementation. Now in this, we are going to use Normal Equation to implement it. Here notice that we can use scikit-learn for implementing polynomial regression also, but another method will give us an insight into how it works.

The equation goes as follows:

In the equation above:

θ: hypothesis parameters that define it the best.

X: input feature value of each instance.

Y: Output value of each instance.

1.3.1 Hypothesis Function for Polynomial Regression

The main matrix in the standard equation:

Step by step implementation in Python:

a. Import the required libraries:

b. Generate the data points:

We are going to generate a dataset for implementing our polynomial regression.

c. Initialize x,x²,x³ vectors:

We are taking the maximum power of x as 3. So our X matrix will have X, X², X³.

d. Column-1 of X matrix:

The 1st column of the main matrix X will always be 1 because it holds the coefficient of beta_0.

e. Form the complete x matrix:

Look at the matrix X at the start of this implementation. We are going to create it by appending vectors.

f. Transpose of the matrix:

We are going to calculate the value of theta step-by-step. First, we need to find the transpose of the matrix.

g. Matrix multiplication:

After finding the transpose, we need to multiply it with the original matrix. Keep in mind that we are going to implement it with a normal equation, so we have to follow its rules.

h. The inverse of a matrix:

Finding the inverse of the matrix and storing it in temp1.

i. Matrix multiplication:

Finding the multiplication of transposed X and the Y vector and storing it in the temp2 variable.

j. Coefficient values:

To find the coefficient values, we need to multiply temp1 and temp2. See the Normal Equation formula.

k. Store the coefficients in variables:

Storing those coefficient values in different variables.

l. Plot the data with curve:

Plotting the data with the regression curve.

m. Prediction function:

Now we are going to predict the output using the regression curve.

n. Error function:

Calculate the error using mean squared error function.

o. Calculate the error:

Putting it all together:

# Import required libraries:
import numpy as np
import matplotlib.pyplot as plt

# Prediction function:
def prediction(x1,x2,x3,beta_0,beta_1,beta_2,beta_3):
y_pred = beta_0 + beta_1*x1 + beta_2*x2 + beta_3*x3
return y_pred
# Making predictions:
pred = prediction(x1,x2,x3,beta_0,beta_1,beta_2,beta_3)
# Calculate accuracy of model:
def err(y_pred,y):
var = (y — y_pred)
var = var*var
n = len(var)
MSE = var.sum()
MSE = MSE/n
return MSE

# Calculating the error:
error = err(pred,y)
error

1.4 Exponential Regression:

Sometimes our dataset follows a trend that shows a slow increase initially, but as time goes, the increase rate grows exponentially. That is when we can use exponential regression.

Some real-life examples of exponential growth:

1. Microorganisms in cultures.

2. Spoilage of food.

3. Human Population.

4. Compound Interest.

5. Pandemics (Such as Covid-19).

6. Ebola Epidemic.

7. Invasive Species.

8. Fire.

9. Cancer Cells.

10. Smartphone Uptake and Sale.

The formula for exponential regression is as follow:

In this case, we are going to use the scikit-learn library to find the coefficient values such as a, b, c.

Step by step implementation in Python

a. Import the required libraries:

b. Insert the data points:

c. Implement the exponential function algorithm:

d. Apply optimal parameters and covariance:

Here we use curve_fit to find the optimal parameter values. It returns two variables, called popt, pcov.

popt stores the value of optimal parameters, and pcov stores the values of its covariances. We can see that popt variable has two values. Those values are our optimal parameters. We are going to use those parameters and plot our best fit curve, as shown below.

e. Plot the data:

Plotting the data with the coefficients found.

f. Check the accuracy of the model:

Check the accuracy of the model with r2_score.

Putting it all together:

# Import required libraries:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit

# Dataset values :
day = np.arange(0,8)
weight = np.array([251,209,157,129,103,81,66,49])

# Exponential Function :
def expo_func(x, a, b):
return a * b ** x

#popt :Optimal values for the parameters
#pcov :The estimated covariance of popt

# Plotting the data
plt.plot(day, weight_pred, ‘r-’)
plt.scatter(day,weight,label=’Day vs Weight’)
plt.title(“Day vs Weight a*b^x”)
plt.xlabel(‘Day’)
plt.ylabel(‘Weight’)
plt.legend()
plt.show()

# Equation
a=popt[0].round(4)
b=popt[1].round(4)
print(f’The equation of regression line is y={a}*{b}^x’)

1.5 Sinusoidal Regression:

Some real-life examples of sinusoidal regression:

Generation of music waves.

Sound travels in waves.

Trigonometric functions in constructions.

Used in space flights.

GPS location calculations.

Architecture.

Electrical current.

Radio broadcasting.

Low and high tides of the ocean.

Buildings.

Sometimes we have data that shows patterns like a sine wave. Therefore, in such case scenarios, we use a sinusoidal regression. Below we can show the formula for the algorithm:

Step by step implementation in Python:

a. Generating the dataset:

b. Applying a sine function:

Here we have created a function called “calc_sine” to calculate the value of output based on optimal coefficients. Here we will use the scikit-learn library to find the optimal parameters.

c. Why does a sinusoidal regression perform better than linear regression?

If we check the accuracy of the model after fitting our data with a straight line, we can see that the accuracy in prediction is less than that of sine wave regression. That is why we use sinusoidal regression.

Putting it all together:

# Import required libraries:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from sklearn.metrics import r2_score

# Generating dataset:

# Y = A*sin(B(X + C)) + D
# A = Amplitude
# Period = 2*pi/B
# Period = Length of One Cycle
# C = Phase Shift (In Radian)
# D = Vertical Shift

X = np.linspace(0,1,100) #(Start,End,Points)

# Here…
# A = 1
# B= 2*pi
# B = 2*pi/Period
# Period = 1
# C = 0
# D = 0

Y = 1*np.sin(2*np.pi*X)

# Adding some Noise :
Noise = 0.4*np.random.normal(size=100)

Y_data = Y + Noiseplt.scatter(X,Y_data,c=”r”)

# Calculate the value:
def calc_sine(x,a,b,c,d):
return a * np.sin(b* ( x + np.radians(c))) + d

# Plot the main data :
plt.scatter(X,Y_data)# Plot the best fit curve :
plt.plot(X,calc_sine(X,*popt),c=”r”)

# Check the accuracy :
Accuracy =r2_score(Y_data,calc_sine(X,*popt))
print (Accuracy)

# Function to calculate the value :
def calc_line(X,m,b):
return b + X*m

# It returns optimized parametes for our function :
# popt stores optimal parameters
# pcov stores the covarience between each parameters.
popt,pcov = curve_fit(calc_line,X,Y_data)

# Plot the main data :
plt.scatter(X,Y_data)# Plot the best fit line :
plt.plot(X,calc_line(X,*popt),c=”r”)

# Check the accuracy of model :
Accuracy =r2_score(Y_data,calc_line(X,*popt))
print (“Accuracy of Linear Model : “,Accuracy)

1.6 Logarithmic Regression:

Some real-life examples of logarithmic growth:

The magnitude of earthquakes.

The intensity of sound.

The acidity of a solution.

The pH level of solutions.

Yields of chemical reactions.

Production of goods.

Growth of infants.

A COVID-19 graph.

Sometimes we have data that grows exponentially in the statement, but after a certain point, it goes flat. In such a case, we can use a logarithmic regression.

Step by step implementation in Python:

a. Import required libraries:

b. Generating the dataset:

c. The first column of our matrix X :

Here we will use our normal equation to find the coefficient values.

d. Reshaping X:

e. Going with the Normal Equation formula:

f. Forming the main matrix X:

g. Finding the transpose matrix:

h. Performing matrix multiplication:

i. Finding the inverse:

j. Matrix multiplication:

k. Finding the coefficient values:

l. Plot the data with the regression curve:

m. Accuracy:

Putting it all together:

# Import required libraries:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score

# Dataset:
# Y = a + b*ln(X)
X = np.arange(1,50,0.5)
Y = 10 + 2*np.log(X)

#Adding some noise to calculate error!
Y_noise = np.random.rand(len(Y))
Y = Y +Y_noise
plt.scatter(X,Y)

# 1st column of our X matrix should be 1:
n = len(X)
x_bias = np.ones((n,1))

print (X.shape)
print (x_bias.shape)

# Reshaping X :
X = np.reshape(X,(n,1))
print (X.shape)

# Going with the formula:
# Y = a + b*ln(X)
X_log = np.log(X)

# Append the X_log to X_bias:
x_new = np.append(x_bias,X_log,axis=1)

# Transpose of a matrix:
x_new_transpose = np.transpose(x_new)

]]>https://towardsai.net/p/machine-learning/machine-learning-algorithms-for-beginners-with-python-code-examples-ml-19c6afd60daa/feed0Ensuring Success Starting a Career in Machine Learning (ML)
https://towardsai.net/p/machine-learning/moocs-vs-academia-ensuring-success-starting-in-a-machine-learning-ml-career-304b2e42315e
https://towardsai.net/p/machine-learning/moocs-vs-academia-ensuring-success-starting-in-a-machine-learning-ml-career-304b2e42315e#respondWed, 27 May 2020 01:55:04 +0000https://towardsai.net/?p=3937Author(s): Roberto Iriondo

Machine learning (ML) careers in industry and academia are in such high demand, how do you assure you can succeed in such a competitive…

]]>https://towardsai.net/p/machine-learning/key-machine-learning-ml-definitions-43e837ec6add/feed0Best Machine Learning Blogs to Follow in 2020
https://towardsai.net/p/machine-learning/best-machine-learning-blogs-6730ea2df3bd
https://towardsai.net/p/machine-learning/best-machine-learning-blogs-6730ea2df3bd#respondWed, 06 May 2020 08:19:32 +0000https://towardsai.net/?p=3593Keep up with the best and the latest machine learning (ML) research blogs through reliable sources

From researchers to students, industry experts, and machine learning (ML) enthusiasts — keeping up with the best and the latest machine learning research is a matter of finding reliable sources of scientific work. While blogs usually update in a more informal and conversational style, we have found that the sources in this list are accurate, resourceful, and reliable sources of machine learning research. Fit for all of those interested in learning more about the scientific field of ML.

Please know that the blogs listed below are by no means ranked or in a particular order. They are all incredible sources of machine learning research. Please let us know in the comments if you know of any other reliable blog sources in machine learning.

The machine learning blog at Carnegie Mellon University, ML@CMU, provides an accessible, general-audience medium for researchers to communicate research findings, perspectives on the field of machine learning, and various updates, both to experts and the general audience. Posts are from students, postdocs, and faculty at Carnegie Mellon [1].

Distill is an academic journal in the area of machine learning. The distinguishing trait of a Distill article is outstanding communication and a dedication to human understanding. Distill articles often, but not always, use interactive media. Most articles (if not all) published at Distill often take 100+ hours for publishing [2].

Google AI conducts research that advances the state-of-the-art in the field. Google AI (or Google.ai) is a division of Google dedicated solely to artificial intelligence. It was announced at Google’s conference I/O 2017 by CEO Sundar Pichai [3]. The Google AI blog has a section specifically for machine learning research [4].

The BAIR blog provides an accessible, general-audience medium for researchers to communicate research findings, perspectives on the field, and various updates. Posts are from students, postdocs, and faculty in BAIR, and intends to provide a relevant and timely discussion of research findings and results, both to experts and the general audience [5].

OpenAI is a research laboratory based in San Francisco, California. Their mission is to ensure that artificial general intelligence benefits all of humanity [8]. The OpenAI blog brings state-of-the-art research in the field. Their mission is to discover and enact the path to safe artificial general intelligence (AGI) [8].

The Machine Learning (Theory) blog is an experiment in the application of a blog to academic research in machine learning and learning theory by machine learning researcherJohn Langford [6]. He has emphasized that the field of machine learning “is shifting from an academic discipline to an industrial tool” [7].

DeepMind works on some of the most complex and exciting challenges in AI. Their world-class research has resulted in hundreds of peer-reviewed papers, including in Nature and Science [9].

MIT often produces state-of-the-art research in the field of machine learning. Thisfiltered news stream provides the latest news and research on what’s happening in the field of machine learning at MIT.

Christopher Olah describes himself as a wandering machine learning researcher, looking to understand things clearly and explain them well [10]. Olah is a researcher with Open AI, and formerly at Google AI. His blog has very complete and exciting articles for the machine learning researcher and enthusiast — a gold mine of free, open, machine learning research.

Facebook AI is known for working on state-of-the-art research in the field. Their research areas focus on computer vision, conversational AI, integrity, NLP, ranking, and recommendations, systems research, machine learning theory, speech, and audio, along with human and machine intelligence. The Facebook AI Blog encompasses excellent content, from blog posts to research publications [12].

Amazon web services is one of the most used cloud services around the world. They offer reliable, scalable, and accessible cloud computing services. Their research team publishes blog posts on machine learning state-of-the-art research and ML applications on the AWS blog [11].

If you happen to know of any other reliable machine learning blogs, please let me know in the comments. Thank you for reading!

DISCLAIMER: The views expressed in this article are those of the author(s) and do not represent the views of Carnegie Mellon University, nor other companies (directly or indirectly) associated with the author(s). These writings do not intend to be final products, yet rather a reflection of current thinking, along with being a catalyst for discussion and improvement.

]]>https://towardsai.net/p/machine-learning/best-machine-learning-blogs-6730ea2df3bd/feed0What is Machine Learning?
https://towardsai.net/p/machine-learning/what-is-machine-learning-ml-b58162f97ec7
https://towardsai.net/p/machine-learning/what-is-machine-learning-ml-b58162f97ec7#respondTue, 30 Apr 2019 08:00:23 +0000https://towardsai.net/?p=3339Demystifying Machine Learning | Part I

Learn what is machine learning, how it works and its importance in five minutes

April 30, 2019, by Roberto Iriondo — Last updated: May 15, 2019

Who should read this article?

Anyone who is curious and wants a truly simple, yet accurate overview of the definition of machine learning, about how it works and its importance. We will go through each of the pertinent questions raised above by slicing technical definitions from machine learning pioneers and industry leaders to present you with a true simplistic introduction to the amazing, scientific field of machine learning.

Glossary of terms can be found at the bottom of the article, along with a small set of resources for further learning, references, and disclosures.

If the above applies to you, read on!

What is machine learning?

The scientific field of machine learning (ML) is a branch of artificial intelligence, as defined by Computer Scientist and machine learning pioneer [1] Tom M. Mitchell: “Machine learning is the study of computer algorithms that allow computer programs to automatically improve through experience [2].”

An algorithm can be thought of as a set of rules/instructions that a computer programmer specifies, which a computer is able to process. Simply put, machine learning algorithms learn by experience, similar to how humans do. For example, after having seen multiple examples of an object, a compute-employing machine learning algorithm can become able to recognize that object in new, previously unseen scenarios.

How does machine learning work?

In the video above [3], Head of Facebook AI Research, Yann LeCun simply explains how machine learning works with easy to follow examples. Machine learning utilizes a variety of techniques to intelligently handle large and complex amounts of information to make decisions and/or predictions.

In practice, the patterns that a computer (machine learning system) learns can be very complicated and difficult to explain. Consider searching for dog images on Google search — as seen in the image below, Google is incredibly good at bringing relevant results, yet how does Google search achieve this task? In simple terms, Google search first gets a large number of examples (image dataset) of photos labeled “dog” — then the computer (machine learning system) looks for patterns of pixels and patterns of colors that will help it guess (predict) if the image queried it is indeed a dog.

At first, Google’s computer makes a random guess of what patterns are good as to identify an image of a dog. If it makes a mistake, then a set of adjustments are made in order for the computer to get it right. In the end, such collection of patterns will be learned by a large computer system modeled after the human brain (deep neural network), that once is trained can correctly identify and bring accurate results of dog images on Google search, along with anything else that you could possibly think of —such process is called the training phase of a machine learning system.

Imagine that you were in charge of building a machine learning prediction system to try and identify images between dogs and cats. The first step as we explained above would be to gather a large quantity of labeled images with “dog” for dogs and “cat” for cats. Second, we would train the computer to look for patterns on the images as to identify dogs and cats respectively.

Once the machine learning model has been trained [7], we can throw at it (input) different images to see if it can correctly identify dogs and cats. As seen on the image above, a trained machine learning model can (most of the time) correctly identify such queries.

Why is machine learning important?

Machine learning its incredibly important nowadays. First, because it can solve complicated real-world problems in a scalable way. Second, because it has disrupted a variety of industries within the past decade [9], and will continue to do so in the future, as more and more industry leaders and researchers are specializing in machine learning, along taking what they have learned in order to continue with their research and/or develop machine learning tools to positively impact their own fields. Third, artificial intelligence has the potential to incrementally add 16% or around $ 13 trillion to the US economy by 2030 [18]. The rate in which machine learning is causing positive impact is already surprisingly impressive [10] [11] [12] [13] [14] [15] [16] which have been successful thanks to the dramatic change on data storage and computing processing power [17] — as more people are increasingly becoming involved, we can only expect it to continue with this route and continue to cause amazing progress in different fields [6].

Future work: In an upcoming article we will discuss the types of machine learning in simple terms, how they are currently being used by academia and industry alike with real-world examples of such.

Acknowledgments:

The author would like to thank Anthony Platanios, Doctoral Researcher with the Machine Learning Department at Carnegie Mellon University for constructive criticism, along with editorial comments in preparation of this article.

DISCLAIMER: The views expressed in this article are those of the author(s) and do not represent the views of Carnegie Mellon University, nor other companies (directly or indirectly) associated with the author(s). These writings are not intended to be final products, yet rather a reflection of current thinking, along with being a catalyst for discussion and improvement.