Roberto Iriondo – Towards AI — The Best of Tech, Science, and Engineering
https://towardsai.net
The Best of Tech, Science, and EngineeringWed, 12 Aug 2020 14:32:04 +0000en-US
hourly
1 https://wordpress.org/?v=5.4.2https://towardsai.net/wp-content/uploads/2019/05/cropped-towards-ai-square-circle-png-32x32.pngRoberto Iriondo – Towards AI — The Best of Tech, Science, and Engineering
https://towardsai.net
3232Monte Carlo Simulation An In-depth Tutorial with Python
https://towardsai.net/p/machine-learning/monte-carlo-simulation-an-in-depth-tutorial-with-python-bcf6eb7856c8
https://towardsai.net/p/machine-learning/monte-carlo-simulation-an-in-depth-tutorial-with-python-bcf6eb7856c8#respondThu, 06 Aug 2020 23:51:23 +0000https://towardsai.net/?p=4960

An in-depth tutorial on the Monte Carlo Simulation methods and applications with Python

Author(s): Pratik Shukla, Roberto Iriondo

What is the Monte Carlo Simulation?

A Monte Carlo method is a technique that uses random numbers and probability to solve complex problems. The Monte Carlo simulation, or probability simulation, is a technique used to understand the impact of risk and uncertainty in financial sectors, project management, costs, and other forecasting machine learning models.

Risk analysis is part of almost every decision we make, as we constantly face uncertainty, ambiguity, and variability in our lives. Moreover, even though we have unprecedented access to information, we cannot accurately predict the future.

The Monte Carlo simulation allows us to see all the possible outcomes of our decisions and assess risk impact, in consequence allowing better decision making under uncertainty.

In this article, we will go through five different examples to understand the Monte Carlo Simulation methods.

Next, we are going to prove this formula experimentally using the Monte Carlo Method.

Python Implementation:

Import required libraries:

2. Coin flip function:

3. Checking the output of the function:

4. Main function:

5. Calling the main function:

As shown in figure 8, we show that after 5000 iterations, the probability of getting a tail is 0.502. Consequently, this is how we can use the Monte Carlo Simulation to find probabilities experimentally.

b. Estimating PI using circle and square :

To estimate the value of PI, we need the area of the square and the area of the circle. To find these areas, we will randomly place dots on the surface and count the dots that fall inside the circle and dots that fall inside the square. Such will give us an estimated amount of their areas. Therefore instead of using the actual areas, we will use the count of dots to use as areas.

In the following code, we used the turtle module of Python to see the random placement of dots.

Python Implementation:

Import required libraries:

2. To visualize the dots:

3. Initialize some required data:

4. Main function:

5. Plot the data:

6. Output:

As shown in figure 17, we can see that after 5000 iterations, we can get the approximate value of PI. Also, notice that the error in estimation also decreased exponentially as the number of iterations increased.

Suppose you are on a game show, and you have the choice of picking one of three doors: Behind one door is a car; behind the other doors, goats. You pick a door, let’s say door 1, and the host, who knows what’s behind the doors, opens another door, say door 3, which has a goat. The host then asks you: do you want to stick with your choice or choose another door? [1]

Is it to your advantage to switch your choice of door?

Based on probability, it turns out it is to our advantage to switch the doors. Let’s find out how:

Initially, for all three gates, the probability (P) of getting the car is the same (P = 1/3).

Now assume that the contestant chooses door 1. Next, the host opens the third door, which has a goat. Next, the host asks the contestant if he/she wants to switch the doors?

We will see why it is more advantageous to switch the door:

In figure 19, we can see that after the host opens door 3, the probability of the last two doors of having a car increases to 2/3. Now we know that the third door has a goat, the probability of the second door having a car increases to 2/3. Hence, it is more advantageous to switch the doors.

Now we are going to use the Monte Carlo Method to perform this test case many times and find out its probabilities in an experimental way.

Python Implementation:

Import required libraries:

2. Initialize some data:

3. Main function:

4. Calling the main function:

5. Output:

In figure 24, we show that after 1000 iterations, the winning probability if we switch the door is 0.669. Therefore, we are confident that it works to our advantage to switch the door in this example.

4. Buffon’s Needle Problem:

A French nobleman Georges-Louis Leclerc, Comte de Buffon, posted the following problem in 1777 [2] [3].

Suppose that we drop a short needle on a ruled paper — what would be the probability that the needle comes to lie in a position where it crosses one of the lines?

The probability depends on the distance (d) between the lines of the ruled paper, and it depends on the length (l) of the needle that we drop — or rather, it depends on the ratio l/d. For this example, we can interpret the needle as l ≤ d. In short, our purpose is that the needle cannot cross two different lines at the same time. Surprisingly, the answer to the Buffon’s needle problem involves PI.

Here we are going to use the solution of Buffon’s needle problem to estimate the value of PI experimentally using the Monte Carlo Method. However, before going into that, we are going to show how the solution derives, making it more interesting.

Theorem:

If a short needle, of length l, is dropped on a paper that is ruled with equally spaced lines of distance d ≥ l, then the probability that the needle comes to lie in a position where it crosses one of the lines is:

Proof:

Next, we need to count the number of needles that crosses any of the vertical lines. For a needle to intersect with one of the lines, for a specific value of theta, the following are the maximum and minimum possible values for which a needle can intersect with a vertical line.

Maximum Possible Value:

2. Minimum Possible Value:

Therefore, for a specific value of theta, the probability for a needle to lie on a vertical line is:

The above probability formula is only limited to one value of theta; in our experiment, the value of theta ranges from 0 to pi/2. Next, we are going to find the actual probability by integrating it concerning all the values of theta.

Estimating PI using Buffon’s needle problem:

Next, we are going to use the above formula to find out the value of PI experimentally.

Now, notice that we have the values for l and d. Our goal is to find the value of P first so that we can get the value of PI. To find the probability P, we must need the count of hit needles and total needles. Since we already have the count of total needles, the only thing we require now is the count of hit needles.

Below is the visual representation of how we are going to calculate the count of hit needles.

Python Implementation:

Import required libraries:

2. Main function:

3. Calling the main function:

4. Output:

As shown in figure 37, after 100 iterations we are able to get a very close value of PI using the Monte Carlo Method.

5. Why Does the House Always Win?

How do casinos earn money? The trick is straightforward — “The more you play, the more they earn.” Let us take a look at how this works with a simple Monte Carlo Simulation example.

Consider an imaginary game in which a player has to choose a chip from a bag of chips.

Rules:

There are chips containing numbers ranging from 1–100 in a bag.

Users can bet on even or odd chips.

In this game, 10 and 11 are special numbers. If we bet on evens, then 10 will be counted as an odd number, and if we bet on odds, then 11 will be counted as an even number.

If we bet on even numbers and we get 10 then we lose.

If we bet on odd numbers and we get 11 then we lose.

If we bet on odds, the probability that we will win is of 49/100. The probability that the house wins is of 51/100. Therefore, for an odd bet the house edge is = 51/100–49/100 = 200/10000 = 0.02 = 2%

If we bet on evens, the probability that the user wins is of 49/100. The probability that the house wins is of 51/100. Hence, for an odd bet the house edge is = 51/100–49/100 = 200/10000 = 0.02 = 2%

In summary, for every $ 1 bet, $ 0.02 goes to the house. In comparison, the lowest house edge on roulette with a single 0 is 2.5%. Consequently, we are certain that you will have a better chance of winning at our imaginary game than with roulette.

Python Implementation:

Import required libraries:

2. Player’s bet:

3. Main function:

4. Final output:

5. Running it for 1000 iterations:

6. Number of bets = 5:

7. Number of bets = 10:

8. Number of bets = 1000:

9. Number of bets = 5000:

10. Number of bets = 10000:

From the above experiment, we can see that the player has a better chance of making a profit if they place fewer bets on these games. In some case scenarios, we get negative numbers, which means that the player lost all of their money and accumulated debt instead of making a profit.

Please keep in mind that these percentages are for our figurative game and they can be modified.

Conclusion:

Like with any forecasting model, the simulation will only be as good as the estimates we make. It is important to remember that the Monte Carlo Simulation only represents probabilities and not certainty. Nevertheless, the Monte Carlo simulation can be a valuable tool when forecasting an unknown future.

DISCLAIMER: The views expressed in this article are those of the author(s) and do not represent the views of Carnegie Mellon University. These writings do not intend to be final products, yet rather a reflection of current thinking, along with being a catalyst for discussion and improvement.

For attribution in academic contexts, please cite this work as:

Shukla, et al., “Monte Carlo Simulation An In-depth Tutorial with Python”, Towards AI, 2020

BibTex citation:

@article{pratik_iriondo_2020,
title={Monte Carlo Simulation An In-depth Tutorial with Python},
url={https://towardsai.net/monte-carlo-simulation},
journal={Towards AI},
publisher={Towards AI Co.},
author={Pratik, Shukla and Iriondo,
Roberto},
year={2020},
month={Aug}
}

Tutorial on the basics of natural language processing (NLP) with sample code implementation in Python

In this article, we explore the basics of natural language processing (NLP) with code examples. We dive into the natural language toolkit (NLTK) library to present how it can be useful for natural language processing related-tasks. Afterward, we will discuss the basics of other Natural Language Processing libraries and other essential methods for NLP, along with their respective coding sample implementations in Python.

Computers and machines are great at working with tabular data or spreadsheets. However, as human beings generally communicate in words and sentences, not in the form of tables. Much information that humans speak or write is unstructured. So it is not very clear for computers to interpret such. In natural language processing (NLP), the goal is to make computers understand the unstructured text and retrieve meaningful pieces of information from it. Natural language Processing (NLP) is a subfield of artificial intelligence, in which its depth involves the interactions between computers and humans.

Applications of NLP:

Machine Translation.

Speech Recognition.

Sentiment Analysis.

Question Answering.

Summarization of Text.

Chatbot.

Intelligent Systems.

Text Classifications.

Character Recognition.

Spell Checking.

Spam Detection.

Autocomplete.

Named Entity Recognition.

Predictive Typing.

Understanding Natural Language Processing (NLP):

We, as humans, perform natural language processing (NLP) considerably well, but even then, we are not perfect. We often misunderstand one thing for another, and we often interpret the same sentences or words differently.

For instance, consider the following sentence, we will try to understand its interpretation in many different ways:

Example 1:

These are some interpretations of the sentence shown above.

There is a man on the hill, and I watched him with my telescope.

There is a man on the hill, and he has a telescope.

I’m on a hill, and I saw a man using my telescope.

I’m on a hill, and I saw a man who has a telescope.

There is a man on a hill, and I saw him something with my telescope.

Example 2:

In the sentence above, we can see that there are two “can” words, but both of them have different meanings. Here the first “can” word is used for question formation. The second “can” word at the end of the sentence is used to represent a container that holds food or liquid.

Hence, from the examples above, we can see that language processing is not “deterministic” (the same language has the same interpretations), and something suitable to one person might not be suitable to another. Therefore, Natural Language Processing (NLP) has a non-deterministic approach. In other words, Natural Language Processing can be used to create a new intelligent system that can understand how humans understand and interpret language in different situations.

Rule-based NLP vs. Statistical NLP:

Natural Language Processing is separated in two different approaches:

Rule-based Natural Language Processing:

It uses common sense reasoning for processing tasks. For instance, the freezing temperature can lead to death, or hot coffee can burn people’s skin, along with other common sense reasoning tasks. However, this process can take much time, and it requires manual effort.

Statistical Natural Language Processing:

It uses large amounts of data and tries to derive conclusions from it. Statistical NLP uses machine learning algorithms to train NLP models. After successful training on large amounts of data, the trained model will have positive outcomes with deduction.

Comparison:

Components of Natural Language Processing (NLP):

a. Lexical Analysis:

With lexical analysis, we divide a whole chunk of text into paragraphs, sentences, and words. It involves identifying and analyzing words’ structure.

b. Syntactic Analysis:

Syntactic analysis involves the analysis of words in a sentence for grammar and arranging words in a manner that shows the relationship among the words. For instance, the sentence “The shop goes to the house” does not pass.

c. Semantic Analysis:

Semantic analysis draws the exact meaning for the words, and it analyzes the text meaningfulness. Sentences such as “hot ice-cream” do not pass.

d. Disclosure Integration:

Disclosure integration takes into account the context of the text. It considers the meaning of the sentence before it ends. For example: “He works at Google.” In this sentence, “he” must be referenced in the sentence before it.

e. Pragmatic Analysis:

Pragmatic analysis deals with overall communication and interpretation of language. It deals with deriving meaningful use of language in various situations.

The NLTK Python framework is generally used as an education and research tool. It’s not usually used on production applications. However, it can be used to build exciting programs due to its ease of use.

spaCy is an open-source natural language processing Python library designed to be fast and production-ready. spaCy focuses on providing software for production usage.

Gensim is an NLP Python framework generally used in topic modeling and similarity detection. It is not a general-purpose NLP library, but it handles tasks assigned to it very well.

Pattern is an NLP Python framework with straightforward syntax. It’s a powerful tool for scientific and non-scientific tasks. It is highly valuable to students.

TextBlob is a Python library designed for processing textual data.

Features:

Part-of-Speech tagging.

Noun phrase extraction.

Sentiment analysis.

Classification.

Language translation.

Parsing.

Wordnet integration.

Use-cases:

Sentiment Analysis.

Spelling Correction.

Translation and Language Detection.

For this tutorial, we are going to focus more on the NLTK library. Let’s dig deeper into natural language processing by making some examples.

Exploring Features of NLTK:

a. Open the text file for processing:

First, we are going to open and read the file which we want to analyze.

Next, notice that the data type of the text file read is a String. The number of characters in our text file is 675.

b. Import required libraries:

For various data processing cases in NLP, we need to import some libraries. In this case, we are going to use NLTK for Natural Language Processing. We will use it to perform various operations on the text.

c. Sentence tokenizing:

By tokenizing the text with sent_tokenize( ), we can get the text as sentences.

In the example above, we can see the entire text of our data is represented as sentences and also notice that the total number of sentences here is 9.

d. Word tokenizing:

By tokenizing the text with word_tokenize( ), we can get the text as words.

Next, we can see the entire text of our data is represented as words and also notice that the total number of words here is 144.

e. Find the frequency distribution:

Let’s find out the frequency of words in our text.

Notice that the most used words are punctuation marks and stopwords. We will have to remove such words to analyze the actual text.

f. Plot the frequency graph:

Let’s plot a graph to visualize the word distribution in our text.

In the graph above, notice that a period “.” is used nine times in our text. Analytically speaking, punctuation marks are not that important for natural language processing. Therefore, in the next step, we will be removing such punctuation marks.

g. Remove punctuation marks:

Next, we are going to remove the punctuation marks as they are not very useful for us. We are going to use isalpha( ) method to separate the punctuation marks from the actual text. Also, we are going to make a new list called words_no_punc, which will store the words in lower case but exclude the punctuation marks.

As shown above, all the punctuation marks from our text are excluded. These can also cross-check with the number of words.

h. Plotting graph without punctuation marks:

Notice that we still have many words that are not very useful in the analysis of our text file sample, such as “and,” “but,” “so,” and others. Next, we need to remove coordinating conjunctions.

i. List of stopwords:

j. Removing stopwords:

k. Final frequency distribution:

As shown above, the final graph has many useful words that help us understand what our sample data is about, showing how essential it is to perform data cleaning on NLP.

Next, we will cover various topics in NLP with coding examples.

Word Cloud:

Word Cloud is a data visualization technique. In which words from a given text display on the main chart. In this technique, more frequent or essential words display in a larger and bolder font, while less frequent or essential words display in smaller or thinner fonts. It is a beneficial technique in NLP that gives us a glance at what text should be analyzed.

Properties:

font_path: It specifies the path for the fonts we want to use.

width: It specifies the width of the canvas.

height: It specifies the height of the canvas.

min_font_size: It specifies the smallest font size to use.

max_font_size: It specifies the largest font size to use.

font_step: It specifies the step size for the font.

max_words: It specifies the maximum number of words on the word cloud.

stopwords: Our program will eliminate these words.

background_color: It specifies the background color for canvas.

normalize_plurals: It removes the trailing “s” from words.

As shown in the graph above, the most frequent words display in larger fonts. The word cloud can be displayed in any shape or image.

For instance: In this case, we are going to use the following circle image, but we can use any shape or any image.

Word Cloud Python Implementation:

As shown above, the word cloud is in the shape of a circle. As we mentioned before, we can use any shape or image to form a word cloud.

Word CloudAdvantages:

They are fast.

They are engaging.

They are simple to understand.

They are casual and visually appealing.

Word Cloud Disadvantages:

They are non-perfect for non-clean data.

They lack the context of words.

Stemming:

We use Stemming to normalize words. In English and many other languages, a single word can take multiple forms depending upon context used. For instance, the verb “study” can take many forms like “studies,” “studying,” “studied,” and others, depending on its context. When we tokenize words, an interpreter considers these input words as different words even though their underlying meaning is the same. Moreover, as we know that NLP is about analyzing the meaning of content, to resolve this problem, we use stemming.

Stemming normalizes the word by truncating the word to its stem word. For example, the words “studies,” “studied,” “studying” will be reduced to “studi,” making all these word forms to refer to only one token. Notice that stemming may not give us a dictionary, grammatical word for a particular set of words.

Let’s take an example:

a. Porter’s Stemmer Example 1:

In the code snippet below, we show that all the words truncate to their stem words. However, notice that the stemmed word is not a dictionary word.

b. Porter’s Stemmer Example 2:

In the code snippet below, many of the words after stemming did not end up being a recognizable dictionary word.

c. SnowballStemmer:

SnowballStemmer generates the same output as porter stemmer, but it supports many more languages.

Lemmatization tries to achieve a similar base “stem” for a word. However, what makes it different is that it finds the dictionary word instead of truncating the original word. Stemming does not consider the context of the word. That is why it generates results faster, but it is less accurate than lemmatization.

If accuracy is not the project’s final goal, then stemming is an appropriate approach. If higher accuracy is crucial and the project is not on a tight deadline, then the best option is amortization (Lemmatization has a lower processing speed, compared to stemming).

Lemmatization takes into account Part Of Speech (POS) values. Also, lemmatization may generate different outputs for different values of POS. We generally have four choices for POS:

Difference between Stemmer and Lemmatizer:

a. Stemming:

Notice how on stemming, the word “studies” gets truncated to “studi.”

b. Lemmatizing:

During lemmatization, the word “studies” displays its dictionary word “study.”

Python Implementation:

a. A basic example demonstrating how a lemmatizer works

In the following example, we are taking the PoS tag as “verb,” and when we apply the lemmatization rules, it gives us dictionary words instead of truncating the original word:

b. Lemmatizer with default PoS value

The default value of PoS in lemmatization is a noun(n). In the following example, we can see that it’s generating dictionary words:

c. Another example demonstrating the power of lemmatizer

d. Lemmatizer with different POS values

Part of Speech Tagging (PoS tagging):

Why do we need Part of Speech (POS)?

Parts of speech(PoS) tagging is crucial for syntactic and semantic analysis. Therefore, for something like the sentence above, the word “can” has several semantic meanings. The first “can” is used for question formation. The second “can” at the end of the sentence is used to represent a container. The first “can” is a verb, and the second “can” is a noun. Giving the word a specific meaning allows the program to handle it correctly in both semantic and syntactic analysis.

Below, please find a list of Part of Speech (PoS) tags with their respective examples:

1. CC: Coordinating Conjunction

2. CD: Cardinal Digit

3. DT: Determiner

4. EX: Existential There

5. FW: Foreign Word

6. IN: Preposition / Subordinating Conjunction

7. JJ: Adjective

8. JJR: Adjective, Comparative

9. JJS: Adjective, Superlative

10. LS: List Marker

11. MD: Modal

12. NN: Noun, Singular

13. NNS: Noun, Plural

14. NNP: Proper Noun, Singular

15. NNPS: Proper Noun, Plural

16. PDT: Predeterminer

17. POS: Possessive Endings

18. PRP: Personal Pronoun

19. PRP$: Possessive Pronoun

20. RB: Adverb

21. RBR: Adverb, Comparative

22. RBS: Adverb, Superlative

23. RP: Particle

24. TO: To

25. UH: Interjection

26. VB: Verb, Base Form

27. VBD: Verb, Past Tense

28. VBG: Verb, Present Participle

29. VBN: Verb, Past Participle

30. VBP: Verb, Present Tense, Not Third Person Singular

31. VBZ: Verb, Present Tense, Third Person Singular

32. WDT: Wh — Determiner

33. WP: Wh — Pronoun

34. WP$ : Possessive Wh — Pronoun

35. WRB: Wh — Adverb

Python Implementation:

a. A simple example demonstrating PoS tagging.

b. A full example demonstrating the use of PoS tagging.

Chunking:

Chunking means to extract meaningful phrases from unstructured text. By tokenizing a book into words, it’s sometimes hard to infer meaningful information. It works on top of Part of Speech(PoS) tagging. Chunking takes PoS tags as input and provides chunks as output. Chunking literally means a group of words, which breaks simple text into phrases that are more meaningful than individual words.

Before working with an example, we need to know what phrases are? Meaningful groups of words are called phrases. There are five significant categories of phrases.

Noun Phrases (NP).

Verb Phrases (VP).

Adjective Phrases (ADJP).

Adverb Phrases (ADVP).

Prepositional Phrases (PP).

Phrase structure rules:

S(Sentence) → NP VP.

NP → {Determiner, Noun, Pronoun, Proper name}.

VP → V (NP)(PP)(Adverb).

PP → Pronoun (NP).

AP → Adjective (PP).

Example:

Python Implementation:

In the following example, we will extract a noun phrase from the text. Before extracting it, we need to define what kind of noun phrase we are looking for, or in other words, we have to set the grammar for a noun phrase. In this case, we define a noun phrase by an optional determiner followed by adjectives and nouns. Then we can define other rules to extract some other phrases. Next, we are going to use RegexpParser( ) to parse the grammar. Notice that we can also visualize the text with the .draw( ) function.

In this example, we can see that we have successfully extracted the noun phrase from the text.

Chinking:

Chinking excludes a part from our chunk. There are certain situations where we need to exclude a part of the text from the whole text or chunk. In complex extractions, it is possible that chunking can output unuseful data. In such case scenarios, we can use chinking to exclude some parts from that chunked text.
In the following example, we are going to take the whole string as a chunk, and then we are going to exclude adjectives from it by using chinking. We generally use chinking when we have a lot of unuseful data even after chunking. Hence, by using this method, we can easily set that apart, also to write chinking grammar, we have to use inverted curly braces, i.e.:

} write chinking grammar here {

Python Implementation:

From the example above, we can see that adjectives separate from the other text.

Named Entity Recognition (NER):

Named entity recognition can automatically scan entire articles and pull out some fundamental entities like people, organizations, places, date, time, money, and GPE discussed in them.

Use-Cases:

Content classification for news channels.

Summarizing resumes.

Optimizing search engine algorithms.

Recommendation systems.

Customer support.

Commonly used types of named entity:

Python Implementation:

There are two options :

1. binary = True

When the binary value is True, then it will only show whether a particular entity is named entity or not. It will not show any further details on it.

Our graph does not show what type of named entity it is. It only shows whether a particular word is named entity or not.

2. binary = False

When the binary value equals False, it shows in detail the type of named entities.

Our graph now shows what type of named entity it is.

WordNet:

Wordnet is a lexical database for the English language. Wordnet is a part of the NLTK corpus. We can use Wordnet to find meanings of words, synonyms, antonyms, and many other words.

a. We can check how many different definitions of a word are available in Wordnet.

b. We can also check the meaning of those different definitions.

c. All details for a word.

d. All details for all meanings of a word.

e. Hypernyms: Hypernyms gives us a more abstract term for a word.

f. Hyponyms: Hyponyms gives us a more specific term for a word.

g. Get a name only.

h. Synonyms.

i. Antonyms.

j. Synonyms and antonyms.

k. Finding the similarity between words.

Bag of Words:

What is the Bag-of-Words method?

It is a method of extracting essential features from row text so that we can use it for machine learning models. We call it “Bag” of words because we discard the order of occurrences of words. A bag of words model converts the raw text into words, and it also counts the frequency for the words in the text. In summary, a bag of words is a collection of words that represent a sentence along with the word count where the order of occurrences is not relevant.

Raw Text: This is the original text on which we want to perform analysis.

Clean Text: Since our raw text contains some unnecessary data like punctuation marks and stopwords, so we need to clean up our text. Clean text is the text after removing such words.

Tokenize: Tokenization represents the sentence as a group of tokens or words.

Building Vocab: It contains total words used in the text after removing unnecessary data.

Generate Vocab: It contains the words along with their frequencies in the sentences.

For instance:

Sentences:

Jim and Pam traveled by bus.

The train was late.

The flight was full. Traveling by flight is expensive.

a. Creating a basic structure:

b. Words with frequencies:

c. Combining all the words:

d. Final model:

Python Implementation:

Applications:

Natural language processing.

Information retrieval from documents.

Classifications of documents.

Limitations:

Semantic meaning: It does not consider the semantic meaning of a word. It ignores the context in which the word is used.

Vector size: For large documents, the vector size increase, which may result in higher computational time.

Preprocessing: In preprocessing, we need to perform data cleansing before using it.

TF-IDF

TF-IDF stands for Term Frequency — Inverse Document Frequency, which is a scoring measure generally used in information retrieval (IR) and summarization. The TF-IDF score shows how important or relevant a term is in a given document.

The intuition behind TF and IDF:

If a particular word appears multiple times in a document, then it might have higher importance than the other words that appear fewer times (TF). At the same time, if a particular word appears many times in a document, but it is also present many times in some other documents, then maybe that word is frequent, so we cannot assign much importance to it. (IDF). For instance, we have a database of thousands of dog descriptions, and the user wants to search for “a cute dog” from our database. The job of our search engine would be to display the closest response to the user query. How would a search engine do that? The search engine will possibly use TF-IDF to calculate the score for all of our descriptions, and the result with the higher score will be displayed as a response to the user. Now, this is the case when there is no exact match for the user’s query. If there is an exact match for the user query, then that result will be displayed first. Then, let’s suppose there are four descriptions available in our database.

The furry dog.

A cute doggo.

A big dog.

The lovely doggo.

Notice that the first description contains 2 out of 3 words from our user query, and the second description contains 1 word from the query. The third description also contains 1 word, and the forth description contains no words from the user query. As we can sense that the closest answer to our query will be description number two, as it contains the essential word “cute” from the user’s query, this is how TF-IDF calculates the value.

Notice that the term frequency values are the same for all of the sentences since none of the words in any sentences repeat in the same sentence. So, in this case, the value of TF will not be instrumental. Next, we are going to use IDF values to get the closest answer to the query. Notice that the word dog or doggo can appear in many many documents. Therefore, the IDF value is going to be very low. Eventually, the TF-IDF value will also be lower. However, if we check the word “cute” in the dog descriptions, then it will come up relatively fewer times, so it increases the TF-IDF value. So the word “cute” has more discriminative power than “dog” or “doggo.” Then, our search engine will find the descriptions that have the word “cute” in it, and in the end, that is what the user was looking for.

Simply put, the higher the TF*IDF score, the rarer or unique or valuable the term and vice versa.

Now we are going to take a straightforward example and understand TF-IDF in more detail.

Example:

Sentence 1: This is the first document.

Sentence 2: This document is the second document.

TF: Term Frequency

a. Represent the words of the sentences in the table.

b. Displaying the frequency of words.

c. Calculating TF using a formula.

IDF: Inverse Document Frequency

d. Calculating IDF values from the formula.

e. Calculating TF-IDF.

TF-IDF is the multiplication of TF*IDF.

In this case, notice that the import words that discriminate both the sentences are “first” in sentence-1 and “second” in sentence-2 as we can see, those words have a relatively higher value than other words.

However, there any many variations for smoothing out the values for large documents. The most common variation is to use a log value for TF-IDF. Let’s calculate the TF-IDF value again by using the new IDF value.

f. Calculating IDF value using log.

g. Calculating TF-IDF.

As seen above, “first” and “second” values are important words that help us to distinguish between those two sentences.

Now that we saw the basics of TF-IDF. Next, we are going to use the sklearn library to implement TF-IDF in Python. A different formula calculates the actual output from our program. First, we will see an overview of our calculations and formulas, and then we will implement it in Python.

Actual Calculations:

a. Term Frequency (TF):

b. Inverse Document Frequency (IDF):

c. Calculating final TF-IDF values:

Python Implementation:

Conclusion:

These are some of the basics for the exciting field of natural language processing (NLP). We hope you enjoyed reading this article and learned something new. Any suggestions or feedback is crucial to continue to improve. Please let us know in the comments if you have any.

DISCLAIMER: The views expressed in this article are those of the author(s) and do not represent the views of Carnegie Mellon University, nor other companies (directly or indirectly) associated with the author(s). These writings do not intend to be final products, yet rather a reflection of current thinking, along with being a catalyst for discussion and improvement.

]]>https://towardsai.net/p/nlp/natural-language-processing-nlp-with-python-tutorial-for-beginners-1f54e610a1a0/feed0Building Neural Networks with Python Code and Math in Detail — II
https://towardsai.net/p/machine-learning/building-neural-networks-with-python-code-and-math-in-detail-ii-bbe8accbf3d1
https://towardsai.net/p/machine-learning/building-neural-networks-with-python-code-and-math-in-detail-ii-bbe8accbf3d1#respondTue, 30 Jun 2020 00:45:24 +0000https://towardsai.net/?p=4472

Author(s): Pratik Shukla, Roberto Iriondo

The second part of our tutorial on neural networks from scratch. From the math behind them to step-by-step implementation case studies in Python. Launch the samples on Google Colab.

In the first part of our tutorial on neural networks, we explained the basic concepts about neural networks, from the math behind them to implementing neural networks in Python without any hidden layers. We showed how to make satisfactory predictions even in case scenarios where we did not use any hidden layers. However, there are several limitations to single-layer neural networks.

In this tutorial, we will dive in-depth on the limitations and advantages of using neural networks in machine learning. We will show how to implement neural nets with hidden layers and how these lead to a higher accuracy rate on our predictions, along with implementation samples in Python on Google Colab.

It can only represent a limited set of functions. If we have been training a model that uses complicated functions (which is the general case), then using a single layer neural network can lead to low accuracy in our prediction rate.

It can only predict linearly separable data. If we have non-linear data, then training our single-layer neural network will lead to low accuracy in our prediction rate.

Decision boundaries for single-layer neural networks must be hyperplane, which means that if our data distributes in 3 dimensions, then the decision boundary must be in 2 dimensions.

To overcome such limitations, we use hidden layers in our neural networks.

Advantages of single-layer neural networks:

Single-layer neural networks are easy to set up.

Single-layer neural networks take less time to train compared to a multi-layer neural network.

Single-layer neural networks have explicit links to statistical models.

The outputs in single layer neural networks are weighted sums of inputs. It means that we can interpret the output of a single layer neural network feasibly.

Advantages of multilayer neural networks:

They construct more extensive networks by considering layers of processing units.

They can be used to classify non-linearly separable data.

Multilayer neural networks are more reliable compared to single-layer neural networks.

2. How to select several neurons in a hidden layer?

There are many methods for determining the correct number of neurons to use in the hidden layer. We will see a few of them here.

The number of hidden nodes should be less than twice the size of the nodes in the input layer.

For example: If we have 2 input nodes, then our hidden nodes should be less than 4.

a. 2 inputs, 4 hidden nodes:

b. 2 inputs, 3 hidden nodes:

c. 2 inputs, 2 hidden nodes:

d. 2 inputs, 1 hidden node:

The number of hidden nodes should be 2/3 the size of input nodes, plus the size of the output node.

For example: If we have 2 input nodes and 1 output node then the hidden nodes should be = floor(2*2/3 + 1) = 2

a. 2 inputs, 2 hidden nodes:

The number of hidden nodes should be between the size of input nodes and output nodes.

For example: If we have 3 input nodes and 2 output nodes, then the hidden nodes should be between 2 and 3.

a. 3 inputs, 2 hidden nodes, 2 outputs:

b. 3 inputs, 3 hidden nodes, 2 outputs:

How many weight values do we need?

For a hidden layer: Number of inputs * No. of hidden layer nodes

For an output layer: Number of hidden layer nodes * No. of outputs

3. The General Structure of an Artificial Neural Network (ANN):

Summarization of an artificial neural network:

Take inputs.

Add bias (if required).

Assign random weights in the hidden layer and the output layer.

Run the code for training.

Find the error in prediction.

Update the weight values of the hidden layer and output layer by gradient descent algorithm.

Repeat the training phase with updated weights.

Make predictions.

Execution of multilayer neural networks:

After reading the first article, we saw that we had only 1 phase of execution there. In that phase, we find the updated weight values and rerun the code to achieve minimum error. However, things are a little spicy here. The execution in a multilayer neural network takes place in two-phase. In phase-1, we update the values of weight_output (weight values for output layer), and in phase-2, we update the value of weight_hidden ( weight values for the hidden layer ). Phase-1 is similar to that of a neural network without any hidden layers.

Execution in phase-1:

To find the derivative, we are going to use in gradient descent algorithm to update the weight values. Here we are not going to derive the derivatives for those functions we already did in part -1 of neural network.
In this phase, our goal is to find the weight values for the output layer. Here we are going to calculate the change in error concerning the change in output weight.

We first define some terms we are going to use in these derivatives:

In phase-1, we find the updated weight for the output layer. In the second phase, we need to find the updated weights for the hidden layer. Hence, find how the change in hidden weight affects the change in error value.

Represented as:

a. Finding the first derivative:

Here we are going to use the chain rule to find the derivative.

4. Implementation of a multilayer neural network in Python

Multilayer neural network: A neural network with a hidden layer For more definitions, check out our article in terminology in machine learning.

Below we are going to implement the “OR” gate without the bias value. In conclusion, adding hidden layers in a neural network helps us achieve higher accuracy in our models.

Representation:

Truth-Table:

Neural Network:

Notice that here we have 2 input features and 1 output feature. In this neural network, we are going to use 1 hidden layer with 3 nodes.

Graphical representation:

Implementation in Python:

Below, we are going to implement our neural net with hidden layers step by step in Python, let’s code:

a. Import required libraries:

b. Define input features:

Next, we take input values for which we want to train our neural network. We can see that we have taken two input features. On tangible data sets, the value of input features is mostly high.

c. Define target output values:

For the input features, we want to have a specific output for specific input features. It is called the target output. We are going to train the model that gives us the target output for our input features.

d. Assign random weights:

Next, we are going to assign random weights to the input features. Note that our model is going to modify these weight values to be optimal. At this point, we are taking these values randomly. Here we have two layers, so we have to assign weights for them separately.

The other variable is the learning rate. We are going to use the learning rate (LR) in a gradient descent algorithm to update the weight values. Generally, we keep LR as low as possible so that we can achieve a minimal error rate.

e. Sigmoid function:

Once we have our weight values and input features, we are going to send it to the main function that predicts the output. Notice that our input features and weight values can be anything, but here we want to classify data, so we need the output between 0 and 1. For such output, we are going to use a sigmoid function.

f. Sigmoid function derivative:

In a gradient descent algorithm, we need the derivative of the sigmoid function.

g. The main logic for predicting output and updating the weight values:

We are going to understand the following code step-by-step.

How does it work?

a. First of all, we run the above code 2,00,000 times. Keep in mind that if we only run this code a few times, then it is probable that we will have a higher error rate. Therefore, we update the weight values 10,000 times to reach the optimal value possible.

b. Next, we find the input for the hidden layer. Defined by the following formula:

We can also represent it as matrices to understand in a better way.

The first matrix here is input features with size (4*2), and the second matrix is weight values for a hidden layer with size (2*3). So the resultant matrix will be of size (4*3).

The intuition behind the final matrix size:

The row size of the final matrix is the same as the row size of the first matrix, and the column size of the final matrix is the same as the column size of the second matrix in multiplication (dot product).

In the representation below, each of those boxes represents a value.

c. Afterward, we have an input for the hidden layer, and it is going to calculate the output by applying a sigmoid function. Below is the output of the hidden layer:

d. Next, we multiply the output of the hidden layer with the weight of the output layer:

The first matrix shows the output of the hidden layer, which has a size of (4*3). The second matrix represents the weight values of the output layer,

e. Afterward, we calculate the output of the output layer by applying a sigmoid function. It can also be represented in matrix form as follows.

f. Now that we have our predicted output, we find the mean squared between target output and predicted output.

g. Next, we begin the first phase of training. In this step, we update the weight values for the output layer. We need to find out how much the output weights affect the error value. To update the weights, we use a gradient descent algorithm. Notice that we have already found the derivatives we will use during the training phase.

g.a. Matrix representation of the first derivative. Matrix size (4*1).

derror_douto = output_op -target_output

g.b. Matrix representation of the second derivative. Matrix size (4*1).

dout_dino = sigmoid_der(input_op)

g.c. Matrix representation of the third derivative. Matrix size (4*3).

dino_dwo = output_hidden

g.d. Matrix representation of transpose of dino_dwo. Matrix size (3*4).

g.e. Now, we are going to find the final matrix of output weight. For a detailed explanation of this step, please check out our previous tutorial. The matrix size will be (3*1), which is the same as the output_weight matrix.

Hence, we have successfully find the derivative values. Next, we update the weight values accordingly with the help of a gradient descent algorithm.

Nonetheless, we also have to find the derivative for phase-2. Let’s first find that, and then we will update the weights for both layers in the end.

h. Phase -2. Updating the weights in the hidden layer.

Since we have already discussed how we derived the derivative values, we are just going to see matrix representation for each of them to understand it better. Our goal here is to find the weight matrix for the hidden layer, which is of size (2*3).

h.a. Matrix representation for the first derivative.

derror_dino = derror_douto * douto_dino

h.b. Matrix representation for the second derivative.

dino_douth = weight_output

h.c. Matrix representation for the third derivative.

derror_douth = np.dot(derror_dino , dino_douth.T)

h.d. Matrix representation for the fourth derivative.

douth_dinh = sigmoid_der(input_hidden)

h.e. Matrix representation for the fifth derivative.

dinh_dwh = input_features

h.f. Matrix representation for the sixth derivative.

Notice that our goal was to find a hidden weight matrix with the size of (2*3). Furthermore, we have successfully managed to find it.

h.g. Updating the weight values :

We will use the gradient descent algorithm to update the values. It takes three parameters.

The original weight: we already have it.

The learning rate (LR): we assigned it the value of 0.05.

The derivative: Found on the previous step.

Gradient descent algorithm:

Since we have all of our parameter values, this will be a straightforward operation. First, we are updating the weight values for the output layer, and then we are updating the weight values for the hidden layer.

i. Final weight values:

Below, we show the updated weight values for both layers — our prediction bases on these values.

j. Making predictions:

j.a. Prediction for (1,1).

Target output = 1

Explanation:

First of all, we are going to take the input values for which we want to predict the output. The “result1” variable stores the value of the dot product of input variables and hidden layer weight. We obtain the output by applying a sigmoid function, the result stores in the result2 variable. Such is the input feature for the output layer. We calculate the input for the output layer by multiplying input features with output layer weight. To find the final output value, we take the sigmoid value of that.

Notice that the predicted output is very close to 1. So we have managed to make accurate predictions.

j.b. Prediction for (0,0).

Target output = 0

Note that the predicted output is very close to 0, which indicates the success rate of our model.

k. Final error value :

After 200,000 iterations, we have our final error value — the lower the error, the higher the accuracy of the model.

As shown above, we can see that the error value is 0.0000000189. This value is the final error value in prediction after 200,000 iterations.

Below, notice that the data we used in this example was linearly separable, which means that by a single line, we can classify outputs with 1 value and outputs with 0 values.

Notice that we did not use bias value here. Now let’s have a quick look at the neural network without hidden layers for the same input features and target values. What we are going to do is find the final error rate and compare it. Since we have already implemented the code in our previous tutorial, for this purpose, we are going to analyze it quickly. [2]

The final error value for the following code is:

As we can see, the error value is way too high compared to the error we found in our neural network implementation with hidden layers, making it one of the main reasons to use hidden layers in a neural network.

# Import required libraries :
import numpy as np# Define input features :
input_features = np.array([[0,0],[0,1],[1,0],[1,1]])
print (input_features.shape)
print (input_features)# Define target output :
target_output = np.array([[0,1,1,1]])# Reshaping our target output into vector :
target_output = target_output.reshape(4,1)
print(target_output.shape)
print (target_output)# Define weights :
weights = np.array([[0.1],[0.2]])
print(weights.shape)
print (weights)# Define learning rate :
lr = 0.05# Sigmoid function :
def sigmoid(x):
return 1/(1+np.exp(-x))# Derivative of sigmoid function :
def sigmoid_der(x):
return sigmoid(x)*(1-sigmoid(x))# Main logic for neural network :
# Running our code 10000 times :for epoch in range(10000):
inputs = input_features#Feedforward input :
pred_in = np.dot(inputs, weights)#Feedforward output :
pred_out = sigmoid(pred_in)#Backpropogation
#Calculating error
error = pred_out - target_output
x = error.sum()
#Going with the formula :
print(x)
#Calculating derivative :
dcost_dpred = error
dpred_dz = sigmoid_der(pred_out)
#Multiplying individual derivatives :
z_delta = dcost_dpred * dpred_dz#Multiplying with the 3rd individual derivative :
inputs = input_features.T
weights -= lr * np.dot(inputs, z_delta)#Predictions :#Taking inputs :
single_point = np.array([1,0])
#1st step :
result1 = np.dot(single_point, weights)
#2nd step :
result2 = sigmoid(result1)
#Print final result
print(result2)#====================================
#Taking inputs :
single_point = np.array([0,0])
#1st step :
result1 = np.dot(single_point, weights)
#2nd step :
result2 = sigmoid(result1)
#Print final result
print(result2)#===================================
#Taking inputs :
single_point = np.array([1,1])
#1st step :
result1 = np.dot(single_point, weights)
#2nd step :
result2 = sigmoid(result1)
#Print final result
print(result2)

6. Non-linearly separable data with a neural network

In this example, we are going to take a dataset that cannot be separated by a single straight line. If we try to separate it by a single line, then one or many outputs may be misclassified, and we will have a very high error. Therefore we use a hidden layer to resolve this issue.

Input Table:

Graphical Representation Of Data Points :

As shown below, we represent the data on the coordinate plane. Here notice that we have 2 colored dots (black and red). If we try to draw a single line, then the output is going to be misclassified.

As figure 59 shows, we have 2 inputs and 1 output. In this example, we are going to use 4 hidden perceptrons. The red dots have an output value of 0, and the black dots have an output value of 1. Therefore, we cannot simply classify them using a single straight line.

Neural Network:

Implementation in Python:

a. Import required libraries:

b. Define input features:

c. Define the target output:

d. Assign random weight values:

On figure 64, notice that we are using NumPy’s library random function to generate random values.

numpy.random.rand(x,y): Here x is the number of rows, and y is the number of columns. It generates output values over [0,1). It means 0 is included, but 1 is not included in the value generation.

e. Sigmoid function:

f. Finding the derivative with a sigmoid function:

g. Training our neural network:

h. Weight values of hidden layer:

i. Weight values of output layer:

j. Final error value :

After training our model for 200,000 iterations, we finally achieved a low error value.

k. Making predictions from the trained model :

k.a. Predicting output for (0.5, 2).

The predicted output is closer to 1.

k.b. Predicting output for (0, -1)

The predicted output is very near to 0.

k.c. Predicting output for (0, 5)

The predicted output is close to 1.

k.d. Predicting output for (1, 1.2)

The predicted output is close to 0.

Based on the output values, our model has done a high-grade job of predicting values.

We can separate our data in the following way as shown in Figure 76. Note that this is not the only possible way to separate these values.

Therefore to conclude, using a hidden layer on our neural networks helps us reducing the error rate when we have non-linearly separable data. Even though the training time extends, we have to remember that our goal is to make high accuracy predictions, and such will be satisfied.

Neural networks can learn from their mistakes, and they can produce output that is not limited to the inputs provided to them.

Inputs store in its networks instead of a database.

These networks can learn from examples, and we can predict the output for similar events.

In case of failure of one neuron, the network can detect the fault and still produce output.

Neural networks can perform multiple tasks in parallel processes.

DISCLAIMER: The views expressed in this article are those of the author(s) and do not represent the views of Carnegie Mellon University, nor other companies (directly or indirectly) associated with the author(s). These writings do not intend to be final products, yet rather a reflection of current thinking, along with being a catalyst for discussion and improvement.

For attribution in academic contexts, please cite this work as:

Shukla, et al., “Building Neural Networks with Python Code and Math in Detail — II”, Towards AI, 2020

BibTex citation:

@article{pratik_iriondo_2020,
title={Building Neural Networks with Python Code and Math in Detail — II},
url={https://towardsai.net/building-neural-nets-with-python},
journal={Towards AI},
publisher={Towards AI Co.},
author={Pratik, Shukla and Iriondo,
Roberto},
year={2020},
month={Jun}
}

Are you new to machine learning? Check out an overview of machine learning algorithms for beginners with code examples in Python

]]>https://towardsai.net/p/machine-learning/building-neural-networks-with-python-code-and-math-in-detail-ii-bbe8accbf3d1/feed0Neural Networks from Scratch with Python Code and Math in Detail- I
https://towardsai.net/p/machine-learning/building-neural-networks-from-scratch-with-python-code-and-math-in-detail-i-536fae5d7bbf
https://towardsai.net/p/machine-learning/building-neural-networks-from-scratch-with-python-code-and-math-in-detail-i-536fae5d7bbf#respondSat, 20 Jun 2020 01:59:17 +0000https://towardsai.net/?p=4313Author(s): Pratik Shukla, Roberto Iriondo

Learn all about neural networks from scratch. From the math behind it to step-by-step implementation case studies in Python. Launch them live on Google Colab

Note: In our upcoming second tutorial on neural networks, we will show how we can add hidden layers to our neural nets.

What is a neural network?

Neural networks form the base of deep learning, which is a subfield of machine learning, where the structure of the human brain inspires the algorithms. Neural networks take input data, train themselves to recognize patterns found in the data, and then predict the output for a new set of similar data. Therefore, a neural network is the functional unit of deep learning, which mimics the behavior of the human brain to solve complex data-driven problems.

The first thing that comes to our mind when we think of “neural networks” is biology, and indeed, neural nets are inspired by our brains. Let’s try to understand them.

In machine learning, the dendrites refer to as input, and the nucleus process the data and forward the calculated output through the axon. In a biological neural network, the width (thickness) of dendrites defines the weight associated with it.

Simply put, an ANN represents interconnected input and output units in which each connection has an associated weight. During the learning phase, the network learns by adjusting these weights in order to be able to predict the correct class for input data.

For instance:

We encounter ourselves in a deep sleep state, and suddenly our environment starts to tremble. Immediately afterward, our brain recognizes that it is an earthquake. At once, we think of what is most valuable to us:

Our beloved ones.

Essential documents.

Jewelry.

Laptop.

A pencil.

Now we only have a few minutes to get out of the house, and we can only save a few things. What will our priorities be in this case?

Perhaps, we are going to save our beloved ones first, and then if time permits, we can think of other things. What we did here is, we assigned a weight to our valuables. Each of the valuables at that moment is an input, and the priorities are the weights we assigned it to it.

The same is the case with neural networks. We assign weights to different values and predict the output from them. However, in this case, we do not know the associated weight with each input, so we make an algorithm that will calculate the weights associated with them by processing lots of input data.

2. Applications of Artificial Neural Networks:

a. Classification of data:

Based on a set of data, our trained neural network predicts whether it is a dog or a cat?

b. Anomaly detection:

Given the details about transactions of a person, it can say that whether the transaction is fraud or not.

c. Speech recognition:

We can train our neural network to recognize speech patterns. Example: Siri, Alexa, Google assistant.

d. Audio generation:

Given the inputs as audio files, it can generate new music based on various factors like genre, singer, and others.

e. Time series analysis:

A well trained neural network can predict the stock price.

f. Spell checking:

We can train a neural network that detects misspelled spellings and can also suggest a similar meaning for words. Example: Grammarly

g. Character recognition:

A well trained neural network can detect handwritten characters.

h. Machine translation:

We can develop a neural network that translates one language into another language.

i. Image processing:

We can train a neural network to process an image and extract pieces of information from it.

3. General Structure of an Artificial Neural Network (ANN):

4. What is a Perceptron?

A perceptron is a neural network without any hidden layer. A perceptron only has an input layer and an output layer.

Where we can use perceptrons?

Perceptrons’ use lies in many case scenarios. While a perceptron is mostly used for simple decision making, these can also come together in larger computer programs to solve more complex problems.

For instance:

Give access if a person is a faculty member and deny access if a person is a student.

Steps involved in the implementation of a neural network:

A neural network executes in 2 steps :

1. Feedforward:

On a feedforward neural network, we have a set of input features and some random weights. Notice that in this case, we are taking random weights that we will optimize using backward propagation.

2. Backpropagation:

During backpropagation, we calculate the error between predicted output and target output and then use an algorithm (gradient descent) to update the weight values.

Why do we need backpropagation?

While designing a neural network, first, we need to train a model and assign specific weights to each of those inputs. That weight decides how vital is that feature for our prediction. The higher the weight, the greater the importance. However, initially, we do not know the specific weight required by those inputs. So what we do is, we assign some random weight to our inputs, and our model calculates the error in prediction. Thereafter, we update our weight values and rerun the code (backpropagation). After individual iterations, we can get lower error values and higher accuracy.

Summarizing an Artificial Neural Network:

Take inputs

Add bias (if required)

Assign random weights to input features

Run the code for training.

Find the error in prediction.

Update the weight by gradient descent algorithm.

Repeat the training phase with updated weights.

Make predictions.

Flow chart for a simple neural network:

The training phase of a neural network:

5. Perceptron Example:

Below is a simple perceptron model with four inputs and one output.

What we have here is the input values and their corresponding target output values. So what we are going to do, is assign some weight to the inputs and then calculate their predicted output values.

In this example we are going to calculate the output by the following formula:

For the sake of this example, we are going to take the bias value = 0 for simplicity of calculation.

a. Let’s take W = 3 and check the predicted output.

b. After we have found the value of predicted output for W=3, we are going to compare it with our target output, and by doing that, we can find the error in the prediction model. Keep in mind that our goal is to achieve minimum error and maximum accuracy for our model.

c. Notice that in the above calculation, there is an error in 3 out of 4 predictions. So we have to change the parameter values of our weight to set in low. Now we have two options:

Increase weight

Decrease weight

First, we are going to increase the value of the weight and check whether it leads to a higher error rate or lower error rate. Here we increased the weight value by 1 and changed it to W = 4.

d. As we can see in the figure above, is that the error in prediction is increasing. So now we can conclude that increasing the weight value does not help us in reducing the error in prediction.

e. After we fail in increasing the weight value, we are going to decrease the value of weight for it. Furthermore, by doing that, we can see whether it helps or not.

f. Calculate the error in prediction. Here we can see that we have achieved the global minimum.

In figure 17, we can see that there is no error in prediction.

Now what we did here:

First, we have our input values and target output.

Then we initialized some random value to W, and then we proceed further.

Last, we calculated the error for in prediction for that weight value. Afterward, we updated the weight and predicted the output. After several trial and error epochs, we can reduce the error in prediction.

So, we are trying to get the value of weight such that the error becomes minimum. We need to figure out whether we need to increase or decrease the weight value. Once we know that, we keep on updating the weight value in that direction until error becomes minimum. We might reach a point where if further updates occur to the weight, the error will increase. At that time, we need to stop, and that is our final weight value.

In real-life data, the situation can be a bit more complex. In the example above, we saw that we could try different weight values and get the minimum error manually. However, in real-life data, weight values are often decimal (non-integer). Therefore, we are going to use a gradient descent algorithm with a low learning rate so that we can try different weight values and obtain the best predictions from our model.

6. Sigmoid Function:

A sigmoid function serves as an activation function in our neural network training. We generally use neural networks for classifications. In binary classification, we have 2 types. However, as we can see, our output value can be any possible number from the equation we used. To solve that problem, we use a sigmoid function. Now for classification, we want our output values to be 0 or 1. So to get values between 0 and 1 we use the sigmoid function. The sigmoid function converts our output values between 0 and 1.

Let’s have a look at it:

Let’s visualize our sigmoid function with Python:

Output:

Explanation:

In figure 21 and 22, for any input values, the value of the sigmoid function will always lie between 0 and 1. Here notice that for negative numbers, the output of the sigmoid function is ≤0.5, or we can say closer to zero, and for positive numbers, the output is going to be >0.5, or we can say closer to 1.

7. Neural Network Implementation from Scratch:

We are going to do is implement the “OR” logic gate using a perceptron. Keep in mind that here we are not going to use any of the hidden layers.

What is logical OR Gate?

Straightforwardly, when one of the inputs is 1, the output of the OR gate is going to be 1. It means that the output is 0 only when both of the inputs are 0.

Representation:

Truth-Table for OR gate:

Perceptron for the OR gate:

Next, we are going to assign some weights to each of the input values and calculate it.

Example: (Calculating Manually)

a. Calculate the input for o1:

b. Calculate the output value:

Notice that from our truth table, we can see that we wanted the output of 1, but what we get here is 0.68997. Now we need to calculate the error and then backpropagate and then update the weight values.

c. Error Calculation:

Next, we are going to use Mean Squared Error for calculating the error :

The summation sign (Sigma symbol) means that we have to add our error for all our input sets. Here we are going to see how that works for only one input set.

We have to do the same for all the remaining inputs. Now that we have found the error, we have to update the values of weight to make the error minimum. For updating weight values, we are going to use a gradient descent algorithm.

8. What is Gradient Descent?

Gradient Descent is a machine learning algorithm that operates iteratively to find the optimal values for its parameters. It takes into account, user-defined learning rate, and initial parameter values.

Working: (Iterative)

1. Start with initial values.

2. Calculate cost.

3. Update values using the update function.

4. Returns minimized cost for our cost function

Why do we need it?

Generally, what we do is, we find the formula that gives us the optimal values for our parameter. However, in this algorithm, it finds the value by itself!

Interesting, isn’t it?

We are going to update our weight with this algorithm. First of all, we need to find the derivative f(X).

9. Derivation of the formula used in a neural network

Next, what we want to find is how a particular weight value affects the error. To find that we are going to apply the chain rule.

Afterward, what we have to do is we have to find values for these three derivatives.

In the following images, we have tried to show the derivation of each of these derivatives to showcase the math behind gradient descent.

d. Calculating derivatives:

In our case:

Output = 0.68997
Target = 1

e. Finding the second part of the derivative:

Figure 36: Calculating the second part

To understand it step-by-step:

e.a. Value of outo1:

e.b. Finding the derivative with respect to ino1:

e.c. Simplifying it a bit to find the derivative easily:

e.d. Applying chain rule and power rule:

e.e. Applying sum rule:

e.f. The derivative of constant is zero:

e.g. Applying exponential rule and chain rule:

e.h. Simplifying it a bit:

e.i. Multiplying both negative signs:

e.j. Put the negative power in the denominator:

That is it. However, we need to simplify it as it is a little complex for our machine learning algorithm to process for a large number of inputs.

e.k. Simplifying it:

e.l. Further simplification:

e.k. Adding +1–1:

e.l. Separate the parts:

e.m. Simplify:

e.n. Now we all know the value of outo1 from equation 1:

e.o. From that we can derive the following final derivative:

e.p. Calculating the value of our input:

f. Finding the third part of the derivative :

f.a Value of ino:

f.b. Finding derivative:

All the other values except w2 will be considered constant here.

f.c Calculating both values for our input:

f.d. Putting it all together:

f.e. Putting it in our main equation:

f.f. We can calculate:

Notice that the value of the weight has increased here. We can calculate all the values in this way, but as we can see, it is going to be a lengthy process. So now we are going to implement all the steps in Python.

Summary of The Manual Implementation of a Neural Network:

a. Input for perceptron:

b. Applying sigmoid function for predicted output :

c. Calculate the error:

d. Changing the weight value based on gradient descent formula:

e. Calculating the derivative:

f. Individual derivatives:

Source: Image created by the author.

Source: Image created by the author.

g. After then we run the same code with updated weight values.

Let’s code:

10. Implementation of a Neural Network In Python:

10.1 Import Required libraries:

First, we are going to import Python libraries. We are using NumPy for the calculations:

10.2 Assign Input values:

Next, we are going to take input values for which we want to train our neural network. Here we can see that we have taken two input features. In actual data sets, the value of the input features is mostly high.

10.3 Target Output:

For the input features, we want to have a specific output for specific input features. It is called the target output. We are going to train the model that gives us the target output for our input features.

10.3 Assign the Weights :

Next, we are going to assign random weights to the input features. Note that our model is going to modify these weight values to be optimum. At this point, we are taking these values randomly. Here we have two input features, so we are going to take two weight values.

10.4 Adding Bias Values and Assigning a Learning Rate :

Now here we are going to add the bias value. The value of bias = 1. However, the weight assigned to it is random at first, and our model will optimize it for our target output.

The other parameter is called the learning rate(LR). We are going to use the learning rate in a gradient descent algorithm to update the weight values. Generally, we keep the learning rate as low as possible so that we can achieve a minimum error rate.

10.5 Applying a Sigmoid Function:

Once we have our weight values and input features, we are going to send it to the main function that predicts the output. Now notice that our input features and weight values can be anything, but here we want to classify data, so we need the output between 0 and 1. For such, we are going to a sigmoid function.

10.6 Derivative of sigmoid function:

In gradient descent algorithm we are going to need the derivative of the sigmoid function.

10.7 The main logic for predicting output and updating the weight values:

We are going to explain the following code step-by-step.

How does it work?

First of all, the code above will need to run approximately 10,000 times. Keep in mind that if we only run this code a few times, then probably we are going to have a higher error rate. Therefore, in short, we can say that we are going to update the weight values 10,000 times to reach the optimal value possible.

Next, what we need to do is multiply the input features with it is corresponding weight values, the values we are going to feed to the perceptron can be represented in the form of a matrix.

in_o represents the dot product of input_features and weight. Notice that the first matrix (input features) is of size (4*2), and the second matrix (weights) is of size (2*1). After multiplication, the resultant matrix is of size (4*1).

In the above representation, each of those boxes represents a value.

Now in our formula, we also have the bias value. Let’s understand it with simple matrix representation.

Next, we are going to add the bias value. Addition operation in the matrix is easy to understand. Such is the input for the sigmoid function. Afterward, we are going to apply the sigmoid function to our input value, which will give us the predicted output value between 0 and 1.

Next, we have to calculate the error in prediction. We generally use Mean Squared Error (MSE) for this, but here we are just going to use simple error function for simplicity in the calculation. Last, we are going to add the error for all of our four inputs.

Our ultimate goal is to minimize the error. To minimize the error, we can update the value of our weights. To update the weight value, we are going to use a gradient descent algorithm.

To find the derivative, we are going to need the values of some derivatives for our gradient descent algorithm. As we have already discussed, we are going to find 3 individual values for derivatives and then multiply it.

The first derivative is:

The second derivative is:

The third derivative is:

Notice that we can easily find the values of the first two derivatives as they are not dependent on inputs. Next, we store the values of the multiplication of the first two derivatives in the deriv variable. Now the values of these derivatives must be of the same size as the size of weights. The size of the weights is (2*1).

To find the final derivative, we need to find the transpose of our input_features and then we are going to multiply it with our deriv variable that is basically the multiplication of the other two derivatives.

Let’s have a look at the matrix representation of the operation.

On figure 83, the first matrix is the transposed matrix of input_features. The second matrix stores the values of the multiplication of the other two derivatives. Now see that we have stored these values in a matrix called deriv_final. Notice that, the size of deriv_final is (2*1) which is the same as the size of our weight matrix (2*1).

Afterward, we update the weight value, notice that we have all the values needed for updating our weight. We are going to use the following formula to update the weight values.

Last, we need to update the bias value. If we remember the diagram, we might have noticed that the value of bias weight is not dependent on the input. So we have to update it separately. In this case, we need the deriv values, as it is not dependent on the input values. To update the bias value, we go through the for loop for updating value at each input on every iteration.

10.8 Check the Values of Weight and Bias:

On figure 85, notice that our weight and bias values have changed from our randomly assigned values.

10.9 Predicting values :

Since we have trained our model, we can start to make predictions from it.

10.9.1 Prediction for (1,0):

Target value = 1

On figure 86, we can see the predicted output is very near to 1.

10.9.2 Prediction for (1,1):

Target output = 1

On figure 87, we can see that the predicted output is very close to 1.

10.9.3 Prediction for (0,0):

Target output = 0

On figure 88, we can see that the predicted output is very close to 0.

#Multiplying individual derivatives:
deriv = derror_douto * douto_dino #Multiplying with the 3rd individual derivative:
#Finding the transpose of input_features:
inputs = input_features.T
deriv_final = np.dot(inputs,deriv)

#Updating the weights values:
weights -= lr * deriv_final #Updating the bias weight value:
for i in deriv:
bias -= lr * i #Check the final values for weight and biasprint (weights)

Suppose if we have input values (0,0), the sum of the products of the input nodes and weights is always going to be zero. In this case, the output will always be zero, no matter how much we train our model. To resolve this issue and make reliable predictions, we use the bias term. In short, we can say that the bias term is necessary to make a robust neural network.

Therefore, how does the value of bias affects the shape of our sigmoid function? Let’s visualize it with some examples.

To change the steepness of the sigmoid curve, we can adjust the weight accordingly.

For instance:

From the output, we can quickly notice that for negative values, the output of the sigmoid function is going to be ≤0.5. Moreover, for positive values, the output is going to be >0.5.

From the figure (red curve), you can see that if we decrease the value of the weight, it decreases the value of steepness, and if we increase the value of weight (green curve), it increases the value of steepness. However, for all of the three curves, if the input is negative, the output is always going to be ≤0.5. For positive numbers, the output is always going to be >0.5.

What if we want to change this pattern?

For such case scenarios, we use bias values.

From the output, we can notice that we can shift the curves on the x-axis that helps us to change the pattern we show in the previous example.

Summary:

In neural networks:

We can view bias as a threshold value for activation.

Bias increases the flexibility of the model

The bias value allows us to shift the activation function to the right or left.

The bias value is most useful when we have all zeros (0,0) as input.

Let’s try to understand it with the same example we saw earlier. Nevertheless, here we are not going to add the bias value. After the model has trained, we will try to predict the value of (0,0). Ideally, it should be close to zero. Now let’s check out the following example.

An Implementation Without Bias Value:

a. Import required libraries:

b. Input features:

c. Target output:

d. Define Input weights:

e. Define the learning rate:

f. Activation function:

g. A derivative of the sigmoid function:

h. The main logic for training our model:

Here notice that we are not going to use bias values anywhere.

i. Making predictions:

i.a. Prediction for (1,0) :

Target output = 1

From the predicted output we can see that it’s close to 1.

i.b. Prediction for (0,0) :

Target output = 0

Here we can see that it’s nowhere near 0. So we can say that our model failed to predict it. This is the reason for adding the bias value.

i.c. Prediction for (1,1):

Target output = 1

We can see that it’s close to 1.

Putting it all together:

# Import required libraries :
import numpy as np# Define input features :
input_features = np.array([[0,0],[0,1],[1,0],[1,1]])
print (input_features.shape)
print (input_features)# Define target output :
target_output = np.array([[0,1,1,1]])# Reshaping our target output into vector :
target_output = target_output.reshape(4,1)
print(target_output.shape)
print (target_output)# Define weights :
weights = np.array([[0.1],[0.2]])
print(weights.shape)
print (weights)# Define learning rate :
lr = 0.05# Sigmoid function :
def sigmoid(x):
return 1/(1+np.exp(-x))# Derivative of sigmoid function :
def sigmoid_der(x):
return sigmoid(x)*(1-sigmoid(x))# Main logic for neural network :
# Running our code 10000 times :for epoch in range(10000):
inputs = input_features#Feedforward input :
pred_in = np.dot(inputs, weights)#Feedforward output :
pred_out = sigmoid(pred_in)#Backpropogation
#Calculating error
error = pred_out — target_output
x = error.sum()
#Going with the formula :
print(x)
#Calculating derivative :
dcost_dpred = error
dpred_dz = sigmoid_der(pred_out)
#Multiplying individual derivatives :
z_delta = dcost_dpred * dpred_dz#Multiplying with the 3rd individual derivative :
inputs = input_features.T
weights -= lr * np.dot(inputs, z_delta)
#Taking inputs :
single_point = np.array([1,0])#1st step :
result1 = np.dot(single_point, weights)#2nd step :
result2 = sigmoid(result1)#Print final result
print(result2)#Taking inputs :
single_point = np.array([0,0])#1st step :
result1 = np.dot(single_point, weights)#2nd step :
result2 = sigmoid(result1)#Print final result
print(result2)#Taking inputs :
single_point = np.array([1,1])#1st step :
result1 = np.dot(single_point, weights)#2nd step :
result2 = sigmoid(result1)#Print final result
print(result2)

Now a real-life example of a prediction case study with a neural network ↓

Case Study: Predicting Whether a Person will be Positive for a Virus with a Neural Net

Dataset:

For this example, our goal is to predict whether a person is positive for a virus or not based on the given input features. Here 1 represents “Yes” and 0 represents “No”.

Let’s code:

a. Import required libraries:

Source: Image created by the author.

b. Input features:

c. Target output:

d. Define weights:

e. Bias value and learning rate:

f. Sigmoid function:

g. Derivative of sigmoid function:

h. The main logic for training model:

i. Making predictions:

i.a. A tested person is positive for the virus.

i.b. A tested person is negative for the virus.

i.c. A tested person is positive for the virus.

j. Final weight and bias values:

In this example, we can notice that the input feature “loss of smell” influences the output the most. If it is true, then in most of the case, the person tests positive for the virus. We can also derive this conclusion from the weight values. Keep in mind that the higher the value of the weight, the more the influence on the output. The input feature “Weight loss” is not affecting the output much, so we can rule it out while we are making predictions for a larger dataset.

Putting it all together:

# Import required libraries :
import numpy as np# Define input features :
input_features = np.array([[1,0,0,1],[1,0,0,0],[0,0,1,1],
[0,1,0,0],[1,1,0,0],[0,0,1,1],
[0,0,0,1],[0,0,1,0]])
print (input_features.shape)
print (input_features)# Define target output :
target_output = np.array([[1,1,0,0,1,1,0,0]])# Reshaping our target output into vector :
target_output = target_output.reshape(8,1)
print(target_output.shape)
print (target_output)# Define weights :
weights = np.array([[0.1],[0.2],[0.3],[0.4]])
print(weights.shape)
print (weights)# Bias weight :
bias = 0.3# Learning Rate :
lr = 0.05# Sigmoid function :
def sigmoid(x):
return 1/(1+np.exp(-x))# Derivative of sigmoid function :
def sigmoid_der(x):
return sigmoid(x)*(1-sigmoid(x))# Main logic for neural network :
# Running our code 10000 times :for epoch in range(10000):
inputs = input_features#Feedforward input :
pred_in = np.dot(inputs, weights) + bias#Feedforward output :
pred_out = sigmoid(pred_in)#Backpropogation
#Calculating error
error = pred_out — target_output
#Going with the formula :
x = error.sum()
print(x)
#Calculating derivative :
dcost_dpred = error
dpred_dz = sigmoid_der(pred_out)
#Multiplying individual derivatives :
z_delta = dcost_dpred * dpred_dz#Multiplying with the 3rd individual derivative :
inputs = input_features.T
weights -= lr * np.dot(inputs, z_delta)#Updating the bias weight value :
for i in z_delta:
bias -= lr * i#Printing final weights:

In the examples above, we did not use any hidden layers for calculations. Notice that in the above examples, our data were linearly separable. For instance:

We can see that the red line can separate the yellow dots (value = 1) and green dot (value = 0 ).

Limitations of a Perceptron Model (Without Hidden Layers):

1. Single-layer perceptrons cannot classify non-linearly separable data points.

2. Complex problems that involve many parameters do not resolve with single-layer perceptrons.

However, in several cases, the data is not linearly separable. In that case, our perceptron model (without hidden layers) fails to make accurate predictions. To make accurate predictions, we need to add one or more hidden layers.
Visual representation of non-linearly separable data:

DISCLAIMER: The views expressed in this article are those of the author(s) and do not represent the views of Carnegie Mellon University, nor other companies (directly or indirectly) associated with the author(s). These writings do not intend to be final products, yet rather a reflection of current thinking, along with being a catalyst for discussion and improvement.

Citation

For attribution in academic contexts, please cite this work as:

Shukla, et al., “Neural Networks from Scratch with Python Code and Math in Detail — I”, Towards AI, 2020

BibTex citation:

@article{pratik_iriondo_2020,
title={Neural Networks from Scratch with Python Code and Math in Detail — I},
url={https://towardsai.net/neural-networks-with-python},
journal={Towards AI},
publisher={Towards AI Co.},
author={Pratik, Shukla and Iriondo,
Roberto},
year={2020},
month={Jun}
}

]]>https://towardsai.net/p/machine-learning/building-neural-networks-from-scratch-with-python-code-and-math-in-detail-i-536fae5d7bbf/feed0Gather AI, Revolutionizing Inventory Management One Drone at a Time
https://towardsai.net/p/news/gather-ai-revolutionizing-inventory-management-one-drone-at-a-time-461aedfe8759
https://towardsai.net/p/news/gather-ai-revolutionizing-inventory-management-one-drone-at-a-time-461aedfe8759#respondSat, 06 Jun 2020 13:55:29 +0000https://towardsai.net/?p=4139Author(s): Roberto Iriondo

Gather AI’s video showcases the world’s first dedicated autonomous software-only inventory management platform for warehouses.

]]>https://towardsai.net/p/news/gather-ai-revolutionizing-inventory-management-one-drone-at-a-time-461aedfe8759/feed0Machine Learning Algorithms For Beginners with Code Examples in Python
https://towardsai.net/p/machine-learning/machine-learning-algorithms-for-beginners-with-python-code-examples-ml-19c6afd60daa
https://towardsai.net/p/machine-learning/machine-learning-algorithms-for-beginners-with-python-code-examples-ml-19c6afd60daa#respondWed, 03 Jun 2020 21:00:35 +0000https://towardsai.net/?p=4054

Overview of the major machine learning algorithms for beginners with coding samples

Machine learning (ML) is rapidly changing the world, from diverse types of applications and research pursued in industry and academia. Machine learning is affecting every part of our daily lives. From voice assistants using NLP and machine learning to make appointments, check our calendar and play music, to programmatic advertisements — that are so accurate that they can predict what we will need before we even think of it.

More often than not, the complexity of the scientific field of machine learning can be overwhelming, making keeping up with “what is important” a very challenging task. However, to make sure that we provide a learning path to those who seek to learn machine learning, but are new to these concepts. In this article, we look at the most critical basic algorithms that hopefully make your machine learning journey less challenging.

Any suggestions or feedback is crucial to continue to improve. Please let us know in the comments if you have any.

Index

Introduction to Machine Learning.

Major Machine Learning Algorithms.

Supervised vs. Unsupervised Learning.

Linear Regression.

Multivariable Linear Regression.

Polynomial Regression.

Exponential Regression.

Sinusoidal Regression.

Logarithmic Regression.

What is machine learning?

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. ~ Tom M. Mitchell [1]

Machine learning behaves similarly to the growth of a child. As a child grows, her experience E in performing task T increases, which results in higher performance measure(P).

For instance, we give a “shape sorting block” toy to a child. (Now we all know that in this toy, we have different shapes and shape holes). In this case, our task T is to find an appropriate shape hole for a shape. Afterward, the child observes the shape and tries to fit it in a shaped hole. Let us say that this toy has three shapes: a circle, a triangle, and a square. In her first attempt at finding a shaped hole, her performance measure(P) is 1/3, which means that the child found 1 out of 3 correct shape holes.

Second, the child tries it another time and notices that she is a little experienced in this task. Considering the experience gained (E), the child tries this task another time, and when measuring the performance(P), it turns out to be 2/3. After repeating this task (T) 100 times, the baby now figured out which shape goes into which shape hole.

So her experience (E) increased, her performance(P) also increased, and then we notice that as the number of attempts at this toy increases. The performance also increases, which results in higher accuracy.

Such execution is similar to machine learning. What a machine does is, it takes a task (T), executes it, and measures its performance (P). Now a machine has a large number of data, so as it processes that data, its experience (E) increases over time, resulting in a higher performance measure (P). So after going through all the data, our machine learning model’s accuracy increases, which means that the predictions made by our model will be very accurate.

Another definition of machine learning by Arthur Samuel:

Machine Learning is the subfield of computer science that gives “computers the ability to learn without being explicitly programmed.” ~ Arthur Samuel [2]

Let us try to understand this definition: It states “learn without being explicitly programmed” — which means that we are not going to teach the computer with a specific set of rules, but instead, what we are going to do is feed the computer with enough data and give it time to learn from it, by making its own mistakes and improve upon those. For example, We did not teach the child how to fit in the shapes, but by performing the same task several times, the child learned to fit the shapes in the toy by himself.

Therefore, we can say that we did not explicitly teach the child how to fit the shapes. We do the same thing with machines. We give it enough data to work on and feed it with the information we want from it. So it processes the data and predicts the data accurately.

Why do we need machine learning?

For instance, we have a set of images of cats and dogs. What we want to do is classify them into a group of cats and dogs. To do that we need to find out different animal features, such as:

How many eyes does each animal have?

What is the eye color of each animal?

What is the height of each animal?

What is the weight of each animal?

What does each animal generally eat?

We form a vector on each of these questions’ answers. Next, we apply a set of rules such as:

If height > 1 feet and weight > 15 lbs, then it could be a cat.

Now, we have to make such a set of rules for every data point. Furthermore, we place a decision tree of if, else if, else statements and check whether it falls into one of the categories.

Let us assume that the result of this experiment was not fruitful as it misclassified many of the animals, which gives us an excellent opportunity to use machine learning.

What machine learning does is process the data with different kinds of algorithms and tells us which feature is more important to determine whether it is a cat or a dog. So instead of applying many sets of rules, we can simplify it based on two or three features, and as a result, it gives us a higher accuracy. The previous method was not generalized enough to make predictions.

Machine learning models helps us in many tasks, such as:

Object Recognition

Summarization

Prediction

Classification

Clustering

Recommender systems

And others

What is a machine learning model?

A machine learning model is a question/answering system that takes care of processing machine-learning related tasks. Think of it as an algorithm system that represents data when solving problems. The methods we will tackle below are beneficial for industry-related purposes to tackle business problems.

For instance, let us imagine that we are working on Google Adwords’ ML system, and our task is to implementing an ML algorithm to convey a particular demographic or area using data. Such a task aims to go from using data to gather valuable insights to improve business outcomes.

Major Machine Learning Algorithms:

1. Regression (Prediction)

We use regression algorithms for predicting continuous values.

Regression algorithms:

Linear Regression

Polynomial Regression

Exponential Regression

Logistic Regression

Logarithmic Regression

2. Classification

We use classification algorithms for predicting a set of items’ class or category.

Classification algorithms:

K-Nearest Neighbors

Decision Trees

Random Forest

Support Vector Machine

Naive Bayes

3. Clustering

We use clustering algorithms for summarization or to structure data.

Clustering algorithms:

K-means

DBSCAN

Mean Shift

Hierarchical

4. Association

We use association algorithms for associating co-occurring items or events.

Association algorithms:

Apriori

5. Anomaly Detection

We use anomaly detection for discovering abnormal activities and unusual cases like fraud detection.

6. Sequence Pattern Mining

We use sequential pattern mining for predicting the next data events between data examples in a sequence.

7. Dimensionality Reduction

We use dimensionality reduction for reducing the size of data to extract only useful features from a dataset.

8. Recommendation Systems

We use recommenders algorithms to build recommendation engines.

Examples:

Netflix recommendation system.

A book recommendation system.

A product recommendation system on Amazon.

Nowadays, we hear many buzz words like artificial intelligence, machine learning, deep learning, and others.

What are the fundamental differences between Artificial Intelligence, Machine Learning, and Deep Learning?

Artificial Intelligence (AI):

Artificial intelligence (AI), as defined by Professor Andrew Moore, is the science and engineering of making computers behave in ways that, until recently, we thought required human intelligence [4].

These include:

Computer Vision

Language Processing

Creativity

Summarization

Machine Learning (ML):

As defined by Professor Tom Mitchell, machine learning refers to a scientific branch of AI, which focuses on the study of computer algorithms that allow computer programs to automatically improve through experience [3].

These include:

Classification

Neural Network

Clustering

Deep Learning:

Deep learning is a subset of machine learning in which layered neural networks, combined with high computing power and large datasets, can create powerful machine learning models. [3]

Why do we prefer Python to implement machine learning algorithms?

Python is a popular and general-purpose programming language. We can write machine learning algorithms using Python, and it works well. The reason why Python is so popular among data scientists is that Python has a diverse variety of modules and libraries already implemented that make our life more comfortable.

Let us have a brief look at some exciting Python libraries.

Numpy: It is a math library to work with n-dimensional arrays in Python. It enables us to do computations effectively and efficiently.

Scipy: It is a collection of numerical algorithms and domain-specific tool-box, including signal processing, optimization, statistics, and much more. Scipy is a functional library for scientific and high-performance computations.

Matplotlib: It is a trendy plotting package that provides 2D plotting as well as 3D plotting.

Scikit-learn: It is a free machine learning library for python programming language. It has most of the classification, regression, and clustering algorithms, and works with Python numerical libraries such as Numpy, Scipy.

Machine learning algorithms classify into two groups :

Supervised Learning algorithms

Unsupervised Learning algorithms

I. Supervised Learning Algorithms:

Goal: Predict class or value label.

Supervised learning is a branch of machine learning(perhaps it is the mainstream of machine/deep learning for now) related to inferring a function from labeled training data. Training data consists of a set of *(input, target)* pairs, where the input could be a vector of features, and the target instructs what we desire for the function to output. Depending on the type of the *target*, we can roughly divide supervised learning into two categories: classification and regression. Classification involves categorical targets; examples ranging from some simple cases, such as image classification, to some advanced topics, such as machine translations and image caption. Regression involves continuous targets. Its applications include stock prediction, image masking, and others- which all fall in this category.

To understand what supervised learning is, we will use an example. For instance, we give a child 100 stuffed animals in which there are ten animals of each kind like ten lions, ten monkeys, ten elephants, and others. Next, we teach the kid to recognize the different types of animals based on different characteristics (features) of an animal. Such as if its color is orange, then it might be a lion. If it is a big animal with a trunk, then it may be an elephant.

We teach the kid how to differentiate animals, this can be an example of supervised learning. Now when we give the kid different animals, he should be able to classify them into an appropriate animal group.

For the sake of this example, we notice that 8/10 of his classifications were correct. So we can say that the kid has done a pretty good job. The same applies to computers. We provide them with thousands of data points with its actual labeled values (Labeled data is classified data into different groups along with its feature values). Then it learns from its different characteristics in its training period. After the training period is over, we can use our trained model to make predictions. Keep in mind that we already fed the machine with labeled data, so its prediction algorithm is based on supervised learning. In short, we can say that the predictions by this example are based on labeled data.

Example of supervised learning algorithms :

Linear Regression

Logistic Regression

K-Nearest Neighbors

Decision Tree

Random Forest

Support Vector Machine

II. Unsupervised Learning:

Goal: Determine data patterns/groupings.

In contrast to supervised learning. Unsupervised learning infers from unlabeled data, a function that describes hidden structures in data.

Perhaps the most basic type of unsupervised learning is dimension reduction methods, such as PCA, t-SNE, while PCA is generally used in data preprocessing, and t-SNE usually used in data visualization.

A more advanced branch is clustering, which explores the hidden patterns in data and then makes predictions on them; examples include K-mean clustering, Gaussian mixture models, hidden Markov models, and others.

Along with the renaissance of deep learning, unsupervised learning gains more and more attention because it frees us from manually labeling data. In light of deep learning, we consider two kinds of unsupervised learning: representation learning and generative models.

Representation learning aims to distill a high-level representative feature that is useful for some downstream tasks, while generative models intend to reproduce the input data from some hidden parameters.

Unsupervised learning works as it sounds. In this type of algorithms, we do not have labeled data. So the machine has to process the input data and try to make conclusions about the output. For example, remember the kid whom we gave a shape toy? In this case, he would learn from its own mistakes to find the perfect shape hole for different shapes.

But the catch is that we are not feeding the child by teaching the methods to fit the shapes (for machine learning purposes called labeled data). However, the child learns from the toy’s different characteristics and tries to make conclusions about them. In short, the predictions are based on unlabeled data.

Examples of unsupervised learning algorithms:

Dimension Reduction

Density Estimation

Generative adversarial networks (GANs)

Market Basket Analysis

Clustering

For this article, we will use a few types of regression algorithms with coding samples.

1. Linear Regression:

Linear regression is a statistical approach that models the relationship between input features and output. The input features are called the independent variables, and the output is called a dependent variable. Our goal here is to predict the value of the output based on the input features by multiplying it with its optimal coefficients.

Some real-life examples of linear regression :

(1) To predict sales of products.

(2) To predict economic growth.

(3) To predict petroleum prices.

(4) To predict the emission of a new car.

(5) Impact of GPA on college admissions.

There are two types of linear regression :

Simple Linear Regression

Multivariable Linear Regression

1.1 Simple Linear Regression:

In simple linear regression, we predict the output/dependent variable based on only one input feature. The simple linear regression is given by:

Below we are going to implement simple linear regression using the sklearn library in Python.

Step by step implementation in Python:

a. Import required libraries:

Since we are going to use various libraries for calculations, we need to import them.

b. Read the CSV file:

We check the first five rows of our dataset. In this case, we are using a vehicle model dataset — please check out the dataset on Github.

c. Select the features we want to consider in predicting values:

Here our goal is to predict the value of “co2 emissions” from the value of “engine size” in our dataset.

d. Plot the data:

We can visualize our data on a scatter plot.

e. Divide the data into training and testing data:

To check the accuracy of a model, we are going to divide our data into training and testing datasets. We will use training data to train our model, and then we will check the accuracy of our model using the testing dataset.

f. Training our model:

Here is how we can train our model and find the coefficients for our best-fit regression line.

g. Plot the best fit line:

Based on the coefficients, we can plot the best fit line for our dataset.

h. Prediction function:

We are going to use a prediction function for our testing dataset.

i. Predicting co2 emissions:

Predicting the values of co2 emissions based on the regression line.

j. Checking accuracy for test data :

We can check the accuracy of a model by comparing the actual values with the predicted values in our dataset.

Putting it all together:

# Import required libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model

# Read the CSV file :
data = pd.read_csv(“Fuel.csv”)
data.head()

# Let’s select some features to explore more :
data = data[[“ENGINESIZE”,”CO2EMISSIONS”]]

# Generating training and testing data from our data:
# We are using 80% data for training.
train = data[:(int((len(data)*0.8)))]
test = data[(int((len(data)*0.8))):]

# Modeling:
# Using sklearn package to model data :
regr = linear_model.LinearRegression()
train_x = np.array(train[[“ENGINESIZE”]])
train_y = np.array(train[[“CO2EMISSIONS”]])
regr.fit(train_x,train_y)

In simple linear regression, we were only able to consider one input feature for predicting the value of the output feature. However, in Multivariable Linear Regression, we can predict the output based on more than one input feature. Here is the formula for multivariable linear regression.

Step by step implementation in Python:

a. Import the required libraries:

b. Read the CSV file :

c. Define X and Y:

X stores the input features we want to consider, and Y stores the value of output.

d. Divide data into a testing and training dataset:

Here we are going to use 80% data in training and 20% data in testing.

e. Train our model :

Here we are going to train our model with 80% of the data.

f. Find the coefficients of input features :

Now we need to know which feature has a more significant effect on the output variable. For that, we are going to print the coefficient values. Note that the negative coefficient means it has an inverse effect on the output. i.e., if the value of that features increases, then the output value decreases.

g. Predict the values:

h. Accuracy of the model:

Now notice that here we used the same dataset for simple and multivariable linear regression. We can notice that the accuracy of multivariable linear regression is far better than the accuracy of simple linear regression.

Putting it all together:

# Import the required libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model

# Read the CSV file:
data = pd.read_csv(“Fuel.csv”)
data.head()

# Consider features we want to work on:
X = data[[ ‘ENGINESIZE’, ‘CYLINDERS’, ‘FUELCONSUMPTION_CITY’,’FUELCONSUMPTION_HWY’,
‘FUELCONSUMPTION_COMB’,’FUELCONSUMPTION_COMB_MPG’]]

Y = data[“CO2EMISSIONS”]

# Generating training and testing data from our data:
# We are using 80% data for training.
train = data[:(int((len(data)*0.8)))]
test = data[(int((len(data)*0.8))):]

#Modeling:
#Using sklearn package to model data :
regr = linear_model.LinearRegression()

#Now let’s do prediction of data:
Y_pred = regr.predict(test_x)

# Check accuracy:
from sklearn.metrics import r2_score
R = r2_score(test_y , Y_pred)
print (“R² :”,R)

1.3 Polynomial Regression:

Sometimes we have data that does not merely follow a linear trend. We sometimes have data that follows a polynomial trend. Therefore, we are going to use polynomial regression.

Before digging into its implementation, we need to know how the graphs of some primary polynomial data look.

Polynomial Functions and Their Graphs:

a. Graph for Y=X:

b. Graph for Y = X²:

c. Graph for Y = X³:

d. Graph with more than one polynomials: Y = X³+X²+X:

In the graph above, we can see that the red dots show the graph for Y=X³+X²+X and the blue dots shows the graph for Y = X³. Here we can see that the most prominent power influences the shape of our graph.

Below is the formula for polynomial regression:

Now in the previous regression models, we used sci-kit learn library for implementation. Now in this, we are going to use Normal Equation to implement it. Here notice that we can use scikit-learn for implementing polynomial regression also, but another method will give us an insight into how it works.

The equation goes as follows:

In the equation above:

θ: hypothesis parameters that define it the best.

X: input feature value of each instance.

Y: Output value of each instance.

1.3.1 Hypothesis Function for Polynomial Regression

The main matrix in the standard equation:

Step by step implementation in Python:

a. Import the required libraries:

b. Generate the data points:

We are going to generate a dataset for implementing our polynomial regression.

c. Initialize x,x²,x³ vectors:

We are taking the maximum power of x as 3. So our X matrix will have X, X², X³.

d. Column-1 of X matrix:

The 1st column of the main matrix X will always be 1 because it holds the coefficient of beta_0.

e. Form the complete x matrix:

Look at the matrix X at the start of this implementation. We are going to create it by appending vectors.

f. Transpose of the matrix:

We are going to calculate the value of theta step-by-step. First, we need to find the transpose of the matrix.

g. Matrix multiplication:

After finding the transpose, we need to multiply it with the original matrix. Keep in mind that we are going to implement it with a normal equation, so we have to follow its rules.

h. The inverse of a matrix:

Finding the inverse of the matrix and storing it in temp1.

i. Matrix multiplication:

Finding the multiplication of transposed X and the Y vector and storing it in the temp2 variable.

j. Coefficient values:

To find the coefficient values, we need to multiply temp1 and temp2. See the Normal Equation formula.

k. Store the coefficients in variables:

Storing those coefficient values in different variables.

l. Plot the data with curve:

Plotting the data with the regression curve.

m. Prediction function:

Now we are going to predict the output using the regression curve.

n. Error function:

Calculate the error using mean squared error function.

o. Calculate the error:

Putting it all together:

# Import required libraries:
import numpy as np
import matplotlib.pyplot as plt

# Prediction function:
def prediction(x1,x2,x3,beta_0,beta_1,beta_2,beta_3):
y_pred = beta_0 + beta_1*x1 + beta_2*x2 + beta_3*x3
return y_pred
# Making predictions:
pred = prediction(x1,x2,x3,beta_0,beta_1,beta_2,beta_3)
# Calculate accuracy of model:
def err(y_pred,y):
var = (y — y_pred)
var = var*var
n = len(var)
MSE = var.sum()
MSE = MSE/n
return MSE

# Calculating the error:
error = err(pred,y)
error

1.4 Exponential Regression:

Sometimes our dataset follows a trend that shows a slow increase initially, but as time goes, the increase rate grows exponentially. That is when we can use exponential regression.

Some real-life examples of exponential growth:

1. Microorganisms in cultures.

2. Spoilage of food.

3. Human Population.

4. Compound Interest.

5. Pandemics (Such as Covid-19).

6. Ebola Epidemic.

7. Invasive Species.

8. Fire.

9. Cancer Cells.

10. Smartphone Uptake and Sale.

The formula for exponential regression is as follow:

In this case, we are going to use the scikit-learn library to find the coefficient values such as a, b, c.

Step by step implementation in Python

a. Import the required libraries:

b. Insert the data points:

c. Implement the exponential function algorithm:

d. Apply optimal parameters and covariance:

Here we use curve_fit to find the optimal parameter values. It returns two variables, called popt, pcov.

popt stores the value of optimal parameters, and pcov stores the values of its covariances. We can see that popt variable has two values. Those values are our optimal parameters. We are going to use those parameters and plot our best fit curve, as shown below.

e. Plot the data:

Plotting the data with the coefficients found.

f. Check the accuracy of the model:

Check the accuracy of the model with r2_score.

Putting it all together:

# Import required libraries:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit

# Dataset values :
day = np.arange(0,8)
weight = np.array([251,209,157,129,103,81,66,49])

# Exponential Function :
def expo_func(x, a, b):
return a * b ** x

#popt :Optimal values for the parameters
#pcov :The estimated covariance of popt

# Plotting the data
plt.plot(day, weight_pred, ‘r-’)
plt.scatter(day,weight,label=’Day vs Weight’)
plt.title(“Day vs Weight a*b^x”)
plt.xlabel(‘Day’)
plt.ylabel(‘Weight’)
plt.legend()
plt.show()

# Equation
a=popt[0].round(4)
b=popt[1].round(4)
print(f’The equation of regression line is y={a}*{b}^x’)

1.5 Sinusoidal Regression:

Some real-life examples of sinusoidal regression:

Generation of music waves.

Sound travels in waves.

Trigonometric functions in constructions.

Used in space flights.

GPS location calculations.

Architecture.

Electrical current.

Radio broadcasting.

Low and high tides of the ocean.

Buildings.

Sometimes we have data that shows patterns like a sine wave. Therefore, in such case scenarios, we use a sinusoidal regression. Below we can show the formula for the algorithm:

Step by step implementation in Python:

a. Generating the dataset:

b. Applying a sine function:

Here we have created a function called “calc_sine” to calculate the value of output based on optimal coefficients. Here we will use the scikit-learn library to find the optimal parameters.

c. Why does a sinusoidal regression perform better than linear regression?

If we check the accuracy of the model after fitting our data with a straight line, we can see that the accuracy in prediction is less than that of sine wave regression. That is why we use sinusoidal regression.

Putting it all together:

# Import required libraries:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from sklearn.metrics import r2_score

# Generating dataset:

# Y = A*sin(B(X + C)) + D
# A = Amplitude
# Period = 2*pi/B
# Period = Length of One Cycle
# C = Phase Shift (In Radian)
# D = Vertical Shift

X = np.linspace(0,1,100) #(Start,End,Points)

# Here…
# A = 1
# B= 2*pi
# B = 2*pi/Period
# Period = 1
# C = 0
# D = 0

Y = 1*np.sin(2*np.pi*X)

# Adding some Noise :
Noise = 0.4*np.random.normal(size=100)

Y_data = Y + Noiseplt.scatter(X,Y_data,c=”r”)

# Calculate the value:
def calc_sine(x,a,b,c,d):
return a * np.sin(b* ( x + np.radians(c))) + d

# Plot the main data :
plt.scatter(X,Y_data)# Plot the best fit curve :
plt.plot(X,calc_sine(X,*popt),c=”r”)

# Check the accuracy :
Accuracy =r2_score(Y_data,calc_sine(X,*popt))
print (Accuracy)

# Function to calculate the value :
def calc_line(X,m,b):
return b + X*m

# It returns optimized parametes for our function :
# popt stores optimal parameters
# pcov stores the covarience between each parameters.
popt,pcov = curve_fit(calc_line,X,Y_data)

# Plot the main data :
plt.scatter(X,Y_data)# Plot the best fit line :
plt.plot(X,calc_line(X,*popt),c=”r”)

# Check the accuracy of model :
Accuracy =r2_score(Y_data,calc_line(X,*popt))
print (“Accuracy of Linear Model : “,Accuracy)

1.6 Logarithmic Regression:

Some real-life examples of logarithmic growth:

The magnitude of earthquakes.

The intensity of sound.

The acidity of a solution.

The pH level of solutions.

Yields of chemical reactions.

Production of goods.

Growth of infants.

A COVID-19 graph.

Sometimes we have data that grows exponentially in the statement, but after a certain point, it goes flat. In such a case, we can use a logarithmic regression.

Step by step implementation in Python:

a. Import required libraries:

b. Generating the dataset:

c. The first column of our matrix X :

Here we will use our normal equation to find the coefficient values.

d. Reshaping X:

e. Going with the Normal Equation formula:

f. Forming the main matrix X:

g. Finding the transpose matrix:

h. Performing matrix multiplication:

i. Finding the inverse:

j. Matrix multiplication:

k. Finding the coefficient values:

l. Plot the data with the regression curve:

m. Accuracy:

Putting it all together:

# Import required libraries:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score

# Dataset:
# Y = a + b*ln(X)
X = np.arange(1,50,0.5)
Y = 10 + 2*np.log(X)

#Adding some noise to calculate error!
Y_noise = np.random.rand(len(Y))
Y = Y +Y_noise
plt.scatter(X,Y)

# 1st column of our X matrix should be 1:
n = len(X)
x_bias = np.ones((n,1))

print (X.shape)
print (x_bias.shape)

# Reshaping X :
X = np.reshape(X,(n,1))
print (X.shape)

# Going with the formula:
# Y = a + b*ln(X)
X_log = np.log(X)

# Append the X_log to X_bias:
x_new = np.append(x_bias,X_log,axis=1)

# Transpose of a matrix:
x_new_transpose = np.transpose(x_new)

]]>https://towardsai.net/p/machine-learning/machine-learning-algorithms-for-beginners-with-python-code-examples-ml-19c6afd60daa/feed0Ensuring Success Starting a Career in Machine Learning (ML)
https://towardsai.net/p/machine-learning/moocs-vs-academia-ensuring-success-starting-in-a-machine-learning-ml-career-304b2e42315e
https://towardsai.net/p/machine-learning/moocs-vs-academia-ensuring-success-starting-in-a-machine-learning-ml-career-304b2e42315e#respondWed, 27 May 2020 01:55:04 +0000https://towardsai.net/?p=3937Author(s): Roberto Iriondo

Machine learning (ML) careers in industry and academia are in such high demand, how do you assure you can succeed in such a competitive…

]]>https://towardsai.net/p/machine-learning/key-machine-learning-ml-definitions-43e837ec6add/feed0Best Machine Learning Blogs to Follow in 2020
https://towardsai.net/p/machine-learning/best-machine-learning-blogs-6730ea2df3bd
https://towardsai.net/p/machine-learning/best-machine-learning-blogs-6730ea2df3bd#respondWed, 06 May 2020 08:19:32 +0000https://towardsai.net/?p=3593Keep up with the best and the latest machine learning (ML) research blogs through reliable sources

From researchers to students, industry experts, and machine learning (ML) enthusiasts — keeping up with the best and the latest machine learning research is a matter of finding reliable sources of scientific work. While blogs usually update in a more informal and conversational style, we have found that the sources in this list are accurate, resourceful, and reliable sources of machine learning research. Fit for all of those interested in learning more about the scientific field of ML.

Please know that the blogs listed below are by no means ranked or in a particular order. They are all incredible sources of machine learning research. Please let us know in the comments if you know of any other reliable blog sources in machine learning.

The machine learning blog at Carnegie Mellon University, ML@CMU, provides an accessible, general-audience medium for researchers to communicate research findings, perspectives on the field of machine learning, and various updates, both to experts and the general audience. Posts are from students, postdocs, and faculty at Carnegie Mellon [1].

Distill is an academic journal in the area of machine learning. The distinguishing trait of a Distill article is outstanding communication and a dedication to human understanding. Distill articles often, but not always, use interactive media. Most articles (if not all) published at Distill often take 100+ hours for publishing [2].

Google AI conducts research that advances the state-of-the-art in the field. Google AI (or Google.ai) is a division of Google dedicated solely to artificial intelligence. It was announced at Google’s conference I/O 2017 by CEO Sundar Pichai [3]. The Google AI blog has a section specifically for machine learning research [4].

The BAIR blog provides an accessible, general-audience medium for researchers to communicate research findings, perspectives on the field, and various updates. Posts are from students, postdocs, and faculty in BAIR, and intends to provide a relevant and timely discussion of research findings and results, both to experts and the general audience [5].

OpenAI is a research laboratory based in San Francisco, California. Their mission is to ensure that artificial general intelligence benefits all of humanity [8]. The OpenAI blog brings state-of-the-art research in the field. Their mission is to discover and enact the path to safe artificial general intelligence (AGI) [8].

The Machine Learning (Theory) blog is an experiment in the application of a blog to academic research in machine learning and learning theory by machine learning researcherJohn Langford [6]. He has emphasized that the field of machine learning “is shifting from an academic discipline to an industrial tool” [7].

DeepMind works on some of the most complex and exciting challenges in AI. Their world-class research has resulted in hundreds of peer-reviewed papers, including in Nature and Science [9].

MIT often produces state-of-the-art research in the field of machine learning. Thisfiltered news stream provides the latest news and research on what’s happening in the field of machine learning at MIT.

Christopher Olah describes himself as a wandering machine learning researcher, looking to understand things clearly and explain them well [10]. Olah is a researcher with Open AI, and formerly at Google AI. His blog has very complete and exciting articles for the machine learning researcher and enthusiast — a gold mine of free, open, machine learning research.

Facebook AI is known for working on state-of-the-art research in the field. Their research areas focus on computer vision, conversational AI, integrity, NLP, ranking, and recommendations, systems research, machine learning theory, speech, and audio, along with human and machine intelligence. The Facebook AI Blog encompasses excellent content, from blog posts to research publications [12].

Amazon web services is one of the most used cloud services around the world. They offer reliable, scalable, and accessible cloud computing services. Their research team publishes blog posts on machine learning state-of-the-art research and ML applications on the AWS blog [11].

If you happen to know of any other reliable machine learning blogs, please let me know in the comments. Thank you for reading!

DISCLAIMER: The views expressed in this article are those of the author(s) and do not represent the views of Carnegie Mellon University, nor other companies (directly or indirectly) associated with the author(s). These writings do not intend to be final products, yet rather a reflection of current thinking, along with being a catalyst for discussion and improvement.

]]>https://towardsai.net/p/machine-learning/best-machine-learning-blogs-6730ea2df3bd/feed0What is Machine Learning?
https://towardsai.net/p/machine-learning/what-is-machine-learning-ml-b58162f97ec7
https://towardsai.net/p/machine-learning/what-is-machine-learning-ml-b58162f97ec7#respondTue, 30 Apr 2019 08:00:23 +0000https://towardsai.net/?p=3339Demystifying Machine Learning | Part I

Learn what is machine learning, how it works and its importance in five minutes

April 30, 2019, by Roberto Iriondo — Last updated: May 15, 2019

Who should read this article?

Anyone who is curious and wants a truly simple, yet accurate overview of the definition of machine learning, about how it works and its importance. We will go through each of the pertinent questions raised above by slicing technical definitions from machine learning pioneers and industry leaders to present you with a true simplistic introduction to the amazing, scientific field of machine learning.

Glossary of terms can be found at the bottom of the article, along with a small set of resources for further learning, references, and disclosures.

If the above applies to you, read on!

What is machine learning?

The scientific field of machine learning (ML) is a branch of artificial intelligence, as defined by Computer Scientist and machine learning pioneer [1] Tom M. Mitchell: “Machine learning is the study of computer algorithms that allow computer programs to automatically improve through experience [2].”

An algorithm can be thought of as a set of rules/instructions that a computer programmer specifies, which a computer is able to process. Simply put, machine learning algorithms learn by experience, similar to how humans do. For example, after having seen multiple examples of an object, a compute-employing machine learning algorithm can become able to recognize that object in new, previously unseen scenarios.

How does machine learning work?

In the video above [3], Head of Facebook AI Research, Yann LeCun simply explains how machine learning works with easy to follow examples. Machine learning utilizes a variety of techniques to intelligently handle large and complex amounts of information to make decisions and/or predictions.

In practice, the patterns that a computer (machine learning system) learns can be very complicated and difficult to explain. Consider searching for dog images on Google search — as seen in the image below, Google is incredibly good at bringing relevant results, yet how does Google search achieve this task? In simple terms, Google search first gets a large number of examples (image dataset) of photos labeled “dog” — then the computer (machine learning system) looks for patterns of pixels and patterns of colors that will help it guess (predict) if the image queried it is indeed a dog.

At first, Google’s computer makes a random guess of what patterns are good as to identify an image of a dog. If it makes a mistake, then a set of adjustments are made in order for the computer to get it right. In the end, such collection of patterns will be learned by a large computer system modeled after the human brain (deep neural network), that once is trained can correctly identify and bring accurate results of dog images on Google search, along with anything else that you could possibly think of —such process is called the training phase of a machine learning system.

Imagine that you were in charge of building a machine learning prediction system to try and identify images between dogs and cats. The first step as we explained above would be to gather a large quantity of labeled images with “dog” for dogs and “cat” for cats. Second, we would train the computer to look for patterns on the images as to identify dogs and cats respectively.

Once the machine learning model has been trained [7], we can throw at it (input) different images to see if it can correctly identify dogs and cats. As seen on the image above, a trained machine learning model can (most of the time) correctly identify such queries.

Why is machine learning important?

Machine learning its incredibly important nowadays. First, because it can solve complicated real-world problems in a scalable way. Second, because it has disrupted a variety of industries within the past decade [9], and will continue to do so in the future, as more and more industry leaders and researchers are specializing in machine learning, along taking what they have learned in order to continue with their research and/or develop machine learning tools to positively impact their own fields. Third, artificial intelligence has the potential to incrementally add 16% or around $ 13 trillion to the US economy by 2030 [18]. The rate in which machine learning is causing positive impact is already surprisingly impressive [10] [11] [12] [13] [14] [15] [16] which have been successful thanks to the dramatic change on data storage and computing processing power [17] — as more people are increasingly becoming involved, we can only expect it to continue with this route and continue to cause amazing progress in different fields [6].

Future work: In an upcoming article we will discuss the types of machine learning in simple terms, how they are currently being used by academia and industry alike with real-world examples of such.

Acknowledgments:

The author would like to thank Anthony Platanios, Doctoral Researcher with the Machine Learning Department at Carnegie Mellon University for constructive criticism, along with editorial comments in preparation of this article.

DISCLAIMER: The views expressed in this article are those of the author(s) and do not represent the views of Carnegie Mellon University, nor other companies (directly or indirectly) associated with the author(s). These writings are not intended to be final products, yet rather a reflection of current thinking, along with being a catalyst for discussion and improvement.

]]>https://towardsai.net/p/machine-learning/what-is-machine-learning-ml-b58162f97ec7/feed0Breaking CAPTCHA Using Machine Learning in 0.05 Seconds
https://towardsai.net/p/machine-learning/breaking-captcha-using-machine-learning-in-0-05-seconds-9feefb997694
https://towardsai.net/p/machine-learning/breaking-captcha-using-machine-learning-in-0-05-seconds-9feefb997694#respondWed, 19 Dec 2018 09:00:15 +0000https://towardsai.net/?p=4532

Machine learning model breaks CAPTCHA systems on 33 highly visited websites. The concept bases on GANs

]]>https://towardsai.net/p/machine-learning/breaking-captcha-using-machine-learning-in-0-05-seconds-9feefb997694/feed0Machine Learning vs. AI, Important Differences Between Them
https://towardsai.net/p/machine-learning/differences-between-ai-and-machine-learning-1255b182fc6
https://towardsai.net/p/machine-learning/differences-between-ai-and-machine-learning-1255b182fc6#respondMon, 15 Oct 2018 21:45:48 +0000https://towardsai.net/?p=3432

Unfortunately, some tech organizations are deceiving customers by proclaiming using AI on their technologies while not being clear about their products’ limits

October 15, 2018, by Roberto Iriondo — Last updated: August 23, 2019

Recently, a report was released regarding the misuse from companies claiming to use artificial intelligence [29] [30] on their products and services. According to the Verge [29], 40% of European startups that claimed to use AI don’t actually use the technology. Last year, TechTalks, also stumbled upon such misuse by companies claiming to use machine learning and advanced artificial intelligence to gather and examine thousands of users’ data to enhance user experience in their products and services [2] [33].

Unfortunately, there’s still a lot of confusion within the public and the media regarding what truly is artificial intelligence [44], and what truly is machine learning [18]. Often the terms are being used as synonyms, in other cases, these are being used as discrete, parallel advancements, while others are taking advantage of the trend to create hype and excitement, as to increase sales and revenue [2] [31] [32] [45].

Below we will go through some main differences between AI and machine learning.

What is machine learning?

Quoting Interim Dean at the School of Computer Science at CMU, Professor and Former Chair of the Machine Learning Department at Carnegie Mellon University, Tom M. Mitchell:

A scientific field is best defined by the central question it studies. The field of Machine Learning seeks to answer the question:

“How can we build computer systems that automatically improve with experience, and what
are the fundamental laws that govern all learning processes? [1]”

Machine learning (ML) is a branch of artificial intelligence, and as defined by Computer Scientist and machine learning pioneer [19] Tom M. Mitchell: “Machine learning is the study of computer algorithms that allow computer programs to automatically improve through experience.” [18] — ML it’s one of the ways we expect to achieve AI. Machine learning relies on working with small to large data-sets by examining and comparing the data to find common patterns and explore nuances.

For instance, if you provide a machine learning model with a lot of songs that you enjoy, along their corresponding audio statistics (dance-ability, instrumentality, tempo or genre), it will be able to automate (depending of the supervised machine learning model used) and generate a recommender system [43] as to suggest you with music in the future that (with a high percentage of probability rate) you’ll enjoy, similarly as to what Netflix, Spotify, and other companies do [20] [21] [22].

In a simple example, if you load a machine learning program with a considerable large data-set of x-ray pictures along with their description (symptoms, items to consider, etc.), it will have the capacity to assist (or perhaps automatize) the data analysis of x-ray pictures later on. The machine learning model will look at each one of the pictures in the diverse data-set, and find common patterns found in pictures that have been labeled with comparable indications. Furthermore, (assuming that we use a good ML algorithm for images) when you load the model with new pictures it will compare its parameters with the examples it has gathered before in order to disclose to you how likely the pictures contain any of the indications it has analyzed previously.

The type of machine learning from our previous example is called “supervised learning,” where supervised learning algorithms try to model relationship and dependencies between the target prediction output and the input features, such that we can predict the output values for new data based on those relationships, which it has learned from previous data-sets [15] fed.

Unsupervised learning, another type of machine learning are the family of machine learning algorithms, which are mainly used in pattern detection and descriptive modeling. These algorithms do not have output categories or labels on the data (the model is trained with unlabeled data).

Reinforcement learning, the third popular type of machine learning, aims at using observations gathered from the interaction with its environment to take actions that would maximize the reward or minimize the risk. In this case, the reinforcement learning algorithm (called the agent) continuously learns from its environment using iteration. A great example of reinforcement learning are computers reaching super-human state and beating humans on computer games [3].

Machine learning is mesmerizing, particularly its advanced sub-branches, i.e., deep learning and the various types of neural networks. In any case, it is “magic” (Computational Learning Theory) [16], regardless of whether the public, at times has issues observing its internal workings. In fact, while some tend to compare deep learning and neural networks to the way the human brain works, there are essential differences between the two [2] [4] [46].

What is Artificial Intelligence (AI)?

Artificial intelligence, on the other hand, is vast in scope. According to Andrew Moore [6] [36] [47], Former-Dean of the School of Computer Science at Carnegie Mellon University, “Artificial intelligence is the science and engineering of making computers behave in ways that, until recently, we thought required human intelligence.”

That is a great way to define AI in a single sentence; however, it still shows how broad and vague the field is. Fifty years ago, a chess-playing program was considered a form of AI [34], since game theory, along with game strategies, were capabilities that only a human brain could perform. Nowadays, a Chess game would be considered dull and antiquated due to the fact that it can be found on almost every computer’s OS [35]; therefore, “until recently” is something that progresses with time [36].

Assistant Professor and Researcher at CMU, Zachary Lipton clarifies on Approximately Correct [7], the term AI “is aspirational, a moving target based on those capabilities that humans possess but which machines do not.” AI also includes a considerable measure of technology advances that we know. Machine learning is only one of them. Prior works of AI utilized different techniques, for instance, Deep Blue, the AI that defeated the world’s chess champion in 1997, used a method called tree search algorithms [8] to evaluate millions of moves at every turn [2] [37] [52] [53].

AI, as we know it today, is symbolized with Human-AI interaction gadgets by Google Home, Siri, and Alexa, by the machine learning powered video prediction systems that power Netflix, Amazon, and YouTube. These technology advancements are progressively becoming essential in our daily lives. In fact, they are intelligent assistants that enhance our abilities as humans and professionals — making us more and more productive.

In contrast to machine learning, AI is a moving target [51], and its definition changes as its related technological advancements turn out to be further developed [7]. Possibly, within a few decades, today’s innovative AI advancements will be considered as dull as flip-phones are to us right now.

Why do tech companies tend to use AI and ML interchangeably?

The term “artificial intelligence” came to inception in 1956 by a group of researchers, including Allen Newell and Herbert A. Simon [9], Since then, AI’s industry has gone through many fluctuations. In the early decades, there was a lot of hype surrounding the industry, and many scientists concurred that human-level AI was just around the corner. However, undelivered assertions caused a general disenchantment with the industry along with the public and led to the AI winter, a period where funding and interest in the field subsided considerably [2] [38] [39] [48].

Afterward, organizations attempted to separate themselves with the term AI, which had become synonymous with unsubstantiated hype and utilized different terms to refer to their work. For instance, IBM described Deep Blue as a supercomputer and explicitly stated that it did not use artificial intelligence [10], while it actually did [23].

During this period, a variety of other terms, such as big data, predictive analytics, and machine learning, started gaining traction and popularity [40]. In 2012, machine learning, deep learning, and neural networks made great strides and started being utilized in a growing number of fields. Organizations suddenly started to use the terms of machine learning and deep learning to advertise their products [41].

Deep learning started to perform tasks that were impossible to do with classic rule-based programming. Fields such as speech and face recognition, image classification and natural language processing, which were at early stages, suddenly took great leaps [2] [24] [49], and on March 2019–three the most recognized deep learning pioneers won a Turing award thanks to their contributions and breakthroughs that have made deep neural networks a critical component to nowadays computing [42].

Hence, to the momentum, we see a gearshift back to AI. For those who had been used to the limits of old-fashioned software, the effects of deep learning almost seemed like “magic” [16], especially since a fraction of the fields that neural networks and deep learning are entering were considered off-limits for computers. Machine learning and deep learning engineers are earning skyward salaries, even when they are working at non-profit organizations, which speaks to how hot the field is [50] [11].

Sadly, this is something that media companies often report without profound examination and frequently go along AI articles with pictures of crystal balls, and other supernatural portrayals. Such deception helps those companies generate hype around their offerings [27]. Yet, down the road, as they fail to meet the expectations, these organizations are forced to hire humans to make up for the shortcomings of their so-called AI [12]. In the end, they might end up causing mistrust in the field and trigger another AI winter for the sake of short-term gains [2] [28].

I am always open to feedback, please share in the comments if you see something that may need revisited. Thank you for reading!

Acknowledgments:

The author would like to extensively thank Ben Dickson, Software Engineer and Tech Blogger for his kindness as to allow me to rely on his expertise and storytelling, along with several members of the AI Community for the immense support and constructive criticism in preparation of this article.

DISCLAIMER: The views expressed in this article are those of the author(s) and do not represent the views of Carnegie Mellon University, nor other companies (directly or indirectly) associated with the author(s). These writings are not intended to be final products, yet rather a reflection of current thinking, along with being a catalyst for discussion and improvement.

]]>https://towardsai.net/p/machine-learning/differences-between-ai-and-machine-learning-1255b182fc6/feed0The Best Public Datasets for Machine Learning and Data Science
https://towardsai.net/p/machine-learning/best-datasets-for-machine-learning-and-data-science-d80e9f030279
https://towardsai.net/p/machine-learning/best-datasets-for-machine-learning-and-data-science-d80e9f030279#respondTue, 02 Oct 2018 08:00:14 +0000https://towardsai.net/?p=3451

Best Public Datasets for Machine Learning and Data Science

Best open-access datasets for machine learning, data science, sentiment analysis, computer vision, natural language processing (NLP), clinical data, and others.

This resource is continuously updated. If you know any other suitable and open dataset, please let us know by emailing us at pub@towardsai.netor by dropping a comment below.

Google Dataset Search: Similar to how Google Scholar works, Dataset Search lets you find datasets wherever they are hosted, whether it’s a publisher’s site, a digital library, or an author’s web page. It’s a phenomenal dataset finder, and it contains over 25 million datasets.

Kaggle: Kaggle provides a vast container of datasets, sufficient for the enthusiast to the expert.

UCI Machine Learning Repository: The Machine Learning Repository at UCI provides an up to date resource for open-source datasets.

VisualData: Discover computer vision datasets by category; it allows searchable queries.

CMU Libraries: Discover high-quality datasets thanks to the collection of Huajin Wang, at CMU.

Get Published With Towards AI

Towards AI, is the world’s fastest-growing AI community for learning, programming, building and implementing AI.

towardsai.net

General Datasets

Housing Datasets

Boston Housing Dataset: Contains information collected by the US Census Service concerning housing in the area of Boston Mass. It was obtained from the StatLib archive and has been used extensively throughout the literature to benchmark algorithms.

Geographic Datasets

Google-Landmarks-v2: An improved dataset for landmark recognition and retrieval. This dataset contains 5M+ images of 200k+ landmarks from across the world, sourced and annotated by the Wiki Commons community.

Machine Learning Datasets:

Mall Customers Dataset: The Mall customers dataset contains information about people visiting the mall in a particular city. The dataset consists of various columns like gender, customer id, age, annual income, and spending score. It’s generally used to segment customers based on their age, income, and interest.

IRIS Dataset: The iris dataset is a simple and beginner-friendly dataset that contains information about the flower petal and sepal width. The data is divided into three classes, with 50 rows in each class. It’s generally used for classification and regression modeling.

MNIST Dataset: This is a database of handwritten digits. It contains 60,000 training images and 10,000 testing images. This is a perfect dataset to start implementing image classification where you can classify a digit from 0 to 9.

Boston Housing Dataset: Contains information collected by the US Census Service concerning housing in the area of Boston Mass. It was obtained from the StatLib archive and has been used extensively throughout the literature to benchmark algorithms.

Fake News Detection Dataset: It is a CSV file that has 7796 rows with four columns. There are four columns: news, title, news text, result.

Wine quality dataset: The dataset contains different chemical information about the wine. The dataset is suitable for classification and regression tasks.

SOCR data — Heights and Weights Dataset: This is a basic dataset for beginners. It contains only the height and weights of 25,000 different humans of 18 years of age. This dataset can be used to build a model that can predict the height or weight of a human.

Titanic Dataset: The dataset contains information like name, age, sex, number of siblings aboard, and other information about 891 passengers in the training set and 418 passengers in the testing set.

Credit Card Fraud Detection Dataset: The dataset contains transactions made by credit cards; they are labeled as fraudulent or genuine. This is important for companies that have transaction systems to build a model for detecting fraudulent activities.

Computer Vision Datasets

xView: xView is one of the most massive publicly available datasets of overhead imagery. It contains images from complex scenes around the world, annotated using bounding boxes.

ImageNet: The largest image dataset for computer vision. It provides an accessible image database that is organized hierarchically, according to WordNet.

Kinetics-700: A large-scale dataset of video URLs from Youtube. Including human-centered actions. It contains over 700,000 videos.

Google’s Open Images: A vast dataset from Google AI containing over 10 million images.

Cityscapes Dataset: This is an open-source dataset for Computer Vision projects. It contains high-quality pixel-level annotations of video sequences taken in 50 different city streets. The dataset is useful in semantic segmentation and training deep neural networks to understand the urban scene.

IMDB-Wiki dataset: The IMDB-Wiki dataset is one of the most extensive open-source datasets for face images with labeled gender and age. The images are collected from IMDB and Wikipedia. It has five million-plus labeled images.

Color Detection Dataset: The dataset contains a CSV file that has 865 color names with their corresponding RGB(red, green, and blue) values of the color. It also has the hexadecimal value of the color.

Lexicoder Sentiment Dictionary: This dataset is specific for sentiment analysis. The dataset contains over 3000 negative words and over 2000 positive sentiment words.

IMDB reviews: An interesting dataset with over 50,000 movie reviews from Kaggle.

Enron Email Dataset: It contains around 0.5 million emails of over 150 users.

Recommender Systems Dataset: It contains various datasets from popular websites like Goodreads book reviews, Amazon product reviews, bartending data, data from social media, and others that are used in building a recommender system.

UCI Spambase Dataset: Classifying emails as spam or non-spam is a prevalent and useful task. The dataset contains 4601 emails and 57 meta-information about the emails. You can build models to filter out the spam.

IMDB reviews: The large movie review dataset consists of movie reviews from IMDB website with over 25,000 reviews for training and 25,000 for the testing set.

Self-driving (Autonomous Driving) Datasets

Waymo Open Dataset: This is a fantastic dataset resource from the folks at Waymo. Includes a vast dataset of autonomous driving, enough to train deep nets from zero.

Berkeley DeepDrive BDD100k: One of the largest datasets for self-driving cars, containing over 2000 hours of driving experiences across New York and California.

Cityscape Dataset: This is an extensive dataset that has street scenes in 50 different cities.

Clinical Datasets

COVID-19 Dataset: The Allen Institute of AI research has released a vast research dataset of over 45,000 scholarly articles about COVID-19.

MIMIC-III: Openly available dataset developed by the MIT Lab for Computational Physiology, comprising de-identified health data associated with ~40,000 critical care patients. It includes demographics, vital signs, laboratory tests, medications, and more.

Datasets for Recommender Systems

MovieLens: It contains rating data sets from the MovieLens web site.

Jester: It contains 4.1 Million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,421 users. It’s mostly used for the collaborative filter.

Million Song Dataset: It can be used for both collaborative and content-based filtering.

Note:

If you are aware of other high-quality, free datasets, which you recommend to people for research and application of machine learning, deep learning, data science, and others. Please feel free to suggest them in the comments below or by emailing us directly at pub@towardsai.net.

If the reason is reliable, we will analyze them and include them in this list. Also, please let us know your experience with using any of these datasets in the comments section.

Happy learning!

Acknowledgments:

The authors would like to thank the members of Lionbridge and the largest AI Communityfor the immense support, along with constructive criticism in preparation for this resource.

DISCLAIMER: The views expressed in this article are those of the author(s) and do not represent the views of Carnegie Mellon University. These writings do not intend to be final products, yet rather a reflection of current thinking, along with being a catalyst for discussion and improvement.