Join thousands of AI enthusiasts and experts at the Learn AI Community.

Publication

Latest

Univariate Linear Regression From Scratch

Last Updated on January 6, 2023 by Editorial Team

Author(s): Erkan HatipoฤŸlu

Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.

With Code inย Python

Univariate Linear Regressionโ€Šโ€”โ€ŠImage byย Author

Introduction

Generally, one of the first subjects of a Machine Learning Course is Linear Regression, which is not very complex and easy to understand. Moreover, it includes many Machine Learning notions that can be used later in more complex concepts. So it is reasonable to start with Linear Regression to a new Machine Learning basicsย series.

In this blog post, we will learn about Univariate Linear Regression which means Linear Regression with just one variable. At the end of this article, we will understand fundamental Machine Learning concepts such as the hypothesis, cost function, and gradient descent. In addition, we will write the necessary code for those functions and train/test a linear model for a Kaggle dataset using Pythonย [1].

Before diving into Univariate Linear Regression, let's briefly talk about some basic concepts:

Basic Concepts

Machine Learning is the subfield of Artificial Intelligence that uses data to program computers for some tasks. A typical Machine Learning application uses a dataset and a learning algorithm to make a model that can then predict the outputs for new inputs to the application.

Machine Learning applications are generally divided into three categories:

In supervised learning [2], the data have identical features that can be mapped to some label. Supervised learning has twoย types.

It is called classification if the number of labels is limited and can be transformed into some categories. Identifying animals or handwritten digits in images are examples of classification.

If the set of labels is not restricted to some small number of elements but can be assumed infinitely many, then it is called regression. Predicting the price of houses in a region or gas usage per kilometer of cars are regression examples.

In unsupervised learning [3], the dataset has no labeled data. After the teaching process, the model discovers the patterns in the data. Categorizing the customers of an e-commerce website is an example of unsupervised learning.

Reinforcement learning [4] uses intelligent agents who act according to reward and punishment mechanisms [5]. It is mainly used in computerย gaming.

Linear Regression

As we have said above, regression maps inputs to infinitely many outputs. In addition, Linear Regression suggests a linear relationship between the inputs and outcomes [6]; it is challenging to visualize this if we have more than one input variable. If we have one input variable, we need to draw a line in the XY plane; if we have two input variables, we need to construct a plane in the three-dimensional coordinate system. More than two input variables are extremely tough to imagine. As a result, we will demonstrate Linear regression using univariate data. Below are the housing prices from Portland, Oregon dataset [7], which we will use as anย example.

Housing Prices from Portland, Oregonโ€Šโ€”โ€ŠImage byย Author

In the graph above, the y-axis is the price (K$) of some houses in Portland, and the x-axis is the house's size (feetยฒ). We want to draw a straight line on this graph (such as the image below) so that when we know the size of a new home not in this dataset, we can predict its price by using thatย line.

Housing Prices from Portland, Oregon with Hypothesisโ€” Image byย Author

For example, we can predict that the price of a 3000 feetยฒ house is approximately 500 K$, as seen in the graphย below.

Calculating New Inputs for Housing Prices from Portland, Oregonโ€Šโ€”โ€ŠImage byย Author

So without further ado, let's dive into more profound concepts to understand how to draw thatย line.

Note that this article uses the Linear Regression dataset [1] and Univariate Linear Regression from Scratch notebook [8] from Kaggle. As described in the content part of the dataset, it has a training dataset of 700 and a test dataset of 300 elements. The data consists of pairs of (x,y) where xโ€™s are numbers between 0 and 100 and yโ€™s have been generated using the Excel function NORMINV(RAND(), x, 3). It is also given that the best estimate for y should beย x.

First, we must write three functions to find a univariate linear regression solution on a dataset. These functions are for the hypothesis, cost, and gradient descent calculations in order. Foremost, we will explain them one byย one.

The Hypothesis

We aim to find a model we can use to predict new inputs. A hypothesis is a potential model for a machine learning systemย [9].

As seen in the figure below, we can find the hypothesis for a dataset by using a learning algorithm [10]. Later we will talk about how well this hypothesis fits our data but for now, let's just define the hypothesis formula.

The Hypothesisโ€Šโ€”โ€ŠImage byย Author

For a univariate linear regression task, the hypothesis is of theย form:

The Hypothesis Formulaโ€Šโ€”โ€ŠImage byย Author

The intersection of the y-axis with the hypothesis line is called intercept. This is ฮธโ‚€ in our formula, and the slope of the hypothesis line isย ฮธโ‚.

If we know x, ฮธโ‚€, and ฮธโ‚ values, we can predict the target value for a machine learning system. For example, in the housing prices example above, ฮธโ‚€ = 0.00008, and ฮธโ‚ = 0.16538 for the red line. So for a house of 3000 feetยฒ (x=3000), the expected priceย is:

As can be seen, we need three parameters to write the hypothesis function. These are x (input), ฮธโ‚€, and ฮธโ‚. So, let's write the Python function for the hypothesis:

The process is pretty straightforward. We just need to return (ฮธโ‚€+ ฮธโ‚*ย x).

The Costย Function

When we have a hypothesis (i.e., we know the values of ฮธโ‚€ and ฮธโ‚), we must decide how well this hypothesis fits our data. Turning back to the housing prices example, Let's take ฮธโ‚€ as 500 and ฮธโ‚ as -0.1 and draw the line on ourย data.

Housing Prices from Portland, Oregon with Unfitting Hypothesisโ€Šโ€”โ€ŠImage byย Author

As seen from the image above, A hypothesis with ฮธโ‚€ = 500 and ฮธโ‚ = -0.1 is not a good match for our dataset. We do not want such a hypothesis to be our model since it gives faulty results except for a few points. Let's draw some more lines to have a better understanding.

Housing Prices from Portland, Oregon with different Hypothesis Linesโ€Šโ€”โ€ŠImage byย Author

We have three hypotheses in the above image with different ฮธ values. By intuition, we can easily pick the blue line (ฮธโ‚€ = 0, ฮธโ‚ = 0.165) as the best match for our data. The question is, how can we select a better fit mathematically? In other words, how can we choose the best ฮธโ‚€ and ฮธโ‚ values that give the optimum solution to our goal of finding a model we can use to predict newย inputs?

To do that, we can choose ฮธโ‚€ and ฮธโ‚ such that h(x) is close to the actual value for all training examples [10]. We make errors for each point on the dataset for a specific hypothesis (with known ฮธโ‚€ and ฮธโ‚ values). If we somehow add all of those errors we made for each data point on that dataset, we can choose the best fitting line by selecting the minimum sum (which means we make fewerย errors).

So we define a cost function to measure the total error we make. There are several solutions we can choose. One is called the squared error function:

The Cost Functionโ€” Image byย Author

j(ฮธโ‚€, ฮธโ‚) is called the cost function. It is a function of ฮธโ‚€ and ฮธโ‚, which are the intercept and slope of the hypothesis. h(xแตข)โ€Šโ€”โ€Šy(xแตข) is the error we made for the ith training example. And by squaring it, we get two significant benefits; first, we punish big mistakes, and second, we get rid of the negative sign if h(x) < y. By dividing the total sum of errors by the number of samples, we find the average error we made. As described above, smaller cost function values suggest a better match, so we are trying to minimize the cost function for a betterย fit.

To write the cost function, we must import NumPy, a famous machine learning library used for linear algebra tasks. NumPy will speed up our calculations. We also need four parameters, and these are ฮธโ‚€, ฮธโ‚, X (input), and y (actualย output).

Inside the function, the first thing to do is to calculate the errors for each data point using NumPy's subtract function. Then we can calculate the square errors for each data point using NumPy's square function. Finally, we can calculate the total square error using sum() and divide by the number of examples to get the average error. Note that all the training data is passed to the function as a NumPy array. In addition, since we are using the hypothesis function inside the cost function, we must write it prior. Remember also that we are not trying to minimize the cost function yet. We will see how to do it later in the Gradient Descentย topic.

Gradient Descent

Now we can calculate the cost function for a given set of ฮธโ‚€ and ฮธโ‚ on a dataset. We also know that the ฮธโ‚€ and ฮธโ‚ values that make the cost function minimum are the best matches for our data. The next question is how to find those ฮธโ‚€ and ฮธโ‚ values that make the cost functionย minimum.

The gradient descent algorithm can be utilized for this purpose, and we can use the climber analogy to explain gradient descent better. Suppose there is a climber at the top of a mountain. He wants to descend to the hillside as soon as possible. So, he observes his surroundings and finds the steepest path to the ground, moving one tiny step in that direction. Then he observes his surroundings again to see the new, most vertical direction to move. He keeps going repeating this procedure till he reaches the hillside.

The explanation is beyond this blog post, but we know that the squared error cost function has one global minimum with no local minimums. Furthermore, its negative gradient at a point is the steepest direction to this global minimum. So like a climber on the top of a mountain finds his way to the hillside, we can use gradient descent to find the ฮธ values that minimize our cost function.

To do that, first, we need to select some initial ฮธโ‚€ and ฮธโ‚ to start with. Then, we can calculate the gradient descent at this point and change ฮธโ‚€ and ฮธโ‚ values proportional to the negative gradient calculated. Then repeat the same steps till convergence.

To better understand the concept of gradient, we will use a random squared error cost function with ฮธโ‚€ = 0, as shown below. We select ฮธโ‚€ as zero; otherwise, we would have struggled with a 3D graph instead of dealing with a 2Dย one.

J(ฮธโ‚)โ€Šโ€”โ€ŠImage byย Author

In the image above, suppose we are at point = 6 (that is, we have selected our initial ฮธโ‚ as 6) and trying to reach the global minimum. The negative gradient at this point is downward, as shown by the arrow. So if we go one step further in that direction, we will end with a point smaller than 6. Then, as shown below, we can repeat the same procedure till convergence at point =ย 2.

Convergence of J(ฮธโ‚) at the Local Minimumโ€” Image byย Author

Let's try another point as below. If we have selected another ฮธโ‚ value, such as point = -2, the negative gradient at this point would be downward, as shown by the arrow. So if we go one step further in that direction, we will end with a value larger thanย -2.

J(ฮธโ‚)โ€Šโ€”โ€ŠImage byย Author

Then, as shown below, we can repeat the same procedure till convergence at point =ย 2.

Convergence of J(ฮธโ‚) at the Local Minimumโ€Šโ€”โ€ŠImage byย Author

One good question to ask is how to decide when we have converged. That is, what is our stop condition at the code? We have two choices. We can either check the value of the cost function and stop if it is smaller or equal to a predefined value, or we can check the difference between consecutive cost function values and stop if it is smaller or equal to a threshold value. We will choose the latter since, due to the initial ฮธ values selection, we can never reach the predefined cost value we have selected.

Taking the gradient descent of a function is a freshman-level Calculus concept, and we will not dive into details; instead, we will only give the outcomes. The gradient of the squared error cost function for a univariate dataset is asย follows:

The Gradient Calculation for the Squared Error Cost Functionโ€Šโ€”โ€ŠImage byย Author

Before writing the code for the gradient descent, we must introduce one final concept: the learning rate. As we have discussed above, the gradient descent algorithm requires tiny steps, and as shown below, we may risk overshooting the local minimum if the steps are too big. So, while calculating the new ฮธ values, we multiply the gradient result with the learning rate coefficient. The learning rate is a hyperparameter that needs to be adjusted during training. While a higher learning rate value may result in overshooting, a smaller learning rate value may result in higher training timeย [10].

Overshootingโ€Šโ€”โ€ŠImage byย Author

As a result, for a univariate dataset with a hypothesis of the form h(x) = ฮธโ‚€ + ฮธโ‚* x and a squared error cost function J(ฮธโ‚€, ฮธโ‚), the ฮธ values for each step can be calculated asย follows:

The Algorithm for Calculating ฮธ Values Using Gradient Descentโ€Šโ€”โ€ŠImage byย Author

As discussed, ฮฑ is the learning rate, and m is the number of training examples.

To write the gradient descent function, we must import NumPy as in the cost function. We also need five parameters, and these are ฮธโ‚€, ฮธโ‚, ฮฑ (learning rate), X (input), and y(actual output). We also must be sure not to update ฮธโ‚€ before calculating ฮธโ‚ [10]. Otherwise, we will end up with a faulty result. The code for the gradient function is asย follows:

Inside the function, the first thing to do is to calculate the errors for each data point using NumPy's subtract function and store it in a variable. Then we can multiply the previous result with x and store it in another variable. Note that all the training data is passed to the function as a NumPy array. So those variables are of type array, and the sum() function can be used to get the total values which then can be multiplied by the learning rate and divided by the number of examples. As a final step, we can calculate and return the new ฮธย values.

Model Training

Since we have written all the functions we need, it is time to train the model. As discussed before, we need to decide when the gradient descent converges. To do that, we will subtract consecutive cost values. If the difference is smaller than a certain threshold, we will conclude that the gradient descent converges.

We also draw the hypothesis for each iteration to see how it changes. But this slows down the code. The selection of initial values may significantly change the number of iterations, so it may be suitable to change this part of the code for those who reproduce it. It is also possible to copy the Univariate Linear Regression from Scratch notebook for this purpose, in which a section draws the model trained only because of high iteration counts.

Inside the code block, we first initialize the ฮฑ, ฮธโ‚€, ฮธโ‚, and the convergence threshold values. We also calculate the initial cost value by calling the cost function.

As previously discussed, our stop condition for the while loop is the difference of consecutive cost function values to be smaller than a threshold value. We first draw the training data and the hypothesis for demonstration purposes for each while loop iteration. We then calculate the cost function value by calling the cost function. Next, we calculate the new ฮธโ‚€ and ฮธโ‚ values by calling the gradient function. Next, we use new ฮธ values to recalculate the cost value to find the difference to check the stop condition. At the end of the loop, we print some necessary information for each iteration.

The Result

The result below is for the initial conditions in the model training code block. The training took 29 iterations. ฮธโ‚€ is calculated at approximately 0.015, and ฮธโ‚ is around 0.999. The final cost value is calculated as about 15.742. The result is very close to the y = x line discussed in the Linear Regression Dataset. We can also see the change in the hypothesis for each iteration.

The Training Result Using Training Datasetโ€” Image byย Author

Validation

We can also validate our result with the test dataset, which hasn't been used in the training process. Below you can find the performance of our model concerning the test dataset. The cost value of the model is approximately 18.915, slightly worse than 15.742 (training cost value), which is expected.

The Validation Result Using Test Datasetโ€” Image byย Author

Conclusion

In this blog post, we learned the basics of linear regression, such as the hypothesis, the cost function, and the gradient descent using univariate data. We also wrote those functions in Python. In addition, using them, we have trained a univariate linear model and tested it utilizing the Linear Regression dataset.

Interested readers can find the code used in this blog post in the Kaggle notebook Univariate Linear Regression from Scratch as aย whole.

Thank you forย reading!

References

[1] V. Andonians, "Linear Regression," Kaggle.com, 2017. [Online]. Available: https://www.kaggle.com/datasets/andonians/random-linear-regression. [Accessed: 12- Sep-ย 2022].

[2] "Supervised learningโ€Šโ€”โ€ŠWikipedia," En.wikipedia.org, 2022. [Online]. Available: https://en.wikipedia.org/wiki/Supervised_learning. [Accessed: 13- Sep-ย 2022].

[3] "Unsupervised learningโ€Šโ€”โ€ŠWikipedia," En.wikipedia.org, 2022. [Online]. Available: https://en.wikipedia.org/wiki/Unsupervised_learning. [Accessed: 13- Sep-ย 2022].

[4]โ€ Reinforcement learningโ€Šโ€”โ€ŠWikipedia," En.wikipedia.org, 2022. [Online]. Available: https://en.wikipedia.org/wiki/Reinforcement_learning. [Accessed: 13- Sep-ย 2022].

[5] J. Carew, "What is Reinforcement Learning? A Comprehensive Overview", SearchEnterpriseAI, 2021. [Online]. Available: https://www.techtarget.com/searchenterpriseai/definition/reinforcement-learning. [Accessed: 12- Sep-ย 2022].

[6] J. Brownlee, "Linear Regression for Machine Learning," https://machinelearningmastery.com/, 2022. [Online]. Available: https://machinelearningmastery.com/linear-regression-for-machine-learning/. [Accessed: 11- Sep-ย 2022].

[7] JohnX, "Housing Prices, Portland, OR," Kaggle.com, 2017. [Online]. Available: https://www.kaggle.com/datasets/kennethjohn/housingprice. [Accessed: 11- Sep-ย 2022].

[8] E. Hatipoglu, "Univariate Linear Regression From Scratch," Kaggle.com, 2022. [Online]. Available: https://www.kaggle.com/code/erkanhatipoglu/univariate-linear-regression-from-scratch. [Accessed: 12- Sep-ย 2022].

[9] J. Brownlee, "What is a Hypothesis in Machine Learning?", machinelearningmastery.com, 2019. [Online]. Available: https://machinelearningmastery.com/what-is-a-hypothesis-in-machine-learning/. [Accessed: 12- Sep-ย 2022].

[10] A. Ng, "Stanford CS229: Machine Learningโ€Šโ€”โ€ŠLinear Regression and Gradient Descent | Lecture 2 (Autumn 2018)", Youtube.com, 2018. [Online]. Available: https://www.youtube.com/watch?v=4b4MUYve_U8&list=PLoROMvodv4rMiGQp3WXShtMGgzqpfVfbU&index=2. [Accessed: 12- Sep-ย 2022].


Univariate Linear Regression From Scratch was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. Itโ€™s free, we donโ€™t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aย sponsor.

Published via Towards AI

Feedback โ†“