Stock Price Prediction Model for Netflix
Last Updated on May 24, 2022 by Editorial Team
A Comparative Study of Linear Regression, K-Nearest Neighbor (KNN) and Support Vector Machine (SVM)
Author(s):Β Vivek Chaudhary
The objective of this article is to design a stock prediction linear model to predict the closing price of Netflix. This will be a comparative study of various machine learning models such as linear regression, K-nearest neighbor, and support vector machines.
Before designing the model I was going through some of the investment-related blogs and articles on the internet to get a better understanding of the matter and I found out some key indicators that determine the opening and closing price of a stock. Therefore, this tutorial on the training of the model will not be limited to a single feature instead this will be a multi-feature trained model.
Stock Market Basic Influencing Factor
As a data scientist or an ml pro it eases out our work when we have a business or functional clarity on what we are trying to build, so I thought I must share some business insight as well, however, I am not a stock market pro but I thought it will build an understanding. so basically the closing price is the last price a buyer paid for that particular stock during business hours and the opening price is the price of the first transaction of that stock, and it is a very common phenomenon in stock trading that an opening price can be different from the closing price, the reason being the fluctuation between demand and supply that determines the attractiveness of a stock. this phenomenon is termed as AHT (after-hours trading) which plays a pivotal role in shifting the opening price of a stock. there are other factors as well as I am not an expert on such matters so I am not going to dig deep into it. this small snippet of information was to give a basic insight into the real-world problem.
so without wasting a further ado lets jump to the solution βcodingβ
1. Import prerequisites:
import pandas as pd from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split import matplotlib.pyplot as plt import numpy as np import pandas_datareader as web
pandas_datareader library allows us to connect to the website and extract data directly from internet sources in our case we are extracting data from Yahoo Finance API.
2. Read and Display
df = web.DataReader(βNFLXβ, data_source=βyahooβ, start=β2012β01β01', end=β2020β04β28') df.tail(10)
df.reset_index(inplace=True) df.describe()
3. Check for Correlation
Before we check for correlation lets understand what exactly it is?
Correlation is a measure of association or dependency between two features i.e. how much Y will vary with a variation in X. The correlation method that we will use is the Pearson Correlation.
Pearson Correlation Coefficient is the most popular way to measure correlation, the range of values varies from -1 to 1. In mathematics/physics terms it can be understood as if two features are positively correlated then they are directly proportional and if they share negative correlation then they are inversely proportional.
corr = df.corr(method=βpearsonβ) corr
Not all text is understandable, lets visualize the correlation coefficient.
import seaborn as sb sb.heatmap(corr,xticklabels=corr.columns, yticklabels=corr.columns, cmap='RdBu_r', annot=True, linewidth=0.5)
The Dark Maroon zone is not Maroon5, haha joking, but it denotes the highly correlated features.
4. Visualize the Dependent variable with Independent Features
#prepare dataset to work with nflx_df=df[[βDateβ,βHighβ,βOpenβ,βLowβ,βCloseβ]] nflx_df.head(10)
plt.figure(figsize=(16,8)) plt.title('Netflix Stocks Closing Price History 2012-2020') plt.plot(nflx_df['Date'],nflx_df['Close']) plt.xlabel('Date',fontsize=18) plt.ylabel('Close Price US($)',fontsize=18) plt.style.use('fivethirtyeight') plt.show()
#Plot Open vs Close nflx_df[['Open','Close']].head(20).plot(kind='bar',figsize=(16,8)) plt.grid(which='major', linestyle='-', linewidth='0.5', color='green') plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black') plt.show()
#Plot High vs Close nflx_df[['High','Close']].head(20).plot(kind='bar',figsize=(16,8)) plt.grid(which='major', linestyle='-', linewidth='0.5', color='green') plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black') plt.show()
#Plot Low vs Close nflx_df[[βLowβ,βCloseβ]].head(20).plot(kind=βbarβ,figsize=(16,8)) plt.grid(which=βmajorβ, linestyle=β-β, linewidth=β0.5', color=βgreenβ) plt.grid(which=βminorβ, linestyle=β:β, linewidth=β0.5', color=βblackβ) plt.show()
5. Model Training and Testing
#Date format is DateTime and it will throw error while training so I have created seperate month, year and date entities
nflx_df[βYearβ]=df[βDateβ].dt.year nflx_df[βMonthβ]=df[βDateβ].dt.month nflx_df[βDayβ]=df[βDateβ].dt.day
Create the final dataset for Model training
nfx_df=nflx_df[[βDayβ,βMonthβ,βYearβ,βHighβ,βOpenβ,βLowβ,βCloseβ]] nfx_df.head(10)
#separate Independent and dependent variable X = nfx_df.iloc[:,nfx_df.columns !=βCloseβ] Y= nfx_df.iloc[:, 5]
print(X.shape) #output: (2093, 6) print(Y.shape) #output: (2093,)
Splitting the dataset into train and test
from sklearn.model_selection import train_test_split x_train,x_test,y_train,y_test= train_test_split(X,Y,test_size=.25)
print(x_train.shape) #output: (1569, 6) print(x_test.shape) #output: (524, 6) print(y_train.shape) #output: (1569,) print(y_test.shape) #output: (524,) #y_test to be evaluated with y_pred for Diff models
Model 1: Linear Regression
Linear Regression Model Training and Testing
lr_model=LinearRegression() lr_model.fit(x_train,y_train)
y_pred=lr_model.predict(x_test)
Linear Model Cross-Validation
What is exactly Cross-Validation?
Basically Cross Validation is a technique using which Model is evaluated on the dataset on which it is not trained i.e. it can be a test data or can be another set as per availability or feasibility.
from sklearn import model_selection from sklearn.model_selection import KFold kfold = model_selection.KFold(n_splits=20, random_state=100) results_kfold = model_selection.cross_val_score(lr_model, x_test, y_test.astype('int'), cv=kfold) print("Accuracy: ", results_kfold.mean()*100)
#output: Accuracy: 99.999366595175
Plot Actual vs Predicted Value
plot_df=pd.DataFrame({βActualβ:y_test,βPredβ:y_pred}) plot_df.head(20).plot(kind='bar',figsize=(16,8)) plt.grid(which='major', linestyle='-', linewidth='0.5', color='green') plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black') plt.show()
Model 2: KNN K-nearest neighbor Regression Model
KNN Model Training and Testing
from sklearn.neighbors import KNeighborsRegressor knn_regressor=KNeighborsRegressor(n_neighbors = 5) knn_model=knn_regressor.fit(x_train,y_train) y_knn_pred=knn_model.predict(x_test)
KNN Cross-Validation
knn_kfold = model_selection.KFold(n_splits=20, random_state=100) results_kfold = model_selection.cross_val_score(knn_model, x_test, y_test.astype(βintβ), cv=knn_kfold) print(βAccuracy: β, results_kfold.mean()*100)
#output: Accuracy: 99.93813335235933
Plot Actual vs Predicted
plot_knn_df=pd.DataFrame({βActualβ:y_test,βPredβ:y_knn_pred}) plot_knn_df.head(20).plot(kind=βbarβ,figsize=(16,8)) plt.grid(which=βmajorβ, linestyle=β-β, linewidth=β0.5', color=βgreenβ) plt.grid(which=βminorβ, linestyle=β:β, linewidth=β0.5', color=βblackβ) plt.show()
Model 3: SVM Support Vector Machine Regression Model
SVM Model Training and Testing
from sklearn.svm import SVR svm_regressor = SVR(kernel=βlinearβ) svm_model=svm_regressor.fit(x_train,y_train) y_svm_pred=svm_model.predict(x_test)
Plot Actual vs Predicted
plot_svm_df=pd.DataFrame({βActualβ:y_test,βPredβ:y_svm_pred}) plot_svm_df.head(20).plot(kind=βbarβ,figsize=(16,8)) plt.grid(which=βmajorβ, linestyle=β-β, linewidth=β0.5', color=βgreenβ) plt.grid(which=βminorβ, linestyle=β:β, linewidth=β0.5', color=βblackβ) plt.show()
RMSE (Root Mean Square Error)
Root Mean Square Error is the Standard Deviation of residuals, which are a measure of how far data points are from the regression. Or in simple terms how concentrated the data points are around the best fit line.
from sklearn.metrics import mean_squared_error , r2_score import math
lr_mse=math.sqrt(mean_squared_error(y_test,y_pred)) print(βLinear Model Root mean square errorβ,lr_mse)
knn_mse=math.sqrt(mean_squared_error(y_test,y_knn_pred)) print(βKNN Model Root mean square errorβ,mse)
svm_mse=math.sqrt(mean_squared_error(y_test,y_svm_pred)) print(βSVM Model Root mean square error SVMβ,svm_mse)
#outputs as below: Linear Model Root mean square error 1.7539775065782694e-14 KNN Model Root mean square error 1.7539775065782694e-14 SVM Model Root mean square error SVM 0.0696764093622963
R2 or r-squared error
R2 or r2 score varies between 0 to 100%.
Mathematical Formula for r2 score :(y_test[i]βββy_pred[i]) **2
print(βLinear R2: β, r2_score(y_test, y_pred)) print(βKNN R2: β, r2_score(y_test, y_knn_pred)) print(βSVM R2: β, r2_score(y_test, y_svm_pred))
#output as below: Linear R2: 1.0 KNN R2: 0.9997974105145412 SVM R2: 0.9999996581785765
For basic mathematical understanding on RMSE and R2 error refer my blog: https://www.geeksforgeeks.org/ml-mathematical-explanation-of-rmse-and-r-squared-error/
Our Model looks very much good with phenomenal stats.
Summary:
Β· Use of prerequisite libraries
Β· Data extraction from the web using pandas_datareader
Β· Basic Stocks understanding
Β· Model training and testing
Β· ML algorithms such as Linear Regression, K-nearest neighbor and Support Vector Machines
Β· Kfold Cross-Validation of Models
Β· Python seaborn library to visualize correlation heatmap
Β· Pearson correlation
Β· Feature plotting against dependent output using Matplotlib.
Β· Plotting of Actual value with the Predicted value.
Β· Root Mean Square Error (RMSE).
Β· R-squared error
Thanks to all for reading my blog and If you like my content and explanation please follow me on medium and share your feedback, which will always help all of us to enhance our knowledge.
Thanks for reading!
Vivek Chaudhary
Stock Price Prediction Model for Netflix was originally published in Towards AIβββMultidisciplinary Science Journal on Medium, where people are continuing the conversation by highlighting and responding to this story.
Ajan K K
Very good article. Learned a lot. Please check the following:
#separate Independent and dependent variable
X = nfx_df.iloc[:,nfx_df.columns !=βCloseβ]
Y= nfx_df.iloc[:, 5]
Y value is expected to be the βCloseβ. But, here the βLowβ will be retained. I think you have to use 6 instead of 5.
pd
Sir, Can we predict the closing value for the next day using KNN method. Then we can consider that value as trend reversal value. I hope you reply
prags
the output for rmse for linear model is coming out to be 3.something, which is different from that is the model!!!why is it so??
V. Karunakaran
Sir I need mathematical calculation
waren
Sir, I haven’t seen such a naΓ―ve approach in my whole life. Try predicting 5 days in advance if you want to get a bit of understanding of the complexity of the problem. R2 of 0.99 for stock market prediction??? … common