Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

Let’s see the sales in a Big Mart Store
Latest   Machine Learning

Let’s see the sales in a Big Mart Store

Last Updated on July 24, 2023 by Editorial Team

Author(s): Saikat Biswas

Originally published on Towards AI.

Image by Anastasia Dulgier on Unsplash

Predicting the sales of each product in a particular store

We as Data Analysts and Data Science Practitioners are often challenged to predict the factors that have an impact over a stimulus. And that can range from factors associated with the prediction of sales in a store, how the weather would be tomorrow, predicting the price of the stock the very next day and so on.

We know that Analytics Vidhya is a very well renowned platform to learn about almost everything that is related to Analytics. Many seasoned Data Scientists, Analysts, and Developers across the globe constantly keep themselves updated with the latest happenings in the world of Data Science and Machine Learning in general by going through their information-rich articles that hold almost everything that we need to know related to this field including me.

And they often put up various hackathons on their site to help us in the journey of Data Science. In fact, one of the best ways to learn more about Data Science is by coding consistently and participating in hackathons and checking their results for ourselves. One such hackathon that is present in their platform deals with the prediction of a particular item in a Sales store.

The Evaluation metric that is used for this competition is Root Mean Square Value.

Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are; RMSE is a measure of how spread out these residuals are. In other words, it tells us, how concentrated the data is around the line of best fit.

For this competition, we have a bunch of features that we are given and we need to predict the sales of each product in the store.

To see the complete code on this one, kindly feel free to check my Github code here.

Can't predict without this one, can we…..

We will divide this article into three sections.

  1. EDA (Exploratory Data Analysis): We will see some Analysis on the train Data for this one.
  2. Feature Engineering: This is where we would see the features that would help us with the prediction.
  3. Modelling: This is where the magic would happen and we would see it unfold in front of our eyes using the power of Machine Learning.

So, without further ado, let’s see the codes.

# importing all the important libraries for analysis

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
pd.set_option('display.max_columns', None)
import pandas_profiling as pp

import matplotlib.pyplot as plt
import seaborn as sns
import warnings
%matplotlib inline
plt.style.use('fivethirtyeight')
warnings.filterwarnings('ignore')
# loading the train and the test datatrain = pd.read_csv('train_bigmart.csv')
test = pd.read_csv('test_bigmart.csv')

There are null values in the Item_Weight and Outlet_Size variable. So, we will take care of that by replacing Item_Weight with its mean and Outlet_Size with its mode.

# filling the missing values in the Item_Weight column with mean and Outlet_Size with mode as there are missing valuesdata['Item_Weight'].fillna(data['Item_Weight'].mean(), inplace = True)

data['Outlet_Size'].fillna(data['Outlet_Size'].mode()[0], inplace = True)

data['Item_Outlet_Sales'] = data['Item_Outlet_Sales'].replace(0, np.NaN)
data['Item_Outlet_Sales'].fillna(data['Item_Outlet_Sales'].mode()[0], inplace = True)

Another important EDA library everyone of us must have on our Jupyter Notebooks is the PandasProfiling Library that helps us with a glance over the data and we can check for almost anything ranging from the variables’ coefficients to their distribution. The installation of this library is fairly simple and a simple pip install should do the job and we can then let the library run its magic.

Now we will see some of the important features that we will need for the complete modelling on this one.

We will create some features that will use for the modelling on this one.

# Getting the first two characters of ID to separate them into different categories

data['Item_Identifier'] = data['Item_Identifier'].apply(lambda x: x[0:2])

data['Item_Identifier'] = data['Item_Identifier'].map({'FD':'Food', 'NC':'Non_Consumable', 'DR':'Drinks'})

data['Item_Identifier'].value_counts()

Creating a new feature Outlet Establishment for our modelling purpose.

# determining the time establishment started

data['Outlet_Years'] = 2013 - data['Outlet_Establishment_Year']
data['Outlet_Years'].value_counts()
# Getting the first two characters of ID to separate them into different categories

data['Item_Identifier'] = data['Item_Identifier'].apply(lambda x: x[0:2])

data['Item_Identifier'] = data['Item_Identifier'].map({'FD':'Food', 'NC':'Non_Consumable', 'DR':'Drinks'})

data['Item_Identifier'].value_counts()

Next Step is Label Encoding the data as Label Encoding refers to converting the labels into a numeric form so as to convert it into the machine-readable form. Machine learning algorithms can then decide in a better way on how those labels must be operated. It is important for the structured dataset in supervised learning.

Once we are done with Label Encoding we proceed with One hot encoding the data or in simple words, we convert the data to its dummy form that we do for categorical variables for our machines to understand.

# one hot encoding the data to get dummy variables

data = pd.get_dummies(data)

print(data.shape)

Next Step is Splitting the data for train and test.

# splitting into train and test for modelling

train = data.iloc[:8523,:] # all rows till 8523 and all cols
test = data.iloc[8523:,:] # last row and last col
# making x_train, x_test, y_train, y_test

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3)

print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

3. Modelling.

A. Linear Regression

# Linear Regression

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

model = LinearRegression()
model.fit(x_train, y_train)

# predicting the test set results
y_pred = model.predict(x_test)
print(y_pred)

# finding the mean squared error and variance
mse = mean_squared_error(y_test, y_pred)
print('RMSE :', np.sqrt(mse))
print('Variance score: %.2f' % r2_score(y_test, y_pred))
OUTPUTRMSE : 9.186108167282568e-13
Variance score: 1.00

B. Random Forest Regressor

# Random Forest Regressor

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators = 100 , n_jobs = -1)
model.fit(x_train, y_train)

# predicting the test set results
y_pred = model.predict(x_test)
print(y_pred)

# finding the mean squared error and variance
mse = mean_squared_error(y_test, y_pred)
print("RMSE :",np.sqrt(mse))
print('Variance score: %.2f' % r2_score(y_test, y_pred))

print("Result :",model.score(x_train, y_train))
OUTPUTRMSE : 39.62801138375659
Variance score: 1.00
Result : 0.9999916265613055

C. Support Vector Machine

# Support Vector Machine

from sklearn.svm import SVR

model = SVR()
model.fit(x_train, y_train)

# predicting the x test results
y_pred = model.predict(x_test)

# Calculating the RMSE Score
mse = mean_squared_error(y_test, y_pred)
print("RMSE :", np.sqrt(mse))
print('Variance score: %.2f' % r2_score(y_test, y_pred))

print("Result :",model.score(x_train, y_train))
OUTPUTVariance score: 0.74
Result : 0.7490125535919516

D. Gradient Boosting Algorithm

# Gradient Boosting Algorithm
from sklearn.ensemble import GradientBoostingRegressor

model = GradientBoostingRegressor()
model.fit(x_train, y_train)

# predicting the test set results
y_pred = model.predict(x_test)
print(y_pred)

# Calculating the root mean squared error
print("RMSE :", np.sqrt(((y_test - y_pred)**2).sum()/len(y_test)))
print('Variance score: %.2f' % r2_score(y_test, y_pred))

print("Result :",model.score(x_train, y_train))
OUTPUTRMSE : 30.42838358308164
Variance score: 1.00
Result : 0.9999400663817338

E. Decision Tree Regressor

# Decision Tree Regressor 

from sklearn.tree import DecisionTreeRegressor

model = DecisionTreeRegressor()
model.fit(x_train, y_train)

# predicting the test set results
y_pred = model.predict(x_test)
print(y_pred)

print(" RMSE : " , np.sqrt(((y_test - y_pred)**2).sum()/len(y_test)))
print('Variance score: %.2f' % r2_score(y_test, y_pred))

print("Result :",model.score(x_train, y_train))
OUTPUTRMSE : 30.028006703468485
Variance score: 1.00
Result : 1.0

6. Adaboost Regressor

# Adaboost
from sklearn.ensemble import AdaBoostRegressor

model = AdaBoostRegressor(random_state=0, n_estimators=100)
model.fit(x_train, y_train)

# predicting the test set results
y_pred = model.predict(x_test)
print(y_pred)

print(" RMSE : " , np.sqrt(((y_test - y_pred)**2).sum()/len(y_test)))
print('Variance score: %.2f' % r2_score(y_test, y_pred))

print("Result :",model.score(x_train, y_train))
OUTPUTRMSE : 121.48048104773004
Variance score: 0.99
Result : 0.9939541181004856

After using 6 algorithms on the train data and testing the same on unseen test data we can see that Linear Regression has yielded the lowest RMSE Score of 9.18, as lower the RMSE value the better, as it is able to explain the variance in the data points in a much better way.

And RMSE is defined by how close the predicted values are in terms of the observed data points. Lower values of RMSE indicate better fit. Hence, the Linear Regression result would be selected.

Image Source Google

Its kind of true isn’t it. The more we torture the data and try to generate insights out of it, it would confess and lead to better results when we put the data up for prediction just as we saw in this case as we worked our way through to 6 algorithms and selecting the ones with the least RMSE.

You can Check out more articles related to Machine Learning by me :

Gradient Descent: In Layman Language

Introducing the most popular and most used machine learning optimization technique in 5 minutes.

medium.com

Importance of K-Fold Cross-Validation in Machine Learning

One of the most important steps before feeding the data to our machine learning model

medium.com

You can further connect with me on Linkedin here

Or connect with me on my Twitter account here

That's all in this one. Till next time Ciao..!!!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓