Predicting Sales using R programming

Last Updated on July 24, 2023 by Editorial Team

Author(s): Suyash Maheshwari

Originally published on Towards AI.

In this article, I will forecast the sales of a multinational retail corporation. The dataset for this can be found on Kaggle. We have been provided with weekly sales data on which we train our models and use the best model to predict the weekly sales value for future dates.

Libraries required :

library(dplyr)
library(forecast)
library(reshape)
library(ggplot2)
library(tidyverse)

The original dataset contains data of around 45 stores and 99 departments. To simplify the calculation, I have just considered the data of the 1st store’s 1st department. I have split the data into two parts — sample_ train and sample_test based on dates. Datasets contain the following fields :

Store — the store number
Dept — the department number
Date — the week
Weekly_Sales — sales for the given department in the given store
IsHoliday — whether the week is a special holiday week

dept1_train <- train %>% filter(Store == "1" & Dept == "1")
dept1_test <- test %>% filter(Store == "1" & Dept == "1")
dept1_test$ Weekly_Sales <- 0dept1_train$Date <- as.Date(dept1_train$Date , format = "%Y-%m-%d")
sample_train <- dept1_train %>% filter(Date < as.Date("2012-02-06"))
sample_test <- dept1_train %>% filter(Date >= as.Date("2012-02-06"))summary(sample_train)Store Dept Date Weekly_Sales 
 Min. :1 Min. :1 Min. :2010-02-05 Min. :14537 
 1st Qu.:1 1st Qu.:1 1st Qu.:2010-08-06 1st Qu.:16329 
 Median :1 Median :1 Median :2011-02-04 Median :18820 
 Mean :1 Mean :1 Mean :2011-02-04 Mean :22777 
 3rd Qu.:1 3rd Qu.:1 3rd Qu.:2011-08-05 3rd Qu.:23388 
 Max. :1 Max. :1 Max. :2012-02-03 Max. :57258IsHoliday 
 Mode :logical 
 FALSE:97 
 TRUE :8summary(sample_test)Store Dept Date Weekly_Sales 
 Min. :1 Min. :1 Min. :2012-02-10 Min. :15723 
 1st Qu.:1 1st Qu.:1 1st Qu.:2012-04-14 1st Qu.:16645 
 Median :1 Median :1 Median :2012-06-18 Median :18243 
 Mean :1 Mean :1 Mean :2012-06-18 Mean :21784 
 3rd Qu.:1 3rd Qu.:1 3rd Qu.:2012-08-22 3rd Qu.:22057 
 Max. :1 Max. :1 Max. :2012-10-26 Max. :57592IsHoliday 
 Mode :logical 
 FALSE:36 
 TRUE :2

For sample_train date ranges from 5th Feb 2010 to 3rd Feb 2012 and for sample_test date ranges from 10th Feb 2012 to 26th Oct 2012. I have created a time series for the weekly_sales data of sample_train.

Time-series: A sequence taken at successive equally spaced points in time.

ts_train_uni <- ts(sample_train$Weekly_Sales , start = c(2010,5) , frequency = 52)

The starting point is the first week of February 2010. Frequency =52 indicates that it is weekly data.

The three models that I use to train my dataset are ARIMA, HoltWinters, and nnetar.

ARIMA: Auto-Regressive Integrated Moving Average

It describes the correlation between data points and takes into account the difference of the values. A model that shows stationarity is one that shows there is constancy to the data over time. Most economical and market data show trends, so differencing is used to remove any trends or seasonal structures. The seasonal difference in this example is 1.

arima_model <- auto.arima(ts_train_uni , seasonal.test = "seas" )
arima_pred = forecast(arima_model , h = 38)
arima_pred <- as.data.frame(arima_pred$mean)

HoltWinters: The Holt-Winters forecasting algorithm allows users to smooth a time series and use that data to forecast areas of interest. Unknown parameters are determined by minimizing the squared prediction error.

holt_model <- HoltWinters(ts_train_uni)
p <- predict(holt_model , 38 , prediction.interval = TRUE)
p <- as.data.frame(p)

nnetar: Feed-forward neural networks with a single hidden layer and lagged inputs for forecasting univariate time series.

neural <- nnetar(ts_train_uni)
neural_pred <- forecast(neural , h=38)
neural_pred <- as.data.frame(neural_pred)

After predicting the values with the help of various models, I use the add_column function from the tidyverse package to append predicted values to sample_test.

pred_data <- sample_test %>%add_column(arima_pred = arima_pred$x , holt_pred = p$fit , neural_pred = neural_pred$`Point Forecast` )

Then, I use the ggplot function to plot the values of weekly_sales data of sample_test and compare it against the values predicted by our models for the same period. This would help us in analyzing the results of our models.

pred_data %>% gather(key = "predictions" , value = "value" , -c(Store , Dept , IsHoliday ,Date))%>% 
 ggplot(aes(x = Date ,y = value , colour = predictions)) + geom_line() + scale_x_date(date_breaks = "4 week")

Actual weekly_sales data represented by the purple line has two peaks at the beginning, and the sales are increasing again at the end. The first peak has been captured by all the models. However, no model has captured the second peak accurately. Arima and HoltWinters’ models are in line with the other distribution of weekly_sales data, whereas the neural network model is far from the actual weekly data and can safely be rejected. To calculate which model would be the right choice, we have to find out Root Mean Square Error (RMSE).

RMSE is a frequently used measure of the differences between values predicted by a model or an estimator and the values observed. It is the standard deviation of the residuals (prediction errors). It is calculated in the following manner:

for (j in 6:ncol(pred_data)) {
 
 error <- pred_data[4] - pred_data[j]
 
 y <- error ^2
 z<- colMeans(y)
 
 rmse[j] <- sqrt(z)
 
}>rmse[6] #arima
 
9097.075> rmse[7] #holtwinters
 
9106.965> rmse[8] #nnetar
 
10123.48

As expected, the rmse of Arima and holtwinters is lower than nnetar since the rmse of Arima is lowest. I use this model to predict future values.

Thus, the forecast for the dates ranging from 2nd Nov 2012 to 26th July 2013 is

ts_train_uni <- ts(dept1_train$Weekly_Sales , start = c(2010,5) , frequency = 52)
 
 arima_model <- auto.arima(ts_train_uni , seasonal.test = "seas" )
 arima_pred1 = forecast(arima_model , h = 39)
 arima_pred1 <- as.data.frame(arima_pred1$mean)
 plot(forecast(arima_model , h=39)) #arima plot
 dept1_test$Weekly_Sales <- arima_pred1$x
 dept1_test <- subset(dept1_test , select = -arima)
 dept1_test$Date <- as.Date(dept1_test$Date)ggplot(dept1_test , aes(x = Date , y = Weekly_Sales) ) + geom_line(color = "blue") + theme_classic() +
 scale_x_date(breaks = "4 weeks")

The sales are higher in the months of December, February, and April for this particular department. No information has been given about the category of products this department sells. We have been given information about various holidays. Also, the weeks including these holidays, are weighted five times higher in the evaluation than non-holiday weeks, and also, the discounts are massive during this period. Therefore, we can see those spikes during these months.

The table below shows the head of the range of values at 80% and 95% confidence intervals for arima model.Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
 36424.86 27138.911 45710.81 22223.227 50626.49
 18689.54 7514.632 29864.45 1598.993 35780.09
 19050.66 7875.752 30225.57 1960.113 36141.21
 20911.25 9736.342 32086.16 3820.703 38001.80
 25293.49 14118.582 36468.40 8202.943 42384.04
 33305.92 22131.012 44480.83 16215.373 50396.47

Thus, we were able to predict sales value using Machine Learning model and also find out the range at various confidence intervals. A Confidence interval of 95% indicates that there are 95% chances that the actual value would be within the range of low 95 and high 95. In the next article, I develop an interactive tool using RShiny, which is used to forecast values at the click of a button. Do clap and comment if you liked the article. Thank you 🙂

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Predicting Sales using R programming

Author(s): Suyash Maheshwari

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

AI’s Got Some Explaining to Do

Diffusion Auto-Regressive Transformer For Effective Self-Supervised Time Series Forecasting

5 Smart Ways to Use Retrieval-Augmented Generation (RaG) for Real-Time NLP Enhancements

How I Built an AI-Powered Edge Computing Application with Python

Getting Started with AgentOps: A Quick Setup Guide

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Predicting Sales using R programming

Author(s): Suyash Maheshwari

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement