Predicting Sales using R programming
Last Updated on July 24, 2023 by Editorial Team
Author(s): Suyash Maheshwari
Originally published on Towards AI.
In this article, I will forecast the sales of a multinational retail corporation. The dataset for this can be found on Kaggle. We have been provided with weekly sales data on which we train our models and use the best model to predict the weekly sales value for future dates.
Libraries required :
library(dplyr)
library(forecast)
library(reshape)
library(ggplot2)
library(tidyverse)
The original dataset contains data of around 45 stores and 99 departments. To simplify the calculation, I have just considered the data of the 1st storeβs 1st department. I have split the data into two parts β sample_ train and sample_test based on dates. Datasets contain the following fields :
- Store β the store number
- Dept β the department number
- Date β the week
- Weekly_Sales β sales for the given department in the given store
- IsHoliday β whether the week is a special holiday week
dept1_train <- train %>% filter(Store == "1" & Dept == "1")
dept1_test <- test %>% filter(Store == "1" & Dept == "1")
dept1_test$ Weekly_Sales <- 0dept1_train$Date <- as.Date(dept1_train$Date , format = "%Y-%m-%d")
sample_train <- dept1_train %>% filter(Date < as.Date("2012-02-06"))
sample_test <- dept1_train %>% filter(Date >= as.Date("2012-02-06"))summary(sample_train)Store Dept Date Weekly_Sales
Min. :1 Min. :1 Min. :2010-02-05 Min. :14537
1st Qu.:1 1st Qu.:1 1st Qu.:2010-08-06 1st Qu.:16329
Median :1 Median :1 Median :2011-02-04 Median :18820
Mean :1 Mean :1 Mean :2011-02-04 Mean :22777
3rd Qu.:1 3rd Qu.:1 3rd Qu.:2011-08-05 3rd Qu.:23388
Max. :1 Max. :1 Max. :2012-02-03 Max. :57258IsHoliday
Mode :logical
FALSE:97
TRUE :8summary(sample_test)Store Dept Date Weekly_Sales
Min. :1 Min. :1 Min. :2012-02-10 Min. :15723
1st Qu.:1 1st Qu.:1 1st Qu.:2012-04-14 1st Qu.:16645
Median :1 Median :1 Median :2012-06-18 Median :18243
Mean :1 Mean :1 Mean :2012-06-18 Mean :21784
3rd Qu.:1 3rd Qu.:1 3rd Qu.:2012-08-22 3rd Qu.:22057
Max. :1 Max. :1 Max. :2012-10-26 Max. :57592IsHoliday
Mode :logical
FALSE:36
TRUE :2
For sample_train date ranges from 5th Feb 2010 to 3rd Feb 2012 and for sample_test date ranges from 10th Feb 2012 to 26th Oct 2012. I have created a time series for the weekly_sales data of sample_train.
Time-series: A sequence taken at successive equally spaced points in time.
ts_train_uni <- ts(sample_train$Weekly_Sales , start = c(2010,5) , frequency = 52)
The starting point is the first week of February 2010. Frequency =52 indicates that it is weekly data.
The three models that I use to train my dataset are ARIMA, HoltWinters, and nnetar.
ARIMA: Auto-Regressive Integrated Moving Average
It describes the correlation between data points and takes into account the difference of the values. A model that shows stationarity is one that shows there is constancy to the data over time. Most economical and market data show trends, so differencing is used to remove any trends or seasonal structures. The seasonal difference in this example is 1.
arima_model <- auto.arima(ts_train_uni , seasonal.test = "seas" )
arima_pred = forecast(arima_model , h = 38)
arima_pred <- as.data.frame(arima_pred$mean)
HoltWinters: The Holt-Winters forecasting algorithm allows users to smooth a time series and use that data to forecast areas of interest. Unknown parameters are determined by minimizing the squared prediction error.
holt_model <- HoltWinters(ts_train_uni)
p <- predict(holt_model , 38 , prediction.interval = TRUE)
p <- as.data.frame(p)
nnetar: Feed-forward neural networks with a single hidden layer and lagged inputs for forecasting univariate time series.
neural <- nnetar(ts_train_uni)
neural_pred <- forecast(neural , h=38)
neural_pred <- as.data.frame(neural_pred)
After predicting the values with the help of various models, I use the add_column function from the tidyverse package to append predicted values to sample_test.
pred_data <- sample_test %>%add_column(arima_pred = arima_pred$x , holt_pred = p$fit , neural_pred = neural_pred$`Point Forecast` )
Then, I use the ggplot function to plot the values of weekly_sales data of sample_test and compare it against the values predicted by our models for the same period. This would help us in analyzing the results of our models.
pred_data %>% gather(key = "predictions" , value = "value" , -c(Store , Dept , IsHoliday ,Date))%>%
ggplot(aes(x = Date ,y = value , colour = predictions)) + geom_line() + scale_x_date(date_breaks = "4 week")
Actual weekly_sales data represented by the purple line has two peaks at the beginning, and the sales are increasing again at the end. The first peak has been captured by all the models. However, no model has captured the second peak accurately. Arima and HoltWintersβ models are in line with the other distribution of weekly_sales data, whereas the neural network model is far from the actual weekly data and can safely be rejected. To calculate which model would be the right choice, we have to find out Root Mean Square Error (RMSE).
RMSE is a frequently used measure of the differences between values predicted by a model or an estimator and the values observed. It is the standard deviation of the residuals (prediction errors). It is calculated in the following manner:
for (j in 6:ncol(pred_data)) {
error <- pred_data[4] - pred_data[j]
y <- error ^2
z<- colMeans(y)
rmse[j] <- sqrt(z)
}>rmse[6] #arima
9097.075> rmse[7] #holtwinters
9106.965> rmse[8] #nnetar
10123.48
As expected, the rmse of Arima and holtwinters is lower than nnetar since the rmse of Arima is lowest. I use this model to predict future values.
Thus, the forecast for the dates ranging from 2nd Nov 2012 to 26th July 2013 is
ts_train_uni <- ts(dept1_train$Weekly_Sales , start = c(2010,5) , frequency = 52)
arima_model <- auto.arima(ts_train_uni , seasonal.test = "seas" )
arima_pred1 = forecast(arima_model , h = 39)
arima_pred1 <- as.data.frame(arima_pred1$mean)
plot(forecast(arima_model , h=39)) #arima plot
dept1_test$Weekly_Sales <- arima_pred1$x
dept1_test <- subset(dept1_test , select = -arima)
dept1_test$Date <- as.Date(dept1_test$Date)ggplot(dept1_test , aes(x = Date , y = Weekly_Sales) ) + geom_line(color = "blue") + theme_classic() +
scale_x_date(breaks = "4 weeks")
The sales are higher in the months of December, February, and April for this particular department. No information has been given about the category of products this department sells. We have been given information about various holidays. Also, the weeks including these holidays, are weighted five times higher in the evaluation than non-holiday weeks, and also, the discounts are massive during this period. Therefore, we can see those spikes during these months.
The table below shows the head of the range of values at 80% and 95% confidence intervals for arima model.Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
36424.86 27138.911 45710.81 22223.227 50626.49
18689.54 7514.632 29864.45 1598.993 35780.09
19050.66 7875.752 30225.57 1960.113 36141.21
20911.25 9736.342 32086.16 3820.703 38001.80
25293.49 14118.582 36468.40 8202.943 42384.04
33305.92 22131.012 44480.83 16215.373 50396.47
Thus, we were able to predict sales value using Machine Learning model and also find out the range at various confidence intervals. A Confidence interval of 95% indicates that there are 95% chances that the actual value would be within the range of low 95 and high 95. In the next article, I develop an interactive tool using RShiny, which is used to forecast values at the click of a button. Do clap and comment if you liked the article. Thank you π
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI