Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Unlock the full potential of AI with Building LLMs for Productionβ€”our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

Predicting Sales using R programming
Latest   Machine Learning

Predicting Sales using R programming

Last Updated on July 24, 2023 by Editorial Team

Author(s): Suyash Maheshwari

Originally published on Towards AI.

In this article, I will forecast the sales of a multinational retail corporation. The dataset for this can be found on Kaggle. We have been provided with weekly sales data on which we train our models and use the best model to predict the weekly sales value for future dates.

Libraries required :

library(dplyr)
library(forecast)
library(reshape)
library(ggplot2)
library(tidyverse)

The original dataset contains data of around 45 stores and 99 departments. To simplify the calculation, I have just considered the data of the 1st store’s 1st department. I have split the data into two parts β€” sample_ train and sample_test based on dates. Datasets contain the following fields :

  • Store β€” the store number
  • Dept β€” the department number
  • Date β€” the week
  • Weekly_Sales β€” sales for the given department in the given store
  • IsHoliday β€” whether the week is a special holiday week
dept1_train <- train %>% filter(Store == "1" & Dept == "1")
dept1_test <- test %>% filter(Store == "1" & Dept == "1")
dept1_test$ Weekly_Sales <- 0
dept1_train$Date <- as.Date(dept1_train$Date , format = "%Y-%m-%d")
sample_train <- dept1_train %>% filter(Date < as.Date("2012-02-06"))
sample_test <- dept1_train %>% filter(Date >= as.Date("2012-02-06"))
summary(sample_train)Store Dept Date Weekly_Sales
Min. :1 Min. :1 Min. :2010-02-05 Min. :14537
1st Qu.:1 1st Qu.:1 1st Qu.:2010-08-06 1st Qu.:16329
Median :1 Median :1 Median :2011-02-04 Median :18820
Mean :1 Mean :1 Mean :2011-02-04 Mean :22777
3rd Qu.:1 3rd Qu.:1 3rd Qu.:2011-08-05 3rd Qu.:23388
Max. :1 Max. :1 Max. :2012-02-03 Max. :57258
IsHoliday
Mode :logical
FALSE:97
TRUE :8
summary(sample_test)Store Dept Date Weekly_Sales
Min. :1 Min. :1 Min. :2012-02-10 Min. :15723
1st Qu.:1 1st Qu.:1 1st Qu.:2012-04-14 1st Qu.:16645
Median :1 Median :1 Median :2012-06-18 Median :18243
Mean :1 Mean :1 Mean :2012-06-18 Mean :21784
3rd Qu.:1 3rd Qu.:1 3rd Qu.:2012-08-22 3rd Qu.:22057
Max. :1 Max. :1 Max. :2012-10-26 Max. :57592
IsHoliday
Mode :logical
FALSE:36
TRUE :2

For sample_train date ranges from 5th Feb 2010 to 3rd Feb 2012 and for sample_test date ranges from 10th Feb 2012 to 26th Oct 2012. I have created a time series for the weekly_sales data of sample_train.

Time-series: A sequence taken at successive equally spaced points in time.

ts_train_uni <- ts(sample_train$Weekly_Sales , start = c(2010,5) , frequency = 52)

The starting point is the first week of February 2010. Frequency =52 indicates that it is weekly data.

The three models that I use to train my dataset are ARIMA, HoltWinters, and nnetar.

ARIMA: Auto-Regressive Integrated Moving Average

It describes the correlation between data points and takes into account the difference of the values. A model that shows stationarity is one that shows there is constancy to the data over time. Most economical and market data show trends, so differencing is used to remove any trends or seasonal structures. The seasonal difference in this example is 1.

arima_model <- auto.arima(ts_train_uni , seasonal.test = "seas" )
arima_pred = forecast(arima_model , h = 38)
arima_pred <- as.data.frame(arima_pred$mean)

HoltWinters: The Holt-Winters forecasting algorithm allows users to smooth a time series and use that data to forecast areas of interest. Unknown parameters are determined by minimizing the squared prediction error.

holt_model <- HoltWinters(ts_train_uni)
p <- predict(holt_model , 38 , prediction.interval = TRUE)
p <- as.data.frame(p)

nnetar: Feed-forward neural networks with a single hidden layer and lagged inputs for forecasting univariate time series.

neural <- nnetar(ts_train_uni)
neural_pred <- forecast(neural , h=38)
neural_pred <- as.data.frame(neural_pred)

After predicting the values with the help of various models, I use the add_column function from the tidyverse package to append predicted values to sample_test.

pred_data <- sample_test %>%add_column(arima_pred = arima_pred$x , holt_pred = p$fit , neural_pred = neural_pred$`Point Forecast` )

Then, I use the ggplot function to plot the values of weekly_sales data of sample_test and compare it against the values predicted by our models for the same period. This would help us in analyzing the results of our models.

pred_data %>% gather(key = "predictions" , value = "value" , -c(Store , Dept , IsHoliday ,Date))%>% 
ggplot(aes(x = Date ,y = value , colour = predictions)) + geom_line() + scale_x_date(date_breaks = "4 week")

Actual weekly_sales data represented by the purple line has two peaks at the beginning, and the sales are increasing again at the end. The first peak has been captured by all the models. However, no model has captured the second peak accurately. Arima and HoltWinters’ models are in line with the other distribution of weekly_sales data, whereas the neural network model is far from the actual weekly data and can safely be rejected. To calculate which model would be the right choice, we have to find out Root Mean Square Error (RMSE).

RMSE is a frequently used measure of the differences between values predicted by a model or an estimator and the values observed. It is the standard deviation of the residuals (prediction errors). It is calculated in the following manner:

for (j in 6:ncol(pred_data)) {

error <- pred_data[4] - pred_data[j]

y <- error ^2
z<- colMeans(y)

rmse[j] <- sqrt(z)

}
>rmse[6] #arima

9097.075
> rmse[7] #holtwinters

9106.965
> rmse[8] #nnetar

10123.48

As expected, the rmse of Arima and holtwinters is lower than nnetar since the rmse of Arima is lowest. I use this model to predict future values.

Thus, the forecast for the dates ranging from 2nd Nov 2012 to 26th July 2013 is

ts_train_uni <- ts(dept1_train$Weekly_Sales , start = c(2010,5) , frequency = 52)

arima_model <- auto.arima(ts_train_uni , seasonal.test = "seas" )
arima_pred1 = forecast(arima_model , h = 39)
arima_pred1 <- as.data.frame(arima_pred1$mean)
plot(forecast(arima_model , h=39)) #arima plot
dept1_test$Weekly_Sales <- arima_pred1$x
dept1_test <- subset(dept1_test , select = -arima)
dept1_test$Date <- as.Date(dept1_test$Date)
ggplot(dept1_test , aes(x = Date , y = Weekly_Sales) ) + geom_line(color = "blue") + theme_classic() +
scale_x_date(breaks = "4 weeks")
predicted sales value

The sales are higher in the months of December, February, and April for this particular department. No information has been given about the category of products this department sells. We have been given information about various holidays. Also, the weeks including these holidays, are weighted five times higher in the evaluation than non-holiday weeks, and also, the discounts are massive during this period. Therefore, we can see those spikes during these months.

The table below shows the head of the range of values at 80% and 95% confidence intervals for arima model.Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
36424.86 27138.911 45710.81 22223.227 50626.49
18689.54 7514.632 29864.45 1598.993 35780.09
19050.66 7875.752 30225.57 1960.113 36141.21
20911.25 9736.342 32086.16 3820.703 38001.80
25293.49 14118.582 36468.40 8202.943 42384.04
33305.92 22131.012 44480.83 16215.373 50396.47

Thus, we were able to predict sales value using Machine Learning model and also find out the range at various confidence intervals. A Confidence interval of 95% indicates that there are 95% chances that the actual value would be within the range of low 95 and high 95. In the next article, I develop an interactive tool using RShiny, which is used to forecast values at the click of a button. Do clap and comment if you liked the article. Thank you πŸ™‚

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓