Unlock the full potential of AI with Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


Regression, Personalisation, and the Kaggle Syndrome
Data Analysis   Data Science   Latest   Machine Learning

Regression, Personalisation, and the Kaggle Syndrome

Last Updated on November 9, 2023 by Editorial Team

Author(s): Kelvin Lu

Originally published on Towards AI.

Photo by Artem Beliaikin on Unsplash

Recently, I worked on a prediction case study using the Kaggle Black Friday Prediction dataset, which was created six years ago and has been downloaded over 32,000 times. While there are over 100 publicly available notebooks on Kaggle for this dataset, and even more available elsewhere, I found that most of these solutions were poorly implemented.

It is not helpful to simply point out the flaws in others’ work. However, when many people make the same mistakes, it is worth investigating the underlying patterns. In this post, I will discuss the common problems with existing solutions, explain why I am no longer a fan of Kaggle, propose a better solution, and outline a personalized prediction approach.


· The Missed Goal
· The Kaggle Syndrome
· Regression That Works
· Personalisation
· Parting Words
· References


Although the Kaggle Black Friday Prediction dataset is popular, its purpose is unclear, and there is no data dictionary to explain the data in detail. Before we can do any further analysis, we need to understand the dataset’s goal, how it was prepared, and why it was designed in a particular way. This information is essential for feature engineering, model selection, and evaluation downstream. In real-world machine learning projects, this preliminary analysis is also important because the best machine learning solutions can only be built on a deep understanding of the data.

Let’s skip over the EDA. The sample data looks like the following:

The dataset has 537577 rows and 12 columns divided into the user profile feature group and product feature group, as described below:

User feature group:

  • User_ID: Unique ID of the user. There are a total of 5891 users in the dataset.
  • Gender: indicates the gender of the person making the transaction in the format of M/F.
  • Age: indicates the age group of the person making the transaction.
  • Occupation: shows the occupation of the user, labelled with numbers 0 to 20.
  • City_Category: The user’s living city category. Cities are categorised into three different categories: ‘A’, ‘B’, and ‘C’.
  • Stay_In_Current_City_Years: Indicates how long the user has lived in this city.
  • Marital_Status: is 0 if the user is not married and 1 otherwise.

Product feature group:

  • Product_ID: Unique ID of the product. There are a total of 3623 products in the dataset.
  • Product_Category_1 to _3: Category of the product. All three are labelled with numbers.


  • Purchase: Purchase amount.

As shown in the profile of the dataset, there are both integer and categorical features. 31.6% of Product_Category_2 are missing values, while the missing value rate of Product_Category_3 is 69.7%.

There’re some interesting details in the data. For example, Product_Category_1 is always greater than Product_Category_2, and Product_Category_2 is always greater than Product_Category_3. This implies logical connections between these three features, but we can’t find out because it was not explained. Probably the three categories are time-bounded. For example, shipping time vs. shelf time vs. off-shelf time, etc. However, that is only a guess. We can’t fully utilise this pattern.

Because the goal of the dataset was not described, let’s put on our detective hats and ask, “Where does the dataset designer hide their true intentions?” The answer is in the test set. After all, the test set is how the dataset designer evaluates model performance. So let’s compare the train and test sets and see what we can find.

As it turns out, by comparing the training and test sets, it is clear that the test set contains all users and products from the training set, but no user-product pairs from the test set are shared with the training set. This means that the prediction is actually a personalized recommendation system, not just a regular regression. In other words, it asks the analysts to predict every user's purchase of other products based on the user’s previous purchase information.

The Missed Goal

The difference between regression and personalization is that while regression models learn global patterns, personalization models learn the interactions between users and products. Regular regression models can still learn the interactions when the data size is small. The real problem becomes significant when the data size gets larger.

In order for the model to learn about personalized interactions, both the user and the product features must be treated as categorical features. In most cases, both of them have large numbers of levels. In our case of Black Friday Prediction, there are 5891 users and 3623 products. This is already a very tiny dataset, but it is already out of the comfort zone for regular regression models.

The standard method for regression models to deal with categorical features is one-hot encoding, or putting each categorical level as a new column. However, that technique doesn’t work for the personalization task because 5891 x 3623 will produce a very large and very sparse 2D array. The number of cells in the array is much greater than the number of rows in the dataset. The resulting array is so sparse that the majority of the cells would be empty. This makes the computation very challenging, and more importantly, the general regression models couldn’t learn anything from the array because of the curse of the high dimensionality.

To avoid the high-cardinality dimension problem, all the open analyses either dropped the product column, the user column, or both. To drop the product column is equivalent to saying I want to predict the users’ behavior in the specific product category and to drop the user column is equivalent to saying I want to predict the particular product’s sale in the user group that shares certain demographic features. Dropping both user and product features would mean that I want to predict the purchase of a certain product category in a certain user group. You can see that they are not the kind of single-user-to-single-product prediction at all. Without a sharp understanding of the goal, all the work will fail.

The Kaggle Syndrome

I noticed that all open analyses failed at their starting point: they didn’t try enough to find out the purpose of the dataset. They limit their EDA to the training data without extending the analysis to the test set, and they treat the task as a regular regression problem without fully understanding the task. I found that this is a common problem in Kaggle.

I was a Kaggle enthusiast at the start of my machine-learning journey, but my interest faded quickly. I still browse Kaggle for technical ideas from time to time, but I’m no longer a big fan of it. One reason is that I couldn’t rank highly on the leaderboard. Another reason is that I found the style of Kaggle competitions to be incompatible with real-world machine learning practices.

Every Kaggle competitor is hyper-focused on squeezing the last drop of performance out of their models. In the real world, the priority is completely reversed: delivering business value is a way more important top priority. In many cases, model performance is a less important requirement. If your model fits into the business very well, no one would question why the accuracy is 0.95 rather than 0.96 unless the performance is pivotal to the project. Projects that are very sensitive to model performance are rare.

Another issue with the competition is the lack of interaction. Once a dataset is given, that’s it. The participants have little chance to ask questions or twist the requirements. As in the Black Friday dataset, we can see some interesting patterns; however, we can’t see why, let alone utilize the patterns. In real-world projects, asking questions is the simplest strategy to prevent your project from failing.

More importantly, Kaggle competitions build up the habit that we are ‘out of touch’ with the business context.

I know that some employers hire Kaggle competition winners. That is one of the stimuli for the participants. However, I am afraid that Kaggle-styled machine learning may be harmful in real business. One such example is Zillow. Zillow sponsored a Kaggle house price prediction competition and offered work opportunities to the winners. Zillow benefited a lot from machine learning when the property market continuously went up. Zillow nearly crashed when the property market changed direction.

What went wrong? It’s fair to say that the problem was that Zillow data scientists weren’t business-aware enough. Otherwise, they would have spent some time researching topics like the turning point of the market and the level of risk when something happened. They would not be as unprepared for the downturning market as we have seen.

I would encourage beginning machine learning learners who are spending days and nights on Kaggle to spare some time thinking of business scenarios and develop the acumen to apply machine learning in the real world. It is a soft skill that no one can teach. But just start by asking more questions, and we will get there.

Regression That Works

Let’s see how we can implement a regular regression model to get closer to what we need. As we discussed, one major problem with doing personalization in a general regression way is the high-cardinality dimensions. The traditional one-hot encoding doesn’t work. Label encoding doesn’t look good because it implies the label of the feature has an ordinal relationship with the target. We don’t like binning or hashing either. Let’s try a different encoding: target encoding.

Instead of replacing the category levels with integers, target encoding replaces them with the mean target values. It solved the high-dimension problem nicely and can bring information into the representative values.

Before we jump into the model training, we cast all 11 features as categorical and target_encode all of them. We also transform the target by sqrt() to make it normalized and then scale the target into the range of [0.0, 10.0], because some of the algorithms I experimented with had this constraint that only accepted target values in that range.

The code is as follows:

# Train_test_spliting
X_train, X_test, y_train, y_test = train_test_split(df_blackfriday_data.drop('Purchase', axis=1), df_blackfriday_data['Purchase'], random_state=random_seed, test_size=0.25)

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

# Target encoding
from category_encoders import TargetEncoder

encoder = TargetEncoder()

enc = encoder.fit(X=X_train, y=y_train)

df_train_X = enc.transform(X_train)
df_test_X = enc.transform(X_test)

# Model training
xgb_reg = XGBRegressor( seed=random_seed)
xgb_reg.fit(df_train_X, y_train)
xgb_y_pred = xgb_reg.predict(df_test_X)
print('Scaled RMSE:', sqrt(mean_squared_error(y_test, xgb_y_pred)))
print("RMSE on the original test data is ",sqrt(mean_squared_error((y_test*15 + 3.464)*(y_test*15 + 3.464), (xgb_y_pred*15 + 3.464)*(xgb_y_pred*15 + 3.464))))

The above code piece produces two RMSE, the scaled RMSE is against the transformed target, and the other is against the original target. As it turns out, the performance was as follows:

Scaled RMSE: 0.8845836109788747

RMSE on the original test data is 2522.1269855375003

Its performance tops all the open analysis results. From the feature importance plot, we can clearly understand the reason:

The strongest features turned out to be the user_id and product_id features, which most analyses dismiss. Target_encoding is not a secret weapon, but the insight that led to the right decision made the difference. Let’s see how to improve even further.


The above regression solution employs the target_encoding trick to deal with the high-cardinality dimensions. target_encoding may not be ideal because it blurs the difference between users and products that have similar target mean values. There’s another family of technologies that tackles the huge user-product array directly, known as recommendation systems. We can also use recommendation technologies to provide personalized predictions. In a typical recommendation system, the input data structure is a huge 2D array, with the user as one dimension, the item as the other, and the purchase as the cell value.

The most classical approach for recommendation and personalization analysis is the collaborative filter. In other words, it scores the similarity between user-to-user or item-to-item to predict users’ preferences. You will see charts like the following quite often if you google recommendation or collaborative filter because the ideas were invented during a movie recommendation competition. In that scenario, the concepts are user, movie, and rate, which are equivalent to user, product, and purchase in our case.

Quite a few other technologies have emerged since then. One of them is matrix factorization, or decomposing the huge matrix into the product of a few low-rank matrixes. Another prominent solution is Deep Learning.

Let’s try a baseline DNN model using FastAI. Please note that our model trains only use the user and product features; all other features are ignored.

from fastai.collab import CollabDataLoaders, collab_learner

# construct user-item-rating dataframe
ratings_dict = {'item': list(trainset.Product_ID),
'user': list(trainset.User_ID),
'rating': list((trainset.Purchase.pow(1/2)- 3.464)/15)}
ratings = pd.DataFrame(ratings_dict)

ratings_test_dict = {'item': list(testset.Product_ID),
'user': list(testset.User_ID),
'rating': list((testset.Purchase.pow(1/2) -3.464)/15)}
ratings_test = pd.DataFrame(ratings_test_dict)

# model training
learn = collab_learner(dls, n_factors=160, use_nn=True, y_range=(0, 10))
import warnings

with warnings.catch_warnings():
learn.fit_one_cycle(5, 5e-3, wd=0.1)

# evaluation
dl = learn.dls.test_dl(ratings_test, with_labels=True)
with warnings.catch_warnings():
aaa = learn.get_preds(dl=dl)

testset['y'] = (testset['Purchase'].pow(1/2) -3.464)/15
testset['y_pred'] = [x.tolist()[0] for x in aaa[0]]

# testset['y_pred'] = testset['y_pred']*testset['y_pred']*225

from sklearn.metrics import mean_squared_error
from math import sqrt

print('Scaled RMSE', sqrt(mean_squared_error(testset['y'], testset['y_pred'])))

testset['Purchase_pred'] = (testset.y_pred*15 +3.464) * (testset.y_pred*15 +3.464)
print('RMSE on the original test set', sqrt(mean_squared_error(testset['Purchase'], testset['Purchase_pred'])))

The outcome was the following:

Scaled RMSE 0.8624311160502426

RMSE on the original test set 2460.1524061340824

This result is significantly better than the previous XGBoost model based on target_encoded user_id and product_id. The trained DNN model is incredibly simple:

The model uses embedding to transform the user_id and product_id and uses only one linear layer. You may wonder why we have to drop all the user properties and product information. Can we include all those features in the prediction?

This is a valid question. In real business scenarios, one downside of the typical recommendation system is that we don’t have control over the pattern learned. It’s more like a black box—no transparency, no handle to manipulate the result. It has no idea how to deal with a new user or new product that shares nothing with other records. This is called a cold-start problem. This problem requires more effort to solve in a collaborative filter or matrix factorization structure because those technologies can’t deal with multi-variate data. It can be easily solved with the DNN model. We just add the embedding of other features as inputs, modify the model parameter, and we are done.

Parting Words

In our experiment, we found that there's rich information in the seemingly barren user_id and product_id features. We can uncover very useful patterns from the user’s behavior only based on the user's own history and not anything else. We didn’t extend our effort to hyperparameter tuning or any performance enhancement technique. Once we get on the right track, achieving good results is just as easy.

In this case study, the most important thing is to understand the context and select the right technology. Business acumen played an important part in this analysis. In fact, business problems serve as the inspiration for the majority of machine learning solutions, such as the collaborative filter system developed to address problems with movie recommendations. Likewise, business understanding is a good accelerator for us to learn machine learning as well.


Black Friday

Kaggle is the world’s largest data science community with powerful tools and resources to help you achieve your data…


Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓