Unlock the full potential of AI with Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

What is K-Fold Cross Validation?
Latest   Machine Learning

What is K-Fold Cross Validation?

Last Updated on July 25, 2023 by Editorial Team

Author(s): Saikat Biswas

Originally published on Towards AI.

Image Source: Unsplash

K-Fold Cross Validation U+007C Towards AI

Importance of K-Fold Cross Validation in Machine Learning

One of the most important steps before feeding the data to our model

Right before we proceed with training the model for our data, we often, proceed with the Cross Validation process. And this is a very important one when it comes to Machine Learning pipelining.

Here, in this article, we will look in more detail in the part where K Fold comes to the fore and how it is an important procedure to look at the data by taking various and random samples.

Cross-validation is a statistical method used to estimate the skill of machine learning models. Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.

It is a popular method because it is simple to understand and because it generally results in a less biased or less optimistic estimate of the model skill than other methods, such as a simple train/test split.

KFold will provide train/test indices to split data into train and test sets. It will split the dataset into k consecutive folds (without shuffling by default). Each fold is then used a validation set once while the k-1 remaining folds form the training set.

The general procedure is as follows:

  1. Shuffle the dataset randomly.
  2. Split the dataset into k groups
  3. For each unique group:
  4. Take the group as a holdout or test data set
  5. Take the remaining groups as a training data set
  6. Fit a model on the training set and evaluate it on the test set
  7. Retain the evaluation score and discard the model
  8. Summarize the skill of the model using the sample of model evaluation scores
General Working of K Fold

Configuration of K Fold

The k value must be chosen carefully for our data sample.

A poorly chosen value for k may result in a misrepresentation idea of the skill of the model, such as a score with a high variance (that may change a lot based on the data used to fit the model), or a high bias, (such as an overestimate of the skill of the model).

Now, there are Three common tactics for choosing a value for k which are as follows:

  • Representative: The value for k is chosen such that each train/test group of data samples is large enough to be statistically representative of the broader dataset.
  • k=10: The value for k is fixed to 10, a value that has been found through experimentation to generally result in a model skill estimate with low bias a modest variance.
  • k=n: The value for k is fixed to n, where n is the size of the dataset to give each test sample an opportunity to be used in the holdout dataset. This approach is called leave-one-out cross-validation.

Types Of K Fold

Stratified K Fold :

  • StratifiedKFold is a variation of KFold. First, StratifiedKFold shuffles our data, after that splits the data into n_splits parts and Done. Now, it will use each part as a test set. Note that it only and always shuffles data one time before splitting.

With shuffle=True, the data is shuffled by our random_state. Otherwise, the data is shuffled by np.random(as default). For example, n_splits= 4, and our data has 3 classes (label) for y(dependent variable). 4 test sets cover all the data without any overlap.

Stratified K Fold in action

Stratified Shuffle K Fold :

  • StratifiedShuffleSplit is a variation of ShuffleSplit. First, StratifiedShuffleSplit shuffles our data, and then it also splits the data into n_splits parts. However, it's not done yet. Immediately after this step, StratifiedShuffleSplit picks one part to use as a test set. Then it repeats the same process n_splits-1 other times, to get n_splits-1 other test sets. If we look at the picture below, with the same data, the 4 test sets do not cover all the data, i.e there are overlaps among test sets.
StratifiedShuffle K fold

So, the difference here is that StratifiedKFold just shuffles and splits once, therefore the test sets do not overlap, while StratifiedShuffleSplit shuffles each time before splitting, and it splits n_splits times so that the test sets can overlap.

Conclusion

There is a bias-variance trade-off associated with the choice of k in k-fold cross-validation. We perform k-fold cross-validation using k = 5 or k = 10, as these values have been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance.

If a value for k is chosen that does not evenly split the data sample, then one group will contain a remainder of the examples. It is preferable to split the data sample into k groups with the same number of samples, such that the sample of model skill scores are all equivalent.

Further Reading

A Gentle Introduction to k-fold Cross-Validation

Cross-validation is a statistical method used to estimate the skill of machine learning models. It is commonly used in…

machinelearningmastery.com

Improve Your Model Performance using Cross Validation (in Python and R)

One of the most interesting and challenging things about data science hackathons is getting a high score on both public…

www.analyticsvidhya.com

That’s all in this one.

Until next time…!!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓