What is K-Fold Cross Validation?

Last Updated on July 25, 2023 by Editorial Team

Author(s): Saikat Biswas

Originally published on Towards AI.

What is K-Fold Cross Validation? — Image Source: Unsplash

K-Fold Cross Validation U+007C Towards AI

Importance of K-Fold Cross Validation in Machine Learning

One of the most important steps before feeding the data to our model

Right before we proceed with training the model for our data, we often, proceed with the Cross Validation process. And this is a very important one when it comes to Machine Learning pipelining.

Here, in this article, we will look in more detail in the part where K Fold comes to the fore and how it is an important procedure to look at the data by taking various and random samples.

Cross-validation is a statistical method used to estimate the skill of machine learning models. Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.

It is a popular method because it is simple to understand and because it generally results in a less biased or less optimistic estimate of the model skill than other methods, such as a simple train/test split.

KFold will provide train/test indices to split data into train and test sets. It will split the dataset into k consecutive folds (without shuffling by default). Each fold is then used a validation set once while the k-1 remaining folds form the training set.

The general procedure is as follows:

Shuffle the dataset randomly.
Split the dataset into k groups
For each unique group:
Take the group as a holdout or test data set
Take the remaining groups as a training data set
Fit a model on the training set and evaluate it on the test set
Retain the evaluation score and discard the model
Summarize the skill of the model using the sample of model evaluation scores

Configuration of K Fold

The k value must be chosen carefully for our data sample.

A poorly chosen value for k may result in a misrepresentation idea of the skill of the model, such as a score with a high variance (that may change a lot based on the data used to fit the model), or a high bias, (such as an overestimate of the skill of the model).

Now, there are Three common tactics for choosing a value for k which are as follows:

Representative: The value for k is chosen such that each train/test group of data samples is large enough to be statistically representative of the broader dataset.
k=10: The value for k is fixed to 10, a value that has been found through experimentation to generally result in a model skill estimate with low bias a modest variance.
k=n: The value for k is fixed to n, where n is the size of the dataset to give each test sample an opportunity to be used in the holdout dataset. This approach is called leave-one-out cross-validation.

Types Of K Fold

Stratified K Fold :

StratifiedKFold is a variation of KFold. First, StratifiedKFold shuffles our data, after that splits the data into n_splits parts and Done. Now, it will use each part as a test set. Note that it only and always shuffles data one time before splitting.

With shuffle=True, the data is shuffled by our random_state. Otherwise, the data is shuffled by np.random(as default). For example, n_splits= 4, and our data has 3 classes (label) for y(dependent variable). 4 test sets cover all the data without any overlap.

Stratified Shuffle K Fold :

StratifiedShuffleSplit is a variation of ShuffleSplit. First, StratifiedShuffleSplit shuffles our data, and then it also splits the data into n_splits parts. However, it's not done yet. Immediately after this step, StratifiedShuffleSplit picks one part to use as a test set. Then it repeats the same process n_splits-1 other times, to get n_splits-1 other test sets. If we look at the picture below, with the same data, the 4 test sets do not cover all the data, i.e there are overlaps among test sets.

So, the difference here is that StratifiedKFold just shuffles and splits once, therefore the test sets do not overlap, while StratifiedShuffleSplit shuffles each time before splitting, and it splits n_splits times so that the test sets can overlap.

Conclusion

There is a bias-variance trade-off associated with the choice of k in k-fold cross-validation. We perform k-fold cross-validation using k = 5 or k = 10, as these values have been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance.

If a value for k is chosen that does not evenly split the data sample, then one group will contain a remainder of the examples. It is preferable to split the data sample into k groups with the same number of samples, such that the sample of model skill scores are all equivalent.

Until next time…!!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

What is K-Fold Cross Validation?

Author(s): Saikat Biswas

K-Fold Cross Validation U+007C Towards AI

Importance of K-Fold Cross Validation in Machine Learning

Configuration of K Fold

Types Of K Fold

Conclusion

Further Reading

A Gentle Introduction to k-fold Cross-Validation

Cross-validation is a statistical method used to estimate the skill of machine learning models. It is commonly used in…

Improve Your Model Performance using Cross Validation (in Python and R)

One of the most interesting and challenging things about data science hackathons is getting a high score on both public…

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

TAI #171: How is AI Actually Being Used? Frontier Ambitions Meet Real-World Adoption Data

I Built a Clinical AI Agent — and It Skipped the Tools I Gave It

ATOKEN: A Unified Tokenizer for Vision Finally Solves AI’s Biggest Problem

How to Model APIs with Ontologies and Graphs for AI Agents

From A/B Testing to DoubleML: A Data Scientist’s Guide to Causal Inference:

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

What is K-Fold Cross Validation?

Author(s): Saikat Biswas

K-Fold Cross Validation U+007C Towards AI

Importance of K-Fold Cross Validation in Machine Learning

Configuration of K Fold

Types Of K Fold

Conclusion

Further Reading

A Gentle Introduction to k-fold Cross-Validation

Cross-validation is a statistical method used to estimate the skill of machine learning models. It is commonly used in…

Improve Your Model Performance using Cross Validation (in Python and R)

One of the most interesting and challenging things about data science hackathons is getting a high score on both public…

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement