What is K-Fold Cross Validation?
Last Updated on July 25, 2023 by Editorial Team
Author(s): Saikat Biswas
Originally published on Towards AI.
K-Fold Cross Validation U+007C Towards AI
Importance of K-Fold Cross Validation in Machine Learning
One of the most important steps before feeding the data to our model
Right before we proceed with training the model for our data, we often, proceed with the Cross Validation process. And this is a very important one when it comes to Machine Learning pipelining.
Here, in this article, we will look in more detail in the part where K Fold comes to the fore and how it is an important procedure to look at the data by taking various and random samples.
Cross-validation is a statistical method used to estimate the skill of machine learning models. Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.
It is a popular method because it is simple to understand and because it generally results in a less biased or less optimistic estimate of the model skill than other methods, such as a simple train/test split.
KFold will provide train/test indices to split data into train and test sets. It will split the dataset into k consecutive folds (without shuffling by default). Each fold is then used a validation set once while the k-1 remaining folds form the training set.
The general procedure is as follows:
- Shuffle the dataset randomly.
- Split the dataset into k groups
- For each unique group:
- Take the group as a holdout or test data set
- Take the remaining groups as a training data set
- Fit a model on the training set and evaluate it on the test set
- Retain the evaluation score and discard the model
- Summarize the skill of the model using the sample of model evaluation scores
Configuration of K Fold
The k value must be chosen carefully for our data sample.
A poorly chosen value for k may result in a misrepresentation idea of the skill of the model, such as a score with a high variance (that may change a lot based on the data used to fit the model), or a high bias, (such as an overestimate of the skill of the model).
Now, there are Three common tactics for choosing a value for k which are as follows:
- Representative: The value for k is chosen such that each train/test group of data samples is large enough to be statistically representative of the broader dataset.
- k=10: The value for k is fixed to 10, a value that has been found through experimentation to generally result in a model skill estimate with low bias a modest variance.
- k=n: The value for k is fixed to n, where n is the size of the dataset to give each test sample an opportunity to be used in the holdout dataset. This approach is called leave-one-out cross-validation.
Types Of K Fold
Stratified K Fold :
- StratifiedKFold is a variation of KFold. First, StratifiedKFold shuffles our data, after that splits the data into n_splits parts and Done. Now, it will use each part as a test set. Note that it only and always shuffles data one time before splitting.
With shuffle=True, the data is shuffled by our random_state. Otherwise, the data is shuffled by np.random(as default). For example, n_splits= 4, and our data has 3 classes (label) for y(dependent variable). 4 test sets cover all the data without any overlap.
Stratified Shuffle K Fold :
- StratifiedShuffleSplit is a variation of ShuffleSplit. First, StratifiedShuffleSplit shuffles our data, and then it also splits the data into n_splits parts. However, it's not done yet. Immediately after this step, StratifiedShuffleSplit picks one part to use as a test set. Then it repeats the same process n_splits-1 other times, to get n_splits-1 other test sets. If we look at the picture below, with the same data, the 4 test sets do not cover all the data, i.e there are overlaps among test sets.
So, the difference here is that StratifiedKFold just shuffles and splits once, therefore the test sets do not overlap, while StratifiedShuffleSplit shuffles each time before splitting, and it splits n_splits times so that the test sets can overlap.
Conclusion
There is a bias-variance trade-off associated with the choice of k in k-fold cross-validation. We perform k-fold cross-validation using k = 5 or k = 10, as these values have been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance.
If a value for k is chosen that does not evenly split the data sample, then one group will contain a remainder of the examples. It is preferable to split the data sample into k groups with the same number of samples, such that the sample of model skill scores are all equivalent.
Further Reading
A Gentle Introduction to k-fold Cross-Validation
Cross-validation is a statistical method used to estimate the skill of machine learning models. It is commonly used inβ¦
machinelearningmastery.com
Improve Your Model Performance using Cross Validation (in Python and R)
One of the most interesting and challenging things about data science hackathons is getting a high score on both publicβ¦
www.analyticsvidhya.com
Thatβs all in this one.
Until next timeβ¦!!
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI