Split Your Dataset With scikit-learn’s train_test_split()

Last Updated on January 7, 2023 by Editorial Team

Author(s): YashNagare

Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.

Machine Learning

Model evaluation and validation are important parts of supervised machine learning. It aids in the selection of the best model to represent our data as well as the prediction of how well that model will perform in the future.

To predict this model we need to split this model dataset into training and testing data. Manually splitting out this data is difficult because of the large size of datasets and data needs to be shuffled.

For making this task easier we will use Scikit-learn’s train_test_split() module, which will split our data into training and testing sets.

Installation of Scikit-Learn

Photo by Volodymyr Hryshchenko on Unsplash

Installing Scikit-learn using pip :

$ pip install -U scikit-learn

Installing Scikit-learn using conda :

$ conda install -c anaconda scikit-learn=0.24

Importing Library —

>>> from sklearn.model_selection import train_test_split

Getting Started —

Syntax –

train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None)

Parameters –

arrays – The data you want to split is held in a sequence of lists, NumPy arrays, pandas DataFrames, or other array-like objects called arrays. The dataset is made up of all of these objects, and they must all be of the same length.

train_size – The size of the training dataset is determined by this option. None, which is the default. Int, which requires the precise number of samples and float, which goes from 0.1 to 1.0, are the three alternatives.

test_size – This parameter specifies the testing dataset size. If the training size is set to default the test_size will be set to 0.25.

random_state – This parameter specifies random split of data using np.random or int.

shuffle – Shuffle has the Boolean value (default=True). This determines whether data should be shuffled or not.

stratify – It is an array-like object(default=None). If stratify is not any then it determines how to use stratified split of data.

Now it’s time to put your data splitting skills to the test! To begin, you’ll need to create a simple dataset to work with. The inputs will be in a two-dimensional array X, while the outputs will be in a one-dimensional array y in the dataset.

We had used the NumPy library to generate the dataset. arange() will return evenly spaced values within a specified interval and .reshape() will change the shape of the array without changing its data.

A complete guide on NumPy for Machine Learning

With a single function call, you can split both the input and output datasets.

train_test_split() performs splitting of data and returns the four sequences of NumPy array in this order:

X_train – The training part of the X sequence
y_train – The training part of the y sequence
X_test – The testing part of the X sequence
y_test – The testing part of the y sequence

>>> train_test_split(y, shuffle=True)
[[4, 0, 6, 2, 5, 9, 3], [1, 7, 8]]
>>>

The samples of the dataset are shuffled randomly and then split data into training and testing sets.

Conclusion

You now understand why and how to utilize sklearn’s train_test_split() function and various parameters which are used in it. You can now use your data to train, validate, and test as many machine learning models as you want.

Before you go

I welcome you to join me on this journey! Follow this Medium page to stay in the loop of more exciting Data Science/Python content.

Split Your Dataset With scikit-learn’s train_test_split() was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Split Your Dataset With scikit-learn’s train_test_split()

Author(s): YashNagare

Machine Learning

Installation of Scikit-Learn

Importing Library —

Getting Started —

Syntax –

Parameters –

Conclusion

Before you go

Towards AI Team

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

LAI #66: Information Theory for People in a Hurry

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

The World’s Leading AI and Technology Publication.

Company

CONTACT US

🔥 Recommended Articles 🔥

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Split Your Dataset With scikit-learn’s train_test_split()

Author(s): YashNagare

Installation of Scikit-Learn

Importing Library —

Getting Started —

Syntax –

Parameters –

Conclusion

Before you go

Towards AI Team

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement

Subscribe to our AI newsletter!

🔥 Recommended Articles 🔥