Split Your Dataset With scikit-learn’s train_test_split()
Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.
Model evaluation and validation are important parts of supervised machine learning. It aids in the selection of the best model to represent our data as well as the prediction of how well that model will perform in the future.
To predict this model we need to split this model dataset into training and testing data. Manually splitting out this data is difficult because of the large size of datasets and data needs to be shuffled.
For making this task easier we will use Scikit-learn’s train_test_split() module, which will split our data into training and testing sets.
Installation of Scikit-Learn
Installing Scikit-learn using pip :
$ pip install -U scikit-learn
Installing Scikit-learn using conda :
$ conda install -c anaconda scikit-learn=0.24
Importing Library —
>>> from sklearn.model_selection import train_test_split
Getting Started —
train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None)
arrays – The data you want to split is held in a sequence of lists, NumPy arrays, pandas DataFrames, or other array-like objects called arrays. The dataset is made up of all of these objects, and they must all be of the same length.
train_size – The size of the training dataset is determined by this option. None, which is the default. Int, which requires the precise number of samples and float, which goes from 0.1 to 1.0, are the three alternatives.
test_size – This parameter specifies the testing dataset size. If the training size is set to default the test_size will be set to 0.25.
random_state – This parameter specifies random split of data using np.random or int.
shuffle – Shuffle has the Boolean value (default=True). This determines whether data should be shuffled or not.
stratify – It is an array-like object(default=None). If stratify is not any then it determines how to use stratified split of data.
Now it’s time to put your data splitting skills to the test! To begin, you’ll need to create a simple dataset to work with. The inputs will be in a two-dimensional array X, while the outputs will be in a one-dimensional array y in the dataset.
We had used the NumPy library to generate the dataset. arange() will return evenly spaced values within a specified interval and .reshape() will change the shape of the array without changing its data.
With a single function call, you can split both the input and output datasets.
train_test_split() performs splitting of data and returns the four sequences of NumPy array in this order:
- X_train – The training part of the X sequence
- y_train – The training part of the y sequence
- X_test – The testing part of the X sequence
- y_test – The testing part of the y sequence
>>> train_test_split(y, shuffle=True)
[[4, 0, 6, 2, 5, 9, 3], [1, 7, 8]]
The samples of the dataset are shuffled randomly and then split data into training and testing sets.
You now understand why and how to utilize sklearn’s train_test_split() function and various parameters which are used in it. You can now use your data to train, validate, and test as many machine learning models as you want.
Before you go
Split Your Dataset With scikit-learn’s train_test_split() was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI