Beginners Guide to Machine Learning: Data Pre-processing using Python.

Last Updated on July 20, 2023 by Editorial Team

Author(s): Anushkad

Originally published on Towards AI.

The journey of learning machine learning is super long but yet very exciting!

If you are a beginner, this blog is just what you need to get your head start! Data processing is the first tool you learn as a machine learning practitioner. Let’s get started!

It is necessary to preprocess your data in the right way so that the machine learning model you will build can be trained in the right way using that data! This will not seem like an unusual step at the beginning. However, once you learn to do it efficiently, you can rapidly get a hold on various branches of Machine Learning (ML).

Step 1] Importing Libraries

What are libraries?

Libraries in Python are a considerable ensemble of tools, functions, and modules which help in making the desired task easier.

Pandas Library: Allows us to read and retrieve datasets.

Numpy Library: Deals with math and advanced 2D arrays.

Matplotlib Library: Helps us in visualizing data in the form of bar charts, line charts, pie charts, etc.

To call the above library functions with ease, we rename the libraries using short syntax like “np’’, “pd,” “plt.”

These libraries can be imported by following the steps, as shown in the code snippet below:

import numpy as npimport pandas as pdimport matplotlib.pyplot as plt

Step 2] Importing Datasets

What are Datasets?

A Dataset is a set containing all our data that helps us to design and train our machine learning model.

The above data set has 3 independent variables or features: Country, Age, Salary, and 1 dependent variable or target variable: Purchased Product, which can have binary outputs: Yes or No, depending upon the independent variables.

To import the dataset we require pandas library function.

The first thing to do is to create a variable that will contain the dataset and create its data frame.

Then declare two variables to contain features and targets, respectively. The code snippet below shows how to import a dataset.

dataset=pd.read_csv("Path of your csv file location") 
#read_csv is function of pandas which helps in retrieving data in form of data frame.X=dataset.iloc[:,:-1].values #X stores features
#iloc is a function of pandas which helps us split dataset based on its index location.y=dataset.iloc[:,-1].values #y stores target

Step 3] Handling the missing data

As you can observe, the above dataset has some missing data in the columns Age and salary, which may serve as reasons for errors while training the model, if not handled. Thus, it is necessary to take care of missing data values in a dataset.

In the case of large datasets, the entry having missing data can be ignored or removed directly. However, if your data set is compact or restricted, the missing data needs to be taken care of.

Thus, the method that can be used to handle missing data is taking the average of all entries in the column having missing data. To do this, we import a famous ML library called sci-kit-learn. It contains data preprocessing tools to handle missing values. We will use the class Imputer to handle missing values.

Code snippet:

from sklearn.impute import SimpleImputerimputer = SimpleImputer(missing_values = np.nan, strategy='mean')imputer.fit(X[:,1:3])// enter columns which contain real numbersX[:,1:3]= imputer.transform(X[:,1:3])print(X)

Step 4] Encoding Categorical Data

Almost all datasets contain a categorical data column. This column of strings needs to be processed and converted into real numbers. Linearly arranging the categories may give a false impression of an existing relationship between data causing an error in the model.

Thus we will use a one-hot-encoding method to convert the categorical data into numerical data. However, we will convert the categorical data of the target variable containing yes and no into 1 and 0, respectively. It will not harm the future accuracy of data.

#Encoding Independent variablefrom sklearn.compose import ColumnTransformerfrom sklearn.preprocessing import OneHotEncoderct = ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[0])], remainder='passthrough')X= np.array(ct.fit_transform(X))print(X)#Encoding Dependent Variablefrom sklearn.preprocessing import LabelEncoderle=LabelEncoder()y= le.fit_transform(y)print(y)

As shown in the above code snippet, To encode independent categorical data, OneHotEncoder is used, and the remainder is set to ‘passthrough’ so that the remaining features are not canceled out. The result is converted into an array using NumPy.

For encoding the Dependent variable, LabelEncoder from the scikit-learn module is used.

Step 5] Feature Scaling

Feature scaling helps in putting the values of independent variables or features in the same range. It is not always necessary to apply feature scaling for all models. Some models automatically compensate for high values using a flow co-efficient.

Two techniques used are Standardization and Normalization.

I will be using Standardization. (Using any technique, doesn’t cause any significant change in output. )

# Formula of Standardization 
# Xstand = [x- mean(x)]/standard_deviation(x)from sklearn.preprocessing import StandardScalersc = StandardScaler()X=sc.fit_transform(X)print(X)

Feature scaling returns the output, which is scaled in the same range.

Step 6] Splitting dataset into train and test set

This is an important step. We will always need to split data into train and test set to train and test our model, respectively. Usually, 80% of data is used for training data, and 20% is used to test the model on future testing data.

To avoid overfitting, the data should be trained on limited test data. However, when the training set is insufficient, the model will generate errors for the test set. Hence the selection and splitting should be appropriately made.

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0 )

This sums up all the tools required to process the data. For almost all models, we will require Importing Libraries, Importing Datasets, and Splitting Data into Train and Test set. We may sometimes require the other mentioned tools.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Beginners Guide to Machine Learning: Data Pre-processing using Python.

Author(s): Anushkad

Table of contents:

Step 1] Importing Libraries

What are libraries?

Step 2] Importing Datasets

What are Datasets?

Step 3] Handling the missing data

Step 4] Encoding Categorical Data

Step 5] Feature Scaling

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Understandability of Deep Learning Models

AI for Everyone: The Biggest AI Myths People Still Believe

How We Taught Machines to Think

#62 Will AI Take Your Job?

NN#6 — Neural Networks Decoded: Concepts Over Code

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Beginners Guide to Machine Learning: Data Pre-processing using Python.

Author(s): Anushkad

Table of contents:

Step 1] Importing Libraries

What are libraries?

Step 2] Importing Datasets

What are Datasets?

Step 3] Handling the missing data

Step 4] Encoding Categorical Data

Step 5] Feature Scaling

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement