Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Data Preprocessing in R Markdown
Latest

Data Preprocessing in R Markdown

Last Updated on January 28, 2023 by Editorial Team

Last Updated on January 28, 2023 by Editorial Team

Author(s): Mohammed Fayiz Parappan

Originally published on Towards AI.

for MachineΒ Learning

Photo by Scott Graham onΒ Unsplash

Data preprocessing constitutes cleaning, sampling, analyzing, transforming, and encoding data so that it can be easily interpretable to provide insights or can be fed into a machine learningΒ model.

Data is the new oil. It is crucial to have data in an interpretable form.

In this article, I will discuss the implementation of Data Preprocessing methods in R. I will be using Heart Attack Analysis and Prediction Dataset provided byΒ Kaggle.

Steps in Data Preprocessing

  1. Import the designated data file andΒ Explore
  2. Handle Missing Values, Remove duplicates and irrelevant observations
  3. Fix structural errors
  4. Filter unwantedΒ outliers
  5. Measures of central tendency (calculate mean, median, mode, and frequencies)
  6. Measures of dispersion (calculate variance, standard deviation, range, inter-quartile range, coefficient of variance)
  7. Calculate the correlation coefficient and correlation plot
  8. Check the distribution of features using histograms and a Normal Probability Plot
  9. Data Splitting
  10. Import the designated data file andΒ Explore

You can find more details on the dataset here: https://www.kaggle.com/datasets/rashikrahmanpritom/heart-attack-analysis-prediction-dataset/

Unlike many other programming languages, datasets in the form of CSV and TXT files can be directly imported without any library inΒ R.

Top rows of theΒ dataset
Structure of theΒ dataset

2. Handle Missing Values, Remove duplicates and irrelevant observations

In R, missing values are represented by NA (not available).

Number of missingΒ values

As no missing values are there, no missing value techniques are used. In case missing values are found, either they are removed or replaced by mean or approximations.

Duplicate data can contaminate the interpretability of the dataset and may also lead machine learning models to learn patterns that do not exist inΒ reality.

The index of the only duplicate row is found and removed from theΒ dataset.

3. Fix structural errors

As missing values and duplicates are now removed, let’s check if the distribution of dataset w.r.t output is balanced or not. The dataset is labeled as 0’s andΒ 1's.

  • 0 = No Heart AttackΒ Occurs
  • 1 = Heart AttackΒ Occur

As there are a similar number of observations of both classes, the dataset is balancedΒ enough.

4. Filter unwantedΒ outliers

Outliers are extreme data points that do not match with general trends seen in other points of the dataset. It can have a crucial impact on the interpretations and results given by ML models. It is important to note that the mere appearance of outliers doesn’t mean they should be removed. Only those outliers which are irrelevant for data analysis should beΒ removed.

Outlier data points in a dataset can be detected with the help of Cook’s Distance which is a metric to measure the influence of each data point over the model (here, linear regression is shown) into which the dataset is fed. Cook’s distances can be easily calculated in R using olsrr library that can be installed from Tools -> Install Packages.

Number ofΒ Outliers

Note that conditions for treating data points as outliers are subjective. Here, I have treated data points whose Cook’s distances are more than five times the mean Cook’s distance as outliers. There are 9 such points, and they were filtered from theΒ dataset.

5. Measures of central tendency (mean, median, mode, and frequencies)

The mean, median, mode, minimum, maximum, and quartiles of each dataframe in the dataset can be extracted from the summary of theΒ dataset.

6. Measures of dispersion (variance, standard deviation, range, inter-quartile range, coefficient of variance)

I have used sapply() function, which takes a list or vector or data frame as input and gives output as a vector or matrix to get the values of measures of dispersion.

Standard Deviation of eachΒ feature
Variance of eachΒ feature
IQR of eachΒ feature
Coefficient of Variance of eachΒ feature

7. Calculate the correlation coefficient and correlation plot

A correlation coefficient is a number between -1 and 1 that tells the strength (along with direction) between features of the dataset. It is useful to detect multicollinearity, which kills independence between features of the dataset and can lead to inaccurate parameter estimates by MLΒ models.

A correlation plot helps in visualizing correlation coefficients between features of the dataset. It is plotted in R using corrplot library, which can be installed from Tools -> Install Packages.

Correlation coefficients of each pair ofΒ features
Correlation Plot onΒ Dataset

Notice that intensity of the blue color shows the strength of positive collinearity, while the intensity of the red color shows the strength of negative collinearity.

8. Check the distribution of features using Histograms and Normal Probability Plot

Histograms show how the values of each feature are distributed, which can give interesting insights into the dataset. A normal probability plot tells us how close the feature distribution is to the normal distribution. I used ggplot2 and qqplotr libraries to plotΒ NPP’s.

Histogram Plot on the Age ofΒ Patients
Histogram Plot on the blood pressure ofΒ Patients
Histogram plot on the cholesterol level ofΒ Patients
Normal Probability plot on the age ofΒ Patients
Normal Probability plot on blood pressure ofΒ Patients
Normal Probability Plot on cholesterol level ofΒ Patients

9. Data Splitting

I have used caTools library to split the dataset into train and test sets with a ratio ofΒ 80:20.

All these techniques will help you to have better insights from data and also to prepare your dataset for feeding it into a machine learning model. If you know any other techniques, share them in the comments for everyone!

Thanks For Reading, Follow Me ForΒ More


Data Preprocessing in R Markdown was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓