Towards AI Can Help your Team Adopt AI: Corporate Training, Consulting, and Talent Solutions.

Publication

Heart Attack Prediction: Unveiling Insights through Predictive Modeling with Python
Latest   Machine Learning

Heart Attack Prediction: Unveiling Insights through Predictive Modeling with Python

Last Updated on July 15, 2023 by Editorial Team

Author(s): Muttineni Sai Rohith

Originally published on Towards AI.

Picture this: a crystal ball that can predict heart attacks before they strike. Sounds like something out of a sci-fi movie, right? Well, thanks to the wonders of Machine Learning and the wizardry of Python programming, we’re not far from turning that imagination into reality.

Welcome to the realm of predictive modeling, where the power of data and cutting-edge technology converge to tackle one of the most critical challenges in cardiovascular health: predicting heart attacks. In this article, we delve into the fascinating world of heart attack prediction using the prowess of predictive modeling with Python.

Predictive Modeling: — It refers to the process of creating and utilizing mathematical or statistical models to make predictions or forecasts about future outcomes or events. It involves analyzing historical data and identifying patterns and relationships between variables to develop a model that can be used to predict the behavior or outcome of a specific target variable.

The predictive modeling process typically involves several steps, including data collection, data preprocessing, feature engineering, model training, model evaluation, and prediction. We will be following the same process in this article to predict Heart Attacks.

Data Collection

While searching for suitable data to use for our risk stratification project aimed at predicting heart attacks, I came across this dataset on the UCI repository. Despite its smaller size, it contains all the essential features we need, making it an ideal starting point for our analysis and model development.

import pandas as pd

data = pd.read_csv("/content/heart.csv")
data.head()
Snapshot of Data

It’s a clean and easy-to-understand set of data with Shape — 1025 * 14. However, the meaning of some of the column headers is not obvious. Here’s what they mean,

  • age: The person’s age in years
  • sex: The person’s sex (1 = male, 0 = female)
  • cp: The chest pain experienced (Value 0: typical angina, Value 1: atypical angina, Value 2: non-anginal pain, Value 3: asymptomatic)
  • trestbps: The person’s resting blood pressure (mm Hg on admission to the hospital)
  • chol: The person’s cholesterol measurement in mg/dl
  • fbs: The person’s fasting blood sugar (> 120 mg/dl, 1 = true; 0 = false)
  • restecg: Resting electrocardiographic measurement (0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes’ criteria)
  • thalach: The person’s maximum heart rate achieved
  • exang: Exercise-induced angina (1 = yes; 0 = no)
  • oldpeak: ST depression induced by exercise relative to rest (‘ST’ relates to positions on the ECG plot. See more here)
  • slope: the slope of the peak exercise ST segment (Value 1: upsloping, Value 2: flat, Value 3: downsloping)
  • ca: The number of major vessels (0–3)
  • thal: A blood disorder called thalassemia (3 = normal; 6 = fixed defect; 7 = reversible defect)
  • target: Heart disease (0 = no, 1 = yes)

Let’s Dive into preprocessing of this Data.

Preprocessing —

With the recent advancement, Generally, I would have preferred using PandasAI or pandas_profiling for preprocessing the data.

Example —

import pandas_profiling as df_report
df_report.ProfileReport(data)

Try this and see the wonders, It summarizes the entire data and gives all the related information to you. But for this article, let's follow the manual process —

Let’s start with missing values detection —

data.isnull().sum()
Missing Values Detection

Fortunately, there are no missing values in the data. Let’s check the data redundancy —

data.duplicated().sum()
Data Duplication

Oops, we already had fewer data, and out of this 723 records are duplicated. But I think the remaining records will be sufficient to serve our purpose because it is really hard to get this sensitive data. So let’s remove these records and move further.

data.drop_duplicates(inplace=True)

So now, let’s analyze the data in order to get a view of how well the fields are correlated and derive some insights from it.

from plotly.subplots import make_subplots
import plotly.graph_objects as go

df=data[['sex','age','target']]
df['sex'].replace({1:'Male',0:'Female'},inplace=True)
df['target'].replace({1:'Heart Patient',0:'Healthy'},inplace=True)

fig = make_subplots(rows=1, cols=2,specs=[[{"type": "histogram"}, {"type": "histogram"}]])
fig.add_trace(
go.Histogram(
x=df['age'].where(df['target']=='Heart Patient'),
name='Heart Patient',
nbinsx=20,
showlegend=False,
marker={"color": '#f84242'}
),
row=1,col=1
)
fig.add_trace(
go.Histogram(
x=df['age'].where(df['target']=='Healthy'),
name='Healthy',
nbinsx=20,
showlegend=False,
marker={"color": 'white'}
),
row=1,col=1
)
fig.add_trace(
go.Histogram(
x=df['sex'].where(df['target']=='Heart Patient'),
name='Heart Patient',
nbinsx=20,
marker={"color": '#f84242'}
),
row=1,col=2
)
fig.add_trace(
go.Histogram(
x=df['sex'].where(df['target']=='Healthy'),
name='Healthy',
nbinsx=20,
marker={"color": 'white'}
),
row=1,col=2
)

fig.update_layout(height=500,
title_text="<b>Age & Gender Distribution<b>",
title_font_size=30,
bargap=0.1,
template='plotly_dark',
)
fig.update_xaxes(title_text="Age", row=1, col=1)
fig.update_yaxes(title_text="Count", row=1, col=1)

fig.update_xaxes(title_text="Gender", row=1, col=2)
fig.update_yaxes(title_text="Count", row=1, col=2)

fig.show()
Age and Gender Distribution

The first subplot built using Plotly displays the age distribution, with separate histograms for heart patients and healthy individuals. The second subplot shows the gender distribution.

As we can females are more prone to heart attacks than Males. Almost 75% ratio of females are heart patients in this data, and 35–55 Age group people are more prone to become heart patients.

Let’s repeat the same analysis for Chest pain and Max Heart Rate by tweaking the above code —

cp1=data.where(data['target']==0).groupby(by=["cp"]).size().reset_index(name="Count")
cp0=data.where(data['target']==1).groupby(by=["cp"]).size().reset_index(name="Count")

cp0['cp'].replace({0:'Type 1',1:'Type 2',2:'Type 3',3:'Type 4'},inplace=True)
cp1['cp'].replace({0:'Type 1',1:'Type 2',2:'Type 3',3:'Type 4'},inplace=True)

df1=data[['thalach','chol','target','age','trestbps']]
df1['targetname']=df1['target'].replace({1:'Heart Patient',0:'Healthy'})

fig = make_subplots(rows=1, cols=2,specs=[[{"type": "histogram"}, {"type": "scatter"}]])
fig.add_trace(
go.Bar(
x=cp0['cp'],y=cp0.Count,marker={"color": 'white'},name='Healthy'
),
row=1,col=1
)
fig.add_trace(
go.Bar(
x=cp1['cp'],y=cp1.Count,marker={"color": '#f84242'},name='Heart Patient'
),
row=1,col=1
)
fig.update_layout(height=500,
title_text="<b>Chest Pain & Max Heart Rate<b>",
title_font_size=30,
bargap=0.1,
template='plotly_dark',
)
fig.add_trace(
go.Scatter(x=df1.thalach, y=df1.age, mode='markers', text=df1['targetname'],showlegend=False,
marker=dict(
color=df1.target,
colorscale=['white','#f84242'],
line_width=1)
),
row=1,col=2
)
fig.update_xaxes(title_text="Chest Pain Type", row=1, col=1)
fig.update_yaxes(title_text="Count", row=1, col=1)

fig.update_xaxes(title_text="Max. Heart Rate", row=1, col=2)
fig.update_yaxes(title_text="Age", row=1, col=2)

fig.show()
Chest Pain and Maximum Heart Rate Distribution

As we can observe, people with type 1 chest pain have a high risk of high disease as compared to other chest pain types also Higher maximum heart rate among younger candidates is seen to be a major symptom of heart disease.

We can perform the same analysis for the remaining fields in the data as well in this process. But the article will become lengthy, so let's directly see the correlation among fields.

import seaborn as sb
import matplotlib.pyplot as plt

sb.set(style="white")
plt.rcParams['figure.figsize']=(15,15)
sb.heatmap(data.corr(),annot= True, linewidth=0.5)
plt.title("Correlation between variables")
Correlation

Here we are correlating the fields based on the entire set of data. Based on this correlation, we can derive that Chest Pain and Maximum Heart rate are correlated with our target variable(Heart Patient or not). I also prefer Comparing the data based on Data intervals, but we will not cover this in this article, But let me give you the code snapshot, if anyone wants to try —

for seq_columns in ["age","trestbps","thalach","oldpeak","chol"]:
print(data.groupby([pd.cut(data[seq_columns],5)])['target'].mean())
for categ_columns in list(set(data.columns) - set(["age","trestbps","thalach","oldpeak","chol","target"])):
print(data.groupby(categ_columns)['target'].mean())

Depending on the analysis of the output, we can derive the following insights —

  • The higher the rate of heart, the more the chance of being a heart patient.
  • Type 1 chest pain has a high risk of high disease as compared to other chest pain types.
  • If cholesterol exceeds 475, then the chance of a heart attack is way more.
  • Females are more prone to heart disease than males
  • Exercise-induced angina is the primary cause of heart attacks in young people

We can derive more insights by closing following through each field, But for the article's purpose, let’s dive into the Modeling part, where we can predict heart attacks using Machine Learning.

from sklearn.preprocessing import StandardScaler

data=pd.get_dummies(data,columns=['sex','cp','restecg','exang','slope','ca','thal','fbs'])
scaler=StandardScaler()
scaledColumns=['age','trestbps','thalach','oldpeak','chol']
data[scaledColumns]=scaler.fit_transform(data[scaledColumns])
data.head()

As we have some categorical fields, we are converting them into binary columns for suitable analysis, and using standardScaler. We are normalizing the fields to match with the rest of the Data.

Model Development

Let’s split the data for training and testing —

from sklearn.model_selection import train_test_split

x = data.drop(columns='target',axis=1)
y = data['target']
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,stratify=y)
print(x_train.shape,x_test.shape)

We are preserving 20% of the data (61 rows) for testing and 80% of the data (241 rows) for training. In order not to overfit our model, I considered this ratio, Also, we will be calculating 10 Fold Cross Validation as Our scoring parameter to make sure the data is fitted correctly.

Let’s start with the Logistic Regression —

from sklearn.metrics import confusion_matrix,accuracy_score,roc_curve,classification_report
from sklearn.model_selection import cross_val_score,GridSearchCV
from sklearn.linear_model import LogisticRegression

model=LogisticRegression()
model.fit(x_train,y_train)
train_pred=model.predict(x_train)
score=accuracy_score(y_train,train_pred)

print("Train Accuracy Score : ",score*100)
test_pred=model.predict(x_test)
test_score=accuracy_score(y_test,test_pred)
print("Test Accuracy Score : ",test_score*100,'\n\n')

lrScore=cross_val_score(model,x,y,cv=10).mean()*100
print("10-Fold CV Score : ",lrScore,'\n')

print("Confusion Matrix : \n",confusion_matrix(y_test,test_pred),'\n\n')
print(classification_report(y_test,test_pred))

We have achieved an 85% — 10 Fold Cross Validation score.

The 10-fold cross-validation score is a technique that divides the data into 10 subsets, trains and tests the model 10 times using different subsets, and calculates the average performance score as an estimate of the model’s generalization ability.

Let’s try different algorithms and calculate the score, and use the one that predicts accurately —

---------------------------------------------------------------------------
#KNN

from sklearn.neighbors import KNeighborsClassifier

knn=KNeighborsClassifier(n_neighbors=1,n_jobs=-1)
knn.fit(x_train,y_train)
train_pred=knn.predict(x_train)
score=accuracy_score(y_train,train_pred)

---------------------------------------------------------------------------
#SVC

from sklearn.svm import SVC

svm = SVC(C=4, degree=8, kernel='poly',max_iter=1000)
svm.fit(x_train,y_train)

train_pred=svm.predict(x_train)
score=accuracy_score(y_train,train_pred)

---------------------------------------------------------------------------
#Navie Bayes or Guassian

from sklearn.naive_bayes import GaussianNB

nb=GaussianNB()
nb.fit(x_train,y_train)
train_pred=nb.predict(x_train)
score=accuracy_score(y_train,train_pred)

---------------------------------------------------------------------------
# Decision Tree

from sklearn.tree import DecisionTreeClassifier

dt=DecisionTreeClassifier(criterion = 'entropy',random_state=0,max_depth = 12)
dt.fit(x_train,y_train)
train_pred=dt.predict(x_train)
score=accuracy_score(y_train,train_pred)
---------------------------------------------------------------------------
# Random Forest

from sklearn.ensemble import RandomForestClassifier
randFor = RandomForestClassifier(n_estimators=1000, random_state = 35)
randFor.fit(x_train, y_train)
train_pred=randFor.predict(x_train)
score=accuracy_score(y_train,train_pred)
---------------------------------------------------------------------------
# XGBoost

from xgboost import XGBClassifier

xgb = XGBClassifier(n_estimators=200)
xgb.fit(x_train, y_train)
train_pred=xgb.predict(x_train)
score=accuracy_score(y_train,train_pred)
---------------------------------------------------------------------------

Instead of showing one by one algorithm, I gave the code above; you can tweak the below code by changing variable names to get accuracies for which algorithm.

print("Train Accuracy Score : ",score*100)
test_pred=xgb.predict(x_test)
test_score=accuracy_score(y_test,test_pred)
print("Test Accuracy Score : ",test_score*100,'\n\n')

xgbScore = cross_val_score(xgb,x,y,cv=10).mean()*100
print("10-Fold CV Score : ",xgbScore,'\n')

print("Confusion Matrix : \n",confusion_matrix(y_test,test_pred),'\n\n')
print(classification_report(y_test,test_pred))

Model Comparision

Model Comparision

As the amount of data is very less, Logistic Regression gave us good accuracy when compared with the rest. Also, I haven’t played very much with the Hyperparameters in the rest of the Algorithms. But This modeling is for educational purposes, and I think you can carry on with the rest.

Remember, although we took 10 Fold CV Score as our main scoring parameter, In HealthCare, precision also becomes an essential parameter to focus on. High precision means that the positive predictions (patients classified as high risk) are more likely to be accurate. In this case, you want to ensure that the individuals identified as high risk indeed have a higher probability of experiencing a heart attack.

However, it is also important to consider the impact of false negatives (missed high-risk patients). If your goal is to identify as many high-risk patients as possible, with the aim of early intervention or close monitoring, then recall becomes crucial. High recall indicates that you are capturing a larger proportion of true positive cases.

Hope you find this article helpful.

Happy Learning…..

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓