Heart Attack Prediction: Unveiling Insights through Predictive Modeling with Python
Last Updated on July 15, 2023 by Editorial Team
Author(s): Muttineni Sai Rohith
Originally published on Towards AI.
Picture this: a crystal ball that can predict heart attacks before they strike. Sounds like something out of a sci-fi movie, right? Well, thanks to the wonders of Machine Learning and the wizardry of Python programming, weβre not far from turning that imagination into reality.
Welcome to the realm of predictive modeling, where the power of data and cutting-edge technology converge to tackle one of the most critical challenges in cardiovascular health: predicting heart attacks. In this article, we delve into the fascinating world of heart attack prediction using the prowess of predictive modeling with Python.
Predictive Modeling: β It refers to the process of creating and utilizing mathematical or statistical models to make predictions or forecasts about future outcomes or events. It involves analyzing historical data and identifying patterns and relationships between variables to develop a model that can be used to predict the behavior or outcome of a specific target variable.
The predictive modeling process typically involves several steps, including data collection, data preprocessing, feature engineering, model training, model evaluation, and prediction. We will be following the same process in this article to predict Heart Attacks.
Data Collection
While searching for suitable data to use for our risk stratification project aimed at predicting heart attacks, I came across this dataset on the UCI repository. Despite its smaller size, it contains all the essential features we need, making it an ideal starting point for our analysis and model development.
import pandas as pd
data = pd.read_csv("/content/heart.csv")
data.head()
Itβs a clean and easy-to-understand set of data with Shape β 1025 * 14. However, the meaning of some of the column headers is not obvious. Hereβs what they mean,
- age: The personβs age in years
- sex: The personβs sex (1 = male, 0 = female)
- cp: The chest pain experienced (Value 0: typical angina, Value 1: atypical angina, Value 2: non-anginal pain, Value 3: asymptomatic)
- trestbps: The personβs resting blood pressure (mm Hg on admission to the hospital)
- chol: The personβs cholesterol measurement in mg/dl
- fbs: The personβs fasting blood sugar (> 120 mg/dl, 1 = true; 0 = false)
- restecg: Resting electrocardiographic measurement (0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estesβ criteria)
- thalach: The personβs maximum heart rate achieved
- exang: Exercise-induced angina (1 = yes; 0 = no)
- oldpeak: ST depression induced by exercise relative to rest (βSTβ relates to positions on the ECG plot. See more here)
- slope: the slope of the peak exercise ST segment (Value 1: upsloping, Value 2: flat, Value 3: downsloping)
- ca: The number of major vessels (0β3)
- thal: A blood disorder called thalassemia (3 = normal; 6 = fixed defect; 7 = reversible defect)
- target: Heart disease (0 = no, 1 = yes)
Letβs Dive into preprocessing of this Data.
Preprocessing β
With the recent advancement, Generally, I would have preferred using PandasAI or pandas_profiling for preprocessing the data.
Example β
import pandas_profiling as df_report
df_report.ProfileReport(data)
Try this and see the wonders, It summarizes the entire data and gives all the related information to you. But for this article, let's follow the manual process β
Letβs start with missing values detection β
data.isnull().sum()
Fortunately, there are no missing values in the data. Letβs check the data redundancy β
data.duplicated().sum()
Oops, we already had fewer data, and out of this 723 records are duplicated. But I think the remaining records will be sufficient to serve our purpose because it is really hard to get this sensitive data. So letβs remove these records and move further.
data.drop_duplicates(inplace=True)
So now, letβs analyze the data in order to get a view of how well the fields are correlated and derive some insights from it.
from plotly.subplots import make_subplots
import plotly.graph_objects as go
df=data[['sex','age','target']]
df['sex'].replace({1:'Male',0:'Female'},inplace=True)
df['target'].replace({1:'Heart Patient',0:'Healthy'},inplace=True)
fig = make_subplots(rows=1, cols=2,specs=[[{"type": "histogram"}, {"type": "histogram"}]])
fig.add_trace(
go.Histogram(
x=df['age'].where(df['target']=='Heart Patient'),
name='Heart Patient',
nbinsx=20,
showlegend=False,
marker={"color": '#f84242'}
),
row=1,col=1
)
fig.add_trace(
go.Histogram(
x=df['age'].where(df['target']=='Healthy'),
name='Healthy',
nbinsx=20,
showlegend=False,
marker={"color": 'white'}
),
row=1,col=1
)
fig.add_trace(
go.Histogram(
x=df['sex'].where(df['target']=='Heart Patient'),
name='Heart Patient',
nbinsx=20,
marker={"color": '#f84242'}
),
row=1,col=2
)
fig.add_trace(
go.Histogram(
x=df['sex'].where(df['target']=='Healthy'),
name='Healthy',
nbinsx=20,
marker={"color": 'white'}
),
row=1,col=2
)
fig.update_layout(height=500,
title_text="<b>Age & Gender Distribution<b>",
title_font_size=30,
bargap=0.1,
template='plotly_dark',
)
fig.update_xaxes(title_text="Age", row=1, col=1)
fig.update_yaxes(title_text="Count", row=1, col=1)
fig.update_xaxes(title_text="Gender", row=1, col=2)
fig.update_yaxes(title_text="Count", row=1, col=2)
fig.show()
The first subplot built using Plotly displays the age distribution, with separate histograms for heart patients and healthy individuals. The second subplot shows the gender distribution.
As we can females are more prone to heart attacks than Males. Almost 75% ratio of females are heart patients in this data, and 35β55 Age group people are more prone to become heart patients.
Letβs repeat the same analysis for Chest pain and Max Heart Rate by tweaking the above code β
cp1=data.where(data['target']==0).groupby(by=["cp"]).size().reset_index(name="Count")
cp0=data.where(data['target']==1).groupby(by=["cp"]).size().reset_index(name="Count")
cp0['cp'].replace({0:'Type 1',1:'Type 2',2:'Type 3',3:'Type 4'},inplace=True)
cp1['cp'].replace({0:'Type 1',1:'Type 2',2:'Type 3',3:'Type 4'},inplace=True)
df1=data[['thalach','chol','target','age','trestbps']]
df1['targetname']=df1['target'].replace({1:'Heart Patient',0:'Healthy'})
fig = make_subplots(rows=1, cols=2,specs=[[{"type": "histogram"}, {"type": "scatter"}]])
fig.add_trace(
go.Bar(
x=cp0['cp'],y=cp0.Count,marker={"color": 'white'},name='Healthy'
),
row=1,col=1
)
fig.add_trace(
go.Bar(
x=cp1['cp'],y=cp1.Count,marker={"color": '#f84242'},name='Heart Patient'
),
row=1,col=1
)
fig.update_layout(height=500,
title_text="<b>Chest Pain & Max Heart Rate<b>",
title_font_size=30,
bargap=0.1,
template='plotly_dark',
)
fig.add_trace(
go.Scatter(x=df1.thalach, y=df1.age, mode='markers', text=df1['targetname'],showlegend=False,
marker=dict(
color=df1.target,
colorscale=['white','#f84242'],
line_width=1)
),
row=1,col=2
)
fig.update_xaxes(title_text="Chest Pain Type", row=1, col=1)
fig.update_yaxes(title_text="Count", row=1, col=1)
fig.update_xaxes(title_text="Max. Heart Rate", row=1, col=2)
fig.update_yaxes(title_text="Age", row=1, col=2)
fig.show()
As we can observe, people with type 1 chest pain have a high risk of high disease as compared to other chest pain types also Higher maximum heart rate among younger candidates is seen to be a major symptom of heart disease.
We can perform the same analysis for the remaining fields in the data as well in this process. But the article will become lengthy, so let's directly see the correlation among fields.
import seaborn as sb
import matplotlib.pyplot as plt
sb.set(style="white")
plt.rcParams['figure.figsize']=(15,15)
sb.heatmap(data.corr(),annot= True, linewidth=0.5)
plt.title("Correlation between variables")
Here we are correlating the fields based on the entire set of data. Based on this correlation, we can derive that Chest Pain and Maximum Heart rate are correlated with our target variable(Heart Patient or not). I also prefer Comparing the data based on Data intervals, but we will not cover this in this article, But let me give you the code snapshot, if anyone wants to try β
for seq_columns in ["age","trestbps","thalach","oldpeak","chol"]:
print(data.groupby([pd.cut(data[seq_columns],5)])['target'].mean())
for categ_columns in list(set(data.columns) - set(["age","trestbps","thalach","oldpeak","chol","target"])):
print(data.groupby(categ_columns)['target'].mean())
Depending on the analysis of the output, we can derive the following insights β
- The higher the rate of heart, the more the chance of being a heart patient.
- Type 1 chest pain has a high risk of high disease as compared to other chest pain types.
- If cholesterol exceeds 475, then the chance of a heart attack is way more.
- Females are more prone to heart disease than males
- Exercise-induced angina is the primary cause of heart attacks in young people
We can derive more insights by closing following through each field, But for the article's purpose, letβs dive into the Modeling part, where we can predict heart attacks using Machine Learning.
from sklearn.preprocessing import StandardScaler
data=pd.get_dummies(data,columns=['sex','cp','restecg','exang','slope','ca','thal','fbs'])
scaler=StandardScaler()
scaledColumns=['age','trestbps','thalach','oldpeak','chol']
data[scaledColumns]=scaler.fit_transform(data[scaledColumns])
data.head()
As we have some categorical fields, we are converting them into binary columns for suitable analysis, and using standardScaler. We are normalizing the fields to match with the rest of the Data.
Model Development
Letβs split the data for training and testing β
from sklearn.model_selection import train_test_split
x = data.drop(columns='target',axis=1)
y = data['target']
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,stratify=y)
print(x_train.shape,x_test.shape)
We are preserving 20% of the data (61 rows) for testing and 80% of the data (241 rows) for training. In order not to overfit our model, I considered this ratio, Also, we will be calculating 10 Fold Cross Validation as Our scoring parameter to make sure the data is fitted correctly.
Letβs start with the Logistic Regression β
from sklearn.metrics import confusion_matrix,accuracy_score,roc_curve,classification_report
from sklearn.model_selection import cross_val_score,GridSearchCV
from sklearn.linear_model import LogisticRegression
model=LogisticRegression()
model.fit(x_train,y_train)
train_pred=model.predict(x_train)
score=accuracy_score(y_train,train_pred)
print("Train Accuracy Score : ",score*100)
test_pred=model.predict(x_test)
test_score=accuracy_score(y_test,test_pred)
print("Test Accuracy Score : ",test_score*100,'\n\n')
lrScore=cross_val_score(model,x,y,cv=10).mean()*100
print("10-Fold CV Score : ",lrScore,'\n')
print("Confusion Matrix : \n",confusion_matrix(y_test,test_pred),'\n\n')
print(classification_report(y_test,test_pred))
We have achieved an 85% β 10 Fold Cross Validation score.
The 10-fold cross-validation score is a technique that divides the data into 10 subsets, trains and tests the model 10 times using different subsets, and calculates the average performance score as an estimate of the modelβs generalization ability.
Letβs try different algorithms and calculate the score, and use the one that predicts accurately β
---------------------------------------------------------------------------
#KNN
from sklearn.neighbors import KNeighborsClassifier
knn=KNeighborsClassifier(n_neighbors=1,n_jobs=-1)
knn.fit(x_train,y_train)
train_pred=knn.predict(x_train)
score=accuracy_score(y_train,train_pred)
---------------------------------------------------------------------------
#SVC
from sklearn.svm import SVC
svm = SVC(C=4, degree=8, kernel='poly',max_iter=1000)
svm.fit(x_train,y_train)
train_pred=svm.predict(x_train)
score=accuracy_score(y_train,train_pred)
---------------------------------------------------------------------------
#Navie Bayes or Guassian
from sklearn.naive_bayes import GaussianNB
nb=GaussianNB()
nb.fit(x_train,y_train)
train_pred=nb.predict(x_train)
score=accuracy_score(y_train,train_pred)
---------------------------------------------------------------------------
# Decision Tree
from sklearn.tree import DecisionTreeClassifier
dt=DecisionTreeClassifier(criterion = 'entropy',random_state=0,max_depth = 12)
dt.fit(x_train,y_train)
train_pred=dt.predict(x_train)
score=accuracy_score(y_train,train_pred)
---------------------------------------------------------------------------
# Random Forest
from sklearn.ensemble import RandomForestClassifier
randFor = RandomForestClassifier(n_estimators=1000, random_state = 35)
randFor.fit(x_train, y_train)
train_pred=randFor.predict(x_train)
score=accuracy_score(y_train,train_pred)
---------------------------------------------------------------------------
# XGBoost
from xgboost import XGBClassifier
xgb = XGBClassifier(n_estimators=200)
xgb.fit(x_train, y_train)
train_pred=xgb.predict(x_train)
score=accuracy_score(y_train,train_pred)
---------------------------------------------------------------------------
Instead of showing one by one algorithm, I gave the code above; you can tweak the below code by changing variable names to get accuracies for which algorithm.
print("Train Accuracy Score : ",score*100)
test_pred=xgb.predict(x_test)
test_score=accuracy_score(y_test,test_pred)
print("Test Accuracy Score : ",test_score*100,'\n\n')
xgbScore = cross_val_score(xgb,x,y,cv=10).mean()*100
print("10-Fold CV Score : ",xgbScore,'\n')
print("Confusion Matrix : \n",confusion_matrix(y_test,test_pred),'\n\n')
print(classification_report(y_test,test_pred))
Model Comparision
As the amount of data is very less, Logistic Regression gave us good accuracy when compared with the rest. Also, I havenβt played very much with the Hyperparameters in the rest of the Algorithms. But This modeling is for educational purposes, and I think you can carry on with the rest.
Remember, although we took 10 Fold CV Score as our main scoring parameter, In HealthCare, precision also becomes an essential parameter to focus on. High precision means that the positive predictions (patients classified as high risk) are more likely to be accurate. In this case, you want to ensure that the individuals identified as high risk indeed have a higher probability of experiencing a heart attack.
However, it is also important to consider the impact of false negatives (missed high-risk patients). If your goal is to identify as many high-risk patients as possible, with the aim of early intervention or close monitoring, then recall becomes crucial. High recall indicates that you are capturing a larger proportion of true positive cases.
Hope you find this article helpful.
Happy Learningβ¦..
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI