Heart Disease Prediction Using Python

Filed Under: Machine Learning
HeartDisease FeaImg

Hey fellow coder! Today in this tutorial, we will try to predict the presence of a very common illness in people, heart disease.

Heart disease is one of the biggest causes of morbidity and mortality among the population of the world. Heart disease refers to a group of disorders that affect the heart. According to WHO, cardiovascular illnesses are now the leading cause of mortality globally, accounting for 17.9 million deaths per year.

Detecting and preventing the factors that have the greatest impact on heart disease is very important in healthcare.

Understanding the Heart Disease Dataset

The dataset chosen for this tutorial is the 2020 annual CDC survey data. You can download the dataset here.

It consists of 401,958 rows and 279 columns. But the original dataset of nearly 300 variables was reduced to just about 20 variables. We should treat the variable “heart disease” as a binary (“Yes” – respondent had the disease; “No” – respondent had no disease).

Code for predicting Heart Disease

Our aim is to predict the disease present in a person using the dataset. This code implementation includes the following steps;

  1. Importing necessary libraries/modules
  2. Data loading and pre-processing
  3. Generation of the train set and test set of data
  4. Defining functions for the training on the train dataset
  5. Performing training by using defined methods
  6. Performing the training by using Sklearn Libraries

Importing Dependencies

The very first thing is importing the required libraries such as pandas, NumPy, and pyplot into our program.

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

Loading and Pre-processing Dataset

data = pd.read_csv('heart_2020_cleaned.csv')
print("Number of Datapoints : ",data.shape[0])
HeartDisease Dataset
heart disease Dataset

A lot of features in the dataset contain string data that are useless for the training of the logistic regression in the later section for prediction. Therefore we convert these objects into integer values. For this purpose, the unique values at those columns are given integer values starting from either 0 or 1. For example values with only “yes” or “no” options 1 is assigned for “yes” and 0 is assigned for “no”.

# Converting Gender type to Integers

# Categorizing Age values
data.iloc[:,9].replace("80 or older",13,inplace=True)

# Categorize Race of the person 
data.iloc[:,10].replace("American Indian/Alaskan Native",4,inplace=True)

# Catgorize if the person is diabetic or not 
data.iloc[:,11].replace("Yes (during pregnancy)",3,inplace=True)
data.iloc[:,11].replace("No, borderline diabetes",2,inplace=True)

# Categorize the Health of the person into integers values
data.iloc[:,13].replace("Very good",3,inplace=True)

# Set final label of having heart disease or not into integers
Heart Disease Dataset Cleaner
heart disease Dataset Cleaner

As the final step, we normalize the dataset values to overcome the problem of encountering large values and making everything more complex.

y = data.HeartDisease.values
x_d = data.drop(["HeartDisease"], axis=1)
x = (x_d - np.min(x_d))/(np.max(x_d)-np.min(x_d)).values

Training and Testing Split of Dataset

We will be making use of the 80-20 rule where the 80% data is the training dataset and the rest 20% is put to the testing dataset. To get the split we make use of the train_test_split function.

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2,random_state=42)

x_train = x_train.T
x_test = x_test.T

Applying Logistic Regression to predict Heart Disease

In this final section, the same training and test data help to train NN by using the Sklearn Logistic Regression function as follows.

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

For the testing dataset, we make predictions whether the person has the disease or not. Along with this, we will also compute the accuracy score of the predictions made using the code below.

all_pred = list(lr.predict(x_test.T))
score = lr.score(x_test.T,y_test.T)
print("Score of Logistic Regression : ",score)
Score of Logistic Regression :  0.9137572507387545

You can see the score is pretty decent and the predictions are more than 90% accurate.

Heart disease is one of society’s key worries nowadays. Manually calculating the chances of developing heart disease based on risk factors is tough. Machine learning techniques, on the other hand, can help to anticipate the outcome of existing data.

Thank you for reading!

I hope you liked the tutorial!

I would recommend you to read the following tutorials and learn a lot more:

  1. K-Nearest Neighbors (KNN) in Python
  2. Logistic Regression in R programming
  3. Python Faker Module – All you need to know!

Generic selectors
Exact matches only
Search in title
Search in content