Original link:http://tecdat.cn/?p=23061

Data set information:

This dataset dates back to 1988 and consists of four databases. Cleveland, Hungary, Switzerland and Long Beach.” The “goal” field indicates whether the patient has heart disease. Its value is an integer, 0= no disease, 1= disease.

Goal:

The main goal is to predict whether a given person will have heart disease, aided by several factors, such as age, cholesterol level, and type of chest pain.

The algorithm we used for this problem is:

  • Binary logistic regression
  • Naive Bayes algorithm
  • The decision tree
  • Random forests

Description of the data set:

The data has 303 observations and 14 variables. Each observation contains the following information about the individual.

  • Age :- The age of the individual, in years
  • Sex :- sex (1= male; 0 = female)
  • CP – type of chest pain (1= typical angina; 2= atypical angina; 3= non-angina pectoris; 4= asymptomatic).
  • TRESTBPS — resting blood pressure
  • CHOL – Serum cholesterol in mg/ dL
  • FBS – fasting blood glucose level >120 mg/ dL (1= true; 0 = false)
  • RESTECG – resting ECG results (0= normal; 1 = ST – T; 2 = hypertrophy)
  • Thalach – Maximum heart rate achieved
  • Exang – exercise-induced angina pectoris (1= yes; 0 = no)
  • OLDPEAK – Motion-induced ST depression relative to resting state
  • SLOPE – Slope of peak value of ST segment in motion (1= upward slope; 2 = flat; 3 = inferior oblique)
  • Ca – Number of major blood vessels (0-4), colored by Flourosopy
  • Thalassaemia – Thalassaemia is an inherited blood disorder that affects the body’s ability to produce hemoglobin and red blood cells. 1 = normal; 2= Fixed defects; 3= Reversible defects
  • Objectives — Predictive attributes — Diagnosis of heart disease (angiographic disease status) (value 0=<50% diameter stenosis; Value 1=>50% diameter narrow)

Load the data in RStudio

heart<-read.csv("heart.csv",header = T)

Header = T means that the given data has its own title, or in other words, the first observation value is also considered for prediction.

head(heart)

When we want to view and examine the first six observation points of the data, we use the head function.

tail(heart)

Shown are the last six observation points in our data

colSums(is.na(heart))

This function is used to check if our data contains any NA values. If the NA is not found, we can move on, otherwise we have to delete the NA earlier.

Examine our data structures

str(heart)

Check out our data summary

summary(heart)

From observing the above summary, we can say the following points

  • Gender is not a continuous variable, because according to our description, it can be male or female. Therefore, we must convert the gender variable name from an integer to a factor.
  • CP cannot be a continuous variable because it is the type of chest pain. Since it is the type of chest pain, we must convert the variable Cp to a factor.
  • FBS cannot be a continuous variable or an integer because it indicates whether the blood glucose level is below 120mg/ dL.
  • RESTECG is a factor because it is the type of ECG result. It can’t be an integer. So, we’re going to convert it to factors and labels.
  • Based on the description of the data set, exang should be a factor. With or without angina pectoris. Therefore, the variable is converted to a factor.
  • The slope cannot be an integer because it is the type of slope observed in an ECG. Therefore, we convert the variables to factors.
  • According to the description of the data set, CA is not an integer. Therefore, we will convert this variable to a factor.
  • Thal is not an integer because it is the type of thalassemia. Therefore, we convert the variables to factors.
  • The goal is the predictive variable that tells us whether the person has heart disease or not. Therefore, we convert this variable to a factor and label it.

Based on the above considerations, we made some changes to the variables

# for example sex < - as factor (sex) levels (sex) < - c (" Female ", "Male")

Verify that the above changes were executed successfully

str(heart)

summary(heart)

  

EDA

EDA stands for Exploratory Data Analysis, which is a method/philosophy of Data Analysis that uses a variety of techniques (primarily graphical techniques) to gain insight into Data sets.

For the graphical representation, we need the library “ggplot2”

library(ggplot2) ggplot(heart,aes(x=age,fill=target,color=target)) + geom_histogram(binwidth = 1,color="black") + labs(x  = "Age",y = "Frequency", title = "Heart Disease w.r.t. Age")

We can conclude that people between the ages of 40 and 60 have the highest incidence of heart disease compared to people over 60.

table <- table(cp)

pie(table)

We can conclude that of all types of chest pain, the majority observed in individuals are the typical type of chest pain, followed by non-angina.

Perform machine learning algorithms

Logistic regression

First, we divided the data set into training data (75%) and test data (25%).

Set.seed (100) #100 Permutation for control sampling is 100. index<-sample(NROW (heart),0.75* NROW (heart))

The model is generated on the training data, and then validated with the test data.

GLM (family = "binomial") # family = "binomial" means that there are only two results.

To examine how our model is generated, we need to calculate prediction scores and build confusion matrices to understand the accuracy of the model.

Pred <-fitted(BLR) # Fit can only be used to obtain the predicted score of the data generated from the model.

As you can see, the predicted score is the probability of having a heart attack. But we had to find the right cut-off point from which it would be easy to distinguish between heart disease and non-heart disease.

For this, we need the ROC curve, which is a graph that shows the performance of the classification model at all classification thresholds. It will enable us to take appropriate thresholds.

pred<-prediction(train$pred,train$target) perf<-performance(pred,"tpr","fpr") plot(perf,colorize = T,print.cutoffs.at = Seq (0.1, by = 0.1))

By using the ROC curve, we could observe that 0.6 had better sensitivity and specificity, so we chose 0.6 as the cut-off point for differentiation.

Pred1 < - ifelse (Mr Pred < 0.6, "No", "Yes")

# Accuracy of training data ACC_TR

From the confusion matrix of the training data, we know that the accuracy of the model is 88.55%.

Now validate the model against the test data

Predict (type = "response") ## type = "response" head(test)

We know that for training data, the tipping point is 0.6. Similarly, the test data will have the same critical point.

confusionMatrix((pred1),target)

# Test the accuracy of data.

Check how much of our prediction is within the curve

[email protected]

We can conclude that our accuracy is 81.58%, and 90.26% of our predictions are below the curve. Meanwhile, our misclassification rate was 18.42%.

Naive Bayes algorithm

Before executing Naive Bayes algorithms, we need to remove the additional predictive columns that we added when performing BLR.

NB # naivebayes model (target ~.)

Examine the model with training data and create its confusion matrix to understand how accurate the model is.

predict(train)
confMat(pred,target)

We can say that the accuracy rate of the Bayesian algorithm on the training data is 85.46%.

Now, validate the model of the test data by predicting and creating confusion matrices.

Matrix(pred,target)

 

We can conclude that the accuracy of the model generated with the help of Naive Bayes algorithm is 78.95%, or we can also say that the error classification rate of Naive Bayes algorithm is 21.05%.

The decision tree

Before implementing the decision tree, we need to remove the extra columns that we added when executing Naive Bayes algorithms.

train$pred<-NULL

Rpart stands for recursive partitioning and regression trees

Rpart is used when both independent and dependent variables are continuous or classified.

Rpart automatically detects whether regression or classification is required based on dependent variables.

Implementation decision tree

plot(tree)

With the help of the decision tree, we can say that the most important variables of all are Cp, Ca, Thal, Oldpeak.

Let’s validate the model with test data and find out how accurate the model is.

conMat(pred,targ)

We can say that the decision tree has an accuracy rate of 76.32%, or its misclassification rate of 23.68%.

Random forests

Before executing the random forest, we need to remove the extra prediction columns that we added when executing the decision tree.

test$pred<-NULL

In random forest, we do not need to divide the data into training data and test data, we directly generate the model on the whole data. To generate the model, we need to use a random forest library

# set. seed controls randomness by limiting permutation. set.seed(100) model_rf<-randomForest(target~.,data = heart) model_rf

The relationship between the random forest and the error is plotted on the graph.

plot(model_rf)

The red line represents the MCR without heart disease, the green line represents the MCR with heart disease, and the black line represents the overall MCR or OOB error. The overall error rate is what we’re interested in, and that’s good.

conclusion

After running through various classification techniques and taking into account their accuracy, we can conclude that all the models are between 76% and 84% accurate. Among them, the accuracy of random forest is slightly higher, 83.5%.


The most popular insight

1. Application case of multiple Logistic Logistic regression in R language

2. Implementation of Panel Smooth Transfer Regression (PSTR) analysis case

3. Partial least squares regression (PLSR) and principal component regression (PCR) in MATLAB

4. Case study of R language Poisson regression model

5. Hosmer-Lemeshow goodness of fit test in R language regression

6. Realization of Lasso regression, Ridge Ridge regression and Elastic Net model in R language

7. Logistic Logistic regression was realized in R language

8. Python uses linear regression to predict stock prices

9. How does R language calculate IDI and NRI indexes in survival analysis and Cox regression