Original link:tecdat.cn/?p=23061

Original source:Tuo End number according to the tribe public number

Dataset information:

This dataset dates back to 1988 and consists of four databases. Cleveland, Hungary, Switzerland and Long Beach.” The “target” field refers to whether the patient has heart disease. Its value is an integer, 0= no disease, 1= sick.

Goal:

The main aim is to predict whether a given person has heart disease, with the help of several factors such as age, cholesterol levels, type of chest pain and so on.

The algorithm we use for this problem is:

  • Binary logistic regression
  • Naive Bayes algorithm
  • The decision tree
  • Random forests

Description of data set:

The data had 303 observations and 14 variables. Each observation contains the following information about the individual.

  • Age :- The age of an individual, in years
  • Sex :- Sex (1= male; 0 = female)
  • Cp – Type of chest pain (1= typical angina; 2= atypical angina; 3= non-angina; 4= asymptomatic).
  • Trestbps — Resting blood pressure
  • Chol – Serum cholesterol, mg/ dL
  • FBS – Fasting blood glucose level >120 mg/ dL (1= true; 0 = false)
  • Restecg-resting ECG results (0= normal; 1 = ST – T; 2 = hypertrophy)
  • Thalach – Maximum heart rate achieved
  • Exang – Exercise-induced angina (1= yes; 0 = no)
  • Oldpeak – Motion induced ST depression relative to the rest state
  • Slope-slope of peak value of ST segment in slope-motion (1= upslope; 2 = flat; 3 = inferior oblique)
  • Ca – Number of major blood vessels (0-4), colored by Flourosopy
  • Thalassemia – Thalassemia is an inherited blood disorder that affects the body’s ability to produce hemoglobin and red blood cells. 1 = normal; 2= fixed defect; 3= Reversible defects
  • Objective — predictive attribute — diagnosis of heart disease (angiographic disease status) (value 0=<50% diameter stenosis; Value 1=>50% diameter stenosis)

Load data in Rstudio

heart<-read.csv("heart.csv",header = T)
Copy the code

Header = T means that the given data has its own title, or in other words, that the first observation is also considered for prediction.

head(heart)
Copy the code

We use the head function when we want to view and examine the first six observation points of the data.

tail(heart)
Copy the code

These are the last six observation points in our data

colSums(is.na(heart))
Copy the code

This function is used to check if our data contains any NA values. If no NA is found, we can move on, otherwise we have to delete the NA before.

Check our data structure

str(heart)
Copy the code

Check out our data summary

summary(heart)
Copy the code

By observing the above summary, we can say the following

  • Gender is not a continuous variable because it can be male or female according to our description. Therefore, we must convert the variable name gender from an integer to a factor.
  • Cp cannot be a continuous variable because it is the type of chest pain. Since it is the type of chest pain, we must convert the variable cp to a factor.
  • FBS cannot be a continuous variable or an integer because it indicates whether the blood glucose level is below 120mg/dl.
  • Restecg is a factor because it is the type of ecg result. It can’t be an integer. So, we’re going to convert it to factors and labels.
  • According to the description of the data set, exang should be a factor. Angina occurs or does not occur. Therefore, this variable is converted to a factor.
  • The slope cannot be an integer because it is the type of slope observed on the electrocardiogram. Therefore, we convert variables to factors.
  • According to the description of the data set, CA is not an integer. Therefore, we will convert this variable to a factor.
  • Thal is not an integer because it is a type of thalassemia. Therefore, we convert variables to factors.
  • The goal is a predictive variable that tells us whether the person has heart disease or not. Therefore, we convert this variable to a factor and label it.

Based on the above considerations, we made some changes to the variables

# for example sex < - as factor (sex) levels (sex) < - c (" Female ", "Male")Copy the code

Check whether the preceding changes are successfully performed

str(heart)
Copy the code

summary(heart)
Copy the code

  

EDA

EDA is short for Exploratory Data Analysis, and is a method/philosophy of Data Analysis that employs a variety of techniques (mostly graphical techniques) for in-depth understanding of Data sets.

For the graphical representation, we need the library “ggplot2”

library(ggplot2) ggplot(heart,aes(x=age,fill=target,color=target)) + geom_histogram(binwidth = 1,color="black") + labs(x  = "Age",y = "Frequency", title = "Heart Disease w.r.t. Age")Copy the code

We can conclude that people between the ages of 40 and 60 have the highest rate of heart disease compared to those over 60.

table <- table(cp)

pie(table)
Copy the code

We can conclude that of all types of chest pain, the majority observed in individuals are typical chest pain types, followed by non-angina.

Execute machine learning algorithms

Logistic regression

First, we split the data set into training data (75%) and test data (25%).

Index <-sample(nrow(heart),0.75*nrow(heart)) set. Seed (100) #100Copy the code

Models are generated on training data and then validated with test data.

GLM (family = "binomial") # family = "binomial" means only two results.Copy the code

To check how our model is generated, we need to calculate prediction scores and build confounding matrices to understand the model’s accuracy.

Pred < -FITTED (BLR) # Fitting can only be used to get the predicted score of the data generated by the model.Copy the code

We can see that the predicted score is the probability of heart disease. But we have to find an appropriate cutoff point from which we can easily distinguish heart disease.

For this, we need the ROC curve, which is a graph showing the performance of the classification model at all classification thresholds. It will enable us to take appropriate thresholds.

pred<-prediction(train$pred,train$target) perf<-performance(pred,"tpr","fpr") plot(perf,colorize = T,print.cutoffs.at = Seq (0.1, by = 0.1))Copy the code

By using ROC curves, we can observe that 0.6 has better sensitivity and specificity, so we choose 0.6 as the cut-off point for differentiation.

Pred1 < - ifelse (Mr Pred < 0.6, "No", "Yes")Copy the code

# Accuracy of training data ACC_trCopy the code

From the obfuscation matrix of training data, we know that the model has an accuracy of 88.55%.

Now validate the model on the test data

Predict (type = "response") ## type = "response" is used to obtain the probability of having heart disease. head(test)Copy the code

We know that for training data, the critical point is 0.6. Similarly, the test data will have the same tipping point.

confusionMatrix((pred1),target)
Copy the code

# Test the accuracy of data.Copy the code

Check how much of our predicted value is in the curve

[email protected]
Copy the code

We can conclude that our accuracy is 81.58%, with 90.26% of the predicted value below the curve. At the same time, our misclassification rate was 18.42%.

Naive Bayes algorithm

Before executing Naive Bayes algorithms, we need to remove the additional prediction columns that we added during BLR execution.

NB # naivebayes model (target ~.)Copy the code

The model is examined with training data and its confusion matrix is created to understand the accuracy of the model.

predict(train)
confMat(pred,target)
Copy the code

We can say that the accuracy of Bayesian algorithm for training data is 85.46%.

Now verify the model of the test data by predicting and creating an obfuscation matrix.

Matrix(pred,target)
Copy the code

 

We can conclude that models generated with the help of Naive Bayes algorithms have an accuracy of 78.95%, or we can say that Naive Bayes algorithms have a misclassification rate of 21.05%.

The decision tree

Before implementing the decision tree, we need to remove the additional columns that we added when implementing the Naive Bayes algorithm.

train$pred<-NULL
Copy the code

Rpart stands for recursive partitioning and regression tree

Rpart is used when both independent and dependent variables are continuous or classified.

Rpart automatically detects whether regression or classification is required based on dependent variables.

Implementation decision tree

plot(tree)
Copy the code

With the help of decision trees, we can say that the most important of all variables are CP, CA, THAL, Oldpeak.

Let’s validate the model with test data and find out how accurate the model is.

conMat(pred,targ)
Copy the code

We can say that the accuracy of the decision tree is 76.32%, or that its misclassification rate is 23.68%.

Random forests

Before executing the random forest, we need to remove the additional prediction columns that we added when executing the decision tree.

test$pred<-NULL
Copy the code

In a random forest, we don’t need to divide the data into training data and test data, we generate models directly on the whole data. To generate the model, we need to use the random forest library

# set.seed controls randomness by limiting permutation. set.seed(100) model_rf<-randomForest(target~.,data = heart) model_rfCopy the code

The relation between random forest and error is plotted on the graph.

plot(model_rf)
Copy the code

The red line represents MCR without heart disease, the green line represents MCR with heart disease, and the black line represents the overall MCR or OOB error. The overall error rate was what we were interested in, and it turned out well.

conclusion

After running various classification techniques and considering their accuracy, we can conclude that all models are between 76% and 84% accurate. Among them, the accuracy rate of random forest is slightly higher, 83.5%.


Most welcome insight

1.R language multiple Logistic Logistic regression application case

2. Panel smooth transfer regression (PSTR) analysis case implementation

3. Partial least squares regression (PLSR) and principal component regression (PCR) in MATLAB

4.R language Poisson regression model analysis cases

5. Hosmer-lemeshow goodness of fit test in R language regression

6. Implementation of LASSO regression, Ridge regression and Elastic Net model in R language

7. Realize Logistic Logistic regression in R language

8. Python predicts stock prices using linear regression

9. How to calculate IDI and NRI indices for R language in survival analysis and Cox regression