The original link:http://tecdat.cn/?p=2652

Logical regression is the method to fit the regression curve. When y is the classification variable, y = f (x). A typical use of this model is the X prediction given a set of predictions Ÿ. Predictors can be continuous, categorical, or a mixture of both.

Logical regression implementation in R

R can be easily fitted to logistic regression models. The function to call is GLM (), and the fitting procedure is not much different from the one used in linear regression. In this article, I will fit a binary logistic regression model and explain each step.

The data set

We’re going to look at the Titanic data set.

The goal is to predict survival (1 if the passenger survives, 0 otherwise) based on certain characteristics such as class, sex, age, and so on. We will use categorical variables and continuous variables.

Data cleansing process

When dealing with real datasets, we need to take into account cases where data may be missing, so we need to prepare datasets for our analysis. As a first step, we use this function to load the CSV data read.csv().

Encode each missing value as NA.

Training.data.raw < -read.csv ('train.csv', header = T, na.strings = c (" "))

Now we need to check for missing values, see the unique value of each variable, and use the sapply() function to pass the function as an argument to each column of the data box.

PassengerId    Survived      Pclass        Name         Sex 
          0           0           0           0           0 
        Age       SibSp       Parch      Ticket        Fare 
        177           0           0           0           0 
      Cabin    Embarked 
        687           2 

length(unique(x)))

PassengerId    Survived      Pclass        Name         Sex 
        891           2           3         891           2 
        Age       SibSp       Parch      Ticket        Fare 
         89           7           7         681         248 
      Cabin    Embarked 
        148           4

Visualize missing values: You can draw data sets and display missing values:

There are too many missing values in the cabin, we don’t use it.

Using the subset() function we subset the original data set, selecting only the relevant columns.

DATA < -subset (TRAINING. DATA. RAW, SELECT = C (2,3,5,6,7,8,10,12))

Now we need to explain the other missing values. R can handle them when fitting a generalized linear model by setting parameters within the fitting function. There are different ways to do this, and a typical approach is to replace missing values with existing averages, medians, or patterns. I’m using the average.

Data $Age \[is.na (data$Age) \] < -mean (data$Age, na.rm = T)

In the case of categorical variables, using read.table() or read.csv() will encode the categorical variables as factors by default.

To better understand how R handles sorting variables, we can use the contrasts() function.

Before the fitting process, the data is cleaned and formatted. This pre-processing step is often critical to obtain a good fit and better predictive power of the model.

The model fitting

We divided the data into two parts: the training and test sets. The training set will be used to fit our model.

By using the function summary() we get the results of our model:

Deviance Residuals: Min 1Q Median 3Q Max -2.6064 -0.5954 -0.4254 0.6220 2.4165 Coefficients: Coefficients Estimate Std. Error z value (Pr > | z |) (Intercept) 5.137627 0.594998 8.635 < 2-16 * * * e Pclass - 1.087156-0.151168-7.192 E-6 *** Sexmale -2.756819 0.212026-13.002 < 2E-6 *** Age -0.037267 0.008195-4.547 1.43E-6 *** sibsp-0.292920 0.114642-2.555 0.0106 * parch-0.116576 0.128127-0.910 0.3629 FARE 0.001528 0.002353 0.649 0.5160 EmbarkedQ -0.002656 0.400882-0.007 0.9947 EmbarkedS -0.318786 0.252960-1.260 0.2076 \-\- - Signif. CODES: 0 "***" 0.001 "**" 0.01 "*" 0.05 ". "0.1" "1

Explain the results of our logistic regression model

Now we can analyze the fitting interpretation model.

Embarked on In the first place, we can see that SibSp, Fare and pursuit have no statistical significance. As for the statistically significant variables, gender has the lowest P-value, indicating a strong correlation between the passenger’s gender and the likelihood of survival. A negative coefficient on the predictors indicated that, all other variables being equal, male passengers were less likely to survive. Since men were the dummy variable, men reduced the logarithmic probability by 2.75, while increasing unit age reduced the logarithmic probability by 0.037.

Now we can run the function on the anova() model to analyze the deviation table

Analysis of Deviance Table Model: binomial, link: logit Response: Survived Terms added sequentially (first to last) DF Deviance Resid.DF Resid.Dev Pr(BB0 Chi) Nul.799 1065.39 Pclass 1 83.607 798 981.79 < 2.2E-16 *** Sex 1 240.014 797 741.77 < 2.2E-16 *** Age 1 17.495 796 724.28 2.881E-05 *** SibSp 1 Pursuit 2 2.187 791 10.842 795 713.43 0.000992 *** Parch 1 0.863 794 712.57 0.352873 Fare 1 0.994 793 711.58 0.318717 pursuit 2 2.187 791 709.39 0.334990

The larger the difference between the null deviation (_NULL_ _deviance_) and the residual error, the better. By analyzing the table, we can see the deviation each time we add a variable. Similarly, increasing Pclass, Sex and Age can significantly reduce residuals. The big p value here means that the model without variables explains more or less the same amount of change. Ultimately you want a significant reduction in the deviation and AIC.

Assess the predictive power of the model

In the above steps, we briefly evaluate the fit of the model. By setting the parameter type = ‘response, R will be P (y | X) = 1 output in the form of probability. Our decision boundary will be 0.5. If P (y | X = 1) > 0.5, y = 1, y = 0 otherwise. Note that for some application scenarios, a different threshold might be a better choice.

Fitting. Results < -Ifelse (fitted. Results > 0.5,1,0) = test $ Survived

An accuracy of 0.84 on the test set is a pretty good result. However, if you want to get a more accurate score, it is better to run cross validation, such as K-fold cross validation.

As a final step, we will plot the ROC curve and calculate the AUC (area under the curve) for typical performance measurements of the binary classifier.

The ROC is the curve generated by plotting the true positive rate (TPR) and false positive rate (FPR) under various threshold Settings, while the AUC is the area under the ROC curve. As a rule of thumb, a model with good predictive power should be close to 1.


The most popular insight

1. Application case of multiple Logistic Logistic regression in R language

2. Implementation of Panel Smooth Transfer Regression (PSTR) analysis case

3. Partial least squares regression (PLSR) and principal component regression (PCR) in MATLAB

4. Case study of R language Poisson regression model

5. Hosmer-Lemeshow goodness of fit test in R language regression

6. Realization of Lasso regression, Ridge Ridge regression and Elastic Net model in R language

7. Logistic Logistic regression was realized in R language

8. Python uses linear regression to predict stock prices

9. How does R language calculate IDI and NRI indexes in survival analysis and Cox regression