Original link:http://tecdat.cn/?p=22966 

Logical regression is a method of fitting a regression curve, y=f(x), when y is a categorical variable. A typical use of this model is to predict Y given a set of predictors X, which may be continuous, categorical, or mixed.

In general, the classification variable y can have different values. In the simplest case, y is binary, meaning it can be a value of 1 or 0. A classic example used in machine learning is email categorization: given a set of attributes for each email, such as word count, links, and images, the algorithm should decide whether the email is spam (1) or not (0).

In this article, we refer to this model as “binomial logistic regression” because the variable to be predicted is binary; however, logistic regression can also be used to predict a dependent variable that can have more than two values. In this second case, we call the model “polynomial logistic regression”. For example, a typical example is the classification of films as “comedy”, “documentary”, or “drama”.

Logistic regression in R is realized

R makes fitting a logistic regression model very easy. The function to call is GLM (), and the fitting procedure is not much different from the function used in linear regression. In this article, I will fit a binary logistic regression model and explain each step.

The data set

We’re going to be working on the Titanic data set. There are different versions of this data set available online, but I recommend using the version provided by Kaggle, as it is almost ready to use (in order to download it, you need to register with Kaggle). The dataset (training) is a data set of a number of passengers (889 to be exact), and the goal of the competition is to predict survival based on characteristics such as service class, gender, age, etc. (1 if passengers survive, 0 if they don’t). As you can see, we will use both classification and continuous variables.

Data cleansing process

When working with real datasets, we need to take into account the possibility that some data may be lost, so we need to prepare datasets for our analysis. As a first step, we use the read.csv() function to load the CSV data. Make sure that the parameter na.strings is equal to c(“”), so that each missing value is encoded as na.

Loading and preprocessing data

Now we need to check for missing values and see how many unique values each variable has using the sapply() function, which applies the function passed as an argument to each column of the data box.

sapply(function(x) sum(is.na(x)))

sapply(function(x) length(unique(x)))

Draw the data set and highlight the missing values.


Handling missing values

Variable cabin has too many missing values; do not use it. We also exclude PassengerID because it is just an index. Subset () function is used to subset the original data set and select only relevant columns.

Now you need to consider other missing values. When fitting generalized linear models, R can handle them by setting a parameter in the fitting function.

However, I personally prefer to “manually” replace missing values. There are different ways to do this, and a typical one is to replace the missing number with the average, median, or existing number. I use averages.

Age \ [is na (Age) \] < - scheme (Age, na. The rm = T) # # use average instead of missing

In the case of categorical variables, using read.table() or read.csv() will encode the categorical variables as factors by default. The factor is the way R deals with categorical variables. We can check the encoding with the following lines of code.


To get a better idea of how R handles sorting variables, we can use the contrasts() function. This function shows us how variables are virtualized and interpreted in the model.


For example, you can see that in the gender variable, women will be used as a reference variable. Embarked in, since there are only two, we will exclude these two lines (we can also replace the missing values and retain the data point).


Before fitting, it is important to clean and format the data. This pre-processing step is very important to obtain good model fit and better predictive power.

The model fitting

We divide the data into two parts: the training set and the test set. The training set will be used to fit our model and we will test on the test set.

Now let's fit the model. Be sure to specify the parameter family=binomial in the GLM () function. GLM (Survived ~.,family=binomial(link='logit')) ## By using the function summary(), we get the results of our model.

Explain the results of our logistic regression model

First, we can see that neither SIBSP nor ticket price is statistically significant. For statistically significant variables, gender had the lowest P-value, indicating a strong relationship between the passenger’s gender and the probability of survival. A negative coefficient on this predictor suggests that, all other variables being equal, male passengers are less likely to survive. Remember that in the Logit model, the response variable is the logarithmic probability: ln(odds) = ln(p/(1-p)) = ax1 + bx2 +. Xn + z *.

Since males were a dummy variable, males reduced logarithmic odds by 2.75, while increasing age by one unit reduced logarithmic odds by 0.037.

Now we can analyze the deviation table against the model

The difference between the null bias and _ residual _ shows how our model compares to the null model (the intercept-only model). The bigger the gap, the better. Analyzing the table, we can see the decrease of _ residuals _ as each variable is added one by one. Similarly, the addition of PClass, Sex, and Age can significantly reduce residuals. Despite the low p value of SIBSP, other variables seem to improve the model less. The large P value here indicates that the model without variables explains more or less the same amount of variation. Ultimately, what we expect to see is a significant decrease in _ residuals _ and AIC.

Although there is no exact equivalent of R2 for linear regression, the McFadden R2 index can be used to evaluate the fit degree of the model.

Assess the predictive power of the model

In the above steps, we briefly evaluated the fit of the model, and now we want to see how the model performs when we predict Y on the new data set. By setting the parameter type = ‘response, R will be P (y | X) = 1 output in the form of probability. Our decision boundary will be 0.5. If P (y | X = 1) > 0.5, then y = 1, y = 0 otherwise.

Error <- mean(fitted ! Dave) print(paste(' accuracy ', 1-error))

An accuracy of 0.84 on the test set is a pretty good result. However, keep in mind that this result depends to some extent on my previous manual segmentation of the data, so if you want a more accurate score, it’s best to run some kind of cross-validation, such as K-fold cross-validation.

As a final step, we will plot the ROC curve and calculate the AUC (area under the curve), which is a typical performance measure for binary classifiers.

ROC is the curve generated by plotting the true positive rate (TPR) and false positive rate (FPR) under different threshold Settings, while AUC is the area under the ROC curve. As a rule of thumb, a model with good predictive power should have an AUC closer to 1 than 0.5 (1 is ideal).

performance( measure = "tpr", x.measure = "fpr")


The most popular insight

1. Application case of multiple Logistic Logistic regression in R language

2. Implementation of Panel Smooth Transfer Regression (PSTR) analysis case

3. Partial least squares regression (PLSR) and principal component regression (PCR) in MATLAB

4. Case study of R language Poisson regression model

5. Hosmer-Lemeshow goodness of fit test in R language regression

6. Realization of Lasso regression, Ridge Ridge regression and Elastic Net model in R language

7. Logistic Logistic regression was realized in R language

8. Python uses linear regression to predict stock prices

9. How does R language calculate IDI and NRI indexes in survival analysis and Cox regression