Original link:tecdat.cn/?p=22966 

Original source:Tuo End number according to the tribe public number

Logistic regression is a method of fitting a regression curve, y=f(x), when y is a categorical variable. A typical use of this model is to predict Y given a set of predictors X, which can be continuous, categorical or mixed.

In general, the classification variable Y can have different values. In the simplest case, y is binary, meaning it can be a value of 1 or 0. A classic example used in machine learning is E-mail sorting: given a set of attributes for each E-mail, such as word count, link, and picture, the algorithm should decide whether the E-mail is spam (1) or not (0).

In this article, we refer to this model as “binomial logistic regression” because the variables to be predicted are binary. However, logistic regression can also be used to predict a dependent variable that can have more than two values. In this second case, we call the model “polynomial logistic regression”. For example, a typical example would be classifying movies as “funny”, “documentary”, or “drama”, etc.

Logistic regression in R

R makes fitting a logistic regression model very easy. The function to be called is GLM (), whose fitting process is not much different from that used in linear regression. In this article, I will fit a binary logistic regression model and explain each step.

The data set

We’ll be working on the Titanic data set. There are different versions of this dataset available online, but I recommend using the version provided by Kaggle because it’s almost ready to use (in order to download it, you need to register with Kaggle). The data set (training) is a data set of a number of passengers (889 to be exact), and the goal of the competition is to predict survival rates (1 if passengers survive, 0 if they don’t) based on characteristics such as service level, gender, age, etc. As you can see, we will use both categorical and continuous variables.

Data cleansing process

When working with real data sets, we need to consider the possibility that some data could be lost, so we need to prepare the data sets for our analysis. As a first step, we load the CSV data using the read.csv() function. Make sure that the parameter na.strings is equal to c(“”), so that each missing value is encoded as NA.

Load and preprocess data

Now we need to check for missing values and see how many unique values each variable has using the sapply() function, which applies the function passed as an argument to each column of the data box.

sapply(function(x) sum(is.na(x)))
Copy the code

sapply(function(x) length(unique(x)))
Copy the code

Draw the dataset and highlight the missing values.

map(training)
Copy the code

Handling missing values

The variable Cabin has too many missing values to use. We also exclude PassengerId, because it’s just an index. Subset () function is used to subset the original data set, and only the relevant columns are selected.

Now you need to consider the other missing values. When fitting generalized linear models, R can deal with them by setting a parameter in the fitting function.

However, I personally prefer to replace missing values “manually”. There are different ways to do this, and a typical approach is to replace the missing value with an average, median, or existing value. I use averages.

Age[is.na(Age)] < -mean (Age,na.rm=T) ##Copy the code

In the case of categorical variables, using read.table() or read.csv() defaults to encoding categorical variables as factors. Factors are the way R handles categorical variables. We can use the following lines of code to check the coding.

 

To get a better idea of how R handles classification variables, use contrasts(). This function shows us how variables are virtualized and how to interpret them in the model.

 

For example, you can see that in the gender variable, women will be used as a reference variable. The value Embarked in, because there are only two missing values, we discard the two rows (we can also replace the missing values and retain data points).

data[!is.na(Embarked),]
Copy the code

It is important to clean and format the data before performing the fitting. This pre-processing step is very important for obtaining good model fit and better prediction ability.

The model fitting

We split the data into two parts: training set and test set. The training set will be used to fit our model and we will test on the test set.

## Now, let's fit the model. Be sure to specify family=binomial in the GLM () function. GLM (survives ~.,family=binomial(link='logit')) ## We get the results of our model by using the function summary().Copy the code

Explain the results of our logistic regression model

First, we can see that SibSp and fares are not statistically significant. As for statistically significant variables, gender has the lowest P value, indicating that the passenger’s gender is strongly related to the probability of survival. The negative coefficient of this predictor indicates that, all other variables being equal, male passengers are less likely to survive. Remember, in the Logit model, the response variable is logarithmic odds: ln(odds) = ln(P /(1-p)) = AX1 + Bx2 +. Xn + z *.

Since male sex is a dummy variable, male sex reduces the logarithmic probability by 2.75, while a one-unit increase in age reduces the logarithmic probability by 0.037.

Now we can analyze the deviation table against the model

The difference between invalid bias and residuals shows how our model compares to empty models (truncated only models). The bigger the gap, the better. Analyzing the table, we can see the decline in residuals as each variable is added one by one. Similarly, the addition of Pclass, Sex and Age can significantly reduce residuals. Despite the low P value of SibSp, other variables seem to improve the model less. The large p-value here suggests that the model with no variables explains more or less the same amount of change. Ultimately, what we would like to see is a significant decline in residuals and AIC.

Although there is no exact equivalent of R2 for linear regression, the McFadden R2 index can be used to evaluate the fit of the model.

Evaluate the predictive power of the model

In the previous steps, we briefly evaluated how well the model fits, and now we want to see how well the model performs when predicting Y on the new data set. By setting the parameter type = ‘response, R will be P (y | X) = 1 output in the form of probability. Our decision boundary will be 0.5. If P (y | X = 1) > 0.5, then y = 1, y = 0 otherwise.

Error <- mean(fitted ! Survived) print(paste(' accuracy ', 1-error))Copy the code

An accuracy of 0.84 on the test set is a pretty good result. However, keep in mind that this result depends in part on my previous manual segmentation of the data, so if you want a more accurate score, it’s best to run some kind of cross-validation, such as k-fold cross-validation.

As a final step, we will plot the ROC curve and calculate the AUC (area under the curve), a typical performance measure for binary classifiers.

ROC is the curve generated by plotting the true positive rate (TPR) and false positive rate (FPR) under different threshold Settings, while AUC is the area under the ROC curve. As a rule of thumb, the AUC of a model with good predictive power should be closer to 1 than 0.5 (1 is ideal).

performance( measure = "tpr", x.measure = "fpr")
plot(prf)
Copy the code


auc
Copy the code


Most welcome insight

1.R language multiple Logistic Logistic regression application case

2. Panel smooth transfer regression (PSTR) analysis case implementation

3. Partial least squares regression (PLSR) and principal component regression (PCR) in MATLAB

4.R language Poisson regression model analysis cases

5. Hosmer-lemeshow goodness of fit test in R language regression

6. Implementation of LASSO regression, Ridge regression and Elastic Net model in R language

7. Realize Logistic Logistic regression in R language

8. Python predicts stock prices using linear regression

9. How to calculate IDI and NRI indices for R language in survival analysis and Cox regression