Original link:tecdat.cn/?p=3863

Original source:Open end data tribes

 

 

For new users of an enterprise, big data will be used to analyze the user’s information to determine whether the user is a paying user and make clear the user attributes, so as to carry out targeted marketing and improve the efficiency of operation personnel.

In the case of paying customer forecasting, it’s about thinking about what factors drive revenue, making predictions for each factor, and finally coming up with a paying forecast. It’s not really a financial issue, it’s a business issue.

Loss prediction. In this aspect, it tends to pay large amount of users, and extracts the forehead feature vector and applies it to the user loss and prediction of application scenarios.

 

methods

Regression is an easy to understand model, which is equivalent to y=f(x), indicating the relationship between independent variable X and dependent variable Y. The most common problems are like the doctor’s observation, smell, inquiry and examination when treating a patient, and then determine whether the patient is sick or what kind of disease. The observation, smell, inquiry and examination is to obtain independent variable X, namely characteristic data. To judge whether the patient is sick is equivalent to obtaining dependent variable Y, namely prediction classification.

 

Problem description

 

We tried and predicted whether users could use logistic regression to predict whether monthly payments would exceed 50K based on the demographic information variables available in the data.

 

In this process, we will:

1. Import data 2. Check category deviation 3. Establish logit model and predict test data. 5. Model diagnosis

 

 

Check for class deviation

 

Ideally, the Y variable has roughly the same proportion of events and non-events. So, we first examine the proportion of classes in the dependent variable ABOVE 50K.

 

     0     1 
 24720  7841
Copy the code

Obviously, the proportion of different paying people is skewed. So we have to sample observations in roughly equal proportions to get a better model.

 

 

Build Logit models and forecasts

 

Determine the optimal prediction probability cutoff value of the model. The default cut-off probability score is 0.5 or the ratio of 1 and 0 in the training data. Sometimes, however, adjusting the probability cutoff value can improve the accuracy of developing and validating samples. InformationValue :: optimalCutoff features provide InformationValue :: optimalCutoff features provide InformationValue :: optimalCutoff features provide InformationValue :: optimalCutoff features provide InformationValue :: optimalCutoff features

OptCutOff <- optimalCutoff(testData$ABOVE50K, predicted)[1] => 0.71Copy the code

 

Model diagnosis

  

Classification error

A misclassification error is a percentage mismatch between the prediction and the actual. The lower the error classification, the better the model.

MisClassError (testData$ABOVE50K, Predicted, Threshold = optCutOff) [1] 0.0892Copy the code

 

 

 

The ROC curve

ROC curve refers to the receiver operating characteristic curve (RECEIVER operating characteristic curve), which is a comprehensive index reflecting the continuous variables of sensitivity and specificity, and is used to reveal the relationship between sensitivity and specificity by using the composition method. It calculates a series of sensitivity and specificity by setting different critical values for continuous variables, and then draws a curve with sensitivity as the ordinate and (1-specificity) as the abscissa. The greater the area under the curve, the higher the diagnostic accuracy. On the ROC curve, the point closest to the upper left of the graph is the critical value with high sensitivity and specificity.

 

 

The ROC curve area of the above model is 89%, which is quite good.

consistency

Simply put, in all combinations of 1-0, consistency is the percentage of predictions that are correct, and the higher the consistency, the better the quality of the model.

 

 $Concordance
 [1] 0.8915107
 
 $Discordance
 [1] 0.1084893
 
 $Tied
 [1] -2.775558e-17
 
 $Pairs
 [1] 45252896
Copy the code

The 89.2% consistency of the above model is indeed a good model.


Confusion matrix

In artificial intelligence, confusion matrix is a visualization tool, especially for supervised learning. In unsupervised learning, it is commonly called matching matrix. Each column represents the predicted value, and each row represents the actual category. The name comes from the fact that it can easily indicate whether multiple classes are confused (i.e., one class is predicted to be another).

confusionMatrix(testData$ABOVE50K, predicted, threshold = optCutOff)

       0    1
 0 18849 1543
 1   383  810
Copy the code

 

conclusion

Here only model establishment and evaluation are introduced. From the model’s conclusion, we can learn something that has been accepted and well known by the public: payment is correlated with education level, intelligence, age and gender. Based on the user scale prediction model, combined with the user population information, we can roughly estimate the income of the product in general, so as to get the paying user prediction model. If the income classification is converted into lost users and effective users, we can get the lost user prediction model.


Most welcome insight

1.R language multiple Logistic Logistic regression application case

2. Panel smooth transfer regression (PSTR) analysis case implementation

3. Partial least squares regression (PLSR) and principal component regression (PCR) in MATLAB

4.R language Poisson regression model analysis cases

5. Hosmer-lemeshow goodness of fit test in R language regression

6. Implementation of LASSO regression, Ridge regression and Elastic Net model in R language

7. Realize Logistic Logistic regression in R language

8. Python predicts stock prices using linear regression

9. How to calculate IDI and NRI indices for R language in survival analysis and Cox regression