0 Related to the source code

1 overview of regression analysis

1.1 Introduction to regression analysis

◆ Regression is similar to classification, except that the prediction result of regression is continuous, while the prediction result of classification is discrete

◆ In this way, many regression and classification models can be modified and universal

◆ Therefore, the models with the same or similar basic principles in regression and classification are not repeated

1.2 Regression algorithm integrated in Spark

◆ Spark implements a rich regression algorithm, and many models can also be used for classification

  • Official documentation regression algorithm list

1.3 Differences and connections between regression and classification

Overview of linear regression algorithm

2.1 Introduction to linear regression

◆ In regression analysis, the linear relationship between independent variables and dependent variables is satisfied or basically satisfied, so linear model can be used for fitting

◆ In regression analysis, only one independent variable is linear regression, the relationship between the independent variable and the dependent variable can be approximated by a straight line

◆ Similarly, the regression of multiple variables is called multiple linear regression, which can be represented by a plane or hyperplane

2.2 Preconditions for using linear regression

◆ There is a linear trend between independent variables and dependent variables, and the correlation coefficient was introduced earlier

◆ Independent dependent variables have independent values and no correlation

2.3 Examples of linear regression

◆ For example, explore the relationship between boiling point and air pressure, study the relationship between buoyancy and surface area, and explore the relationship between force and acceleration in physics

3. Principle of linear regression algorithm

3.1 Review of machine learning models

◆ For statistical learning, machine learning model is a function expression, its training process is constantly updating the parameters of the function, so that the function can produce the best prediction effect on unknown data

◆ This process of machine learning, and human learning process is the same principle, learning before use, so it belongs to the field of artificial intelligence

3.2 What is a good prediction effect?

◆ “In order to achieve the best forecast effect”, then how to quantify “good forecast effect”?

◆ The function to measure the effect of prediction is called cost function or loss function.

For example: using a model to predict whether it will rain or not, if the model prediction error one day, the loss function plus 1 machine learning algorithms of direct target is to adjust the parameters of the function The number of days to be able to make the prediction error reduction, reduce the loss function, is at the same time, also improve the accuracy of prediction

3.3 Linear regression

Linear regression is one of the simplest mathematical models

◆ The step of linear regression is to use existing data to explore the relationship between independent variable X and dependent variable Y, which is the parameter in the linear regression model. With this, we can use this model to make predictions about unknown data

The basic training process of machine learning is also the same, which belongs to supervised learning

3.4 Linear regression model

◆ The mathematical expression of linear regression is

◆ The above equations are linear regression and linear regression model written in matrix form respectively

The least square method

4.1 What is the least square method

◆ Also known as the least square method, by minimizing the sum of squares of residuals to find the best function matching

◆ That is, the least square method uses the sum of squares of residuals as the loss function to measure the quality of the model

The least square method can be used to fit the curve

4.2 Principle of least square method

◆ Take unary linear regression as an example to demonstrate the bulldozing process

4.3 Example of least square method

Random gradient descent

5.1 What is stochastic gradient descent

◆ Stochastic gradient descent (SGD) is a common optimization method in machine learning

◆ It is a method to find the global optimal solution of a function by means of continuous iteration and update

◆ Similar to the least square method, both are optimization algorithms. Stochastic gradient descent is especially suitable for models with numerous variables and complex controlled systems, especially for deep learning

5.2 Start with the gradient

Gradient is an operator used in calculus to find the path along which a function changes the fastest at a point. It is generally understood as the path on which the geometry is more “steep”.

◆ Its mathematical expression is (take binary function as an example)

5.3 Principle of Stochastic gradient descent

◆ Linear model of gradient descent bulldozing process

5.4 Advantages of stochastic gradient descent

◆ The “randomness” of stochastic gradient descent is reflected in the fact that n samples are randomly selected for gradient calculation, which requires less computation than directly using all samples

Random gradient descent is good at solving a large number of training samples

◆ Learning rate determines the speed of gradient descent. Meanwhile, the concept of “momentum” is introduced on the basis of SGD, and optimization algorithms to further accelerate the convergence speed are also proposed successively

Spark house price forecast – Project presentation and code overview

  • code

Data loading and conversion

  • Data set file – Price in descending order

Because the training set is orderly, shuffle is required to improve accuracy

  • Predicted results

Logistic regression algorithm and principle overview

7.1 Linear vs. Non-linear

◆ Linearity is simply a power function relationship between two variables

◆ More relations between variables in nature are nonlinear, and relatively few are absolutely linear

◆ Therefore, in the selection of mathematical models for fitting, in many cases using nonlinear functions to construct the model may be better than linear function model

7.2 Logistic regression

◆ Logistic regression, namely logistic regression, is a generalized linear regression, but different from linear regression model, it introduces nonlinear function

Therefore, logistic regression can be used for regression fitting of nonlinear relations, which is not available in linear regression

7.3 Principle of logistic regression algorithm

The Sigmoid function

◆ Logic function (Logistic Function) or logic curve (Logistic curve) is a common S-function, which was named by Pierre Francois Verulle in 1844 or 1845 when he studied its relationship with population growth.

  • A simple Logistic function can be expressed as follows:

The generalized Logistic curve can imitate the S-shaped curve of population growth (P) in some cases. At first it was more or less exponential; And then as it starts to saturate, it slows down; Finally, the increase stops when maturity is reached.

  • Standard Logistic function

Logistic regression principle

◆ Improved linear regression model

The regularization principle

8.1 Is it better to train more models?

◆ We usually understand that “thoroughly tempered” must be excellent quality, and machine learning is the same?

8.2 Over-fitting, under-fitting and just right

◆ People learn too easily, too inflexible, too dogmatic, become so-called “nerd” machine learning is the same

◆ We’ve trained our machine learning model so much that we call it “dogmatic” : over fitting.

◆ On the contrary, the model with poor prediction ability is called under fitting as if it is “retarded”.

◆ The following illustrates three states produced by fitting sample points with three different mathematical models

8.3 How to get just right?

◆ For the under-fitting state, only need to increase the training rounds, increase the number of features, the use of nonlinear model can be achieved

◆ Overfitting, on the other hand, is often trickier

The commonly used methods to reduce overfitting are cross validation method, regularization method, etc

8.3.1 Cross validation method

◆ The so-called cross-validation method is to split the training data set into training set and validation set in the training process

  • Training set specific training model
  • The validation set only tests the predictive power of the model

When both of them reach the optimal at the same time, the model is optimal

8.4 Regularization principle

◆ As we can see from the previous examples, overfitting is often caused by a model that is too complex for actual needs

◆ Then, can the complexity of the model be quantified in the calculation of the loss function? The more complex the model is, the more it is “punished”, so as to make the model more “neutral”?

◆ The above idea is the regularization idea, to prevent the model from being too complex by dynamically adjusting the degree of punishment

◆ Let the loss function be

◆ The optimized parameter is

Pieces of

Is the regularization term, reflecting the complexity of the model, which varies among different algorithms. For example, can be

9 Actual Spark logistic regression

  • The algorithm is officially classified as a classification algorithm

  • Logistic regression algorithm

  • Classification results (because of classification, so the display is 1500)

Introduction to order preserving regression algorithm

10.1 What is Order Preserving Regression?

◆ Order preserving regression is a kind of regression analysis used to fit non-decreasing data (non-increasing as well), and at the same time, order preserving regression can minimize the error after fitting. In numerical analysis, Isotonic regression refers to searching the least squares y of a weighted W under order preserving constraints to fit variable X, which is a quadratic programming problem:

Order preserving regression is used in statistical reasoning and multidimensional scaling.

◆ Compare order preserving regression and linear regression

10.2 Application of order preserving regression

◆ Order preserving regression is used to fit non-decreasing data. It is not necessary to judge in advance whether it is linear or not, as long as the overall trend of the data is non-decreasing. For example, to study the relationship between the dosage of a drug and its efficacy

Principle of order preserving regression algorithm

11.1 Principle of order preserving regression

◆ The premise of applying the order preserving regression should be the non-decreasing of the result data. Then, we can trigger the calculation by judging whether the data decreases

◆ Algorithm Description

◆ Spark implements the Pool Adjacent Violators algorithm (PAVA) to solve the model.

◆ for example, the original sequence {1,3,2,4,6} is {1,3,3,3,6} after order preserving regression

12 Actual combat order preserving regression data analysis

  • Official Website Documentation

Order preserving regression belongs to a family of regression algorithms. Standard order preserving regression is a problem given a finite set of real numbers Y = y1, y2… , yn represents the observed response, X = x1, x2… , xn unknown response value fitting finds a function to minimize

Relative to the x1 x2 or less or less… ≤xn complete order, where
wiIt’s a positive weight. The resulting function is called order preserving regression. It can be regarded as the least squares problem of sequential constraints. Basically order preserving regression is the monotone function that best suits the original data point. We implemented one
pool adjacent violators algorithmThe algorithm uses a parallel method of order preserving regression. The training input is a DataFrame that contains three columns: label, function, and weight. In addition, the IsotonicRegression algorithm has an optional parameter called isotonic which defaults to true. The argument specifies whether an isotonic regression is isotonic (monotonically increasing) or antimonotone (monotonically decreasing). Return to training
IsotonicRegressionModel, which can be used to predict known and unknown features. The result of order preserving regression is treated as piecewise linear function. Therefore, the prediction rule is: 1 If the prediction input and the training feature match exactly, the associated prediction is returned. If there are multiple predictions with the same characteristics, one of them is returned. Which one is undefined (with Java. Util. Arrays. The same binarySearch) if the forecast input below or above all training characteristics, respectively return with minimum or maximum characteristics prediction. 3 If there are multiple predictions with the same characteristics, return the lowest or highest, respectively.

  • code
  • Calculation results, the prediction effect is the most amazing!!

Spark Machine learning Practice series

  • Spark based Machine learning Practices (PART 1) – Introduction to machine learning
  • Spark based Machine learning practices (II) – Introduction to MLlib
  • Spark based machine learning practice (III) – Actual environment construction
  • Spark based Machine learning practice (IV) – Data visualization
  • Spark based Machine learning practice (vi) – Basic statistics module
  • Spark based machine learning practice (vii) – regression algorithm