Logistic regression and maximum entropy model — Theoretical derivationFour optimization algorithms are mentioned in:

  1. Gradient descent algorithm
  2. Quasi Newtonian method
  3. GIS (Generalized Iterative Scaling) is a Generalized Iterative Scaling method.
  4. IIS: Improved Iterative Scaling.

The expression and objective function of Logistics and maximum entropy model are:

Logistics model
  • expression

  • The objective function

Maximum entropy model
  • expression

  • The objective function

Among them



Major premise

For the data set.On behalf of the firstThe class label of the data,On behalf of the firstCharacteristics of a piece of data,On behalf of the firstThe data ofA feature.

Assumed objective functionIs a convex function, and the two orders are continuously differentiable, and the minimum value of the function is.


1. Gradient descent algorithm

1.1 the gradient

The mathematical definition of a gradient is that the direction of the gradient is the direction of the fastest change. The meaning of this is that, along the gradient, it’s easier to find the maximum of the function; Conversely, the gradient decreases fastest, making it easier to find the minimum.

1.2 applications

In logistics model, the partial derivative of the objective function with respect to the weight of each feature is calculated.


Then the weight is updated as:


Less than 1.3

The distance of each step is very important near the extreme point. If the step is too large, it is easy to oscillate near the extreme point and fail to converge.

The solution: willSet to a variable that decreases with the number of iterations, but cannot be completely reduced to zero.

1.4 Various gradient descent

1.4.1 Batch Gradient DescentBatch gradient descent method is the most commonly used form of gradient descent method. The specific method is to use all samples to update parameters.1.4.2 Stochastic Gradient DescentIn fact, the stochastic gradient descent method is similar to the batch gradient descent method in principle. The difference is that it does not use all the sample data when calculating the gradient, but only selects one sample to calculate the gradient. The stochastic gradient descent method and batch gradient descent method are two extremes, one uses all data to gradient descent, the other uses a sample gradient descent. Naturally, the pros and cons of each stand out. As for the training speed, the stochastic gradient descent method only uses one sample iteration each time, so the training speed is fast, while the batch gradient descent method can not be satisfied with the training speed when the sample size is large. For accuracy, the stochastic gradient descent method is used to determine the gradient direction with only one sample, leading to a high probability that the solution is not optimal. As for the convergence rate, because the STOCHASTIC gradient descent method iterates one sample at a time, the direction of iteration changes greatly, and it cannot quickly converge to the local optimal solution.1.4.3 Mini-Batch Gradient DescentSmall batch gradient descent method is a compromise between batch gradient descent method and stochastic gradient descent method, that is, forOne sample, we takeTo iterate over,, in general,Of course, this value can be adjusted according to the sample data.

1.5 Code Implementation

    #============== Gradient ascent optimization algorithm =======================#
    def _batchGradientAscent(self, nums, lr):
        ' ''Gradient ascent optimization algorithm ------ : Param nums: < Np. ndarray> Number of iterations: Param LR: < Np. ndarray> Learning rate :return:'' '
        for k in range(nums):
            print('%d th iterations' % k)
            output = self.predict(self.input_vecs)
            delta = lr * (self.labels - output.T)
            delta_weight = np.dot(self.input_vecs.T, delta)
            self.weight += delta_weight.T

    # = = = = = = = = = = = = = = rising stochastic gradient optimization algorithm = = = = = = = = = = = = = = = = = = = = = = = #
    def _StochasticGradientAscent0(self, lr):
        ' ''Stochastic gradient ascent optimization algorithm ------ :param LR: 
      
        Learning rate :return:'
      ' '
        for inum in range(self.n_nums):
            output = self.predict(self.input_vecs[inum])
            delta = lr * (self.labels[inum] - output.T)
            delta_weight = self.input_vecs[inum] * delta
            self.weight += delta_weight.T

    # = = = = = = = = = = = = = = rising stochastic gradient optimization algorithm = = = = = = = = = = = = = = = = = = = = = = = #
    def _StochasticGradientAscent1(self, nums):
        ' ''Stochastic gradient ascent optimization algorithm ------ : Param nums: 
      
        Number of iterations :return:'
      ' '
        for iteration in range(nums):
            for inum in range(self.n_nums):
                data_index = [_ for _ inRange (self.n_nums)] lr = 4 / (Iteration + inum + 1) + 0.01 rand_index = int(random. Uniform (0, self.n_nums)) output = self.predict(self.input_vecs[rand_index]) delta = lr * (self.labels[rand_index] - output.T) delta_weight = self.input_vecs[rand_index] * delta self.weight += delta_weight.T del(data_index[rand_index])# = = = = = = = = = = = = = = rising small batch gradient optimization algorithm = = = = = = = = = = = = = = = = = = = = = = = #
    def _MiniBatchGradientAscent(self, nums, lr, batch_size=16):
        ' ''Small batch gradient rise optimization algorithm ------ : Param nums: < Np. ndarray> Number of iterations: Param LR: < Np. ndarray> Learning rate: Param batch_size: 
      
        Batch learning size :return: '
      ' '
        for iteration in range(nums):
            for ibatch in range(1, self.n_nums // batch_size):
                start_index = (ibatch-1) * batch_size
                end_index = ibatch * batch_size

                mini_train_data = self.input_vecs[start_index: end_index, ]
                mini_label = self.labels[start_index: end_index, ]

                output = self.predict(mini_train_data)
                delta = lr * (mini_label - output.T)
                delta_weight = np.dot(mini_train_data.T, delta)
                self.weight += delta_weight.T

    def train(self, nums, optimization='gradAscent', lr = 0.001) :' ''Training logistics model: Param nums: Number of iterations: Param input_vecs: Eigenvalue of training sample :return:'' '
        if optimization == 'gradAscent':
            self._batchGradientAscent(nums, lr)
        elif optimization == 'SGA0':
            self._StochasticGradientAscent0(lr)
        elif optimization == 'SGA1':
            self._StochasticGradientAscent1(nums)
        elif optimization == 'MGA':
            self._MiniBatchGradientAscent(nums, lr, batch_size=16)
Copy the code

2. Newton’s method

2.1 Basic Idea (First consider the case of single value)

Use Newton’s method to determine the solution of the function. Located in theIs the estimation point of the current minimum value, which can be known by Taylor formula


makeIf,If the conditions are not met, the iterative formula can be


At this point it can be identified asthanIt’s closer to 0, whenWhen convergence.

2.1 Basic Idea (case of matrix)


If you want toIf you have an extreme point, you need,


When the matrixIn the case of a nonsingular matrix, theta



3. Quasi-newton method

Newton’s method and Quasi Newtonian method study Notes (2) Quasi Newtonian conditions

Newton method and quasi-Newton method study notes (3) DFP algorithm


4. Iterative scaling (maximum entropy model)The complete code

reference

  • Newton’s method and Quasi – Newton’s method

  • Gradient descent method, Newton method and quasi-Newton method