This is the fifth day of my participation in the August More text Challenge. For details, see: August More Text Challenge

This article explains how to alleviate over-fitting. Starting with the following three figures, under-FITTED shows that the predicted function model has fewer parameters and complexity than the actual model, but this is becoming increasingly rare because today’s networks are deep enough

The over-FITTED function model shows that the number of parameters and complexity of the predicted function model are much higher than that of the actual model

In this context, some people put forward “Occam’s Razor”, that is, more things should not be used than are necessary, not to use things that are not necessary, in neural network, not necessary network parameters, Try to choose the smallest and most likely number of parameters

At present, there are several mainstream methods to prevent over-fitting

  • More data
  • Constraint model complexity
  • shallow
  • regularization
  • Dropout
  • Data argumentation
  • Early Stopping

Here we use Regularization. For a binary classification problem, its Cross Entropy formula is


J 1 ( Theta. ) = 1 m i = 1 m [ y i ln y ^ i + ( 1 y i ) ln ( 1 y ^ i ) ] J_1(\theta)=-\frac{1}{m}\sum_{i=1}^m[y_i\ln\hat y_i+(1-y_i)\ln(1-\hat y_i)]

If a parameter θ\thetaθ is added, θ\thetaθ represents the network parameters (w1, B1,w2)(w1,b1,w2), etc., Then multiply some norm of θ\thetaθ (using L1-norm in the following formula) by a factor λ>0\lambda>0λ>0, and the formula becomes


J 2 ( Theta. ) = J 1 ( Theta. ) + Lambda. i = 1 n Theta. i J_2(\theta)=J_1(\theta)+\lambda\sum_{i=1}^n|\theta_i|

Think about it. We originally intended to optimize Loss, that is, the value of J1(θ)J_1(\ Theta)J1(θ), and make it close to 0, but now we optimize J2(θ)J_2(\ Theta)J2(θ). In fact, in the process of forcing Loss to approach 0, Makes the parameter of the L1 norm – ∑ I ∣ theta I ∣ \ sum_i | \ theta_i | ∑ I ∣ theta I ∣ is close to zero

So why does the complexity of the model decrease when the norm of the parameter approaches zero? Let’s imagine that we now have a model y=β0+β1x+… 7 + beta x7y = \ \ beta_1x beta_0 + +… + \ beta_7x ^ 7 = beta beta 1 0 + x + y… +β7×7, after regularization, the norm value of the parameter is optimized to a value very close to 0, at this time, it is possible that β3… 7 \ beta_3, beta,… , \ beta_7 beta 3,… , the value of β7 becomes very small, assuming 0.010.010.01, then the model becomes approximately a quadratic equation, which is not so complicated as the original seven times

This method is also called Weight Decay

It was noted that the segmentation plane on the left of the figure above was complex and not a smooth curve, indicating that the function model had good segmentation and expressive ability. However, some noise samples were learned. The figure on the right is the one with regularization added. The functional model has not learned some noise samples, so the expression ability is not so strong that it can be better divided, which is what we want

There are two common ways to Regularization, one is to add L1-norm, the other is to add L2-norm, the most commonly used is L2-regularization, the code is as follows

net = MLP()
optimizer = optim.SGD(net.parameters(), lr=learning_rater, weight_decay=0.01)
# SGD gets all the network parameters and sets weight_decay=0.01 to force the dinorm to gradually approach 0
# but note that without overfitting phenomena still set weight_decay parameters, makes a sharp drop in performance
criteon = nn.CrossEntropyLoss()
Copy the code

Pytorch does not have good support for L1-Regularization at the moment, so the code needs to be set manually

regularization_loss = 0
for param in model.parameters():
    regularization_loss += torch.sum(torch.abs(param))

classify_loss = criteon(logits, target)
loss = classify_loss + 0.01 * regularization_loss

optimizer.zero_grad()
loss.backward()
optimizer.step()
Copy the code