Model Statistics

Total number of statistical parameters

num_params  =  sum(param.numel()  for  param in  model.parameters())
Copy the code

Parameter Regularization (Weight Regularization)

Old method

L2/L1 Regularization

In machine learning, almost all loss functions will be followed by an extra term. The commonly used extra term generally has two kinds, called **L1 regularization and L2 regularization, or L1 norm and L2 norm **.

L1 regularization and L2 regularization can be regarded as penalty terms of the loss function. The so-called “penalty” refers to some restrictions on the parameters of the loss function.

  • L1 regularization refers to the ** of each element in the weight vector WSum of absolute values**, usually expressed as
  • L2 regularization refers to the ** of each element in the weight vector WAnd then you take the square root* *, usually expressed as {| | w | |} _2 $

The following is the role of L1 regularization and L2 regularization. These statements can be found in many articles.

  • L1 regularization can generate a sparse weight matrix, that is, a sparse model can be used for feature selection
  • L2 regularization can prevent overfitting of the model. L1 can also prevent overfitting to some extent

Implementation method of L2 regularization:

reg = 1e-6 l2_loss = Variable(torch.FloatTensor(1), requires_grad=True) for name, param in model.named_parameters(): If \'bias' not in name: L2_loss = L2_loss (0.5 * reg * torch. Sum (torch. Pow (W, 2)))Copy the code

Implementation method of L1 regularization:

reg = 1e-6
l1_loss = Variable(torch.FloatTensor(1), requires_grad=True)
for name, param in model.named_parameters():
    if \'bias\' not in name:
        l1_loss = l1_loss   (reg * torch.sum(torch.abs(W)))
Copy the code

Orthogonal Regularization

reg = 1e-6
orth_loss = Variable(torch.FloatTensor(1), requires_grad=True)
for name, param in model.named_parameters():
    if \'bias\' not in name:
        param_flat = param.view(param.shape[0], -1)
        sym = torch.mm(param_flat, torch.t(param_flat))
        sym -= Variable(torch.eye(param_flat.shape[0]))
        orth_loss = orth_loss   (reg * sym.sum())
Copy the code

Max Norm Constraint

To put it simply, the reference to W is directly restricted.

def max_norm(model, max_val=3, eps=1e-8):
    for name, param in model.named_parameters():
        if \'bias\' not in name:
            norm = param.norm(2, dim=0, keepdim=True)
            desired = torch.clamp(norm, 0, max_val)
            param = param * (desired / (eps   norm))
Copy the code

L2 regular

The most direct way to perform L2 regularization in pytorch is to use the weight_decay option of the optimizer, which is equivalent to λ in L2 regularization

Optimizer = optim.sgd (model.parameters(), lr = 0.01, momentum=0.9,weight_decay=1e-5)Copy the code
lambda = torch.tensor(1.) 
l2_reg = torch.tensor(0.) 
for param in model.parameters():     
	l2_reg += torch.norm(param) 
loss += lambda * l2_reg 
Copy the code

In addition, the optimizer supports operations called per-parameter options, where each parameter is specified to satisfy more detailed requirements. Instead of passing in a Variable, we pass in an iterable dictionary. The dictionary must have the params key that specifies the Variable to be optimized, while the other keys need to match the optimizer’s own parameter Settings.

optim.SGD([
                {'params': model.base.parameters()},
                {'params': model.classifier.parameters(), 'lr': 1e-3}
            ], lr=1e-2, momentum=0.9)
Copy the code
weight_p, bias_p = [],[]
for name, p in model.named_parameters():
  if 'bias' in name:
     bias_p += [p]
   else:
     weight_p += [p]
# The name of each parameter in the model is automatically named by the system. As long as the weight is with weight, the bias is with bias.
This is different from TensorFlow, where the user can define the name, and the system can also define the name.
optim.SGD([
          {'params': weight_p, 'weight_decay':1e-5},
          {'params': bias_p, 'weight_decay':0}
          ], lr=1e-2, momentum=0.9)
Copy the code

L1 regularization

criterion= nn.CrossEntropyLoss()

classify_loss = criterion(input=out, target=batch_train_label)

lambda = torch.tensor(1.)
l1_reg = torch.tensor(0.)
for param in model.parameters():
    l1_reg += torch.sum(torch.abs(param))

loss =classify_loss+ lambda * l1_reg
Copy the code

Define the regularization class

Check whether the GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# device='cuda'
print("-----device:{}".format(device))
print("-----Pytorch version:{}".format(torch.__version__))
 
 
class Regularization(torch.nn.Module):
    def __init__(self,model,weight_decay,p=2):
        :param model :param weight_decay: regularized parameter :param p: power exponent value in norm calculation, default to 2 norm, when p=0 is L2 regularized and p=1 is L1 regularized
        super(Regularization, self).__init__()
        if weight_decay <= 0:
            print("param weight_decay can not <=0")
            exit(0)
        self.model=model
        self.weight_decay=weight_decay
        self.p=p
        self.weight_list=self.get_weight(model)
        self.weight_info(self.weight_list)
 
    def to(self,device):
        Param Device: cude or CPU :return: ""
        self.device=device
        super().to(device)
        return self
 
    def forward(self, model):
        self.weight_list=self.get_weight(model)Get the latest weights
        reg_loss = self.regularization_loss(self.weight_list, self.weight_decay, p=self.p)
        return reg_loss
 
    def get_weight(self,model):
        Get the weight list of the model: param Model: :return:"
        weight_list = []
        for name, param in model.named_parameters():
            if 'weight' in name:
                weight = (name, param)
                weight_list.append(weight)
        return weight_list
 
    def regularization_loss(self,weight_list, weight_decay, p=2):
        Computing tensor norm :param weight_list: :param p: power index value in norm calculation, defaults to 2 norm :param weight_decay: :return: ""
        # weight_decay=Variable(torch.FloatTensor([weight_decay]).to(self.device),requires_grad=True)
        # reg_loss=Variable(torch.FloatTensor([0.]).to(self.device),requires_grad=True)
        # weight_decay=torch.FloatTensor([weight_decay]).to(self.device)
        # reg_loss=torch.FloatTensor([0.]).to(self.device)
        reg_loss=0
        for name, w in weight_list:
            l2_reg = torch.norm(w, p=p)
            reg_loss = reg_loss + l2_reg
 
        reg_loss=weight_decay*reg_loss
        return reg_loss
 
    def weight_info(self,weight_list):
        "" Prints weight list information :param weight_list: :return:"
        print("---------------regularization weight---------------")
        for name ,w in weight_list:
            print(name)
        print("-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -")
Copy the code

Use of regularized classes

Check whether the GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 
print("-----device:{}".format(device))
print("-----Pytorch version:{}".format(torch.__version__))
 
weight_decay=100.0 # regularize parameters
 
model = my_net().to(device)
Initialize regularization
if weight_decay>0:
   reg_loss=Regularization(model, weight_decay, p=2).to(device)
else:
   print("no regularization")
 
 
criterion= nn.CrossEntropyLoss().to(device) # CrossEntropyLoss=softmax+cross entropy
optimizer = optim.Adam(model.parameters(),lr=learning_rate)Weight_decay is not required
 
# train
batch_train_data=...
batch_train_label=...
 
out = model(batch_train_data)
 
# loss and regularization
loss = criterion(input=out, target=batch_train_label)
if weight_decay > 0:
   loss = loss + reg_loss(model)
total_loss = loss.item()
 
# backprop
optimizer.zero_grad()Clear all current cumulative gradients
total_loss.backward()
optimizer.step()
Copy the code

Attenuation of learning rate

torch.optim.lr_scheduler

According to the number of iterations

When the epoch passes stop_size, the learning rate becomes gamma times of the initial learning rate

optimizer = optim.SGD(params=model.parameters(), lr=0.05)

# lr_scheduler.StepLR()
Assuming optimizer uses LR = 0.05 for all groups
# lr = 0.05 if epoch < 30
# lr = 0.005 if 30 <= epoch < 60
# lr = 0.0005 if 60 <= epoch < 90

scheduler = lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)
plt.figure()
x = list(range(100))
y = []
for epoch in range(100):
    scheduler.step()
    lr = scheduler.get_lr()
    print(epoch, scheduler.get_lr()[0])
    y.append(scheduler.get_lr()[0])
Copy the code

According to the test indicators

CLASS torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, 
mode='min', factor=0.1, patience=10, verbose=False, 
threshold=0.0001, threshold_mode='rel', cooldown=0, min_lr=0, eps=1e-08)
Copy the code

View Pytorch network layer output (feature map, weight, bias)

weight and bias

# Method 1 Check the Parameters in a variety of ways
model = alexnet(pretrained=True).to(device)
conv1_weight = model.features[0].weight

# Method 2 
# This approach is also suitable for you to write a network with reference to a pre-training model, the parameters of each layer remain the same, but the network structure is expressed differently
This way you can iterate over param and assign it to your network counterpart layer, avoiding the problem of direct load not matching!
for layer,param in model.state_dict().items(): # param is weight or bias(Tensor) 
	print layer,param
Copy the code

feature map

Since PyTorch is a dynamic network and does not store computational data, it is not very convenient to view the characteristic graph of the output of each layer! Divided into the following two cases to discuss:

1, the layer you want to view is independent, so you can use the variable to receive and return in the forward!!

class Net(nn.Module):
    def __init__(self):
        self.conv1 = nn.Conv2d(1.1.3)
        self.conv2 = nn.Conv2d(1.1.3)
        self.conv3 = nn.Conv2d(1.1.3)

    def forward(self, x):
        out1 = F.relu(self.conv1(x))
        out2 = F.relu(self.conv2(out1))
        out3 = F.relu(self.conv3(out2))
        return out1, out2, out3
Copy the code

2. The layer you want to look at is in a nn.sequential () container, which is a bit more complicated.

Nn.module.children ()
# After the model is instantiated, use nn.module.children () to delete the next layer of the layer you are viewing
import torch
import torch.nn as nn
from torchvision import models

model = models.alexnet(pretrained=True)

# remove last fully-connected layer
new_classifier = nn.Sequential(*list(model.classifier.children())[:- 1])
model.classifier = new_classifier
# Third convolutional layer
new_features = nn.Sequential(*list(model.features.children())[:5])
model.features = new_features
Copy the code

# Method 2 Use hook skillfully. It is recommended to use this Method without changing the original model
# torch.nn.Module.register_forward_hook(hook)
# hook(module, input, output) -> None

model = models.alexnet(pretrained=True)
# define
def hook (module,input,output):
    print output.size()
# registered
handle = model.features[0].register_forward_hook(hook)
# delete handle
handle.remove()

# torch.nn.Module.register_backward_hook(hook)
# hook(module, grad_input, grad_output) -> Tensor or None
model = alexnet(pretrained=True).to(device)
outputs = []
def hook (module,input,output):
    outputs.append(output)
    print len(outputs)

handle = model.features[0].register_backward_hook(hook)
Copy the code

Note: You can also transform the problem into the first by defining a class to extract features, or even by reconstructing the same model independently of each layer

Calculate the number of model parameters

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)
Copy the code

The custom Operation (Function)

Class torch. Autograd. Function defines formulas for differential operations and records the history of operations. Every operation you do on the Tensor creates a new Function object that performs calculations and records the events that happen to it. The history is retained as a DAG of the function, with edges representing data dependencies (input < – output). Then, when backward is called, the graph is processed in topological order by calling the BACKWARD () method of each Function object and passing the returned gradient to the next Function.

In general, the only way users can interact with functions is by creating subclasses and defining new operations. This is the recommended way to extend Torch. Autograd.

Considerations for creating subclasses

  • Subclasses must override methods forward(), Backward () and be static, defined with an @staticMethod decorator.
  • Forward () must accept a Contextctx as the first argument, and the context can be used to store tensors that can be retrieved during backpropagation. It can be followed by any number of parameters (tensors or other types).
  • Backward () must take a contextctx as the first argument, and the context can be used to retrieve tensors saved during forward propagation.
  • The argument is the gradient of output given by forward() and the number of values returned by forward(). The return value is the gradient of the input for forward(), the same number of inputs for forward(). This operation can be invoked using class_name. Apply (arg)

Example 1: Custom ReLU activation functions

class MyReLU(torch.autograd.Function):
"""
We can implement our own custom autograd Functions by subclassing
torch.autograd.Function and implementing the forward and backward passes
which operate on Tensors.
"""

    @staticmethod
    def forward(ctx, input):
        """ In the forward pass we receive a Tensor containing the input and return a Tensor containing the output. ctx is a context object that can be used to stash information for backward computation. You can cache arbitrary objects for use in the backward pass using the ctx.save_for_backward method. """
        ctx.save_for_backward(input)
        return input.clamp(min=0)

    @staticmethod
    def backward(ctx, grad_output):
        """ In the backward pass we receive a Tensor containing the gradient of the loss with respect to the output, and we need to compute the gradient of the loss with respect to the input. """
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0
        return grad_input
Copy the code

Example 2: Custom OHEMHingeLoss loss function

# from the https://github.com/yjxiong/action-detection
class OHEMHingeLoss(torch.autograd.Function):
    """ This class is the core implementation for the completeness loss in paper. It compute class-wise hinge loss and performs online hard negative mining (OHEM). """

    @staticmethod
    def forward(ctx, pred, labels, is_positive, ohem_ratio, group_size):
        n_sample = pred.size()[0]
        assert n_sample == len(labels), "mismatch between sample size and label size"
        losses = torch.zeros(n_sample)
        slopes = torch.zeros(n_sample)
        for i in range(n_sample):
            losses[i] = max(0.1 - is_positive * pred[i, labels[i] - 1])
            slopes[i] = -is_positive iflosses[i] ! =0 else 0

        losses = losses.view(- 1, group_size).contiguous()
        sorted_losses, indices = torch.sort(losses, dim=1, descending=True)
        keep_num = int(group_size * ohem_ratio)
        loss = torch.zeros(1).cuda()
        for i in range(losses.size(0)) : loss += sorted_losses[i, :keep_num].sum() ctx.loss_ind = indices[:, :keep_num] ctx.labels = labels ctx.slopes = slopes ctx.shape = pred.size() ctx.group_size = group_size ctx.num_group = losses.size(0)
        return loss

    @staticmethod
    def backward(ctx, grad_output):
        labels = ctx.labels
        slopes = ctx.slopes

        grad_in = torch.zeros(ctx.shape)
        for group in range(ctx.num_group):
            for idx in ctx.loss_ind[group]:
                loc = idx + group * ctx.group_size
                grad_in[loc, labels[loc] - 1] = slopes[loc] * grad_output.data[0]
        return torch.autograd.Variable(grad_in.cuda()), None.None.None.None
Copy the code