This is the 28th day of my participation in the August Text Challenge.More challenges in August

A method of Xavier

1.1 Basic theory of Xavier initialization parameter method

For Glorot condition, the variance before and after the data flows through each layer is consistent in forward propagation, and the variance before and after the data flows through each layer is consistent in back propagation. The former is called forward propagation condition and the latter is called back propagation condition. When data flows to a certain layer, the data received by neurons at that layer can be calculated as follows:

Where 𝑧𝑗 represents the data received by a neuron at the current layer, 𝑥𝑖 represents the data transmitted from a neuron at the previous layer, and 𝑤𝑖 represents the weight between the two neurons corresponding to the connection. Of course, if we regard 𝑧𝑗, 𝑥𝑖 and present-day as random variables, the above formula can represent the general situation of the calculation process. And the variance calculation process of 𝑤𝑖 and 𝑥𝑖 is as follows:

Where, Var() represents variance calculation, and E() represents mean calculation. As we assume that the parameter is uniformly distributed with 0 as the mean or 0 as the normal distribution of the mean, 𝐸(𝑤𝑖)=0. As we introduced before, assuming that the input data is zero-centered, 𝐸(𝑥𝑖)=0. Therefore, the above formula can be further simplified as:

It can be considered that x and W are independently and identically distributed (one is the data after collection and processing, and the other is the randomly generated parameter). Therefore, each 𝑉𝑎𝑟(𝑤𝑖)𝑉𝑎𝑟(𝑥𝑖) is also independently and identically distributed. Therefore, all the 𝑤𝑖 can be represented by a random variable, All 𝑥𝑖 can also be represented by a random variable 𝑥

Where n is the number of neurons in the previous layer. It should be noted that the above formula only considers the case of forward propagation, while in practice, the above process is just the opposite when carrying out back propagation. In reverse propagation, Z represents the data received by neurons at the previous layer, while X represents the data transmitted from the current layer. Although the calculation formula remains unchanged, the meaning of N has changed. In order to distinguish, the variable n in forward propagation, that is, the number of neurons in the previous layer, was named 𝑛𝑖𝑛, while the variable n in reverse propagation, that is, the number of neurons in the current layer, was named 𝑛𝑜𝑢𝑡. In order to take into account both forward propagation and back propagation, we take the final variance of W as:

A more rigorous way of referring to the number of neurons in one layer and the number of neurons in the next is fan in and Fan out, Xavier method of initial parameter of variance can also be written as 𝑉 𝑎 𝑟 (𝑤) = 2 / (𝑓 𝑎 𝑛 𝑖 𝑛 + 𝑓 𝑎 𝑛 𝑜 𝑢 𝑡), in addition, Xavier in the thesis, points out, should keep the activation of the layers of value and the variance of gradient is consistent in the communication process, also known as Glorot conditions.

Two Sigmoid activation function modeling process using Xavier initialization

Import matplotlib as MPL import matplotlib.pyplot as PLT from mpl_toolKits. Mplot3d import Axes3D import seaborn as sns # numpy import numpy as np # pytorch import torch from torch import nn,optim import torch.nn.functional as F from torch.utils.data import Dataset,TensorDataset,DataLoader from torch.utils.data import random_split from torch.utils.tensorboard import SummaryWriterCopy the code

The self-built functions covered in this article

#回归类数据集创建函数
def tensorGenReg(num_examples=1000,w=[2,-1,1],bias=True,delta=0.01,deg=1):
    
    """
    回归类数据集创建函数
    param num_examples:创建数据集的张量
    param w:包括截距(如果存在)特征系数张量
    param bias:是否需要截距
    param delta:扰动项取值
    param deg :方程次数
    return :生成的特征张量和标签张量
    """
    if bias==True:
        
        num_inputs=len(w)-1   #特征张量
        features_true=torch.randn(num_examples,num_inputs)   #不包含全是1的列的特征张量
        w_true=torch.tensor(w[:-1]).reshape(-1,1).float()   #自变量系数
        b_true=torch.tensor(w[-1]).float()
        if num_inputs==1:
            labels_true=torch.pow(features_true,deg)*w_true+b_true
        else:
            labels_true=torch.mm(torch.pow(features_true,deg),w_true)+b_true
        features=torch.cat((features_true,torch.ones(len(features_true)),1),1)
        labels=labels_true+torch.randn(size=labels_true.shape)*delta
    else:
        num_inputs=len(w)
        features=torch.randn(num_examples,num_inputs)
        w_true=torch.tensor(w).reshape(-1,1).float()
        if num_inputs==1:
            labels_true=torch.pow(features,deg)*w_true
        else:
            labels_true=torch.mm(torch.pow(features,deg),w_true)
        labels=labels_true+torch.randn(size=labels_true.shape)*delta
    return features,labels
class GenData(Dataset):
    def __init__(self,features,labels):
        self.features=features
        self.labels=labels
        self.lens=len(features)
    def __getitem__(self,index):
        return self.features[index,:],  self.labels[index]
    
    def __len__(self):
        return self.lens
    
def split_loader(features,labels,batch_size=10,rate=0.7):
    """
    数据封装,切分和加载函数
    param features:输入的特征
    param labels:数据集标签张量
    param batch_size:数据加载时每一个小批数据量
    param rate:训练集数据占比
    return:加载好的训练集合测试集
    """
    data=GenData(features,labels)
    num_train=int(data.lens*0.7)
    num_test=data.lens-num_train
    data_train,data_test=random_split(data,[num_train,num_test])
    train_loader=DataLoader(data_train,batch_size=batch_size,shuffle=True)
    test_loader=DataLoader(data_test,batch_size=batch_size,shuffle=False)
    return (train_loader,test_loader)

class Sigmoid_class3(nn.Module):
    def __init__(self,in_features=2,n_hidden1=4,n_hidden2=4,n_hidden3=4,out_features=1,BN_model=None):
        super(Sigmoid_class3,self).__init__()
        self.linear1=nn.Linear(in_features,n_hidden1)
        self.normalize1=nn.BatchNorm1d(n_hidden1)
        self.linear2=nn.Linear(n_hidden1,n_hidden2)
        self.normalize2=nn.BatchNorm1d(n_hidden2)
        self.linear3=nn.Linear(n_hidden2,n_hidden3)
        self.normalize3=nn.BatchNorm1d(n_hidden3)
        self.linear4=nn.Linear(n_hidden3,out_features)
        self.BN_model=BN_model
        
    def forward(self,x):
        if self.BN_model==None:
            z1=self.linear1(x)
            p1=torch.sigmoid(z1)
            z2=self.linear2(p1)
            p2=torch.sigmoid(z2)
            z3=self.linear3(p2)
            p3=torch.sigmoid(z3)
            out=self.linear4(p3)
        elif self.BN_model=='pre':
            z1=self.normalize1(self.linear1(x))
            p1=torch.sigmoid(z1)
            z2=self.normalize2(self.linear2(p1))
            p2=torch.sigmoid(z2)
            z3=self.normalize3(self.linear3(p2))
            p3=torch.sigmoid(z3)
            out=self.linear4(self.normalize3(p3))
        elif self.BN_model=='post':
            z1=self.linear1(x)
            p1=torch.sigmoid(z1)
            z2=self.linear2(self.normalize1(p1))
            p2=torch.sigmoid(z2)
            z3=self.linear3(self.normalize2(p2))
            p3=torch.sigmoid(z3)
            out=self.linear4(self.normalize3(p3))
        
        return out

class Sigmoid_class4(nn.Module):
    def __init__(self,in_features=2,n_hidden1=4,n_hidden2=4,n_hidden3=4,n_hidden4=4,out_features=1,BN_model=None):
        super(Sigmoid_class4,self).__init__()
        self.linear1=nn.Linear(in_features,n_hidden1)
        self.normalize1=nn.BatchNorm1d(n_hidden1)
        self.linear2=nn.Linear(n_hidden1,n_hidden2)
        self.normalize2=nn.BatchNorm1d(n_hidden2)
        self.linear3=nn.Linear(n_hidden2,n_hidden3)
        self.normalize3=nn.BatchNorm1d(n_hidden3)
        self.linear4=nn.Linear(n_hidden3,n_hidden4)
        self.normalize4=nn.BatchNorm1d(n_hidden4)
        self.linear5=nn.Linear(n_hidden4,out_features)
        self.BN_model=BN_model
        
    def forward(self,x):
        if self.BN_model==None:
            z1=self.linear1(x)
            p1=torch.sigmoid(z1)
            z2=self.linear2(p1)
            p2=torch.sigmoid(z2)
            z3=self.linear3(p2)
            p3=torch.sigmoid(z3)
            z4=self.linear4(p3)
            p4=torch.sigmoid(z4)
            out=self.linear5(p4)
            
        elif self.BN_model=='pre':
            z1=self.normalize1(self.linear1(x))
            p1=torch.sigmoid(z1)
            z2=self.normalize2(self.linear2(p1))
            p2=torch.sigmoid(z2)
            z3=self.normalize3(self.linear3(p2))
            p3=torch.sigmoid(z3)
            z4=self.normalize4(self.linear4(p3))
            p4=torch.sigmoid(z4)
            out=self.linear5(p4)
            
        elif self.BN_model=='post':
            z1=self.linear1(x)
            p1=torch.sigmoid(z1)
            z2=self.linear2(self.normalize1(p1))
            p2=torch.sigmoid(z2)
            z3=self.linear3(self.normalize1(p2))
            p3=torch.sigmoid(z3)
            z4=self.linear4(self.normalize1(p3))
            p4=torch.sigmoid(z4)
            out=self.linear5(self.normalize1(p4))
        return out
            
            
            
def fit(net,criterion,optimizer,batchdata,epochs=3,cla=False):
    """
    模型训练函数
    param net:待训练的模型
    param criterion:损失函数
    param optimizer:优化算法
    param batchdata:训练数据集
    param cla:是否是分类问题
    param epochs:遍历数据次数
    """
    for epoch in range(epochs):
        for X,y in batchdata:
            if cla==True:
                y=y.flatten().long()
            yhat=net.forward(X)
            loss=criterion(yhat,y)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
def mse_cal(data_loader,net):
    """
    mse计算函数
    param data_loader:加载好的数据
    param net:模型
    return:根据输入的数据,输出其mse计算结果
    """
    data=data_loader.dataset  #还原datasetlei
    X=data[:][0]
    y=data[:][1]
    yhat=net(X)
    return F.mse_loss(yhat,y)
    
def model_comparison(model_l
                    ,name_l
                     ,train_data
                     ,test_data
                     ,num_epochs=20
                     ,criterion=nn.MSELoss()
                     ,optimizer=optim.SGD
                     ,lr=0.03
                     ,cla=False
                     ,eva=mse_cal
                    ):
    
    
    """
    模型对比函数
    param  model_l:模型序列
    param name_l:模型名称序列
    param train_data:训练数据
    param test_data:测试数据
    param num_epochs:迭代轮数
    param criterion:损失函数
    param lr:学习率
    param cla:是否是分类模型
    param eva:模型评估指标
    return:评估指标张量矩阵    
    """
    train_l=torch.zeros(len(model_l),num_epochs)
    test_l=torch.zeros(len(model_l),num_epochs)
    #模型训练
    for epochs in range(num_epochs):
        for i ,model in enumerate(model_l):
            model.train()
            fit(net=model
               ,criterion=criterion
                ,optimizer=optimizer(model.parameters(),lr=lr)
                ,batchdata=train_data
                ,epochs=epochs
                ,cla=cla
               )
            model.eval()
            train_l[i][epochs]=eva(train_data,model).detach()
            test_l[i][epochs]=eva(test_data,model).detach()
    return train_l,test_l         

class tanh_class3(nn.Module):                                   
    def __init__(self, in_features=2, n_hidden1=4, n_hidden2=4, n_hidden3=4, out_features=1, BN_model=None):       
        super(tanh_class3, self).__init__()
        self.linear1 = nn.Linear(in_features, n_hidden1)
        self.normalize1 = nn.BatchNorm1d(n_hidden1)
        self.linear2 = nn.Linear(n_hidden1, n_hidden2)
        self.normalize2 = nn.BatchNorm1d(n_hidden2)
        self.linear3 = nn.Linear(n_hidden2, n_hidden3)
        self.normalize3 = nn.BatchNorm1d(n_hidden3)
        self.linear4 = nn.Linear(n_hidden3, out_features) 
        self.BN_model = BN_model
        
    def forward(self, x):
        if self.BN_model == None:
            z1 = self.linear1(x)
            p1 = torch.tanh(z1)
            z2 = self.linear2(p1)
            p2 = torch.tanh(z2)
            z3 = self.linear3(p2)
            p3 = torch.tanh(z3)
            out = self.linear4(p3)
        elif self.BN_model == 'pre':
            z1 = self.normalize1(self.linear1(x))
            p1 = torch.tanh(z1)
            z2 = self.normalize2(self.linear2(p1))
            p2 = torch.tanh(z2)
            z3 = self.normalize3(self.linear3(p2))
            p3 = torch.tanh(z3)
            out = self.linear4(p3)
        elif self.BN_model == 'post':
            z1 = self.linear1(x)
            p1 = torch.tanh(z1)
            z2 = self.linear2(self.normalize1(p1))
            p2 = torch.tanh(z2)
            z3 = self.linear3(self.normalize2(p2))
            p3 = torch.tanh(z3)
            out = self.linear4(self.normalize3(p3))
        return out
class Sigmoid_class2(nn.Module):                                   
    def __init__(self, in_features=2, n_hidden1=4, n_hidden2=4, out_features=1, BN_model=None):       
        super(Sigmoid_class2, self).__init__()
        self.linear1 = nn.Linear(in_features, n_hidden1)
        self.normalize1 = nn.BatchNorm1d(n_hidden1)
        self.linear2 = nn.Linear(n_hidden1, n_hidden2)
        self.normalize2 = nn.BatchNorm1d(n_hidden2)
        self.linear3 = nn.Linear(n_hidden2, out_features) 
        self.BN_model = BN_model
        
    def forward(self, x):
        if self.BN_model == None:
            z1 = self.linear1(x)
            p1 = torch.sigmoid(z1)
            z2 = self.linear2(p1)
            p2 = torch.sigmoid(z2)
            out = self.linear3(p2)
        elif self.BN_model == 'pre':
            z1 = self.normalize1(self.linear1(x))
            p1 = torch.sigmoid(z1)
            z2 = self.normalize2(self.linear2(p1))
            p2 = torch.sigmoid(z2)
            out = self.linear3(p2)
        elif self.BN_model == 'post':
            z1 = self.linear1(x)
            p1 = torch.sigmoid(z1)
            z2 = self.linear2(self.normalize1(p1))
            p2 = torch.sigmoid(z2)
            out = self.linear3(self.normalize2(p2))
        return out
    
class ReLU_class3(nn.Module):                                   
    def __init__(self, in_features=2, n_hidden1=4, n_hidden2=4, n_hidden3=4, out_features=1, bias=True, BN_model=None):       
        super(ReLU_class3, self).__init__()
        self.linear1 = nn.Linear(in_features, n_hidden1, bias=bias)
        self.normalize1 = nn.BatchNorm1d(n_hidden1)
        self.linear2 = nn.Linear(n_hidden1, n_hidden2, bias=bias)
        self.normalize2 = nn.BatchNorm1d(n_hidden2)
        self.linear3 = nn.Linear(n_hidden2, n_hidden3, bias=bias)
        self.normalize3 = nn.BatchNorm1d(n_hidden3)
        self.linear4 = nn.Linear(n_hidden3, out_features, bias=bias)
        self.BN_model = BN_model
        
    def forward(self, x):  
        if self.BN_model == None:
            z1 = self.linear1(x)
            p1 = torch.relu(z1)
            z2 = self.linear2(p1)
            p2 = torch.relu(z2)
            z3 = self.linear3(p2)
            p3 = torch.relu(z3)
            out = self.linear4(p3)
        elif self.BN_model == 'pre':
            z1 = self.normalize1(self.linear1(x))
            p1 = torch.relu(z1)
            z2 = self.normalize2(self.linear2(p1))
            p2 = torch.relu(z2)
            z3 = self.normalize3(self.linear3(p2))
            p3 = torch.relu(z3)
            out = self.linear4(p3)
        elif self.BN_model == 'post':
            z1 = self.linear1(x)
            p1 = torch.relu(z1)
            z2 = self.linear2(self.normalize1(p1))
            p2 = torch.relu(z2)
            z3 = self.linear3(self.normalize2(p2))
            p3 = torch.relu(z3)
            out = self.linear4(self.normalize3(p3))
        return out

Copy the code
Set =tensorGenReg(w=[2,-1], Bias =False, deG =2); set =tensorGenReg(w=[2,-1], BIAS =False, deG =2) Train_loader,test_loader=split_loader(features,labels) # Initial core parameter LR =0.03 num_epochs=20Copy the code
Torch. Manual_seed (420) # instantiate the model sigmoid_model3=Sigmoid_class3() # retain the original parameters sigmoid_model3_init=Sigmoid_class3() # use Xavier to initialize the parameters Sigmoid_model3_init.modules (): if isinstance(m, nn.linear): Model_l =[sigmoid_model3,sigmoid_model3_init] name_l=['sigmoid_model3','sigmoid_model3_init'] def weights_vp(model,att='grad'): """ Observe each layer parameters and gradients of the violin drawing param model: observe object param att: select the parameter gradient (grad) or (weights) return: Vp =[] for I,m in enumerate(model.modules()): if isinstance(m, nn.linear): if att=='grad': Vp_x = m. eight. Grad. Detach (). Reshape (1, 1). The numpy () else: Vp_x = m. eight. Detach (). Reshape (1, 1). The numpy () vp_y = np. Full_like vp_a (vp_x, I) = np, concatenate ((vp_x, vp_y), 1) vp.append(vp_a) vp_r=np.concatenate((vp),0) ax=sns.violinplot(y=vp_r[:,0],x=vp_r[:,1]) ax.set(xlabel='num_hidden',title=att) train_l,test_l=model_comparison(model_l=model_l ,name_l=name_l ,train_data=train_loader ,test_data=test_loader ,num_epochs=2 ,criterion=nn.MSELoss() ,optimizer=optim.SGD ,lr=lr ,cla=False ,eva=mse_cal ) weights_vp(sigmoid_model3,att='grad')Copy the code

weights_vp(sigmoid_model3_init,att='grad')
Copy the code

When num_epochs was set to 2 (after only one iteration), the gradient of the model initialized by Xavier was more stable and did not disappear. In contrast to the original model sigmoid_model2, the gradient of the first layer was very small and had a tendency to disappear. As we know, the gradient of each layer represents the state of model learning. It is obvious that all layers of the initialized model are in a stable learning state, and the model convergence speed is fast at this time. We can also verify this with the MSE curve.

train_l,test_l=model_comparison(model_l=model_l ,name_l=name_l ,train_data=train_loader ,test_data=test_loader ,num_epochs=num_epochs,criterion= nn.mseloss (), Optimizer = optim.sgd,lr= LR, clA =False, Eva =mse_cal  enumerate(name_l): plt.plot(list(range(num_epochs)),train_l[i],label=name) plt.legend(loc=1) plt.title('mse_train')Copy the code

Error for I,name in enumerate(name_l): plt.plot(list(range(num_epochs)),test_l[i],label=name) plt.legend(loc=1) plt.title('mse_test')Copy the code

The core role of Xavier initialization is to ensure the smooth distribution of gradient values at each layer, so as to ensure the effectiveness of model learning at each layer. Finally, in the performance of model results, the model learning efficiency and convergence speed are higher after Xavier initialization parameters.

2.1 Xavier initialization effects in extreme cases

In some extreme cases, Xavier initialization is more pronounced. Using a neural network with four sigmoid hidden layers as an example, we observe the effect of Xavier initialization on evading the gradient disappearance problem.

Torch. Manual_seed (420) sigmoid_model4=Sigmoid_class4() sigmoid_model4_init=Sigmoid_class4() # Use Xavier to initialize the parameter Sigmoid_model4_init.modules (): if isinstance(m, nn.linear): Nn.init.xavier_uniform_ (m.eight) # create model container model_l=[sigmoid_model4,sigmoid_model4_init] Name_l =['sigmoid_model4','sigmoid_model4_init'] # core parameter LR =0.03 num_epochs=40 # model training train_l,test_l=model_comparison(model_l=model_l ,name_l=name_l ,train_data=train_loader ,test_data=test_loader ,num_epochs=num_epochs ,criterion=nn.MSELoss() ,optimizer=optim.SGD ,lr=lr ,cla=False ,eva=mse_cal )Copy the code
Error for I,name in enumerate(name_l): plt.plot(list(range(num_epochs)),train_l[i],label=name) plt.legend(loc=1) plt.title('mse_train')Copy the code

Error for I,name in enumerate(name_l): plt.plot(list(range(num_epochs)),test_l[i],label=name) plt.legend(loc=1) plt.title('mse_test')Copy the code

Sigmoid_model4 had a model with severe gradient disappearance in the previous experiment. Because the previous layers basically lost their learning ability, the effect of sigmoid_model4 itself was not good. With Xavier initialization, however, the init model is much better at avoiding the gradient extinction problem.

The tri TANH activation function modeling process was initialized using Xavier

Compared with sigmoid activation function, Xavier initialization method is more suitable for TANH activation function. The core reason is that tanH activation function itself can generate zero-centered Data, and with the parameters generated by Xavier initialization, it can better ensure the smooth gradient and smooth learning of each layer.

  • A neural network with three hidden layers of TANH activation function was used to test the initialization effect of Xavier.
The labels=tensorGenReg(w=[2,-1], Bias =False, deG =2). The labels=tensorGenReg(w=[2,-1], BIAS =False, deG =2) train_loader,test_loader=split_loader(features,labels)Copy the code
Torch. Manual_seed (420) # instantiate tanh_model3 = tanh_class3() # retain the original parameter tanh_model3_init = tanh_class3() # Initialize parameters using XavierCopy the code
For m in tanh_model3_init.modules(): if isinstance(m, nn.linear): Nn.init. xavier_Uniform_ (m.eight) # create model container model_l = [tanh_model3, tanh_model3_init] name_l = ['tanh_model3', 'tanh_model3_init'] # test_l = model_comparison(test_l = model_l, name_l = name_l, train_data = train_loader, test_data = test_loader, num_epochs = 2, criterion = nn.MSELoss(), optimizer = optim.SGD, lr = lr, cla = False, eva = mse_cal) weights_vp(tanh_model3, att="grad")Copy the code

weights_vp(tanh_model3_init, att="grad")
Copy the code

It can be seen that the model gradient after initialization of Xavier parameters is more stable, and we judge that the initial iteration of the model after initialization has a faster convergence rate

Torch. Manual_seed (420) # instantiate tanh_model3 = tanh_class3() # retain the original parameter tanh_model3_init = tanh_class3() # For m in tanh_model3_init.modules(): if isinstance(m, nn.linear): Nn.init. xavier_Uniform_ (m.eight) # create model container model_l = [tanh_model3, tanh_model3_init] name_l = ['tanh_model3', 'tanh_model3_init'] # test_l = model_comparison(test_l = model_l, name_l = name_l, train_data = train_loader, test_data = test_loader, num_epochs = num_epochs, criterion = nn.MSELoss(), Optimizer = optim.SGD, LR = LR, CLA = False, Eva = mse_cal) plt.plot(list(range(num_epochs)), train_l[i], label=name) plt.legend(loc = 1) plt.title('mse_train')Copy the code

Error for I, name in enumerate(name_l): plt.plot(list(range(num_epochs)), test_l[i], label=name) plt.legend(loc = 1) plt.title('mse_test')Copy the code

The model converges faster and becomes more stable after several iterations.

Four Kaiming method (HE initialization)

4.1 HE Initialization Basis

  • Although Xavier initialization can play a certain role in the neural network superposed by Sigmoid and TANH activation functions, ReLU activation function is an unsaturated activation function, which does not cause the problems of gradient disappearance or gradient explosion that may occur in the process of Sigmoid and TANH activation function. On the contrary, due to the unsaturated characteristics of ReLU activation function, the superposition of ReLU activation function is very likely to cause the problem of neuron activity loss, which obviously cannot be solved by Xavier initialization.
  • Reasonable setting of the initial values of parameters is still an effective method to ensure the validity of the model, and also can solve the problem of the disappearance of neuron activity of ReLU activation function to a certain extent. The current general method of initializing parameters for ReLU activation functions is Delving Deep Into Rectifiers by He Kaiming (2015) : Surpassing human-level Performance on ImageNet Classification, HE initialization method, also known as Kaiming method.

4.2 Description of related parameters of Kaiming method

  • Mode: parameter refers to the number of neurons that are selected to be brought in or out of the fan for calculation. As mentioned above, theoretically the two have no significant influence on modeling and can be selected either way. However, in practice, due to individual differences of models, there are still slight differences in actual use, so we can choose according to the actual effect.
  • A: is the correction coefficient when using ReLU variant activation function;
  • Nonlinearity: refers to the selection of variant ReLU activation function type, which needs to be used with A parameter
Torch. Manual_seed (420) # Create a polynomial regression set with the highest value of 2. Features, labels = tensorGenReg(w=[2, 1], labels = tensorGenReg(w=[2, 1], labels = tensorGenReg(w=[2, 1], labels = tensorGenReg) Bias =False, deg=2) # select bias=False, deg=2) # select bias=False, deg=2) Labels) # Initial core parameter LR = 0.001 num_epochs = 20Copy the code
Torch. Manual_seed (420) # instantiate relu_model3 = ReLU_class3() # retain the original parameter relu_model3_init = ReLU_class3() # For m in relu_model3_init.modules(): if isinstance(m, nn.linear): Model_l = [relu_model3, relu_model3_init] name_l = ['relu_model3', 'relu_model3_init']Copy the code
train_l, test_l = model_comparison(model_l = model_l, name_l = name_l, train_data = train_loader, test_data = test_loader, num_epochs = num_epochs, criterion = nn.MSELoss(), optimizer = optim.SGD, lr = lr, Cla = False, Eva = mSE_cal) # Error for I, name in enumerate(name_l): plt.plot(list(range(num_epochs)), train_l[i], label=name) plt.legend(loc = 1) plt.title('mse_train')Copy the code

Error for I, name in enumerate(name_l): plt.plot(list(range(num_epochs)), test_l[i], label=name) plt.legend(loc = 1) plt.title('mse_test')Copy the code

After HE initialization, the convergence rate of the model is significantly improved.