THOR: Source code analysis and practical application of MindSpore self-developed high-order optimizer

Abstract: This article will share with you the practical application of THOR. Part of THOR’s algorithm is currently open source in MindSpore.

This article is shared from huawei cloud community “MindSpore self-developed advanced optimizer source code analysis and practical application”, original author: HWCloudAI.

This article will share with you the practical application of THOR. Part of THOR algorithm has been opened source in MindSpore.

Gitee.com/mindspore/m…

Using THOR training network in MindSpore is very simple. The following four lines of code will show you how to use THOR training network.

Net = net () # call optimizer opt = THOR(NET, LR, Tensor(damping), config.momentum, config.weight_decay, config.loss_scale, config.batch_size, ConvertModelUtils().convert_to_thor_model(model=model, network=net, loss_fn=loss, optimizer=opt, loss_scale_manager=loss_scale, metrics={'acc'}, amp_level="O2", keep_batchnorm_fp32=False, Update (config. Epoch_size, dataset, callbacks=cb, sink_size=dataset.get_dataset_size(), dataset_sink_mode=True)Copy the code

Import the packages required by THOR, the second-order optimizer
The first line of code creates the network normally
The second line defines the optimizer we use, THOR
The third line of code is to increase the calculation graph so that THOR can achieve better performance
The fourth line trains the network

Let’s expand on that a little bit. First import the package of the second-order optimizer required by MindSpore, located in mindspore.nn.optim

Then create the network you need; Then, the THOR optimizer is defined, and the network information and the superparameter information (such as learning rate, regularization coefficient, etc.) required by THOR are passed in.

Then call the convert_to_thor_model function, which makes THOR achieve better performance by adding a calculation graph. What does this mean? The network itself is a calculation graph when running, THOR will use outdated second-order information. The operation of updating the second-order matrix and not updating the second-order matrix is performed on the two graphs respectively to achieve better performance. (PS. MindSpore supports dynamic and static graphs. Here, static graph mode is used for better performance. Mindspore-website.obs.cn-north-4.myhuaweicloud.com/white\_pape…

Finally, call model.train to start the training. A brief introduction to how to use, next we look at its source.

Source code analysis

The init function is used for the initialization of THOR, which needs to pass in the superparameters and network structure required by THOR. THOR supports GPU and Ascend. ** Class THOR_GPU(Optimizer)** and **class THOR_Ascend(Optimizer). The main difference between these two classes is the operator difference. Class THOR_Ascend(Optimizer)**

class THOR_Ascend(Optimizer): Def __init__(self, net, learning_rate, damping, momentum, weight_decay=0.0, loss_scale=1.0, batch_size=32, decay_filter=lambda x: x.name not in [], split_indices=None): params = filter(lambda x: x.requires_grad, net.get_parameters()) super(THOR_Ascend, self).__init__(learning_rate, params, weight_decay, Loss_scale) if isinstance(momentum, float) and momentum < 0.0: Raise ValueError(" Momentum should be at least 0.0, but got momentum {}".format(momentum)) self.momentum = Parameter(Tensor(momentum, mstype.float32), name="momentum") self.params = self.parameters self.moments = self.params.clone(prefix="moments", init='zeros') self.hyper_map = C.HyperMap() self.opt = P.ApplyMomentum() self.net = net self.matrix_A_cov = ParameterTuple(filter(lambda x: 'matrix_A' in x.name, net.get_parameters())) self.matrix_G_cov = ParameterTuple(filter(lambda x: 'matrix_G' in x.name, net.get_parameters())) ...Copy the code

All optimizers in MindSpore inherit from classOptimizer, a base class that defines some basic functions (such as getting the learning rate, gradient scaling, etc.). During initialization, THOR defines the passed superparameters as class attributes for easy call, and defines the operators to be used in subsequent calculations.

In other words, the function of initialization function is to define operators and variables (Parameter, Tensor, etc.) needed by THOR calculation.

So let’s focus on self.matrix_a_cov, self.matrix_g_cov. These two variables are the information needed to calculate the second order gradient, which are the covariance matrix of the input of each layer and the covariance matrix of the first derivative of the output of each layer respectively, which have been saved in the forward and backward processes at run time.

Now let’s look at the parameters used to create THOR:

Net: The model established by this training;

Learning_rate: the learning rate exceeds the parameter. Damping: superparameter of regularized term added to second-order matrix; Momentum: Momentum overparameter; Weight_decay: weight decay, used to prevent overfitting, defaults to 0.0, meaning value decay is not allowed; Loss_scale: Used for loss during scaling training to prevent gradient crossing. The default value is 1.0, that is, no scaling is used. Batch_size: indicates the amount of data used to train a step. The default value is 32. Decay_filter: select which layers weight_decay is applied to, which takes effect if weight_decay>0; Split_indices: This parameter is used to speed up the AllReduce process.

The _get_Ainv_Ginv_Amax_Gmax_list function is used to compute the inverse of the covariance matrix A/G and return the inverse. The specific process is to traverse all layers of the model, process according to layers, add regularization terms to the covariance matrix of each layer, and then perform Cholesky decomposition on the matrix to obtain the inverse. The current open source code THOR supports the processing of full connection layer and convolution layer.

def _get_Ainv_Ginv_Amax_Gmax_list(self, gradients, damping_step, matrix_a_allreduce, matrix_g_allreduce,
                                      matrix_a_max_allreduce, matrix_g_max_allreduce):
        """get matrixA inverse list, matrixG inverse list, matrixA_max list, matrixG_max list"""
        for i in range(len(self.params)):
            thor_layer_count = self.weight_fim_idx_map[i]
            conv_layer_count = self.weight_conv_idx_map[i]
            layer_type = self.weight_layerType_idx_map[i]
            if layer_type in [Conv, FC, Embedding]:
                g = gradients[i]
                matrix_A = self.matrix_A_cov[thor_layer_count]
                matrix_G = self.matrix_G_cov[thor_layer_count]
                matrix_A = F.depend(matrix_A, g)
                matrix_G = F.depend(matrix_G, g)
                A_shape = self.shape(matrix_A)
                A_eye = self.eye(A_shape[0], A_shape[0], mstype.float32)
                G_shape = self.shape(matrix_G)
                G_eye = self.eye(G_shape[0], G_shape[0], mstype.float32)
                if layer_type == Conv:
                    ...
                elif layer_type == FC:
                    matrix_A = matrix_A + damping * A_eye
                    matrix_A_inv = self.cholesky(matrix_A)
                    matrix_A_inv = self.vector_matmul(matrix_A_inv, matrix_A_inv)
Copy the code

The _get_second_gradients function is used to calculate the updating direction of the final parameters, and the parameter updating direction formula in the paper is

, so the code is actually implemented as

, the code is as follows

def _get_second_gradients(self, new_grads, damping_step, gradients):
        """get second gradients for thor"""
        params_len = len(self.params)
        for i in range(params_len):
            ...
            else:
                ...
                elif layer_type == FC:
                    temp_a = self.matrix_A_cov[thor_layer_count]
                    temp_g = self.matrix_G_cov[thor_layer_count]
                    temp_a = self.cast(temp_a, mstype.float16)
                    temp_g = self.cast(temp_g, mstype.float16)
                    g = self.cast(g, mstype.float16)
                    g = self.matmul(temp_g, g)
                    g = self.matmul(g, temp_a)
                    g = self.cast(g, mstype.float32)
Copy the code

The construct function is what is actually executed during network training. It contains calls to **_get_Ainv_Ginv_Amax_Gmax_list** and **_get_second_gradients**, The function completes the calculation of the second order matrix and the adjustment of gradient updating direction.

def construct(self, gradients): params = self.params moments = self.moments damping_step = self.gather(self.damping, self.cov_step, self.axis) damping_step = self.cast(damping_step, mstype.float32) if self.thor: matrix_A_allreduce = () matrix_G_allreduce = () matrix_A_max_allreduce = () matrix_G_max_allreduce = () matrix_A_allreduce, matrix_G_allreduce, matrix_A_max_allreduce, matrix_G_max_allreduce = \ self._get_Ainv_Ginv_Amax_Gmax_list(gradients, damping_step, matrix_A_allreduce, Matrix_G_allreduce, matrix_A_max_allreduce, matrix_G_max_allreduce) new_grads = () for i in range(len(self.params)): ... If self.conv_layer_count > 0: If layer_type == Embedding:... if layer_type == Embedding:... elif layer_type == FC: temp_a = matrix_A_allreduce[thor_layer_count] temp_g = matrix_G_allreduce[thor_layer_count] fake_A = self.assign(self.matrix_A_cov[thor_layer_count], temp_a) fake_G = self.assign(self.matrix_G_cov[thor_layer_count], G = f.depend (g, fake_G) temp_A = self.cast(temp_a, f = f depend(g, f = fake_G) mstype.float16) temp_g = self.cast(temp_g, mstype.float16) g = self.cast(g, mstype.float16) g = self.matmul(temp_g, G) g = self.matmul(g, temp_a)# change the first order direction to the second order direction g = self.cast(g, mSType. Float32) elif layer_type == LayerNorm: G = self._process_layernorm(damping_step, G) new_grads = new_grads + (g,) gradients = new_grads # New_grads = () gradients = self._get_second_gradients(new_grads, damping_step, Gradients) # call the gradients function _get_second_gradients to calculate the direction...Copy the code

Practical application of THOR

In this section under the share of THOR’s practical application, the two examples ResNet50 and BERT, respectively, the two example code is open source, link is as follows: ResNet50:gitee.com/mindspore/m…

BERT:gitee.com/mindspore/m…

ResNet50[1]

The optimizer is invoked in the same way as mentioned at the beginning of this article, in this example the specific training process is expanded.

Firstly, the training set required by network training is created and the network is defined as ResNet50. Then, the hyperparameter policy needed by THOR is set. The values of other hyperparameters can be modified in the SRC /config.py directory. The THOR optimizer is then created and passed in the value of the set superparameter. Then transform the model to save the information required by the second order. Finally, you can train the network.

from mindspore.nn.optim import Momentum, THOR from src.resnet import resnet50 as resnet from mindore. Train. Model import model... if __name__ == '__main__': ... Dataset = create_DATASET (dataset_PATH =args_opt.dataset_path, do_train=True, REPEAT_num =1, batch_size=config.batch_size, target=target, Distribute =args_opt.run_distribute) step_size = dataset. Get_dataset_size () # create resnet50 model net = resnet(class_num=config.class_num) ... # init lr if cfg.optimizer == "Thor": Lr_generator import get_thor_lr lr = get_thor_lr(0, config.lr_init, config.lr_decay, config.lr_end_epoch, step_size, decay_epochs=39) # define loss, model if target == "Ascend": if args_opt.dataset == "imagenet2012": if not config.use_label_smooth: Label_smooth_factor = 0.0 loss = CrossEntropySmooth(Sparse =True, reduction="mean", smooth_factor=config.label_smooth_factor, num_classes=config.class_num) else: loss = SoftmaxCrossEntropyWithLogits(sparse=True, Loss_scale = FixedLossScaleManager(config.loss_scale, drop_overflow_update=False) Model = model (NET, LOss_fn =loss, Optimizer =opt, LOss_scale_manager =loss_scale, metrics={'acc'}, amp_level="O2", keep_batchnorm_fp32=False) if cfg.optimizer == "Thor" and args_opt.dataset == "imagenet2012": From src.lr_generator import get_thor_damping damping = get_thor_damping(0, config.damping_init, Damping_decay, 70, step_size) # Opt = THOR(NET, LR) Tensor(damping), config.momentum, config.weight_decay, config.loss_scale, config.batch_size, ConvertModelUtils().convert_to_thor_model(model=model, network=net, loss_fn=loss, optimizer=opt, loss_scale_manager=loss_scale, metrics={'acc'}, amp_level="O2", keep_batchnorm_fp32=False, frequency=config.frequency) ... Epoch_size-config. Pretrain_epoch_size, dataset, callbacks=cb, sink_size=dataset.get_dataset_size(), dataset_sink_mode=dataset_sink_mode)Copy the code

The last input

Ready to run the script. BERT[2]

The steps in BERT are similar to ResNet50. Firstly, the training set required by network training is created and the network is defined as BERT. Then, the hyperparameter policy needed by THOR is set. The values of other hyperparameters can be modified in the SRC /config.py directory. The optimizer is created with the value of the hyperparameter set by BERT. In this case, it is created with:

The bias parameter in LN layer and FC was excluded when the operator weightdecay was performed. Then transform the model to save the information required by the second order. Finally, you can train the network.

Optim import Lamb, Momentum, AdamWeightDecay, THOR # from SRC import BertNetworkWithLoss... def _get_optimizer(args_opt, network): """get bert optimizer, support Lamb, Momentum, AdamWeightDecay.""" if cfg.optimizer == 'Lamb': ... elif cfg.optimizer == "Thor": From src.utils import get_bert_thor_lr, get_bert_thor_damping # set lr = get_bert_thor_lr(cfg.thor.lr_max, cfg.Thor.lr_min, cfg.Thor.lr_power, cfg.Thor.lr_total_steps) damping = get_bert_thor_damping(cfg.Thor.damping_max, cfg.Thor.damping_min, cfg.Thor.damping_power, CFG.Thor. Damping_total_steps) split_indices = None if bert_net_cfg.num_hidden_layers == 12: if bert_net_cfg.use_relative_positions: split_indices = [29, 58, 87, 116, 145, 174, 203, 217] else: split_indices = [28, 55, 82, 109, 136, 163, 190, 205] elif bert_net_cfg.num_hidden_layers == 24: if bert_net_cfg.use_relative_positions: split_indices = [30, 90, 150, 210, 270, 330, 390, 421] else: Indices = [32, 398, 398, 398, 398, 397] # cfg.Thor.weight_decay, cfg.Thor.loss_scale, cfg.batch_size, decay_filter=lambda x: 'layernorm' not in x.name.lower() and 'bias' not in x.name.lower(), split_indices=split_indices) ... return optimizer def run_pretrain(): ... Ds = create_bert_dataset(device_num, rank, args_opt.do_shuffle, args_opt.data_dir, Args_opt.schema_dir) # create net_with_loss = BertNetworkWithLoss(bert_net_cfg, True)... Args_opt. load_checkpoint_path: param_dict = load_checkpoint(args_opt.load_checkpoint_path) load_param_into_net(net_with_loss, Param_dict) # Dynamic loss scaling if args_opt. enable_lossScale == "true":... # Fix loss scaling value else: Grads = BertTrainOneStepCell(net_with_loss, Optimizer =optimizer) # create network model = model (net_with_grads) # add compute graph enhance performance model = ConvertModelUtils().convert_to_thor_model(model, network=net_with_grads, optimizer=optimizer, Repeat_count (new_repeat_count, ds, callbacks=callback) dataset_sink_mode=(args_opt.enable_data_sink == "true"), sink_size=args_opt.data_sink_steps) if __name__ == '__main__': set_seed(0)Copy the code

The last input

Ready to run the script. This is the end of the high-level optimizer series. There are three articles in this series respectively from the background of the optimizer, the introduction of MindSpore self-developed optimizer and the source code analysis and practical application of MindSpore High-level optimizer THOR to share with you. If there is any deficiency, welcome your criticism and correction. Also welcome to the MindSpore open source community.

References:

[1]He K,Zhang X, Ren S, et al. Deep residual learning for imagerecognition[C]//Proceedings of the IEEE conference on computer vision andpattern recognition. 2016: 770-778.

[2]Devlin J,Chang M W, Lee K, et al. Bert: Deep Bidirectional Transformersfor Language understanding[J]. ArXiv Preprint arXiv:1810.04805, 2018

Click to follow, the first time to learn about Huawei cloud fresh technology ~

THOR: Source code analysis and practical application of MindSpore self-developed high-order optimizer

Source code analysis

Practical application of THOR

Related Posts

[Ali Cloud IoT+YF3300] 5. Alink model service delivery

· Use TensorFlow to fit straight lines

TensorFlow’s JupyterLab environment