Abstract:Recently, increasing model size has become the main means to improve model performance. In particular, self-supervised pre-trained language models in the NLP space have grown in size, from 175 billion parameters for GPT3 to 1.6 trillion parameters for Switch Transformer, which is another order of magnitude.

This article is shared from Huawei Cloud Community “A paper to bring you to understand the key technology of the trillion-level parameter super large model supported by Mindspore!” , by HWCloudai.

## preface

Recently, increasing model size has become the main means to improve model performance. In particular, self-supervised pre-trained language models in the NLP space have grown in size, from 175 billion parameters for GPT3 to 1.6 trillion parameters for Switch Transformer, which is another order of magnitude.

The increase of the order of magnitude of the model size has achieved a certain degree of performance improvement and even produced some unexpected “magic” effects (such as GPT3), but the computing overhead behind it has become the biggest problem, such as the GPT3 training using ten thousand grade GPU and several weeks of training time. One of the most important challenges is how to improve the model expression and performance with super-large scale parameters, while controlling the small increase of computation. Dynamic neural network technology represented by MOE is introduced emphatically. The brain is a typical computing model with low energy consumption and high efficiency, and sparse activation is the most important characteristic. In addition to computational efficiency challenges in training reasoning, especially in training, another greater challenge for training optimization algorithms of current mega-models (not discussed here) is that BP algorithm is currently the most available deep network optimization. However, a more ideal optimization algorithm requires high parallelism, asymmetric optimization process, and the ability to complete the global optimization through local continuous optimization in the spatial and temporal dimension.

1. In the traditional neural network model, when feeding forward, each sample processing in the input batch will activate every parameter in the network to participate in the calculation.

2. Conditional computing, the loosest definition, refers to an algorithm that activates only certain parts of the network. Conditional Computation refers to a class of algorithms that activate only some of the different parts in a network. In the implementation of a specific type of conditional calculation, the conditional selection mode may independently activate different parts of the network according to each sample in the input batch, or different parts in the input data space (such as different regions of the image or channels). This may vary depending on the time of the input data (such as different Slide Window for Time Series or different Frame for Video). , which may be independent of each task according to the different target tasks, and may be independent of different subnets according to the non-learnable fixed random assignment.

3. Selectively perform the calculation of the subsequent part of the network for different inputs (original or previous layer) according to certain conditions. Under this technology, there are some similar or related technologies, such as: dynamic neural network(s), conditional computing, conditional activation, sparse activating, selective execution, Mixture of Experts (MOE), Dynamic Routing,… ; Strongly related models such as Switch Transformer.

## Classification of Conditional Computing (Generalized)

1. Depending on whether routing is learnable, it can be divided into Learnable Routing Conditional Computation and Unlearnable Routing Conditional Computation.

2. In terms of whether non-activation is performed or not, there can be hard conditional computation and soft conditional computation. For the hard-mode conditional calculation, the data that does not need to be activated will not participate in the calculation of the inactive part of the network completely through the operation of tensor selection and segmentation, no matter what the mode is selected under the conditions; Soft-mode conditional calculation may only take the relevant data to zero or other methods to avoid the calculation effect, but still do not need to activate the network part to actually perform the calculation process.

## The main advantages of conditional computation

1. Effective calculation and reduced energy consumption: through partial activation and partial calculation, taking the conditional calculation of conditional activation of each sample as an example, a single sample only needs to go through part of the entire SuperNet to participate in the calculation.

2. Larger network, stronger expression: due to the Route from one place to many places, Input from each place (layer) is routed to different subnets for independent calculation, and the mutual expression of different inputs at each layer is relatively independent and has no impact. With stronger expression ability, the network can be larger, but the expression efficiency is reduced.

## The network and form of conditional computation

Condition of network and computing is more flexible in form, such as part of the building form: (is omitted and specific model, see: http://intellabs.github.io/dis)

1. According to the characteristics of tasks such as CV, several independent CNNs are used as Expert network, and independent routing is conducted according to tasks, and the tails are combined to a large network.

2. Combining different Expert networks at different levels using forms such as more complex Cascading.

3. Routing is realized by data transformation through decision tree and other methods.

4. Routing through learnable networks. Among them, the loss of strategy learning has many forms of construction: the main loss of direct use of tasks such as classification, the importance of different experts and the loss of load construction as auxiliary loss and so on.

## Routing strategy for conditional computation

1. Non-learnable /hard-mode, using certain deterministic strategies, such as LSH, to compute routing.

2. Learnable – Mode to compute routing via learnable network. The network size can be large or small, and the simple learnable route is the single layer weight: G(x) = P(XW), G(x) is the routing Gate function, x is the input, W is the learnable routing weight measured by the pass loss function, P is some selection function (such as topk, sort, etc.), in the actual implementation, the input and weight calculation results of XW may be part of the input information of the subsequent network. Instead of using G(x) to select routes, the results of XW need to be normalized. The more typical form is G(x)=P(N(Xw)), where N represents the Normalization function, such as Softmax.

## Redundancy strategy for conditional computation

Redundancy strategy of conditional calculation can be divided into non-redundant conditional calculation and redundant conditional calculation:

1. No redundancy condition can be calculated by P(.) Function implementations such as topk(k=1…) To implement;

2. Redundancy condition calculation, which can be realized in a variety of forms, can be achieved through P(.) Function implementations such as topk(k=n…) , n>=2, and can also be implemented through hard redundancy mode, which supports replication and multiplexing of inputs throughout the network.

## Conditional computing challenges

1. The routing algorithm on the quality of the model input and routing information (X * W) of the role of weight, is only as a routing and as a subsequent network unit of input, or directly as part of the subsequent network unit of input, routing algorithm determines the input information processing flow, has a great impact on the overall quality of the model. 2. Stability of routing/gate The weights of the routing/gate are randomly initialized, and the weights themselves are constantly being trained and adjusted; The same sample will be assigned to different subsequent network units at different stages of training. Such dynamic change is too drastic, which will seriously affect the stability and convergence speed of the whole network training process. 3, the importance of routing expert samples and load balance

During the training phase, the importance of the correlation between each expert and the samples in the sample batch, and the load balance of the samples in each batch being evenly distributed among the different experts, are both related and conflicting. It is necessary to construct loss functions as auxiliary losses to optimize these two indexes. It is discussed in arxiv:1701.06538 Outrageously Large Neural Networks: The sparsely-mixture -of-Experts Layer.

## About Conditional Computing/Dynamic Neural Networks

More information about conditional computation/Dynamic Neural Networks is available in Dynamic Neural Networks: A Survey arxiv: 2102.04906 (http://arxiv.org/abs/2102.0490), the author of the generalized dynamic neural network, the various dynamic network technology levels according to the instance level, time, space, made A classification.

- Instance-wise Dynamic NN: Instance-by-instance Dynamic, each sample independently activates different networks and parameters (MOE in this direction). Dynamic Architecture: Dynamic Depth, Dynamic Width, Dynamic Routing/ MOE; Dynamic Parameter: Parameter Adjustment, Parameter Prediction, Dynamic Feature(S)
- Spatial-wise Dynamic NN: Different Spatial locations such as images activate different networks and parameters. (CNN, etc.) : Pixel Level, Region Level, Resolution Level
- Temporal-wise Dynamic NN: Temporal-wise Dynamic NN: Temporal-wise data is divided into sequential dimensions to activate subsequent different networks and parameters. (Video – Frames, Text – Sequence, Time – Series, Stream…) Text-sequencevideo-frames The above is the general classification of Dynamic NN in this review paper.

From the perspective of ultra-large scale network dynamic network technical support, high expression ability and low computing cost are considered as the main categories, and dynamic network technologies are classified from two dimensions:

### 1. According to whether it is partially activated during feedforward calculation:

Hard-dynamic: During feedforward, part of the network must not be activated to participate in the calculation

Soft-dynamic: During feedforward, after passing through gate/route such as softmax, part of the network loses expression ability through tensor element zeroing, etc., but will participate in calculation.

### 2. According to the input of the dynamic activation determination algorithm:

- At the sample level :(at the input layer) the subsequent activation of the dynamic network is determined on a per-sample basis.
- Subsample level :(at the input level) different subsequent network units are activated at the time/space level within the sample. The typical deep network is not only selectively activated in the input layer, but also in the middle layer.

Among them, the intelligent platform supports the hard-dynamic sample-level Dynamic neural network, which can naturally obtain the sparse activation of the large particles of the network structure, and can achieve high energy efficiency of training and reasoning in the large model.

Compared with the neural network with static structure, a lot of comparative studies have been done from the aspects of efficiency, expression, generalization, robustness and interpretation in the related studies. Efficiency and Representation are the most important from the perspective of the intelligent platform to improve the model performance by supporting the very large scale network with the lowest possible computing cost:

1. Efficiency: Static networks “make a difference”, each sample input has to respond to the whole network/all parameters, which is too much of a challenge for models with super large networks to achieve leading effects.

2, Representation: larger number of parameters, Representation capacity; However, in terms of the expression of features of each layer of deep network, MOE and other structures have lower reuse and lower expression efficiency of each parameter.

## Implementation strategy

To realize the very large parameter version with dynamic routing sparse activation of various models, it needs to be studied and implemented separately.

In the case of Switch Transformer, the parameters are extended to the FFN part of the Transformer. Its MOE extension is shown below:

(Photo: Switch Transformer Paper)

It can be seen that the main change in MOE is to add MOE-related logic before and after the need for Expert subnetworks. This article mainly introduces the implementation on the platform. Dynamic routing condition calculation mainly includes four steps: routing calculation, data dispatch, independent calculation and result merging.

1. Routing calculation – GATE: Based on the input (which can be the input of the whole network, or the output of the previous network unit/layer), the calculation is done in the routing unit, and the subsequent network routing (Mixture of Experts/ MOE) to be dispatched for each sample is calculated in the batch sample-wise routing.

2. Data dispatch-Dispatch: Collect and combine the Tensor that each expert needs to process from the input whole Tensor according to the sample-expert relationship calculated by routing. If in the design of fixed expert-batch training, it is necessary to balance the number of samples assigned to each expert in each batch of training and the maximum capacity of experts in each round of training, due to the randomness of sample input, it is difficult to ensure a relatively uniform distribution. For batches lower than the maximum capacity, PAD should be done for fixed batch-size batches. For the samples higher than the maximum capacity, delayed resampling and other methods can be adopted. In order to maintain the correct input-output relationship (Input/ x-label /Y) and the training is the derivation relationship of the back propagation, the index relationship from the original batch to the sub-batch of each expert needs to be maintained in the implementation, and used in the later derivation and combination merge.

3. Independent computation-Expert: Each Expert is called concurrently (logically sequentially) to process the corresponding sub-batch. This is one of the concurrency APIs that smart platforms support.

4. Results Merge -Combine: Merge each expert’s result tensor into the whole batch’s tensor and swap it to the original input in order according to the data dispatch index.

In the mainstream deep learning intelligent platform, two main implementation strategies can be adopted:

Tensor zeroing: For subsequent network units that need to be assigned to different networks (expert network subnets, etc.), for experts that need to be assigned several copies of Tensor, for data dimensions that should not be entered into the current expert processing. This method is simple to implement, full tensor operation and has no special requirements on the platform under the condition that the null calculation logic is correct. It is suitable for algorithm research and only shows that the pre-order data of conditional calculation is dynamically routed to different subsequent network units to analyze the effect of the algorithm. If the tensor processed by each expert is set to zero, the batch dimension size of the tensor processed by each expert is all batch, which cannot save computation and memory usage.

Tensor collation: Several copies of Tensor are made for the experts who need to be assigned to different subsequent network units (expert network subnets, etc.). Data dimensions that should not be entered for the current expert processing are not kept. And maintain the corresponding relationship between the index of the sample level before and after transformation. In a distributed friendly implementation, if the expert subnet is divided into different compute nodes as a unit, then the implementation of the expert network is best inherited from the subnet level platform object, such as MindSpore. Nn.Cell in MindSpore. See the following technical implementation section for detailed implementation details.

## The core code

**Core code: routing calculation, data dispatch, independent calculation, results merge**

The reference code is implemented using Mindspore schematic. (Note: Import Mindspore as MS)

Mixture of Experts core logic, the input I, through routing_network(the simplest *W can be), and then topk(if the variant algorithm needs gate weight need softmax, otherwise not), The tensor for each Subnetwork/Expert is then selected using the operation of Tensor (Batch).

In order to facilitate debugging, the input and routing weight are constructed using a very small scale non-random deterministic numerical value, and the routing network is simple X*W.

### 1. Routing calculation

The experts to be assigned for each sample can be clearly calculated after the matrix multiplication of the input samples of 5 lines (only 3 categories, and the experts are expected to be assigned to 3 experts) and the Gate weight. Matmul or something like Gates_weighted = Einsum (‘bd,de->be’, [data_inputs, gate_weights]) The first round of matrix multiplication is:

Input and weight multiplication, in Python you can use @, you can use Matmul, or you can use the Einstein summation simple memory function Einsum. When it is simple matrix multiplication, using Einsum will actually split into multiple algorithms when the calculation graph is compiled, and the performance is not good. However, when the input and weight are more than 2 dimensions and the batch dimension is needed for the fixed routing calculation, it is easy to program the implementation using Einsum.

### 2, the dispatch

The main logic for the assignment of conditional computation is to calculate the top-k experts for each sample according to the output of the routing network. Its implementation can be achieved through the topk function. Since the top selection score can be used as the input information of subsequent network units (including routing information), softmax is generally normalized for routing output.

Calculate 1: Please refer to #2 for the normalized weight among all-n experts as required. The output is:

Select top-k experts for each sample in the batch. Here is the expert weight for each sample in the batch. Top-k can be obtained from softmax-ed or top-k can be obtained directly from gates_weighted. Since softmax may not be done or delayed, we can use gates_weighted, where is the expert number for each batch

Its output is:

And then:

Calculate 2 as needed: Normalized weights among Top-N experts

How to extract the tensor belonging to the expert processing from the original input according to the assignment index for each expert? In the current mainstream intelligent platform, there is no special operator, and similar effects can be achieved by combining other operators. In Mindspore, you can implement the operator through the underlying C++, or you can inherit the Cell from Python and implement the bprob, and then organize the original gate tensor into the target output by index. Here we implement a Dispatch class

### 3. Independent calculation

Direct parallel calls to the subsequent expert network. The parallel part can be supported by the platform. It can be identified by special functions, annotations, etc., or it can be optimized for parallel execution at platform compile time. (In network models with non-dynamic routing conditions, similar optimizations generally do not exist.)

### 4, merging,

The logic of merging is relatively simple. First, the batch dimension is concatenated by CAT, and then the correct Zeros Tensor is constructed with index_add to combine the results of each expert network in the input order according to the index, as the output of the MOE module.

The above completes the entire calculation process of MOE.

### The code framework

We extend the logic based on tensor operation calculated by the above basic dynamic routing conditions to a complete training code framework:

- Class Dispatch(ms.nn.Cell): Implements the Dispatch logic in a route
- Class Combine(MS.NN. Cell): Implement assembly logic in routing
- Class Route(Ms. Nn.Cell): Complete the entire dynamic routing logic, can be implemented as a relatively generic class
- Class Expert(Ms. Nn.Cell): An Expert network customized by platform users
- Class Network(Ms. Nn.Cell): A large Network defined by the users of the platform
- Class MSELoss(Ms. Nn.Cell) : implements MSE loss, implements auxiliary loss logic
- Class OutputLossGraph(Ms. Nn.Cell) : Output Infer and Loss, Pynative mode one-step
- Def train(…) Def train(…) Def train(…) Def train(…) : Training entrance

## Conditional calculation to achieve technical points

### 1. Dynamic routing

- Unlearnable routing

For example, using LSH (Locality Sensitive Hashing) for routing: in the front end of the entire learnable network, LSH is used to dispatch samples, which can avoid the LSH partial derivation problem. If LSH module is added in the middle of the network, part of gradient transmission of deterministic algorithm needs to be completed by gradient estimation.

- Learnable route

In a simple way, gate_weights is defined as learnable Parameter. For two-dimensional tensor, weight routing is calculated through Python @ or Matmul. If it is a tensor of higher dimensions, and the batch dimension is fixed, Einsum (‘bd,de->b*e’) will complete the calculation.

### 2. The relationship between TOPK and Softmax

In G_1(x)=softmax(topk(XW))) and G_2(x)=topk(softmax(XW))),

Put softmax before and after Topk, and the choice of top-k remains unchanged; When G_* is needed as part of the input of the post-order network, i.e. the routing weight information is used as the input information of the subsequent network, it is necessary to consider: if the normalized weight among all-n experts is needed, softmax is placed before top-k; Otherwise Softmax is placed after Top-K to calculate the normalized weight among Top-N experts.

### 3. How to balance each expert in batch processing

According to the sum of routing weight of each sample, that is, the sum of importance and weight of 1+ export allocated by batch single sample, the importance is calculated. According to the sum of non-zero routing weights of each sample, load can be obtained by calculating loaded experts. Efficient_of_variation (importance) + CoEfficient_of_variation (load) as auxiliary_loss optimization to balance importance and load. Coefficient of Variation refers to the degree of dispersion of the dimensionless measurement data. The more discrete the data is, the worse the equilibrium is, and it needs to be optimized to a smaller extent.

In a multi-layer (multi-point) MOE model such as Transformer, federate multiple groups of auxiliary_loss as auxiliary_loss after adding dominated_loss.

**Click on the attention, the first time to understand Huawei cloud fresh technology ~**