Abstract:In the process of model development, the accuracy is often not up to expectations. To help users solve the problem of model debugging and tuning, we have created a visual debugging and tuning component tailored to MindSpore: MindInsight.

Share this article from huawei cloud community “technology dry | model optimization accuracy, speed, all I have! MindSpore model accuracy tuning of actual combat (2), the original author: HWCloudAI.

Quote:

In the process of model development, the accuracy is often not up to expectations. To help users solve the problem of model debugging and tuning, we have created a visual debugging and tuning component tailored to MindSpore: MindInsight. We have also sorted out the debugging and tuning guidelines for common precision problems, and will share them in the form of a series of articles called “Mindspore Model Precision Tuning in Practice”, hoping to help users easily locate precision problems and quickly optimize model accuracy.

Review MindSpore model accuracy tuning of actual combat series dry goods | click skip links – technology faster positioning accuracy problem! Mindspore model precision tuning practice (I).

This article is the second in a series of shares, which will provide common ideas for precision debugging and tuning. This series shares the assumption that your script is already running and calculating loss values. If the script does not run, please refer to the relevant error message to modify it.

When accuracy problems are encountered, common debugging and tuning ideas are as follows:

  1. Check code and overarguments
  2. Check the model structure
  3. Check input data
  4. Check Loss Curve
  5. Check whether the accuracy is as expected

Code is an important source of accuracy problems. The key to checking code is to check scripts and code to find problems at the source (Section 2). The model structure reflects Mindspore’s understanding of the code, and checking the model structure is important to check whether Mindspore’s understanding is consistent with the algorithm engineer’s design (Section 3); Some problems can only be found in the dynamic training process. The inspection of input data (Section 4) and loss curve (Section 5) is the combination of code and dynamic training phenomenon for inspection. Checking whether the accuracy is as expected is a review of the overall precision tuning process, and considering tuning means such as adjusting overparameters, explaining models, and optimizing algorithms (Section 6). In addition, it is important to be familiar with the models and tools (Section 1). Each of these ideas is described below.

01 Precision tuning preparation

1.1 Review the algorithm design and be fully familiar with the model

Before precision tuning, the algorithm design should be reviewed to ensure that the algorithm design is clear. If the paper implementation model is referred to, all the design details and the selection of overparameters in the paper should be reviewed. If you are referring to other framework script implementation models, make sure you have a unique benchmark script that is accurate enough to meet standards. If it is a newly developed algorithm, important design details and the selection of superparameters should also be clear. This information is an important basis for reviewing the script steps later.

Before precision tuning, a thorough familiarity with the model is required. Only when you are familiar with the model, can you accurately understand the information provided by MindInsight, judge whether there is a problem and find the source of the problem. Therefore, it is important to take the time to understand model elements such as the algorithm and structure of the model, the role of the operators in the model and the meaning of the parameters, and the characteristics of the optimizer used by the model. Before analyzing the details of precision problems, it is recommended to deepen the understanding of these model elements with the questions first.

Before that, review the algorithm design to make sure the algorithm design is clear. If the paper implementation model is referred to, all the design details and the selection of overparameters in the paper should be reviewed. If you are referring to other framework script implementation models, make sure you have a unique benchmark script that is accurate enough to meet standards. If it is a newly developed algorithm, important design details and the selection of superparameters should also be clear. This information is an important basis for reviewing the script steps later.

Before precision tuning, a thorough familiarity with the model is required. Only when you are familiar with the model, can you accurately understand the information provided by MindInsight, judge whether there is a problem and find the source of the problem. Therefore, it is important to take the time to understand model elements such as the algorithm and structure of the model, the role of the operators in the model and the meaning of the parameters, and the characteristics of the optimizer used by the model. Before analyzing the details of precision problems, it is recommended to deepen the understanding of these model elements with the questions first.

1.2 Familiar with Tools

MindinSight is rich in features. It is recommended that users simply read the MindinSight tutorial (https://www.mindspore.cn/tuto…) to understand the main functions. When positioning accuracy issues, it is recommended to enable the Summary training information collection feature, add a SummaryCollector to the script, and use the training Kanban to view the training process data, as shown in the figure below. Instructions for using the Summary function can be found in (https://www.mindspore.cn/tuto…), please refer to (https://www.mindspore.cn/tuto…).



When you need to debug the model online, refer to this link to enable the debugger function:

https://www.mindspore.cn/tuto…

02 Check code and overarguments

Code is an important source of precision problems, and the overparameter problems, model structure problems, data problems, algorithm design and implementation problems will be reflected in the script, and checking the script is a very efficient means to locate precision problems. The review code mainly relies on the code day reading, recommended to use the yellow duck debugging method: in the process of code day reading, patiently explain the function of each line of code to the inexperienced “yellow duck”, so as to inspire the code problems. Check script, attention should be paid to achieve the script (including data processing, model structure, loss function, the optimizer, etc) are consistent with the design, if the reference other scripts, to focus on implementation are consistent with other scripts, the script, all inconsistencies should have sufficient and reasonable reason, otherwise, they should be modified.

When checking the script, we should also pay attention to the situation of overparameter. The overparameter problem is mainly manifested by unreasonable value of overparameter, such as

  1. Unreasonable setting of learning rate;
  2. Loss_scale parameter is not reasonable;
  3. The weight initialization parameter is not reasonable.

MindinSight helps you check for overarguments. In most cases, the SummaryCollector automatically records common overarguments. You can view overarguments using MindinSight’s training parameter detail function (as shown below) and the traceability analysis function. Combined with the MindinSight model traceability analysis module and the code in the script, the value of overparameters can be confirmed and the obviously unreasonable overparameters can be identified. If there is a benchmark script, it is recommended to compare the overparameter values with the benchmark script one by one. If there is a default parameter value, the default value should also be compared to avoid the accuracy decline or training error caused by the different default values of parameters of different frameworks.

03 Check the model structure

In terms of model structure, common problems are:

  1. Operator used incorrectly (operator used is not applicable to the target scene, such as integer division instead of floating-point division);
  2. Weight sharing error (sharing weight that should not be shared);
  3. Weight freezing error (freezing weights that should not be frozen)
  4. Node connection error (Block that should be connected to calculation graph is not connected)
  5. Loss function error;
  6. Optimizer algorithm error (if you implement the optimizer yourself), etc.

It is recommended to review the model structure by examining the model code. In addition, MindInsight can also assist the user to check the model structure. In most cases, a SummaryCollector will automatically record computed graphs, and with MindInsight, the user canConveniently view the calculation diagram. After the model script is run, it is recommended to use the MindInsight calculation diagram visual module to view the model structure, deepen the understanding of the calculation diagram, and confirm that the model structure meets the expectations. If you have a benchmark script, you can also check the calculation graph against the benchmark script to see if there are any important differences between the current script and the benchmark script’s calculation graph.

Given the complexity of model structures in general, it is unrealistic to expect all model structure problems to be discovered in this step. As long as through the visualization of the model structure to deepen the understanding of the calculation diagram, the obvious structural problems can be found. In the following steps, we will return to this step to recheck if we find more specific accuracy problems.

Note 1: MindSight supports viewing computed graphs recorded by the SummaryCollector and computed graphs derived from PB files by the save_graphs parameter of the MindSpore Context. Please refer to our tutorial “calculation chart visualization” section for more information (https://www.mindspore.cn/tuto)… .

Note 2: the script migration tool can be PyTorch, written within the framework of the TensorFlow model into MindSpore script, please visit the tutorial (https://www.mindspore.cn/tuto)… For more information.

04 Check input data

By examining the data in the input model, the script can be used to determine if there is a problem with the data processing pipeline and data set. Common problems with input data are:

  1. Too many missing data values;
  2. The number of samples in each category is uneven;
  3. There are outliers in the data;
  4. Data labeling error;
  5. Insufficient training samples;
  6. The data is not standardized, and the data input to the model is not in the correct range;
  7. Finetune and Pretrain process data differently;
  8. The data processing methods of training stage and inference stage are different.
  9. Incorrect data processing parameters, etc.

MindinSight can assist users to check the input data, data processing pipeline. In most cases, the SummaryCollector automatically records the data entered into the model (the data after processing) and the data processing pipeline parameters. Input model data will be displayed in the “Data Sampling” module, and data processing pipeline parameters will be displayed in the “Data Graph” module and the “Data Traceability” module. Through the data sampling module of MindInsight, you can check the data of the input model (after the data processing pipeline). If the data obviously does not meet the expectation (for example, the range of data being cut is too large, the Angle of data rotation is too large, etc.), it can be judged that there is a certain problem with the input data. Through the data graph and data traceability module of MindInsight, the data processing process of the data processing pipeline and the value of specific parameters can be checked, thusDiscover unreasonable data processing methods.



If you have a benchmark script, you can also check against the benchmark script to see if the data output from the data processing pipeline is the same as that of the current script. For example, save the data from the data processing pipeline to an npy file, and then use the numpy.allClose () method to compare the data from the benchmark script to the current script. If differences are found, there may be problems with precision in the data processing phase.

If the data processing pipeline does not find any problems, the data set can be manually checked to see whether there are problems such as unbalanced classification, wrong label matching, too many missing values, insufficient training samples, etc.

05 Check the loss curve

Many accuracy problems can be found in the process of network training. Common problems or phenomena include:

  1. Weight initialization is not reasonable (such as the initial value is 0, the initial value range is not reasonable, etc.);
  2. There are too large and too small values in the weight;
  3. Weight change is too large;
  4. [Fixed] Incorrect weight freezing
  5. Incorrect weight sharing;
  6. The activation value is saturated or too weak (for example, the output of Sigmoid is close to 1, and the output of ReLU is all 0);
  7. Gradient explosion and disappearance;
  8. Training epoch is insufficient;
  9. NaN and InF exist in the results of operator calculation.
  10. Operator calculation process overflow (overflow in the calculation process is not necessarily harmful) and so on.

Some of these problems or phenomena mentioned above can be shown through loss, while others are difficult to observe. MindInsight provides targeted features that allow you to observe these phenomena and automatically check for problems, helping you locate the root cause of problems faster. Such as:

  • The parameter distribution diagram module of MindInsight can show the changing trend of model weight with the training process.
  • The tensor visual module of MindInsight can display the specific values of tensors and compare different tensors.
  • The MindInsight debugger is built with a wide variety of powerful checking capabilities, You can check the weight problem (for example, the weight does not update, the weight update is too large, the weight value is too large/too small), the gradient problem (for example, the gradient vanishing, the gradient explosion), the activation value problem (for example, the activation value is saturated or too weak), the tensor is all 0, NaN /INF, the operator calculation process overflow and so on.

Debugger Using Tutorials:

https://www.mindspore.cn/tuto…

In most cases, the SummaryCollector automatically records the loss curves of the model and can be viewed through the scalar visual module of MindInsight. Loss curve can reflect the dynamic trend of network training. By observing the loss curve, it canThe information of convergence and overfitting of the model is obtained.

For the most part, the SummaryCollector automatically records changes in model parameters (five parameters by default) and can be viewed through the parameter distribution module in MindinSight. If you want to record parameter profiles for more parameters, refer to the histogram_regular parameters of the SummaryCollector (https://www.mindspore.cn/doc/…) or refer to the HistogramSummary operator (https://www.mindspore.cn/tuto…).

Tensor will not be automatically record, if you want to by MindInsight view of tensor specific values, please use the TensorSummary operator (https://www.mindspore.cn/tuto)… .

The following introduces the idea of using MindInsight to locate precision problems in combination with the common phenomena of loss curve.

5.1 loss run to fly

Loss flying refers to the presence of NaN, +/-INF, or a particularly large value in loss. Loss flying usually means there is a problem with the algorithm design or implementation. The positioning idea is as follows:

  1. Review the script, model structure, and data,

1) Check whether the overparameter has unreasonable excessively large/excessively small value,

2) Check whether the model structure is implemented correctly, especially whether the loss function is implemented correctly.

3) Check whether there are missing values and particularly large/small values in the input data.

  1. Obuse the parameter distribution map in the training Kanban to check whether there is obvious abnormality in parameter update. If a parameter update exception is found, you can use the debugger to locate the cause of the parameter update exception. 3. Use the debugger module to check the training site.

1) If NaN and +/-INF appear in loss value, global monitoring points can be added by using the “check tensor overflow” condition to locate the operator node where NaN and +/-INF appear first, and check whether the input data of the operator will lead to calculation anomalies (such as division by zero). If the operator input data problems, can be targeted to add small value epsilon to avoid calculation anomalies.

2) If the loss value is particularly large, the “check too large tensor” condition can be used to add global monitoring points, locate the operator node with large value first, and check whether the input data of the operator will lead to calculation anomalies. If the input data itself is abnormal, it can continue to trace up the operator that produced the input data until the specific cause is identified.

3) if the suspected abnormal parameters update, gradient, etc, can use “check the weight change is too big”, “check gradient disappear”, “check the large gradient condition such as set up the monitoring stations, locate the abnormal weight or gradient, and then combined with tensor check view, resulto suspicious of positive operator, a reverse operator, the optimizer operator, etc.

5.2 Loss convergence is slow

Slow convergence of loss refers to loss oscillation and slow convergence speed. It takes a long time to reach the expected value, or it can’t converge to the expected value in the end. Compared with loss running and flying, loss has less obvious numerical characteristics of slow convergence and is more difficult to locate. 1. Review the script, model structure and data.

1) Check whether the overparameter has unreasonable excessively large/excessively small value, especially check whether the learning rate is set too small or too large, which will lead to slow convergence speed, and too large learning rate will lead to loss oscillation and no decline;

2) Check whether the model structure is implemented correctly, especially whether the loss function and the optimizer are implemented correctly;

3) Check whether the range of input data is normal, especially whether the value of input data is too small

  1. Obuse the parameter distribution map in the training Kanban to check whether there is obvious abnormality in parameter update. If a parameter update exception is found, you can use the debugger to locate the cause of the parameter update exception. 3. Use the debugger module to check the process on the training site.

1) The trainable (unfixed) weights can be monitored by using the conditions of “check the weight change is too small” and “check the weight change is not changed” to check whether the weight change is too small. If it is found that the weight change is too small, it can further check whether the value of learning rate is too small, whether the optimizer algorithm is correctly implemented, and whether the gradient disappears, and make targeted repairs.

2) The “check gradient disappearance” condition can be used to monitor the gradient to check whether the phenomenon of gradient disappearance exists. If the gradient is found to disappear, the cause of the gradient disappearance can be further examined upward. For example, the “Check the range of activation values” condition can be used to check whether problems such as saturation of activation values and ReLU output being 0 have occurred.

5.3 Other Loss Phenomena

If the loss on the training set is 0, it generally indicates that the model has been overfitted. Please try to increase the size of the training set.

06 Check whether the accuracy is up to expectation

MindInsight can record the precision results of each training for the user. Model evaluation metrics information is automatically recorded when the same SummaryCollector instance is used in model.train and model.eval. After the training, the model traceability module of MindInsight can be used to check whether the accuracy of the training results reaches the standard.

6.1 Check the accuracy on the training set

If the loss value and metric value of the model on the training set fail to reach the expectation, the following ideas can be referred to for positioning and optimization:

  1. Review the code, model structure, input data, and loss curves,

1) Check the script to check whether the overparameter has an unreasonable value

2) Check whether the model structure is implemented correctly

3) Check whether the input data is correct

4) Check whether there are anomalies in the convergence result and convergence trend of loss curve

  1. Try using the MindInsight traceability analysis feature to optimize the overreference. The traceability analysis page will analyze the importance of the super parameter, and users should give priority to adjusting the super parameter with high importance. The relationship between the super parameter and the optimization target can be observed from the scatter diagram, so that the value of the super parameter can be adjusted accordingly.
  2. Try using the MindInsight parameter tuner to optimize overparameters. Note that the time consumed by the parameter tuner to conduct overparameter search by performing multiple complete training is several times that of the network training. If the network training takes a long time, the overparameter search will take a long time. Ginseng modulator using tutorial: https://www.mindspore.cn/tuto…
  3. Try using the MindInsight model to explain the functional optimization model and data set. The model interpretation function can visually display the areas that are most important to the classification results through salients, and can also indicate which types of labels should be optimized through the scoring system.

Model explanation using tutorial:

https://www.mindspore.cn/tuto…

  1. Try to optimize the model structure/algorithm.

6.2 Check the accuracy on the validation set

If the accuracy of the training set and the accuracy of the verification set do not reach the expectation, the accuracy of the training set should be checked by referring to the previous section first. If the accuracy of the training set has reached the expectation, but the accuracy of the verification set has not reached the expectation, the high probability is that the model has over-fitting. The processing idea is as follows:

  1. Check the evaluation logic of the validation set evaluation script for errors. In particular, whether the data processing mode is consistent with the training set, whether the inference algorithm is wrong, and whether the correct model checkpoint is loaded. 2. Increase the amount of data. Including increasing the sample size, data enhancement and perturbation. 3. Regularization. Common techniques include parameter norm punishment (such as adding a regular term to the objective function), parameter sharing (forcing two components of the model to share the same parameter value), early termination of training, etc. 4. Reduce the size of the model appropriately. For example, reduce the number of convolution layers.

6.3 Check accuracy on test set

If neither the accuracy of the verification set nor the test set reaches the expectation, the accuracy of the verification set should be checked by referring to the previous section first. If the accuracy of the verification set has reached the expectation, but the accuracy of the test set has not reached the expectation, considering that the data of the test set is new data that the model has never seen, the reason is generally that the data distribution of the test set and the data distribution of the training set are inconsistent. The processing idea is as follows:

  1. Check for errors in the evaluation logic of the test set evaluation script. In particular, whether the data processing mode is consistent with the training set, whether the inference algorithm is wrong, and whether the correct model checkpoint is loaded.
  2. Check the quality of the data in the test set, such as whether the distribution range of the data is significantly different from that of the training set, and whether the data has a large number of noises, missing values, or outliers.

07 summary

Since there are multiple possible causes for the same phenomenon, the localization of precision problems is highly dependent on expert experience. I hope the above positioning methods and functions can play a good role in guiding, and help you accumulate successful experience and become a master of precision tuning.

Click on the attention, the first time to understand Huawei cloud fresh technology ~