Mindspore Model Precision Tuning: How to Locate Accuracy Problems Faster

Abstract:In this paper, we summarize the debugging and tuning guidelines for common precision problems, and share them in the form of a series of articles called “Mindspore Model Precision Tuning in Practice” to help you easily locate precision problems and quickly optimize model accuracy.

Share this article from huawei cloud community “technology dry | faster positioning accuracy problem! MindSpore model accuracy tuning of actual combat (a)”, the original author: HWCloudAI.

Quote:

In the process of model development, the accuracy is often not up to expectations. To help you solve the problem of model debugging and tuning, we have created a visual debugging and tuning component for MindSpore: MindInsight.

In addition, we have also sorted out the debugging and tuning guidelines for common precision problems and shared them in the form of a series of articles called “Mindspore Model Precision Tuning in Practice” to help you easily locate precision problems and quickly optimize model accuracy.

This article is the first of a series of shared articles, which will briefly introduce the common accuracy problems, analyze the common phenomena and causes of accuracy problems, and give an overall tuning idea. This series shares the assumption that your script is already running and calculating loss values. If the script does not run, please refer to the relevant error message to modify it. In the practice of precision tuning, it is relatively easy to find abnormal phenomena. But if we are not sensitive enough to explain anomalies, we may miss the root cause. This article explains common accuracy problems to increase your sensitivity to anomalies and help you locate accuracy problems faster.

Common phenomena and causes of accuracy problems

The model precision problem is different from the general software problem, and the positioning cycle is generally longer. In a normal program, when the output of the program doesn’t match the expected output, it means there is a bug (code error). However, for a deep learning model, there are more complex reasons and more possibilities for the model accuracy to fail to meet expectations. Because the model accuracy takes a long time to train to see the final result, positioning accuracy problems usually take longer.

1.1 Common phenomena

The direct phenomena of the accuracy problem are generally expressed in loss (model loss value) and metrics (model measures). Loss phenomenon is generally manifested as (1) loss flying, NaN, +/ -INF, maximum value (2) loss non-convergence, slow convergence (3) loss 0, etc. Model metrics generally mean that the metric of the model, such as accuracy and precision, fails to meet expectations.

The direct phenomenon of the precision problem is easier to observe. With the help of visualized tools such as MindInsight, more phenomena can be observed on tensors such as gradient, weight and activation value. The common phenomena are :(1) the gradient disappears (2) the gradient explosion (3) the weight does not update (4) the weight change is too small (5) the weight change is too large (6) the activation value saturation, etc.

1.2 Common Causes

Every effect has a cause. Behind the phenomenon is the cause of the precision problem, which can be simply divided into overparameter problem, model structure problem, data problem, algorithm design problem and other categories:

1.2.1 Over-parameter problem

Superparameter is the lubricant between model and data. The selection of superparameter directly affects the data fitting effect of model. Common problems with overparameters are as follows:

1) The setting of learning rate is unreasonable (too large, too small)

2) LOSS_SCALE parameter is unreasonable

3) Unreasonable weight initialization parameters, etc

4) The epoch is too big or too small

5) The batch size is too large

The learning rate is too high or too low. Learning rate is arguably the most important superparameter in model training. If the learning rate is too large, it will lead to loss oscillation and cannot converge to the expected value. Too small learning rate will lead to slow convergence of loss. The learning rate strategy should be chosen reasonably according to theory and experience.

The epoch is too big or too small. The number of EPOCH directly affects whether the model is underfitted or overfitted. The epoch is too small, and the model stops training before the optimal solution is trained, which leads to under-fitting. The epoch is too large and the training time of the model is too long, so it is easy to overfit on the training set and fail to achieve the optimal effect on the test set. The number of epochs should be reasonably selected according to the changes of model effects on the validation set during the training process. The batch size is too large. When the batch size is too large, the model may not converge to a better minimum value, thus reducing the generalization ability of the model.

1.2.2 Data problems

A. Dataset problems

The quality of the data set determines the upper limit of the algorithm effect. If the data quality is poor, it is difficult for the best algorithm to achieve good results. Common data set problems are as follows:

1) Too many missing data values

2) The number of samples in each category is not balanced

3) Outliers exist in the data

4) Insufficient training samples

5) Data labeling error

There are missing values and outliers in the data set, which will cause the model to learn the wrong data relationship. In general, data with missing or outliers should be removed from the training set, or reasonable default values should be set. Data labeling error is a special case of outliers, but this kind of situation is very destructive to training. Therefore, such problems should be identified in advance by sampling data of input model and other methods.

The number of samples in each category of the dataset is not balanced, which means that the number of samples in each category of the dataset has a large gap. For example, in the image classification data set (training set), there are 1000 samples in most categories, but only 100 samples in the category of “cat”, so it can be considered that there is an unbalanced number of samples. The unbalanced sample size will result in poor prediction performance of the model in the category with small sample size. If there is an imbalance in the sample size, the sample size of the category with small sample size should be increased as appropriate. Generally speaking, the supervised deep learning algorithm will achieve acceptable performance in the case of 5000 labeled samples for each category. When there are more than 10 million labeled samples in the data set, the performance of the model will exceed that of human beings.

Insufficient training samples means that the training set is too small relative to the model capacity. Insufficient training samples will lead to unstable training and prone to over-fitting. If the number of model parameters is not proportional to the number of training samples, we should consider increasing training samples or reducing the complexity of the model.

B. Data processing problems Common data processing problems are as follows:

1) Common data processing algorithm problems

2) Incorrect data processing parameters, etc

3) Data is not normalized or standardized

4) The data processing method is inconsistent with the training set

5) The dataset is not shuffled

The data is not normalized or standardized, which refers to the data of the input model, and each dimension is not on the same scale. In general, the model requires that the data for each dimension be between -1 and 1, with an average of 0. If there is an order of magnitude difference between the scales of two dimensions, it may affect the training effect of the model, so it is necessary to normalize or standardize the data. Inconsistency between data processing methods and training sets refers to inconsistency between processing methods and training sets when using models for reasoning. For example, different scaling, cropping, normalization parameters and training sets of images will lead to differences in data distribution during reasoning and training, which may reduce the inference accuracy of the model. Note: Some data enhancement operations (such as random rotation, random clipping, etc.) are generally only used in the training set. Data enhancement is not required for inference.

The data set is not shuffled, which means that the data set is not shuffled during training. If shuffling is not carried out or mixed washing is not sufficient, the model will always be updated in the same data order, which seriously limits the selectivity of the direction of gradient optimization and leads to less choice space of convergence points, which is easy to overfit.

1.2.3 Algorithm problems

The algorithm itself has defects, resulting in the accuracy can not reach the expectation.

A. API usage issues

Common API usage problems are as follows:

1. Using the API does not follow the MindSpore constraint

2. The Mindspore Construct constraint is not followed during composition.

Using an API that does not follow the MindSpore constraint is when the API used does not match the scenario in the real application. For example, in scenarios where the divisor may contain zeros, you should consider using DivNoNan instead of Div to avoid the zero division problem. For example, in MindSpore, the first parameter of DropOut is the probability of retention, which is the opposite of the other frameworks (which are the probability of dropping), so you should be careful when using them.

A composition that does not follow the Mindspore Construct constraint is when the network in Graph mode does not follow the constraints declared in the Mindspore Static Graph syntax support. For example, MindSpore does not currently support the inverse of functions with key-value pairs of arguments. Integrity constraints, see: https://mindspore.cn/doc/note…

B. Computational graph structure problems

The calculation graph structure is the carrier of model calculation, and the error of the calculation graph structure is usually caused by the wrong code when the algorithm is implemented. Common problems in calculating graph structure are:

1. The operator used is wrong (the operator used is not suitable for the target scene)

2. Weight sharing error (sharing weights that should not be shared)

3. Node connection error (Block that should be connected to calculation diagram is not connected)

4. The node mode is incorrect

5. Weight freezing error (freezing weights that should not be frozen)

6. Loss function error

7. Optimizer algorithm error (if you implement the optimizer), etc

Weight sharing error, refers to the weight should be shared not shared, or should not share the weight of the shared. You can check for this type of problem with MindInsight.

Weight freezing error, refers to the weight should be frozen is not frozen, or should not be frozen weight frozen. In MindSpore, freezing weights can be achieved by controlling the params parameter passed to the optimizer. Parameter that is not passed to the optimizer will not be updated. You can verify the weight freezing by checking the script, or by looking at the parameter distribution in MindInsight.

Node connection error means that the connection and design of each block in the calculation diagram are not consistent. If you find a node connection error, you should carefully check if the script is incorrectly written.

The node mode is not correct, which refers to some operators that distinguish training and reasoning modes, and the mode needs to be set according to the actual situation. The typical ones include :(1) BatchNorm. Training mode of BatchNorm should be turned on during training. This switch automatically turns on when net.set_train(True) is called; (2) Dropout operator should not be used during reasoning.

Incorrect loss function means that the algorithm implementation of loss function is wrong or a reasonable loss function is not selected. For example, BCEOSS and BCEWithLogitsLoss are different and should be reasonably chosen based on whether the SIGMOID function is required.

C. Weight initialization problem

The initial weight value is the starting point of model training, and the unreasonable initial value will affect the speed and effect of model training. Common problems with weight initialization are as follows:

1. The initial weight values are all 0

2. The initial weight values of different nodes in distributed scenarios are different

The initial weight value is all 0, which means that after initialization, the weight value is 0. This generally leads to weight update problems, and the weights should be initialized with random values.

The initial weight values of different nodes in distributed scenarios are different, which means that the initial weight values of the same name on different nodes are different after initialization. Normally, MindSpore does global AllReduce for gradients. Make sure that the weight update amount is the same at the end of each step, so as to ensure that the weight of each node in each step is the same. If the weights of each node are different during initialization, the weights of different nodes will be in different states in the following training, which will directly affect the accuracy of the model. In distributed scenarios, the same random number seed should be fixed to ensure the same initial weight value.

1.3 There are multiple possible reasons for the same phenomenon, which lead to the difficulty in locating the accuracy problem

Taking loss non-convergence as an example (as shown in the figure below), any problem that may lead to activation value saturation, gradient disappearance and incorrect weight update may lead to loss non-convergence. For example, some weight is wrongly frozen, the activation function used does not match with the data (when the ReLU activation function is used, the input values are all less than 0), and the learning rate is too small, etc., all of which are possible reasons for loss’s non-convergence.

02 Overview of tuning ideas

In view of the phenomena and causes of the above precision problems, several commonly used tuning ideas are as follows: checking code and overparameter, checking model structure, checking input data, and checking loss curve. If none of the above approaches is found to be problematic, we can let the training run to the end to check whether the accuracy (mainly model metrics) has met expectations.

Among them, checking the model structure and overparameter is checking the static characteristics of the model; Checking input data and loss curve is a combination of static features and dynamic training phenomena. To check whether the accuracy is up to the expectation is to re-examine the overall precision tuning process, and consider tuning means such as adjusting overparameter, explaining model and optimizing algorithm.

In order to help users effectively implement the above precision tuning ideas, MindInsight provides supporting capabilities, as shown in the figure below. Stay tuned for future articles in this series that will cover precision tuning preparations, the details of each tuning idea, and how to practice these tuning ideas using the capabilities of MindInsight.

03 precision problem checklist

Finally, we will summarize the common accuracy problems together for your convenience:

Learning about the key technologies of Mindspore is exciting! Click the link and sign up now to learn a classic case of MindSpore-based deep learning in ModelArts platform!

Click on the attention, the first time to understand Huawei cloud fresh technology ~