Abstract: sorts out the debugging and tuning guide for common accuracy problems for everyone, and will share it in the form of "MindSpore model accuracy tuning practice" series of articles to help you easily locate accuracy problems and quickly optimize model accuracy.
This article is shared from the Huawei Cloud Community " Technical Dry Goods | Faster Positioning Accuracy Issues! MindSpore model accuracy tuning actual combat (1) ", original author: HWCloudAI.
During the development of the model, it is often a headache if the accuracy is not up to expectations. In order to help you solve the problem of model debugging and tuning, we have customized a visual tuning and tuning component for : 160ed08c270ec1 MindInsight .
I also sorted out the debugging and tuning guide for common accuracy problems for everyone, and will share them in the form of "MindSpore model accuracy tuning actual combat" series of articles to help you easily locate accuracy problems and quickly optimize model accuracy.
This article is the first in a series of sharing. It will briefly introduce common accuracy problems, analyze common phenomena and causes of accuracy problems, and give an overall tuning idea. This series of sharing assumes that your script has been able to run and calculate the loss value. If the script still fails to run, please modify it by referring to the relevant error message. In the practice of precision tuning, it is relatively easy to find anomalies. However, if we are not sensitive enough to the anomaly and cannot explain it, we will still miss the root cause of the problem. This article explains the common accuracy problems. can improve your sensitivity to abnormal phenomena and help you locate accuracy problems faster.
01 Common phenomena and causes of accuracy problems
Model accuracy problems are different from general software problems, and the positioning cycle is generally longer. In a normal program, the program output does not match the expectations, which means that there is a bug (coding error). But for a deep learning model, the accuracy of the model is not up to expectations, there are more complicated reasons and more possibilities. Because the model accuracy takes a long time to see the final result, the positioning accuracy problem usually takes longer.
1.1 Common phenomena
The direct phenomena of accuracy problems are generally reflected in loss (model loss value) and metrics (model metrics). The loss phenomenon is generally manifested as (1) loss runs away, NAN, +/- INF appear, maximum value (2) loss does not converge, slow convergence (3) loss is 0, etc. Model metrics generally show that metrics such as accuracy and precision of the model are not up to expectations.
The direct phenomena of accuracy problems are easier to observe. With the help of visualization tools such as MindInsight, you can also observe more phenomena on tensors such as gradients, weights, and activation values. Common phenomena such as: (1) the gradient disappears (2) the gradient explodes (3) the weight is not updated (4) the weight change is too small (5) the weight changes too much (6) the activation value is saturated, etc.
1.2 Common causes
There must be a reason for the result. Behind the phenomenon is the cause of the accuracy problem, which can be simply divided into categories such as hyperparameter problems, model structure problems, data problems, and algorithm design problems:
1.2.1 Hyperparameter problem
Hyperparameters are the lubricant between the model and the data. The choice of superparameters directly affects the quality of the model's fit to the data. The common problems with super parameters are as follows:
1) The learning rate setting is unreasonable (too large, too small)
2) The loss_scale parameter is unreasonable
3) Unreasonable weight initialization parameters, etc.
4) The epoch is too big or too small
5) The batch size is too large
The learning rate is too large or too small. The learning rate can be said to be the most important super parameter in model training. If the learning rate is too large, it will cause loss to oscillate and fail to converge to the expected value. If the learning rate is too small, it will cause loss to converge slowly. The learning rate strategy should be reasonably selected based on theory and experience.
The epoch is too big or too small. The number of epochs directly affects whether the model is under-fitting or over-fitting. If the epoch is too small, the model stops training before the optimal solution is trained, which is easy to underfit; if the epoch is too large, the model training time is too long, and it is easy to overfit on the training set, and it is not optimal on the test set. effect. The number of epochs should be selected reasonably according to the changes in the model effect on the validation set during the training process. The batch size is too large. When the batch size is too large, the model may not converge to a better minimum, thereby reducing the generalization ability of the model.
1.2.2 Data issues
a. Data set problem
The quality of the data set determines the upper limit of the algorithm's effect. If the data quality is poor, no matter how good the algorithm is, it is difficult to get good results. Common data set problems are as follows:
1) Too many missing values in the data
2) The number of samples in each category is not balanced
3) There are outliers in the data
4) Insufficient training samples
5) The label of the data is wrong
The existence of missing values and outliers in the data set will cause the model to learn wrong data relationships. Generally speaking, data with missing or outliers should be deleted from the training set, or a reasonable default value should be set. Data label errors are a special case of outliers, but this situation is more destructive to training. Such problems should be identified in advance by spot checking the data input to the model.
The unbalanced number of samples in each category in the data set means that there is a large gap in the number of samples in each category in the data set. For example, in the image classification data set (training set), most categories have 1000 samples, but the category "cat" has only 100 samples, and it can be considered that there is an imbalance in the number of samples. The unbalanced number of samples will cause the model to predict poorly on categories with a small number of samples. If there is an imbalance in the number of samples, the samples of the small sample size category should be increased as appropriate. Generally speaking, a supervised deep learning algorithm will achieve acceptable performance with 5000 labeled samples per class. When there are more than 10 million labeled samples in the data set, the performance of the model will exceed that of humans.
Insufficient training samples means that the training set is too small relative to the model capacity. Insufficient training samples will lead to unstable training and prone to overfitting. If the parameter amount of the model is not proportional to the number of training samples, you should consider increasing the training samples or reducing the model complexity.
b. Data processing problems Common data processing problems are as follows:
1) Common data processing algorithm problems
2) Incorrect data processing parameters, etc.
3) Data is not normalized or standardized
4) The data processing method is inconsistent with the training set
5) The data set is not shuffled
The data is not normalized or standardized, which means that the data input to the model is not on the same scale in each dimension. Generally speaking, the model requires the data of each dimension to be between -1 and 1, with an average value of 0. If there is an order of magnitude difference between the scales of certain two dimensions, it may affect the training effect of the model. At this time, the data needs to be normalized or standardized. The inconsistency between the data processing method and the training set means that the processing method is inconsistent with the training set when using the model for inference. For example, the zoom, crop, and normalization parameters of the picture are different from the training set, which will cause the difference between the data distribution during inference and the data distribution during training, and may reduce the inference accuracy of the model. Note: Some data enhancement operations (such as random rotation, random cropping, etc.) are generally only applied to the training set, and data enhancement is not required during inference.
The data set is not shuffled, which means that the data set is not shuffled during training. Failure to shuffle or inadequate shuffling will cause the model to always be updated in the same data sequence, which severely limits the selectivity of the gradient optimization direction, resulting in less selection space for convergence points and easy overfitting.
1.2.3 Algorithm problems
The algorithm itself is flawed and the accuracy cannot reach expectations.
a. API usage problem
Common API usage issues are as follows:
1. Using API does not follow MindSpore constraints
2. The MindSpore construct constraints were not followed when composing the picture.
The API used does not follow the MindSpore constraints, which means that the API used does not match the real application scenario. For example, in scenarios where the divisor may contain zero, you should consider using DivNoNan instead of Div to avoid division by zero. For another example, in MindSpore, the first parameter of DropOut is the probability of retention, which is the opposite of other frameworks (other frameworks are the probability of loss), so you need to pay attention when using it.
The composition does not follow the MindSpore construct constraints, which means that the network in the graph mode does not follow the constraints declared in the MindSpore static graph syntax support. For example, MindSpore currently does not support the inversion of functions with key-value pairs of parameters. For complete constraints, please see: https://mindspore.cn/doc/note/zh-CN/master/static_graph_syntax_support.html
b. Computational graph structure problem
The calculation graph structure is the carrier of the model calculation, and the calculation graph structure error is generally the code written wrong when implementing the algorithm. Common problems in the structure of calculation graphs are:
1. Wrong operator (operator used is not suitable for the target scene)
2. Weight sharing error (weights that should not be shared are shared)
3. Node connection error (should be connected to the block in the calculation graph is not connected)
4. The node mode is incorrect
5. Weight freeze error (weights that should not be frozen are frozen)
6. The loss function is wrong
7. Optimizer algorithm error (if you implement the optimizer yourself), etc.
Weight sharing error means that the weight that should be shared is not shared, or the weight that should not be shared is shared. This type of problem can be checked through the visualization of the MindInsight calculation graph.
The weight freeze error means that the weights that should be frozen are not frozen, or the weights that should not be frozen are frozen. In MindSpore, freezing weights can be achieved by controlling the params parameters passed into the optimizer. Parameters not passed to the optimizer will not be updated. You can check the script or view the parameter distribution graph in MindInsight to confirm the weight freeze.
Node connection error means that the connection and design of each block in the calculation graph are inconsistent. If you find a node connection error, you should carefully check whether the script is written incorrectly.
Incorrect node mode refers to some operators that distinguish between training and inference modes, and the mode needs to be set according to the actual situation. Typical examples include: (1) BatchNorm operator, the training mode of BatchNorm should be turned on during training, this switch will be automatically turned on when net.set_train(True) is called (2) DropOut operator, DropOut operator should not be used during inference .
An error in the loss function means that the loss function algorithm is implemented incorrectly, or a reasonable loss function is not selected. For example, BCELoss and BCEWithLogitsLoss are different and should be selected reasonably according to whether the sigmoid function is required.
c. Weight initialization problem
The initial value of weight is the starting point of model training, and an unreasonable initial value will affect the speed and effect of model training. Common problems in weight initialization are as follows:
1. The initial value of the weight is all 0
2. The initial values of the weights of different nodes in distributed scenarios are different
The initial value of the weight is all 0, which means that the weight value is 0 after initialization. This generally leads to weight update problems, and the weights should be initialized with random values.
Different nodes have different initial values of weights in a distributed scenario, which means that the initial values of weights of the same name on different nodes are different after initialization. Normally, MindSpore will perform global AllReduce on the gradient. Ensure that the weight update amount is the same at the end of each step, so as to ensure that the weights on each node in each step are the same. If the weights of each node are different during initialization, the weights of different nodes will be in different states in the following training, which will directly affect the accuracy of the model. In a distributed scenario, the same random number seed should be fixed to ensure that the initial value of the weight is consistent.
1.3 There are multiple possible reasons for the same phenomenon, which makes it difficult to locate the accuracy problem
Take loss not converging as an example (the figure below), any problems that may cause activation value saturation, gradient disappearance, and incorrect weight update may cause loss to not converge. For example, some weights are incorrectly frozen, the activation function used does not match the data (using the relu activation function, the input values are all less than 0), the learning rate is too small, and other reasons are all possible reasons for the loss not to converge.
02 Overview of tuning ideas
In view of the phenomena and causes of the above accuracy problems, several commonly used tuning ideas are as follows: check the code and hyperparameters, check the model structure, check the input data, check the loss curve. If none of the above ideas find any problems, we can let the training execute to the end and check whether the accuracy (mainly the model metrics) meets expectations.
Among them, checking the model structure and hyperparameters is checking the static characteristics of the model; checking the input data and loss curve is a combination of static characteristics and dynamic training phenomena; checking whether the accuracy meets expectations is a re-examination of the overall accuracy tuning process. And consider the adjustment of hyperparameters, interpretation models, optimization algorithms and other tuning methods.
In order to help users efficiently implement the above-mentioned precision tuning ideas, MindInsight provides supporting capabilities, as shown in the figure below. After subsequent articles in this series, we will introduce the preparations for precision tuning, the details of each tuning idea, and how to use the functions of MindInsight to practice these tuning ideas, so stay tuned.
03 Accuracy checklist
Finally, we put together the common accuracy problems for easy reference:
After understanding the key technology of MindSpore, is it very exciting? Hurry up [click on the link] and [register now], you can learn a classic case on the ModelArts platform to master deep learning based on MindSpore!