MindSpore model accuracy tuning actual combat: commonly used positioning accuracy tuning and tuning ideas

Abstract: , it is often a headache that the accuracy is not up to expectations. In order to help users solve the problem of model debugging and tuning, we tailored a visual tuning and tuning component for MindSpore: MindInsight.

This article is shared from the Huawei Cloud Community " technical dry goods | I want all model optimization accuracy and speed! MindSpore model accuracy tuning actual combat (2) ", original author: HWCloudAI.

introduction:

During the development of the model, it is often a headache if the accuracy is not up to expectations. In order to help users solve the problem of model debugging and tuning, we tailored a visual tuning and tuning component for MindSpore: MindInsight. We have also sorted out the debugging and tuning guide for common accuracy problems, and will share them in the form of "MindSpore model accuracy tuning practice" series of articles, hoping to help users easily locate accuracy problems and quickly optimize model accuracy.

Review the actual series of MindSpore model accuracy tuning click the jump link → technical dry goods | faster positioning accuracy problem! The actual combat of MindSpore model accuracy tuning (1).

This article is the second part of the series, and will give commonly used precision debugging and tuning ideas . This series of sharing assumes that your script has been able to run and calculate the loss value. If the script still fails to run, please modify it by referring to the relevant error message.

When encountering accuracy problems, the common debugging and tuning ideas are as follows:

Check code and hyperparameters
Check the model structure
Check input data
Check the loss curve
Check whether the accuracy is as expected

Code is an important source of accuracy problems. Checking the code focuses on checking scripts and codes, and striving to find problems at the source (Section 2); the model structure reflects MindSpore's understanding of the code, and the checking of the model structure focuses on checking MindSpore's understanding and Whether the design of the algorithm engineer is consistent (Section 3); some problems will not be discovered until the dynamic training process. Checking the input data (Section 4) and the loss curve (Section 5) is exactly the code and dynamic training phenomenon Combine inspections; to check whether the accuracy meets expectations is to re-examine the overall accuracy tuning process, and consider tuning hyper-parameters, interpretation models, optimization algorithms and other tuning methods (Section 6). In addition, it is important to be familiar with the models and tools (Section 1). These ideas will be introduced separately below.

01Preparation for precision tuning

1.1 Review the algorithm design, fully familiar with the model

Before precision tuning, review the algorithm design first, ensure that the algorithm design is clear . If you refer to the paper to implement the model, you should review all the design details and hyper-parameter selection in the paper; if you refer to other framework scripts to implement the model, you should ensure that there is a unique benchmark script with accuracy that can meet the standard; if it is a newly developed algorithm , The important design details and super parameter selection should also be clarified. This information is an important basis for checking the script steps later.

Before precision tuning, you must be fully familiar with the model. Only when you are familiar with the model can you accurately understand the information provided by , 160f6352f1d061 judge whether there is a problem, and find the source of the problem . Therefore, it is important to spend time to understand the model algorithm and structure, understand the role of operators in the model and the meaning of the parameters, and understand the characteristics of the optimizer used in the model. Before starting to analyze the details of the accuracy problem, it is recommended to take the problem to deepen the understanding of these model elements.

Before, we must first review the algorithm design, ensure that the algorithm design is clear . If you refer to the paper to implement the model, you should review all the design details and hyper-parameter selection in the paper; if you refer to other framework scripts to implement the model, you should ensure that there is a unique benchmark script with accuracy that can meet the standard; if it is a newly developed algorithm , The important design details and super parameter selection should also be clarified. This information is an important basis for checking the script steps later.

Before precision tuning, you must be fully familiar with the model. Only when you are familiar with the model can you accurately understand the information provided by determine whether there is a problem and find the source of the problem. Therefore, it is very important to spend time to understand the model algorithm and structure, understand the role of the operators in the model and the meaning of the parameters, and understand the characteristics of the optimizer used in the model. Before starting to analyze the details of the accuracy problem, it is recommended to take the problem to deepen the understanding of these model elements.

1.2 Familiarize yourself with tools

MindInsight is rich in features. It is recommended that users simply read the MindInsight tutorial ( https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/visualization_tutorials.html) to understand the main features. When positioning accuracy problems, it is recommended to enable the summary training information collection function, add SummaryCollector to the script, and use the training board to view the training process data, as shown in the figure below. ( https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/summary_record.html) for the user guide of the Summary function, and refer to (https:/ /www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/dashboard.html) .

When you need to debug the model online, please refer to this link to enable the debugger function:

https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/debugger.html

02Check code and hyperparameters

Code is an important source of accuracy problems. Hyperparameter problems, model structure problems, data problems, algorithm design and implementation problems will be reflected in the script. Checking the script is an efficient means to locate accuracy problems. Checking the code mainly relies on code walking, and it is recommended to use the little yellow duck debugging method: During the code walking, patiently explain the role of each line of code to the inexperienced “little yellow duck”, so as to inspire inspiration and find code problems. When checking the script, pay attention to checking whether the script implementation (including the implementation of data processing, model structure, loss function, optimizer, etc.) is consistent with the design. If you refer to other scripts, focus on checking whether the script implementation is consistent with other scripts, and all inconsistencies There should be sufficient and reasonable reasons for all places, otherwise they should be revised.

When checking the script, you should also pay attention to the situation of the super parameter. The problem of the super parameter is mainly reflected in the unreasonable value of the super parameter, for example

Unreasonable learning rate setting;
The loss_scale parameter is unreasonable;
Unreasonable weight initialization parameters, etc.

MindInsight can assist users in checking super parameters. In most cases, SummaryCollector will automatically record common super parameters. You can view the super parameters through MindInsight's training parameter details function (as shown in the figure below) and traceability analysis function. Combined with the code in the MindInsight model traceability analysis module and the script, the value of the super parameter can be confirmed and the obviously unreasonable super parameter can be identified. If there is a benchmarking script, it is recommended to compare the hyperparameter values one by one with the benchmarking script. If there is a default parameter value, the default value should also be compared together to avoid the difference in the default value of the parameters of different frameworks leading to a decrease in accuracy or training errors.

03Check the model structure

In terms of model structure, common problems are:

The operator is used incorrectly (the operator used is not suitable for the target scenario, such as floating point division should be used, and integer division is used incorrectly);
Weight sharing error (weights that should not be shared are shared);
Weight freeze error (weights that should not be frozen are frozen);
Node connection error (should be connected to the block in the calculation graph is not connected);
loss function error;
Optimizer algorithm error (if you implement the optimizer yourself), etc.

It is recommended to check the model structure by checking the model code. In addition, MindInsight can also assist users in checking the model structure. In most cases, SummaryCollector record automatically calculating map, the user may MindInsight conveniently calculation map view . After the model script runs, it is recommended to use the MindInsight calculation graph visual module to view the model structure, deepen the understanding of the calculation graph, and confirm that the model structure meets expectations. If there is a benchmarking script, you can also check the calculation graph against the benchmarking script to check whether there are important differences between the current script and the calculation graph of the benchmarking script.

Considering that the model structure is generally very complicated, it is unrealistic to expect that all the model structure problems can be found at this step. As long as you deepen your understanding of the calculation graph through the visual model structure, you can find obvious structural problems. In the following steps, we will return to this step to recheck and confirm after we find a more clear accuracy problem.

Note 1: MindInsight supports viewing the calculation graph recorded by the SummaryCollector and the calculation graph of the pb file derived from the save_graphs parameter of the MindSpore context. Please refer to the "Computational Graph Visualization" part of our tutorial for more information ( https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/dashboard.html).

Note 2: The script migration tool can convert models written under the framework of PyTorch and TensorFlow into MindSpore scripts. Please visit the tutorial ( https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/migrate_3rd_scripts_mindconverter. html) for more information.

04Check input data

By checking the data input to the model, it can be combined with scripts to determine whether there is a problem with the data processing pipeline and data set. Common problems with input data are:

Too many missing values in the data;
The number of samples in each category is not balanced;
There are outliers in the data;
Data label error;
Insufficient training samples;
The data is not standardized, and the data entered into the model is not in the correct range;
The data processing methods of finetune and pretrain are different;
The data processing methods in the training phase and the inference phase are different;
Incorrect data processing parameters, etc.

MindInsight can assist users in checking input data and data processing pipelines. In most cases, the SummaryCollector will automatically record the data input to the model (data after data processing) and the data processing pipeline parameters. The data input to the model will be displayed in the "data sampling" module, and the data processing pipeline parameters will be displayed in the "data graph" module and the "data traceability" module. Through the data sampling module of MindInsight, you can check the data input to the model (processed by the data processing pipeline). If the data obviously does not meet expectations (for example, the range of data being clipped is too large, the angle of data rotation is too large, etc.), it can be judged that there is a certain problem with the input data. Through the data graph and data traceability module of MindInsight, you can check the data processing process and specific parameter values of the data processing pipeline, and finds unreasonable data processing methods.

If there is a benchmarking script, you can compare it with the benchmarking script to check whether the data output by the data processing pipeline is the same as the data of the current script. For example, save the data output by the data processing pipeline as an npy file, and then use the numpy.allclose() method to compare the data of the benchmark script and the current script. If it is found to be different, there may be an accuracy problem in the data processing stage.

If no problems are found in the data processing pipeline, you can manually check the data set for problems such as imbalanced classification, incorrect label matching, too many missing values, and insufficient training samples.

05Check the loss curve

Many accuracy problems will be discovered during network training. Common problems or phenomena are:

The weight initialization is unreasonable (for example, the initial value is 0, the initial value range is unreasonable, etc.);
There are too large or too small values in the weight;
The weight changes too much;
The weight freeze is incorrect;
Incorrect weight sharing;
The activation value is saturated or too weak (for example, the output of Sigmoid is close to 1, and the output of Relu is all 0);
The gradient explodes and disappears;
Insufficient training epoch;
NAN and INF exist in the calculation result of the operator;
Operator calculation process overflow (overflow in the calculation process is not necessarily harmful), etc.

Some of the above problems or phenomena can be expressed through loss, while others are difficult to observe. MindInsight provides targeted functions that can observe the above phenomena, automatically check for problems, and help you locate the root cause of the problem faster. E.g:

The parameter distribution graph module of MindInsight can display the trend of model weights with the training process;
MindInsight's tensor visualization module can display the specific values of tensors and compare different tensors;
The MindInsight debugger has a rich variety of built-in and powerful checking capabilities, which can check weight problems (such as weights not updating, weight updating too large, weight values too large/too small), gradient problems (such as gradient disappearance, gradient explosion), activation Value problems (such as saturated or too weak activation values), all tensors are 0, NAN/INF, overflow in the operator calculation process, etc.

Debugger tutorial:

https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/debugger.html

In most cases, SummaryCollector will automatically record the loss curve of the model, which can be viewed through the scalar visualization module of MindInsight. The loss curve can reflect the dynamic trend of network training. By observing the loss curve, obtain information such as whether the model has converged or overfitted.

In most cases, SummaryCollector will automatically record model parameter changes (5 parameters are recorded by default), which can be viewed through the parameter distribution graph module of MindInsight. If you want to record the parameter distribution diagram of more parameters, please refer to the histogram_regular parameter of SummaryCollector ( https://www.mindspore.cn/doc/api_python/zh-CN/master/mindspore/mindspore.train.html#mindspore. train.callback.SummaryCollector), or refer to the HistogramSummary operator (https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/summary_record.html#summarysummarycollector).

The tensor will not be automatically recorded. If you want to view the specific value of the tensor through MindInsight, please use the TensorSummary operator ( https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/ summary_record.html#summarysummarycollector).

The following describes the idea of using MindInsight to locate accuracy problems in combination with the common phenomena of the loss curve.

5.1 Loss runs away

loss runaway refers to the presence of NAN, +/-INF or a particularly large value in loss. Runaway loss generally means that there is a problem with the algorithm design or implementation. The positioning idea is as follows:

Review the script, model structure and data,

1) Check whether the super parameter has an unreasonably large/small value,

2) Check whether the model structure is implemented correctly, especially check whether the loss function is implemented correctly,

3) Check whether there are missing values in the input data, and whether there are particularly large/small values.

Observe the parameter distribution diagram in the training kanban, and check whether there is any obvious abnormality in the parameter update. If an abnormal parameter update is found, the debugger can be used to locate the cause of the abnormal parameter update. 3. Use the debugger module to check the training site.

1) If the loss value appears NAN, +/-INF, you can use the "check tensor overflow" condition to add a global monitoring point, locate the operator node that first appears NAN, +/-INF, and check whether the input data of the operator will cause Calculation anomalies (such as division by zero). If it is the problem of operator input data, you can add the decimal value epsilon to avoid calculation abnormalities.

2) If the loss value has a particularly large value, you can use the "check too large tensor" condition to add a global monitoring point, locate the operator node with a large value first, and check whether the input data of the operator will cause a calculation abnormality. If there is an abnormality in the input data itself, you can continue to track up the operator that generated the input data until the specific cause is located.

3) If you suspect that there are abnormalities in parameter updates, gradients, etc., you can use conditions such as "check weight change too large", "check gradient disappear", "check gradient too large" and other conditions to set monitoring points, locate the abnormal weight or gradient, and then Combined with the tensor inspection view, suspicious forward operators, reverse operators, optimizer operators, etc. are checked layer by layer.

5.2 Loss convergence is slow

The slow convergence of loss means that the loss is oscillating and the convergence speed is slow. It takes a long time to reach the expected value, or it can't finally converge to the expected value. Compared with loss running and flying, the numerical characteristics of slow loss convergence are not obvious, and it is more difficult to locate. The positioning ideas are as follows: 1. Review the script, model structure and data,

1) Check whether the hyperparameter has an unreasonably large/small value, especially check whether the learning rate is set too small or too large. Setting the learning rate too small will result in slow convergence, and setting the learning rate too large will cause loss Shock, not drop;

2) Check whether the model structure is implemented correctly, especially whether the loss function and optimizer are implemented correctly;

3) Check whether the range of the input data is normal, especially whether the value of the input data is too small

Observe the parameter distribution diagram in the training kanban, and check whether there is any obvious abnormality in the parameter update. If an abnormal parameter update is found, the debugger can be used to locate the cause of the abnormal parameter update. 3. Use the debugger module to check the progress of the training site.

1) Use the conditions of "check weight change too small" and "check unchanged weight" to monitor trainable (unfixed) weights, and check whether the weight changes too little. If you find that the weight change is too small, you can further check whether the learning rate is too small, whether the optimizer algorithm is implemented correctly, whether the gradient disappears, and make targeted repairs.

2) Use the "check gradient disappearance" condition to monitor the gradient to check whether the gradient disappears. If you find that the gradient disappears, you can further check the cause of the disappearance of the gradient. For example, you can check whether the activation value is saturated or the Relu output is 0 through the "check activation value range" condition.

5.3 Other loss phenomena

If the loss on the training set is 0, it generally means that the model has overfitted. Please try to increase the size of the training set.

06Check whether the accuracy meets expectations

MindInsight can record the accuracy results of each training for users. When the same SummaryCollector instance is used in model.train and model.eval, model evaluation (metrics) information is automatically recorded. After the training, you can check whether the accuracy of the training results meets the standard through the model traceability module of MindInsight.

6.1 Check the accuracy on the training set

If the loss value and metric value of the model on the training set do not meet expectations, you can refer to the following ideas for positioning and optimization:

Review the code, model structure, input data and loss curve,

1) Check the script to check whether the super parameter has an unreasonable value

2) Check whether the model structure is implemented correctly

3) Check whether the input data is correct

4) Check whether the convergence result and convergence trend of the loss curve are abnormal

Try to use MindInsight traceability analysis function to optimize hyperparameters. The traceability analysis page will analyze the importance of super-parameters. Users should give priority to adjusting the super-parameters with high importance. The relationship between super-parameters and optimization targets can be observed from the scatter diagram, so as to adjust the value of super-parameters in a targeted manner. .
Try to use MindInsight to optimize the hyperparameters. Please note that the Parameter Tuner performs a hyperparameter search by performing multiple complete trainings, which consumes several times the time of a network training. If the network training takes a long time, the hyperparameter search will take a long time . Parameter tuner use tutorial: https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/hyper_parameters_auto_tuning.html
Try to use the MindInsight model interpretation function to optimize the model and data set. The model interpretation function can visually display the most important areas for the classification results through the saliency map, and can also prompt which type of label should be optimized through the scoring system.

Model explanation and usage tutorial:

https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/model_explaination.html

Try to optimize the model structure/algorithm.

6.2 Check the accuracy on the validation set

If the accuracy of the training set and the verification set are not up to expectations, you should first refer to the previous section to check the accuracy of the training set. If the accuracy of the training set has reached the expectation, but the accuracy of the validation set has not reached the expectation, there is a high probability that the model has overfitted. The processing ideas are as follows:

Check the evaluation logic of the validation set evaluation script for errors. In particular, whether the data processing method is consistent with the training set, the reasoning algorithm is wrong, and whether the correct model checkpoint is loaded. 2. Increase the amount of data. Including increasing sample size, data enhancement and perturbation, etc. 3. Regularization. Common techniques include parameter norm penalty (for example, adding a regular term to the objective function), parameter sharing (forcing the two components of the model to share the same parameter value), and suspending training early. 4. Appropriately reduce the scale of the model. For example, reduce the number of convolutional layers.

6.3 Check the accuracy on the test set

If the accuracy of the verification set and the test set are not up to expectations, you should first refer to the previous section to check the accuracy of the verification set. If the accuracy of the verification set has reached expectations, but the accuracy of the test set does not meet expectations, considering that the data of the test set is new data that the model has never seen before, the reason is generally that the data distribution of the test set and the data distribution of the training set are inconsistent. The processing ideas are as follows:

Check the evaluation logic of the test set evaluation script for errors. In particular, whether the data processing method is consistent with the training set, the reasoning algorithm is wrong, and whether the correct model checkpoint is loaded.
Check the quality of the data in the test set, such as whether the distribution range of the data is obviously different from the training set, and whether the data has a lot of noise, missing values, or outliers.

07 Summary

Since there are multiple possible causes for the same phenomenon, the location of accuracy problems is very dependent on expert experience. I hope that the above positioning methods and functions can play a good guiding role, help you continue to accumulate successful experience, and become a precision tuning master.

Click to follow, and get to know the fresh technology of

MindSpore model accuracy tuning actual combat: commonly used positioning accuracy tuning and tuning ideas

introduction:

01Preparation for precision tuning

1.1 Review the algorithm design, fully familiar with the model

1.2 Familiarize yourself with tools

02Check code and hyperparameters

03Check the model structure

04Check input data

05Check the loss curve

5.1 Loss runs away

5.2 Loss convergence is slow

5.3 Other loss phenomena

06Check whether the accuracy meets expectations

6.1 Check the accuracy on the training set

6.2 Check the accuracy on the validation set

6.3 Check the accuracy on the test set

07 Summary

华为云开发者联盟

引用和评论

华为云开发者联盟入选 2023 中国技术品牌影响力企业榜，深耕开发者生态

Open WebUI：开源AI交互平台的全面解析

大模型中的Token究竟是什么？从原理到作用深度解析

一文掌握 MCP 上下文协议：从理论到实践

MySQL × 向量数据库：大模型时代的黄金组合实战指南

AdventureX 2025 正式启动：五天四夜，120小时极限创造！一起在杭州点燃青年创新之火！

大模型时代，后端程序员如何避免被AI卷死？