TorchVision v0.9 has added a series of mobile-friendly models, can be used to process classification, target detection, semantic segmentation and other tasks.
This article will explore the code of these models in depth, share noteworthy implementation details, explain the configuration and training principles of these models, and interpret important trade-offs officially made in the model optimization process.
The goal of this article is to show the technical details of the model that are not recorded in the original papers and databases.
Network Architecture
The implementation of the MobileNetV3 architecture strictly complies with the settings in the original paper. supports user customization and provides different configurations for building classification, target detection and semantic segmentation Backbone. Its structural design is similar to MobileNetV2, and both share the same building blocks.
is ready to use out of the box. officially provides two variants: Large and Small. The two are built with the same code, the only difference is the configuration (number of modules, size, activation function, etc.).
Configuration parameter
Although the user can customize the InvertedResidual setting and pass it directly to the MobileNetV3 class, for most applications, developers can adjust the existing configuration by passing parameters to the model construction method. Some key configuration parameters of
width_mult
parameter is a multiplier that determines the number of model pipelines. The default value is 1. The number of convolution filters can be changed by adjusting the default value, including the first and last layers. When implementing, ensure that the number of filters is 8 Multiples of. This is a hardware optimization technique that can speed up the vectorization process of the operation.
reduced_tail
parameter is mainly used for running speed optimization, which makes the number of pipes of the last module of the network halved. This version is often used in object detection and semantic segmentation models. According to the description of MobileNetV3 related papers, using the reduced_tail parameter can reduce the delay by 15% without affecting the accuracy.
dilated
parameter mainly affects the last three InvertedResidual modules of the model. The Depthwise convolution of these modules can be converted into Atrous convolution, which is used to control the output step size of the module and improve the accuracy of the semantic segmentation model.
Implementation details
MobileNetV3 class is responsible for building a network from the provided configuration. The implementation details are as follows:
- The last convolution module expands the output of the last InvertedResidual module by 6 times. This implementation method can adapt to different multiplier parameters.
- Similar to the MobileNetV2 model, there is a Dropout layer before the last Linear layer of the classifier.
InvertedResidual class is the main building block of the network. The implementation details that need to be paid attention to are as follows:
- If the input pipeline and expansion pipeline are the same, the Expansion step is not required. This happens on the first convolution module of the network.
- Even if the Expanded pipeline is the same as the output channel, a Projection step is always required.
- The activation of the Depthwise module takes precedence over the Squeeze-and-Excite layer, which can improve the accuracy to a certain extent.
MobileNetV3 module architecture diagram
classification
Here will explain the pre-training model benchmark and configuration, training and quantification details.
Benchmarks
Initialize the pre-trained model:
large = torchvision.models.mobilenet_v3_large(pretrained=True, width_mult=1.0, reduced_tail=False, dilated=False)
small = torchvision.models.mobilenet_v3_small(pretrained=True)
quantized = torchvision.models.quantization.mobilenet_v3_large(pretrained=True)
new and old model detailed benchmark comparison
As shown in the figure, if the user is willing to sacrifice a little accuracy in exchange for a speed increase of about 6 times, , MobileNetV3-Large can become a substitute for ResNet50.
Note that the inference time here is measured on the CPU.
Training process
Configure all pre-trained models as a non-dilated model with a width multiplier of 1, with full tails, and fit them on ImageNet. Both Large and Small variants are trained with the same hyperparameters and scripts.
Fast and stable model training
The correct configuration of RMSProp is essential to speed up the training process and ensure numerical stability. The author of the paper used TensorFlow in the experiment, and used rmsprop_epsilon
, which is quite high compared to the default value, during the operation.
Normally, this hyperparameter is used to avoid the occurrence of zero denominator, so its value is very small, but in this particular model, choosing the correct value is important to avoid numerical instability in the loss.
Another important detail is that although the RMSProp implementations of PyTorch and TensorFlow usually behave similarly, in the settings here, it is necessary to pay attention to the difference between the two frameworks in handling epsilon hyperparameters.
Specifically, PyTorch adds epsilon to the square root calculation, while TensorFlow adds epsilon to it. This makes the user need to adjust the epsilon value when transplanting the hyperparameters in this article, and the formula PyTorch_eps=sqrt(TF_eps)
can be used to calculate a reasonable approximation.
Improve model accuracy by adjusting hyperparameters and improving the training process
After configuring the optimizer to achieve fast and stable training, you can start to optimize the accuracy of the model. There are some technologies that can help users achieve this goal.
First, in order to avoid overfitting, can use AutoAugment and RandomErasing to enhance the data . In addition, using cross-validation to adjust parameters such as weight attenuation and averaging the weights of different epoch checkpoints after training is also of great significance. Finally, using Label Smoothing, random depth, and LR noise injection methods can also increase the overall accuracy by at least 1.5%.
A key iteration to improve the accuracy of MobileNetV3-Large
MobileNetV2-style
Note that once the set accuracy is reached, the model performance will be verified on the validation set. This process helps to detect overfitting.
Quantify
Provides quantized weights for the QNNPACK backend of the MobileNetV3-Large variant, which increases the running speed by 2.5 times. To quantify the model, uses Quantitative Awareness Training (QAT) here.
Note that QAT allows modeling the effects of quantification and adjusting the weights in order to improve the accuracy of the model. Compared with the quantized results of the simple training model, the accuracy is increased by 1.8%:
Target Detection
This part will first provide the benchmark of the published model, and then discuss how MobileNetV3-Large Backbone is used in Feature Pyramid Network together with the FasterRCNN detector for target detection.
will also explain how the network is trained and adjusted, and where the pros and cons must be weighed (this section does not involve the details of how to use it with SSDlite).
Benchmarks
Initialize the model:
high_res = torchvision.models.detection.fasterrcnn_mobilenet_v3_large_fpn(pretrained=True)
low_res = torchvision.models.detection.fasterrcnn_mobilenet_v3_large_320_fpn(pretrained=True)
Benchmark comparison between the old and new models
You can see, if you are willing to 5 times faster training speed, accuracy, then sacrifice a little, with MobileNetV3-Large FPN backbone of high-resolution Faster R-CNN, you can substitute an equivalent ResNet50 model.
Implementation details
detector uses an FPN-style backbone, which can extract features from different convolutions of the MobileNetV3 model. By default, the pre-training model uses the output of the 13th InvertedResidual module and the output of the convolution before the pooling layer. This implementation also supports the use of more stages of output.
All feature maps extracted from the network are projected by the FPN module to 256 pipelines, which can greatly increase the network speed. These feature maps provided by the FPN backbone will be used by the FasterRCNN detector to provide box and class predictions of different scales.
Training and tuning process
Currently, two pre-training models are officially provided, which can perform target detection at different resolutions. Both models are trained on the COCO dataset with the same hyperparameters and scripts.
high-resolution detector is trained with 800-1333px images, while the mobile-friendly low-resolution detector is trained with 320-640px images.
The reason for providing two sets of independent pre-training weights is to train the detector directly on the smaller image. Compared with passing the small image to the pre-trained high-resolution model, will increase the accuracy by 5 mAP.
Both backbones are initialized with the weights on ImageNet, and the last three stages of their weights are fine-tuned during the training process.
By adjusting the threshold of RPN NMS, additional speed optimization can be performed on mobile-friendly models. sacrifices the accuracy of 0.2 mAP to increase the CPU speed of the model by about 45%. The optimization details are as follows:
Faster R-CNN MobileNetV3-Large FPN model prediction diagram
Semantic segmentation
This section first provides some published pre-training model benchmarks, and then discusses how MobileNetV3-Large backbone combines with segmentation heads such as LR-ASPP, DeepLabV3 and FCN to perform semantic segmentation.
In addition, it will explain the training process of the network and propose some alternative optimization techniques for speed-critical applications.
Benchmarks
Initialize the pre-trained model:
lraspp = torchvision.models.segmentation.lraspp_mobilenet_v3_large(pretrained=True)
deeplabv3 = torchvision.models.segmentation.deeplabv3_mobilenet_v3_large(pretrained=True)
Detailed benchmark comparison between the old and new models
As can be seen from the figure, in most applications, DeepLabV3 with MobileNetV3-Large backbone is a viable alternative to FCN and ResNet50. runs 8.5 times faster under the premise of ensuring similar accuracy. In addition, the performance of the LR-ASPP network on all indicators exceeds the FCN under the same conditions.
Implementation details
This section will discuss the important implementation details of the tested split head. Note that all models described in this section use the expanded MobileNetV3-Large backbone.
LR-ASPP
LR-ASPP is a simplified version of the Reduced Atrous Spatial Pyramid Pooling model proposed by the author of the MobileNetV3 paper. Unlike other segmentation models in , 160c1c692a4868 does not use auxiliary loss, but uses low-level and high-level features, with output steps of 8 and 16, respectively.
Unlike the 49x49 AveragePooling layer and variable step size used in the paper, the AdaptiveAvgPool2d layer is used to process global features.
This can provide users with a universal implementation method that runs through multiple data sets. Finally, before returning to the output, a bilinear interpolation is always generated to ensure that the size of the input and output images match exactly.
DeepLabV3 & FCN
The combination of MobileNetV3, DeepLabV3 and FCN is very similar to the combination of other models, and the stage evaluation of these methods is the same as LR-ASPP.
It should be noted that the high-level and low-level features are not used here, but a normal loss is added to the feature map with an output span of 16, and an auxiliary loss is added to the feature map with an output span of 8.
FCN is inferior to LR-ASPP in terms of speed and accuracy, so it is not considered here. The pre-training weights are still available, and you only need to modify the code slightly.
Training and tuning process
Here are two MobileNetV3 pre-trained models that can be used for semantic segmentation: LR-ASPP and DeepLabV3. The backbone of these models is initialized with ImageNet weights and trained end-to-end.
Both architectures are trained on the COCO dataset using the same script and similar hyperparameters.
Normally, during the inference process, the size of the image will be adjusted to 520 pixels. An optional speed optimization scheme is to use high-resolution pre-training weights to build a low-resolution model configuration and reduce the inference size to 320 pixels. This will increase the execution time of the CPU by about 60%, while sacrificing several mIoU points.
Optimized detailed number
LR-ASPP MobileNetV3-Large model prediction example
The above are the implementation details of MobileNetV3 summarized in this issue. I hope these will give you a further understanding and understanding of the model.
reference:
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。