Master the implementation details of MobileNetV3 in TorchVision in one article

TorchVision v0.9 has added a series of mobile-friendly models, can be used to process classification, target detection, semantic segmentation and other tasks.

This article will explore the code of these models in depth, share noteworthy implementation details, explain the configuration and training principles of these models, and interpret important trade-offs officially made in the model optimization process.

The goal of this article is to show the technical details of the model that are not recorded in the original papers and databases.

Network Architecture

The implementation of the MobileNetV3 architecture strictly complies with the settings in the original paper. supports user customization and provides different configurations for building classification, target detection and semantic segmentation Backbone. Its structural design is similar to MobileNetV2, and both share the same building blocks.

is ready to use out of the box. officially provides two variants: Large and Small. The two are built with the same code, the only difference is the configuration (number of modules, size, activation function, etc.).

Configuration parameter

Although the user can customize the InvertedResidual setting and pass it directly to the MobileNetV3 class, for most applications, developers can adjust the existing configuration by passing parameters to the model construction method. Some key configuration parameters of

width_mult parameter is a multiplier that determines the number of model pipelines. The default value is 1. The number of convolution filters can be changed by adjusting the default value, including the first and last layers. When implementing, ensure that the number of filters is 8 Multiples of. This is a hardware optimization technique that can speed up the vectorization process of the operation.

reduced_tail parameter is mainly used for running speed optimization, which makes the number of pipes of the last module of the network halved. This version is often used in object detection and semantic segmentation models. According to the description of MobileNetV3 related papers, using the reduced_tail parameter can reduce the delay by 15% without affecting the accuracy.

dilated parameter mainly affects the last three InvertedResidual modules of the model. The Depthwise convolution of these modules can be converted into Atrous convolution, which is used to control the output step size of the module and improve the accuracy of the semantic segmentation model.

Implementation details

MobileNetV3 class is responsible for building a network from the provided configuration. The implementation details are as follows:

The last convolution module expands the output of the last InvertedResidual module by 6 times. This implementation method can adapt to different multiplier parameters.

Similar to the MobileNetV2 model, there is a Dropout layer before the last Linear layer of the classifier.

InvertedResidual class is the main building block of the network. The implementation details that need to be paid attention to are as follows:

If the input pipeline and expansion pipeline are the same, the Expansion step is not required. This happens on the first convolution module of the network.

Even if the Expanded pipeline is the same as the output channel, a Projection step is always required.

The activation of the Depthwise module takes precedence over the Squeeze-and-Excite layer, which can improve the accuracy to a certain extent.

MobileNetV3 模块架构示意图
MobileNetV3 module architecture diagram

classification

Here will explain the pre-training model benchmark and configuration, training and quantification details.

Benchmarks

Initialize the pre-trained model:

large = torchvision.models.mobilenet_v3_large(pretrained=True, width_mult=1.0,  reduced_tail=False, dilated=False)
small = torchvision.models.mobilenet_v3_small(pretrained=True)
quantized = torchvision.models.quantization.mobilenet_v3_large(pretrained=True)

在这里插入图片描述
new and old model detailed benchmark comparison

As shown in the figure, if the user is willing to sacrifice a little accuracy in exchange for a speed increase of about 6 times, , MobileNetV3-Large can become a substitute for ResNet50.

Note that the inference time here is measured on the CPU.

Training process

Configure all pre-trained models as a non-dilated model with a width multiplier of 1, with full tails, and fit them on ImageNet. Both Large and Small variants are trained with the same hyperparameters and scripts.

Fast and stable model training

The correct configuration of RMSProp is essential to speed up the training process and ensure numerical stability. The author of the paper used TensorFlow in the experiment, and used rmsprop_epsilon , which is quite high compared to the default value, during the operation.

Normally, this hyperparameter is used to avoid the occurrence of zero denominator, so its value is very small, but in this particular model, choosing the correct value is important to avoid numerical instability in the loss.

Another important detail is that although the RMSProp implementations of PyTorch and TensorFlow usually behave similarly, in the settings here, it is necessary to pay attention to the difference between the two frameworks in handling epsilon hyperparameters.

Specifically, PyTorch adds epsilon to the square root calculation, while TensorFlow adds epsilon to it. This makes the user need to adjust the epsilon value when transplanting the hyperparameters in this article, and the formula PyTorch_eps=sqrt(TF_eps) can be used to calculate a reasonable approximation.

Improve model accuracy by adjusting hyperparameters and improving the training process

After configuring the optimizer to achieve fast and stable training, you can start to optimize the accuracy of the model. There are some technologies that can help users achieve this goal.

First, in order to avoid overfitting, can use AutoAugment and RandomErasing to enhance the data . In addition, using cross-validation to adjust parameters such as weight attenuation and averaging the weights of different epoch checkpoints after training is also of great significance. Finally, using Label Smoothing, random depth, and LR noise injection methods can also increase the overall accuracy by at least 1.5%.

A key iteration to improve the accuracy of MobileNetV3-Large

MobileNetV2-style

Note that once the set accuracy is reached, the model performance will be verified on the validation set. This process helps to detect overfitting.

Quantify

Provides quantized weights for the QNNPACK backend of the MobileNetV3-Large variant, which increases the running speed by 2.5 times. To quantify the model, uses Quantitative Awareness Training (QAT) here.

Note that QAT allows modeling the effects of quantification and adjusting the weights in order to improve the accuracy of the model. Compared with the quantized results of the simple training model, the accuracy is increased by 1.8%:
在这里插入图片描述

Target Detection

This part will first provide the benchmark of the published model, and then discuss how MobileNetV3-Large Backbone is used in Feature Pyramid Network together with the FasterRCNN detector for target detection.

will also explain how the network is trained and adjusted, and where the pros and cons must be weighed (this section does not involve the details of how to use it with SSDlite).

Benchmarks

Initialize the model:

high_res = torchvision.models.detection.fasterrcnn_mobilenet_v3_large_fpn(pretrained=True) 
low_res = torchvision.models.detection.fasterrcnn_mobilenet_v3_large_320_fpn(pretrained=True)

Benchmark comparison between the old and new models

You can see, if you are willing to 5 times faster training speed, accuracy, then sacrifice a little, with MobileNetV3-Large FPN backbone of high-resolution Faster R-CNN, you can substitute an equivalent ResNet50 model.

Implementation details

detector uses an FPN-style backbone, which can extract features from different convolutions of the MobileNetV3 model. By default, the pre-training model uses the output of the 13th InvertedResidual module and the output of the convolution before the pooling layer. This implementation also supports the use of more stages of output.

All feature maps extracted from the network are projected by the FPN module to 256 pipelines, which can greatly increase the network speed. These feature maps provided by the FPN backbone will be used by the FasterRCNN detector to provide box and class predictions of different scales.

Training and tuning process

Currently, two pre-training models are officially provided, which can perform target detection at different resolutions. Both models are trained on the COCO dataset with the same hyperparameters and scripts.

high-resolution detector is trained with 800-1333px images, while the mobile-friendly low-resolution detector is trained with 320-640px images.

The reason for providing two sets of independent pre-training weights is to train the detector directly on the smaller image. Compared with passing the small image to the pre-trained high-resolution model, will increase the accuracy by 5 mAP.

Both backbones are initialized with the weights on ImageNet, and the last three stages of their weights are fine-tuned during the training process.

By adjusting the threshold of RPN NMS, additional speed optimization can be performed on mobile-friendly models. sacrifices the accuracy of 0.2 mAP to increase the CPU speed of the model by about 45%. The optimization details are as follows:
在这里插入图片描述

Faster R-CNN MobileNetV3-Large FPN model prediction diagram

Semantic segmentation

This section first provides some published pre-training model benchmarks, and then discusses how MobileNetV3-Large backbone combines with segmentation heads such as LR-ASPP, DeepLabV3 and FCN to perform semantic segmentation.

In addition, it will explain the training process of the network and propose some alternative optimization techniques for speed-critical applications.

Benchmarks

Initialize the pre-trained model:

lraspp = torchvision.models.segmentation.lraspp_mobilenet_v3_large(pretrained=True) 
deeplabv3 = torchvision.models.segmentation.deeplabv3_mobilenet_v3_large(pretrained=True)

在这里插入图片描述
Detailed benchmark comparison between the old and new models

As can be seen from the figure, in most applications, DeepLabV3 with MobileNetV3-Large backbone is a viable alternative to FCN and ResNet50. runs 8.5 times faster under the premise of ensuring similar accuracy. In addition, the performance of the LR-ASPP network on all indicators exceeds the FCN under the same conditions.

Implementation details

This section will discuss the important implementation details of the tested split head. Note that all models described in this section use the expanded MobileNetV3-Large backbone.

LR-ASPP

LR-ASPP is a simplified version of the Reduced Atrous Spatial Pyramid Pooling model proposed by the author of the MobileNetV3 paper. Unlike other segmentation models in , 160c1c692a4868 does not use auxiliary loss, but uses low-level and high-level features, with output steps of 8 and 16, respectively.

Unlike the 49x49 AveragePooling layer and variable step size used in the paper, the AdaptiveAvgPool2d layer is used to process global features.

This can provide users with a universal implementation method that runs through multiple data sets. Finally, before returning to the output, a bilinear interpolation is always generated to ensure that the size of the input and output images match exactly.

DeepLabV3 & FCN

The combination of MobileNetV3, DeepLabV3 and FCN is very similar to the combination of other models, and the stage evaluation of these methods is the same as LR-ASPP.

It should be noted that the high-level and low-level features are not used here, but a normal loss is added to the feature map with an output span of 16, and an auxiliary loss is added to the feature map with an output span of 8.

FCN is inferior to LR-ASPP in terms of speed and accuracy, so it is not considered here. The pre-training weights are still available, and you only need to modify the code slightly.

Training and tuning process

Here are two MobileNetV3 pre-trained models that can be used for semantic segmentation: LR-ASPP and DeepLabV3. The backbone of these models is initialized with ImageNet weights and trained end-to-end.

Both architectures are trained on the COCO dataset using the same script and similar hyperparameters.

Normally, during the inference process, the size of the image will be adjusted to 520 pixels. An optional speed optimization scheme is to use high-resolution pre-training weights to build a low-resolution model configuration and reduce the inference size to 320 pixels. This will increase the execution time of the CPU by about 60%, while sacrificing several mIoU points.

在这里插入图片描述
Optimized detailed number

在这里插入图片描述
LR-ASPP MobileNetV3-Large model prediction example

The above are the implementation details of MobileNetV3 summarized in this issue. I hope these will give you a further understanding and understanding of the model.

reference:

MobileNetV3 Paper

PyTorch Blog

Master the implementation details of MobileNetV3 in TorchVision in one article

Network Architecture

Configuration parameter

Implementation details

classification

Benchmarks

Training process

Fast and stable model training

Improve model accuracy by adjusting hyperparameters and improving the training process

Quantify

Target Detection

Benchmarks

Implementation details

Training and tuning process

Semantic segmentation

Benchmarks

Implementation details

LR-ASPP

DeepLabV3 & FCN

Training and tuning process

超神经HyperAI

引用和评论

获 1.3 亿美元融资，NewLimit 利用机器学习指导表观遗传程序设计，延长人类健康寿命研究已有初级成果

一文掌握 MCP 上下文协议：从理论到实践

AI Agent爆火后，MCP协议为什么如此重要！

AdventureX 2025 正式启动：五天四夜，120小时极限创造！一起在杭州点燃青年创新之火！

MCP 协议为何不如你想象的安全？从技术专家视角解读

🔥吐血整理 Bolt.diy 部署与应用攻略

常见的 AI 模型格式