Content guide

PyTorch 1.9 has updated some libraries simultaneously, including the new SSD and SSDlite models in TorchVision. Compared with SSD, SSDlite is more suitable for mobile APP development.

SSD stands for Single Shot MultiBox Detector, which is a single-shot detection algorithm for target detection, which can detect multiple objects at a time. This article will focus on the mobile-friendly variant of SSD, namely SSDlite.

specific narrative process of

First, the main components of the algorithm are used to emphasize the difference from the original SSD;

Then discuss how the released model was trained;

Finally, a detailed Benchmark is provided for all newly launched target detection models.

SSDlite network architecture

SSDlite SSD is an upgraded version, first released in MobileNetV2 paper, then used again in MobileNetV3 papers. Because the focus of these two papers is to introduce the new CNN architecture, most of the implementation details of SSDlite are not mentioned.

Our code follows all the details in these two papers, and supplements the official implementation method when necessary.

As mentioned above, SSD is a series of models, users can use different backbones (such as VGG, MobileNetV3, etc.) and different heads (such as conventional convolution, separable convolution, etc.) to configure. Therefore, in SSDlite, many SSD components are the same. Below we only discuss those different parts.

Classification and regression

According to the MobileNetV2 paper, SSDlite replaces the conventional convolution used in the original Head with separable convolution. Therefore, our implementation method introduces a new Head that uses 3x3 deep convolution and 1x1 projection.

Since all other components of the SSD method remain unchanged, we need to create an SSDlite model, Our implementation method is to initialize the SSDlite Head and pass it directly to the SSD constructor.

Backbone feature extractor

Our implementation method introduces a new class to build the MobileNet feature extractor. According to the MobileNetV3 paper narrative, Backbone will return the extended layer output of the Inverted Bottleneck module. The output step of this extended layer is 16, which precedes the pooling with an output step of 32.

In addition, Backbone have been replaced with lightweight modules. use a 1x1 Compression, a separable 3x3 convolution with a step of 2, and a 1x1 expansion

Finally, in order to ensure a small width even when using a multiplier, there Head sufficient predictive power, minimum depth dimension convolution of all the parameters controlled by the ultra min_depth.

SSDlite320 MobileNetV3-Large model

image.png
This section discusses the configuration of the SSDlite pre-training model and the training process in order to replicate the results of the paper as much as possible

Training process

The most noteworthy details of the training process are discussed here.

1. Adjusted hyperparameters

The paper does not provide any hyperparameters (such as regularization, learning rate, batch size, etc.) that can be used for model training. According to the parameters listed in the official repo configuration file, uses cross-validation to adjust them to the best value. This makes the baseline SSD configuration significantly improved.

2. Data enhancement

Compared with SSD, the main difference of SSDlite is that the Backbone weight of the former is only a small part of the latter. Therefore, in SSDlite, data enhancement focuses more on making the model robust to target objects of different sizes, rather than just focusing on overfitting.

SSDlite only uses a subset of SSD transformation, which avoids over-regularization of the model.

3. LR Scheme 

Due to the dependence on data enhancement to make the model robust to small and medium target objects, we found that increasing the number of epochs is very beneficial for model training. Specifically, if the number of epochs is increased to 3 times that of SSD, the accuracy can be increased to 4.2mAP point; if a 6x multiplier is used, it can be increased to 4.9mAP.

Transitional increase in epoch will backfire, reducing training speed and accuracy. Nevertheless, according to the model configuration in the paper, the author seems to use a multiplier equivalent to 16 times.

4. Weight initialization & Input Scaling & ReLU6

A series of optimizations made our implementation method very close to the official method and narrowed the accuracy gap. These optimization methods are to train Backbone from scratch instead of initializing from ImageNet. In addition, these optimization methods also adjusted our weight initialization scheme, changed Input Scaling, and replaced all standard ReLUs added in SSDlite Head with ReLU6.

Note that since we are training the model from random weights, we also applied the speed optimization method described in the paper, that is, using reduced tail in Backbone.

5. Differences in implementation methods

Comparing the above implementation method with the implementation method in the official repo, we found some differences.

Most of the differences between and how to initialize the weights (such as Gaussian distribution VS truncated normal distribution), how to parameterize LR Scheduling (such as smaller VS larger Warmup rate, shorter VS longer training duration) related.

known to have the most significant difference in the way the classification loss is calculated. is the implementation of MobileNetV3 Backbone SSDlite in the official repo. The SSD Multibox loss is not used, but the focal loss of RetinaNet is used.

Since TorchVision has provided a complete implementation of RetinaNet, we decided to implement SSDlite with normal Multi-box SSD loss.

Improved key accuracy

Copying the code in the paper cannot guarantee accuracy, especially if the complete training process and implementation details are not known. Usually this process involves a lot of backtracking, because we need to find those implementation details and parameters that have a significant impact on accuracy.

Below we will try to visualize those important iterative processes that improve accuracy based on the baseline:

image.png

image.png

The optimization sequence described above is accurate, although in some cases a bit idealistic. For example, although different schedulers were tested during the hyperparameter adjustment phase, they did not bring significant improvements in accuracy, so we retained the MultiStepLR used in the baseline.

When testing different LR later, 160dd76102e14c we found that switching to CosineAnnealingLR requires less configuration and better results.

In summary, even if we use a correct implementation method and a series of best hyperparameters from the same family of models, can always improve the accuracy to some extent by optimizing the training process and adjusting the implementation method.

It is true that the above is a very extreme case, the accuracy rate has doubled, but in many cases, there is still a lot of optimization space that can help us greatly improve the accuracy rate.

Benchmark

Initialize two pre-trained models:

`ssdlite = torchvision.models.detection.ssdlite320_mobilenet_v3_large(pretrained=True)
ssd = torchvision.models.detection.ssd300_vgg16(pretrained=True)`

Benchmark comparison between the old and new models:

image.png

SSDlite320 MobileNetV3-Large model is by far the fastest and smallest model, so it is very suitable for mobile APP development.

Although its accuracy is not as good as the pre-trained low-resolution Faster R-CNN model, the SSDlite frame is highly adjustable, and users can increase the accuracy by introducing a heavier head with more convolutions.

On the other hand, the SSD300 VGG16 model runs quite slow and has low accuracy. This is mainly because of its VGG16 Backbone. Although the VGG architecture is very influential, it is now a bit outdated.

Because this particular model has historical significance and research value, it is being put in TorchVision. If you want to use a high-resolution detector, we still recommend that you either use SSD in combination with other Backbone, or use a Faster R-CNN pre-trained model.

Reference: PyTorch Blog


超神经HyperAI
1.3k 声望8.8k 粉丝