机器学习 - NeurIPS 2021 | Twins: Rethinking the Design of Efficient Visual Attention Models - 美团技术团队

Twins is a visual attention model proposed by Meituan and the University of Adelaide, and related papers have been accepted by the NeurIPS 2021 conference. This article mainly describes the difficulties solved by Twins, the design and implementation ideas, and the exploration and implementation of the Meituan scene, hoping to help and inspire students who are engaged in the research and development of visual algorithms.

summary

Twins ^[1] is a visual attention model proposed by Meituan and the University of Adelaide. The related papers have been accepted by the NeurIPS 2021 conference, and the code has also been open sourced on GitHub . NeurIPS (Conference on Neural Information Processing Systems) is an academic conference related to machine learning and computational neuroscience, as well as an international top conference on artificial intelligence.

Twins proposes two types of structures, Twins-PCPVT and Twins-SVT:

Twins-PCPVT changed the fixed position encoding (Positional Encoding) in the pyramid Transformer model PVT ^[2] to the conditional position encoding (CPE) proposed by the team in CPVT ^[3] , so that the model It has translation equivariance (that is, after the input image is translated, the output changes correspondingly), and can flexibly process features from different spatial scales, so it can be widely used in image segmentation, detection and other variable-length input scenarios.
Twins-SVT proposes a spatially separable self-attention mechanism (SSSA) to group the spatial dimensions of image features, calculate the self-attention of each local space separately, and then use the global self-attention mechanism to Fusion. This mechanism is computationally more efficient and performs better.

The Twins series of models are simple to implement and friendly to deploy, and have achieved industry-leading results in many classic vision tasks such as ImageNet classification, ADE20K semantic segmentation, and COCO target detection.

background

In September 2020, Google's Vision Transformer (ViT) ^[4] successfully applied the Transformer ^[5] originally used for natural language processing to the visual classification task. ViT divides an input image into several image blocks (Patch), and compares an image block to a word (Word) as the input of the Transformer encoder (as shown in Figure 1), which is processed by the L-layer encoder. Then use the ordinary Multilayer Perceptron (MLP) to map to the category space. The model performance of ViT greatly exceeds that of convolutional neural networks, and since then it has rapidly developed into the main focus of current research in the field of vision.

![Figure 1 Visual Attention Model (ViT) Applying Transformer for Natural Language Processing Task to Vision Task (Source: ViT [4])]( https://p1.meituan.net/travelcube/799ca7205dad0fd1ad5e8c8604480625136165.png @750w_80q )

The basic calculation method of Multi-head attention in the Transformer encoder is given by the following formula, where Q, K, and V are the abbreviations of Query (query), Key (key), and Value (value) respectively, and d is Coding dimension, softmax is a normalization function, and the attention mechanism can be understood as the process of weighting the input according to the relevance.

The native visual attention model as the backbone network is not well suited for common dense prediction tasks such as object detection and semantic segmentation. In addition, compared with convolutional neural networks, ViT usually requires more computation and slower inference speed, which is not conducive to practical applications. Therefore, designing a more efficient visual attention model and better adapting to downstream tasks has become the focus of current research. The pyramid visual attention model PVT ^[2] , jointly proposed by the University of Hong Kong and SenseTime, draws on the image pyramid paradigm in convolutional neural networks to generate multi-scale features, which can be combined with existing post-processing for dense tasks. End-to-end direct binding supports a variety of downstream tasks, as shown in Figure 2(c). However, because PVT uses static and fixed-length position encoding, it adapts to variable-length input through interpolation, and cannot be encoded according to input characteristics, so the performance is limited. In addition, PVT follows the global self-attention mechanism of ViT, and the calculation amount is still large.

! [Figure 2 PVT transfers the pyramid paradigm of convolutional neural network (a) to visual attention model (b) to obtain (c) to adapt to various tasks of classification, detection and segmentation (Source: PVT ^[2] ) ]( https://p0.meituan.net/travelcube/eebbcf9f58abb9763977218a43101c9f295988.png@750w_80q )

Swin ^[6] proposed by Microsoft Asian Research Institute reuses the pyramid structure of PVT. When calculating self-attention, the method of window grouping of features is used (as shown in Figure 3), the attention mechanism is limited to a small window (red grid), and then the windows are dislocated to make different groups of Information generates interaction. This can avoid the calculation of global self-attention and reduce the amount of calculation. The disadvantage is that the global attention is lost, and the information interaction ability caused by the dislocation of the window is relatively weak, which affects the performance to a certain extent.

![Figure 3 Swin calculates the local self-attention of each red grid, and generates interaction between local attentions through the window shift between different layers (source: Swin [6])]( https://p0 .meituan.net/travelcube/35daedb7476c63158e6ec880d1e483fc303574.png@750w_80q )

Difficulties in Visual Attention Model Design

To briefly summarize, the difficulties that need to be solved in the design of the current visual attention model are:

High-efficiency computing : narrow the gap in computing efficiency with convolutional neural networks, and promote practical business applications;
Flexible attention mechanism : that is, it can have the local receptive field of convolution and the global receptive field ability of self-attention, and has the advantages of both;
beneficial for downstream tasks : Support downstream tasks such as detection and segmentation, especially in scenarios where the input scale changes.

Twins model design

Starting from these difficult problems, and based on the detailed analysis of the current visual attention model, the Meituan Visual Intelligence Department rethought the design ideas of the self-attention mechanism and proposed targeted solutions. First, PVT ^[2] and CPVT ^[4] are combined to form Twins-PCPVT to support downstream tasks in scale-changing scenarios. From the perspective of the efficiency of the self-attention mechanism and the receptive field, a new type of self-attention compatible with local and global receptive fields is designed, called Spatially Separable Self-Attention (Spatially Separable Self-Attention, SSSA), forming Twins- SVT.

Twins-PCPVT

Twins-PCPVT replaces the positional encodings in PVT (the same fixed-length, learnable positional encodings as DeiT ^[7] ) with the Conditional Positional Encodings (CPE) in CPVT ^[4] . The module that generates CPE is called Positional Encoding Generator (PEG). The specific position of PEG in the Twins model is after the first Transformer Encoder in each stage, as shown in Figure 4 below:

图4 Twins-PCPVT-S 模型结构，使用了 CPVT 提出的位置编码器（PEG）

Conditional location coding

Figure 5 below shows the encoding process of the conditional position encoder proposed by the team in CPVT ^[4] . First, convert the input sequence of $N*d$ to the input feature of $H*W*d$, and then use $F$ to encode the position of the conditional expression according to the input, and the output size is the same as the input feature, so it can be converted to $ Element-wise additive fusion of N*d$ sequences and input features.

图5 条件位置编码器（PEG）

Among them, the encoding function $F$ can be implemented by a simple depthwise separable convolution or other modules. The simplified code of the PEG part is as follows. The input feat_token is a tensor of shape $B*N*d$, $B$ is the batch, $N$ is the number of tokens, and $C$ is the encoding dimension (same as $d$ in Figure 5). After the feat_token is converted into a tensor cnn_feat of $B*d*H*W$, a depthwise separable convolution (PEG) operation is performed to generate a tensor with the same shape as the input feat_token, that is, the positional encoding of the conditional expression.

class PEG(nn.Module):
    def __init__(self, in_chans, embed_dim):
        super(PEG, self).__init__()
        self.peg = nn.Conv2d(in_chans, embed_dim, 3, 1, 1, bias=True, groups=embed_dim)
        
    def forward(self, feat_token, H, W):
        B, N, C = feat_token.shape
        cnn_feat = feat_token.transpose(1, 2).view(B, C, H, W) 
        x = self.peg(cnn_feat) + cnn_feat
        x = x.flatten(2).transpose(1, 2)
        return x

Since the conditional positional coding CPE is generated according to the input, it supports variable-length input, which enables Twins to flexibly handle features from different spatial scales. In addition, PEG is implemented by convolution, so Twins also retains its translation equivariance. This property is very important for image tasks. If the target is offset in the detection task, the detection frame needs to be offset accordingly. Experiments show that the Twins-PCPVT family of models can directly improve performance in classification and downstream tasks, especially on dense tasks. This architecture shows that PVT can obtain very good performance after only enhancing the conditional position coding of CPVT, which shows that the position coding used by PVT limits its performance.

Twins-SVT

Twins-SVT (shown in Figure 6 below) optimizes and improves the global attention strategy. The computation amount of the global attention strategy increases quadratically with the resolution of the image, so how to reduce the computation amount without significant performance loss is also a research hotspot. Twins-SVT proposes a new mechanism that combines local-global attention, which can be analogous to the depthwise separable convolution (Depthwise Separable Convolution) in convolutional neural networks, and is therefore named Spatially Separable Self-Attention. -Attention, SSSA). Different from the depthwise separable convolution, the spatially separable self-attention proposed by Twins-SVT (as shown in Figure 7 below) is to group the spatial dimensions of the features, and calculate the self-attention within each group, and then from the global Fusion of group attention results.

图6 Twins-SVT-S 模型结构，右侧为两个相邻 Transformer Encoder 的结合方式

图7 Twins 提出的空间可分离自注意力机制（SSSA）

Spatially separable self-attention adopts a local-global self-attention (LSA-GSA) alternating mechanism, and the local attention calculated in groups can be efficiently transmitted to the global. LSA can greatly reduce the computational cost, and the complexity is reduced from the square $O(H^2W^2d)$ of the input to the linear $O(mnHWd)$. The key implementation of grouped local attention LSA (initialization function is omitted) is as follows:

class LSA(nn.Module):
    def forward(self, x, H, W):
        B, N, C = x.shape
        h_group, w_group = H // self.ws, W // self.ws # 根据窗口大小计算长（H）和宽（W）维度的分组个数
        total_groups = h_group * w_group
        x = x.reshape(B, h_group, self.ws, w_group, self.ws, C).transpose(2, 3) # 将输入根据窗口进行分组 B* h_group * ws * w_group * ws * C
        qkv = self.qkv(x).reshape(B, total_groups, -1, 3, self.num_heads, C // self.num_heads).permute(3, 0, 1, 4, 2, 5) # 计算各组的 q， k， v 
        q, k, v = qkv[0], qkv[1], qkv[2]
        attn = (q @ k.transpose(-2, -1)) * self.scale # 计算各组的注意力
        attn = attn.softmax(dim=-1) # 注意力归一化
        attn = self.attn_drop(attn) # 注意力 Dropout 层
        attn = (attn @ v).transpose(2, 3).reshape(B, h_group, w_group, self.ws, self.ws, C) # 用各组内的局部自注意力给 v 进行加权
        x = attn.transpose(2, 3).reshape(B, N, C)
        x = self.proj(x) # MLP 层
        x = self.proj_drop(x) # Dropout 层
        return x

The key implementation of GSA for efficient fusion of LSA attention (initialization function is omitted) is as follows. Compared with ViT's original global self-attention, GSA's K, V are calculated on the basis of shrinking features, but Q is global, so attention can still be restored to the global. This practice significantly reduces the amount of computation.

class GSA(nn.Module):
    def forward(self, x, H, W):
        B, N, C = x.shape
        q = self.q(x).reshape(B, N, self.num_heads, C // self.num_heads).permute(0, 2, 1, 3) # 根据输入特征 x 计算查询张量 q
        x_ = x.permute(0, 2, 1).reshape(B, C, H, W)
        x_ = self.sr(x_).reshape(B, C, -1).permute(0, 2, 1) # 缩小输入特征的尺寸得到 x_
        x_ = self.norm(x_) # 层归一化 LayerNorm
        kv = self.kv(x_).reshape(B, -1, 2, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4) # 根据缩小尺寸后的特征后 x_，计算 k, v 
        k, v = kv[0], kv[1]
        attn = (q @ k.transpose(-2, -1)) * self.scale # 计算全局自注意力
        attn = attn.softmax(dim=-1)
        attn = self.attn_drop(attn)
        x = (attn @ v).transpose(1, 2).reshape(B, N, C) # 根据全局自注意力对 v 加权
        x = self.proj(x)
        x = self.proj_drop(x)
        return x

It can be seen from the above code that the SVT series adopts the existing operations in the existing mainstream deep learning framework in implementation, and does not require additional underlying adaptation, so it is more convenient to deploy.

experiment

ImageNet-1k classification

Twins-PCPVT and Twins-SVT achieve SOTA results on the ImageNet-1k classification task compared to models of the same magnitude, with superior throughput. In addition, Twins supports TensorRT deployment, the Twins-SVT-S model can be accelerated by 1.6 times using NVIDIA TensorRT 7.0 inference, and the throughput can be increased from 1059 (images/s) implemented by PyTorch to 1732.

表1 ImageNet-1k 分类

ADE20K Split

On the semantic segmentation task ADE20K, the Twins model uses the FPN and Uppernet backends as the backbone network, respectively, and achieves better results than PVT and Swin, as shown in Table 2 below:

表2 ADE20K 分割

COCO Object Detection (RetinaNet Framework)

In the classic COCO object detection task, using the RetinaNet framework, the Twins model outperforms PVT by a large margin. Moreover, the Twins-PCPVT series proves that the PVT can be comparable to the Swin model of the same magnitude after being enhanced by the CPVT encoding method, as shown in Table 3 below:

表3 COCO 目标检测（RetinaNet 框架）

COCO Object Detection (Mask-RCNN Framework)

Under the Mask-RCNN framework, the Twins model also has a good performance advantage on COCO, and it is maintained during longer training (3x), see Table 4 below:

表4 COCO 目标检测（Mask-RCNN 框架）

Application in high-precision map multi-element semantic segmentation scene

High-precision maps are a key component of autonomous driving, and play a very important role in Meituan’s unmanned distribution, car-hailing and other businesses. Semantic extraction of key elements of road scenes, as the pre-process of high-precision mapping, has a direct impact on the quality of mapping. Multi-element semantic segmentation is an important part of semantic extraction, and the industry generally adopts classic semantic segmentation algorithms to achieve it.

Here, we introduce the DeepLab series ^[8] as a representative. The segmentation model is usually divided into two stages: encoding and decoding, using convolutional neural networks to extract features, and using Spatial Pyramid Pooling, and Atrous Conv operations at different scales (as shown in Figure 8a below) are used to increase the global receptive field. On the one hand, this design is limited by the feature extraction ability of convolutional neural networks, and on the other hand, the ability to model global relationships is limited, resulting in insufficient attention to details in segmentation tasks, and edges are often not clear enough.

![Figure 8 Classic Semantic Segmentation Model Architecture (DeepLabV3+ [8])]( https://p0.meituan.net/travelcube/9b88dd87d5dbb64c208d62c3716a70f7469291.png@750w_80q )

Although Twins greatly improves the efficiency and performance of the visual attention model, in order to maintain the inference efficiency close to that of the convolutional neural network, we still need to further optimize the back-end structure of the model. Different from the heavier FPN ^[9] or UpperNet ^[10] backend used in the paper for fair comparison with other methods, we design a simple and lightweight backend as shown in Figure 9 below, and There is a good balance between performance and inference speed on business datasets. This back-end is designed according to the characteristics of Twins. Since Twins takes into account both global and local attention, the back-end does not need to adopt a complex design to increase the receptive field. Only through the linear change and scaling of features at each scale, it can be directly Restore to the same size and perform concatenation (Concat), and the segmentation result can be output after simple dimension transformation.

图9 轻量化的 Twins 分割后端设计

From the comparison of the segmentation results in Figure 10 below, the model with Twins as the backbone network can extract finer image edges, such as the difference between key road elements such as isolation belts, road signs, street light poles, and the ground truth. smaller.

图10 道路多要素语义提取结果对比

Summarize

The visual attention model is the current research focus in the field of vision, and has demonstrated its superiority compared to the classic convolutional neural network in various visual tasks, but it still needs to be carefully optimized in terms of efficiency, and the effect needs to continue to be improved. Exploring the design of a more efficient attention model and promoting cutting-edge visual research to industrial implementation is also of great significance to Meituan's business.

This time, the Twins series model architecture jointly designed by Meituan and the University of Adelaide has effectively reduced computing costs, improved model performance, and better supported dense tasks such as detection and segmentation. In addition, we applied Twins to the feature semantic segmentation scene of Meituan HD map, which brought more refined segmentation results and improved the quality of HD map construction. In the future, the vision team will continue to explore efficient visual attention model design, and expect to be practiced and applied in a wider range of business scenarios of Meituan.

references

About the Author

Xiangxiang, Tian Zhi, Zhang Bo, and Xiaolin are in the visual intelligence department, while Haibing and Huaxia are in the automatic vehicle distribution department.

Team Profile and Recruitment Information

The AutoML algorithm team of Meituan Visual Intelligence Department aims to empower the company's various businesses and accelerate the implementation of algorithms through AutoML and cutting-edge visual technologies, covering AutoML, segmentation, detection (2D, 3D), Self-training and other technical directions. Interested students who are recruited from campus and socially recruited are welcome to send their resumes to: chuxiangxiang@meituan.com , both internship and formal.

The high-precision map team of Meituan's automatic vehicle distribution department is a high-precision map technology research and development team. Our responsibility is to provide high-precision, high-quality, and large-scale high-precision map services for Meituan's autonomous driving. High-precision map is a comprehensive technology involving a variety of disciplines. It not only needs to rely on algorithms such as SLAM, geographic mapping, deep learning, and multi-sensor positioning to construct high-precision maps, but also requires the use of big data technology, high-performance computing, and high-concurrency services. etc. for large-scale high-precision map processing, storage and query services. The HD Map team has been recruiting experts in computer vision, SLAM, and system development for a long time. Interested students can send their resumes to: tech@meituan.com (mail subject: Meituan HD Map).

Read a collection of more technical articles from the Meituan technical team

| In the public account menu bar dialog box, reply to keywords such as [2021 stock], [2020 stock], [2019 stock], [2018 stock], [2017 stock] and other keywords, you can view the collection of technical articles by the Meituan technical team over the years.

| This article is produced by Meituan technical team, and the copyright belongs to Meituan. Welcome to reprint or use the content of this article for non-commercial purposes such as sharing and communication, please indicate "The content is reproduced from the Meituan technical team". This article may not be reproduced or used commercially without permission. For any commercial activities, please send an email to tech@meituan.com to apply for authorization.

NeurIPS 2021 | Twins: Rethinking the Design of Efficient Visual Attention Models

summary

background

Difficulties in Visual Attention Model Design

Twins model design

Twins-PCPVT

Conditional location coding

Twins-SVT

experiment

ImageNet-1k classification

ADE20K Split

COCO Object Detection (RetinaNet Framework)

COCO Object Detection (Mask-RCNN Framework)

Application in high-precision map multi-element semantic segmentation scene

Summarize

references

About the Author

Team Profile and Recruitment Information

美团技术团队

引用和评论

MTGR：美团外卖生成式推荐Scaling Law落地实践

人工智能与机器学习入门：基尼系数（Gini Index）和基于熵（Entropy）

人工智能与机器学习入门：决策树应用

人工智能与机器学习入门：使用Kaggle完成Titanic推断学习

科学计算编程涉及到的技术栈简介

DeepSeek行业应用实践报告100+份汇总解读|附PDF下载

manus 的替代品有哪些？使用LLM大模型技术做手机/网页/浏览器自动化操作技术汇总