Abstract: at the current situation of ViT, 1616926fae4ddf analyzes the surviving problems of ViT and the corresponding solutions, and summarizes related paper ideas.

This article is shared from the HUAWEI cloud community " [ViT] Current Vision Transformer's Problems and Overcoming Methods: A Summary of Related Papers ", author: Su Dao.

First look at ViT ancestor-level papers:

An image is worth 16x16 words: Transformers for image recognition at scale

Paper address: https://arxiv.org/abs/2010.11929
image.png

He uses the full Transformer structure to divide the image area into small squares as Patches as input. The picture on the left is the overall architecture of ViT, and the picture on the right is the shape of each block in the Transformer Encoder. We can see that he is basically the structure of the original Transformer, except that he put norm in the front, and there are articles that show that it is easier to train norm in the front.

Using Transformer can get the global information of the picture at each layer, but it is not perfect. It has the following shortcomings:

1. Large data demand: Self-Attention has weaker inductive bias ability than CNN. How do you say the inductive bias is that the model makes some assumptions about the data that it has not encountered. CNN has the assumption of spatial invariance, so it can use a weight to slide the window to process the entire feature map, and RNN has the assumption of time invariance . But Self-Attetnion does not have these assumptions, so he needs more data to automatically learn these assumptions, but this has the advantage that the assumptions that may be learned will be more flexible.

For this problem, we can use a CNN network as the Teacher network and add distillation loss to help him learn.

Patch Embedding is essentially a large convolution sum with a convolution kernel and a sliding step size of Patch size. If a convolution kernel with a Vit of 16, it is definitely not stable enough, so some later studies will use several convolutions and Pooling is combined or simply the first few blocks are the residual blocks instead.

2. Large amount of calculation: The calculation complexity of 1616926fae4f32 is related to the square of the token. If the input feature map is a 56*56 feature map, it will involve a matrix operation of 3000+ length and width, which During the process, the number of tokens and hidden size remained unchanged, so later researchers adopted several methods to solve the problem of large amount of calculation. Refer to the resnet structure using the pyramid structure, the higher the level of the token, the smaller the number of tokens; use the local window sa, consider a part of the feature map as sa, and then find a way to interact with these local information; use convolution instead of fc to reduce parameters ; In the process of generating Q, K, V, the feature maps or tokens of K, V are pooled to reduce computational complexity.

3. The number of stacked layers is limited: has an excessive smoothing problem. The similarity between different blocks increases as the model deepens; the similarity between different tokens increases as the model deepens. The main solution is to increase the hidden size, but the increase in the parameters of this method will also be large; before and after the attention map softmex, linear transformation is performed in the head dimension to increase information interaction and increase the diversity of the attention map; in the deep dropout, increase Increase the diversity of features; or increase the similarity penalty loss item.

4. The model itself cannot encode position: requires a variety of position codes. Some position codes are listed below. There are fixed and learnable, absolute and relative, and convolution. The feature uses convolution as the position code.

See the table below for details
image.png

You can check the following table for related papers on the above improvements:
image.png
image.png

Click to follow and learn about Huawei Cloud's fresh technology for the first time~


华为云开发者联盟
1.4k 声望1.8k 粉丝

生于云,长于云,让开发者成为决定性力量