Take you to read Paper丨Analyze ViT&#39;s surviving problems and corresponding solutions

Abstract: at the current situation of ViT, 1616926fae4ddf analyzes the surviving problems of ViT and the corresponding solutions, and summarizes related paper ideas.

This article is shared from the HUAWEI cloud community " [ViT] Current Vision Transformer's Problems and Overcoming Methods: A Summary of Related Papers ", author: Su Dao.

First look at ViT ancestor-level papers:

An image is worth 16x16 words: Transformers for image recognition at scale

Paper address: https://arxiv.org/abs/2010.11929

He uses the full Transformer structure to divide the image area into small squares as Patches as input. The picture on the left is the overall architecture of ViT, and the picture on the right is the shape of each block in the Transformer Encoder. We can see that he is basically the structure of the original Transformer, except that he put norm in the front, and there are articles that show that it is easier to train norm in the front.

Using Transformer can get the global information of the picture at each layer, but it is not perfect. It has the following shortcomings:

1. Large data demand: Self-Attention has weaker inductive bias ability than CNN. How do you say the inductive bias is that the model makes some assumptions about the data that it has not encountered. CNN has the assumption of spatial invariance, so it can use a weight to slide the window to process the entire feature map, and RNN has the assumption of time invariance . But Self-Attetnion does not have these assumptions, so he needs more data to automatically learn these assumptions, but this has the advantage that the assumptions that may be learned will be more flexible.

For this problem, we can use a CNN network as the Teacher network and add distillation loss to help him learn.

Patch Embedding is essentially a large convolution sum with a convolution kernel and a sliding step size of Patch size. If a convolution kernel with a Vit of 16, it is definitely not stable enough, so some later studies will use several convolutions and Pooling is combined or simply the first few blocks are the residual blocks instead.

2. Large amount of calculation: The calculation complexity of 1616926fae4f32 is related to the square of the token. If the input feature map is a 56*56 feature map, it will involve a matrix operation of 3000+ length and width, which During the process, the number of tokens and hidden size remained unchanged, so later researchers adopted several methods to solve the problem of large amount of calculation. Refer to the resnet structure using the pyramid structure, the higher the level of the token, the smaller the number of tokens; use the local window sa, consider a part of the feature map as sa, and then find a way to interact with these local information; use convolution instead of fc to reduce parameters ; In the process of generating Q, K, V, the feature maps or tokens of K, V are pooled to reduce computational complexity.

3. The number of stacked layers is limited: has an excessive smoothing problem. The similarity between different blocks increases as the model deepens; the similarity between different tokens increases as the model deepens. The main solution is to increase the hidden size, but the increase in the parameters of this method will also be large; before and after the attention map softmex, linear transformation is performed in the head dimension to increase information interaction and increase the diversity of the attention map; in the deep dropout, increase Increase the diversity of features; or increase the similarity penalty loss item.

4. The model itself cannot encode position: requires a variety of position codes. Some position codes are listed below. There are fixed and learnable, absolute and relative, and convolution. The feature uses convolution as the position code.

See the table below for details

You can check the following table for related papers on the above improvements:

Click to follow and learn about Huawei Cloud's fresh technology for the first time~

Take you to read Paper丨Analyze ViT's surviving problems and corresponding solutions

First look at ViT ancestor-level papers:

华为云开发者联盟

引用和评论

华为云开发者联盟入选 2023 中国技术品牌影响力企业榜，深耕开发者生态

🔥全程不用写代码，我用 AI 程序员写了一个飞机大战

从 DeepSeek 看25年前端的一个小趋势

Open WebUI：开源AI交互平台的全面解析

大模型中的Token究竟是什么？从原理到作用深度解析

一文掌握 MCP 上下文协议：从理论到实践

MySQL × 向量数据库：大模型时代的黄金组合实战指南

Take you to read Paper丨Analyze ViT&#39;s surviving problems and corresponding solutions

First look at ViT ancestor-level papers:

华为云开发者联盟

引用和评论

华为云开发者联盟入选 2023 中国技术品牌影响力企业榜，深耕开发者生态

🔥全程不用写代码，我用 AI 程序员写了一个飞机大战

从 DeepSeek 看25年前端的一个小趋势

Open WebUI：开源AI交互平台的全面解析

大模型中的Token究竟是什么？从原理到作用深度解析

一文掌握 MCP 上下文协议：从理论到实践

MySQL × 向量数据库：大模型时代的黄金组合实战指南

Take you to read Paper丨Analyze ViT's surviving problems and corresponding solutions