Abstract: This article interprets "TransFG: A Transformer Architecture for Fine-grained Recognition", which proposes the corresponding TransFG for fine-grained classification tasks.
This article is shared from the HUAWEI cloud community " Paper Interpretation Series 20: Transformer Structure for Fine-Grained Classification—TransFG ", author: BigDragon.
Paper address: https://arxiv.org/abs/2103.07976
GitHub address: https://github.com/TACJu/TransFG
Recently, the research work of fine-grained classification has focused on how to locate the different image areas to improve the ability of the network to capture small differences. Most of the work mainly uses different base models to extract the characteristics of specific areas, but this method will To complicate the process, and extract a large number of redundant features from a specific area. Therefore, this paper integrates all the original attention weights into the attention map to guide the model to efficiently select different image regions, and proposes the Transformer structure TransFG for fine-grained classification.
Figure 1 TransFG structure
1 Problem definition
The fine-grained classification task is mainly based on positioning methods and feature coding methods. The positioning methods mainly classify by locating different local areas, while the feature coding methods learn more information through high-dimensional information or finding the relationship between pairs of differences. TransFG integrates the attention weights and calculates the contrast loss of the regions to locate the different local regions to perform fine-grained classification.
2 TransFG
2.1 Image serialization
The original Vision Transformer divides the picture into patches that do not overlap each other, but this will damage the local adjacent structure and may cause the different image regions to be separated. Therefore, in order to solve this problem, this paper uses a sliding window to generate overlapping patches, and the number of patches generated N is calculated according to formula (1). Among them, H and W are the image length and width respectively, P is the image patch size, and S is the sliding window step size.
2.2 Patch Embedding and Transformer Encoder
TransFG follows the original ViT format in the Patch Embedding and Transformer Encoder modules without any changes
2.3 Partial Selection Module (PSM)
Figure 2 Attention mapping of TransFG and selected token
First, suppose that there are K self-attention heads in the model, and the attention weights of each layer are as shown in formula (2), where al refers to the attention weights of K heads in the lth layer.
As shown in formula (3), the attention weights of all layers are matrix multiplied. Afinal captures the entire process of image information from input to deeper layers. Compared with the original ViT, it contains more information and is more helpful. Select a recognizable area
Select the maximum values A1, A2,..., AK of K different attention headers in afinal, and splice them with the classification token, and the result is shown in formula (4). This step not only preserves the global information, but also makes the model pay more attention to the small differences between different categories.
2.4 Comparative loss
As shown in formula (5), the goal of comparison loss is to minimize the similarity of classification tokens corresponding to different categories and maximize the similarity of classification tokens corresponding to the same category. Among them, in order to reduce loss being affected by simple negative samples, α is used to control the negative sample pairs that contribute to loss.
3 Experimental results
TranFG has been validated on five data sets: CUB-200-2011, Stanford Cars, Stanford Dogs, NABirds and iNat2017, and obtained SOTA results on CUB-200-2011, Standford Dogs, and NABirds data sets.
4. Summary
- In the image serialization part, compared with the non-overlapping patch segmentation method, the accuracy of the overlapping method is increased by 0.2%
- PSM integrates all attention weights, retains global information, allows the model to pay more attention to small differences in different categories, and improves the accuracy of the model by 0.7%.
- Using the contrast loss function can reduce the similarity of different categories, improve the similarity of the same category, and increase the accuracy of the model by 0.4%-0.5%.
references
[1] He, Ju, et al. "TransFG: A Transformer Architecture for Fine-grained Recognition." arXiv preprint arXiv:2103.07976 (2021).
If you want to know more about the dry goods of AI technology, welcome to the AI area of HUAWEI CLOUD. At present, there are AI programming Python and other six combat camps for everyone to learn for free
Click to follow to learn about Huawei Cloud's fresh technology for the first time~
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。