Image recognition algorithm based on self-attention mechanism

background

For those who are CV classmates, image recognition is the simplest model and the most basic model in the introduction. In different CV tasks, even if it has been developed for many years, it has always retained the weight of the model that used the trained image recognition task as a backbone to accelerate training convergence. But in the face of some image recognition tasks, how we can transform an already relatively complete recognition task according to business needs is still quite an interesting topic.

Business needs

It can filter out the pictures with clean and uniform background in the sneaker pictures uploaded by users. We will call this project background complexity detection task. It helps to reduce the difficulty of subsequent valuation and sales algorithm, and maintain the image quality of the entire app at a level.

Project requirements

The accuracy rate of the test set can reach more than 80%, which can be used to give users a prompt, but it is not mandatory; while reaching more than 90%, users can be forced to meet the requirements of uploading image quality according to the requirements.
Able to realize end-to-side applications.

Model design

mobilenet backbone + FPN + modified SAM

The final model is actually very simple to decompose the various modules, and there is nothing particularly difficult to understand.

The overall direction of effort is nothing more than the following points:

Analyzing the business characteristics, this is a typical spatial recognition task, that is, the purpose is accomplished through the content of a certain part of the image or certain parts of the area. For background complexity we want the plane to the outside part of the body, to determine whether the remaining part is "complex", you can, so I thought I could use the space series of attentional mechanisms.
It is impossible for users to strictly control the proportion of the subject in the photo that occupies the proportion of the picture. Some users are accustomed to filling the main body of the screen, and some users prefer to leave more white space. For different scales, if you want to achieve fine classification, you need to make a fuss about the feature map at a higher resolution, so FPN is used here.
In order to be able to implement the application on the end side, it is a natural idea to choose the mobilenet series.

The design ideas of the above points are actually designed completely around the current business scenario .

Finally, on the test set, a 96% accuracy rate was achieved using the designed CNN model. It can be used as a basis for forcing users to upload high-quality pictures.

If you want to understand the model used in this project, you can achieve the goal by sharing it. The design of each module is actually not difficult, there is no new concept. The code is also very concise, and the implementation is also very convenient.

But the most interesting thing about the project itself is actually the idea of realizing the whole project. Why did you choose this model in the end? How were other ideas eliminated? How were trial and error! ! !

Project process

Traditional CV

The project needs to be implemented on the end side. Although there is no clear mobile phone real-time requirement, the entire algorithm file is required, and the referenced library occupies a small enough memory. The first thing that can be thought of is whether this problem can be solved if the deep network is not used.

Wrong idea

Analyze background complexity , I thought of using

For the main content that has nothing to do with the entire business but seriously affects the results, a fixed-size Gaussian kernel is used for filtering, and then other processing is performed.
Use edge detection and gradient information to analyze the background complexity of a picture
The Fourier frequency information analyzes the high and low frequency information of the picture to judge: high frequency means more background information and more complicated; low frequency means clean.
A simple pixel weighted average is used as a template by using multiple pictures with clean backgrounds, and then using anomaly detection methods to filter out pictures with complex backgrounds.

The above are some simple ideas that I can think of at the beginning of the project. But after observing the picture samples, these ideas, except for the first idea (spatial dimension information), are correct, and the other three ideas are all wrong. For example, a pair of shoes is placed on the carpet, the carpet itself is clean, the picture also removes the main body and the carpet without any other objects. But the carpet has its own texture, a lot of high-frequency information, and a lot of gradient information. According to the principles of Ideas 2, 3, and 4 above, such a background is a highly complex background. But in fact, in the sample, such a background is considered clean.

Through analyzing the data sample and reflecting on the business logic, the following conclusions are drawn:

The concept of background complexity is not aimed at the background differences between different pictures.
The concept of background complexity is based on whether there are different backgrounds and some foreground objects that are not the subject of each picture itself.

Then, the overall thinking needs to be changed, that is, the background complexity by the 161978acac6f72 self-similarity This is also a strategic adjustment after becoming more familiar with the business.

Template matching

So what parts need to be used to judge the self-similarity?

Four corners

As mentioned above, one of the wrong ideas is correct, that is, to use the spatial dimension to think about the business, and the idea of removing the main information is correct. That is, when we judge the self-similarity of the background, we need to avoid the subject information. According to this idea, based on our experience, the four corners can represent the background information of a picture to a certain extent. If the content on the 4 corners is similar, most of the business goals can be solved. Use the information of 4 angles to calculate the pairwise self-similarity, there are 6 values. The higher the 6 values, the higher the similarity of the 4 corners, and the greater the probability of being a clean background.

Two horns

After observing the actual business sample, it is found that the user's habit of taking pictures is often that the upper part of the blank is more than the lower part. In other words, the two corners below often contain subject information. So the final solution is to only use the upper 2 corners for similarity matching.

Not over yet

However, only using the similarity of the two angles to match the output with only one value, such a value is too unstable for the entire business, and there is a great risk. So I used the upper two corners as templates, and then did the sliding window matching, and recorded the number greater than a certain threshold as an indicator of similarity evaluation. Therefore, there are finally 3 indicators to measure the similarity of the background of a picture. One is the similarity value of the two upper corners to each other. The other two are the number of each corner that can be used as a template alone to match the content in other spaces on the entire picture. Perform weighting according to the preset weights, and finally get a score. This score is used to judge whether the final picture has a complicated background.

Finally, through this method, the input image is uniformly resized to keep the image at a certain size, so that the template matching speed is controlled within a reasonable range. According to the above approach, the test machine can achieve 80% accuracy. This algorithm can be used as a tool to guide dizzy photos on the product side. When the algorithm judges that it is a complex background, it will give a reminder to hope that the user will take another photo.

CNN

It doesn't matter if I didn't understand the above traditional CV. It just describes the process of using business analysis for the entire project to get some valuable ideas. The following method is the focus of this article.

baseline

In order to quickly iterate and optimize the project, use the identified ideas to do this project. It is a very common image recognition task. Since I hope to put the model in the mobile phone, I can think of and am more familiar with mobilenet v1.

Optimization model

Target Detection

At the beginning of the article, I mentioned that this is a standard space line identification problem. For such recognition problems, my usual approach is to use the detection model solve the recognition task.

This kind of thinking is very common. For example, in the community, we want to identify in a graphic post, what kind of outfit the expert in the picture looks like, whether he carries any brand of bag, and what brand of shoes he is wearing. Regardless of the difficulty of the actual algorithm, the end-to-end thinking is an identification problem. Input a picture and output several labels. But in fact, it is difficult to directly use the recognition model because there is a lot of redundant background information. We still need to use detection models to deal with such problems. In the end, it’s just that you don’t need to output the bounding box, and you may need to do some de-duplication strategies.

Back to this business, one difference is that the area we ultimately care about is precisely the area where the main body is removed. One idea in traditional CV is to use a fixed-size Gaussian kernel to do similar things. But precisely because of this, we cannot use a target detection model and some strategies to get the output directly like the example in the community above.

First, a target detection model is needed to get the subject. Then use the mask filtering scheme, whether it is connected to the traditional CV algorithm, or the CNN baseline, you can get the final answer.

Using the method of target detection can clearly solve such a business problem, and it is also the most intuitive choice.

Hidden target detection

Although the method of target detection is intuitive, there are two huge flaws:

It cannot be end-to-end. At least two steps or two models are required, which is a burden for speed and memory space.
Target detection requires the cost of tags to be much larger than the cost of identification.

So, can we optimize the above two defects? The answer is yes.

We can predict an area as a bounding box in the process of the intermediate model like the target detection step. The output is 4 dimensions. Usually it can output upper left corner, lower right corner; upper left corner, width and height; or center coordinates, width and height. Then use the original image and the results of these 4 dimensions to do a mask filter to filter out the subject, and then perform the next step of recognition.

The advantage of doing so can avoid all the defects of directly using the target detection task. He combined the previous scheme step by step into a model. The middle label is also learned by myself rather than manually marked, which saves a lot of preparation time before training.

In fact, the famous beat Li Amoy map search algorithm that exploits the idea of a view search of recognition. The general image search is also divided into the target detection + recognition method, but the Polaroid solution uses the above method to remove the pre-task of target detection.

Implicit segmentation

Since target detection can be done, in fact, semantic segmentation or instance segmentation can also be used for such content. Let's see that hidden target detection also has this limitation:

Due to the problem of the bounding box itself, he still included some or removed some necessary information.
For multiple objects, this method is very inflexible. (Of course, this problem is not impossible to solve. You can also use target detection to design some convolution to get the feature map to solve the problem of multi-target detection, but the amount of parameters will also increase.)

Similarly, we use the typical segmentation idea to abandon the regular shape of the bounding box at the stage of mask generation, and go to the outer contour boundary of an object in the province. This boundary is also implicit, and does not need to be manually labeled, and then the input image is used to filter the mask.

Spatial attention mechanism

The implicit segmentation is certainly reasonable, but in fact the amount of parameters will increase and the amount of calculation will increase. In fact, whether it is a method of target detection or segmentation, in a broad sense, it is a strong supervision of spatial attention. It is our artificially delineated spatial attention mechanism. But there is still that problem. Our ultimate goal is not to learn the coordinates of the bounding box, nor to learn how accurate the background contour information mask is. These intermediate results are not critical to the final goal. In other words, it is enough for the model to learn its own set of areas of concern, and it does not require areas with strong interpretability. So for such a scene, we don't need to be the top layer on the segmentation model, that is, the mask of the layer with the highest resolution. As long as the intermediate layer effect is achieved, there is no need to waste more parameters.

For the "U"-shaped image segmentation model, we only need to implement the "J"-shaped model. The "J" type structure is the standard backbone + FPN. FPN not only retains certain high-resolution information, but also combines the semantic information from the bottom layer, making it a perfect "fusion". Doing another mask on this layer is equivalent to doing semantic segmentation on this layer. In fact, this belongs to the spatial attention mechanism.

Space + channel attention mechanism

Now that there is spatial attention, it's all here, we can also add the attention mechanism of the channel dimension. The attention module we are familiar with, like SE, belongs to the channel dimension. BAM/CBAM is an attention mechanism with both spatial dimension + channel dimension. But he did it separately. Whether it is serial or parallel, it is divided into two modules. In this project, the Modified SAM module rewritten in YOLOV4 was used to synthesize the weight of space + channel in one step. Simple and easy to understand. But the channel dimension is indeed a "by the way". In this business, I think it can be added or not.

Let's take a look at the grad CAM result in the middle

Comparative Results

The improved model is compared with the model without the aforementioned attention module and FPN. The accuracy rate of the original model on the validation set is 93%, and the improved model is 96%. But more importantly, the improved model has a significant increase in interpretability. When the interpretability is enhanced, there will be the possibility of a clear optimization direction.

Summarize

The above is an idea of how to make innovations in image recognition. But the purpose of innovation is not to innovation, the whole project down the most important core is to solve business problems .

Essay/poem

Pay attention to the material technology, be the most fashionable technical person!

Image recognition algorithm based on self-attention mechanism

background

Business needs

Project requirements

Model design

Project process

Traditional CV

Wrong idea

Template matching

CNN

baseline

Optimization model

Target Detection

Hidden target detection

Implicit segmentation

Spatial attention mechanism

Space + channel attention mechanism

Comparative Results

Summarize

得物技术

引用和评论

AI生成功能设计用例｜得物技术

大模型中的Token究竟是什么？从原理到作用深度解析

OpenAI API Key 获取并用GPT-4o 图像生成：使用 Node JS代码调用示例

功率器件热设计基础（九）——功率半导体模块的热扩散

英飞凌 | 驱动电路设计（二）——驱动器的输入侧探究

DeepSeek的开源之路:一文读懂从V1-R1的技术发展,见证从开源新秀到推理革命的领跑者

入选ICLR 2025，MIT/UC伯克利/哈佛/斯坦福等提出DRAKES算法，突破生物序列设计瓶颈