Author: Shunda
Recently, the Quark End Intelligence team is doing real-time document detection on the end, that is, inputting an RGB image to obtain the coordinates of the key points of the four corners of the document. The entire pipeline belongs to the key point detection algorithm, so recent papers in related fields have been read and experimental attempts have been made.
The key point detection algorithm is divided into different modules, which can be divided into the following parts, and each part has related methods that can be optimized:
- Image processing: including data optical enhancement, transformation, resize, crop and other operations to expand the diversity of images;
- Coding: Refers to how to convert the coordinates into the required label during training, which is used to supervise the output of the model;
- Network model: refers to the network structure, which can be composed of backbone/FPN/detection head and other parts;
- Decoding: Refers to how to convert the result of model inference into the required coordinate form, such as the coordinates in the Cartesian coordinate system.
Related Works
There are mainly two technical solutions in key point detection:
- Similar to face detection, the tensor output by the model directly obtains a one-dimensional vector through the fc layer, which is usually the coordinate value of the key point after normalization;
- Similar to human pose estimation, the result tensor output by the model obtains the corresponding large coordinates in the heatmap through argmax and other methods, and finally restores the coordinates to the coordinates of the original image.
In recent years, most of the schemes for key point detection based on heatmap are mainly because the effect of heatmap is better than that of regression using fully connected layer. Therefore, the scheme we adopted is also based on heatmap. The following are some related papers in recent years.
DSNT
[1] Nibali A , He Z , Morgan S , et al. Numerical Coordinate Regression with Convolutional Neural Networks[J]. 2018.
Ideas
Currently, there are two ways to convert the heatmap output by the model to numerical coordinates:
- By taking argmax in the heatmap, the corresponding maximum point is obtained, which can be converted into numerical coordinates. This method has good spatial generalization, but because argmax is not diversified during training, heatmap is usually used to approximate the encoded Gaussian heat example map, which will cause the loss function to be inconsistent with the final evaluation index. Secondly, in the inference phase, only the coordinate point with the maximum response is used to calculate the numerical coordinates, while in the training phase, all coordinate points contribute to the loss. Third, there will be a lower limit of theoretical error when converted into numerical coordinates through heatmap;
- By connecting the fc layer after the heatmap, it is converted into numerical coordinates. This method allows the gradient to be returned from the numerical coordinates to the input, but the result is heavily dependent on the data distribution (for example, in the training set, an object always appears in the coordinates; in the test set, the object appears on the right, which will lead to predictions mistake). Secondly, through fc conversion, the spatial information of the heatmap is lost.
Aiming at the above two schemes, the author has compatible with the advantages of these two schemes (end-to-end optimization and maintaining spatial generalization), and proposes a differentiable method to obtain numerical coordinates.
Specific steps
- The output of the model is 1 K H*W heatmaps, where K represents the number of key points;
- Normalize the heatmap of each channel so that its values are all non-negative and the sum is 1, thereby obtaining norm_heatmap. The purpose of this is to use the normalized heatmap to ensure that the predicted coordinates are within the spatial range of the heatmap. At the same time, norm_heatmap can also be understood as a two-dimensional discrete probability density function;
- Generate X and Y matrices, \( X_{(i,j)} = \frac{2j-(w+1)}{w} \), \( Y_{(i,j)} = \frac{2i- (h+1)}{h} \), respectively indicate the index of the x-axis and the index of the y-axis. It can be understood as zooming the upper left corner of the picture to (-1, -1) and the lower right corner to (1, 1);
- Multiply the X and Y matrices with norm_heatmap to get the final numerical coordinates. The reason for this is that norm_heatmap represents the probability density function, the X matrix represents the index, and the two dots represent the mean value of the predicted x. The average value is used to express the final predicted coordinates. The advantage of this is that a) can be differentiated; b) the lower limit of the theoretical error is small.
Loss function loss
The loss function of the dsnt module is composed of Euclidean loss and JS regular constraints. The former is used for regression coordinates, and the latter is used to constrain the generated heat map to be closer to the Gaussian distribution.
$$ L_{euc}(u,p) = ||p-u||_2 $$
$$ L_D(Z,p) =JS(p(c)||N(p,I))) $$
advantage
- The entire model is trained end-to-end, and the loss function can correspond to the test indicators;
- The lower limit of theoretical error is very small;
- The introduction of X matrix and Y matrix can be understood as introducing a priori to reduce the learning difficulty of the model;
- The effect of low resolution is still good.
shortcoming
In the experiment, it was found that when the key point is located at the edge of the picture, the prediction result is not good.
DARK
[1] Zhang F , Zhu X , Dai H , et al. Distribution-Aware Coordinate Representation for Human Pose Estimation[J]. 2019.
Ideas
The author found that decoding the heatmaps has a greater impact on the final numerical coordinates. Therefore, the deficiencies of the standard coordinate decoding method are studied, and a decoding method and encoding method with a known distribution are proposed to improve the final effect of the model.
The standard coordinate decoding process is to find the maximum response point m and the second largest response point s through argmax after obtaining the heatmaps of the model to calculate the final response point p:
$$ p=m+0.25\frac{s-m}{\left \| s-m \right \|_2} $$
This formula means that the maximum response point is shifted by 0.25 pixels to the second largest response point. The purpose of this is to compensate for the quantization error. Then map the response point back to the original image:
$$ \hat{p} = \lambda p $$
This also shows that the maximum response point in the heatmap does not exactly correspond to the key points of the original image, but only the approximate location.
Based on the above pain points, the author proposes a new decoding method based on the premise that the distribution is known (Gaussian distribution) to solve how to get the precise position from the heatmap and minimize the quantization error. At the same time, a matching coding method is proposed
Specific steps
decoding
Assuming that the output heatmap conforms to Gaussian distribution, then the heatmap can be expressed by the following function
Among them, \( \mu \) indicates the location of the key point mapped to the heatmap. We need to require the position of \( \mu \), so the function g is converted into a maximum likelihood function
Taylor expansion of \( P(\mu) \)
Among them, m represents the position of maximum response in the heat map. And \( \mu \) corresponds to the extreme point in the heat map, which has the following properties
Combining the above formula, we can get
Therefore, in order to get the position of \( \mu \) in the heatmap, it can be obtained by the first derivative and the second derivative of the heatmap. The function of this step is to explain the moving distance through mathematical methods.
As mentioned earlier, it is assumed that the output heatmap conforms to the Gaussian distribution, but the actual situation is not consistent. It may be multi-peak in reality. Therefore, the heatmap needs to be modulated to meet this premise as much as possible. The specific method is to use the Gaussian kernel function to smooth the heatmap, and at the same time, in order to ensure the same amplitude, it is necessary to normalize.
$$ {h}'=K\circledast h $$
$$ {h}'=\frac{{h}'-min({h}')}{max({h}')-min({h}')}*max(h) $$
In summary, the steps are:
- Use Gaussian kernel to modulate heatmap and zoom;
- Find the first derivative and the second derivative to get \( \mu \);
- Map \( \mu \) back to the original image.
coding
Encoding refers to mapping key points to heatmap and generating Gaussian distribution.
The previous work method is to downsample the coordinates, then quantize the points (floor, ceil, round), and finally use the quantized coordinates to generate a Gaussian function.
Because quantization is non-directed and there are quantization errors, the author proposes to use float to generate Gaussian function without quantization, so that unbiased heatmap can be generated.
UDP
[1] Huang J , Zhu Z , Guo F , et al. The Devil is in the Details: Delving into Unbiased Data Processing for Human Pose Estimation[J]. 2019.
Ideas
The author starts with data processing and coordinate representation to improve performance. The author found that the current data processing method is biased, especially when flipping, it will not be aligned with the original data; secondly, there are also statistical errors in the coordinate representation. These two issues together lead to biased results. Therefore, an unbiased data processing method is proposed to solve the error caused by image conversion and coordinate conversion.
Specific steps
Unbiased Coordinate System Transformation
In the test, the flipped \( {k}'_{o,flip} \) is usually superimposed with the original \( {k}'_o \) to get the final prediction result. However, \( {k}'_o \) and \( {\hat{k}}'_o \) are not inconsistent, and there is a deviation. It can be seen that the flipped heatmap is not aligned with the original heatmap, which will cause errors, which are related to the resolution.
Therefore, the author suggests to use unit length instead of picture length: \( w=w^p-1 \). In this way, the flipped heatmap is aligned.
Unbiased Keypoint Format Transformation
The unbiased key point conversion method should be reversible, namely \( k=Decoding(Enoding(k)) \). Therefore, the author proposes two ways:
- Combined classification and regression format
Using the anchor method in target detection for reference, assuming that the key points need to be predicted \( k=(m,n) \), then it is converted into the following. Where C represents the location range of the key point, and X and Y represent the offset that needs to be predicted. The final decoding is to get the argmax on the heat map C, then get the offset of the corresponding position on the heat map of X and Y, and finally add the numerical coordinates.
- Classification format
It is consistent with the DARK method, that is, Taylor expansion is used to approximate the real position.
AID
[1] Huang J , Zhu Z , Huang G , et al. AID: Pushing the Performance Boundary of Human Pose Estimation with Information Dropping Augmentation[J]. 2020.
Contribution point
For key point detection, appearance information is as important as constraint information. The previous work usually overfits the appearance information and ignores the constraint information. Therefore, this article hopes to use information drop, which can be understood as a mask, to emphasize constraint information. The constraint information is helpful to predict the accurate position of the key point when it is occluded.
The reason why the information drop was not used in previous work is that the index dropped after using this data enhancement method. Through experiments, the author found that the information drop is helpful to improve the accuracy of the model, but the response training strategy needs to be modified:
- Double the number of training sessions;
- First use the mask without mask to train, and after getting a better model, add the mask method to continue training.
RSN
[1] Cai Y , Wang Z , Luo Z , et al. Learning Delicate Local Representations for Multi-Person Pose Estimation[J]. 2020.
Contribution point
This article is the scheme of the 2019 coco key point detection champion. The main idea of this article is to maximize the aggregation of features with the same spatial size, in order to obtain rich local information, which is conducive to generating more accurate locations. Therefore, the RSN network is proposed, as shown in Figure 1 below. From the picture, it is the integration of the characteristics of different receptive fields. The output of RSN contains low-level accurate spatial information and high-level semantic information. Spatial information is helpful for positioning, and semantic information is helpful for classification. However, the influence weights of these two types of information on the final prediction are inconsistent, and need to be balanced by the PRM module. The RPM module is essentially a channel attention and spatial attention module.
Lite-HRNet
[1] Yu C , Xiao B , Gao C , et al. Lite-HRNet: A Lightweight High-Resolution Network[J]. 2021.
Contribution point
This article presents an efficient high-resolution network, which is a lightweight version of HRNet, by introducing the shuffle block in ShuffleNet into HRNet. At the same time, it is found that pointwise convolution (1 1 convolution) is widely used in shuffleNet, which is the calculation bottleneck, so the contional channel weight is introduced to replace the 1 1 convolution in the shuffle block. The overall structure of the network is shown in the figure below. The high-resolution features are consistently retained in the model, and high-level features are continuously integrated.
The contional channel weight mentioned earlier is as follows. The left is the shuffle block in ShuffleNet, and the right is the logical channel weight. It can be seen that the new module is used to replace the 1*1 convolution to realize cross-stage information exchange and local information exchange. The specific methods include Cross-resolution weight computation and Spatial weight computation. The essence of these two modules is the attention mechanism.
Experimental optimization results
Model structure
This model draws on related work in CenterNet/RetiaFace/DBFace. The dsnt scheme is used this time. The main reason is: need to run on the end, real-time is the primary consideration. The advantage of dsnt in low resolution is obvious.
MobileNet v3 uses the small version, and Nearest Upsample + conv + bn + Relu is used in FPN for upsampling. The keypoints, mask, and center branches are used in training; and only the keypoints branch is used in prediction.
Optimization Strategy
In this experiment, the following optimization strategies were used:
- Use mask and center branch to assist learning. Where mask represents the mask of the document, and center represents the center point of the document;
- Use deep Supervise. Use 4 times down-sampling feature maps and 8 times down-sampling feature maps for training, and use the same loss function to supervise these two layers;
- The effect of edge points in dsnt is not good, therefore, padding the picture so that the points are no longer at the edge of the picture;
- Data enhancement strategy, in addition to conventional optical disturbance enhancement, it also performs random crop, random erase, and random flip operations on images;
- Try the loss function. For the loss of the key point branch, I have tried euclidean loss, l1 loss, l2 loss and smoothl1 loss. In the end, smoothl1 loss has the best effect.
Evaluation index
- MSE
Used to evaluate the mean square error of the validation set during training.
$$ mse = \frac{\sum |d_i - \hat{d_i}|_2^2}{N} $$
- oks-mAP
oks is used to evaluate the similarity between the prediction and the real key points. The evaluation method of mAP is similar to the evaluation method of coco[0.5:0.05:0.95], here is [0.99:0.001:0.999]. Among them, oks performs a certain transformation, \( d_{p,i} \) represents the Euclidean distance of the point, and $S_p$ represents the area of the quadrilateral.
$$ oks_{p,i} =e^{-\frac{d_{p,i}^2}{2S_p}} $$
- time consuming
Time-consuming refers to the average time to run the model with the MNN inference framework on Redmi 8.
Experimental results
First build a baseline. The baseline model is moblieNet v3 + fpn + ssh module + keypoints branch + dsnt. None of the above optimization strategies are used, and 4 times down-sampling feature maps are used as output.
Replace different loss functions in the v2 version.
In addition, other invalid tricks have been tried:
- Auxiliary tasks are conducive to improving the indicators of the model, so the edge branch is also added to assist learning. After the experiment, adding this branch will damage the model's indicators. The possible reason is that the edge is generated by using gt key points, and some edges may not be the true edges of the corresponding document;
- The current stage is to predict the 4 corners of the document, so 4 points are added for prediction, which are the center points of the 4 sides, so the model predicts a total of 8 key points. The experimental results show that the indicators have also dropped.
Demo
Please click the end of this article to view the
Summarize
To sum up, in the field of document key point detection on the end, the current attempt is that the heatmap+dsnt-based solution is better, and the oks-mAP index has room for improvement. However, compared with the method of using the fc layer to regress coordinates, there is a deficiency in the heatmap-based solution: it is impossible to predict the coordinates of key points outside the picture according to the constraint information of the document. The insufficiency of this scheme will lead to the lack of document content and the situation of poor correction effect. Therefore, the follow-up needs to make up for this shortcoming.
, 3 mobile technology practices & dry goods for you to think about every week!
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。