Take you to 9 common convolutional neural networks

Abstract: In convolutional neural networks, different features are extracted by using filters. The weights of these filters are automatically learned during training, and then all these extracted features are "combined" to make a decision.

This article is shared from the HUAWEI cloud community " Neural Network Common Convolution Summary ", the original author: fdafad.

The purpose of convolution is to extract useful features from the input. In image processing, you can choose a variety of filters. Each type of filter helps to extract different features from the input image, such as horizontal/vertical/diagonal edge features. In a convolutional neural network, different features are extracted by using filters. The weights of these filters are automatically learned during training, and then all these extracted features are "combined" to make a decision.

2D convolution

Single channel: In deep learning, convolution is essentially multiplying and accumulating the signal by element to obtain the convolution value. For an image with 1 channel, the following figure demonstrates the operation form of convolution:

The filter here is a 3 x 3 matrix with elements [[0, 1, 2], [2, 2, 0], [0, 1, 2]]. The filter slides in the input data. At each position, it is doing element-wise multiplication and addition. Each sliding position ends with a number, and the final output is a 3 x 3 matrix.

Multi-channel: Since the image generally has 3 channels of RGB, convolution is generally used for multi-channel input scenes. The following figure demonstrates the calculation form of the multi-channel input scene:

Here the input layer is a 5 x 5 x 3 matrix with 3 channels, and the filters are a 3 x 3 x 3 matrix. First, each kernel in the filters is applied to the three channels in the input layer, and convolution is performed three times to produce 3 channels with a size of 3×3:

Then add these three channels (add element by element) to form a single channel (3 x 3 x 1), which uses filters (3 x 3 x 3 matrix) to input layer (5 x 5 x 3 matrix) ) The result of convolution:

3D convolution

In the previous illustration, it can be seen that this is actually completing the 3D-convolution. But in a general sense, it is still called 2D-convolution of deep learning. Because the depth of the filters is the same as the depth of the input layer, the 3D-filters only move in 2 dimensions (the height and width of the image), and the result is a single channel. Through the promotion of 2D-convolution, 3D-convolution is defined as the depth of filters is less than the depth of the input layer (that is, the number of convolution kernels is less than the number of channels in the input layer), so 3D-filters need to slide in three dimensions (Enter the length, width, and height of the layer). Perform a convolution operation at each position sliding on the filters to get a value. When the filters slide across the entire 3D space, the output structure is also 3D. The main difference between 2D-convolution and 3D-convolution is the spatial dimension of filters sliding. The advantage of 3D-convolution is to describe the relationship between objects in 3D space. 3D relationships are very important in certain applications, such as 3D-object segmentation and medical image reconstruction.

1*1 convolution

For 1*1 convolution, it seems that each value in the feature maps is multiplied by a number, but in fact it is more than that. First, because it will pass through the activation layer, it is actually a nonlinear mapping. That is, the number of channels of feature maps can be changed.

The above figure describes: the operation mode on an input layer with a dimension of H x W x D. After 1 x 1 convolution of filters of size 1 x 1 x D, the dimension of the output channel is H x W x 1. If we perform this 1 x 1 convolution N times, and then combine these results, we can get an output layer of dimension H x W x N.

Spatially separable convolution

In a separable convolution, we can split the kernel operation into multiple steps. We use y = conv(x, k) to denote convolution, where y is the output image, x is the input image, and k is the kernel. This step is very simple. Next, we assume that k can be calculated by the following equation: k = k1.dot(k2). This will make it a separable convolution, because we can achieve the same result by doing two one-dimensional convolutions on k1 and k2 instead of using k for two-dimensional convolution.

Take the Sobel kernel, which is commonly used for image processing, as an example. You can get the same kernel by multiplying the vectors [1,0,-1] and [1,2,1].T. When performing the same operation, you only need 6 instead of 9 parameters.

Depth separable convolution

Spatially separable convolution (previous section), and in deep learning, deep separable convolution will perform a spatial convolution while maintaining channel independence, and then perform a deep convolution operation. Suppose we have a 3x3 convolutional layer on a 16 input channel and 32 output channel. Then what will happen is that each of the 16 channels is traversed by 32 3x3 cores, resulting in a 512 (16x32) feature map. Next, we synthesize a large feature map by adding the feature maps in each input channel. Since we can perform this operation 32 times, we get the expected 32 output channels. So, for the same example, what is the performance of the depth separable convolution? We traverse 16 channels, each of which has a 3x3 kernel, which can give 16 feature maps. Now, before doing any merging operations, we will traverse these 16 feature maps, each containing 32 1x1 convolutions, and then start adding them one by one. This results in 656 (16x3x3 + 16x32x1x1) parameters opposite to the above 4608 (16x32x3x3) parameters. A detailed description will be given below. The 2D convolution kernel mentioned in the previous section is 1x1 convolution. Let's quickly go through the standard 2D convolution first. To give a specific case, suppose the size of the input layer is 7 x 7 x 3 (height x width x channel), the filter size is 3 x 3 x 3, and after a 2D convolution of a filter, the size of the output layer is 5 x 5 x 1 (only 1 channel). As shown below:

Generally speaking, multiple filters are applied between two neural network layers. Now assume that the number of filters is 128. 128 2D convolutions have resulted in 128 5 x 5 x 1 output maps. Then stack these maps into a single layer with a size of 5 x 5 x 128. Spatial dimensions such as height and width are reduced, while depth is enlarged. As shown below:

Let's see how to achieve the same conversion using depth separable convolution. First, we apply deep convolution on the input layer. We use 3 convolution kernels (each filter is 3 x 3 x 1) in 2D convolution, instead of using a single filter of 3 x 3 x 3. Each convolution kernel only performs convolution on 1 channel of the input layer. This convolution results in a map of size 5 x 5 x 1 each time, and then stacks these maps together to create a 5 x 5 x 3 images, and finally an output image with a size of 5 x 5 x 3. In this case, the depth of the image remains the same as the original.

Depth separable convolution-the first step: use 3 convolution kernels (each filter size is 3 x 3 x 1) in 2D convolution, instead of using a single filter size of 3 x 3 x 3 Device. Each convolution kernel only performs convolution on 1 channel of the input layer. This convolution results in a map of size 5 x 5 x 1 each time, and then stacks these maps together to create a 5 x 5 x 3 images, and finally an output image with a size of 5 x 5 x 3. The second step of depth separable convolution is to expand the depth. We use a 1x1x3 convolution kernel to do 1x1 convolution. After each 1x1x3 convolution check convolves a 5 x 5 x 3 input image, a map of size 5 x 5 x1 is obtained.

In this case, after doing 128 1x1 convolutions, a layer of size 5 x 5 x 128 can be obtained.

Grouped convolution

Group convolution Group convolution first appeared in AlexNet. Due to the limited hardware resources at the time, all convolution operations cannot be processed on the same GPU when training AlexNet. Therefore, the author divides the feature maps into multiple GPUs for processing separately, and finally The results of multiple GPUs are merged.

The following describes how grouped convolution is implemented. First, the traditional 2D convolution steps are shown in the figure below. In this case, by applying 128 filters (each with a size of 3 x 3 x 3), an input layer of size 7 x 7 x 3 is converted to an output layer of size 5 x 5 x 128. For the general situation, it can be summarized as follows: by applying Dout convolution kernels (the size of each convolution kernel is hxwx Din), the input layer of size Hin x Win x Din can be converted into the size of Hout x Wout x Dout The output layer. In grouped convolution, the filter is split into different groups, and each group is responsible for the work of traditional 2D convolution with a certain depth. The case shown in the figure below is more clear.

Dilated convolution

The parameter that dilated convolution introduces into another convolutional layer is called dilation rate. This defines the spacing between values in the kernel. A 3x3 kernel with an expansion rate of 2 will have the same field of view as a 5x5 kernel, but only use 9 parameters. Imagine using a 5x5 kernel and deleting the rows and columns of each interval. (As shown in the figure below) The system can provide a larger receptive field with the same calculation cost. Dilated convolution is particularly popular in the field of real-time segmentation. It can be used only when a larger observation range is needed and cannot withstand multiple convolutions or larger kernels.

Intuitively, the hollow convolution allows the convolution kernel to "inflate" by inserting space between the convolution kernel parts. This increased parameter l (hole rate) indicates how large we want to relax the convolution kernel. The figure below shows the size of the convolution kernel when l=1,2,4. (When l=1, the hole convolution becomes a standard convolution).

Deconvolution

The deconvolution mentioned here is very different from the one-dimensional signal processing deconvolution calculation. The FCN author calls it backwards convolution. Some people call the Deconvolution layer is a very unfortunate name and should rather be called a transposed convolutional layer. We It can be known that there are con layer and pool layer in CNN. The con layer performs image convolution to extract features. The pool layer reduces the image by half to filter important features. For classic image recognition CNN networks, such as IMAGENET, the final output is 1X1X1000. 1000 is the category type, 1x1 is obtained. The FCN author, or the person who later studied the end to end, uses deconvolution on the final 1x1 result (in fact, the final output of the FCN author is not 1X1, which is one-32th of the image size, but it does not affect the deconvolution. use). The deconvolution of the image here is the same as the full convolution principle of Figure 6. Using this kind of deconvolution method makes the image larger. The method used by the FCN author is a variant of the deconvolution mentioned here. , So that the corresponding pixel value can be obtained, and the image can realize end to end.

There are currently two types of deconvolution that are most commonly used:

Method 1: full convolution, complete convolution can make the original domain larger
Method 2: Record the pooling index, then expand the space, and then fill it with convolution. The image deconvolution process is as follows:

Input: 2x2, Convolution kernel: 4x4, Sliding step size: 3, Output: 7x7

That is, the image input as 2x2 is subjected to the process of deconvolution with a step size of 3 through a 4x4 convolution kernel.

1. Perform a full convolution for each pixel of the input image. According to the full convolution size calculation, you can know that the convolution size of each pixel is 1+4-1=4, that is, a 4x4 feature map, and the input has 4 pixels So 4 4x4 feature maps

2. Perform the fusion (ie addition) of the 4 feature maps with a step length of 3; for example, the red feature map is still in the original input position (upper left corner), and the green is still in the original position (upper right corner), and the step size is 3 refers to fusion every 3 pixels, and the overlapping parts are added, that is, the first row and the fourth column of the output are composed of the first row and fourth column of the red special matrix map and the first row and first column of the green feature map Add them up, and so on.

It can be seen that the size of the deconvolution is determined by the size of the convolution kernel and the sliding step size, in is the input size, k is the convolution kernel size, s is the sliding step size, and out is the output size

Get out = (in-1) * s + k
The process in the figure above is (2-1) * 3 + 4 = 7

Involution

Paper: Involution: Inverting the Inherence of Convolution for Visual Recognition (CVPR'21)

Code open source address: https://github.com/d-li14/involution

Despite the rapid development of neural network architecture, convolution is still the main component of deep neural network architecture. Inspired from classic image filtering methods, the convolution kernel has two notable characteristics: Spatial-agnostic and Channel-specific. In Spatial, the nature of the former ensures that the convolution kernel is shared between different positions and achieves translation invariance. In the Channel domain, the frequency spectrum of the convolution kernel is responsible for collecting different information encoded in different channels to satisfy the latter characteristic. In addition, since the emergence of VGGNet, modern neural networks have met the compactness of the convolution kernel by limiting the spatial span of the convolution kernel to no more than 3*3.

On the one hand, although the properties of Spatial-Agnostic and Spatial-Compact are meaningful in terms of improving efficiency and explaining the equivalence of translation invariance, they deprive the convolution kernel of the ability to adapt to different visual modes at different spatial locations. In addition, locality limits the receptive field of convolution and poses a challenge to small targets or blurred images. On the other hand, it is well known that the inter-channel redundancy within the convolution kernel is prominent in many classic deep neural networks, which limits the flexibility of the convolution kernel for different channels.

In order to overcome the above limitations, the author of this article proposes an operation called involution. Compared with standard convolution, involution has symmetrical reverse characteristics, namely Spatial-Specific and Channel-Agnostic. Specifically, involution nuclei are different in spatial scope, but shared in channels. Due to the spatial characteristics of the involution kernel, if it is parameterized into a fixed-size matrix such as a convolution kernel and updated using the backpropagation algorithm, it will hinder the learning of the alignment kernel between input images of different resolutions. transmission. At the end of processing variable feature resolution, the involution kernel belonging to a specific spatial location may only be generated as an example under the condition of the incoming feature vector of the corresponding location itself. In addition, the author also reduces the redundancy of the kernel by sharing the involution kernel on the channel dimension.

Combining the above two factors, the computational complexity of the involution operation increases linearly with the number of feature channels, and the dynamic parameterized involution kernel has extensive coverage in spatial dimensions. Through the reverse design scheme, the involution proposed in this article has the dual advantages of convolution:

1: Involution can aggregate context in a wider space, thus overcoming the difficulty of modeling remote interactions

2: Involution can adaptively assign weights at different positions, thereby prioritizing the most informative visual elements in the spatial domain.

Everyone knows that recent further research based on Self-Attention shows that many tasks propose to use Transformer for modeling in order to capture the long-term dependencies of features. In these studies, pure Self-Attention can be used to build independent models with good performance. And this article will reveal that Self-Attention uses a complex formula about the nuclear structure to model the relationship between neighboring pixels, which is actually a special case of involution. In contrast, the kernel used in this article is generated based on a single pixel, rather than its relationship with neighboring pixels. Furthermore, the author proved in experiments that even with the simple version, the accuracy of Self-Attention can be achieved.

The calculation process of involution is shown in the figure below:

For the feature vector at a coordinate point of the input feature map, first expand it into the shape of the kernel through ∅ (FC-BN-ReLU-FC) and reshape (channel-to-space) transformation, so as to obtain the corresponding involution on this coordinate point The kernel, and then perform Multiply-Add with the feature vector of this coordinate point neighborhood on the input feature map to obtain the final output feature map. The specific operation process and tensor shape changes are as follows: