Five Implementation Strategies of Spatial-Shift-Operation in Pytorch

This article has been authorized by the Jishi platform and first published on the public account of the Jishi platform. It may not be reprinted without permission.

Original document (may be further updated): https://www.yuque.com/lart/ugkv9f/nnor5p

foreword

I have seen some papers that use spatial offset operations to replace regional convolution operations:

Rough look: https://www.yuque.com/lart/architecture/conv#uKY5N
- (CVPR 2018) [Grouped Shift] Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions:
- (ICCV 2019) 4-Connected Shift Residual Networks
- (NIPS 2018) [Active Shift] Constructing Fast Network through Deconstruction of Convolution
- (CVPR 2019) [Sparse Shift] All You Need Is a Few Shifts: Designing Efficient Convolutional Neural Networks for Image Classification
Take a closer look:
- Hire-MLP Vision MLP via Hierarchical Rearrangement
  - Hire-MLP: Vision MLP via Hierarchical Rearrangement:https://www.yuque.com/lart/papers/lbhadn
- CycleMLP A MLP-like Architecture for Dense Prediction of
  - CycleMLP: A MLP-like Architecture for Dense Prediction:https://www.yuque.com/lart/papers/om3xb6
- Vision MLP of S2-MLP V1&V2 Spatial-Shift MLP Architecture for Vision
  - S2-MLP: Spatial-Shift MLP Architecture for Vision:https://www.yuque.com/lart/papers/dgdu2b
  - S2-MLPv2: Improved Spatial-Shift MLP Architecture for Vision:https://www.yuque.com/lart/papers/dgdu2b

After reading these papers, by referring to the core code provided by them (mainly the latter MLP methods), I have some ideas for implementing spatial offset.
By integrating existing knowledge, I summarize five implementation strategies.
Since I personally use pytorch, the demonstration here may also use some useful functions provided by pytorch itself.

Problem Description

Before providing the implementation, we should clarify the purpose to facilitate subsequent implementation.
These existing jobs can all be simplified to:

Given tensor $X \in \mathbb{R}^{1 \times 8 \times 5 \times 5}$, here follows the default data format of B, C, H, W , which is 061ece7e00c086 .
Convert $X$ to $\tilde{X}$ by transforming $\mathcal{T}: x \rightarrow \tilde{x}$.

Here tensor $\tilde{X} \in \mathbb{R}^{1 \times 8 \times 5 \times 5}$, in order to provide a reasonable comparison, the results based on the "slice index" strategy in the following chapters are uniformly used here as the value of $\tilde{X}$.

import torch

xs = torch.meshgrid(torch.arange(5), torch.arange(5))
x = torch.stack(xs, dim=0)
x = x.unsqueeze(0).repeat(1, 4, 1, 1).float()
print(x)

'''
tensor([[[[0., 0., 0., 0., 0.],
          [1., 1., 1., 1., 1.],
          [2., 2., 2., 2., 2.],
          [3., 3., 3., 3., 3.],
          [4., 4., 4., 4., 4.]],

         [[0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.]],

         [[0., 0., 0., 0., 0.],
          [1., 1., 1., 1., 1.],
          [2., 2., 2., 2., 2.],
          [3., 3., 3., 3., 3.],
          [4., 4., 4., 4., 4.]],

         [[0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.]],

         [[0., 0., 0., 0., 0.],
          [1., 1., 1., 1., 1.],
          [2., 2., 2., 2., 2.],
          [3., 3., 3., 3., 3.],
          [4., 4., 4., 4., 4.]],

         [[0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.]],

         [[0., 0., 0., 0., 0.],
          [1., 1., 1., 1., 1.],
          [2., 2., 2., 2., 2.],
          [3., 3., 3., 3., 3.],
          [4., 4., 4., 4., 4.]],

         [[0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.]]]])
'''

Method 1: Slice Index

This is the most straightforward and simple strategy. This is also the strategy used in the S2-MLP series.
We use it as the reference object for all other strategies. This result will also be obtained in subsequent implementations.

direct_shift = torch.clone(x)
direct_shift[:, 0:2, :, 1:] = torch.clone(direct_shift[:, 0:2, :, :4])
direct_shift[:, 2:4, :, :4] = torch.clone(direct_shift[:, 2:4, :, 1:])
direct_shift[:, 4:6, 1:, :] = torch.clone(direct_shift[:, 4:6, :4, :])
direct_shift[:, 6:8, :4, :] = torch.clone(direct_shift[:, 6:8, 1:, :])
print(direct_shift)

'''
tensor([[[[0., 0., 0., 0., 0.],
          [1., 1., 1., 1., 1.],
          [2., 2., 2., 2., 2.],
          [3., 3., 3., 3., 3.],
          [4., 4., 4., 4., 4.]],

         [[0., 0., 1., 2., 3.],
          [0., 0., 1., 2., 3.],
          [0., 0., 1., 2., 3.],
          [0., 0., 1., 2., 3.],
          [0., 0., 1., 2., 3.]],

         [[0., 0., 0., 0., 0.],
          [1., 1., 1., 1., 1.],
          [2., 2., 2., 2., 2.],
          [3., 3., 3., 3., 3.],
          [4., 4., 4., 4., 4.]],

         [[1., 2., 3., 4., 4.],
          [1., 2., 3., 4., 4.],
          [1., 2., 3., 4., 4.],
          [1., 2., 3., 4., 4.],
          [1., 2., 3., 4., 4.]],

         [[0., 0., 0., 0., 0.],
          [0., 0., 0., 0., 0.],
          [1., 1., 1., 1., 1.],
          [2., 2., 2., 2., 2.],
          [3., 3., 3., 3., 3.]],

         [[0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.]],

         [[1., 1., 1., 1., 1.],
          [2., 2., 2., 2., 2.],
          [3., 3., 3., 3., 3.],
          [4., 4., 4., 4., 4.],
          [4., 4., 4., 4., 4.]],

         [[0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.]]]])
'''

Method 2: Feature Map Offset - `torch.roll`

pytorch provides a function to directly offset feature maps, namely torch.roll . This operation has been used in recent transformer papers and in mlp, such as SwinTransformer and AS-MLP .

Here is the pseudocode provided in the AS-MLP paper:

Its main function is to offset the feature map along a certain axis, and support to offset along multiple axes at the same time, so as to construct more diverse offset directions.
To achieve the same result as before, we need to padding the input first.
Because a feature of the direct slice index is that the boundary value will appear repeatedly, and if the direct roll operation is performed, it will cause all the values to move as a whole.
So in order to achieve a similar effect, first padding a grid of data around each.
Note that the use of repeat mode (replicate) is chosen here to achieve the effect of the final boundary repeating value.

import torch.nn.functional as F

pad_x = F.pad(x, pad=[1, 1, 1, 1], mode="replicate")  # 这里需要借助padding来保留边界的数据

Then start processing, offset one unit of length in each of the four directions:

roll_shift = torch.cat(
    [
        torch.roll(pad_x[:, c * 2 : (c + 1) * 2, ...], shifts=(shift_h, shift_w), dims=(2, 3))
        for c, (shift_h, shift_w) in enumerate([(0, 1), (0, -1), (1, 0), (-1, 0)])
    ],
    dim=1,
)

'''
tensor([[[[0., 0., 0., 0., 0., 0., 0.],
          [0., 0., 0., 0., 0., 0., 0.],
          [1., 1., 1., 1., 1., 1., 1.],
          [2., 2., 2., 2., 2., 2., 2.],
          [3., 3., 3., 3., 3., 3., 3.],
          [4., 4., 4., 4., 4., 4., 4.],
          [4., 4., 4., 4., 4., 4., 4.]],

         [[4., 0., 0., 1., 2., 3., 4.],
          [4., 0., 0., 1., 2., 3., 4.],
          [4., 0., 0., 1., 2., 3., 4.],
          [4., 0., 0., 1., 2., 3., 4.],
          [4., 0., 0., 1., 2., 3., 4.],
          [4., 0., 0., 1., 2., 3., 4.],
          [4., 0., 0., 1., 2., 3., 4.]],

         [[0., 0., 0., 0., 0., 0., 0.],
          [0., 0., 0., 0., 0., 0., 0.],
          [1., 1., 1., 1., 1., 1., 1.],
          [2., 2., 2., 2., 2., 2., 2.],
          [3., 3., 3., 3., 3., 3., 3.],
          [4., 4., 4., 4., 4., 4., 4.],
          [4., 4., 4., 4., 4., 4., 4.]],

         [[0., 1., 2., 3., 4., 4., 0.],
          [0., 1., 2., 3., 4., 4., 0.],
          [0., 1., 2., 3., 4., 4., 0.],
          [0., 1., 2., 3., 4., 4., 0.],
          [0., 1., 2., 3., 4., 4., 0.],
          [0., 1., 2., 3., 4., 4., 0.],
          [0., 1., 2., 3., 4., 4., 0.]],

         [[4., 4., 4., 4., 4., 4., 4.],
          [0., 0., 0., 0., 0., 0., 0.],
          [0., 0., 0., 0., 0., 0., 0.],
          [1., 1., 1., 1., 1., 1., 1.],
          [2., 2., 2., 2., 2., 2., 2.],
          [3., 3., 3., 3., 3., 3., 3.],
          [4., 4., 4., 4., 4., 4., 4.]],

         [[0., 0., 1., 2., 3., 4., 4.],
          [0., 0., 1., 2., 3., 4., 4.],
          [0., 0., 1., 2., 3., 4., 4.],
          [0., 0., 1., 2., 3., 4., 4.],
          [0., 0., 1., 2., 3., 4., 4.],
          [0., 0., 1., 2., 3., 4., 4.],
          [0., 0., 1., 2., 3., 4., 4.]],

         [[0., 0., 0., 0., 0., 0., 0.],
          [1., 1., 1., 1., 1., 1., 1.],
          [2., 2., 2., 2., 2., 2., 2.],
          [3., 3., 3., 3., 3., 3., 3.],
          [4., 4., 4., 4., 4., 4., 4.],
          [4., 4., 4., 4., 4., 4., 4.],
          [0., 0., 0., 0., 0., 0., 0.]],

         [[0., 0., 1., 2., 3., 4., 4.],
          [0., 0., 1., 2., 3., 4., 4.],
          [0., 0., 1., 2., 3., 4., 4.],
          [0., 0., 1., 2., 3., 4., 4.],
          [0., 0., 1., 2., 3., 4., 4.],
          [0., 0., 1., 2., 3., 4., 4.],
          [0., 0., 1., 2., 3., 4., 4.]]]])
'''

Then just cut it out:

roll_shift = roll_shift[..., 1:6, 1:6]
print(roll_shift)

'''
tensor([[[[0., 0., 0., 0., 0.],
          [1., 1., 1., 1., 1.],
          [2., 2., 2., 2., 2.],
          [3., 3., 3., 3., 3.],
          [4., 4., 4., 4., 4.]],

         [[0., 0., 1., 2., 3.],
          [0., 0., 1., 2., 3.],
          [0., 0., 1., 2., 3.],
          [0., 0., 1., 2., 3.],
          [0., 0., 1., 2., 3.]],

         [[0., 0., 0., 0., 0.],
          [1., 1., 1., 1., 1.],
          [2., 2., 2., 2., 2.],
          [3., 3., 3., 3., 3.],
          [4., 4., 4., 4., 4.]],

         [[1., 2., 3., 4., 4.],
          [1., 2., 3., 4., 4.],
          [1., 2., 3., 4., 4.],
          [1., 2., 3., 4., 4.],
          [1., 2., 3., 4., 4.]],

         [[0., 0., 0., 0., 0.],
          [0., 0., 0., 0., 0.],
          [1., 1., 1., 1., 1.],
          [2., 2., 2., 2., 2.],
          [3., 3., 3., 3., 3.]],

         [[0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.]],

         [[1., 1., 1., 1., 1.],
          [2., 2., 2., 2., 2.],
          [3., 3., 3., 3., 3.],
          [4., 4., 4., 4., 4.],
          [4., 4., 4., 4., 4.]],

         [[0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.]]]])
'''

Method 3: 1x1 Deformable Convolution - `ops.deform_conv2d`

In the process of reading Cycle FC, I learned about the magic of Deformable Convolution in implementing spatial offset operations.
Since the latest version of torchvision has integrated this operation, we only need to import the function:

from torchvision.ops import deform_conv2d

In order to use it to achieve spatial offset, I added some comment information to the relevant code in the interpretation

To understand the operation of this function, you need to first understand the specific usage of deform_conv2d_tv used later.

For details, see: https://pytorch.org/vision/0.10/ops.html#torchvision.ops.deform_conv2d
The requirements for the offset parameter here are:

offset (Tensor[batch_size, 2 offset_groups kernel_height * kernel_width, out_height, out_width])
offsets to be applied for each position in the convolution kernel.
That is, for sample s channel output characteristics of FIG c the position (x, y) , this function from offset taken out, shape kernel_height*kernel_width convolution kernel corresponding offset parameters , as offset[s, 0:2*offset_groups*kernel_height*kernel_width, x, y] . Also That is, this series of parameters corresponds to the single position (x, y) of the s
There can be different offset for different locations, or the same (the following implementation is the latter).
For the 2*offset_groups*kernel_height*kernel_width , it involves the grouping of input feature channels.
offset_groups it into 061ece7e00c7cf groups, each of which has a separate set of relative offsets corresponding to the center position of the convolution kernel, a total of 2*kernel_height*kernel_width numbers.
For each kernel parameter, uses two quantities to describe the offset, that is, the offset of the h-direction and w-direction relative to the center position , which corresponds to the subtraction of kernel_height//2 or kernel_width//2 in the following code.
It should be noted that when the offset position is padding after tensor , the grid is padded with 0. If there is a boundary value on the grid, the boundary value and the vertices of the grid that are padded with 0 are used. to compute the result of bilinear interpolation.

This strategy requires us to construct a specific relative offset value offset to adjust the sampling position of the 1x1 convolution kernel in different channels.

We first construct the offset we need $\Delta \in \mathbb{R}^{1 \times 2C_iK_hK_w \times 1 \times 1}$. The reason why the out_height & out_width are set to 1 is because we have a The offsets are consistent, so simply repeat the values.

offset = torch.empty(1, 2 * 8 * 1 * 1, 1, 1)
for c, (rel_offset_h, rel_offset_w) in enumerate([(0, -1), (0, -1), (0, 1), (0, 1), (-1, 0), (-1, 0), (1, 0), (1, 0)]):
    offset[0, c * 2 + 0, 0, 0] = rel_offset_h
    offset[0, c * 2 + 1, 0, 0] = rel_offset_w
offset = offset.repeat(1, 1, 7, 7).float()  # 针对空间偏移重复偏移量

When constructing the offset, we have to make it clear are in pairs, and each group contains the relative offset along the H and W axes (this relative offset should be It is centered on the position of the convolution weight it acts on - I have not verified this conclusion, it is just a personal reasoning, because it may be more convenient to implement in the source code, and the coordinates of the corresponding position of the weight can be directly applied. is not reading If you understand the function of the function under the premise of the source code, you need to construct the data to verify the understanding 161ece7e00c8e8 ).

In order to better understand the principle of the effect of offset, we can imagine that for the sampling position $(h, w)$, after using the relative offset $(\delta_h, \delta_w)$, the sampling position becomes $(h+ \delta_h, w+\delta_w)$. That is, the weight originally applied to $(h, w)$ is directly applied to the position $(h+\delta_h, w+\delta_w)$ after offset.

For the one-unit offset along the four axes we described earlier, it can be achieved by assigning the values in $\{-1, 0, 1\}$ to $\delta_h$ and $\delta_w$ respectively .

Since only the channel-specific spatial offset function needs to be reflected here, and the convolution function of Deformable Convolution is not required, we need to set the convolution kernel as a unit matrix and convert it to the form of the convolution kernel corresponding to the grouped convolution:

weight = torch.eye(8).reshape(8, 8, 1, 1).float()
# 输入8通道，输出8通道，每个输入通道只和一个对应的输出通道有映射权值1

Next feed the weights and offsets into the imported function.
for offsets beyond the boundary, in order to achieve the effect of repeating values on the previous boundary, 161ece7e00c9aa also needs to use the input after padding in repeat mode.
And prune the result a bit:

deconv_shift = deform_conv2d(pad_x, offset=offset, weight=weight)
deconv_shift = deconv_shift[..., 1:6, 1:6]
print(deconv_shift)

'''
tensor([[[[0., 0., 0., 0., 0.],
          [1., 1., 1., 1., 1.],
          [2., 2., 2., 2., 2.],
          [3., 3., 3., 3., 3.],
          [4., 4., 4., 4., 4.]],

         [[0., 0., 1., 2., 3.],
          [0., 0., 1., 2., 3.],
          [0., 0., 1., 2., 3.],
          [0., 0., 1., 2., 3.],
          [0., 0., 1., 2., 3.]],

         [[0., 0., 0., 0., 0.],
          [1., 1., 1., 1., 1.],
          [2., 2., 2., 2., 2.],
          [3., 3., 3., 3., 3.],
          [4., 4., 4., 4., 4.]],

         [[1., 2., 3., 4., 4.],
          [1., 2., 3., 4., 4.],
          [1., 2., 3., 4., 4.],
          [1., 2., 3., 4., 4.],
          [1., 2., 3., 4., 4.]],

         [[0., 0., 0., 0., 0.],
          [0., 0., 0., 0., 0.],
          [1., 1., 1., 1., 1.],
          [2., 2., 2., 2., 2.],
          [3., 3., 3., 3., 3.]],

         [[0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.]],

         [[1., 1., 1., 1., 1.],
          [2., 2., 2., 2., 2.],
          [3., 3., 3., 3., 3.],
          [4., 4., 4., 4., 4.],
          [4., 4., 4., 4., 4.]],

         [[0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.]]]])
'''

Method 4: 3x3 Depthwise Convolution - `F.conv2d`

It is mentioned in S2MLP that the spatial offset operation can be implemented by using a specially constructed 3x3 Depthwise Convolution.
Since it is based on a 3x3 convolution operation, it is still necessary to repeat the padding of the input in order to achieve the repeated effect of the boundary value.
First construct convolution kernels corresponding to four directions:

k1 = torch.FloatTensor([[0, 0, 0], [1, 0, 0], [0, 0, 0]]).reshape(1, 1, 3, 3)
k2 = torch.FloatTensor([[0, 0, 0], [0, 0, 1], [0, 0, 0]]).reshape(1, 1, 3, 3)
k3 = torch.FloatTensor([[0, 1, 0], [0, 0, 0], [0, 0, 0]]).reshape(1, 1, 3, 3)
k4 = torch.FloatTensor([[0, 0, 0], [0, 0, 0], [0, 1, 0]]).reshape(1, 1, 3, 3)
weight = torch.cat([k1, k1, k2, k2, k3, k3, k4, k4], dim=0)  # 每个输出通道对应一个输入通道

Next, the convolution kernel and data are sent to F.conv2d for calculation. The input is padding 1 unit on each side, so the output shape remains unchanged:

conv_shift = F.conv2d(pad_x, weight=weight, groups=8)
print(conv_shift)

'''
tensor([[[[0., 0., 0., 0., 0.],
          [1., 1., 1., 1., 1.],
          [2., 2., 2., 2., 2.],
          [3., 3., 3., 3., 3.],
          [4., 4., 4., 4., 4.]],

         [[0., 0., 1., 2., 3.],
          [0., 0., 1., 2., 3.],
          [0., 0., 1., 2., 3.],
          [0., 0., 1., 2., 3.],
          [0., 0., 1., 2., 3.]],

         [[0., 0., 0., 0., 0.],
          [1., 1., 1., 1., 1.],
          [2., 2., 2., 2., 2.],
          [3., 3., 3., 3., 3.],
          [4., 4., 4., 4., 4.]],

         [[1., 2., 3., 4., 4.],
          [1., 2., 3., 4., 4.],
          [1., 2., 3., 4., 4.],
          [1., 2., 3., 4., 4.],
          [1., 2., 3., 4., 4.]],

         [[0., 0., 0., 0., 0.],
          [0., 0., 0., 0., 0.],
          [1., 1., 1., 1., 1.],
          [2., 2., 2., 2., 2.],
          [3., 3., 3., 3., 3.]],

         [[0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.]],

         [[1., 1., 1., 1., 1.],
          [2., 2., 2., 2., 2.],
          [3., 3., 3., 3., 3.],
          [4., 4., 4., 4., 4.],
          [4., 4., 4., 4., 4.]],

         [[0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.]]]])
'''

Method 5: Grid Sampling - `F.grid_sample`

Finally, the operation mentioned here is based on F.grid_sample , which is a function provided by pytorch for building STN, but it has begun to appear in optical flow prediction tasks and some recent segmentation tasks:

AlignSeg: Feature-Aligned Segmentation Networks
Semantic Flow for Fast and Accurate Scene Parsing

For 4Dtensor, its main function is to measure the data point $(\gamma_h, \gamma_w according to the given grid sampling graph grid$\Gamma = \mathbb{R}^{B \times H_o \times W_o \times 2}$ )$ is sampled to place in the output position $(h, w)$.
It should be noted that this function limits the value range of the sampling graph grid is the result of normalizing the input size, and the last dimension of $\Gamma$ is the index W axis and H axis respectively. That is, for The four dimensions of the input tensor layout B, C, H, W are indexed from back to front. In fact, this rule is widely followed in the design of other functions in pytorch. For example, the rules for the pad function in pytorch are the same.
First, construct the original coordinate array based on the input data according to the requirements (the upper left corner is $(h_{coord}[0, 0], w_{coord}[0, 0])$, and the upper right corner is $(h_{coord}[0] , 5], w_{coord}[0, 5])$):

h_coord, w_coord = torch.meshgrid(torch.arange(5), torch.arange(5))
print(h_coord)
print(w_coord)
h_coord = h_coord.reshape(1, 5, 5, 1)
w_coord = w_coord.reshape(1, 5, 5, 1)

'''
tensor([[0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1],
        [2, 2, 2, 2, 2],
        [3, 3, 3, 3, 3],
        [4, 4, 4, 4, 4]])
tensor([[0, 1, 2, 3, 4],
        [0, 1, 2, 3, 4],
        [0, 1, 2, 3, 4],
        [0, 1, 2, 3, 4],
        [0, 1, 2, 3, 4]])
'''

For each output $\tilde{x}$, calculate the coordinates of the corresponding input $x$ (ie, the sampling position):

            torch.cat(
                [  # 请注意这里的堆叠顺序，先放靠后的轴的坐标
                    2 * torch.clamp(w_coord + w, 0, 4) / (5 - 1) - 1,
                    2 * torch.clamp(h_coord + h, 0, 4) / (5 - 1) - 1,
                ],
                dim=-1,
            )

The parameter $w\&h$ here represents the offset based on the original coordinate system.
Since the clamp is directly used here to limit the sampling interval, the part close to the boundary will be reused, so the original input can be used directly in the future.
When the new coordinates are fed into the function, they need to be converted to values in the range of $[-1, 1]$, that is, the normalized calculation is performed for the input shapes W and H.

        F.grid_sample(
            x,
            torch.cat(
                [
                    2 * torch.clamp(w_coord + w, 0, 4) / (5 - 1) - 1,
                    2 * torch.clamp(h_coord + h, 0, 4) / (5 - 1) - 1,
                ],
                dim=-1,
            ),
            mode="bilinear",
            align_corners=True,
        )

It should be noted that align_corners=True is used here. For the introduction of this parameter in https://www.yuque.com/lart/idh721/ugwn46 .
True :

False :

So it can be seen that the former here is more in line with our needs, because the implementation of the algorithms involving bilinear interpolation (such as the previous Deformable Convolution) mentioned here is to put pixels on the vertices of the grid (according to this idea The understanding is more in line with the experimental phenomenon, so I will describe it like this).

grid_sampled_shift = torch.cat(
    [
        F.grid_sample(
            x,
            torch.cat(
                [
                    2 * torch.clamp(w_coord + w, 0, 4) / (5 - 1) - 1,
                    2 * torch.clamp(h_coord + h, 0, 4) / (5 - 1) - 1,
                ],
                dim=-1,
            ),
            mode="bilinear",
            align_corners=True,
        )
        for x, (h, w) in zip(x.chunk(4, dim=1), [(0, -1), (0, 1), (-1, 0), (1, 0)])
    ],
    dim=1,
)
print(grid_sampled_shift)

'''
tensor([[[[0., 0., 0., 0., 0.],
          [1., 1., 1., 1., 1.],
          [2., 2., 2., 2., 2.],
          [3., 3., 3., 3., 3.],
          [4., 4., 4., 4., 4.]],

         [[0., 0., 1., 2., 3.],
          [0., 0., 1., 2., 3.],
          [0., 0., 1., 2., 3.],
          [0., 0., 1., 2., 3.],
          [0., 0., 1., 2., 3.]],

         [[0., 0., 0., 0., 0.],
          [1., 1., 1., 1., 1.],
          [2., 2., 2., 2., 2.],
          [3., 3., 3., 3., 3.],
          [4., 4., 4., 4., 4.]],

         [[1., 2., 3., 4., 4.],
          [1., 2., 3., 4., 4.],
          [1., 2., 3., 4., 4.],
          [1., 2., 3., 4., 4.],
          [1., 2., 3., 4., 4.]],

         [[0., 0., 0., 0., 0.],
          [0., 0., 0., 0., 0.],
          [1., 1., 1., 1., 1.],
          [2., 2., 2., 2., 2.],
          [3., 3., 3., 3., 3.]],

         [[0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.]],

         [[1., 1., 1., 1., 1.],
          [2., 2., 2., 2., 2.],
          [3., 3., 3., 3., 3.],
          [4., 4., 4., 4., 4.],
          [4., 4., 4., 4., 4.]],

         [[0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.],
          [0., 1., 2., 3., 4.]]]])
'''

Some other thoughts

About the error problem `F.grid_sample`

Since F.grid_sample involves the normalization operation, there is naturally a loss of precision.
So in fact, if you want to achieve precise control, it is not recommended to use this method.
If the location happens to be on the corner of the cell, you can use the nearest neighbor interpolation mode to get a neater result.
Below is an example:

h_coord, w_coord = torch.meshgrid(torch.arange(7), torch.arange(7))
h_coord = h_coord.reshape(1, 7, 7, 1)
w_coord = w_coord.reshape(1, 7, 7, 1)
grid = torch.cat(
    [
        2 * torch.clamp(w_coord, 0, 6) / (7 - 1) - 1,
        2 * torch.clamp(h_coord, 0, 6) / (7 - 1) - 1,
    ],
    dim=-1,
)
print(grid)
print(pad_x[:, :2])

print("mode=bilinear\n", F.grid_sample(pad_x[:, :2], grid, mode="bilinear", align_corners=True))
print("mode=nearest\n", F.grid_sample(pad_x[:, :2], grid, mode="nearest", align_corners=True))

'''
tensor([[[[-1.0000, -1.0000],
          [-0.6667, -1.0000],
          [-0.3333, -1.0000],
          [ 0.0000, -1.0000],
          [ 0.3333, -1.0000],
          [ 0.6667, -1.0000],
          [ 1.0000, -1.0000]],

         [[-1.0000, -0.6667],
          [-0.6667, -0.6667],
          [-0.3333, -0.6667],
          [ 0.0000, -0.6667],
          [ 0.3333, -0.6667],
          [ 0.6667, -0.6667],
          [ 1.0000, -0.6667]],

         [[-1.0000, -0.3333],
          [-0.6667, -0.3333],
          [-0.3333, -0.3333],
          [ 0.0000, -0.3333],
          [ 0.3333, -0.3333],
          [ 0.6667, -0.3333],
          [ 1.0000, -0.3333]],

         [[-1.0000,  0.0000],
          [-0.6667,  0.0000],
          [-0.3333,  0.0000],
          [ 0.0000,  0.0000],
          [ 0.3333,  0.0000],
          [ 0.6667,  0.0000],
          [ 1.0000,  0.0000]],

         [[-1.0000,  0.3333],
          [-0.6667,  0.3333],
          [-0.3333,  0.3333],
          [ 0.0000,  0.3333],
          [ 0.3333,  0.3333],
          [ 0.6667,  0.3333],
          [ 1.0000,  0.3333]],

         [[-1.0000,  0.6667],
          [-0.6667,  0.6667],
          [-0.3333,  0.6667],
          [ 0.0000,  0.6667],
          [ 0.3333,  0.6667],
          [ 0.6667,  0.6667],
          [ 1.0000,  0.6667]],

         [[-1.0000,  1.0000],
          [-0.6667,  1.0000],
          [-0.3333,  1.0000],
          [ 0.0000,  1.0000],
          [ 0.3333,  1.0000],
          [ 0.6667,  1.0000],
          [ 1.0000,  1.0000]]]])
tensor([[[[0., 0., 0., 0., 0., 0., 0.],
          [0., 0., 0., 0., 0., 0., 0.],
          [1., 1., 1., 1., 1., 1., 1.],
          [2., 2., 2., 2., 2., 2., 2.],
          [3., 3., 3., 3., 3., 3., 3.],
          [4., 4., 4., 4., 4., 4., 4.],
          [4., 4., 4., 4., 4., 4., 4.]],

         [[0., 0., 1., 2., 3., 4., 4.],
          [0., 0., 1., 2., 3., 4., 4.],
          [0., 0., 1., 2., 3., 4., 4.],
          [0., 0., 1., 2., 3., 4., 4.],
          [0., 0., 1., 2., 3., 4., 4.],
          [0., 0., 1., 2., 3., 4., 4.],
          [0., 0., 1., 2., 3., 4., 4.]]]])
mode=bilinear
 tensor([[[[0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
           0.0000e+00, 0.0000e+00],
          [1.1921e-07, 1.1921e-07, 1.1921e-07, 1.1921e-07, 1.1921e-07,
           1.1921e-07, 1.1921e-07],
          [1.0000e+00, 1.0000e+00, 1.0000e+00, 1.0000e+00, 1.0000e+00,
           1.0000e+00, 1.0000e+00],
          [2.0000e+00, 2.0000e+00, 2.0000e+00, 2.0000e+00, 2.0000e+00,
           2.0000e+00, 2.0000e+00],
          [3.0000e+00, 3.0000e+00, 3.0000e+00, 3.0000e+00, 3.0000e+00,
           3.0000e+00, 3.0000e+00],
          [4.0000e+00, 4.0000e+00, 4.0000e+00, 4.0000e+00, 4.0000e+00,
           4.0000e+00, 4.0000e+00],
          [4.0000e+00, 4.0000e+00, 4.0000e+00, 4.0000e+00, 4.0000e+00,
           4.0000e+00, 4.0000e+00]],

         [[0.0000e+00, 1.1921e-07, 1.0000e+00, 2.0000e+00, 3.0000e+00,
           4.0000e+00, 4.0000e+00],
          [0.0000e+00, 1.1921e-07, 1.0000e+00, 2.0000e+00, 3.0000e+00,
           4.0000e+00, 4.0000e+00],
          [0.0000e+00, 1.1921e-07, 1.0000e+00, 2.0000e+00, 3.0000e+00,
           4.0000e+00, 4.0000e+00],
          [0.0000e+00, 1.1921e-07, 1.0000e+00, 2.0000e+00, 3.0000e+00,
           4.0000e+00, 4.0000e+00],
          [0.0000e+00, 1.1921e-07, 1.0000e+00, 2.0000e+00, 3.0000e+00,
           4.0000e+00, 4.0000e+00],
          [0.0000e+00, 1.1921e-07, 1.0000e+00, 2.0000e+00, 3.0000e+00,
           4.0000e+00, 4.0000e+00],
          [0.0000e+00, 1.1921e-07, 1.0000e+00, 2.0000e+00, 3.0000e+00,
           4.0000e+00, 4.0000e+00]]]])
mode=nearest
 tensor([[[[0., 0., 0., 0., 0., 0., 0.],
          [0., 0., 0., 0., 0., 0., 0.],
          [1., 1., 1., 1., 1., 1., 1.],
          [2., 2., 2., 2., 2., 2., 2.],
          [3., 3., 3., 3., 3., 3., 3.],
          [4., 4., 4., 4., 4., 4., 4.],
          [4., 4., 4., 4., 4., 4., 4.]],

         [[0., 0., 1., 2., 3., 4., 4.],
          [0., 0., 1., 2., 3., 4., 4.],
          [0., 0., 1., 2., 3., 4., 4.],
          [0., 0., 1., 2., 3., 4., 4.],
          [0., 0., 1., 2., 3., 4., 4.],
          [0., 0., 1., 2., 3., 4., 4.],
          [0., 0., 1., 2., 3., 4., 4.]]]])
'''

`F.grid_sample` and Deformable Convolution

Although both of them realize the adjustment of the mapping relationship between the input and output positions, there are obvious differences in the way of adjustment between the two.

Different reference coordinate systems
- The former coordinate system is a normalized coordinate system based on the overall input. The origin is the center of the input HW plane, and the H and W axes are directed downward and to the right, respectively. In the coordinate system WOH, the input The upper left corner of the data is $(-1, -1)$, and the upper right corner is $(1, -1)$.
- The latter coordinate system is a relative coordinate system relative to the initial action position of the weight. But in fact, it is actually understood here as along the H and W axes is more suitable . For example, the weight action The position is shifted to the left by one unit, and the corresponding offset parameter group $(\delta_h, \delta_w)$ can be set to $(0, -1)$, that is, the action position is relative to the original action position. Add $-1$ to the $w$ coordinate.
different effects
- The former directly adjusts the coordinates of the overall input, and has the same adjustment effect for all channels of the input.
- The latter is built on the convolution operation, so it can be more convenient to deal with different channels ( offset_groups ), different local areas that may actually overlap ( kernel_height * kernel_width ). So the actual function is more flexible and adjustable.

The second spring of Shift operation

Although various forms of spatial shift operations have been explored in previous work, they have not attracted much attention.

(CVPR 2018) [Grouped Shift] Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions:
(ICCV 2019) 4-Connected Shift Residual Networks
(NIPS 2018) [Active Shift] Constructing Fast Network through Deconstruction of Convolution
(CVPR 2019) [Sparse Shift] All You Need Is a Few Shifts: Designing Efficient Convolutional Neural Networks for Image Classification

Most of these works focus on the design of lightweight networks, and these shift-based methods now combine the MLP as a clipper, which seems to have stirred up some new splashes.
These current methods often use more effective training settings, and strategies outside these models also greatly improve the performance of the model to a certain extent. This is actually confusing, if the shifts before the direct migration Operating in the MLP framework here, maybe the performance will not be bad, right?

This idea actually applies to traditional CNN methods. If the previous structures use the same training strategy, how much worse can they be compared to now? It is estimated that only those big guys with cards, time and patience can explore Exactly.

In fact, these existing MLP methods based on spatial offset can be regarded as [ (NIPS 2018) [Active Shift] Constructing Fast Network through Deconstruction of Convolution ]( https://www .yuque.com/lart/architecture/conv#tjP7f ) specialization of this work.

That is, the offset parameters of the adaptive learning in the original work are changed to fixed offset parameters.

Five Implementation Strategies of Spatial-Shift-Operation in Pytorch