机器学习 - CVPR 2022 | Interpretation of selected papers by Meituan technical team - 美团技术团队

CVPR 2022, the top international conference on computer vision, was recently held in New Orleans, USA. This year, the Meituan technical team has several papers included in CVPR 2022. These papers cover model compression, video target segmentation, 3D visual positioning, image description, model security, cross Modal video content retrieval and other research areas. This article will give a brief introduction to the 6 selected papers (with download links), hoping to help or inspire students who are engaged in related research.

The full name of CVPR is IEEE Conference on Computer Vision and Pattern Recognition. According to the latest ranking of academic journals and conference influence published by Google Scholar in 2021, CVPR ranks 4th among all academic journals, second only to Nature, NEJM and Science. CVPR has received more than 8,100 papers from around the world this year, of which 2,067 were finally accepted, with an acceptance rate of about 25%.

Paper 01 | Compressing Models with Few Samples: Mimicking then Replacing

| Paper Download
| The author of the paper: Wang Huanyu (Meituan intern & Nanjing University), Liu Junjie (Meituan), Ma Xin (Meituan), Yong Yang (Meituan intern & Xi'an Jiaotong University), Chai Zhenhua (Meituan), Wu Jianxin ( Nanjing University)
| Remarks: The units in parentheses are the institutions where the authors of the paper work when the paper is published.
| Paper Type : CVPR Main Conference (Long Paper)

Model pruning is a relatively mature research direction in model compression, but the time-consuming problem of pruning and then tuning in millions of data sets is an important pain point that restricts the promotion of this direction. In recent years, model pruning under small samples has attracted the attention of the academic community, especially in large-scale datasets or scenarios with sensitive data sources, model compression and optimization can be quickly completed. However, the layer-by-layer channel alignment method adopted by the existing research will greatly limit the scope of the pruned area due to the complex structure. At the same time, in the case of unbalanced sample distribution, overemphasizing the consistency of feature distribution between layers will lead to optimization errors.

Contrary to intuition, this paper proposes a method called MiR (Mimicking then Replacing) - which discards the posterior distribution alignment relied on in traditional knowledge distillation methods by only using the knowledge transfer of Penultimate Layer. And by grafting the classification head/detection head in the original model to the compressed model, the re-tuning of the compressed model can be quickly completed with few samples. Experiments show that the algorithm proposed in this paper is significantly better than various baseline methods (and better than the TPAMI work in the same period), and we have also been further verified in scenarios such as Meituan image security audit.

Mean and standard deviation of top-1/top-5 accuracy (%) on ILSVRC-2012

Paper 02 | Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation

| Paper Download
| The authors of the paper: Ding Zihan (Meituan), Hui Tianrui (University of Chinese Academy of Sciences), Huang Junshi (Meituan), Wei Xiaoming (Meituan), Han Jizhong (University of Chinese Academy of Sciences), Liu Xi (Beijing University of Aeronautics and Astronautics)
| Paper Type : CVPR 2022 Main Conference Long Paper (Poster)

Video object representation segmentation aims to segment the foreground pixels of objects referred to by natural language descriptions in videos. Previous methods either rely on 3D convolutional networks or incorporate additional 2D convolutional networks as encoders to extract mixed spatiotemporal features. However, these methods suffer from spatial misalignment or erroneous interference due to delays and implicit spatiotemporal interactions that occur at the decoding stage.

To address these limitations, we propose a Language Bridged Bidirectional Transfer (LBDT) module that utilizes language as an intermediate bridge to accomplish explicit and adaptive spatiotemporal interactions early in the encoding phase. Specifically, between the temporal encoder, the referential word, and the spatial encoder, we aggregate and transfer language-related motor and appearance information through a cross-modal attention mechanism. Furthermore, we also propose a Bilateral Channel Activation (BCA) module in the decoding stage to further denoise and highlight spatiotemporally consistent features through channel activation. Extensive experiments show that our method achieves state-of-the-art performance on four commonly used public datasets without requiring image referent segmentation pre-training, with significant improvements in model efficiency. Related code link: LBDT .

论文方法整体框架图

Paper 03 | 3D-SPS: Single-Stage 3D Visual Grounding via Referred Point Progressive Selection

| Paper Download
| The authors of the paper: Luo Junyu (Meituan intern & Beihang University), Fu Jiahui (Meituan intern & Beihang University), Kong Xianghao (Meituan intern & Beihang University), Gao Chen (Beijing University of Aeronautics and Astronautics), Ren Haibing (Meituan), Shen Hao (Meituan), Xia Huaxia (Meituan), Liu Xi (Beijing University of Aeronautics and Astronautics)
| Paper Type : CVPR 2022 Main Conference (Oral)

The 3D vision localization task aims to localize the described target object in the point cloud scene according to natural language. Previous methods mostly follow a two-stage paradigm, i.e., language-independent object detection and cross-modality object matching. In this separated paradigm, due to the irregular and large-scale characteristic properties of point clouds compared to images, detection The processor needs to sample keypoints from the original point cloud and generate preselected boxes for each keypoint. However, sparse pre-selection boxes may miss potential objects in the detection stage, while dense pre-selection boxes may increase the difficulty of later matching stages. In addition, the proportion of keypoints obtained by language-independent sampling is also less on the localization target, which also makes the target prediction worse.

In this paper, we propose a one-stage keypoint progressive selection (3D-SPS) method to gradually select keypoints and directly locate objects under the guidance of language. Specifically, we propose a description-aware keypoint sampling (DKS) module to initially focus on point cloud data on language-related objects. Furthermore, we design a target-oriented progressive relation mining (TPM) module to finely focus on target objects through multi-layer intra-modal relation modeling and inter-modal target mining. 3D-SPS avoids the separation between detection and matching in 3D vision localization tasks, directly localizing objects in a single stage.

3D-SPS方法

Paper 04 | DeeCap: Dynamic Early Exiting for Efficient Image Captioning

| Paper Download
| The authors of the paper: Fei Zhengcong (Meituan), Yan Xu (Institute of Computing, Chinese Academy of Sciences), Wang Shuhui (Institute of Computing, Chinese Academy of Sciences), Tian Qi (Huawei)
| Paper Type : CVPR 2022 Main Conference Long Paper (Poster)

Accurate description and efficient generation are very important for the application of image description in real-world scenarios. Transformer-based models achieve significant performance gains, but the computational cost of the model is very high. A possible way to reduce the time complexity is to exit early from the shallow layers in the inner decoding layer to make predictions without going through the processing of the entire model. However, we found the following 2 problems during actual testing: first, the learned representations in shallow layers lack high-level semantics and sufficient cross-modal fusion information for accurate prediction; second, existing decisions made by internal classifiers sometimes is unreliable.

In this regard, we propose the DeeCap framework for efficient image description, which dynamically selects the appropriate number of decoding layers from a global perspective to exit early. The key to accurate dropout lies in the introduced imitation learning mechanism, which predicts deep features through shallow features. By incorporating imitation learning into the entire image description model, the deep representation obtained by imitation can alleviate the loss caused by the lack of actual deep layers during early exit, thereby effectively reducing the computational cost and ensuring a small loss of accuracy . Experiments on MS COCO and Flickr30K datasets show that the DeeCap model proposed in this paper maintains a very competitive performance with 4x acceleration. Related code link: DeeCap .

通过模仿学习来优化深层网络特征的流程图

Paper 05 | Boosting Black-Box Attack with Partially Transferred Conditional Adversarial Distribution

| Paper Download
| Paper authors : Feng Yan (Meituan), Wu Baoyuan (Chinese University of Hong Kong), Fan Yanbo (Tencent), Liu Li (Chinese University of Hong Kong), Li Zhifeng (Tencent), Xia Shutao (Tsinghua University)
| Paper Type : CVPR 2022 Main Conference Long Paper (Poster)

This paper studies the model security problem in the black-box scenario, that is, the attacker can attack the target model only through the query feedback given by the model. The current mainstream method is to use the adversarial transferrability between some white-box proxy models and the target model (that is, the attacked model) to improve the attack effect. However, due to possible differences in model architecture and training datasets between the surrogate model and the target model, known as "surrogate bias", the contribution of adversarial transferability to improving attack performance may be weakened. To address this issue, this paper proposes an adversarial transferability mechanism that is robust to proxy bias. The general idea is to transfer part of the parameters of the conditional adversarial distribution of the proxy model, while learning the untransferred parameters according to the query to the target model, in order to maintain the flexibility to adjust the conditional adversarial distribution of the target model on any new clean samples. This paper conducts a large number of experiments on large-scale datasets and real APIs, and the experimental results demonstrate the effectiveness of the proposed method.

CGATTACK黑盒攻击流程图

Paper 06 | Semi-supervised Video Paragraph Grounding with Contrastive Encoder

| Paper Download
| Paper authors : Jiang Xun (University of Electronic Science and Technology), Xu Xing (University of Electronic Science and Technology), Zhang Jingran (University of Electronic Science and Technology), Shen Fumin (University of Electronic Science and Technology), Cao Zuo (Meituan), Shen Hengtao (University of Electronic Science and Technology)
| Paper Type : CVPR Main Conference, Long Paper (Poster)

Video event localization is a task of cross-modal video content retrieval, which aims to retrieve the video clip corresponding to the Query from an uncropped video according to the input Query, and the corresponding video clip can be used to subsequently generate the motion corresponding to the Query. Figure, in the search scene to achieve press search to get the moving picture. Different from the coarse-grained retrieval mechanism of Video-Text Retrieval (VTR), which retrieves video files as the result, this task emphasizes the realization of event-level fine-grained cross-modal retrieval in videos, based on the analysis of video content and natural The collaborative understanding of language achieves alignment between multiple modalities in terms of time sequence.

This paper proposes for the first time a VPG framework for semi-supervised learning, which can significantly reduce the dependence on moment-annotated data while making more effective use of the contextual information of events in paragraphs. Specifically, it consists of two key parts: (1) a basic Transformer-based model that learns coarse-grained alignments between video and paragraph text by contrasting encoders, and at the same time guides the interaction between each sentence in the paragraph by to learn contextual information between events; (2) a semi-supervised learning framework centered on (1), which reduces the reliance on labeled data through an average teacher model. Experimental results show that our method achieves state-of-the-art performance when using all the annotation information, while still achieving quite competitive results with a large reduction in the proportion of annotated data.

半监督学习的VPG框架

In CVPR 2022, the Visual Intelligence Department of Meituan's technical team won the champion of the 9th Fine-Grained Visual Classification Symposium (FGVC9) Herbarium Recognition Track, and the Dianping Division won the champion of the large-scale cross-modal commodity image recall competition . Meituan’s car-hailing division won the runner-up in the lightweight NAS international competition. The Meituan Visual Intelligence Department won the third place in the deepfake face detection competition, the third place in the SoccerNet 2022 pedestrian re-identification competition, and the fifth place in the large-scale video object segmentation competition (Youtube-VOS).

Relevant technology sharing will be pushed on the official account of the Meituan technical team in the future, so stay tuned.

write on the back

The above papers are the results of the joint cooperation between the technical team of Meituan and various universities and scientific research institutions. This paper mainly introduces our research in model compression, video target segmentation, image description, model security, cross-modal video content retrieval, 3D visual positioning and other fields some research work done.

In addition, the Meituan technical team is also actively participating in the international challenge, hoping to put more scientific research projects into practice, thereby generating more business value and social value. The problems and solutions we encountered in actual work scenarios are reflected in the papers and competitions. We hope to be helpful or enlightening to you, and you are welcome to communicate with us.

Meituan scientific research cooperation

Meituan's scientific research cooperation is committed to building a bridge and platform for cooperation between various departments of Meituan and universities, scientific research institutions, and think tanks. Relying on Meituan's rich business scenarios, data resources and real industrial problems, it is open and innovative, and gathers upward forces. Intelligence, big data, Internet of Things, unmanned driving, operational optimization, digital economy, public affairs and other fields, jointly explore cutting-edge technology and industrial focus macro issues, promote industry-university-research cooperation and exchange and achievement transformation, and promote the cultivation of outstanding talents. Facing the future, we look forward to cooperating with teachers and students from more universities and research institutes. Teachers and students are welcome to send emails to: meituan.oi@meituan.com .

Read more collections of technical articles from the Meituan technical team

| Reply keywords such as [2021 stock], [2020 stock], [2019 stock], [2018 stock], [2017 stock] in the public account menu bar dialog box, you can view the collection of technical articles by the Meituan technical team over the years.

| This article is produced by Meituan technical team, and the copyright belongs to Meituan. Welcome to reprint or use the content of this article for non-commercial purposes such as sharing and communication, please indicate "The content is reproduced from the Meituan technical team". This article may not be reproduced or used commercially without permission. For any commercial activities, please send an email to tech@meituan.com to apply for authorization.

CVPR 2022 | Interpretation of selected papers by Meituan technical team

Paper 01 | Compressing Models with Few Samples: Mimicking then Replacing

Paper 02 | Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation

Paper 03 | 3D-SPS: Single-Stage 3D Visual Grounding via Referred Point Progressive Selection

Paper 04 | DeeCap: Dynamic Early Exiting for Efficient Image Captioning

Paper 05 | Boosting Black-Box Attack with Partially Transferred Conditional Adversarial Distribution

Paper 06 | Semi-supervised Video Paragraph Grounding with Contrastive Encoder

write on the back

Meituan scientific research cooperation

美团技术团队

引用和评论

ICLR&CVPR 2025美团技术团队论文精选

人工智能与机器学习入门：基尼系数（Gini Index）和基于熵（Entropy）

DeepSeek(私有化)+IDEA+Dify+微信搭建AI助手保姆级教程

人工智能与机器学习入门：决策树应用

如何给本地部署的 DeepSeek-R1投喂数据

AlphaFolding填补蛋白质动态结构预测空白！复旦大学等提出4D扩散模型，成果入选AAAI 2025

awesome-ai4s重磅开源！200余篇AI for Science前沿学术论文汇总，涵盖中文解读，持续更新ing

CVPR 2022 | Interpretation of selected papers by Meituan technical team

Paper 01 | Compressing Models with Few Samples: Mimicking then Replacing

Paper 02 | Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation

Paper 03 | 3D-SPS: Single-Stage 3D Visual Grounding via Referred Point Progressive Selection

Paper 04 | DeeCap: Dynamic Early Exiting for Efficient Image Captioning

Paper 05 | Boosting Black-Box Attack with Partially Transferred Conditional Adversarial Distribution

Paper 06 | Semi-supervised Video Paragraph Grounding with Contrastive Encoder

write on the back

Meituan scientific research cooperation

美团技术团队

引用和评论

ICLR&CVPR 2025美团技术团队论文精选

人工智能与机器学习入门：基尼系数（Gini Index）和基于熵（Entropy）

DeepSeek(私有化)+IDEA+Dify+微信 搭建AI助手保姆级教程

人工智能与机器学习入门：决策树应用

如何给本地部署的 DeepSeek-R1投喂数据

AlphaFolding填补蛋白质动态结构预测空白！复旦大学等提出4D扩散模型，成果入选AAAI 2025

awesome-ai4s重磅开源！200余篇AI for Science前沿学术论文汇总，涵盖中文解读，持续更新ing

DeepSeek(私有化)+IDEA+Dify+微信搭建AI助手保姆级教程