头图

A total of 5 video-related models were introduced this week: "Omnivore" Omnivore, "King of Cost-effectiveness" TSM, "Attack Pureblood" TimeSformer, "Returning Master" Video Swin Tranformer, "Domestic Light" UniFormer. Whether it is the rising new generation or the OG that keeps pace with the times, there is always a model for you to love. Grab images, videos, and 3D data, the AI model Omnivore that is not picky eaters!


Producer: The Towhee technical team is tired of using different models for different data? Ever thought of using one model to handle data from different modalities? Finally, in early 2022, Meta AI launched the "omnivorous" Omnivore, a model that handles data of different visual modalities and can classify images, videos, and 3D data. Omnivore is not only compatible with many types of data, but also ranks among the best on datasets for different tasks. Omnivore achieves 86.0% accuracy on ImageNet, an image classification dataset; 84.1% accuracy on Kinetics dataset for action recognition; and 84.1% accuracy on SUN RGB-D dataset for single-view 3D scene classification 67.1%.

图片

Omnivore: Multiple visual modalities Omnivore converts the data of different visual modalities into a common vector format, and then uses the unique flexibility of Transformer to jointly train for classification tasks of different modalities. Whether training from scratch or fine-tuning a pre-trained model, using Omnivore and ready-made standard datasets can make its performance match or exceed the corresponding single model.

Reference: Model use case: action-classification/omnivore
Thesis: OMNIVORE: A Single Model for Many Visual Modalities
More information: Facebook AI launches "super model": it can handle the three major classification tasks of image, video and 3D data, and the performance is not inferior to the independent model


Cost-effective model TSM, achieve 3D effect with 2D cost

MIT and IBM Watson AI Lab jointly propose an efficient video understanding model TSM (Temporal Shift Module), which improves model performance by simulating 3D modeling through time displacement while retaining 2D efficiency. When the previous model understands video, it needs to spend a lot of computing power on the basis of traditional image analysis to supplement the information about time. The emergence of TSM makes it possible to implement high-performance video understanding models at low cost.
图片
TSM: Temporal shifting 2D CNN and 3D CNN are the two most commonly used methods in video understanding: using 2D CNN model requires less computation, but will lose some temporal information; while using 3D CNN is effective, but it requires a lot of computation. Faced with such a situation, TSM embeds the temporal displacement module into the 2D CNN, so that it can easily achieve a video understanding capability comparable to that of the 3D CNN without adding any additional computation and parameters.

References: Model use cases: action-classification/tsm
Paper: TSM: Temporal Shift Module for Efficient Video Understanding
More information: Video Classification | Paper 2019 TSM: Temporal Shift Module for Efficient Video Understanding TSM: Temporal Shift Module for Video Understanding


TimeSformer: Transformer alone to understand video? Another attack on the attention mechanism!

Facebook AI proposed a new video understanding architecture of TimeSformer (Time-Space transformer), which is completely based on Transformer and can completely get rid of CNN! In just one-third the time to train, TimeSformer can infer ten times faster and achieve superior results on multiple behavior recognition datasets. The datasets used in the paper include Kinetics-400, Kinetics-600, Something-Something-v2, Diving-48 and HowTo100M, which all verify the high performance of TimeSformer!
图片
TimeSformer: Visualization of 5 space-time self-attention schemes TimeSformer can capture the temporal and spatial dependencies of the entire video. It treats the input video as a spatiotemporal sequence consisting of image patches extracted from each frame, similar to how Transformers are used in part in NLP. Compared to modern 3D convolutional neural networks, TimeSformer not only speeds up the training process, but also drastically reduces inference time. In addition, due to the scalability of TimeSformer, it has more potential to process longer video clips and train larger models.

Reference: Model Use Case: action-classification/timesformer
Paper: Is Space-Time Attention All You Need for Video Understanding?
Other information: Facebook AI proposes TimeSformer: a video understanding framework based entirely on Transformer TimeSformer Analysis: Transformer in video understanding TimeSformer: Does video understanding only require spatiotemporal attention?


ICCV 2021's best paper model, Swin Transformer, finally starts with video!

After Swin Transformer won the ICCV 2021 best paper last year, Microsoft Research Asia has launched Video Swin Transformer, a masterpiece in the video field this year. The Video Swin Transformer model tops the performance list in CVPR 2022, and outperforms ViViT, TimeSformer and other networks in both action recognition and time series modeling tasks! The model has 84.9% top-1 accuracy on Kinetics-400 and 69.6% top-1 accuracy on Something-Something v2.
图片
Video Swin Transformer: an illustrated example of 3D shifted windows extending from the image domain to the video domain, the Swin Transformer introduces a localized inductive bias and effectively utilizes pre-trained image models. Compared to before, Video Swin Transformer can globally compute self-attention even with spatiotemporal decomposition, so it can better trade off speed and accuracy.

References:
Model use case: action-classification/video-swin-transformer
Thesis: Video Swin Transformer
Official description: Tu Bang video understands several major tasks! Microsoft proposes: Video Swin Transformer
Other information: Video Swin Transformer, a video classification tool


Homemade Light! UniFormer, a high-scoring spatiotemporal representation learning model

Jointly produced by the Chinese Academy of Sciences, National University of Science and Technology, Shanghai Artificial Intelligence Laboratory, SenseTime, and the Chinese University of Hong Kong, the SoTA model UniFormer (UNIFIED TRANSFORMER) has achieved excellent results in mainstream data sets: on Kinetics-400/Kinetics600 Achieved 82.9% / 84.8% top-1 accuracy; achieved 60.9% and 71.2% top-1 accuracy on Something-Something V1 & V2. Once published, his paper received high marks and was finally included in ICLR 2022 (up to 7.5 points in the preliminary review: 8 8 6 8).
图片
UniFormer ArchitectureUniFormer proposes a Transformer structure that integrates 3D convolution and spatiotemporal self-attention mechanism, which can achieve a balance between computational complexity and accuracy. Different from the traditional Transformer structure that uses self-attention mechanism in all layers, the relation aggregator proposed in this paper can deal with redundant information and dependent information of video respectively. At a shallow level, the aggregator utilizes a small learnable matrix to learn local relations, greatly reducing computation by aggregating token information from small 3D neighborhoods. In the deep layer, the aggregator learns global relations through similarity comparison, and can flexibly establish long-range dependencies between tokens of distant video frames.
References:
Model use case: action-classification/video-swin-transformer
Thesis: UNIFORMER: UNIFIED TRANSFORMER FOR EFFICIENT SPATIOTEMPORAL REPRESENTATION LEARNING
More information: High-scoring papers! UniFormer: Unified Transformer for Efficient Spatio-temporal Representation Learning ICLR2022 UniFormer: Seamlessly Integrated Transformer, a More Efficient Spatio-temporal Representation Learning Framework


For more project updates and details, please pay attention to our project ( https://github.com/towhee-io/towhee/blob/main/towhee/models/README_CN.md ), your attention is the power of our power generation with love Power, welcome to star, fork, slack three times :)
图片


Zilliz
154 声望829 粉丝

Vector database for Enterprise-grade AI