头图

AtomoVideo:高保真地用图像生成视频

AtomoVideo: High Fidelity Image-to-Video Generation

龚立彤, 朱一然, 李伟杰, 康晓阳, 王彪, 葛铁正, 郑波

Litong Gong, Yiran Zhu, Weijie Li, Xiaoyang Kang, Biao Wang, Tiezheng Ge, Bo Zheng

阿里巴巴公司 * Equal Contribution

Alibaba Inc. *Equal Contribution

一分钟视频

One-minute Video

AtomoVideo 是一种新颖的高保真图像到视频 (I2V) 生成框架,可从输入图像生成高保真视频,实现比现有作品的运动强度和一致性,并且无需特定调整即可兼容各种个性化 T2I 模型。

AtomoVideo is a novel high-fidelity image-to-video (I2V) generation framework that generates high-fidelity video from input images, achieves better motion intensity and consistency than existing work, and is compatible with various personalized T2I models without specific tuning.

视频地址

图像到视频示例

Image-to-Video Examples

https://www.bilibili.com/video/BV1cZ421Y7gB/?aid=1151545974&c...

与其他方法的比较

Comparisons with Other Methods

简介

Abstract

最近,基于卓越的文本到图像生成技术,视频生成取得了显着的快速发展。在这项工作中,我们提出了一个用于图像到视频生成的高保真框架,称为AtomoVideo。基于多粒度图像注入,我们实现了生成的视频对给定图像的更高保真度。此外,得益于高质量的数据集和训练策略,我们实现了更高的运动强度,同时保持了卓越的时间一致性和稳定性。我们的架构可以灵活地扩展到视频帧预测任务,通过迭代生成实现长序列预测。此外,由于适配器训练的设计,我们的方法可以很好地与现有的个性化模型和可控模块相结合。通过定量和定性评估,与流行方法相比,AtomoVideo 取得了更好的结果。

Recently, video generation has achieved significant rapid development based on superior text-to-image generation techniques. In this work, we propose a high fidelity framework for image-to-video generation, named AtomoVideo. Based on multi-granularity image injection, we achieve higher fidelity of the generated video to the given image. In addition, thanks to high quality datasets and training strategies, we achieve greater motion intensity while maintaining superior temporal consistency and stability. Our architecture extends flexibly to the video frame prediction task, enabling long sequence prediction through iterative generation. Furthermore, due to the design of adapter training, our approach can be well combined with existing personalised models and controllable modules. By quantitatively and qualitatively evaluation, AtomoVideo achieves superior results compared to popular methods.

框架

Framework

image.png

我们的图像到视频方法的框架。我们使用预训练的 T2I 模型,在每个空间卷积和注意力层之后新增一维时间卷积和时间注意力模块,具有固定的 T2I 模型参数,并且只训练添加的时间层。同时,为了注入图像信息,我们将输入通道修改为9个通道,添加图像条件潜伏和二进制掩码。由于输入的串联图像信息仅由VAE编码,因此它表示低级信息,这有助于提高视频相对于给定图像的保真度。同时,我们还以交叉关注的形式注入了高级图像语义,以实现更多的语义图像可控性。

The framework of our image-to-video method. We use the pre-trained T2I model, newly added 1D temporal convolution and temporal attention modules after every spatial convolution and attention layer, with fixed T2I model parameters and only training the added temporal layer. Meanwhile, in order to inject the image information, we modify the input channel to 9 channels, add the image condition latent and binary mask. Since the input concatenate image information is only encoded by VAE, it represents low-level information, which contributes to the enhancement of fidelity of the video with respect to the given image. Meanwhile, we also inject high-level image semantic in the form of cross-attention to achieve more semantic image controllability.

BibTeX

BibTeX
@misc{atomovideo,
      title={AtomoVideo: High Fidelity Image-to-Video Generation},
      author={Gong, Litong and Zhu, Yiran and Li, Weijie and Kang, Xiaoyang and Wang, Biao and Ge, Tiezheng and Zheng, Bo},
      year={2024},
      eprint={arXiv:2403.01800},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

誉儿
173 声望1.2k 粉丝