Microsoft Research Asia Multimodal Model NÜWA: Creating Visual Content in Natural Language

We asked a question before: How many steps does it take to generate a creative video from a text script? Microsoft Research Asia's open-domain video generation pretrained model has the answer: just one step. Now, we ask: In addition to text-generated video, what other ways can we generate video? Can we edit visual content using natural language? The latest multi-modal model NÜWA launched by Microsoft Research Asia not only provides a new way to create visual content, but also provides more ways to open the Windows classic desktop.

Humans perceive information in five ways, including vision, hearing, smell, touch and taste. Vision is the most important channel for receiving information and the source of creativity. On the road of promoting the development of artificial intelligence, computer vision has become an important research field, especially the frequent emergence of visual creation applications in recent years, making creation more and more convenient, and more and more users can use these Tools to make and share the good life around you. At the same time, the widespread use of vision-based applications has also promoted research in the field of computer vision.

However, although these tools are powerful, they still have shortcomings: First, they require creators to manually collect and process visual material, which makes the visual knowledge contained in existing large-scale visual data cannot be effectively utilized automatically. Second, these tools often interact with creators through graphical interfaces, not natural language instructions. Therefore, for some users, there are certain technical thresholds, and they need to have rich experience. According to Microsoft Research Asia, the next generation of visual content creation tools should be able to use big data and AI models to help users create content more easily, and use natural language as a more friendly interface.

Under this concept, Microsoft Research Asia has re-innovated on the basis of the video generation pre-training model and developed a multi-modal NÜWA (Neural visUal World creAtion) model. Through natural language instructions, NÜWA can realize the generation, conversion and editing of text, images and videos, helping visual content creators to lower technical barriers and improve creativity. At the same time, developers can also use NÜWA to build an AI-based visual content creation platform.

Supports eight major visual generation and editing tasks

NÜWA currently supports eight major visual generation and editing tasks. Among them, the four types of tasks that support images include: text to image, sketch to image, image completion, and image editing; the four types of tasks that support video include: text to video, video sketch to video, video prediction, and video editing.

Next, let's take the Windows classic desktop as an example and try out several functions of NÜWA. ( Click to see more effects of NÜWA in the eight tasks )

Given an original image:

Let NÜWA complete the image to 256x256 (image completion):

Let NÜWA add "a horse walking on grass" in the red frame of the picture (image editing):

Let NÜWA generate this image into a video that can "move" (video prediction):

Completed a number of visual content creation tasks with "single effort"

The NÜWA model proposes a novel 3D encoder-decoder framework. The encoder can support a variety of input conditions including text, images, videos, or sketches, and even partial images or partial videos, allowing the model to complete subsequent video frames; the decoder converts these input conditions into discrete visual Mark, output image and video content according to training data.

During the pre-training phase, the researchers used an autoregressive model as a pre-training task to train NÜWA, where the VQ-GAN encoder converts images and videos into corresponding visual labels as part of the pre-training data. During the inference stage, the VQ-GAN decoder reconstructs the image or video based on the predicted discrete visual markers.

NÜWA also introduced a 3D Nearby Attention (3DNA) mechanism to deal with the characteristics of 3D data, which can support both encoder and decoder sparse attention. That is to say, when generating a part of a specific image or a video frame, NÜWA will not only see the historical information that has been generated, but also pay attention to the information of the position corresponding to its conditions. For example, in the process of generating a video from a video sketch, When the second frame is generated, the model will consider the position corresponding to the sketch of the second frame, and then generate a video that satisfies the change of the sketch according to the change of the sketch, which is the simultaneous sparseness of the encoder and the decoder. While previous work is usually only 1D or 2D sparse attention, and only sparse in the encoder, or only sparse in the decoder. By using the 3DNA mechanism, the computational complexity of NÜWA has been simplified and computational efficiency has been improved.

Figure 1: NÜWA based 3D encoding-decoding architecture

In order to support the creation of multi-modal tasks such as text, pictures, and videos, and to bridge the gap of data in different fields, the researchers adopted a step-by-step training method, using different types of training data in pre-training. First, the text-picture task and the picture-video task are trained. After the task is stabilized, the text-video data is added for joint training, and the researchers also use the video to complete the task, and generate the follow-up video according to the given part of the video as input. This enables NÜWA to have powerful zero-sample visual content generation and editing capabilities, enabling image and video content addition, deletion, and modification operations, and even controllable adjustments to future frames of video.

Duan Nan, a senior researcher at Microsoft Research Asia, said, "NÜWA is the first multimodal pre-training model. We hope that NÜWA can achieve real-world video generation, but the model will generate a lot of 'intermediate variables' during the training process, consuming Huge video memory, computing and other resources. Therefore, the NÜWA team cooperated with colleagues in the system group to set up a variety of parallel mechanisms for NÜWA on the system architecture, such as tensor parallelism, pipeline parallelism and data parallelism. dynamic training becomes possible.”

NÜWA covers 11 datasets and 11 evaluation metrics. In the Frechet Inception Distance (FID) index generated from text to image, NÜWA outperformed DALL-E and CogView, surpassed CCVS in the FVD index generated by video, and achieved the current SOTA result. Among them, the test results are as follows ( Click to see more NÜWA test results in different datasets and evaluation indicators ):

Table 1: Text-to-image task test results

NÜWA-LIP: Make visual editing more refined

The NÜWA model has basically included the core process of visual creation, which can help creators to improve efficiency to a certain extent, but in actual creation, creators still have many diverse and high-quality needs. To this end, researchers at Microsoft Research Asia have updated and iterated on the basis of NÜWA, recently proposed the NÜWA-LIP model, and made a new breakthrough in the typical task in the visual field - defect image repair.

Although there have been methods to complete similar image restoration before, the creation of the model is relatively random and cannot meet the creator's wishes, while NÜWA LIP can almost repair and complete images acceptable to the naked eye according to the given natural language instructions. Next, let us intuitively feel the magical image restoration effect of NÜWA-LIP.

Figure 2: NÜWA-LIP exhibits excellent performance on image editing tasks

Figure 2 shows two examples. The first example is to want the model to complete the black area as per "Racers riding four wheelers while a crowd watches". Although the existing work GLIDE can be completed, it can be seen that there are obvious white lines at the border, and the completed area is relatively blurred. The NÜWA model is generated by scanning from left to right in an autoregressive manner, and the boundary is more natural than GLIDE. However, since the wheel on the right cannot be seen when completing the black area, the standard NÜWA model has the problem that the completion boundary is not connected properly. NÜWA-LIP fixes the shortcoming of NÜWA. It previews the entire image in advance, innovatively uses lossless encoding technology, and then generates autoregressively, so that the border of the black area can be connected naturally, and the complementing area is also very good. clear.

In the FID metric test, NÜWA-LIP achieves the best score on the task of natural language instructing image inpainting by comparing the inpainted image with the original image. (Note: A lower FID score indicates a higher quality inpainted image.)

Table 2: NÜWA-LIP achieves FID metric of 10.5 in image editing tasks

NÜWA-Infinity: Let visual creation tend to "infinite flow"

In addition to image restoration, Microsoft Research Asia has also conducted continuous research on the lateral extension of high-resolution, large images, and proposed the NÜWA Infinity model. As the name suggests, NÜWA Infinity can generate infinitely continuous high-definition "blockbusters" from a given image. “In the beginning, the resolution of images and videos that NÜWA can generate and edit is relatively low, generally small images with a resolution of 256×256. We hope that the model can generate higher-definition large images and form a greater visual impact, satisfying the The actual needs of different creators. Simply put, NÜWA Infinity will scan each frame window according to the different levels of the image, and continuously render it to form a high-pixel, continuous large image,” said Wu Chenfei, a researcher at Microsoft Research Asia.

Wondering what the right side of the Windows classic desktop looks like? Click the image below, and NÜWA-Infinity will "uncover" the mystery for you.

Duan Nan added, "On the surface, NÜWA Infinity solves the problems of low-definition pictures generated by NÜWA and the limited number of video frames. But in fact, NÜWA Infinity has formed a generation mechanism from the bottom, which can not only generate pictures in an extended manner, but also It can also be applied to video prediction creation, which is also the research topic we will tackle next.”

Since then, NÜWA-LIP has made it possible for machines to automatically retouch images by accepting language instructions, while NÜWA-Infinity has made a big step forward in image generation quality to the high-definition, infinite real world. According to such a pace of iterative innovation, creators in the future will have a set of visual creation aids that tend to be "infinite flow", just around the corner.

NÜWA Multimodal Model Chain Reaction: May Bring More "Killer" Apps

In the future, with the development of artificial intelligence technology, immersive human-computer interaction interfaces such as augmented reality and virtual reality will be more widely used, and the combination of the digital world and the physical world will become more and more closely. Different types of multimodal content are the glue that brings the virtual space and the real world closer together. Therefore, the creation, editing and interaction of virtual content will be crucial. The visual content generation and editing technology provided by NÜWA provides unlimited imagination for these applications. When multimodal technology becomes the development direction of future artificial intelligence applications, multimodal models will bring more next-generation models to the fields of learning, advertising, news, conferences, entertainment, social networks, digital humans, and brain-computer interaction. The "killer" app.

Related papers link:

NÜWA：https://arxiv.org/abs/2111.12417

NÜWA-LIP：https://arxiv.org/abs/2202.05009

appendix:

The effect of NÜWA in the eight tasks

Figure 3: Text-to-image task. For example, given the text "A wooden house sitting in a field". NÜWA has created 4 cabins with different camera angles, these cabins are not only oriented in style, but also very authentic.

Figure 4: Sketch-to-image task. For example, given a sketch of a bus (1st row, 1st column), NÜWA created 3 images that satisfy the shape and position of the sketch, including the reflections from the windows that are clearly visible.

Figure 5: Image completion task. For example, in line 1, enter the top of the tower (50% of the original image), NÜWA can complete the appearance of the bottom of the tower, columns and even roofs. For line 2, NÜWA can still complete the image when only 5% of the image area is given.

Figure 6: Image editing. For example image 1, given the image to be edited, the area of the image that needs to be edited (red box), and the text "Beach and sky" above the image, image 2 gives the edited result .

Figure 7: Image-to-video task. Not only can NÜWA generate videos based on the common text "Play golf on grass", but also videos that are impossible in reality, such as "Play golf on the swimming pool" ).

Figure 8: Video sketch to video. Input a video sketch and NÜWA can generate frame- and frame-sequential video.

Figure 9: Video prediction. Input a still image and NÜWA can output a video that "animates" it.

Figure 10: Video editing. Input editing text, video, NÜWA can output the edited video. For example, the original video of the diver swimming horizontally, controlled by the second image "The diver is swimming to the surface" (the diver is swimming to the surface), produces a video of the diver swimming upstream.

Microsoft Research Asia Multimodal Model NÜWA: Creating Visual Content in Natural Language

Supports eight major visual generation and editing tasks

Completed a number of visual content creation tasks with "single effort"

NÜWA-LIP: Make visual editing more refined

NÜWA-Infinity: Let visual creation tend to "infinite flow"

NÜWA Multimodal Model Chain Reaction: May Bring More "Killer" Apps

微软技术栈

引用和评论

对话声网 JCFTP AI Studios：以技术温度叩开商业价值之门

【成功解决】JetBrains PyCharm 激活提示 “Key is invalid” (秘钥无效) 的终极解决方案

解剖DeepSeek四把刀，一场深到源码，大到行业，细到人心的手术盛宴

个人博客目录在此

【前瞻技术布局】打破"沙漏“现象→提高生成式搜索/推荐的上限

好用的开源埋点方案-ClkLog埋点用户分析系统

做「长期主义者」的技术人们