Introduction to MUGE is Multimodal Understanding and Generation Evaluation Benchmark. It is a large-scale Chinese multimodal evaluation benchmark launched by the Cognitive Intelligence team of the Dharma Academy's Intelligent Computing Laboratory. It has the largest Chinese multimodal evaluation data set. , Covering multiple types of tasks, including graphic description, text-based image generation, cross-modal retrieval, etc. The launch of MUGE aims to solve the problem of the lack of data sets for downstream tasks in the current Chinese multimodal field, and to provide a platform and evaluation benchmark for the majority of researchers to measure the effectiveness of the algorithm model.
background
In recent years, the successful practice of large-scale neural network models and pre-training technologies has promoted the rapid development of computer vision and natural language processing, as well as the research of multi-modal representation learning. In 2020, Jeff Dean pointed out that multi-modal research will be a major trend in future research. In China, the Intelligent Computing Laboratory of Alibaba Dharma Academy is also deeply exploring Chinese multi-modal pre-training and ultra-large-scale pre-training. Recently, it has successively launched tens of billions, hundreds of billions and trillions of parameter M6 models [1], and realized Currently, the largest multi-modal pre-training model is applied to various downstream tasks, and it is widely used in real scenes such as search, recommendation, clothing design, and smart copywriting.
However, the current evaluation benchmarks and data sets in the multimodal field are mostly in English, such as MSCOCO's Image Captioning[2], VQA[3][4], textVQA, VCR, etc., and there is a lack of a unified evaluation benchmark for research Personnel can comprehensively evaluate the performance of their models in different scenarios and different task types. The current multi-modal public data sets and lists in the Chinese field are even more lackluster. Taking into account the vigorous development of the Chinese multi-modal field, the Cognitive Intelligence Team of the Intelligent Computing Laboratory of Dharma Academy has launched a large-scale Chinese multi-modal evaluation benchmark MUGE, which has the largest Chinese multi-modal evaluation data set, covering a variety of Types of tasks, including graphic description, text-based image generation, cross-modal retrieval, etc., carry out a comprehensive evaluation of the model, helping researchers to better understand their own models.
MUGE introduction
The full name of MUGE is Multimodal Understanding and Generation Evaluation Benchmark. The first issue mainly opens the Chinese multimodal related downstream task data set and evaluation list, aiming to help Chinese multimodal researchers evaluate the algorithm model in an all-round way. MUGE will realize the coverage of multi-scenarios and multi-tasks, including understanding tasks, such as cross-modal retrieval, cross-modal classification, etc., as well as generation tasks, such as graphic description, text-based image generation, etc., and researchers can understand The algorithm model is evaluated from two perspectives: capability and generation capability. The first phase of opening includes the following 3 tasks:
E-Commerce IC(Image Caption)
Image description generation is a classic multi-modal task. The task goal is to generate a corresponding text description based on an image. The generated description must faithfully reflect the objects and key details in the image. There are many product pictures in the e-commerce field. Applying image description technology to the e-commerce field to generate an attractive description for each product is of great significance for attracting users’ clicks and increasing conversion rates.
E-Commerce IC data set released this time covers many product categories such as clothing, food, cosmetics, 3C digital accessories, and all data are derived from real Taobao e-commerce scenarios. Among them, the text description corresponding to the product is written by the merchant according to the product characteristics, and the style of different copywriting is very different, which brings many challenges to the generation of the image description. ECommerce-IC contains a total of 5w training data and 5k validation set data. It also provides 1w pictures for online evaluation. It is currently the largest Chinese e-commerce Caption data set in the industry.
Here are two examples:
Example 1:
- Input (product picture):
- Output (Commodity copywriting description): Use the original Nordic style, advocating nature, using wood, black, and white as the overall color, giving people a comfortable and peaceful feeling, and eating easily can retain good food. In the minimalist Nordic restaurant In, enjoy the food.
Example 2:
- Input (product picture):
- Output (commodity copywriting description): Two-piece suit print skirt, elegant and free and easy in the intellectual. Elegant printed skirts meet suits, easily creating an exquisite workplace goddess. It is still beautiful and elegant without the jacket, which is a wise outfit. The V-neck design shows the sexy charm of women even more. Such as walking fashion albums are exquisite, tasteful and elegant.
E-Commerce T2I(Text to Image):
Text-to-image generation is a challenging task, which requires image generation and the ability to understand cross-modality. The task goal is to generate an image that meets the corresponding description based on a text description, while requiring the image to be clear and realistic. There are many product pictures in the e-commerce field. The application of text-to-image generation technology to the e-commerce field is of great significance for the new, design, and distribution of products, reducing business operating costs, and improving user experience.
ECommerce-T2I data set released this time covers multiple product categories in clothing, accessories, and cosmetics. All data comes from real Taobao e-commerce scenarios. The entire data set consists of training set, validation set and test set. Among them, the training set has 9w pictures, and the verification set and test set have 5k pictures each. In addition, the images in this data set are all white background images, and the players do not need to focus on background generation. The main purpose is to examine the model's understanding and generation ability of product text, and to improve the quality of object generation.
Here are two examples:
Example 1:
- Input (text): Sheep wool business casual suit
- Output (generated image):
Example 2:
- Input (text): shock absorption and breathable running shoes
- Output (generated image):
Multimodal Retrieval Dataset
The ability of the multi-modal retrieval evaluation model to understand and match graphics and text is an indispensable part of meeting user needs and facilitating click transactions in e-commerce scenarios. In this task, we prepared the real search query and product map from the Taobao e-commerce platform, and asked the model to retrieve the products that matched the search query from a given product pool (such as the following figure). In order to better evaluate the effect of the model's cross-modal understanding, we will not disclose the title and other information of the product this time, and require the model to retrieve only based on the product image, which is quite challenging.
The public e-commerce graphic retrieval data set this time consists of a training set, a validation set and a test set. The training set contains a 25w search query-commodity graph consisting of graphic and text pairs, covering approximately 12w product images. For the verification set and the test set, we have prepared 5k search queries and 3w candidate product images. The data set category covers a wide range, involving clothing, home furnishing, electronics, cosmetics and other fields. It is currently the largest Chinese full-field e-commerce graphic retrieval data set, which puts forward a test of the generalization ability of the model.
Here are two examples:
Example 1:
- Input (Query): Pure cotton floral sling skirt
- Output: product picture
Example 2:
- Input (Query): Nordic light luxury side table
- Output: product picture
MUGE Challenge List
MUGE aims to solve the problem of the lack of data sets for downstream tasks in the current Chinese multimodal field, and to provide a platform and evaluation benchmark for the majority of researchers to measure the effectiveness of the algorithm model. In addition, compared to the traditional list, MUGE has more comprehensive coverage, covering two categories of tasks: understanding and generation, and pioneering the inclusion of text-based image generation. In the future, MUGE will continue to expand more multi-modal tasks and data scales, and further provide support for researchers and developers to improve the effect of algorithm models.
MUGE is currently open on the Aliyun Tianchi platform. Interested researchers can visit the following link to enter the MUGE leaderboard to participate in the challenge. The platform will top 1611e24a10d676 Top8 contestants at the end of each month and give Tianchi customized gifts!
MUGE Challenge List Address: https://tianchi.aliyun.com/specials/promotion/mugemultimodalunderstandingandgenerationevaluation?spm=a2c41.24125772.0.0
About M6:
Previously, the Cognitive Intelligence Group of the Intelligent Computing Laboratory of Dharma Academy vigorously promoted the research of ultra-large-scale Chinese multi-modal pre-training, and successively released the large-scale pre-training model M6 with tens of billions, hundreds of billions and trillions of parameters. It has achieved outstanding results in downstream tasks, and has also made in-depth explorations on large-scale pre-training basic technologies, including how to train super-large models and how to design MoE model architectures. M6's work is currently accepted by KDD 2021.
The Cognitive Intelligence Group of Dharma Academy’s Intelligent Computing Laboratory is affiliated to Dharma Academy. It is committed to advancing cognitive intelligence research, achieving large-scale implementation in a large number of real business scenarios, and in multi-modal pre-training and large-scale graph neural networks. Groundbreaking world-leading results have been obtained in many fields. The cognitive intelligent computing platform developed by the team won the SAIL award, the highest honor in the 2019 World Artificial Intelligence Innovation Competition, and was selected into the National Development and Reform Commission's National Major Construction Project Library. The team has won the second prize of the 2020 National Science and Technology Progress Award and the leading innovation team in Hangzhou. It has strong personnel and technical strength and has published more than 100 articles in CCF-A conferences and journals.
Reference materials:
[1] Junyang Lin, Rui Men, An Yang, Chang Zhou, Ming Ding, Yichang Zhang, Peng Wang, Ang Wang, Le Jiang, Xianyan Jia, Jie Zhang, Jianwei Zhang, Xu Zou, Zhikang Li, Xiaodong Deng, Jie Liu, Jinbao Xue, Huiling Zhou, Jianxin Ma, Jin Yu, Yong Li, Wei Lin, Jingren Zhou, Jie Tang, and Hongxia Yang. 2021. M6: A chinese multimodal pretrainer. CoRR, abs/2103.00823.
[2] Chen, X., Fang, H., Lin, T., Vedantam, R., Gupta, S., Dollár, P., & Zitnick, C.L. (2015). Microsoft COCO Captions: Data Collection and Evaluation Server. _ArXiv, abs/1504.00325_.
[3] Agrawal, A., Lu, J., Antol, S., Mitchell, M., Zitnick, C.L., Parikh, D., & Batra, D. (2015). VQA: Visual Question Answering. _International Journal of Computer Vision, 123_, 4-31.
[4] Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., & Parikh, D. (2017). Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. _2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)_, 6325-6334.
Copyright Notice: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users. The copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。