Take you to understand multi-modal machine learning from 5 major challenges

Abstract: multi-modal machine learning aims to build a model from multiple modalities, which can process and correlate information of multiple modalities. Considering the heterogeneity of data, the field of MMML (Multimodal Machine Learning) brings many unique challenges, in general, five types: representation, transformation, alignment, integration, and collaborative learning.

This article is shared from Huawei Cloud Community " Multimodal Learning Overview ", the original author: Finetune expert.

Preface

A modal refers to the way things happen or experience, and the problem of multi-modal research refers to the inclusion of multiple modalities.

Multi-modal machine learning aims to build a model from multiple modalities, capable of processing and correlating information from multiple modalities

Considering the heterogeneity of data, the field of MMML (Multimodal Machine Learning) brings many unique challenges, in general five types:

said: The is to learn how to use the complementarity and redundancy of multiple modalities to represent and summarize modal data. The heterogeneity of modalities poses challenges for this representation. For example: language is usually represented by symbols, while speech is usually represented by signals.
conversion: How to convert (map) the data of one modal to another modal in Multi-modality not only has heterogeneous data, but also the relationship between modalities is usually open or subjective. For example, there are many correct ways to describe a picture, and there may not be the best modal translation.
alignment: modal alignment is mainly to identify the direct connection between the elements (sub-elements) of multiple modalities. For example, match each step of the recipe with the video of cooking. To solve this problem, it is necessary to measure the similarity between different modes, and to consider possible long-distance dependence and ambiguity.
fusion: connects multiple modal information to complete reasoning. For example, in audiovisual speech recognition, the movement of the lips described visually is mixed with audio signals to complete the reasoning of the spoken word. When at least one modal data may be lost, information from different modalities has different predictive capabilities and noise topologies in inference.
collaborative learning: between different modalities, representations, and prediction models. There are typical applications in collaborative training, conceptual grounding and zero-shot learning. When the resources of a certain modal are limited (the labeling data is scarce), it is of great significance.

Application: There are many applications, including audiovisual speech recognition (AVSR), multimedia data indexing and searching, social interaction behavior understanding, video description, etc.

1. Multi-modal representation

Multi-modal representations need to solve problems: how to combine heterogeneous data, how to deal with different levels of noise, and how to deal with lost data

Bengio pointed out that a good feature representation should:

smooth
Consistent time and space
Sparse
Natural clustering

Srivastava et.al. add three points

Representation space should reflect the similarity of corresponding concepts
Even if some modalities do not exist, the representation should be easy to obtain
Make it possible to fill missing modalities

In the previous study (before 2019), most multimodal representations simply connect single-modal features. Two multimodal representation methods: joint representation and coordinated representation

Joint representation

Each mode is x_ixi, x_m = f(x_1, \dots, x_n)xm=f(x1,…,xn)

Joint representation is often used in tasks where training and inference are all multi-modal data. The simplest method is feature splicing

deep learning method: The deep learning features naturally contain high-level semantic information, and the last or penultimate layer features are commonly used

Since deep learning networks require a large amount of labeled data, unsupervised methods, such as autoencoders, are often used to perform feature representation pre-training. Deep learning cannot naturally solve the problem of data loss.

Probabilistic graphical model: uses hidden random variables to construct feature representation

The most common graph model-based feature representation method uses deep Boltzmann machine (DBM) and restricted Boltzmann machine (RBM) as modules to build, similar to deep learning, feature layering, and is an unsupervised method. It is also possible to use Deep Belief Network (DBN) to characterize each modal and then perform joint representation.

uses a multi-modal deep Boltzmann machine to learn multi-modal feature representation. Due to the natural generation characteristics, it can easily deal with the problem of missing data. The entire modal data loss can also be solved naturally; it can also be generated by a certain modal Samples of another type of modal; DBM's shortcoming is that it is difficult to train and computationally expensive, and requires a variational approximation training method

Sequence characterization

When the length of the data is a variable length sequence, such as a sentence, video or audio stream, use sequence characterization

RNN and LSTM are currently mainly used to represent single-modal sequences, and the hidden state of RNN at a certain moment can be regarded as the integration of the features of all sequences before this moment. In AVSR, Cosi et al. use RNN to represent multi-modality. feature.

Collaborative representation

Each mode is x_ixi, f(x_1) \sim g(x_2)f(x1)∼g(x2), and each mode has a corresponding mapping function, which is mapped to the multi-modal space, each projection process mode are independent , but the final multimodal space for collaboration by certain restrictions indicated.

Two collaborative representation methods: similarity model and structured model. The former guarantees the similarity of feature representation, while the latter strengthens the structure in the feature result space.

similarity model: similarity model minimizes the distance between different modalities in the collaborative representation space, such as the distance between the dog and the image of the dog, which is smaller than the distance between the dog and the car. The advantage of deep neural networks in collaborative representation lies in the ability to perform joint learning of collaborative representation in an end-to-end manner.

structured collaborative space model: The structured collaborative representation model strengthens the additional restrictions of different modal representations, and the specific structural restrictions vary according to the application.

The structured collaborative representation space is commonly used in cross-modal hashing. Compressing high-dimensional data into a compact binary representation makes similar objects have similar codes, and is often used in cross-modal retrieval. The hash method forces the final multimodal space representation to have the following limitations: 1) N-dimensional Hamming space, binary representation of the number of controllable bits; 2) The same object of different modalities has similar hash codes; 3) Multiple The modal space must maintain data similarity.

Another method of structured collaborative representation comes from the "sequential embedding" of images and language.

For example, Vendrov et al. strengthened a dissimilarity measure in multimodal space, which is an asymmetric partial order relationship. The main idea is to grasp a partial order relationship in the representation of language and images, and to force a hierarchical structure. For an image, this partial order relationship is "a woman walking her dog"> "woman walking her dog"> "woman walking".

A special structured collaborative space is based on canonical correlation analysis (CCA). CCA uses linear projection to maximize the correlation between two random variables and enhances the orthogonality of the new space. The CCA model is mostly used for cross-modal search and speech and visual signal analysis.

Using the kernel method, CCA can be extended to KCCA. This non-parametric method has poor scalability as the size of the training data grows. In-depth canonical correlation analysis DCCA is proposed as a substitute for KCCA, which solves the scalability problem and can get a better correlation representation space. Deep correlation RBM can also be used as a method of cross-modal search.

KCCA, CCA, and DCCA are all unsupervised methods, which can only optimize the correlation of feature representations, and can obtain cross-modal shared features.

Other methods such as deep canonical correlation autoencoders and semantic relevance maximization methods are also used in structured collaborative space representation

summary:

Joint and cooperative representation methods are the two main methods of multi-modal feature representation.

joint feature representation method multi-modal data into a common feature representation space, which is most suitable for scenes where all modal data appear during inference.
collaborative feature representation method each mode into a separate but related space. This method is suitable for the situation where only one mode appears during reasoning.

The joint representation method has been used to construct scenes with more than two modal representations, while the collaborative space representation is often limited to two modalities.

2. Multi-modal transformation

Transforming from one modality to another is the focus of many multimodal machine learning.

The task of multi-modal transformation is to give an entity in one modal and generate the same entity in another modal. For example, given an image, we can generate a sentence to describe the image, or, given a text description, we can generate a matching image. Multimodal transformation has been studied for a long time, early speech synthesis, audiovisual speech generation, video description, and cross-modal retrieval. Recently, the combination of NLP and CV fields, as well as large-scale multi-modal data have promoted this development.

popular applications: visual scene description (image, video description), in addition to identifying the main part and understanding the visual scene, it also needs to generate grammatically correct and understand accurate description sentences.

Multi-modal transformation can be divided into two categories, instance-based method and generative method . The former uses a dictionary to achieve modal conversion, and the latter uses a model to generate conversion results.

Considering that the generative model needs to generate a signal or symbol sequence (sentence), the generative model method is more challenging. Therefore, many early methods tended to carry out modal transformation based on examples. However, with the development of deep learning, generative models also have the ability to generate images, sounds, and texts.

Case-based approach

The instance-based method is limited by the training data—dictionary (an instance pair composed of the source mode and the target mode)

Two algorithms: retrieval-based method and combination-based method. The former directly uses the searched conversion results without modifying them, and the latter relies on more complex rules to establish modal conversion results based on a large number of searched examples

search-based method: search-based method is the simplest method of multi-modal transformation. It relies on the most recent sample searched in the dictionary and uses it as the result of the transformation. The retrieval is completed in a single-modal space, or It is done in the intermediate semantic space.

Given an instance of the source modal to be transformed, the single-modal search achieves modal transformation by looking up the nearest source modal instance in the dictionary, essentially finding the mapping from the source modal to the target modal through KNN. Some typical application scenarios such as TTS, image description, etc. The advantage of this method is that only a single modal representation is needed, which can be achieved through retrieval. But also because of the search method, it is necessary to consider the reordering of search results. The problem with this method is that instances with high similarity in the single-modal space are not necessarily good modal transformations.

Another method is to use the intermediate semantic space to achieve similarity comparison. This method is generally used in conjunction with collaborative representation. It should be because the collaborative representation space itself limits the similarity of vector representations. The method of modal retrieval in the semantic space is more effective than the method of unimodal retrieval, because its search space reflects two modalities at the same time, which is more meaningful. At the same time, it supports two-way conversion, which is not very straightforward in single-modal retrieval. However, the intermediate semantic space retrieval method needs to learn a semantic space, which requires a large number of training dictionaries (source modal, target modal sample pairs).

Combination-based method: obtains better modal conversion results by meaningful combination of search results. Combination-based media description is mainly based on image-based description sentences that have the same simple structure. Features. usual combined rules are manually specified or heuristically generated.

The biggest problem facing the case-based method is that its model is the entire dictionary. The model will continue to grow with the increase of the data set, and the reasoning will slow down; another problem is that unless the entire dictionary is very large, it cannot cover all possibilities. The source modal query. This problem can be solved by a combination of multiple models. The case-based method for multi-modal transformation is unidirectional, and the semantic space-based method can transform bidirectionally between the source and target modalities.

Generative method

The model constructed by the generative method in the multi-modal transformation can perform multi-modal transformation on a given single modal instance. The challenge is to understand the source modal to generate the target sequence and signal. There may be many correct transformation results. Therefore, this Class methods are more difficult to evaluate.

Three generative methods: based on grammar, encoder-decoder, and continuous generative model. The first method uses grammar to define the target domain, such as generating sentences based on the template definition of <subject, object, verb>; encoder decoding The model first encodes the original mode into a latent space representation, and then the decoder generates the target mode; the third method is based on a streaming input of the source mode to continuously generate the target mode, which is especially suitable for time-series sentence translation such as TTS.

based on grammar rules: relies on pre-defined grammars in order to generate specific patterns. This method first detects high-level meanings from the source modalities, such as entities in images and behaviors in videos; and then sends these detection results into a process of generating a predefined grammar of opportunities to obtain the target modalities.

Some grammar-based methods rely on graph models to generate target patterns

The grammar-based method is more inclined to generate sentence structurally or logically correct examples, because they are based on a pre-defined template and limited grammar. The disadvantage is that it generates grammatical results instead of innovative transformations, and does not generate new content; and grammar-based methods rely on complex concepts. The detection pipeline of these concepts is very complicated, and the extraction of each concept may require a separate model. And independent training set

encoder decoder model: is the most popular multi-modal transformation technology recently. The core idea is to encode a vector representation of the source modality, and then use the decoder module to generate the target modality. state. Initially used for machine translation, it has been successfully used for picture interpretation and video description; currently it is mainly used to generate text, but also can be used to generate images and continuous speech and sound.

encoding: first encodes the source instance in a specific modal. The more popular coding methods for sound signals are RNN and DBN; distributional semantics and variants of RNN are commonly used for word and sentence coding; CNN is used for images; artificial features are still commonly used for video coding. You can also use a single modal representation method, such as the use of collaborative representation, to get better results.

decoding: usually uses RNN or LSTM, and takes the encoded feature representation as the initial hidden state. Venugopalan et al. verified that using a pre-trained LSTM decoder for image interpretation is beneficial for video description tasks. The problem with using RNN is that the model needs to generate a description from a single image, sentence or video vector representation. When a long sequence needs to be generated, the model will forget the initial input. This problem can be solved by the attention mechanism, which allows the network to pay more attention to part of the content of images, sentences, and videos during the generation process. Attention-based generative RNN is also used in the task of generating images from sentences, which is unreal but has potential.

Although the network based on the encoder and decoder is successful, it still faces many problems. Devlin et al. pointed out that the network may remember the training data instead of learning how to understand and generate visual scenes. He observed that the results generated by the kNN model are very similar to those generated by the codec network. The scale of training data required by the codec model is very large.

continuous generation model: continuous generation model is used for sequence translation and online method to generate output at each timestamp. This method is very effective when converting sequence to sequence, such as text-to-speech, voice-to-text, and video-to-text .

Many other methods have also been proposed for this kind of modeling: graph models, continuous coding and decoding methods, and various other regression classification methods. The additional problem that these models need to solve is the timing consistency between modalities. Recently, the Encoder-Decoder model is often used for sequence transformation modeling.

Summary and discussion

One of the major challenges of multimodal transformation is that it is difficult to evaluate. Some tasks (such as speech recognition) have a correct translation, while speech synthesis and media description do not. Sometimes just like in a language translation scene, multiple answers are correct, and which translation is better is usually very subjective. At present, a large number of similar automated evaluation standards are also assisting the evaluation of modal transformation results.

Human evaluation standards are the most ideal. Some automated evaluation indicators, such as those commonly used in media descriptions: BLEU, ROUGE, Meteor, and CIDEr, have also been proposed, but received mixed reviews.

Solving the evaluation problem is very important, not only can be used to compare different methods, but also can provide better optimization goals.

Three, multi-modal alignment

Multi-modal alignment refers to finding the correspondence between sub-components in two or more modal instances, for example: given a picture and a description, find the area in the picture that corresponds to a word or phrase; another example Given a movie, align it with subtitles or chapters in the book.

Multi-modal alignment is divided into two categories: implicit alignment and display alignment, and the corresponding relationship between sub-components of the attention modes displayed in the alignment display, such as aligning the corresponding steps in the video and the recipe; implicit alignment is often used as one of other tasks Links, such as text-based image search, to align keywords and image areas

Display alignment

The similarity measurement between sub-components is the basis of display alignment. Two types of algorithms are unsupervised methods and (weak) supervised methods

unsupervised method: The unsupervised method does not require the annotation of alignment between modes. Dynamic time warping measures the similarity of two sequences and finds an optimal match, which is a dynamic programming method. Since DTW requires a predefined similarity measure, CCA (Canonical Correlation Analysis) can be used to map the modality to a collaborative expression space. DTW and CCA are both linear transformations, and the nonlinear relationship between the modes cannot be found . Graph models can also be used to align unsupervised multi-modal sequences.

The DTW and graph model methods used for multi-modal alignment need to follow some constraints, such as timing consistency, no big jumps in time, and monotonicity. DTW can learn similarity measurement and modal alignment at the same time, and the graphical model method requires expert knowledge in the modeling process.

(weak) supervision method: The supervision method requires annotated modal alignment examples to train the similarity measure in modal alignment. Many supervised sequence alignment methods are inspired by unsupervised methods. Current deep learning methods use It is more common for modal alignment.

Implicit alignment

It is often used as an intermediate step in other tasks, such as speech recognition, machine translation, multimedia description and visual question answering to achieve better performance. Early work was based on graph models, and currently more based on neural networks.

graph model: needs to manually construct the mapping relationship between the modalities

Neural Network: modal conversion can be used for modal alignment, the performance of the task can be improved

Simply using the encoder can only summarize the entire picture, sentence, and video by adjusting the weight, as a single vector representation; the introduction of the attention mechanism allows the decoder to focus on sub-components. The attention mechanism will make the decoder pay more attention to sub-components

attention mechanism can be regarded as a common method of deep learning modal alignment.

summary

Modal alignment faces many difficulties: there are few data sets that show the alignment of labeled modalities; it is difficult to design a similarity measure between modalities; there are many possible modal alignments, and the elements in one modal may be in another There is no correspondence in the modal.

Four, multi-modal fusion

Multimodal fusion is the integration of multiple modal information for classification or regression tasks. The research on multimodal fusion can be traced back to 25 years ago. The benefits of multi-modal fusion are: (1) Different modal representations of the same phenomenon can produce more robust inference results; (2) Auxiliary information can be obtained from multiple scales, which are in a single mode It is invisible; (3) For a multi-modal system, modal fusion can still operate normally when a certain modal disappears.

The current boundary between multimodal representation and fusion is becoming increasingly blurred, because in deep learning, representation learning and classification/regression tasks are intertwined.

Two multi-modal hybrid methods: model-independent and model-based methods. The does not directly depend on a specific machine learning method, while the latter shows fusion during the construction process (core method, graph model, neural network) ).

Model-independent approach

There are three methods independent of the pre-fusion, post-fusion and hybrid fusion. The early fusion is the fusion of feature levels, the later fusion is the fusion of inference results, and the hybrid fusion includes two fusion methods at the same time.

The advantage of the model-independent fusion method is that it can be compatible with any classifier or regressor

Preliminary fusion can be regarded as a preliminary attempt of multimodal representation

The later fusion uses the prediction results of a single mode, through voting mechanism, weighting, signal variance or a model for fusion. The later fusion ignores the relationship between the underlying features of the modal

Model-based approach

Multi-core learning (MKL): An extension of the kernel SVM, using different kernels for different modalities

The MKL method is the most commonly used method before deep learning. The advantage is that the loss function is convex, and the standard optimization package and global optimization methods can be used for model training. The disadvantage is that the data set dependence reasoning speed is slow during testing.

Graph model

In this review, only shallow graph models are considered. For depth graph models such as DBN, you can refer to the previous chapters. Most graph models can be divided into two categories: generative (joint probability) and discriminant (conditional probability)

graph model can easily explore the spatial and temporal structure of the data, and at the same time, it can embed expert knowledge into the model, and the model can also explain

Neural Networks

The modal and optimization methods used by neural networks for modal fusion may be different, and the idea of information fusion through joint hidden layers is the same. Neural networks are also used for time-series multi-modal fusion, usually RNN and LSTM are used. Typical applications are audio-visual sentiment classification and picture interpretation

The advantages of deep neural network for modal fusion: (1) It can learn from a large amount of data; (2) End-to-end learning of multi-modal feature representation and fusion; (3) Compared with non-deep learning methods, it has better performance and can learn more complexly. Decision boundary

Disadvantages: poor interpretability, don’t know what the network is based on, and don’t know the role of each mode; a lot of training data is needed to get good results

summary

There are the following challenges in multimodal fusion tasks: 1) Signals may not be time-aligned, such as dense continuous signals vs sparse events; 2) It is difficult to build a model to discover supplementary information rather than auxiliary information; 3) Each mode The state may exhibit different types and levels of noise at different points in time.

Five, multi-modal learning together

Multi-modal co-learning aims at to help the current modal modeling by discovering the information of another modal

Related scenarios: a modal has limited resources, lack of annotation data or large input noise, and low label reliability

Three co-learning methods: parallel, non-parallel, and hybrid; the first method requires a direct connection between one modal observation and another modal observation, such as audio-visual speech data and above, video and speech The sample comes from the same speaker; the non-parallel data method does not require the direct connection of the two observations, and usually uses the intersection between the categories. For example, in zero shot learning, the text data of the Wiki is used to extend the traditional visual target recognition data set to improve the performance of target recognition. ; The method of mixing data is connected through a shared modality or data.

Parallel data

Sharing a set of instances between modalities, two methods: collaborative training and representation learning

Co-training: When there is very little labeled data for a certain modal, you can use co-training to generate more labeled training data, or use the inconsistency between modals to filter unreliable labeled samples

Co-training methods can generate more labeled data, but it may also lead to overfitting

transfer learning: multi-modal Boltzmann machine or multi-modal autoencoder transforms one modal feature representation into another, so that not only can obtain multi-modal representations, but also reason for single-modality Better performance can also be obtained in the process.

Non-parallel data

No need to rely on shared instances between modals, just have shared categories or concepts (concept)

transfer learning: transfer learning can migrate from one modal learning feature representation with sufficient and clean data to another modal with scarce data and high noise. This kind of transfer learning is usually implemented by multi-modal collaborative feature representation.

Conceptual learns semantic meaning through language and other additional modalities, such as vision, sound, and even taste. The semantic meaning cannot be well learned by simply using text information. For example, when people learn a concept, they use all Perceptual information rather than mere symbols.

Grounding usually finds the common hidden space between feature representations or learns the feature representations of each modal separately and then splices them. There is a high overlap between conceptual grounding and multi-modal feature alignment, because the visual scene and the corresponding description align itself. Can bring better text or visual feature representation.

It should be noted that grounding does not improve performance in all situations. It is only effective when grounding is related to specific tasks, such as grounding using images in visual-related tasks.

Zero-shot learning: ZSL task is to identify a concept without seeing any samples that are not displayed, for example, not providing any cat pictures to classify the cats in the picture.

Two methods: single-modal method and multi-modal method

Monomodal method: Pay attention to the components and attributes of the category to be identified, for example, visually predict the category that has been seen through attributes such as color, size, and shape

Multi-modal method: use the information of another modality, the category has appeared in another modality

mixed data: connects two non-data-parallel modalities through shared modalities or data sets. Typical tasks such as image description in multiple languages. The picture will establish a connection with at least one language. You can use machine translation tasks to establish connections.

If the target task has only a small amount of annotation data, similar or related tasks can also be used to improve performance, such as using a large amount of text corpus to guide the image segmentation task.

summary

Multi-modal collaborative learning seeks out complementary information between modalities, so that one modal influences the training process of another modal.

Multi-modal collaborative learning is task-independent and can be used for better multi-modal feature fusion, conversion and alignment.

Click to follow and learn about Huawei Cloud's fresh technology for the first time~

Take you to understand multi-modal machine learning from 5 major challenges

Preface

1. Multi-modal representation

Joint representation

Sequence characterization

Collaborative representation

summary:

2. Multi-modal transformation

Case-based approach

Generative method

Summary and discussion

Three, multi-modal alignment

Display alignment

Implicit alignment

summary

Four, multi-modal fusion

Model-independent approach

Model-based approach

Graph model

Neural Networks

summary

Five, multi-modal learning together

Parallel data

Non-parallel data

summary

华为云开发者联盟

引用和评论

华为云开发者联盟入选 2023 中国技术品牌影响力企业榜，深耕开发者生态

2025年医疗大模型各医疗场景赋能实践研究报告130+份汇总解读|附PDF下载

科学计算编程涉及到的技术栈简介

manus 的替代品有哪些？使用LLM大模型技术做手机/网页/浏览器自动化操作技术汇总

vLLM 实战教程汇总，从环境配置到大模型部署，中文文档追踪重磅更新

性能远超SAM系模型，苏黎世大学等开发通用3D血管分割基础模型

【vLLM 学习】基础教程