1
头图

(Reproduced from Microsoft Research AI headlines)

Editor's note: In recent years, causal machine learning has had an outstanding impact on artificial intelligence and many intersecting fields, and has received more and more attention. With the help of causal reasoning, the robustness, generalization ability, and interpretability of machine learning will be effectively improved. Today we have selected three NeurIPS 2021 papers on causal machine learning from Microsoft Research Asia to introduce the latest research progress in this field. The content of the paper covers: methods and theories for learning causality in three types of tasks: single-source domain generalization prediction, multi-source domain generalization prediction, and imitation learning, and demonstrates the use of causality to improve the model's performance when the environment and distribution change Robustness. In the future, Microsoft Research Asia will further promote the application of machine learning methods to more demanding real-world tasks.

In recent years, with the continuous improvement of the performance of machine learning models, people are no longer satisfied with their performance on standard data sets, but hope that they can also have stable and reliable performance in real application scenarios. But an important challenge in achieving this goal is that the environment in the real scene is usually different from the clean standard training data set. There will be changes in the data distribution and out-of-distribution examples will be encountered, and the model may not necessarily be in the new environment. Give reasonable results.

This brings new requirements to the machine learning model, that is, the model needs to learn the essential reasons and rules for making predictions or judgments , instead of relying on the apparent "look" relationship, because the latter may only be in a specific The appearance of the environment, only the former determines the law after the environment changes, and can give reasonable results for the out-of-distribution examples. This leads to the new research direction of causal machine learning. At NeurIPS 2021, researchers from Microsoft Research Asia published a series of research results in the field of causal machine learning.

Learn the semantic representation of causality for out-of-distribution prediction

d71d48879de6236e48f4260685e16fd9.png

People have found that standard supervised learning methods, especially deep learning methods, perform poorly in predicting out-of-distribution examples. For example, in the example in Figure 1 [Ribeiro'16], if most of the "huskies" pictures in the training set are dark backgrounds and most of the "wolves" pictures are snow backgrounds, then for the "huskies" in the snow "For the test sample of ", the model will predict a "wolf". If you visualize the model, you can find that the model pays more attention to the background, because in such a data set, the background and foreground objects have a strong correlation, and the background is a feature that is more distinguishing than the foreground objects, but only the foreground objects Decide on the label of the picture.

df0cdbde2b6abcd7f466902555035835.png
(Figure 1: Challenges of out-of-distribution prediction tasks)

Therefore, researchers at Microsoft Research Asia hope that the model can learn features similar to foreground objects to make predictions. This goal can be formally described under the theory of causality. This theory defines causality through the performance of the system under intervention. That is, if changing the value of variable A through intervention will change the value of variable B but intervention B will not change A, then A is the cause of B. ), B is the effect of A, denoted as A→B. For example, the average temperature of cities with higher altitudes is usually lower, but from such “altitude-temperature” pair data alone, it is impossible to know who is the cause and the effect. People know that altitude is the cause of temperature, because if you use a large weightlifting machine to lift a city and raise its altitude, its temperature will drop, and if you use a huge heater to raise the temperature of the city, then The city will not sink automatically. In the same way, if the background of a picture x is forcibly changed while keeping the foreground object unchanged, the label y of this picture should not be changed, but changing the foreground object will change y. Therefore, the researchers hope that the model learns the marked with y in , which is called "semantic factor" (semantic factor)s, such as foreground objects, and the difference is "variation factor" (variation factor) v, such as pictures background. Only by identifying s can out-of-distribution prediction be done.

Based on this causal perspective, the researchers proposed the "Causal Semantic Generative model" (CSG) , as shown in Figure 2(a) (note that based on the previous considerations, v→y is removed ). In addition, according to the above example, s and v are often related in certain environments. For example, "huskies"/"wolves" often appear together with dark backgrounds/snow backgrounds, but this correlation is not due to a causal relationship between the two, such as Putting the "husky" in the snow will not turn it into a "wolf", nor will it darken the background. So the researchers used an undirected edge to connect them. This is different from most existing works, which believe that the hidden factors are independent.
827296384eb66171d32706bcf68bc32e.png
(Figure 2: Causal semantic generative model (a) and its variants (b, c) used in the test domain)

Causality invariance and out-of-distribution prediction

This model that embodies the causal nature can help make out-of-distribution predictions. The starting point is " causal invariance " (causal invariance), that is, the causal relationship does not change with the environment or domain. This is because causality reflects the basic laws of nature, such as the process in which objects and backgrounds in a scene are imaged as pictures by the camera, that is, p(x│s,v), and the process of giving annotations from the essential characteristics of the objects , Namely p(y│s). The domain change comes from the change of the prior distribution p(s,v). For example, p(s,v) in the training environment will give ("husky", dark background) and ("wolf", snow background) larger The value of the test environment is the opposite.

In contrast, the current mainstream domain adaptation and domain generalization methods use the same encoder in different domains to infer hidden factors. This actually implies " inference invariance " (inference invariance). Researchers believe that inferred invariance is a special case of causal invariance . In the examples that support inference invariance, such as inferring the position of an object from a picture, the causal generation mechanism p(x│s,v) is almost deterministic and reversible, meaning that there is only one value of "object position" (A component of s) to make p(x│s,v) non-zero for a given x. Since p(x│s,v) has causality invariance, this inference method is also invariant. But when p(x│s,v) is noisy or degraded, it is arbitrary to make inferences based on p(x│s,v) only. For example, the number in the left picture of Figure 3 may be derived from "5" or "3" is generated, and in the right picture, whether it is A or B surface close to us, the same picture will be obtained. In this case, the inference result given by the Bayesian formula p(s,v│x)∝p(s,v)p(x│s,v) will obviously be affected by the prior. The priors change with the environment (the preference for possible inference results varies from person to person), so the no longer valid, while the invariance of causality is still reliable .

4684552b6c9af6cb619113854f8e425f.png

(Figure 3: When the generation mechanism p(x│s,v) is noisy (left) or degenerated (right), the inference result is arbitrary, so the inference invariance is no longer reliable)

Based on the invariance of causality, researchers have given the principle of making predictions on the test domain. This paper considers the two distributions outer prediction task, called " distribution outer generalization " (OUT-of-Distribution Generalization) and " adaptive art " (Domain Adaptation). Both have only one training domain (therefore, out-of-distribution generalization is different from domain generalization; the next work will solve the domain generalization task), but domain adaptation has unsupervised data on the test domain, and In generalization outside the distribution, nothing is known about the test domain.

From the causal invariance, we can see that in the test domain, the causal data generation mechanisms p(x│s,v) and p(y│s) are still applicable, but the prior distribution will change. For generalization outside the distribution, all possibilities of the test domain prior need to be considered. Therefore, researchers proposed to apply an independent prior distribution p^⊥ (s,v)≔p(s)p(v), where p(s) and p(v) are both training domain priors p(s ,v) The marginal distribution. This choice removes the spurious correlation between s and v in the training domain, and since p^⊥ (s,v) has a greater entropy than p(s,v), it subtracts the exclusive training domain The information, which makes the model more dependent on causal invariance of the generation mechanism for prediction. This prediction method is called CSG-ind . For domain adaptation, unsupervised data can be used to learn the prior p ̃(s,v) of the test domain for prediction, and the corresponding method is called CSG-DA. These two models are shown in Figure 2(b,c). It is worth noting that because CSG uses a different prior distribution on the test domain from the training domain, the prediction rule p(y│x) obtained on the test domain will be different from that on the training domain. Therefore, this method is different from that based on inference. The method of immutability is strictly different.

method

In fact, no matter which method, it first needs to fit the training data well, because this is the source of all supervision information. Since CSG involves hidden variables, it is difficult to directly calculate the log likelihood log⁡p(x,y) for training, so the researchers used Variational Bayes to optimize a lower bound that can be adaptively tightened. , Denoted as ELBO (Evidence Lower BOund). Although it is standard practice to introduce an inference model of the form q(s,v│x,y), it does not help to make predictions. For this reason, researchers consider using a model of the form q(s,v,y│x) to represent the required inference model q(s,v│x,y)=q(s,v,y│x)/∫ q(s,v,y│x) dsdv. Furthermore, substituting it into ELBO shows that the goal of this new q(s,v,y│x) model is exactly the corresponding distribution p(s,v,y│x) defined by the CSG model, and the CSG This distribution can be decomposed into p(s,v,y│x)=p(s,v│x)p(y│s), where p(y│s) has been explicitly given by the CSG model Out, only p(s,v│x) is a difficult-to-calculate term. Therefore, the researchers finally adopted an inference model in the form of q(s,v│x) to approximate the smallest difficult part p(s,v│x), which was substituted into ELBO to obtain the training target.

For CSG-ind, on the one hand, it needs an inference model q^⊥ (s,v│x) for the independent prior p^⊥ (s,v) for prediction, on the other hand, it also needs an inference model q on the training domain. (s,v│x) is used for training. To avoid the trouble of using two inference models , researchers found that q^⊥ (s,v│x) can be used to represent q(s,v│x). This is because these two models take p(s,v│x) defined by CSG and p^⊥ (s,v) defined by CSG-ind as the targets respectively. According to the relationship between the two, q(s, v│x)=(p(s,v) / p^⊥(s,v)) (p^⊥(x) / p(x)) q^⊥(s,v|x), so when q^ ⊥ (s,v│x) When the goal is achieved, the corresponding q(s,v│x) also achieves the goal. Substitute this formula into ELBO to get the training goal of CSG-ind as:

f6b9f01cf053b035ec01c327afda76c3.png

Where π(y│x)≔E_(q^⊥ (s,v│x)) [p(s,v)/(p^⊥ (s,v)) p(y│s)]. The expectation in the formula can be estimated by Monte Carlo method after reparameterization of q(s,v│x). The prediction is given by p^⊥ (y│x)=E_(p^⊥ (s,v|x)) [p(y│s)]≈E_(q^⊥ (s,v|x)) [p(y │s)] is given.

For CSG-DA, it is similar to CSG-ind, so the researchers also use the inference model q ̃(s,v│x) on the test domain to represent q(s,v│x), and write the training domain similarly The objective function on. In the test domain, CSG-DA also needs to learn the test domain prior p ̃(s,v) by fitting unsupervised data, which can be achieved by standard ELBO:

b0ef2a20af0eb4e6dcb7aae869db3925.png

theory

It can be proved that the CSG model can identify semantic factors from a single training domain under certain conditions, and this semantic recognizability can guarantee the performance of CSG in out-of-distribution prediction (for detailed description, please refer to the original paper). The researchers defined " CSG recognized the semantics " as the existence of a reparameterization that can be transformed from the ground-truth CSG to the CSG so that it does not mix the real v with the learned s Go in.

theorem (semantic recognizability on a single training domain) : assuming that p(x│s,v) and p(y│s) are additive noise (additive noise) form p_noise (random variable-function (condition) Variable)), and the function is bijective, and log⁡p(s,v) is bounded. Then when the noise variance σ_μ^2 tends to 0 or the noise has non-zero characteristic functions almost everywhere, a learned CSG (ie p(x,y)=p^* (x,y)) recognizes The semantics of .

Interpretation of : It is difficult to obtain recognition on a single training domain, so there must be requirements for it. Otherwise, if all the "huskies" in the training domain are in the dark background and all the "wolves" are in the snow, then it is a fairy and does not know whether the label is the foreground object or the background. The bounded condition of log⁡p(s,v) in the theorem is precisely for this point, because in the above case p(s,v) is concentrated on the (s,v(s)) curve and the density function is unbounded. And if this bounded condition is satisfied, then when the learned CSG mixes the real v into its s, the randomness between the real s and v will bring more noise to the prediction on the training set, so that this CSG is not "Learned well". This is the intuition of this theorem.

Theorem (the guarantee of semantic recognition for generalization outside the distribution): the prediction error of a semantically recognized CSG on an unknown test domain is bounded : E_(p ̃^ (x)) ‖E[y │x]-E ̃^ [y│x]‖_2^2≤Cσ_μ^4 E_(p ̃_(s,v)) ‖∇ log⁡(p ̃_(s,v)/p_(s,v) ) ‖_2^2 (where C is a specific constant).

In the theorem, researchers found that E_(p ̃_(s,v)) ‖∇ log⁡(p ̃_(s,v)/p_(s,v)) ‖_2^2 This term is a measure of the first The Fisher divergence D_F (p ̃_(s,v),p_(s,v)) of the difference in the experimental distribution measures the degree of difference between the two fields in the sense of the prediction error. In addition, the smaller Fisher divergence D_F (p ̃_(s,v),⋅) needs to have a larger support set distribution than p ̃_(s,v), and p_(s,v)^⊥ is exactly p_(s,v) has a larger support set, which shows that CSG-ind has a smaller prediction error bound than CSG!

Theorem (Semantic recognition guarantees domain adaptation): Based on a CSG that recognizes semantics (that is, p ̃(x)=p ̃^ (x)) test domain prior p ̃(s, v) is the real test domain prior p ̃^ (s,v) the multiple parameters, and based on the prediction rules it gives is accurate, that is, E ̃[y│x]=E ̃^* [y│ x].

experiment

The researchers designed a "translation MNIST" data set containing only the numbers 0 and 1, in which the 0 in the training data is noisyly shifted to the left by 5 pixels, and the 1 in the training data is shifted to the right. In addition to the original test set, the researchers also considered shifting the numbers with zero mean noise. More realistic tasks include ImageCLEF-DA, PACS and VLCS (Appendix). The results in Table 1 show that CSG outperforms the standard supervised learning (cross-entropy, CE) and discriminative causal method CNBB for generalization outside the distribution, while CSG-ind also outperforms CSG, indicating the use of independent priors for The benefits of forecasting. For domain adaptation, CSG-DA also outperforms the currently popular methods. The visual analysis in Figure 4 shows that the proposed method pays more attention to the area and shape of the image with semantic information.

bcfe00e0edec023648987a5ae44f9f78.png

(Table 1: Translation MNIST (first two rows), ImageCLEF-DA (middle four rows), and PACS (last four rows) on the distribution of generalization (left four columns) and domain adaptation (right five columns) tasks The performance of each method (the proposed method is in bold) (prediction accuracy %))

23152cab27b18aa4d31dfd8c251e1ba3.png

(Figure 4: Visualization results of each method in the task of generalization outside distribution (upper two rows) and domain adaptation (lower two rows) (based on LIME [Ribeiro'16]))

Looking for implicit causal factors for generalization of variable distribution

3c97acd2954ee8c2a52573220bad520e.png

This paper extends the CSG model to the case of multiple training domains, that is, it is used to deal with domain generalization tasks, and the corresponding algorithms and theories are given. In order to model the relationship with the domain label d, the prior distribution at this time is denoted as p^d (s,v). In order to avoid implying the independence of s and v after d is implied in the graph model and in algorithms and theories, researchers have introduced a confounder c. It explains the spurious correlation between s and v, because although there is no causal relationship between s and v, if c is ignored, it seems that s and v will be correlated: p^d (s, v)=∫p^d (c) p^d (s│c) p^d (v│c) dc. The expanded model is shown in Figure 5 and is called Latent Causal Invariant Model (LaCIM) .
bd895d5118a4635f32894f8fc123b057.png
(Figure 5: Implicit Causal Invariance Model (LaCIM))

LaCIM's training method is similar to CSG, except that it needs to sum the objective functions on all training domains, and use their respective prior models p^d (s,v) and inference models q^d (s, v│x). The prediction method is similar to CSG-ind, the difference is that inferring (s, v) does not pass an inference model, but directly uses the maximum a posteriori estimate (MAP): p^(d^') ( y│x)=p(y│s(x) ), where (s(x),v(x))≔arg⁡max_(s,v)⁡ p(x│s,v) p^⊥ (s ,v)^λ.

theory

Due to the need to model the relationship between each distribution and the field label d, more structure needs to be added in the theoretical analysis. Therefore, suppose c∈[C]≔{1,...,C}, and p^d (s│c) and p^d (v│c) belong to the exponential family, and then define the corresponding recognition The concept of sex, called exponential recognition : There is a double parameter that can be transformed from the real LaCIM to the learned LaCIM, and this double parameter can restore the real p^d ( s│c) and p^d (v│c) sufficient statistics.

Theorem (Exponential identifiability over multiple training domains) : Assume that p(x│s,v) and p(y│s) are specific additive noise forms, and p^d (s│c) and p^ The sufficient statistics of d (v│c) are linearly independent. Then when the training domains are sufficiently diverse in a specific sense, a well-learned LaCIM achieves exponential recognition.

The conclusion of this theorem (obtaining exponential recognizability) is stronger than the conclusion of the recognizability theorem on single training domain (obtaining semantic recognizability) . This is reflected in the fact that the former not only requires that the learned s required by the latter is not mixed with the real v, but also requires the learned v not to be mixed with the real s, that is, the learned s and v are disentangled. A stronger conclusion can be obtained because multiple sufficiently diverse training domains bring more information to the model, and the exponential distribution family also brings a more specific structure to the model. In addition, this conclusion is also stronger than the conclusion of identifiable-VAE [Khemakhem'20], because this conclusion requires that the component replacement of sufficient statistics cannot span the interior of s and v.

experiment

In the experiment, the researchers selected some of the latest domain generalization data sets, including the NICO natural picture data set, the color MNIST, and the ADNI data set for predicting Alzheimer's disease. The results in Table 2 show that LaCIM achieved the best performance. It can be noticed that LaCIM also performs better than LaCIMz, a variant that does not distinguish between s and v, which illustrates the benefits of modeling s and v separately. The visual analysis in Figure 6 shows that LaCIM distinguishes semantics and multiple factors well, and focuses on areas with semantic information in the picture.

6676415d6fb2ed470ac7f64b30fae7ce.png

(Table 2: The performance of each method on each data set of domain generalization (prediction accuracy %))

a3742f3049b3b0fd2610327c38aab57d.png
(Figure 6: Visualization results of each method in the domain generalization task)

The regularization method of perceiving object to solve the problem of causal confusion in imitation learning

9888e1a314ae26d748f37fee5883e0f1.png

This paper on causal machine learning focuses on the problem of causal confusion in imitation learning. Imitation learning is learning a policy model from expert demonstrations. It can use existing data to avoid or reduce dangerous or costly interaction with the environment. Behavioral cloning (BC) is a simple and effective method that regards imitating expert demonstration as a supervised learning task, that is, using state s to predict action a. However, this method often causes the problem of causal confusion, that is, the learned strategy focuses on the obvious result of the expert's action rather than the cause (that is, the object that the expert's strategy focuses on). De Haan et al. (2019) gave a classic example: Consider the process of a driver doing a driving demonstration, in which there is a brake indicator on the dashboard of the car. When a pedestrian appears in the field of view, the driver will brake and the brake lights will light up. Since “a=brake on” and “s=brake light on” always appear at the same time, the strategy model is likely to decide whether to apply the brake based on the brake light alone, which can fit the demonstration data well, but when in use When there is a pedestrian in the field of view, it will still not step on the brake because the brake light is not on, which is obviously not what people want.

Researchers have discovered that the problem of causal confusion is widespread in general scenarios. As shown in Figure 7, the performance of the strategies learned in the original environment is far worse than that of masking the scores during training. In the original environment, the strategy model will only rely on the scores in the screen to give actions, because it has a close and sensitive relationship with expert actions, but it is not known that this is only the result of expert actions, so effective actions cannot be taken in use. And in an environment where the score is masked, the strategy model has to look for other clues to predict the action of the expert before it can discover the true law.

fe3574b1db7c75243fa9ae76878d232a.png

method

From the above analysis, the researchers found that the causal confusion problem is mainly because the strategy model only relies on the actions taken by individual objects in the picture, and this object is often the obvious result of the expert's actions. This inspired researchers to deal with this problem by making the strategy model pay attention to all the objects in the picture in a balanced manner, so that the strategy model can pay attention to the real cause.

Two tasks need to be solved to realize this idea: (1) Extract objects from the image. (2) Let the policy model pay attention to all objects. For the first task, the researchers used a vector-quantized variational auto-encoder (VQ-VAE) [vd Oord'17] to extract object features. As shown in Figure 8, the researchers found that the discrete codes learned by VQ-VAE with similar values (similar colors) represent the same (or semantically similar) object, so it finds and distinguishes the objects in the image.

fa88c4955edde4a71d131afe74e9fdae.png

(Figure 8: Discrete coding learned by VQ-VAE can find and distinguish objects in an image_

For the second task, the researchers randomly decide whether to select each discrete code value, and mask the grid points with the selected discrete value in the VQ-VAE code of the image. This operation obscures some objects in the encoding, forcing the policy model to focus on the unobscured objects and avoid focusing on individual objects. This is the biggest difference from the existing methods. The existing methods cover up areas that are spatially similar and do not reflect semantic objects. Therefore, this method is called " Object-aware REgularizatiOn, OREO) . Figure 9 shows the flow of the OREO method. The first stage trains VQ-VAE to extract the object representation, and the second stage learns the strategy model based on VQ-VAE coding, during which regularization is done by the method of randomly concealing the object.

0ed056a26e4b859d47e313b5cf4363ed.png

(Figure 9: Flow of "Regularization Method of Perception Object" (OREO))

experiment

First, consider the confounded Atari games environment, which is the environment proposed by De Haan et al. (2019) to investigate the problem of causal confusion, in which each frame of the game image additionally shows the player’s previous action . As shown in Table 3, the OREO method has achieved the best performance in most games. In particular, the OREO method outperforms the method of random masking on the spatial region (Dropout, DropBlock), data augmentation (Cutout, RandomShift), and the method of spatially random masking the coding learned by beta-VAE (CCIL) [De Hann'19] illustrates the advantages of regularization by perceiving objects. OREO also outperforms the causal prediction method CRLR, indicating that the simple and direct application of causal methods is not necessarily effective, because its assumptions are not valid in imitating learning tasks. For example, there is no clear causal relationship between the various dimensions of image data, and the variable relationship is also Non-linear. The visualization results in Figure 10 show that the strategies learned by behavioral cloning really only focus on individual objects, while those learned by OREO focus more on related objects in the graph. For real-world tasks, the researchers also examined the performance in the CARLA driving simulation environment. The results in Table 4 show that OREO also achieved the best performance. More experimental results are provided in the original paper and appendix.

b9620df46a4510e011bcb0acb109d58b.png

(Table 3: Comparison of the performance of various imitation learning algorithms in the confusion Atari game environment)

e3cd2af1b6dae27f984dc6cc3ed5ab44.png

(Figure 10: Visualization results of the strategy model learned under the confusion of Atari environment (left column) and the original Atari environment (right column) using behavioral cloning (first row) and OREO method (second row) )

54aa6594452ee8ea68eadd5acf943951.png

(Table 4: Success rate of each simulation learning algorithm under each task in the CARLA driving simulation environment)


References:

  1. [Ribeiro’16] M. T. Ribeiro, S. Singh, and C. Guestrin. “Why should I trust you?": Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, pages 1135–1144, 2016.
  2. [v.d. Oord’17] van den Oord, A., Vinyals, O., & Kavukcuoglu, K. Neural discrete representation learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems (pp. 6309-6318), 2017.
  3. [de Haan’19] de Haan, Pim, Jayaraman, Dinesh, and Levine, Sergey. Causal confusion in imitation learning. In Advances in Neural Information Processing Systems, 2019.
  4. [Khemakhem’20] I. Khemakhem, D. P. Kingma, R. P. Monti, and A. Hyvärinen. Variational autoencoders and nonlinear ICA: A unifying framework. In the 23rd International Conference on Artificial Intelligence and Statistics, 26-28 August 2020, Online [Palermo, Sicily, Italy], volume 108 of Proceedings of Machine Learning Research, pages 2207–2217, 2020.

Welcome to follow the Microsoft China MSDN subscription account for more latest releases!
54f48003ad44029cb57a5d3771a44a47.jpg


微软技术栈
423 声望996 粉丝

微软技术生态官方平台。予力众生,成就不凡!微软致力于用技术改变世界,助力企业实现数字化转型。