As artificial intelligence continues to evolve, security and compliance issues become increasingly important. A major limitation of current machine learning is that its learning models are all based on an association framework, which has the problem of sample selection bias and poor stability. The emergence of the causal inference model has opened up a new idea for machine learning. The Meituan technical team specially invited Mr. Cui Peng, a tenured associate professor of the School of Computer Science of Tsinghua University, to share with the students of the Meituan technical team the latest development trend of causal inference technology and some achievements at this stage.
| Guest speaker: Cui Peng, tenured associate professor and doctoral supervisor of the Department of Computer Science, Tsinghua University
| Research interests focus on big data-driven causal inference and stable prediction, large-scale network representation learning, etc. He has published more than 100 papers in top international conferences in the field of data mining and artificial intelligence, and has won five top international conference or journal paper awards, and has twice been selected for the best paper special issue of KDD, a top international conference in the field of data mining. Served as editorial board member of top international journals such as IEEE TKDE, ACM TOMM, ACM TIST, IEEE TBD, etc. He has won the second prize of the National Natural Science Award, the first prize of the Ministry of Education Natural Science, the first prize of the Natural Science of the Institute of Electronics, the first prize of the Beijing Municipal Science and Technology Progress Award, the Young Scientist Award of the China Computer Federation, and the Outstanding Scientist of the International Association for Computing Machinery (ACM).
background
It is expected that in the next ten to twenty years, artificial intelligence will be more widely used in many risk-sensitive fields, including medical, judicial, production, financial technology and so on. Before, artificial intelligence was mostly applied on the Internet, and the Internet is a risk-insensitive field. However, with the introduction of various laws and regulations in the past two years, major Internet platforms have been on the "cusp", and more and more More and more people are beginning to see various potential risks in the Internet, and they are also facing the risk of being controlled by macro policies. Therefore, from this perspective, the risks brought by artificial intelligence technology need to be paid attention to.
The prevention and control of artificial intelligence risks can be described as "only know the truth, but don't know the reason". Everyone knows how to make predictions, but it is difficult to answer "Why", such as why do you make such a decision? When can the judgment of the system be trusted? We cannot give a relatively accurate answer to the model of many questions. In this case, a series of problems will arise. The first is inexplicability, which also makes it difficult for the "human-machine collaboration" model to be implemented in the real world. For example, artificial intelligence technology is difficult to apply to the medical industry, because doctors do not know what the basis for the system's judgment is, so the current artificial intelligence Technology has huge limitations when it comes to landing. Second, the current mainstream artificial intelligence methods are based on the assumption of independent and identical distribution, which requires that the training set data and test set data of the model come from the same distribution, and in practical applications, it is difficult to guarantee what kind of data the model will be applied to , because the final performance of the model depends on how well the training and test set distributions fit. Third, artificial intelligence technology will introduce fairness risks when applied to social issues. For example, in the United States, two people with the same background in income and education may systematically judge that the crime rate of blacks may be ten times that of whites. Finally, there is no retrospectiveness. It is impossible to obtain the desired output by adjusting the input, because the process of reasoning and prediction cannot be retroactive.
The main source of the above problems is that the current artificial intelligence is based on the framework of association. Under the association-based framework, it can be concluded that both income-criminal rate and skin-color-crime rate are strongly correlated. In the causality-based framework, when we need to judge whether a variable T has a causal effect on the output Y, we do not directly measure the relationship between T and Y, but look at the relationship between T and Y under the control of X. association relationship. For example, the distribution of X (income level) in the two control groups is the same (either have money or no money), and then adjust T (skin color) to see whether the Y (criminal rate) of the two groups will be different. There is a significant difference, and then we will find that there is no significant difference in crime rates between blacks and whites. So why is there a strong association between skin color and crime rate in an association-based framework? This is because most black people have lower incomes, which leads to a high overall crime rate, but this is not caused by skin color.
Fundamentally, the problem is not with relational models, but with the way machine learning is used. In general, there are three ways to generate associations. The first is the causal mechanism. The causal relationship is stable, interpretable and retroactive. The second is the confounding effect, where if X causes both T and Y, a spurious association between T and Y is created. The third is sample selection bias. For example, in the case of dogs and grass, when the beach environment is changed, the model cannot identify the dog. This is because we have selected a large number of dogs in the grass environment as samples, so the model will think that there is a relationship between the dog and the grass. This is also a false association.
Among the above three methods, in addition to the causal relationship that is reliable, the other two methods are not very reliable. However, the current field of machine learning does not distinguish between these three ways of generating associations, and there are many false associations, which leads to certain problems in the interpretability, stability, fairness, and retrospectability of the model. If you want to fundamentally break through the limitations of current machine learning, you need to use a more rigorous statistical logic, such as using causal statistics to replace the original correlation statistics.
Applying causal inference to machine learning faces many challenges, because the original research scope of causal inference is mainly in the field of statistics (including the field of philosophy), and the environment for these fields is the control environment of small data, and the entire data generation process. is controllable. For example, in a behavioral experiment to test whether a vaccine is effective, we can control who gets vaccinated and who doesn't. But in machine learning, the data generation process is uncontrollable. In an observational study of big data, we need to consider factors such as high dimensionality, high noise, and weak priors of big data, and the data generation process is unknowable, which brings great influence to the traditional causal inference framework. challenge. In addition, the goals of causal inference and machine learning are also very different: causal inference needs to understand the mechanism of data generation, while machine learning (including many applications in the Internet field) is mainly to predict what will happen in the future Variety.
So, how to bridge the gap between causal inference and machine learning? We propose a causally-inspired methodology for learning inference and decision evaluation. The first problem to solve is how to identify causal structures in large-scale data. The second problem to be solved is how to integrate with machine learning after the causal structure is in place. The current causal-inspired stable learning model and fair and unbiased learning model are all aimed at this. The third problem to be solved is how to use these causal structures to help us make decision-making optimization from prediction problems to design decision-making mechanisms, that is, counterfactual reasoning and decision-making optimization mechanisms.
Two basic paradigms of causal inference
structural causal model
There are two basic paradigms for causal inference. The first paradigm is the Structural Causal Model. The core of this framework is how to reason in a known causal graph. For example, how to identify any one of the variables, and how much influence this variable has on another variable. At present, there are relatively mature judgment criteria such as Back Door and Front Door to remove the confusion, and do Causal Estimation through Do-Calculus. The core problem faced by this method at present is that we cannot define causal diagrams when doing observational research. Although in some fields (such as archaeology), we can define causal diagrams through expert knowledge, but this has reached the old age of "expert system". on the way. In general, the core question is how to discover the causal structure.
A derivative technique here is Causal Discovery, which can define causal diagrams based on conditional independence detection and existing data, and use existing variables to frequently do a series of independent judgments such as conditional independence. The causal graph, which is an NP problem, can have combinatorial explosion problems. This is a bottleneck when structural causal models are applied to large-scale data, and there have been recent studies such as using differentiable causal discovery to address this problem.
Potential Outcomes Framework
The second paradigm is the Potential Outcome Framework. The core of this framework is that you do not need to know the causal structure of all variables, but only need to know whether one of the variables has a causal effect on the output. Care, but we need to know what confounders (Confounders) are between this variable and the output, and assume that all of them have been observed.
The above is the introduction of some background knowledge and theory. Next, I mainly talk about some of our recent thinking and attempts, and how to combine these two paradigms into specific problems.
Differentiable causal discovery and its application in recommender systems
Causal discovery and problem definition
The definition of causal discovery is that for a given set of samples, each of which is characterized by some variables, we hope to find the causal structure between these variables through some observable data. The found causal graph can be considered as a graphical model. From the perspective of generative models, we hope to find a causal graph that enables it to generate such a set of samples according to the causal structure in it. The likelihood of this set of samples is the highest.
A concept called Functional Causal Models (FCMs) is introduced here. The so-called FCM is that for a certain type of variable X, since the causal graph is a directed acyclic graph (DAG), this variable must have its parent node, then it The value of must be generated by all its parent nodes through the action of a function plus noise. For example, in the linear framework, the problem becomes: how to find a set of W such that the reconstruction of X is optimal.
The optimization of directed acyclic graphs has always been an open problem. A paper in 2018 [1] proposed an optimization method: gradient optimization can be done in the directed acyclic graph of the full space, by increasing the DAG limit and the sparsity limit (l1 or l2 regularity) to minimize the reconstruction error of the final X.
We found some problems when implementing this framework. The basic assumption of this framework is that the noise of all variables must be Gaussian distribution, and the scale of the noise should be similar. If this assumption is not satisfied, there will be some problems, such as having a minimum weight The structure of the construct error may not be the ground truth, which is a limitation of differentiable causal discovery methods. We can solve this problem by imposing an independence limit, and transform the independence judgment criterion into an optimizable form for optimization. The specific implementation details will not be repeated here, and interested students can read the paper [2].
Application of Differentiable Causal Discovery in Recommender Systems
The entire recommender system has the assumption of IID (Independent and Identically Distributed), that is to say, the training set and test set of users and items need to come from the same distribution, but in fact there are various OOD ( Out Of Distribution) problem. The first is Natural Shift. For example, models trained based on data from Beijing and Shanghai may not be effective when targeting users in Chongqing. The second is the artificial shift (Artificial Shift) caused by the recommender system mechanism.
We hope to propose a more general approach to recommend algorithms that resist various OOD problems or bias problems existing in recommender systems. In response to this problem, we have also done some research work [3]. There is an invariance assumption in the OOD recommender system - whether a person buys an item after seeing it will not change with the environment. Therefore, as long as the user's preference for items is guaranteed to remain unchanged, such an invariance assumption can be established, and a more reasonable recommendation result can be given, which is the core of solving the OOD problem.
How to ensure that user preferences are unchanged? There is a basic consensus that there is an equivalent transformation relationship between invariance and causality. If a structure can be guaranteed to have the same predictive effect in various environments, then the structure must be a causal structure, and the performance of a causal structure in various environments is relatively stable. Therefore, finding invariant user preferences turns into a problem of causal preference learning. There is a special structure in the recommender system called bipartite graph, and we need to design a causal discovery method based on this special structure. In this final learned model, it is only necessary to input the representation of the user to know what items the user will like.
Obviously, this method will have certain benefits for improving the interpretability, transparency and stability of the recommendation system. We have also compared with many methods, and we can see that it has a relatively obvious performance improvement.
Some Thoughts on OOD Generalization and Stable Learning
The OOD problem is a very basic problem in machine learning. The previous work was basically based on the assumption of IID. Although transfer learning has done adaptive, because transfer learning assumes that the test set is known, its main body is still IID. theoretical framework. We have done some research in the direction of OOD since 2018. First, the definition of OOD is that the training set and the test set are not from the same distribution. If the training set and the test set come from the same distribution, then it is IID. OOD can be divided into two cases. If the distribution of the test set is known or partially known, it is OOD Adaptation, which is transfer learning/domain adaptation. If the distribution of the test set is unknown, it is a real OOD generalization problem.
The concept of "generalization" here is different from the concept of "generalization" in machine learning. The "generalization" in machine learning is more about the interpolation problem. The interpolation problem within the training data is an "interpolation" problem. If you want to predict X beyond the interpolation domain, it is an "extrapolation" problem. "Extrapolation" is a relatively dangerous thing. Under what circumstances can "extrapolation" be done? If you can find the invariance in it, you can do "extrapolation".
When I was doing machine learning before, I was doing IID, that is, data fitting, and I just needed to prevent overfitting/underfitting. Now if you want to solve the OOD problem, you need to find the invariance in it. There are two paths to find invariance. The first path is causal inference. There is equivalence between causality and invariance. That is to say, as long as the causal structure is found, invariance can be guaranteed. In fact, causal inference itself is about invariance. Transgender Science. Stable learning, in part, is to hope that the model is based on causal inference when doing learning and prediction. We found that by re-weighting the samples, all variables can be made independent, so that an association-based model becomes a causal-based model. If you are interested, you can go to the relevant papers.
The second path is to find invariance from difference. In statistics, there is a concept of heterogeneity. For example, the distribution of a dog has two peaks, one peak is the dog on the beach, and the other peak is the dog on the grass. Since these two peaks represent dogs, then one of them must be There is invariance, and the invariant part has OOD generalization ability. The heterogeneity of data cannot be predefined. We hope to find the implicit heterogeneity and the invariance in the implicit heterogeneity in a data-driven way, and the learning of the two is mutually reinforcing.
The so-called stable learning is to use a training set of one distribution and a test set of various unknown distributions, and the goal of optimization is to minimize the variance of the accuracy. That is to say, it is assumed that there is a training distribution, which has a certain degree of heterogeneity, but does not artificially divide its heterogeneity. In this case, we hope to learn a training distribution that can be relatively A model with good performance. We wrote a Survey[4] on OOD generalization last year, and made a systematic analysis of this problem. Interested students can refer to it.
references
- [1] Zheng, Xun, Bryon Aragam,Pradeep K. Ravikumar, and Eric P. Xing. DAGs with NO TEARS: Continuous Optimization for Structure Learning. Advances in Neural Information Processing Systems 31 (2018).
- [2] Yue He, Peng Cui, et al. DARING: Differentiable Causal Discovery with Residual Independence. KDD, 2021.
- [3] Yue He, Zimu Wang, Peng Cui, Hao Zou, Yafeng Zhang, Qiang Cui, Yong Jiang. CausPref: Causal Preference Learning for Out-of-Distribution Recommendation. The WebConf, 2022.
- [4] Zheyan Shen, Jiashuo Liu, Yue He, Xingxuan Zhang, Renzhe Xu, Han Yu, Peng Cui. Towards Out-Of-Distribution Generalization: A Survey. arxiv, 2021.
Read more technical articles from the Meituan technical team
the front | algorithm | backend | data | security | operation and maintenance | iOS | Android | test
| reply to keywords such as [2021 stock], [2020 stock], [2019 stock], [2018 stock], [2017 stock] in the public account menu bar dialog box, you can view the collection of technical articles by the Meituan technical team over the years.
| This article is produced by Meituan technical team, and the copyright belongs to Meituan. Welcome to reprint or use the content of this article for non-commercial purposes such as sharing and communication, please indicate "The content is reproduced from the Meituan technical team". This article may not be reproduced or used commercially without permission. For any commercial activities, please send an email to tech@meituan.com to apply for authorization.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。