Meituan internal lecture | ljubisa tumbakovic tsinghua university: the latest trend of the development of causal inference technology

As ARTIFICIAL intelligence continues to evolve, security and compliance issues become more and more important. One of the major limitations of current machine learning is that its learning models are all based on association framework, which has the problem of sample selection bias and poor stability. The emergence of causal reasoning model opens a new idea for machine learning. Meituan technology team specially invited Professor Cui Peng, associate professor of School of Computer science, Tsinghua University, to share the latest development trend of causal inference technology and some achievements achieved at the present stage.

| share guests: ljubisa tumbakovic, long hired, an associate professor at tsinghua university computer science department, doctoral supervisor

| causal reasoning research interests focus on large data driven and stability prediction, large-scale network characterization study, etc. He has published more than 100 papers in top international conferences in the field of data mining and artificial intelligence, and has won 5 awards of top international conferences or journal papers, and has been selected twice in the special issue of KDD Best Paper of top international conferences in the field of data mining. He is on the editorial board of IEEE TKDE, ACM TOMM, ACM Artist, IEEE TBD and other international top journals. He has won the Second prize of National Natural Science Award, the first Prize of Natural Science award of Ministry of Education, the first Prize of Natural Science Award of Electronic Society, the First Prize of Science and Technology Progress of Beijing, The Young Scientist Award of China Computer Society, and the Outstanding Scientist award of International Computer Society (ACM).

background

Over the next 10 to 20 years, AI is expected to become more widely used in many risk-sensitive fields, including healthcare, justice, manufacturing, and fintech. Before, most of artificial intelligence is applied on the Internet, the Internet is a risk not sensitive areas, but with two years of various laws and regulations, make each big Internet platform in the “center”, more and more people begin to see in the Internet all kinds of potential risks, and also is faced with the risk on the macro policy. Therefore, in this sense, the risks brought by ARTIFICIAL intelligence technology need to be paid attention to.

The prevention and control of artificial intelligence risks can be described as “only knowing how, but not knowing why”. You know how to make predictions, but it’s hard to say “Why”, Why do you make decisions like this? When can you trust the judgment of the system? We can’t give a relatively accurate answer to many questions. This brings a series of problems. First of all, it is unexplainable, which makes it difficult for the “human-machine collaboration” model to be implemented in the real world. For example, it is difficult for artificial intelligence technology to be applied in the medical industry, because doctors do not know what the basis of system judgment is, so the implementation of artificial intelligence technology has great limitations. Second, the current mainstream method based on the assumption of independent identically distributed artificial intelligence, this requires a model of the training data set and test set data come from the same distribution, but in practice, it is difficult to guarantee the model will be used in what kind of data, because the model ultimately depends on the performance of the fit of the training set and testing set of distribution. Third, when ai technology is applied to social issues, fairness risk will be introduced. For example, in the United States, two people with identical income, education and other background will be judged by the system to be ten times more likely to commit crimes against blacks than whites. Finally, there is non-traceability, the inability to adjust the input to get the desired output, because the process of reasoning and prediction is non-traceable.

The main source of the above problems lies in: current artificial intelligence is based on the framework of association. In the framework of association, it can be concluded that both income – crime rate and skin color – crime rate are strongly correlated. However, in the framework based on causality, when we need to judge whether a certain variable T has a causal effect on output Y, we do not directly measure the correlation between T and Y, but look at the correlation between T and Y under the control of X. For example, the distribution of X (income level) in the two control groups is the same (either all have money or none), and then adjusting T (skin color) to observe whether there is a significant difference in Y (crime rate) between the two groups, and we find that there is no significant difference in the crime rate between blacks and whites. So why is skin color strongly associated with crime in an association-based framework? This is because most black people have lower income, which leads to higher crime rates overall, but this is not due to color.

Ultimately, the problem is not the correlation model, but the way in which machine learning is used. In general, there are three ways to generate correlation. The first is causal mechanism, which is stable, explainable and traceable. The second is the confounding effect, where if X causes both T and Y, there will be a false association between T and Y. The third is sample selection bias. For example, in the case of dogs and grass, when the beach environment was changed, the model could not identify dogs. This was because we chose a large number of dogs in grassland environment as samples, so the model would think there was an association between dogs and grass, which was also a false association.

Of the above three ways, except for causality, the correlation is valid, the other two ways are not very reliable. However, the current field of machine learning does not distinguish these three ways of generating correlations, and there are many false correlations, which leads to certain problems in the interpretability, stability, fairness and traceability of the model. If we want to fundamentally break through the limitations of current machine learning, we need to use a stricter statistical logic, such as using causal statistics to replace the original correlation statistics.

The application of causal reasoning to machine learning is faced with many challenges, because the original research scope of causal reasoning is mainly in the statistical field (including the field of philosophy), which is oriented to the control environment of small data, and the whole data generation process is controllable. Like a behavioral experiment to see if a vaccine works, where we can control who gets vaccinated and who doesn’t. But in machine learning, the process of data generation is not controllable. In an observational study of big data, we need to consider factors such as high dimension, high noise and weak priori of big data. The generation process of data is unknowable, which brings great challenges to the traditional causal reasoning framework. In addition, the goal of causal reasoning is very different from that of machine learning: causal reasoning requires understanding how data is generated, whereas machine learning (including many applications in the Internet) is to predict what will happen in the future.

So how do you bridge the gap between causal reasoning and machine learning? We propose a methodological framework for cause-inspired learning reasoning and decision evaluation. The first problem to solve is how to identify causal structures in large-scale data. The second problem to be solved is how to integrate machine learning with causal structure. The current stable learning model inspired by causality and fair and unbiased learning model are all aimed at this. The third problem to be solved is from the prediction problem to the design of the decision-making mechanism, how to use these causal structures to help us make the optimization of decision, that is, counterfactual reasoning and decision optimization mechanism.

Two basic paradigms of causal reasoning

Structural causal model

There are two basic paradigms of causal reasoning. The first paradigm is the Structure Causal Model, the core of which is how to reason in a known Causal graph. For example, how to identify any one of these variables, and how much this variable affects the other variable. At present, relatively mature judgment criteria such as Back Door and Front Door have been developed to remove the confusion and Causal Estimation has been performed by do-calculus method. The core problem with this approach is that it is impossible to define cause-and-effect maps in observational studies, although in some fields (such as archaeology) it is possible to define cause-and-effect maps through expert knowledge, but this leads back to “expert systems”. In general, the core problem is how to discover the causal structure.

A derived technology is Causal Discovery, in which Causal maps can be defined based on conditional independence detection and existing data, and a series of independent judgments such as conditional independence are frequently made with existing variables to define Causal maps, which is an NP problem and combination explosion may occur. This is one of the bottlenecks that structural causality models face when applied to large-scale data, and some recent studies have addressed this problem using differentiable causality discovery.

Potential outcome framework

The second paradigm is the Potential Outcome Framework. The core of this Framework is that it does not need to know the causal structure of all variables, but only needs to know whether one variable has a causal effect on the output, ignoring the influence of other variables. But we need to know what Confounders are between this variable and the output, and assume that all of them have been observed.

So that’s some background and some theory. Next, we will focus on some of our recent thinking and attempts, and how to combine these two paradigms into specific problems.

Differentiable causality discovery and application in recommendation systems

Causal discovery and problem definition

The definition of causal discovery is that for a given group of samples, each sample is represented by some variables, and we hope to find the causal structure among these variables through some observable data. The found causal graph can be regarded as a graph model. From the perspective of generative model, we hope to find a causal graph so that it can generate such a group of samples according to the causal structure, which has the highest likelihood.

A concept called Functional acyclic Model(FCMs) is introduced here. The so-called FCM is that, for a certain type of variable X, the Causal graph is a directed acyclic graph (DAG), the variable must have its parent node. Then its value must be generated by all of its parents by the action of a function plus the noise. In a linear framework, for example, the problem becomes: how to find a set of W’s such that the reconstruction of X is optimal.

Optimization of directed acyclic graphs has always been an open problem. In 2018, a paper [1] proposed an optimization method: gradient optimization can be performed in the whole space of directed acyclic graphs, and the final reconstruction error of X can be minimized by adding DAG restriction and sparse restriction (L1 or L2 regularization).

We found some problems in the concrete implementation of the framework, the framework of the basic assumption is that all variables must be gaussian noise, and the scale of the noise should be the same, if not satisfy this assumption will appear some problems, for example the structure of the minimum reconstruction error may not be the real value (Ground way), This is a limitation of the differentiable causal discovery approach. We can solve this problem by imposing an independence constraint, transforming the independence judgment criterion into an optimizable form for optimization. Specific implementation details are not described here, but interested students can read the paper [2].

Application of differentiable causal discovery in recommendation systems

The whole recommendation system has the assumption of I.I.D (Independent and Identically Distributed), which means that the training set and test set of users and items need to come from the same distribution. But in fact, there are various OOD (Out Of Distribution) problems in recommendation systems. The first is Natural Shift, where a model trained on data from Beijing and Shanghai, for example, may not work for users in Chongqing. The second type is Artificial Shift caused by the recommendation system mechanism.

We hope to propose a more general way to resist all kinds of OOD problems or bias problems in the recommendation system. We have also done some research on this issue [3]. There is an invariance assumption in OOD recommendation system — whether a person buys a product after seeing it will not change with the change of environment. Therefore, as long as users’ preferences for items remain unchanged, such invariance hypothesis can be established and reasonable recommendation results can be given, which is the core of OOD solution.

How to ensure that user preferences are constant? There is a basic consensus that invariance and causality have some equivalent transformation relation. If a structure can be guaranteed to have the same predictive effect in various environments, then the structure must be a causal structure, and the performance of a causal structure in various environments is relatively stable. Therefore, finding a constant user preference is transformed into a problem of causal preference learning. There is a special structure called bipartite graph in the recommendation system, and we need to design causal discovery methods based on this special structure. In the model that was eventually learned, it was possible to know what the user would like simply by entering the user’s representation.

Obviously, this method has certain benefits for improving the interpretability, transparency and stability of the recommendation system. We also compare it with many methods, and it can be seen that it has obvious performance improvement.

Some thoughts on OOD generalization and stable learning

OOD problem is a very basic problem in machine learning. The previous work is basically based on I.I.D. The hypothesis of transfer learning is adaptive, but because the test set of transfer learning hypothesis is known, its main body is still I.I.D. The theoretical framework of the. We have done some research in the direction of OOD since 2018. First of all, the definition of OOD is that the training set and the test set are not from the same distribution. If the training set and the test set are from the same distribution, then I.I.D. OOD Adaptation can be divided into two types, if the distribution of the test set is known or partially known, it is referred to as OOD Adaptation, or TRANSFER learning/domain Adaptation. The real OOD generalization problem is if the distribution of the test set is unknown.

Generalization is not the same as generalization in machine learning. The “generalization” in machine learning is more about interpolation problems. The interpolation problems in training data are all “interpolation” problems, and the prediction of X beyond the interpolation range is “interpolation” problems. “Plug in” is a relatively dangerous thing, under what circumstances can do “plug in”? If we can find invariance in it, we can do extrapolation.

When we were doing machine learning, we were doing I.I.D. data fitting, and we just had to prevent overfitting/underfitting. Now if we want to solve the OOD problem, we need to find the invariance in it. There are two ways to find immutability. The first way is causal inference. There is equivalence between causality and immutability, that is to say, immutability can be guaranteed as long as the causal structure is found. Stable learning, in part, is the expectation that the model will do its learning and prediction based on causal inference. We found that by reweighting the sample you can make all the variables independent, turn a model based on association into a model based on cause and effect, and if you’re interested, check out the papers.

The second path is to find immutability in difference. There is a concept of heterogeneity in statistics. For example, the distribution of a dog has two peaks, one for the dog on the beach and the other for the dog on the grass. Since both peaks represent dogs, there must be invariability among them, and the invariable part has OOD generalization ability. The heterogeneity of data cannot be predefined, so we hope to find the implicit heterogeneity and the invariance in the implicit heterogeneity in a data-driven way, and the learning of the two is mutually promoting.

The so-called stable learning is to use a distributed training set and a variety of different test sets with unknown distributions, and the optimization goal is to minimize the variance of accuracy. In other words, it is assumed that there is a training distribution with some inherent heterogeneity, but there is no artificial division of its heterogeneity. In this case, we hope to learn a model that can have good performance under various unknown distributions. Last year, we wrote a Survery[4] on OOD generalization, which made a systematic analysis of this problem, and interested students can refer to it.

reference

[1] Zheng, Xun, Bryon Aragam,Pradeep K. Ravikumar, and Eric P. Xing. DAGs with NO TEARS: Continuous Optimization for Structure Learning. Advances in Neural Information Processing Systems 31 (2018).
[2] Yue He, Peng Cui, et al. DARING: Differentiable Causal Discovery with Residual Independence. KDD, 2021.
[3] Yue He, Zimu Wang, Peng Cui, Hao Zou, Yafeng Zhang, Qiang Cui, Yong Jiang. CausPref: Causal Preference Learning for Out-of-Distribution Recommendation. The WebConf, 2022.
[4] Zheyan Shen, Jiashuo Liu, Yue He, Xingxuan Zhang, Renzhe Xu, Han Yu, Peng Cui. Towards Out-Of-Distribution Generalization: A Survey. arxiv, 2021.

Read more technical articles from meituan’s technical team

| in the public bar menu dialog reply goodies for [2021], [2020] special purchases, goodies for [2019], [2018] special purchases, 【 2017 】 special purchases, such as keywords, to view Meituan technology team calendar year essay collection.

| this paper Meituan produced by the technical team, the copyright ownership Meituan. You are welcome to reprint or use the content of this article for non-commercial purposes such as sharing and communication. Please mark “Content reprinted from Meituan Technical team”. This article shall not be reproduced or used commercially without permission. For any commercial activity, please send an email to [email protected] for authorization.