Deep learning theoretical research has attracted more and more attention, but machine learning also has theoretical limitations.

However, attention to the issue does not seem to have made much of a stir.

Turing Prize winner Judea Pearl, the father of Bayesian networks, recently uploaded his latest paper to arXiv, discussing the limitations of current machine learning theories and giving seven insights from causal reasoning.

You may remember the lonely presence of Judea Pearl at NIPS 2017, when his presentation on the limits of machine learning theory was thinly attended.



CMU Professor Eric Xing, Judea Pearl reports that there are few people at the meeting, and the picture is from Zhou Zhihua

The title of the presentation, Theoretical Impediments to Machine Learning, is Judea Pearl’s reflections on the theory behind machine learning, especially deep learning.

Although we can’t visit the site, we can read Pearl’s paper on “Cause and Effect Revolution”.

Judea Pearl: A lonely figure and the Seven unprovoked sparks of causal reasoning

Judea Pearl won the Turing Award in 2011 for his fundamental contributions to the field of artificial intelligence, developing probabilistic and causal reasoning algorithms that revolutionized the rule-based and logic-based direction of artificial intelligence. His main areas of research are probabilistic graph models and causal reasoning, which are fundamental to machine learning. The Turing Prize is usually awarded to a scholar of pure computing, or an early computer architecture or framework.

Judea Pearl, a professor of computer science at UCLA, has twice been at the center of scientific revolutions: first, in the 1980s, when he introduced a new set of tools for artificial intelligence called Bayesian networks. The second revolution, given the computational advantages of Bayesian networks, Pearl realized that simple graph models and probability theory (as in Bayesian networks) could also be used for causal reasoning. The finding lays another foundation for the development of artificial intelligence, a methodical mathematical method of verifying cause and effect that has been adopted by almost every scientific and social science field.

Judea Pearl is also a Member of the National Academy of Engineering, an AAAI and IEEE Fellow, and president of the Daniel Pearl Foundation, named after his son, a Wall Street Journal reporter who was kidnapped and beheaded by Pakistani terrorists in 2002. They made A movie about it, “A Mighty Heart”.)

Machine learning theory obstacle and cause and Effect revolution seven sparks.



Abstract

Current machine learning systems operate almost entirely in statistical or blind models, which imposes strict theoretical limits on their power and performance. Such systems do not provoke intervention and reflection, and therefore cannot be the basis for strong AI. To reach the level of human intelligence, learning machines need to be guided by realistic models, similar to those used in causal reasoning tasks. To demonstrate the importance of these models, I present a summary of seven tasks that are not possible with current machine learning systems and are accomplished using causal modeling tools.

Science background

Today, if we look at the systems that drive machine learning, we find that it operates almost entirely in statistical fashion. In other words, the learning machine optimizes its performance through sensory input stream parameters from the environment. This is a slow process, similar in many ways to the natural selection process of Darwinian evolution.

It explains how species like eagles and snakes developed superb visual systems over millions of years. However, it does not explain the process of technological superevolution, such as humans being able to build glasses and telescopes over thousands of years.

What humans possess but other species lack is a mental representation, an ability to increase will, imagination, hypothesis, planning and learning to manipulate the blueprint of existence, as anthropologists like N. Harari and S. Mithen generally agree.

The decisive factor in our Homo sapiens ancestors achieving global domination 40,000 years ago was their ability to orchestrate mental representations of their environment, to question representations, to act on them in the spirit of imagination, and finally to assume “What if not?” “Or an intrusive inquiry:” What if I took action? And retrospective, interpretative reflection: “What if I had acted differently?” “And” What if we banned smoking? Today, the vast majority of machine learning is not equipped to solve these problems.

I think the key to solving these problems is to equip machines with causal reasoning tools to speed up learning to the level of human cognition. This assumption was made 20 years ago during the mathematical phase of counterfactual information, but not today.

Advances in graphical and structural models have made counterfactual computing easier to manage, leading to a more promising direction for model-driven reasoning and powerful AI. In the next section, I will describe the obstacles faced by machine learning systems using a three-level hierarchy that governs reasoning in causal reasoning. The final section summarizes how to circumvent these barriers using modern causal inference tools.

Three levels of causation



Figure 1: Causal hierarchy. Level I questions can only be answered if level I or above information is available.

A very useful insight revealed by the logic of causal reasoning is that there exists a clear classification of causal information in terms of the types of questions each category can answer.

This classification forms a three-level hierarchy, which means that questions on level I (I = 1,2,3) can only be answered if information on level j (j ≥ I) is available.

Figure 1 shows a hierarchy with three levels and typical questions that can be answered at each level. These levels are: ① Association, ② Intervention, and ③ Counterfactual. The names were chosen to emphasize their usage. We call the first association because it invokes a purely statistical relationship, defined by bare data. Customers who buy toothpaste, for example, are also more likely to buy floss; This association can be inferred directly from observed data using conditional expectations. Problems at this level are placed at the bottom level because causal information is not required. The second intervention is higher than Associative because it involves not just seeing, but changing what we see. The typical question at this level is: What happens if we double the price? Such questions cannot be answered from sales figures alone, as they relate to changes in customer behaviour that affect new pricing. These options may differ significantly from previous price increases. (Unless we accurately replicate market conditions where prices are twice their current value.) Finally, the top layer is called Counterfactuals, a term that dates back to the philosophers David Hume and John Stewart Mill and has been used in computer-friendly semantics for the past two decades. A typical question in the counterfactual category is “what if I had acted differently” and therefore requires retroactive reasoning.

Counterfacts are placed at the top of the hierarchy because they involve questions of intervention and association. If we have a model that can answer counterfactual questions, we can also use it to answer questions about intervention and association. For example, what would happen if the price were doubled (the intervention question) can be answered by asking a counterfactual question: What would happen if the price were twice its current value? Similarly, when we can answer interventional questions, associative questions can also be answered.

The associative question model, by contrast, doesn’t answer the larger questions, like we can’t re-test people who are taking medication to see how they behave if they’re not taking medication. Therefore, hierarchies are directional, with the top layer being the most powerful.

Counterfactual is a cornerstone of scientific thought, as are legal and moral reasoning. In a civil court, for example, the defendant is considered to be responsible for causing harm. Without the defendant’s actions, the injury likely would not have occurred. The computational meaning of “if not” requires a comparison between the real world and another world in which the defendant’s conduct did not occur.

Each level in the hierarchy has a syntactic signature that represents the level of statement being typed. Link layer, for example, is characterized by conditional probability, P (y | x) = P, for example, note: suppose we observed event x = x, event probability of y = y equal to P. In large systems, these evidential sentences can be efficiently computed using Bayesian networks or any neural network that supports deep learning systems.

In interventional layer, we find the type of P (y | do (x), z) of the sentence, it said “event probability of y = y, suppose we step in and set to the value of x x, z = avdeev then observe events. Such expressions can be estimated experimentally from randomised trials or using causal Bayesian networks (Pearl, 2000, Chapter 3). A child learns about the effects of an intervention through playful manipulation of the environment (usually in a deterministic playground), and AI planners gain knowledge of the intervention by exercising their assigned behaviors. No matter how large the data, interference expression cannot be inferred from passive observation.

Finally, in the fact that level, we have the type P (yx ‘, ‘y | x) expression, it means “if we observed x is x, y = y event probability can be observed, based on what we actually observed x is x and y’ y ‘. For example, if Joe finishes college, his salary will be Y. He would have earned Y ‘for “only two years of college.” Such sentences can only be calculated if we have functional or structural equation models or if we have properties of those models. (Pearl, 2000, Chapter 7).

This hierarchy, and the required formal constraints, explains why statistics-based machine learning systems cannot reason about actions, experiments, and explanations. It also tells us what additional statistics are needed and in what format to support these patterns of reasoning.

Researchers are often surprised that this hierarchy lowers the achievement of deep learning to the level of association. Juxtaposed with textbook curve fitting exercises. One argument against this comparison is that in deep learning we try to minimize “over-fitting”, while the goal of curve fitting is to maximize “fitting” as much as possible. Unfortunately, the theoretical barrier that separates the three layers in the hierarchy tells us that the nature of our target function doesn’t matter. As long as our system optimizes some properties of the observed data, but does not mention the world beyond the data, we are back to the first level of the hierarchy, which has many limitations.

7 Pillars of causal reasoning Models: What can you do with causal reasoning models?

Consider these five questions:

  • How effective is a given therapy in treating a particular disease?
  • Is the new tax credit causing sales to rise?
  • Is the annual rise in medical costs due to an increase in obesity?
  • Can hiring records prove sex discrimination?
  • Should I give up my job?

The general feature of these questions is that they are concerned with relationships between cause and effect, which can be identified by words such as cause, by consequence, prove, and should. These words are common in everyday language, and society has always needed answers to these questions. Until recently, however, no scientific method was good enough to address these questions, let alone answer them. Unlike the laws of geometry, mechanics, optics, or probability theory, the laws of causes and effects were once considered unsuitable for mathematical analysis.

But things have changed dramatically over the past 30 years. A powerful and transparent mathematical language has been developed for dealing with causality, along with accompanying tools for turning causal analysis into mathematical games. These tools allow us to express causal questions and then use the data to estimate the answers.

This is what I call a “Causal revolution” (Pearl and Mackenzie, 2018, forthcoming) and the mathematical framework for Causal revolution IS what I call a “Structural Causal Model” (SCM).

SCM consists of three parts: graph model, structural equation, counterfactual and intervention logic

Among them, graph model serves as the language to represent knowledge, counterfactual logic helps to express problems, and structural equation connects the former two with clear semantics.




I’ll take a look at the seven most important features of the SCM framework and discuss how each makes a unique contribution to automated reasoning.

1. Code causal assumptions — transparency and testability

Once we get serious about the requirements of transparency and testability, the task of coding assumptions in a compact, usable form is no simple matter. Transparency enables the analyst to discern whether the encoded assumptions are sound (based on scientific evidence) or whether additional assumptions are necessary. Testability allows us (whether analysts or machines) to determine whether the coded assumptions are compatible with the available data and, if not, identify those assumptions that need to be fixed.

Advances in graphical models made compact coding possible. Their transparency comes from the fact that all assumptions are graphically coded, which is consistent with the way researchers understand cause and effect in the field; No counterfactual or statistical dependency judgments are required, as these can be read from the structure of the graph. Testability is facilitated by a graphical standard called D-separation, which provides a fundamental link between cause and probability. It tells us which dependency patterns should exist in the data for any given path pattern in the model (Pearl, 1988).

2. Do-calculus and control mix

Confounding, or the unobserved factor in which two or more variables appear, has long been considered the main obstacle to making causal inferences from data. Deconfound can be done using a graphics standard called back-door. The task of selecting an appropriate set of variables to control confounding has been reduced to a simple “roadblocks” problem that can be solved with a simple algorithm (Pearl, 1993).

For models for which the “back-door” criterion does not hold, there is a symbolic engine called “do-calculus”, which can predict the effect of strategic intervention in any feasible situation and exit with failure when the prediction cannot be determined by specific assumptions (Pearl, 1995; Tian and Pearl, 2002; Shpitser and Pearl, 2008).

3. Algorithmic counterfactual

Counterfactual analysis deals with the behavior of a particular individual, identified by a different set of characteristics. For example, suppose Joe’s salary is Y = Y and he goes to college for X = X years, what is Joe’s salary? So if Joe stayed in college for another year, what would his salary be?

One achievement of the causal revolution was the formalization of counterfactual reasoning in graphical representations, a form of representation used by researchers to encode scientific knowledge. Each structural equation model determines the truth value of each counterfactual statement. Thus, we can analyze to determine whether the probability of a sentence can be determined by experiment or observational studies, or estimated by a combination of the two [Balke and Pearl, 1994; Pearl, 2000, Chapter 7].

Of particular interest in the discourse of causation is the counterfactual question of causes of effects (as opposed to effects of causes). For example, Joe’s going swimming was a necessary (or sufficient) cause of Joe’s death (Pearl, 2015a; Halpern and Pearl, 2005).

4. Mediation analysis and evaluation of direct and indirect effects

Mediation analysis focuses on the mechanism by which changes are transmitted from cause to effect. The detection of intermediate mechanisms is the basis for generating explanations and must be aided by counterfactual logic. The graphic representation of the counterfactual enables us to define direct and indirect effects and to determine the conditions under which these effects can be estimated from data or experiments (Robins and Greenland, 1992; Pearl, 2001; VanderWeele, 2015). The typical question that can be answered by this analysis is how much of the effect of X on Y is due to the variable Z.

5. External validity and sample selection bias

The validity of all experimental studies is affected by differences between experimental and implementation Settings. We cannot expect a machine trained in an environment to perform well when environmental conditions change, unless the changes are local and determinable. This problem and its various manifestations have been recognized by machine learning researchers, such as domain adaptation, transferable learning, lifelong learning, and interpretable ARTIFICIAL intelligence. These are just some of the subtasks that researchers and funding agencies are working on in an attempt to mitigate the common problem of robustness.

Unfortunately, the problem of robustness requires a causal model of the environment and cannot be addressed at the level of association, where most remedies have already been attempted. Association is not sufficient to determine the mechanisms affected by the changes that occur. Do-calculus, which we discussed earlier, provides a complete approach to overcome the bias brought about by environmental change, both by readjutting learning policies to avoid environmental change and by controlling the bias of atypical samples (Bareinboim and Pearl, 2016).

6. Data is lost

The problem of data loss plagues every branch of experimental science. For example, respondents did not answer all items in the questionnaire, sensors failed due to changes in environmental conditions, and patients often dropped out of clinical studies for unknown reasons. For this question, a large body of literature is devoted to the model-blind paradigm of statistical analysis, so these studies are severely limited to situations where data loss occurs randomly, that is, independent of the values of other variables in the model. Using the causal model of the missingness Process, we can now recover causal and probabilistic relationships from incomplete data and obtain consistent estimates of the required relationships as long as conditions are met (Mohan and Pearl, 2017).

7. Causal discovery

The D-separation standard described earlier enables us to detect and enumerate testable inferences of a given causal model. This makes it possible to reason with imprecise assumptions and a collection of data-compatible models that can be represented compactly. Systematic searches have been developed to, in some cases, prune a set of compatible models to the point where causal problems can be assessed directly from that set (Spirtes et al., 2000; Pearl, 2000; Peters et al., 2017).

conclusion

The philosopher Stephen Toulmin believes that the dichotomy based on model versus blind model is the key to understanding the competition between Babylonian and Ancient Greek science. According to Toulmin, Babylonian astronomers were masters of black-box predictions, far surpassing the ancient Greeks in accuracy and consistency (Toulmin, 1961, pp.27-30). Science, however, favored the creative speculative strategies of the Greek astronomers, which were as wild as the metaphysical images: round tubes filled with flame, tiny holes where fire could be seen, and hemispherical earths riding on tortoises. This wild modeling strategy, however, upended one of Eratosthenes (276-194 BC) ‘s most creative experiments in the ancient world and measured the radius of the Earth. That would never happen in Babylon.

Going back to strong AI, we have seen that the blind model approach has inherent limits to the cognitive tasks that can be performed. We describe some of these tasks and show how they can be accomplished in an SCM framework and why a model-based approach is essential for performing them. Our overall conclusion is that human AI cannot simply emerge from a blind model learning machine, it requires a symbiotic collaboration of data and model.

Data science is nothing more than a science because it helps interpret data: it’s a two-body problem that connects data to reality. No matter how big it is or how deftly it is used, data itself is not a science.


The original post was published on January 16, 2018

This article is from xinzhiyuan, a partner of the cloud community. For relevant information, you can follow the wechat public account “AI_era”

Turing Award Winner Judea Pearl: Machine learning can’t be the foundation of strong AI, The Breakthrough lies in “Cause and Effect Revolution”