How can you go beyond manipulation when talking about related issues?

Photo: Zakaria ZayaneonUnsplash

Correlation: a handicap in analytics

“Correlation is not causation” is a phrase you hear a lot in analytics (I’m going to abbreviate it to CINC from now on; I’m going to say “kink” instead). Many times in my career I have seen A business analyst or data scientist come up with A data scatter plot showing the correlation between two variables A and B and issue this ritualized warning. Unfortunately, 90% of the time, they go on to do one of two things.

  • For example, “We can see that the number of marketing emails received correlates with customer lifetime. Of course, correlation is not causation. With that said, let’s now talk about how we can enhance our marketing efforts to increase customer lifecycle value.” In this case, a CINC is little more than a thin disclaimer to protect analysts’ butts when you foolishly believe their conclusions.
  • Or, they say, unless you do a randomized trial, you can’t draw any further conclusions. This approach, which is common among analysts with statistical training, has the advantage of being intellectually honest. In practice, however, business partners often just nod their heads, and when the speaker leaves the room, they make plans based on variable A leading to variable B.

However, this sorry state of affairs is not necessarily the norm. Whenever we observe A correlation in the data, there are actually A limited number of possible cases where variable A leads to variable B.

  1. Observed correlations do not reflect real correlations in the population of interest.
  2. Variable B leads to variable A.
  3. Variable A and variable B have A common cause.
  4. There is a more complex causal structure at work.

1.1. No real correlation

In the simplest case, there is virtually no correlation among the interested people. There are two cases of this: noise (also known as sampling change) and bias.

Noise. First, if your sample is “too small,” or if you take too many samples in a row (aka a fishing survey), the observed correlation may just be a random fluke. This is a real problem, especially if you rely on p-values as a measure of significance rather than confidence intervals to determine economic significance, but I won’t dwell on that. I think most people have a pretty good grasp of this trap, and in most business situations, the sample isn’t that small. If you have a million rows, sampling variation should be low on your list of potential problems. If your sample is small, use a more robust metric, such as the median rather than the average. People often underestimate the robustness of the median, even in very small samples (the math is in the appendix).

* * deviation. ** Bias occurs when your sample is not a good representation of the population you are looking at. For example, “All customers with active accounts last year” is usually a reasonable representation of “all customers with active accounts next year.” On the other hand, “all customers who had an active account _ and provided an email address _ last year” are not. Bias is a more insidious problem than noise because even large samples can fall victim to it, as a recent study on COVID showed [1].

But avoiding bias, or at least recognizing it, doesn’t have to be complicated. Just write down the definition of your sample and the definition of the people you’re interested in as accurately as possible. If your sample is really a random sample from your population, you can do it. In any other case, there may be a bias, for example, if you randomly contact some people in your population, but your sample only includes those who answered or provided complete answers. Try to find subcategories that belong to the people you are interested in, but may be missing or underrepresented in your sample. Pushing it to the limit, if poor older women with disabilities and no Internet connection are part of your population, are you adequately reaching them?

If you’re thinking, “But that’s just a fraction of my population!” I ask you to think again. Subcategories can add up to a large share of your population, even if each of them is small. They may just seem small from your point of view. I currently live in West Africa and have recently had trouble updating my iPhone: it requires 1) downloading several gigabytes of data, 2) over WiFi (another phone hotspot doesn’t work), and 3) while charging. But in developing countries, where the typical smartphone owner may not have WiFi at home (their smartphone is their only access to the Internet), WiFi bandwidth in stores is often limited, assuming they will even let you use an electric plug. This may be a “fringe case” if you live on the West Coast of the US, but it probably covers hundreds of millions, if not billions, of smartphone users!

2. Reverse causality (B causes A)

The next possibility is that the correlation between variables A and B may result from the fact that variable B causes variable A, rather than the other way around. For example, the correlation between the number of marketing emails received and the customer’s lifecycle value could be due to the Marketing Department’s email targeting high LTV customers. Once you allow for that possibility, generally speaking, it’s pretty obvious that this is happening in your data.

3. Mixed factors (A and B share A common cause)

The final “easy” case is when A and B have A common cause. For example, perhaps marketing budgets are allocated at the state level in the United States, or at the national level internationally. A customer in California (in the US) is likely to have higher LTV and receive more marketing emails than a customer in Tennessee (in Nigeria). Again, once you allow for that possibility, in general, it’s pretty obvious in your data.

4. Other situations (more complex causal structure)

The first three probably represent 90% of what you’ll encounter in practice, but technically they don’t cover all the possibilities. For the sake of completeness, I will briefly mention what else is going on.

A more complex kind of causal structure is when you explicitly or implicitly control a variable that you shouldn’t control. For example, an army doctor found that the use of a tourniquet on the battlefield was negatively associated with survival; The problem is that his analysis is based on soldiers arriving at field hospitals. But the main benefit of a tourniquet is that it keeps soldiers with severe wounds alive until they reach the hospital, rather than losing too much blood. That means more soldiers survive overall, but a smaller percentage of those who make it to the hospital because we’re adding more severe cases [2]. As a side note, this example can also be interpreted as a bias in data collection (i.e. the observed negative correlation is not representative of the interested population), suggesting that data collection and data analysis are not as independent as is often assumed.

Finally, we have situations that seem designed by nature to trip up and confuse scientists. Autism, for example, has been known for some time to be associated with a simpler gut microbiome (that is, a less diverse group of bacteria in the gut). Does this mean the microbiome causes autism? A recent study suggests that “no, it’s the other way around” : Children with autism often have restrictive diets because sensory experiences overwhelm them, and limited food types lead to limited microbiome types. But then, what explains why faecal transplants improve behavior in children with autism? An emerging hypothesis is that “faecal transplantation improves the behavior of children with autism by alleviating discomfort directly caused by an unbalanced microbiome, but without affecting the neurological basis of the condition” [3]. So, the corresponding causal diagram would be.

Ultimately, science advances by developing more and more accurate and complete models that illustrate all the facts at hand. The same applies to business. Achieving a deep understanding of customer (or employee) behavior requires the establishment of accurate cause-and-effect diagrams, as I explain in the book for behavioral data analysis in R and Python [4].

Review and conclusion

Whenever you observe A correlation between variables A and B in the data, there are exactly four possibilities other than A causing B.

  1. Observed correlations do not reflect true correlations in the relevant population, possibly because of sampling noise or bias.
  2. Variable B leads to variable A.
  3. Variable A and variable B have A common cause.
  4. There is a more complex causal structure at work.

That means you don’t have to limit yourself to “correlation is not causation.” By carefully considering other possibilities and ruling out those that are off the table, you can conclude that “this correlation probably reflects causality, which will be confirmed by running A/B testing once we have identified the course of action to take.” If things get too complicated, you can build cause-and-effect maps to determine what’s happening.

reference

[1] news.harvard.edu/gazette/sto… .

[2] This example comes from Judea Pearl & Dana MacKenzie. The Book of Why. The new science of causality.

[3] The Economist, “How intestinal microbiome disorders link to autism.”

[4] Florent Buisson, Behavioral Data analysis with R and Python. Customer-driven data to achieve true business results.

You can also check out my previous posts on Medium.

  • Is your behavioral data real behavior?
  • Discard p values. Use the Bootstrap confidence interval instead
  • Is Zillow “cursed”? Behavioral economics perspective
  • What does a behavioral science manager at a Fortune 100 company do?

The appendix. Robustness of median estimator

Remember, by definition, the median population looks like this: half the population has a value below it and half has a value above it. This holds true regardless of the shape of the data distribution, the number of peaks, and so on.

This means that if you randomly pick two values x and y from the person’s mouth, there are four possibilities.

  • They are all below the median of the population, with a probability of 0.5*0.5=0.25.
  • They are both higher than the median population and have a probability of 0.25.
  • The probability of one being below the median of the population and one being above it is 0.5.

More generally, if you have N numbers.

  • They are all below the median with a probability of 0.5 to the N.
  • They are all above the median with a probability of 0.5 to the N.
  • The median is between the lowest and highest of N values with a probability of 1-2* (0.5^N).

That means that even with a sample of five values, there’s a 94% chance that the median population is surrounded by your sample. If there are 10 values, the probability is 99.8%. Now, I can’t guarantee that you’ll be happy with the size of this confidence interval, but at least you’ll have a very clear idea of the importance of the sampling change in the situation at hand.


Correlation Is Not Causation… Or Is It? Published in Towards Data Scienceon Medium, where people continue the conversation by highlighting and responding to the story.