Let me start with a flag from the beginning of this year:



My New Year’s resolution: Every chart I draw during 2018 will include an estimate of uncertainty

Why this flag? Because I’m tired of hearing people argue at conferences about whether the number of widgets is going up or down every month, or whether widget method X is more efficient than widget method Y.

And I found it useful to quantify uncertainty for almost any chart, so I started trying.

I soon found, however, that I had dug myself a deep hole. A few months later:



Now that we’re four months into the year, I’ll tell you the waters of uncertainty are pretty deep.

I had never studied statistics, nor had I learned it backwards through machine learning. So I’m kind of a halfway house, slowly teaching myself statistics. Earlier this year, I knew only the basics of Bootstrapping algorithms and confidence intervals, but over time I’ve learned the full gambit of Monte Carlo methods and inverse Hessians.

These methods have been very useful, and I want to share with you the business lessons of this year.

Start with the data

I believe you can’t really learn anything without concrete examples, so let’s make some data. We will generate a fake time series with dates ranging from 2017-07-01 to 2018-07-31, let’s say this series is the observed weight of an elephant.

Copy the code




Before we start, we need to make a picture to see what’s going on!

Copy the code

First, we don’t use any fancy models, we just break it down into buckets and calculate the average of each bucket. But let’s pause for a moment to talk about uncertainty.

Data distribution and uncertainty

I’ve always been confused about what “uncertainty” means, but I think it’s important to figure this out. We can estimate distributions for many different data sets:

1. Data itself. Given a certain time range (t, t ‘), what is the distribution of the weight of the elephant during this time interval?

2. Uncertainty of some parameters. For example, in the linear relation of parameter k y = k t + m, or in the uncertainty of some estimator, like the average of many observations.

3. Uncertainty of predicted value. So if we predict that the weight of the elephant at date T (possibly in the future) is y kilograms, we want to know the uncertainty of quantity Y.

Let’s start with the most basic model – just break down the problem in intervals. If we just want to learn some basic concepts about distributions and uncertainty estimation, then I recommend the Seaborn package. Seaborn usually runs on data frames, so we need to convert:

Copy the code




The final chart shows the distribution of the data set. Now let’s try to figure out the uncertainty of a very common estimator: the mean!

Calculate the indeterminacy – normal distribution of the mean

Under some loose assumptions (I’ll come back to this in detail in a moment), we can calculate the confidence interval for the mean estimator:



Here, ¯ X is the mean and sigma is the standard deviation, the square root of the variance. I don’t think it’s very important to remember this formula, but I think it’s useful to remember that the size of the confidence interval is inversely proportional to the square root of the sample number. For example, this is useful when you are running A/B tests – if you are detecting A 1% difference, then you need about 0.01−2=10,000 samples. (This is just a rule of thumb, don’t use it for your medical device software).

By the way – where did the number 1.96 come from? It is directly related to the size of the uncertainty estimate. ± 1.96 means you’ll cover about 95% of the probability distribution.

Copy the code




Note that this refers to the uncertainty of the mean, which is not the same thing as the data distribution itself. That’s why you see far less than 95% blue dots in the red shaded area. If we add more points, the red shaded area will get narrower and narrower, while the blue points will still have about the same proportion. In theory, however, the true average should be in the red-shaded area 95% of the time.

I mentioned earlier that the formula for confidence intervals only works for loose assumptions. What are those assumptions? Is the assumption of normality. According to the central limit theorem, this is also true for a large number of observations.

The confidence interval when all the results are 0 or 1

Let’s look at one kind of data set I use a lot: transformation. For the sake of argument, let’s assume we’re running an A/B test that has some impact, and we’re trying to understand how each state affects conversion rates. The result of the transformation is 0 or 1. The code that generates this dataset is not very important, so don’t focus too much on:

Copy the code

For each state and each “group” (test and control), we generated n users, k of which were converted. Let’s plot the conversion rates for each state and see what happens!

Copy the code




Since all outcomes are 0 or 1 and plotted with the same (unknown) probability, we know that the number of 1s and 0s follows a binomial distribution. This means that the confidence interval for the “k out of n users converted” situation is a Beta distribution.

I learned a lot from remembering the confidence interval formula, and I think I’m probably more inclined to use it than the (normal-based) formula I used before. In particular, remember:

Copy the code

You plug in the values of n and k and you get a 95% confidence interval. In this example, we see that if we have 100 site visitors and 3 of them buy the product, the confidence interval is 0.6%-7.1%.

Let’s try it with our data set:

Copy the code




Oh, not bad

Bootstrapping algorithm

Another useful method is the bootstrapping algorithm, which allows you to do the same statistics without having to memorize any formulas. The core of this algorithm is to calculate the mean, but for n bootstrap resampling, where each bootstrap is a random sample (substitution) from our observations. For each self-sampling, we calculated the mean and then used the mean between 97.5% and 2.5% as the confidence interval:

Copy the code




Amazingly, these graphs are very similar to the previous ones! (As we should have expected)

Bootstrapping is great because it lets you sidestep any questions about the probability distribution of the data generated. While it may be a bit slow, it’s basically plug and play and works for almost every problem.

Note, however, that bootstrapping can also be unavailable. My understanding is that bootstrapping converges to the correct estimate as the number of samples approaches infinity. But if you use a small sample set, you get unreliable results. I generally don’t trust bootstrapping re-sampling of any problem with less than 50 samples, so you’d better not do it either.

As a side note, Seaborn’s barplot actually uses bootstrapping to plot confidence intervals:

Copy the code

seaborn.barplot(data=d, x=’Month’, y=’Weight (kg)’)



Again, Seaborn is great for exploratory analysis, and some of the charts can be used for basic statistics.

Return to the

Let’s take it up a notch. We’re going to do a linear regression over a bunch of points.

I’m going to do it in what I think is the most common way. We’ll define a model (in this case a straight line), a loss function (square deviation from that line), and then optimize it using a universal solver (scipy.optimize.minimize).

Copy the code




Maximum likelihood method is used for linear regression with uncertainty

We only fit k and m, but there is no uncertainty estimate. There are a few things we can estimate uncertainty, but let’s start with the uncertainty of the predicted value.

We can do this by fitting a normal distribution around a line while fitting k and m. I will use the maximum likelihood method to do this. If you are not familiar with this method, fear not! If there is a statistical method that is easy to implement (it is basic probability theory) and useful, that is it.

In fact, minimizing the squared loss (which we just did in the previous clip) is actually the most likely special case! Minimizing the squared loss is the same thing as maximizing the logarithm of the probability of all data. This is often called “logarithmic likelihood”.

So we already have an expression to reduce the squared loss. If we make the variance as the unknown variable σ2, we can fit both! The number we’re trying to minimize is now



Where ^yi=kxi+m is the predicted value of our model. Let’s try to fit it!

Copy the code






The uncertainty estimate here is not actually 100%, because it does not take into account the uncertainties of k, M, and σ themselves. That’s a good approximation, but to get it right, we need to do all these things together. So let’s do this.

Bootstrapping again!

So let’s take it to the next level and try to estimate the uncertainty of k and M and σ! I think this will show how Bootstrapping basically cuts – you can plug it into almost anything to estimate the uncertainty.

For each bootstrap estimate, I will draw a line. We can also take all of these lines and calculate confidence intervals:

Copy the code




Wow, what’s going on here? This kind of uncertainty is very different from previous episodes. It seems confusing until you realize they’re showing two very different things:





It turns out that we can combine these two methods and make it more complex by drawing bootstrapping samples by fitting them simultaneously with k, M, and σ. Then, for each estimate, we can predict the new value y. Let’s do this:

Copy the code

Ouch, not bad again! It now looks very similar – if you look closely, you can see a hyperbolic shape!

The trick here is that for each bootstrap estimate of (k, m, σ), we also need to plot a random prediction. As you can see in the code, we are actually adding the random normal variable to the predicted value of y. That’s why the shape ends up being a big wave.

Unfortunately, bootstrapping is pretty slow for this problem – for each bootstrap, we need to fit a model. Let’s look at another option:

Markov chain Monte Carlo method

Now it’s going to get a little wild

I’ll switch to some Bayesian methods, where we estimate k, m, and σ by plotting samples. It is similar to bootstrapping, but MCMC has a better theoretical basis (we use Bayesian rules to sample from a “posterior distribution”), and it is usually orders of magnitude faster.

To do this, we’ll use a library called Emcee, which I find very useful. All it takes is a log-likelihood function, which we’ve defined before! We just have to use the negative of it.

Copy the code

Let’s plot the sample values of k and m!

Copy the code




There is something else to these methods – sampling is a bit fussy and requires some human involvement to work properly. I won’t go into all the details here, and I’m a layman myself. But it is usually orders of magnitude faster than Booststrapping, and it can also handle less data better.

Finally, we obtain samples of k, M, σ posterior distributions. We can look at the probability distribution of these unknowns:

Copy the code




You can see the distribution centers around k = 200, m = 1000, σ= 100. That’s how we originally built them! That seems reassuring

Finally, we could use the same method as boostraps to map the complete uncertainty of prediction:

Copy the code




These Bayesian methods don’t end here. There are several libraries in particular that can address these issues. It turns out that if you express the problem in a more structured way (rather than just a negative log likelihood function), you can scale the sample to large problems (such as thousands of unknown parameters). For Python, there are PyMC3 and PyStan and, to a slightly lesser extent, Edward and Pyro.

conclusion

It seems like I’m leading you into a deep hole – but these methods can actually go a lot further. In fact, forcing myself to estimate the uncertainty of anything I do can be a great experience, forcing myself to learn a lot about statistics.

It’s hard to make decisions based on data! But if we were more rigorous about quantifying uncertainty, we might make better decisions. It’s not easy to do right now, but I do hope that we can use more convenient tools to see these methods take off.



Big Data Abstracts
Big Data Abstracts