The magic of conjugate priors (for online learning)

In Bayesian reasoning, the fundamental problem is the following. Given a prior distribution $@p(x)$@, and some set of evidence $@E$@, compute a posterior distribution on $@x$@ namely $@p(x | E)$@. For example, $@x$@ might be the conversion rate of some email. Before you have any evidence you might expect the conversion rate to be somewhere in the range of perhaps $@5\%$@ and $@50\%$@. After you have evidence, you update your belief - if you sent out thousands of emails and observed an empirical $@16.5\%$@ conversion rate, you are now reasonably confident that the true conversion rate lies roughly in the range of $@16\%-17\%$@.

In mathematics, a conjugate prior consists of the following. Consider a family of probability distributions characterized by some parameter $@\theta$@ (possibly a single number, possibly a tuple). A prior is a conjugate prior if it is a member of this family and if all possible posterior distributions are also members of this family.

The important object here is not the prior, but the family of distributions chosen to model the problem. This family is closed under evidence - regardless of what you observe, you will continue to believe that the posterior lives somewhere in this family. This fact tends to promote a large amount of algebraic simplicity in the computation of the posterior.

The problem

Consider a random process generating some evidence, $@E$@. As all good bayesians do, we develop a mathematical model of this random process, ideally characterized by some smallish number of parameters. I.e., we assume that:

$$ E \sim D(\theta)$$

where $@A \sim B$@ means that the evidence $@A$@ is generated by the probability distribution $@B$@. The parameter $@\theta$@ (which is likely multidimensional) is unknown, and it is our goal to estimate it. To begin with, all we know about $@\theta$@ is a prior, i.e. we believe that $@\theta \sim P$@. At the end of the day we want to compute the probability distribution on $@\theta$@ via Bayes rule, i.e.:

$$ p(\theta | E) = \frac{P(E | \theta) P(\theta) }{ P(E) } $$

Unfortunately, computing $@P(E)$@ is a somewhat complicated integral:

$$ P(E) = \int P(E | \theta) P(\theta) d\theta = \int D(\theta) P(\theta) d\theta $$

To recognize how difficult this can be in practice, suppose $@\theta$@ is a three dimensional parameter. Suppose we wanted to numerically approximate this integral via quadratures, taking perhaps 1000 grid points in each direction. In that case, the computational grid representing the function to be integrated has $@1000^3=10^9$@ data points. This is tractable, but just barely - a fourth dimension will be crippling.

After this integration is complete, we can approximate $@p(\theta | E)$@. To actually use it for any useful calculation, we will likely need to compute a second three-dimensional numerical integration.

Dimensions are easy to come by

Consider the problem of A/B testing two webpages. Each webpage generates user engagement which we model as a continuous quantity which is normally distributed. I.e., a visit to page $@A$@ generates $@N(\mu_A, \sigma_A)$@ units of value, and similarly for page $@B$@. In any bayesian method, the parameter governing the problem is $@\theta = (\mu_A, \sigma_A, \mu_B, \sigma_B)$@. Thus, to compute $@P(E)$@, we will need to compute:

$$ \int P(E | \mu_A, \sigma_A, \mu_B, \sigma_B) P(\mu_A, \sigma_A, \mu_B, \sigma_b) d\mu_A d\sigma_A d\mu_B d\sigma_B $$

With this simple model, we already have 4-dimensions to contend with..

Using a 1000-point quadrature would involve $@10^12$@ data points which is beyond the reach of most modern computers. This puts us firmly in the territory of monte carlo methods, which are notoriously less accurate and more complicated than simple quadratures. After this is completed, you still will likely need to compute some integral of the posterior, which again involves a monte carlo integral of an inaccurate result.

A problem for online learning

When you are trying to compute statistics in realtime, computing a $@10^9$@ data point integral is death. This rules out the use of large numerical quadratures in bandit algorithms, high frequency trading, realtime recommendation, etc.

Conjugate priors are pot brownies seasoned with unicorn farts

The magic of choosing a conjugate prior is that it will usually reduce the computation of the posterior from a tricky numerical integral to some simple algebra. For a variety of such families, a bayesian update consists of the following. You compute the initial parameters of your prior, $@\theta_0$@. Whenever you need to compute an update, you do some simple algebra of the form $@\theta | E = f(E, \theta_0)$@, where the function $@f$@ does not require computing any integrals.

Rather than speaking in abstractions, I'll provide some simple examples.

Example - the normal distribution in stock pricing

Consider the problem of stock pricing. A simple model of stock price movements is to treat them as lognormal - i.e., stock prices obey the formula:

$$ p(t+1) = p(t) \cdot e^{z} $$

where the variable $@z$@ is drawn from a normal distribution:

$$ z \sim \frac{1}{\sqrt{2\pi}} e^{-(z-\mu)^2/2 \sigma^2} = N(\mu, \sigma^2)$$

This is called the lognormal distribution because the log of the percentage change, $@\ln(p(t+1)/p(t))$@, is a normally distributed random variable.

If we are interested in understanding the movements of this security, we want to know the parameters of the normal distribution, namely $@\mu$@ and $@\sigma$@. So we observe the security over many days, and come up evidence of the form

$$z_1=\ln(p(1)/p(0)), ..., z_n=\ln(p(n)/p(n-1))$$

If we are Bayesians (and these days, who isn't?), we want to come up with an estimate on the parameter $@\mu$@ and $@\sigma$@ based on a prior and the available evidence. This is where we choose the conjugate prior - we assume that $@\mu$@ itself is normally distributed:

$$\mu \sim N( m, s^2 )$$

Now, given this prior, we want to compute:

$$ P(\mu | z_1+z_2+ ...+z_n) = \frac{ P(z_1+ ...+ z_n | \mu) P(\mu) } { P(z_1+ ...+ z_n) }$$

Letting $@\bar{z}=(z_1+z_2+...+z_n)/n$@ and doing a bunch of algebra, we find that:

$$ \mu \sim N\left( \frac{\sigma^2 m/n + s^2 \bar{x}}{\sigma^2/n + s^2}, \left(\frac{n}{\sigma^2} + \frac{1}{s^2} \right)^{-1} \right)$$

I.e., if the prior on $@\mu$@ is a normal distribution, then the posterior is as well. Thus, to compute a bayesian update, all we need to do is some simple algebra on a few parameters (namely $@\bar{z}, \sigma, m$@ and $@s^2$@).

There is no integration required to compute the posterior!

Furthermore, since the posterior is both explicitly known and a standard probability distribution, your favorite mathematical library almost certainly has well optimized methods for performing all the standard operations (e.g. drawing a sample, computing inverse probabilities, etc).

The Beta distribution for multi-armed bandits

Now let us turn to the classical problem in bandit algorithms. You have a set of $@K$@ slot machines ("one armed bandits"), which for simplicity we will assume have a binary payout (i.e., their payout is either 1 or 0). The model parameter governing the behavior of each machine is simply a payout probability, $@\theta_i \in [0,1]$@. The goal of the bandit algorithm is to maximize the expected payout as one plays the slot machines.

In practice, this consists of playing all machines for a while until an estimate of $@\theta_i$@ can be constructed, and eventually playing the machine for which $@\theta_i$@ is maximal. If we wish to do this in a Bayesian framework, it is particularly helpful to choose a Beta distribution as the prior:

$$ f(\theta; \alpha, \beta) = \frac{ \theta^{\alpha-1}(1-\theta)^{\beta-1} } { B(\alpha, \beta) } $$

where

$$ B(\alpha, \beta) = \int_0^1 t^{\alpha-1}(1-t)^{\beta-1} dt $$

is the usual beta function.

The reason is that the beta distribution is a conjugate prior to the binomial distribution. So now suppose we take as our prior

$$ \theta_i \sim B( \alpha_p^i, \beta_p^i) $$

Some straightforward algebra (see wikipedia for details) shows that if we have played the $@i$@-th slot machine $@n_i$@ times, and received a payout on $@s_i$@ of those plays, then:

$$ p(\theta_i | n_i, s_i) = Const \theta_i^{s_i} (1-\theta_i)^{n_i-s_i} f(\theta_i; \alpha_p^i, \beta_p^i) = \ldots \sim B(\alpha_p^i + s_i, \beta_p^i+n_i-s_i)$$

So gathering new evidence does little more (algebraically) than change the parameters of your beta distribution. After this simple step, you can perform any calculation you like (e.g., compute a Gittins index, or draw samples from the distribution).

Conclusion

Conjugate priors are pretty handy for a lot of statistical problems. Whenever you find yourself puzzling over some online learning problem, it's strongly worth going through a list of all the standard distributions to figure out if one of them fits you well. Even if it doesn't, a convex combination of them might - in that case, you need merely estimate your prior with conjugate priors, and then take convex combinations of the solution at the end of the day.

I'm generally a huge fan of numerical analysis, but I am coming around to the point of view of the classical analysts with their pen and paper calculations. If you can make your problem fit their methodology, you'll save a lot of computation.