### Why Multi-armed Bandit algorithms are superior to A/B testing

In a recent post, a company selling A/B testing services made the claim that A/B testing is superior to bandit algorithms. They do make a compelling case that A/B testing is superior to one particular not very good bandit algorithm, because that particular algorithm does not take into account statistical significance.

However, there are bandit algorithms that account for statistical significance.

# Simple algorithms

To begin, let me discuss the simple algorithms, namely epsilon-greedy (“20 lines of code that will beat A/B testing every time”) and epsilon-first (A/B testing).

An important disclaimer: *I’m glossing over a lot of math to simplify this post*. The target audience of this post is web programmers, not mathematicians.

## Epsilon-greedy algorithms

The original post described an epsilon-greedy algorithm:

```
inputs:
machines :: [ GamblingMachine ]
num_plays :: Map(GamblingMachine -> Int)
total_reward :: Map(GamblingMachine -> Float)
select e // In the original post, e = 0.1
while True:
x <- UNIFORM([0,1]) // A uniformly distributed random variable
if x < e: //exploration phase
let m = random_choice(machines)
reward <- play_machine( m )
num_plays[m] += 1
total_reward[m] += reward
else: //exploitation phase
average_rewards = Map( (m, total_reward[m] / num_plays[m]) for m in machines )
best_machine = argmax( average_rewards ) //Find the machine with the highest reward
reward <- play_machine( best_machine )
num_plays[best_machine] += 1
total_reward[best_machine] += reward
```

Basically, exploration with probability `e`

and explotation with probability `1-e`

.

This is not a good algorithm. Consider the scenario of two machines - machine A has reward 1, machine B has reward 0. Even after the epsilon-greedy algorithm has converged, 10% of the time a random machine will be chosen. So given “convergence”, the odds of choosing machine B are still:

```
P( exploration) * P(machine B | exploration) = e * (1/2)
```

So your total reward can be at most:

```
num_plays * (1 - e/2)
```

I.e., if `e=0.1`

(as advocated in the original blog post), you are leaving $5 on the table for every $100 you could potentially earn.

## A/B testing or Epsilon-first algorithms

The approach that Visual Website Optimizer advocates is A/B testing, which is called epsilon-first in the formal literature. The way it works is that you explore for a finite number of plays (the A/B test) and exploit forever after:

```
for i in 1...N:
m <- choose_machine // Can be done at random, i % num_machines, etc
reward <- play_machine( m )
save_results(m, reward, i)
best_machine_estimate = statistical_test( N, reward_history )
while True:
play_machine( best_machine_estimate )
```

I won’t get into the specifics of how Visual Website Optimizer set up their A/B test, since *they did it wrong* (see their code). In particular, what they did is A/B testing with repeated peeks - they measured statistical significance repeatedly throughout the A/B test. Rather than explaining why this is wrong, I’ll just refer to the excellent post How Not to Run an A/B Test, by Evan Miller. Seriously, go read that article.

The key point in understanding epsilon-first algorithms is recognizing that `statistical_test`

has an error rate (which is dependent on `N`

as well as the prior distribution of machine rewards). Rather than getting into the details of statistical testing, I’ll just make up a symbol for it, `Err`

.

Let us now suppose the epsilon-first algorithm has finished the exploration phase. In that case:

```
P( best_machine_estimate == true_best_machine ) = 1 - Err
```

Suppose again that `true_best_machine`

has reward 1, and the other worse machines all have reward 0. Then in the exploitation phase, your reward is:

```
num_plays x P( best_machine_estimate == true_best_machine) = num_plays x (1 - Err)
```

If you run your A/B tests with 5% confidence levels (as most people do), you are again leaving $5 on the table for every $100 you could earn.

In the epsilon-greedy case, the most likely scenario is that you collect $95/$100. In the epsilon-first case, the most likely scenario is that you collect $100, and there is a 5% chance you collect $0. But in both cases, you are leaving money on the table.

# UCB1 - A superior bandit algorithm

I’ll now discuss a somewhat more complicated algorithm. It’s taken from the paper Finite-time Analysis of the Multiarmed Bandit Problem, by Auer, Cesa-Bianchi and Fischer, Machine Learning, 47, 235-256, 2002.

The algorithm is called UCB1, and it’s based on the following idea. Figure out which machine *could* be the best possible one (based on your statistical level of confidence), and play that machine as much as possible. Note that I didn’t say which machine *is* the best, but which one *could* be the best.

There are two ways a machine *could* be the best:

- You have measured it’s mean payout to be higher than all the others, and you are certain of this with a high degree of statistical significance.
- You have not measured it’s mean payout very well, and the confidence interval is large.

The algorithm is based on the Chernoff-Hoeffding bound. Suppose we have a random generator `rand`

. Repeated calls to `rand`

are assumed to be independent of each other, and it’s range is bounded:

```
x <- rand
assert(0 <= x && x <= 1)
```

Now suppose we run `N`

trials and compute the average:

```
avg = 0
for i in 1..N:
avg += rand
avg /= N
```

Define `m`

to be the theoretical statistical mean of `rand`

(i.e., if `rand`

were a fair coin flip, `m=0.5`

). Then `avg`

is an approximation to `m`

.

Chernoff-Hoeffding says that for any `a > 0`

, we have the following bound:

```
P( avg > m + a) <= 1 / exp(2 * a^2 N)
```

I.e., the probability that `avg`

overestimates `m`

by at least `a`

decreases exponentially with `N`

. This is just a way of quantifying something we already know intuitively - the more trials we run, the better our estimate of the mean of a random variable.

If we reverse Chernoff-Hoeffding and do some arithmetic (which is already done in the paper, so I won’t repeat it here), we can get the following *stylized fact* (i.e., it isn’t true, but is intuitively useful):

```
true_mean(m) <= avg[m] + sqrt( 2 ln(N) / num_plays[m] )
```

Here `m`

is a machine, `avg[m]`

the average reward observed so far from machine `m`

, `num_plays[m]`

the number of times machine `m`

has been played, and `N`

the total number of plays across *all machines*. From here on out, I’ll refer to `sqrt( 2 ln(N) / num_plays[m] )`

as the *confidence bound*.

Finally, the algorithm `UCB1`

is as follows:

```
Play each machine once.
while True:
best_possible_true_mean = { m -> avg[m] + sqrt( 2 ln(N) / num_plays[m] ) }
// This is pseudocode for a "dict comprehension"
// I.e. best_possible_true_mean[m] = avg[m] + ...
to_play = argmax(best_possible_true_mean)
reward <- play_machine(to_play)
update avg[to_play], num_plays[m], N
```

This algorithm starts off in a pure exploration phase, since `avg[m]`

will be small compared to the confidence bound. For example, with `N=10`

and `num_plays[m]=5`

, we find that the confidence bound is 0.9597. For comparison, `avg[m]`

is some number bounded by 1.

However, for larger values of `N`

, `ln(N)`

grows very slowly, and eventually `ln(N) / num_plays[m]`

will become small. With `N=400`

and `num_plays[m]=200`

, we find the confidence bound has dropped to 0.2447. If `avg[m_best] = 0.50`

and `avg[m_worst] = 0.20`

, the algorithm will start playing `m_best`

most of the time.

At this point, `N`

and `num_plays[m_best]`

are increasing, but `num_plays[m_worst]`

is remaining constant. This means the confidence bound for `m_worst`

is slowly increasing. At `N=8000`

, the confidence bound on the worst machine is 0.2997, while the confidence bound on the best is only 0.04800. We will continue playing machine `m_best`

for a while. For simplicity, I’ll assume `avg[m_best]=0.5`

and `avg[m_worst]=0.2`

on *every play*. (In reality this won’t be true, I’m just trying to illustrate a point here.)

```
N = 5000
avg[m_best] + sqrt(2*ln(N)/num_plays[m_best]) = 0.5596
avg[m_worst] + sqrt(2*ln(N)/num_plays[m_worst]) = 0.4918
N = 9000
avg[m_best] + sqrt(2*ln(N)/num_plays[m_best]) = 0.5455
avg[m_worst] + sqrt(2*ln(N)/num_plays[m_worst]) = 0.5017
N=25000
avg[m_best] + sqrt(2*ln(N)/num_plays[m_best]) = 0.5286
avg[m_worst] + sqrt(2*ln(N)/num_plays[m_worst]) = 0.5182
N = 36422
avg[m_best] + sqrt(2*ln(N)/num_plays[m_best]) = 0.52408152492
avg[m_worst] + sqrt(2*ln(N)/num_plays[m_worst]) = 0.524082215906
```

At N=36422, we will play machine `m_worst`

once. Now `num_plays[m_worst]=201`

. And at this point, we go back to playing machine `m_best`

:

```
N = 36423
avg[m_best] + sqrt(2*ln(N)/num_plays[m_best]) = 0.5240815563951478,
avg[m_worst] + sqrt(2*ln(N)/num_plays[m_worst]) = 0.5232754585667818
```

The key point here is that `ln(N)`

grows very slowly relative to `num_plays[m_worst]`

. So while the exploration process continues forever, the fraction of time we spend exploring decreases exponentially.

Another important fact is that the exploration time is adaptive - if `avg[m_best]`

and `avg[m_worst]`

are very close together (i.e., 0.5 and 0.51), then we will spend more time exploring until we are finally confident of which machine is the best one.

There is a theorem proven in the paper which says that the regret of this algorithm grows only logarithmically. This means that your reward will be proportional to:

```
num_plays - log(num_plays)
```

This value will *always* be larger than either `num_plays * (1-e/2)`

or `num_plays * (1 - Err)`

provided `num_plays`

is sufficiently large.

This means that UCB1 will always beat both A/B testing and “20 lines of code…”. (Though actually, UCB1 will take only marginally more than 20 lines of code.)

# Conclusion

If you don’t do any of this stuff (A/B, epsilon-greedy, MAB), pick whichever one is easiest and do it. Any of these fine optimization methods are *vastly* better than nothing. If you really want to be optimal, try UCB1 or any of the more modern adaptive bandit methods. They are marginally better than A/B tests.

I’ve also written quite a bit more on bandit algorithms since this post was published. I’ve described my favorite bandit algorithm, the Bayesian Bandit, which achieves similar accuracy to UCB1. I wrote this blog post about Bayesian A/B testing. And I wrote a post about measuring a changing conversion rate, which (when combined with the Bayesian Bandit) can handle the case of time varying conversion rates.