In a previous job, I built a machine learning system to detect financial fraud. Fraud was a big problem at the time - for simplicity of having nice round numbers, suppose 10% of attempted transactions were fraudulent. My machine learning system worked great - as a further set of made-up round numbers, lets describe it as having a precision and recall of 50% each. All this resulted in a fantastic bite taken out of the fraud problem.

It worked so well that fraud dropped by well over 50% - because of the effort involved in getting past the system, fraudsters just gave up and stopped trying to scam us.

Suddenly the system's performance tanked - recall stayed at 50% but precision dropped to 8%! After some diagnosis, I discovered the cause was the following - all the fraudsters had gone away. For every fraud attempt, the system had a 50% chance of flagging it. For every non-fraudulent transaction, the system had a 5.5% chance of flagging it.

Early on, fraud attempts made up 10% of our transactions. Thus, for every 1000 transactions, we would flag 50 of the 100 fraudulent transactions and 50 of the 900 good transactions. This means that for every 10 flags, 5 are correct - hence a precision of 50%.

Once the fraudsters fucked off, fraud attempts dropped to perhaps 1% of our transactions. For every 1000 transactions, only 10 were fraudulent. We would flag 5 of them, along with 5.5% x 990 legitimate transactions = 54 transactions. The net result is that only 5 of the 59 transactions we flagged as fraudulent actually were, for a precision of 8%.

This phenomenon is called **label shift**. The problem with label shift is that the base rate for the target class changes with time and this significantly affects the precision of the classifier.

In general, the following are characteristics of the problem that I'm generally interested in:

- not too large - potentially under 100k.
- in the ballpark of 0.1% to 5%.

These kinds of problems are typical in security, fraud prevention, medicine, and other situations of attempting to detect harmful anomalous behavior.

For most classifiers the ultimate goal is to make a decision. The decision is taken in order to minimize some loss function which represents the real world cost of making a mistake.

Consider as an example a clasifier used to predict a disease. Let us define to be our feature vector, to be our risk score and whether or not the patient actually has the disease.

A loss function might represent the loss in QUALYs from making an error. Concretely, suppose that a failure to diagnose a disease results in the immediate death of the patient - this is a loss of `78 - patient's age`

QUALYs. On the flip side, treatment is also risky - perhaps 5% of patients are allergic and also die instantly. This is a loss of `5% x (78 - patient's age)`

1. Represented mathematically, our loss function is:

Let us also suppose that we have a calibrated risk score, i.e. a monotonically increasing function with the property that . For a given patient, the expected loss from treatment is therefore:

while the loss from non-treatment is:

The expected loss from treatment exceeds the expected loss from non-treatment when , so the optimal decision rule is to treat every patient with a (calibrated) risk score larger than 0.0526 while letting the others go untreated.

Let's study this from the perspective of score distributions. Suppose that is the pdf of the distribution and is the pdf of the distribution . For simplicity, assume these distributions are monotonic.

Suppose now that the base rate is . In this framework, a label shift can be represented simply as a change in .

It is straightforward to calculate the calibration curve (as a function ) as:

As is apparent from this formula, a change in will result in a change in calibration. The following graph provides an example:

Let's consider the effect of this on decisionmaking. Going back to our disease example above, suppose that at model training/calibration time, . Then a disease outbreak occurs and . The decision rule being used based on the training data (with ) says to treat any patient with raw score of 0.65 or greater.

But once , the actual infection probability of a person with is nearly 40%. As per the loss function calculation earlier, we want to treat any patient with a 5.26% or greater chance of being sick!

In the literature, when making batch predictions, there's a known technique for solving this (see discussion 2). The basic idea is the following. For a set of raw risk scores , we know they are drawn from the distribution:

Thus, one can estimate via the maximum likelihood principle (although the literature describes a slightly different approach 3):

Maximizing this is straightforward - take logs, compute , use scipy.optimize.minimize.

The method described above is strongly sensitive to the assumption that the *shape* of the distribution of the positive class
does not change, only it's amplitude
.

However in practice, we often discover that changes with time as well. For example, consider again the example of disease prediction - a new strain of a known disease may have a somewhat different symptom distribution in the future than in the past. However it is a reasonable assumption to make that the shape of remains the same; healthy people do not change their health profile until they become infected.

Thus, the more general situation I'm considering is a mix of label shift/base rate changes, together with small to moderate changes in the distribution of the *exceptional class only*. By "exceptional class", I mean "sick" (in disease prediction), "fraud" (in fraud prevention), essentially the uncommon label which corresponds to something anomalous.

In general, it is impossible to solve this problem 5. However, if we stay away from this degenerate case (see footnote 5), it's actually quite possible to solve this problem and estimate both the new shape of and . The main restriction is that is not too different from the old value, but right now I don't have a good characterization of what "not too different" actually means.

In the training phase, we have a labelled data set on which we can train any sort of model that generates risk scores . We will assume that in this data set, the risk scores are drawn from if and if .

In the prediction phase we will consider batch predictions. We receive a new set of and we can of course use the already trained classifier to generate risk scores . Our goal is for each data point to generate a calibrated risk score .

Without label shift there is a standard approach to this that is implemented in sklearn as sklearn.calibration.CalibratedClassifierCV. Typically this involves running isotonic regression on a subset of the training data and the mapping is the result of this.

That does not work in this case because
computed in the training phase will be for the *wrong* distribution. The figure Illustration of calibration curves changing with base rate illustrates this - isotonic calibration may correctly fit the curve
in the training phase. But if the right curve in the prediction phase is
, that fit is not actually correct. This blog post aims to address that problem.

The approach I'm taking is upgrading the maximum likelihood estimation to a max-aposteriori estimation.

I first parameterize the shape of the exceptional label with . I then construct a Bayesian prior on it which is clustered near . It is straightforwardly derived from Bayes rule that x:

For simplicity I'm taking , a uniform prior on .

Once the posterior is computed, we can replace *maximum likelihood* with *max-aposteriori* estimation. This provides a plausible point estimate for
which we can then use for calibration.

The first step is doing kernel density estimation in 1-dimension in a manner that respects the domain of the function. Gaussian KDE does NOT fit the bill here because the support of a gaussian kernel is , not . One approach (which is somewhat technical and I couldn't make performant) is using beta-function KDE instead 4. An additional technical challenge with using traditional KDE approaches on this problem is that whatever approach is taken, it also needs to be fit into a max-likelihood/max-aposteriori type method.

I took a simpler approach and simply used linear splines in a manner that's easy to work with in scipy. Suppose we have node points . Then let us define the distribution as a normal piecewise linear function:

for with defined as

and

I chose this parameterization because `scipy.optimize.minimize`

doesn't do constrained optimization very well. With this parameterization, all values
yield a valid probability distribution on
.

Python code implementing this is available in the linked notebook, implemented as `PiecewiseLinearKDE`

. Calculations of
- used in numerical optimization - can also be found in that notebook. Most of it is straightforward.

Fitting a piecewise linear distribution to data is only a few lines of code:

from scipy.optimize import minimize def objective(q): p = PiecewiseLinearKDE(zz, q) return -1*np.log(p(z)+reg).sum() / len(z) def deriv(q): p = PiecewiseLinearKDE(zz, q) return -1*p.grad_q(z) @ (1.0/(p(z)+reg)) / len(z) result = minimize(objective, jac=deriv, x0=np.zeros(shape=(len(zeta)-1,)), method='tnc', tol=1e-6, options={'maxiter': 10000}) result = PiecewiseLinearKDE(zeta, result.x)

The result is approximately what one might expect.

One useful coding trick to take away from this is our use of `np.interp`

inside a number of methods of `PiecewiseLinearKDE`

. Since the curve itself is computed as `np.interp(x, self.nodes, self.h())`

, gradients of this w.r.t. `q`

can then then be computed by applying `np.interp(x, self.nodes, grad_h)`

where `grad_h`

is the gradient of
w.r.t.
. This then allows the efficient calculation of gradients of likelihood functions as seen in `deriv`

above, simplifying what might otherwise be index-heavy code.

Defining a prior on a function space - e.g. the space of all probability distributions on [0,1] - is not a simple matter. However, once we've chosen a parameterization for , it becomes straightforward. Since , the restriction of any reasonable prior onto this space is absolutely continuous w.r.t. Lebesgue measure, thereby eliminating any theoretical concerns.

The situation we are attempting to model is a small to moderate *change* in the distribution of
, particularly in regions where
is small. So we will define the (unnormalized) prior to be:

where is a basically just a smoothed out (differentiable) version of . We need a smooth version of simply because when we do max-aposteriori later, a smooth curve makes numerical minimization easier.

This prior should not be thought of as a principled Bayesian prior, but merely one chosen for convenience and because it regularizes the method. If we ignore the smoothing, this is analogous to a prior that penalizes deviation from in the metric. The measure is used to penalize deviation more in areas where is large. The parameter represents the strength of the prior - larger means that will remain closer to .

One important note about the power . Because as , choosing does NOT actually generate any kind of sparsity penalty, in contrast to using .

The likelihood is (as per the above):

Computing the log of likelihood times prior (neglecting the normalization term from Bayes rule), we obtain:

The gradient of this with respect to is:

Using this objective function and gradient, it is straightforward to use scipy.optimize.minimize to simultaneously find both and .

**Note:** All of the examples here are computed in this Jupyter notebook. For more specific details on how they were performed, the notebook is the place to look.

Here's an example. I took a distribution of 97.7% negative samples, with a relatively simple prior distribution. I simulated a significant change of shape in the distribution of scores of the positive class, which is illustrated in red in the graph below. As can be seen, the approximation (the orange line) is reasonably good. Moreover, we recover with reasonable accuracy - the measured was 0.0225 while the true was 0.0234.

(The histograms in the graph illustrate the actual samples drawn.)

Using the fitted curve to compute calibration seems to work reasonably well, although simple isotonic regression is another way to do it.

The advantage of using this method is on out of sample data with a significantly different distribution of positive cases. I repeated this experiment, but with and a marginally different distribution of positive cases.

The dynamically calculated calibration curve (the green) still behaves well, while the isotonic fit calculated *for a different*
(unsurprisingly) does not provide good calibration.

Note that recalculating the isotonic fit is not possible, since that requires outcome data which is not yet available.

The major use case for this method of calibration is reducing the loss of a decision rule due to model miscalibration. Consider a loss function which penalizes false positives and false negatives. Without loss of generality 6, such a loss function takes this form:

With this loss function, the optimal decision rule is to choose 1 (positive) whenever , otherwise choose 0 (negative).

Using the same example as above, we can compute the result of applying this decision rule using either isotonic calibration (static) or our dynamic rule to the test set. For almost every choice of threshold , the loss is significantly lower when using the dynamic calibration.

A method such as this should NOT be expected to improve ROC_AUC, and in fact in empirical tests this method does not. This is because ROC_AUC is based primarily on ordering of risk scores, and our calibration rule does not change the ordering.

The Brier Score - an explicit metric of calibration - does tend to increase with this method. This is of course completely expected. In my experiments, this method is less effective at generating a low Brier score than Isotonic calibration at least until either or changes.

The average precision score also tends to increase over *multiple batches* with different
.

Another approach (the approach of papers linked in footnote 2) is to simply fit and do not allow to change.

In experiments, I've noticed that fitting without allowing to change generally produces a more accurate estimate of , even in situations where the true distribution differs significantly from .

However, in spite of a more accurate estimate of , the resulting calibration curves from fitting only do not tend to be as accurate. The curve that comes from fitting is more accurate than the fit of alone:

At this stage I do not consider this method in any sense "production ready". I do not have a great grasp on the conditions when this method works or fails. I've also observed that very frequently, `scipy.optimize.minimize`

fails to converge, yet returns a useful result anyway. Most likely I'm looking for too high a tolerance.

I've also tried a couple of other ways to parameterize the probability distributions and the method seems quite sensitive to them. For example, I included an unnecessary parameter in an earlier variation - - and this completely caused the method to fail to converge. I'm not entirely sure why.

There is a corresponding Jupyter notebook which has the code to do this this. If anyone finds this useful and is able to move it forward, please let me know! As a warning, playing around with the code in the notebook will make the warts of the method fairly visible - e.g. once in a while, a cell will fail to converge, or just converge to something a bit weird.

However, overall I am encouraged by this. I believe it's a promising approach to dynamically adjusting calibration curves and better using prediction models in a context when the distribution of the positive class is highly variable.

As one additional note, I'll mention that I have some work (which I'll write about soon) suggesting that if we can request labels for a subset of the data points, we can do reasonably efficient active learning of calibration curves. This appears to significantly improve accuracy and reduce the number of samples needed.

**Notes**

- 1
- In reality 78 should be replaced with life expectancy
*at the time of diagnosis*, which is typically larger than the mean population life expectancy. This is a technical detail irrelevant for this post. - 2(1,2,3)
- Adjusting the Outputs of a Classifier to New a Priori Probabilities: A Simple Procedure, by Marco Saerens, Patrice Latinne & Christine Decaestecker. Another useful paper is EM with Bias-Corrected Calibration is Hard-To-Beat at Label Shift Adaptation which compares the maximum likelihood method with other more complex methods and finds it's generally competitive. This paper also suggests max likelihood type methods are usually the best.
- 3
- The approach taken in the papers cited in 2 are a bit different - they do expectation maximization and actually generate parameters representing outcome variables, requiring use of expectation maximization. The approach I'm describing just represents likelihoods of z-scores and ignores outcomes. But in principle these approaches are quite similar, and in testing the version I use tends to be a bit simpler and still works.
- 4
- Adaptive Estimation of a Density Function Using Beta Kernels by Karine Bertin and Nicolas Klutchnikoff.
- 5(1,2)
- Suppose that the distribution changes so that . Then for all , and therefore it is impossible to distinguish between different values of from the distribution of alone.
- 6
- Suppose we had an arbitrary loss function with a false positive cost of and a false negative cost of . Then define and . This is equivalent to a loss function with penalties for false positives and for false negatives, which differs from our choice of loss function only by a multiplicative constant .

It's become a popular meme that "shareholders only care about the next quarter". Lots of people make arguments like this - for example, Jamie Dimon and Warren Buffet. As the meme goes, shareholders only care about the next quarter of earnings, and CEOs make decisions accordingly - sacrificing long term profitability to meet quarterly expectations.

But is this meme true?

Coronavirus gives us a great empirical test of this theory.

The first step in answering this question is to formalize the theory. The most straightforward way I can think of to do this is through the lens of net present value, albeit with a modified discount rate.

This framework says that the value of any cash generating asset is given by:

In this sum,
is the cash flow in time period
and
is the *discount factor* of time
.

Here's a fairly simple example - a US treasury bill guaranteed to pay a $100 coupon for 3 periods and then to pay a final $10,000 in the 4'th. In tabular form:

date | R |
---|---|

2020-06-30 | 100 |

2020-09-30 | 100 |

2020-12-31 | 100 |

2021-03-31 | 10000 |

To complete the calculation, we need to time discount each cash payment. This is typically done by taking the risk free interest rate - say 5% - and applying that to each time period. For example:

date | R | d | R*d |
---|---|---|---|

2020-06-30 | 100 | 1.00 | 100.00 |

2020-09-30 | 100 | 0.99 | 98.75 |

2020-12-31 | 100 | 0.98 | 97.51 |

2021-03-31 | 10000 | 0.96 | 9631.85 |

Finally, the value of the bond is the sum of the `R*d`

column, which is $9928.12 in this example.

In this framework, short-termism can be straightforwardly represented by the `d`

column - specifically, `d`

will rapidly decrease over time. For instance, a very short term valuation of the same bond (a 25% discount rate) might be described as:

date | R | d | R*d |
---|---|---|---|

2020-06-30 | 100 | 1.00 | 100.00 |

2020-09-30 | 100 | 0.94 | 93.90 |

2020-12-31 | 100 | 0.88 | 88.17 |

2021-03-31 | 10000 | 0.83 | 8289.90 |

which yields a valuation of $8571.97.

Given that Treasury valuations do not look anything like this, we can certainly see that *bond* investors are not vulnerable to the short-termism that *stock* investors purportedly suffer from.

The straw man version of "shareholders only care about the next quarter" would mean that `d=0`

for all quarters past the next one.

I will examine this model for mathematical understanding, though I don't think it's a particularly fair thing to do.

Now let us consider a stock rather than a bond - specifically, a pharma company with a single drug in the final phase of clinical trials which end in 1 year.

The cashflow is quite certain for the next 1 year - `R[0:4] == 0`

, i.e. the company loses money to run the clinical trial and pays nothing to shareholders. After 1 year, there are two possible outcomes:

- The good outcome.
`R_good = +1000`

, the drug works, everyone buys it for 17 years, company is valuable. - The bad outcome.
`R_bad = 0`

, the drug does not work, company is worthless.

date | R_good | R_bad | d |
---|---|---|---|

2020-06-30 | 0 | 0 | 1.0000 |

2020-09-30 | 0 | 0 | 0.9987 |

2020-12-31 | 0 | 0 | 0.9975 |

2021-03-31 | 0 | 0 | 0.9963 |

2021-06-30 | 1000 | 0 | 0.9950 |

2021-09-30 | 1000 | 0 | 0.9938 |

. | . | . | . |

2036-06-30 | 1000 | 0 | 0.9231 |

2036-09-30 | 1000 | 0 | 0.9220 |

2036-12-31 | 1000 | 0 | 0.9208 |

2037-03-31 | 1000 | 0 | 0.9197 |

The company has two eventual valuations (at a long-termist 0.5% discount rate), depending on whether we believe the `R_good`

or `R_bad`

column represents the future - $61,238 in the first case and $0 in the second.

If we assume a 60% chance of the drug getting through clinical trials, then the value of the company would be `0.6 * $61238 + 0.4 * 0 = 36742.90`

.

Note that in the straw man case of *literally only the next quarter matters*, this company is worth $0 in all possible scenarios - it's first actual profit is 1 year out.

Lets now consider a long term investor who is evaluating a blue chip, highly stable stock. This stock regularly has earnings of $100. Then one quarter, it misses earnings and only reports $75!

An investor infected by short-termism will significantly cut their evaluation of the company - since `d=0`

for all future periods, the value drops from $100 to $75, a 25% decrease.

Let us now consider a long term investor.

date | R | d |
---|---|---|

2020-06-30 | 75 | 1.0000 |

2020-09-30 | 100 | 0.9987 |

2020-12-31 | 100 | 0.9975 |

2021-03-31 | 100 | 0.9963 |

2021-06-30 | 100 | 0.9950 |

. | . | . |

Over 18 years, the value of this revenue stream works out to be $6498. In contrast, had earnings for one quarter not been missed, it would be $6523, a difference of 0.4%. Thus, if there is a drop in share price of significantly more than 0.4%, one might hypothesize that this is due to the market taking a short termist view.

Let us now consider a long term investor who actively tries to think through cause and effect. Earnings decreased, and there must be some reason for it! The question to ask is therefore whether this reduction in a single quarter's earnings will continue into the future. We encounter a situation similar to the pharma stock discussed earlier:

date | R_good | R_bad | d |
---|---|---|---|

2020-06-30 | 75 | 75 | 1.0000 |

2020-09-30 | 100 | 75 | 0.9987 |

. | . | . | . |

2036-09-30 | 100 | 75 | 0.9220 |

2036-12-31 | 100 | 75 | 0.9208 |

2037-03-31 | 100 | 75 | 0.9197 |

In the `R_bad`

scenario, the company will only be worth $4892 (a 25% decrease from it's previous value).

If the long term investor believes that there is a 40% chance of this occurring, then the value of the stock decreases to $5855.75, a 10% drop!

Even though the long term investor doesn't care much about a single quarter's earnings, he cares a lot about whether this predicts many more quarters of reduced earnings. This means that even long term investors behave in the manner that others describe as "short-termist".

As a result, both the short-termism theory and the long-termism theory *make very similar predictions*. The fact that stock prices move significantly in response to missed earnings estimates is insufficient to distinguish between these two theories.

Coronavirus is a great natural experiment for a lot of things.

One of the most important things we can take away from it is the conclusion that equity markets are fundamentally focused on the long term value of the companies being traded. There are fast responses to problems with next quarter earnings, but these are primarily driven by the fact that problems in the short run tend to be indicative of more fundamental issues.

Now that we have a systematic example where we know that short run problems are strictly short run, we can safely disambiguate between short termism and long termism. The result is very clear; the market is predominantly focused on the long term.

**Disclosure:** Long $SBUX, $CCL.

I am posting full screenshots of the censored Hydroxychloroquinine paper here (captured by @aetherczar). I encourage others to repost this elsewhere. I make no endorsement of it's contents. **DO NOT DRINK FISH TANK CLEANER, YOU WILL PROBABLY DIE.**

I am posting this here simply because it's important for us to resist Google's censorship, as well as the media spreading misleading information (writing stories implying the dead guy ate malaria medication instead of fish tank cleaner).

Again - **DO NOT DRINK FISH TANK CLEANER IT WILL KILL YOU**. Anti-malarial medication might help.

A lot of suspicious behavior can be detected simply by looking at a histogram. Here's a nice example. There's a paper Distributions of p-values smaller than .05 in Psychology: What is going on? which attempts to characterize the level of data manipulation performed in academic psychology. Now under normal circumstances, one would expect a nice smooth distribution of p-values resulting from honest statistical analysis.

What actually shows up when they measure it is something else entirely:

Another example happened to me when I was doing credit underwriting. A front-line team came to me with concerns that some of our customers might not be genuine, and in fact some of them might be committing fraud! Curious, I started digging into the data and made a histogram to get an idea of spending per customer. The graph looked something like this:

The value `x=10`

corresponded to the credit limit we were giving out to many of our customers. For some reason, a certain cohort of users were spending as much as possible on the credit lines we gave them. Further investigation determined that most of those customers were not repaying the money we lent them.

In contrast, under normal circumstances, a graph of the same quantity would typically kook like this:

A third example - with graphs very similar to the previous example - happened to me when debugging some DB performance issues. We had a database in US-East which was replicated to US-West. Read performance in US-West was weirdly slow, and when we made a histogram of request times, it turned out that the slowness was driven primarily by a spike at around 90ms. Coincidentally, 90ms was the ping time between our US-East and US-West servers. It turned out that a misconfiguration resulted in the US-West servers occasionally querying the US-East read replica instead of the US-West one, adding 90ms to the latency.

A fourth example comes from the paper Under Pressure? Performance Evaluation of Police Officers as an Incentive to Cheat: Evidence from Drug Crimes in Russia which discovers odd spikes in the amount of drugs found in police searches.

It sure is very strange that so many criminals all choose to carry the exact amount of heroin needed to trigger harsher sentencing thresholds, and never a few grams less.

In short, many histograms should be relatively smooth and decreasing. When such histograms display a spike, that spike is a warning sign that something is wrong and we should give it further attention.

In all the cases above, I made these histograms as part of a post-hoc analysis. Once the existence of a problem was suspected, further evidence was gathered and the spike in the histogram was one piece of evidence. I've always been interested in the question - can we instead automatically scan histograms for spikes like the above and alert humans to a possible problem when they arise?

This blog post answers the question in the affirmative, at least theoretically.

To model this problem in the frequentist hypothesis testing framework, let us assume we have a continuous probability distributions which is supported on . As our null hypothesis - i.e. nothing unusual to report - we'll assume this distribution is absolutely continuous with respect to Lebesgue measure and that it has pdf which is monotonically decreasing, i.e.