I am posting full screenshots of the censored Hydroxychloroquinine paper here (captured by @aetherczar). I encourage others to repost this elsewhere. I make no endorsement of it's contents. **DO NOT DRINK FISH TANK CLEANER, YOU WILL PROBABLY DIE.**

I am posting this here simply because it's important for us to resist Google's censorship, as well as the media spreading misleading information (writing stories implying the dead guy ate malaria medication instead of fish tank cleaner).

Again - **DO NOT DRINK FISH TANK CLEANER IT WILL KILL YOU**. Anti-malarial medication might help.

A lot of suspicious behavior can be detected simply by looking at a histogram. Here's a nice example. There's a paper Distributions of p-values smaller than .05 in Psychology: What is going on? which attempts to characterize the level of data manipulation performed in academic psychology. Now under normal circumstances, one would expect a nice smooth distribution of p-values resulting from honest statistical analysis.

What actually shows up when they measure it is something else entirely:

Another example happened to me when I was doing credit underwriting. A front-line team came to me with concerns that some of our customers might not be genuine, and in fact some of them might be committing fraud! Curious, I started digging into the data and made a histogram to get an idea of spending per customer. The graph looked something like this:

The value `x=10`

corresponded to the credit limit we were giving out to many of our customers. For some reason, a certain cohort of users were spending as much as possible on the credit lines we gave them. Further investigation determined that most of those customers were not repaying the money we lent them.

In contrast, under normal circumstances, a graph of the same quantity would typically kook like this:

A third example - with graphs very similar to the previous example - happened to me when debugging some DB performance issues. We had a database in US-East which was replicated to US-West. Read performance in US-West was weirdly slow, and when we made a histogram of request times, it turned out that the slowness was driven primarily by a spike at around 90ms. Coincidentally, 90ms was the ping time between our US-East and US-West servers. It turned out that a misconfiguration resulted in the US-West servers occasionally querying the US-East read replica instead of the US-West one, adding 90ms to the latency.

A fourth example comes from the paper Under Pressure? Performance Evaluation of Police Officers as an Incentive to Cheat: Evidence from Drug Crimes in Russia which discovers odd spikes in the amount of drugs found in police searches.

It sure is very strange that so many criminals all choose to carry the exact amount of heroin needed to trigger harsher sentencing thresholds, and never a few grams less.

In short, many histograms should be relatively smooth and decreasing. When such histograms display a spike, that spike is a warning sign that something is wrong and we should give it further attention.

In all the cases above, I made these histograms as part of a post-hoc analysis. Once the existence of a problem was suspected, further evidence was gathered and the spike in the histogram was one piece of evidence. I've always been interested in the question - can we instead automatically scan histograms for spikes like the above and alert humans to a possible problem when they arise?

This blog post answers the question in the affirmative, at least theoretically.

To model this problem in the frequentist hypothesis testing framework, let us assume we have a continuous probability distributions which is supported on . As our null hypothesis - i.e. nothing unusual to report - we'll assume this distribution is absolutely continuous with respect to Lebesgue measure and that it has pdf which is monotonically decreasing, i.e. for (almost everywhere).

In contrast, for the alternative hypothesis - something worth flagging as potentially bad - I'll assume that the distribution is a mixture distribution with pdf . Here is monotonically increasing, or more typically .

**Observation:** Consider a probability distribution
that is monotonically decreasing. Then the cumulative distribution function
is concave. This can be proven by noting that it's derivative,
is monotonically decreasing.

Our hypothesis test for distinguishing between the null and alternative hypothesis will be based on concavity. Specifically, if there are spikes in a histogram of the pdf of a distribution, then it's CDF may cease to be concave at the point of the spike. Here's an illustration. First, consider the empirical CDF of a distribution which is monotonically decreasing:

This graph is clearly concave. The red line illustrates a chord which must, by concavity, remain below the actual curve.

In contrast, a pdf with a spike in it will fail to be concave near the spike. Here's an illustration:

At the chord (the red line) is above the graph of the CDF (the green line).

In mathematical terms, concavity of the true CDF can be expressed as the relation:

or equivalently:

Since we do not know exactly, we of course cannot measure this directly. But given a sample, we can construct the empirical CDF which is nearly as good:

Using the empirical CDF and the definition of concavity suggests a test statistic which we can use:

Our goal is to show that if this test statistic is sufficiently negative, then a spike must exist.

When becomes negative, this shows that is non-concave. However, the empirical distribution function is by definition non-concave, as can be seen clearly when we zoom in:

Mathematically we can also see this simply by noting that is not concave. However, this non-concavity has order of magnitude , so to deal with this we can simply demand that .

There is a larger problem caused - potentially - by deviation between the empirical distribution and the true, continuous and concave cdf . This however can also be controlled and will be controlled in the next section.

To control false positives, there is a useful mathematical tool we can use to control this - the DKW inequality (abbreviating Dvoretzkyâ€“Kieferâ€“Wolfowitz). This is a stronger version of the Glivenko-Cantelli Theorem, but which provides uniform convergence over the range of the cdf.

We use it as follows.

Recall that is defind as a minima of . Let us choose now to be the value at which that minima is achieved. Note that this requires that are two points in the domain of and . Let us also define in order to simplify the calculation.

Now lets do some arithmetic, starting from the definition of concavity of the CDF:

(This line follows since due to our choice of above.)

The DKW inequality tells us that for any ,

Substituting this into the above, we can therefore say that with probability ,

If , this lets us reject the null hypothesis that is concave, or equivalently, that is monotonically decreasing. Conversely, given a value of , we can invert to gain a p-value. We summarize this as a theorem:

**Theorem 1:** Assume the null hypothesis of concavity is true. Let
be defined as above. Then if
, we can reject the null hypothesis (that
is decreasing monotonically) with p-value
.

This convergence is exponential but at a slow rate. Much like a Kolmogorov-Smirnov, the statistical power is relatively low compared to parametric tests (such as Anderson-Darling) that are not based on the DKW inequality.

Let us now examine the true positive rate and attempt to compute statistical power. As a simple alternative hypothesis, let us take a mixture model:

Here is monotone decreasing and is the point mass at . Let us attempt to compute

Let , and . Then:

Now substituting this in, we discover:

Letting , we observe that . Since is absolutely continuous, is of course a continuous function.

Let us now take the limit as :

This implies that

since the minima is of course smaller than any limit.

By the same argument as in the previous section - using the DKQ inequality to relate to - we can therefore conclude that:

with probability .

We can combine these results into a hypothesis test which is capable of distinguishing between the null and alternative hypothesis with any desired statistical power.

**Theorem 2:** Let
be a specified p-value threshold and let
be a desired statistical power. Let us reject the null hypothesis whenever

Suppose now that

Then with probability at least , we will reject the null hypothesis.

Due to the slowness of the convergence implied by the DKW inequality, we unfortunately need fairly large (or large ) for this test to be useful.

n | |
---|---|

1000 | 0.155 |

2000 | 0.109 |

5000 | 0.0692 |

10000 | 0.0490 |

25000 | 0.0310 |

100000 | 0.0155 |

Thus, this method is really only suitable for detecting either large anomalies or in situations with large sample sizes.

Somewhat importantly, this method is not particularly sensitive to the p-value cutoff. For example, with a 1% cutoff rather than a 5%, we can detect spikes of size at .

This makes the method reasonably suitable for surveillance purposes. By setting the p-value cutoff reasonably low (e.g. 1% or 0.1%), we sacrifice very little measurement power on a per-test basis. This allows us to run many versions of this test in parallel and then use either the Sidak correction to control the group-wise false positive rate or Benjamini-Hochburg to control the false discovery rate.

At the moment this test is not all I was hoping for. It's quite versatile, in the sense of being fully nonparametric and assuming little beyond the underlying distribution being monotone decreasing. But while theoretically the convergence is what one would expect, in practice the constants involved are large. I can only detect spikes in histograms after they've become significantly larger than I'd otherwise like.

However, it's still certainly better than nothing. This method would have worked in several of the practical examples I described at the beginning and would have flagged issues earlier than than I detected them via manual processes. I do believe this method is worth adding to suites of automated anomaly detection. But if anyone can think of ways to improve this method, I'd love to hear about them.

I've searched, but haven't found a lot of papers on this. One of the closest related ones is Multiscale Testing of Qualitative Hypotheses.

Frequently in data science, we have a relationship between `X`

and `y`

where (probabilistically) `y`

increases as `X`

does. The relationship is often not linear, but rather reflects something more complex. Here's an example of a relationship like this:

In this plot of synthetic we have a non-linear but increasing relationship between `X`

and `Y`

. The orange line represents the true mean of this data. Note the large amount of noise present.

There is a classical algorithm for solving this problem nonparametrically, specifically Isotonic regression. This simple algorithm is also implemented in sklearn.isotonic. The classic algorithm is based on a piecewise constant approximation - with nodes at every data point - as well as minimizing (possibly weighted) l^2 error.

The standard isotonic package works reasonably well, but there are a number of things I don't like about it. My data is often noisy with fatter than normal tails, which means that minimizing l^2 error overweights outliers. Additionally, at the endpoints, sklearn's isotonic regression tends to be quite noisy.

The curves output by sklearn's isotonic model are piecewise constant with a large number of discontinuities (O(N) of them).

The size of the isotonic model can be very large - O(N), in fact (with N the size of the training data). This is because in principle, the classical version isotonic regression allows every single value of `x`

to be a node.

The isotonic package I've written provides some modest improvements on this. It uses piecewise linear curves with a bounded (controllable) number of nodes - in this example, 30:

It also allows for non-`l^2`

penalties in order to handle noise better.

Another issue facing the standard isotonic regression model is binary data - where `y in [0,1]`

. Using RMS on binary data sometimes works (when there's lots of data and it's mean is far from `0`

and `1`

), but it's far from optimal.

For this reason I wrote a class `isotonic.BinomialIsotonicRegression`

which handles isotonic regression for the case of a binomial loss function.

As is apparent from the figure, this generates more plausible results for binary isotonic regression (in a case with relatively few samples) than the standard sklearn package. The result is most pronounced at the endpoints where data is scarcest.

You can find the code on my github. It's pretty alpha at this time, so don't expect it to be perfect. Nevertheless, I'm currently using it in production code, in particular a trading strategy where the noise sensitivity of `sklearn.isotonic.IsotonicRegression`

was causing me problems. So while I don't guarantee it as being fit for any particular purpose, I'm gambling :code:O($25,000) on it every week or two.

This appendix explains the mathematical details of the methods, as well as technical details of the parameterization. It is mainly intended to be used as a reference when understanding the code.

The package uses maximum likelihood for curve estimation, and uses the Conjugate Gradient method (as implemented in `scipy.optimize.minimize`

) to actually compute this maximum.

The first part of this is parameterizing the curves. The curves are parameterized by a set of and a corresponding set of , with for all . (I'm using zero-indexing to match the code.)

Since conjugate gradient doesn't deal with constraints, we must come up with a parameterization where the domain is unconstrained and the range satisfies the monotonicity constraint.

There are two cases to consider.

For real-valued isotonic regression, there are no constraints on beyond the monotonicity constraint. Thus, we can use the parameterization:

Since , this trivially satisfies the monotonicity constraint.

In this case, the Jacobian can be computed to be:

Here the function is equal to if it's argument is true and otherwise.

This parameterization is implemented here.

In the case of binomial isotonic regression, we have the additional constraint that and (since the curve represents a probability). We can parameterize this via:

It is trivially easy to verify that this satisfies both the monotonicity constraint as well as the constraint that . Note that in this case, there are parameters for an -dimensional vector .

The Jacobian can be calculated to be:

This parameterization is implemented here.

One parameterization for is piecewise constant, i.e.:

In this case, simple calculus shows that

with as above.

This is implemented as the PiecewiseConstantIsotonicCurve in the library.

Another parameterization is piecewise linear:

This has derivative:

This is implemented as the PiecewiseLinearIsotonicCurve.

Some notation first. Let us consider a data set . We will define a curve , taking values at the points , i.e. and being parametrically related to elsewhere. Current implementations include piecewise linear and piecewise constant.

Supposing now that the nodes are given, it remains to find the values that minimize a loss function.

In this case, our goal is to minimize the error:

Note that this corresponds to maximum likelihood under the model:

with drawn from the distribution having pdf .

Computing the gradient w.r.t. yields:

This is implemented in the library as LpIsotonicRegression.

Then given the data set, we can do max likelihood: