Selected blog posts

Posts listed on the homepage are the more popular or interesting ones. Click here for the chronological blog listing.

Boosting as a scheme for transfer learning

Here's a scenario that I believe to be common. I've got a dataset I've been collecting over time, with features $$x_1, \ldots, x_m$$ This dataset will generally represent decisions I want to make at a certain time. This data is not a timeseries, it's just data I happen to have …

Calibrating a classifier when the base rate changes

In a previous job, I built a machine learning system to detect financial fraud. Fraud was a big problem at the time - for simplicity of having nice round numbers, suppose 10% of attempted transactions were fraudulent. My machine learning system worked great - as a further set of made-up round numbers …

Shareholder Short-Termism Theory has Died of COVID-19

It's become a popular meme that "shareholders only care about the next quarter". Lots of people make arguments like this - for example, Jamie Dimon and Warren Buffet. As the meme goes, shareholders only care about the next quarter of earnings, and CEOs make decisions accordingly - sacrificing long term profitability to …

Isotonic: A Python package for doing fancier versions of isotonic regression

Frequently in data science, we have a relationship between X and y where (probabilistically) y increases as X does. The relationship is often not linear, but rather reflects something more complex. Here's an example of a relationship like this:

In this plot of synthetic we have a non-linear but increasing …

The Final Stage of Grief (about bad data) is Acceptance

I recently gave a talk at the Fifth Elephant 2019. The talk was a discussion about how to use math to handle unfixably bad data. The slides are available here.. Go check it out.

AI Ethics, Impossibility Theorems and Tradeoffs

I recently gave a talk at CrunchConf 2018. The talk was a about the various impossibility theorems that make a person concerned with AI Ethics must content with. The slides are available here.. Go check it out.

Bayesian Linear Regression (in PyMC) - a different way to think about regression

Consider a data set, a sequence of point $$@ (x_1, y_1), (x_2, y_2), \ldots, (x_k, y_k)$$@. We are interested in discovering the relationship between x and y. Linear regression, at it's simplest, assumes a relationship between x and y of the form $$@ y = \alpha x + \beta + e$$@. Here, the variable \$@ e …

Bayesian A/B Testing - my talk at Gilt

I gave a talk on friday about Bayesian A/B testing at Gilt's engineering seminar. You can view them here.

Wingify releases Bayesian A/B tester

I've written a number of posts here about a/b testing, and readers have probably observed that I favor the Bayesian approach. I'm very happy to announce that Wingify (my employer) has release SmartStats - a fully Bayesian A/B testing engine. I've always maintained that you should A/B test …

Don't use Hadoop - your data isn't that big

"So, how much experience do you have with Big Data and Hadoop?" they asked me. I told them that I use Hadoop all the time, but rarely for jobs larger than a few TB. I'm basically a big data neophite - I know the concepts, I've written code, but never at …

A High Frequency Trader's Apology, Pt 1

I'm a former high frequency trader. And following the tradition of G.H. Hardy, I feel the need to make an apology for my former profession. Not an apology in the sense of a request for forgiveness of wrongs performed, but merely an intellectual justification of a field which is …

Read more