Don't use Hadoop - your data isn't that big

image possibly inspired by this post

"So, how much experience do you have with Big Data and Hadoop?" they asked me. I told them that I use Hadoop all the time, but rarely for jobs larger than a few TB. I'm basically a big data neophite - I know the concepts, I've written code, but never at …

more ...

java.lang.OutOfMemoryError, GC overhead limit exceeded

One annoying error which I often see when running Hadoop jobs is this:

java.lang.OutOfMemoryError: GC overhead limit exceeded

The cause of this error is that Java is spending a lot of time inside the garbage collector, and is not freeing up large chunks of memory. When this error …

more ...

Mechanical Turk and Error Correcting Codes

In a recent blog post, Panos Ipeirotis asks the following question (loosely paraphrased).

Consider a set of n oracles or mechanical turks, each with a probability q of returning an erroneous response. The capacity of a channel with error probability q is C(q,n), and information theoretic channel capacity …

more ...

Bayesian Bandits - optimizing click throughs with statistics

Great news! A murder victim has been found. No slow news day today! The story is already written, now a title needs to be selected. The clever reporter who wrote the story has come up with two potential titles - "Murder victim found in adult entertainment venue" and "Headless Body found …

more ...

Why Not Python - the GIL hinders concurrency

Python is not ready for the big leagues, at least if you need to deal with concurrency. I know this is a fairly controversial statement, and I plan to back it up with very specific critiques. The general gist of my critique is that python fails to properly utilize modern …

more ...


Caching the Identity for Fun and Profit

One of the wonderful features of Scala and other high level languages is that they are very expressive. Very often, one can represent business objects as a simple map, e.g.:

val headers = Map("User-Agent" -> "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26 …
more ...


The magic of conjugate priors (for online learning)

In Bayesian reasoning, the fundamental problem is the following. Given a prior distribution $@p(x)$@, and some set of evidence $@E$@, compute a posterior distribution on $@x$@ namely $@p(x | E)$@. For example, $@x$@ might be the conversion rate of some email. Before you have any evidence you might expect …

more ...

The Metrics Manifesto - Why you need an objective function

Hypst is an early stage startup - the elevator pitch is "like Facebook, before it became mainstream". The founders of Hypst have read a lot about A/B testing and statistics, and they decide to use the techniques they learned to improve engagement.

The first thing they notice is that not …

more ...