Postgres NOTIFY for cache busting and more

"There are only two hard things in Computer Science: cache invalidation and naming things."

Phil Karlton

For those of using Postgres as a data store, cache invalidation has become significantly easier. Postgres has introduced the command NOTIFY which can be used to inform the cache of necessary invalidation.

The old …

more ...

Compound Aggregates in Hadoop/Scalding

Consider the following problem. I have an extremely large number of servers, each of which uploads their logs to a Hadoop cluster. Each line of the log file contains a server IP address, and represents a single message in Hadoop. I'm investigating a network intrusion. One of my network admins …

more ...

Don't use Hadoop - your data isn't that big

image possibly inspired by this post

"So, how much experience do you have with Big Data and Hadoop?" they asked me. I told them that I use Hadoop all the time, but rarely for jobs larger than a few TB. I'm basically a big data neophite - I know the concepts, I've written code, but never at …

more ...

java.lang.OutOfMemoryError, GC overhead limit exceeded

One annoying error which I often see when running Hadoop jobs is this:

java.lang.OutOfMemoryError: GC overhead limit exceeded

The cause of this error is that Java is spending a lot of time inside the garbage collector, and is not freeing up large chunks of memory. When this error …

more ...

Mechanical Turk and Error Correcting Codes

In a recent blog post, Panos Ipeirotis asks the following question (loosely paraphrased).

Consider a set of n oracles or mechanical turks, each with a probability q of returning an erroneous response. The capacity of a channel with error probability q is C(q,n), and information theoretic channel capacity …

more ...

Bayesian Bandits - optimizing click throughs with statistics

Great news! A murder victim has been found. No slow news day today! The story is already written, now a title needs to be selected. The clever reporter who wrote the story has come up with two potential titles - "Murder victim found in adult entertainment venue" and "Headless Body found …

more ...

Why Not Python - the GIL hinders concurrency

Python is not ready for the big leagues, at least if you need to deal with concurrency. I know this is a fairly controversial statement, and I plan to back it up with very specific critiques. The general gist of my critique is that python fails to properly utilize modern …

more ...