Postgres NOTIFY for cache busting and more

"There are only two hard things in Computer Science: cache invalidation and naming things."

Phil Karlton

For those of using Postgres as a data store, cache invalidation has become significantly easier. Postgres has introduced the command NOTIFY which can be used to inform the cache of necessary invalidation.

The old ...

more ...

Compound Aggregates in Hadoop/Scalding

Consider the following problem. I have an extremely large number of servers, each of which uploads their logs to a Hadoop cluster. Each line of the log file contains a server IP address, and represents a single message in Hadoop. I'm investigating a network intrusion. One of my network ...

more ...

Don't use Hadoop - your data isn't that big

image possibly inspired by this post

"So, how much experience do you have with Big Data and Hadoop?" they asked me. I told them that I use Hadoop all the time, but rarely for jobs larger than a few TB. I'm basically a big data neophite - I know the concepts, I've written code, but ...

more ...

java.lang.OutOfMemoryError, GC overhead limit exceeded

One annoying error which I often see when running Hadoop jobs is this:

java.lang.OutOfMemoryError: GC overhead limit exceeded

The cause of this error is that Java is spending a lot of time inside the garbage collector, and is not freeing up large chunks of memory. When this error ...

more ...

Mechanical Turk and Error Correcting Codes

In a recent blog post, Panos Ipeirotis asks the following question (loosely paraphrased).

Consider a set of n oracles or mechanical turks, each with a probability q of returning an erroneous response. The capacity of a channel with error probability q is C(q,n), and information theoretic channel capacity ...

more ...

Bayesian Bandits - optimizing click throughs with statistics

Great news! A murder victim has been found. No slow news day today! The story is already written, now a title needs to be selected. The clever reporter who wrote the story has come up with two potential titles - "Murder victim found in adult entertainment venue" and "Headless Body found ...

more ...

Why Not Python - the GIL hinders concurrency

Python is not ready for the big leagues, at least if you need to deal with concurrency. I know this is a fairly controversial statement, and I plan to back it up with very specific critiques. The general gist of my critique is that python fails to properly utilize modern ...

more ...


Caching the Identity for Fun and Profit

One of the wonderful features of Scala and other high level languages is that they are very expressive. Very often, one can represent business objects as a simple map, e.g.:

val headers = Map("User-Agent" -> "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0 ...
more ...