One of the wonderful features of Scala and other high level languages is that they are very expressive. Very often, one can represent business objects as a simple map, e.g.:

val headers = Map("User-Agent" -> "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.63 Safari/537.31",
"Accept" -> "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-encoding" -> "gzip,deflate,sdch"
)


Very often one might need to process a large number of such objects. An actual page of heap used by our program might look something like this (the - character represents binary data we don’t care about for this discussion):

User-Agent----Accept----gzip,deflate,sdch---Accept-----User-Agent-----gzip,deflate,sdch---
-----Accept-encoding-----User-Agent-------gzip,deflate,sdch---Mozilla/5.0 (X11; Linux x86_
-------User-Agent-------gzip,deflate,sdch---Internet Explorer-----------------------------


Fugly. We are storing the same common strings many times.

So I’ve just launched my new startup, BeerBnB. It’s a hip little site matching beer drinkers with specialty microbreweries - AirBnB for drinkers, or maybe eBay for brewers. My marketer growth hacker has gotten some early publicity by advertising in the bathroom of a few bars - the result was 794 unique visitors of whom 12 created an account. Doing some division I’ve computed an empirical conversion rate of 12/794=1.5%.

To begin with, this seems promising. A 1.5% conversion rate isn’t great, but it’s certainly enough to get started. Investors have suggested that they will probably invest if the conversion rate exceeds 1%.

Now, suppose the marketer has the ability to get a lot more publicity. He can expose BeerBnB site to approximately 10,000 visitors via toilet adds at bars around the city. Suppose we make the assumption that these 10,000 visitors will convert at the same rate as the 794 early visitors. How many people can I reasonably expect to signup? This isn’t a trick question - the expectation is about 150 signups. But how confident are we that we will really see 150 signups? How confident are we that the conversion rate is higher than 1%?

In Bayesian reasoning, the fundamental problem is the following. Given a prior distribution $p(x)$, and some set of evidence $E$, compute a posterior distribution on $x$ namely $p(x | E)$. For example, $x$ might be the conversion rate of some email. Before you have any evidence you might expect the conversion rate to be somewhere in the range of perhaps $5\%$ and $50\%$. After you have evidence, you update your belief - if you sent out thousands of emails and observed an empirical $16.5\%$ conversion rate, you are now reasonably confident that the true conversion rate lies roughly in the range of $16\%-17\%$.

In mathematics, a conjugate prior consists of the following. Consider a family of probability distributions characterized by some parameter $\theta$ (possibly a single number, possibly a tuple). A prior is a conjugate prior if it is a member of this family and if all possible posterior distributions are also members of this family.

Hypst is an early stage startup - the elevator pitch is “like Facebook, before it became mainstream”. The founders of Hypst have read a lot about A/B testing and statistics, and they decide to use the techniques they learned to improve engagement.

The first thing they notice is that not enough people are inviting their cool friends. So they come up with alternate captions, and discover that “invite only your cool friends” achieves 20% more invitations than “invite your buddies”. They commit this version to master and continue. Then they notice that no one is clicking on their “liked it before it was cool” button. They come up with alternate designs, run an A/B test, and discover that a green button achieves 20% more clicks than a blue one. The third thing they notice is that people don’t engage with the Irony Feed. They tweak their algorithm, run another A/B test, and discover that a wider Irony Feed gets 20% more clicks than the original design.

All their tests were run with proper statistical methods and clean experiment design. All the aforementioned test results were statistically valid. Yet somehow, after running these three tests and implementing the best version, they observe that clicks are down 2.8% across all categories. WTF just happened?

By now, it is a fairly uncontroversial opinion that ORMs create a large number of difficulties when developing larger systems. The have been famously called the Vietnam of Computer Science. The main alternative to ORMs is manually constructing SQL by hand, but unfortunately that is a rather dangerous thing to do in the present day.

With the collapse of Knight Capital recently, there has been a lot of scrutiny of High Frequency Trading (HFT). Breathless reporters have been bombarding us with articles suggesting that there is danger out there.

That’s all nonsense.

Scott Aaronson has a great article explaining how quantum mechanics should be taught. The basis of his idea is to describe quantum mechancis as being the mathematical structure you get when you attempt to generalize probability theory to include negative numbers. Once you do this, the L^2 norm replaces the L^1 norm, an inner product can be derived, and quantum theory falls out. Quite elegant.

It’s also a vastly better way to learn than the historically-based approach through which I was taught [1].

[1] Another flaw in the historical approach is that it overemphasizes the photoelectric effect - specifically, it claims that the photoelectric effect proves light is quantized. But that’s incorrect - you can derive the photoelectric effect directly from a Schrodinger equation with a classical electric field.

Ron Unz over at American Conservative has written an excellent blog post called Race, IQ and Wealth. It’s well worth reading, gets into data, and is a fairly dispassionate analysis of the topic. It also (politely and subtly) calls out the supposed “experts” on the field, who ignore all sorts of valuable data (quite a bit of which actually supports their views) merely because their ideological opponents present it.

One of the more important tools in my computational toolbox is linear programming. It’s a great way to solve a lot of otherwise difficult problems in a straightforward, nearly black-box manner. I’ve discovered that a lot of programmers aren’t too familiar with it, so I’m writing this tutorial that explains how to use it in practical purposes.

Previously I argued that some chunk of HFT is rent seeking due to the minimal tick sizes induced by the subpenny rule. It appears now that there is a proposal to the SEC to increase the tick size in order to increase market making activity. I.e., if you raise the price floor on liquidity, you’ll induce an oversupply of it.

In a recent post, a company selling A/B testing services made the claim that A/B testing is superior to bandit algorithms. They do make a compelling case that A/B testing is superior to one particular not very good bandit algorithm, because that particular algorithm does not take into account statistical significance.

However, there are bandit algorithms that account for statistical significance.

Machines are increasingly replacing humans across many walks of life. Huge numbers of jobs are being made obsolete. Historically this trend has been a positive force in terms of living standards. Additionally, as old jobs have been replaced, humans have found new work to do.

In a series of blog posts, Gary Rubinstein attempts show that the Value Added Modelling scores recently released by the NYC Department of Education prove that VAM (Value Added Modelling) is not accurate.

Just thought it’s worth pointing out a couple of responses to my post proposing the elimination of the subpenny rule.

In previous posts, I discussed the basic mechanics and social utility of high frequency trading. Of particular import is that I characterized the latency arms race as socially wasteful. Now I’ll discuss a policy proposal which might mitigate the harmful effects of the race for latency, while giving better prices to speculators.

In a recent hacker news thread, a great deal of criticism was heaped on Fermi problems. “How many golf balls fit on a double decker bus?” “How many piano tuners live in Seattle?” I think much of the criticism is unfounded, since Fermi problems come up all the time in computing.

In a recent blog post, Karl Smith defends U3 from it’s critics. The majority of his defense is spot on - U3 is a consistent, well-understood measurement which we can use to measure changes over time, and we also have a pretty good understanding of it’s relationship to other relevant quantities.

But I’ll dispute him on one point - U3 by itself does not measure whether or not the labor market is clearing.

In a previous post I discussed the mechanics of HFT. If you haven’t read it, go read it now. Now I’ll discuss it’s social utility and cost.

A while back, Hacker News was deluged with assorted command line bug trackers. It’s pretty clear that many people want a command line interface to their bug tracking system. But most of us are forced to use an existing system and can’t switch to such a new one. So I wrote Idli, which is a command line interface to existing bug trackers.

I’m a former high frequency trader. And following the tradition of G.H. Hardy, I feel the need to make an apology for my former profession. Not an apology in the sense of a request for forgiveness of wrongs performed, but merely an intellectual justification of a field which is often misunderstood.

In this blog post, I’ll attempt to explain the basics of how high frequency trading works and why traders attempt to improve their latency. In future blog posts, I’ll attempt to justify the social value of HFT (under some circumstances), and describe other circumstances under which it is not very useful. Eventually I’ll even put forward a policy prescription which I believe could cause HFT to focus primarily on socially valuable activities.

HTTP is a stateless protocol. For this reason, it’s often considered bad practice to store data in your server’s memory the memory of your webserver to be a bad practice. In general, this is correct - your webserver’s ram is a bad place to store permanent data. It’s volatile, and if you have multiple servers behind a load balancer, it’s likely to give inconsistent results between requests.

But it’s a great place to cache certain pieces of data.

In a fairly popular blog post, Raganwald advocates against using various heuristics while filtering out resumes. For the most part, I agree with him. Since it’s hard to find good programmers, false negatives are very costly compared to the cost of a false positive (a wasted interview).

But I have found one filter which actually works well. Not only did it filter out bad candidates, it actually increased the number of applications (both in total, and the # of good candidates) we received.

So you’ve decided to leave academia, or are perhaps just thinking about doing so. Welcome to the dark side. I made the transition a few years ago, and since then I’ve gotten a number of questions about how to do it. Hence, this article.

In this post I’d like to introduce Hobo, a fast in-memory index for your data. Hobo is not a database, it’s an external index for data you have stored elsewhere. Hobo is used for two primary purposes - the first is to answer queries of the form “find me all items with features X and Y”, and the second is to find items with a specified color.

A major problem with running Postgres on EC2 is that EBS performance often sucks. In addition to performing poorly, EBS also uses the network connection, which can be undesirable. Ephemeral storage is provided, and tends to have better performance characteristics, but unfortunately it lacks durability.

I’ve run into a curious error while cloning Ubuntu VM hard drives. After cloning, the network card no longer works. However, if I clone the VM completely (including the mac address), there is no problem.

A fairly recent trend in discussions of obesity is to focus on weight stability. Weight stability is a phenomenon by which a human maintains roughly the same bodyweight over a long period of time. Karl Smith recently brought it up, for example:

Eric Ries over at TechCrunch recently wrote an article discussing racism/sexism in Silicon Valley and the technology industry. The article discusses differences in aptitude between men and women, and attempts to downplay them as the cause of a lack of women in technology (and at YC-funded companies, specifically):

Many people have commented on the fact that, after adjusting for chained CPI, median income has not risen significantly since the 1970’s. Tyler Cowen points to this as evidence for his theory of the “Great Stagnation”, which purports that the economy has grown more slowly during the latter parts of the 20’th century than during the former.

High heels were invented in the near east around the year 900 with the very practical purpose of making it easier for a man riding a horse to keep his foot in the stirrup. This was crucial for the Persian cavalry, all of whom wore heels. This is the same reason why contemporary cowboy boots often have a heel. This trend eventually made it’s way west, and upper class men began wearing boots with a heel in the late 1500’s. But a curious thing happened - men who did not ride began wearing boots with heels in order to emulate men who were rich enough (and thus high status enough) to own a horse. High heels stopped being a practical measure to stay in the saddle and became a social signal. To indicate wealth, fashionability and social status, men and women walked around town in stylized riding shoes.

I recently ran into the following error when trying to copy data from a local Hadoop cluster into Amazon S3:

#!bash
$hadoop distcp -i / s3n://USERID:SECRETKEY@BUCKETNAME/ 11/04/13 07:15:31 INFO tools.DistCp: srcPaths=[/] 11/04/13 07:15:31 INFO tools.DistCp: destPath=s3n://USERID:SECRETKEY@BUCKETNAME/ With failures, global counters are inaccurate; consider running with -i Copy failed: java.lang.NullPointerException at org.apache.hadoop.tools.DistCp.makeRelative(DistCp.java:901) at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1059) at org.apache.hadoop.tools.DistCp.copy(DistCp.java:650) at org.apache.hadoop.tools.DistCp.run(DistCp.java:857) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.tools.DistCp.main(DistCp.java:884)  • ### Investment, employment, and the role of women I recently saw a post on the Freakonomics blog in which Justin Wolfers harshly criticized John Taylor’s blog post which reported a strong correlation between investment and unemployment: • ### Hadoop's MapWritable sometimes a performance hog I’ve been using Hadoop a lot lately for a stealth mode project I’m working on. One of the big lessons I’m learning is that where medium to big data is concerned, data formats matter a lot. Where small filesizes are concerned, there is little harm in slinging around JSON objects and text representations. But once you reach the point of several GB, it’s in your best interest to think carefully about efficient representations. • ### Structural Shift in the Economy • ### Hadoop error - HTTP Response Code 503 I recently had a power failure, which resulting in my hadoop cluster shutting down. No matter, hadoop came back after a little while. However, I ran into problems immediately after restarting it: #!bash$ hadoop fsck /
