-
Monads are like a dildo factory, staffed by midgets
At their most basic, monads are a monoid in the category of endofunctors. But that’s an explanation that only appears to mathematicians. They are also a design pattern, but that’s an explanation that only appeals to computer geeks.
You’ve read many monad tutorials. For instance, monads are like monsters. No wait, monads are like space suits, and functions of type
a -> M bare like space brothels where you take off your suit, get space herpes and then put your suit back on. But this post is the ultimate in monad tutorials - this is the one that will finally cause them to make sense in your mind.So consider a program being used to run a dildo factory. The most basic underlying type is the
Dildo:data Dildo = NormalDildo | Rabbit | StrapOn | ...We also have a data type representing the box:
data Box a = Box aNow consider one of the midgets, who’s job it is to do work to a dildo and put it into a box. In the abstract, the type signature of the midget is
Dildo -> Box Dildo. But sometimes the midgets need to take a dildo out of the box, do some work on it, and put it back into the box. Monads to the rescue. -
Caching the Identity for Fun and Profit
One of the wonderful features of Scala and other high level languages is that they are very expressive. Very often, one can represent business objects as a simple map, e.g.:
val headers = Map("User-Agent" -> "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.63 Safari/537.31", "Accept" -> "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-encoding" -> "gzip,deflate,sdch" )Very often one might need to process a large number of such objects. An actual page of heap used by our program might look something like this (the
-character represents binary data we don’t care about for this discussion):User-Agent----Accept----gzip,deflate,sdch---Accept-----User-Agent-----gzip,deflate,sdch--- -----Accept-encoding-----User-Agent-------gzip,deflate,sdch---Mozilla/5.0 (X11; Linux x86_ -------User-Agent-------gzip,deflate,sdch---Internet Explorer-----------------------------Fugly. We are storing the same common strings many times.
-
Analyzing conversion rates with Bayes Rule (Bayesian statistics tutorial)
So I’ve just launched my new startup, BeerBnB. It’s a hip little site matching beer drinkers with specialty microbreweries - AirBnB for drinkers, or maybe eBay for brewers. My
marketergrowth hacker has gotten some early publicity by advertising in the bathroom of a few bars - the result was 794 unique visitors of whom 12 created an account. Doing some division I’ve computed an empirical conversion rate of 12/794=1.5%.To begin with, this seems promising. A 1.5% conversion rate isn’t great, but it’s certainly enough to get started. Investors have suggested that they will probably invest if the conversion rate exceeds 1%.
Now, suppose the marketer has the ability to get a lot more publicity. He can expose BeerBnB site to approximately 10,000 visitors via toilet adds at bars around the city. Suppose we make the assumption that these 10,000 visitors will convert at the same rate as the 794 early visitors. How many people can I reasonably expect to signup? This isn’t a trick question - the expectation is about 150 signups. But how confident are we that we will really see 150 signups? How confident are we that the conversion rate is higher than 1%?
-
The magic of conjugate priors (for online learning)
In Bayesian reasoning, the fundamental problem is the following. Given a prior distribution $@p(x)$@, and some set of evidence $@E$@, compute a posterior distribution on $@x$@ namely $@p(x | E)$@. For example, $@x$@ might be the conversion rate of some email. Before you have any evidence you might expect the conversion rate to be somewhere in the range of perhaps $@5\%$@ and $@50\%$@. After you have evidence, you update your belief - if you sent out thousands of emails and observed an empirical $@16.5\%$@ conversion rate, you are now reasonably confident that the true conversion rate lies roughly in the range of $@16\%-17\%$@.
In mathematics, a conjugate prior consists of the following. Consider a family of probability distributions characterized by some parameter $@\theta$@ (possibly a single number, possibly a tuple). A prior is a conjugate prior if it is a member of this family and if all possible posterior distributions are also members of this family.
-
The Metrics Manifesto - Why you need an objective function
Hypst is an early stage startup - the elevator pitch is “like Facebook, before it became mainstream”. The founders of Hypst have read a lot about A/B testing and statistics, and they decide to use the techniques they learned to improve engagement.
The first thing they notice is that not enough people are inviting their cool friends. So they come up with alternate captions, and discover that “invite only your cool friends” achieves 20% more invitations than “invite your buddies”. They commit this version to master and continue. Then they notice that no one is clicking on their “liked it before it was cool” button. They come up with alternate designs, run an A/B test, and discover that a green button achieves 20% more clicks than a blue one. The third thing they notice is that people don’t engage with the Irony Feed. They tweak their algorithm, run another A/B test, and discover that a wider Irony Feed gets 20% more clicks than the original design.
All their tests were run with proper statistical methods and clean experiment design. All the aforementioned test results were statistically valid. Yet somehow, after running these three tests and implementing the best version, they observe that clicks are down 2.8% across all categories. WTF just happened?
-
Write Queries with Tiramisu
By now, it is a fairly uncontroversial opinion that ORMs create a large number of difficulties when developing larger systems. The have been famously called the Vietnam of Computer Science. The main alternative to ORMs is manually constructing SQL by hand, but unfortunately that is a rather dangerous thing to do in the present day.
-
Flash Crash? Or Flash in the Pan?
With the collapse of Knight Capital recently, there has been a lot of scrutiny of High Frequency Trading (HFT). Breathless reporters have been bombarding us with articles suggesting that there is danger out there.
That’s all nonsense.
-
What I'm reading - How Quantum Mechanics Should be Taught
Scott Aaronson has a great article explaining how quantum mechanics should be taught. The basis of his idea is to describe quantum mechancis as being the mathematical structure you get when you attempt to generalize probability theory to include negative numbers. Once you do this, the L^2 norm replaces the L^1 norm, an inner product can be derived, and quantum theory falls out. Quite elegant.
It’s also a vastly better way to learn than the historically-based approach through which I was taught [1].
Go read it.
[1] Another flaw in the historical approach is that it overemphasizes the photoelectric effect - specifically, it claims that the photoelectric effect proves light is quantized. But that’s incorrect - you can derive the photoelectric effect directly from a Schrodinger equation with a classical electric field.
-
What I'm reading - Race, IQ and Wealth
Ron Unz over at American Conservative has written an excellent blog post called Race, IQ and Wealth. It’s well worth reading, gets into data, and is a fairly dispassionate analysis of the topic. It also (politely and subtly) calls out the supposed “experts” on the field, who ignore all sorts of valuable data (quite a bit of which actually supports their views) merely because their ideological opponents present it.
I strongly recommend reading it.
-
Minimize your cloud costs with GLPK and Haskell
One of the more important tools in my computational toolbox is linear programming. It’s a great way to solve a lot of otherwise difficult problems in a straightforward, nearly black-box manner. I’ve discovered that a lot of programmers aren’t too familiar with it, so I’m writing this tutorial that explains how to use it in practical purposes.
-
Proposal - bigger ticks, more rent seeking
Previously I argued that some chunk of HFT is rent seeking due to the minimal tick sizes induced by the subpenny rule. It appears now that there is a proposal to the SEC to increase the tick size in order to increase market making activity. I.e., if you raise the price floor on liquidity, you’ll induce an oversupply of it.
-
Why Multi-armed Bandit algorithms are superior to A/B testing
In a recent post, a company selling A/B testing services made the claim that A/B testing is superior to bandit algorithms. They do make a compelling case that A/B testing is superior to one particular not very good bandit algorithm, because that particular algorithm does not take into account statistical significance.
However, there are bandit algorithms that account for statistical significance.
-
Human vs Machine Progress
Machines are increasingly replacing humans across many walks of life. Huge numbers of jobs are being made obsolete. Historically this trend has been a positive force in terms of living standards. Additionally, as old jobs have been replaced, humans have found new work to do.
-
Don't use Scatterplots
In a series of blog posts, Gary Rubinstein attempts show that the Value Added Modelling scores recently released by the NYC Department of Education prove that VAM (Value Added Modelling) is not accurate.
-
Subpenny rule elimination - roundup
Just thought it’s worth pointing out a couple of responses to my post proposing the elimination of the subpenny rule.
-
High Frequency Trading - What's broken and how to fix it
In previous posts, I discussed the basic mechanics and social utility of high frequency trading. Of particular import is that I characterized the latency arms race as socially wasteful. Now I’ll discuss a policy proposal which might mitigate the harmful effects of the race for latency, while giving better prices to speculators.
-
Why I ask 'how many golf balls fit on a bus?' on job interviews
In a recent hacker news thread, a great deal of criticism was heaped on Fermi problems. “How many golf balls fit on a double decker bus?” “How many piano tuners live in Seattle?” I think much of the criticism is unfounded, since Fermi problems come up all the time in computing.
-
Unemployment and market clearing
In a recent blog post, Karl Smith defends U3 from it’s critics. The majority of his defense is spot on - U3 is a consistent, well-understood measurement which we can use to measure changes over time, and we also have a pretty good understanding of it’s relationship to other relevant quantities.
But I’ll dispute him on one point - U3 by itself does not measure whether or not the labor market is clearing.
-
A High Frequency Trader's Apology, Pt 2
In a previous post I discussed the mechanics of HFT. If you haven’t read it, go read it now. Now I’ll discuss it’s social utility and cost.
-
Idli - a command line interface to your bugtracker
A while back, Hacker News was deluged with assorted command line bug trackers. It’s pretty clear that many people want a command line interface to their bug tracking system. But most of us are forced to use an existing system and can’t switch to such a new one. So I wrote Idli, which is a command line interface to existing bug trackers.
-
A High Frequency Trader's Apology, Pt 1
I’m a former high frequency trader. And following the tradition of G.H. Hardy, I feel the need to make an apology for my former profession. Not an apology in the sense of a request for forgiveness of wrongs performed, but merely an intellectual justification of a field which is often misunderstood.
In this blog post, I’ll attempt to explain the basics of how high frequency trading works and why traders attempt to improve their latency. In future blog posts, I’ll attempt to justify the social value of HFT (under some circumstances), and describe other circumstances under which it is not very useful. Eventually I’ll even put forward a policy prescription which I believe could cause HFT to focus primarily on socially valuable activities.
-
In-app caching - spend a little RAM to speed up your site
HTTP is a stateless protocol. For this reason, it’s often considered bad practice to store data in your server’s memory the memory of your webserver to be a bad practice. In general, this is correct - your webserver’s ram is a bad place to store permanent data. It’s volatile, and if you have multiple servers behind a load balancer, it’s likely to give inconsistent results between requests.
But it’s a great place to cache certain pieces of data.
-
One Hiring Filter that Works
In a fairly popular blog post, Raganwald advocates against using various heuristics while filtering out resumes. For the most part, I agree with him. Since it’s hard to find good programmers, false negatives are very costly compared to the cost of a false positive (a wasted interview).
But I have found one filter which actually works well. Not only did it filter out bad candidates, it actually increased the number of applications (both in total, and the # of good candidates) we received.
-
How to leave academia
So you’ve decided to leave academia, or are perhaps just thinking about doing so. Welcome to the dark side. I made the transition a few years ago, and since then I’ve gotten a number of questions about how to do it. Hence, this article.
-
Introducing Hobo
In this post I’d like to introduce Hobo, a fast in-memory index for your data. Hobo is not a database, it’s an external index for data you have stored elsewhere. Hobo is used for two primary purposes - the first is to answer queries of the form “find me all items with features X and Y”, and the second is to find items with a specified color.
-
A simple trick to speed up complex Postgres queries on EC2
A major problem with running Postgres on EC2 is that EBS performance often sucks. In addition to performing poorly, EBS also uses the network connection, which can be undesirable. Ephemeral storage is provided, and tends to have better performance characteristics, but unfortunately it lacks durability.
-
Networking problems while cloning Ubuntu VM
I’ve run into a curious error while cloning Ubuntu VM hard drives. After cloning, the network card no longer works. However, if I clone the VM completely (including the mac address), there is no problem.
-
The Calories In/Calories Out model explains weight stability
A fairly recent trend in discussions of obesity is to focus on weight stability. Weight stability is a phenomenon by which a human maintains roughly the same bodyweight over a long period of time. Karl Smith recently brought it up, for example:
-
TechCrunch messes up the math on sexism
Eric Ries over at TechCrunch recently wrote an article discussing racism/sexism in Silicon Valley and the technology industry. The article discusses differences in aptitude between men and women, and attempts to downplay them as the cause of a lack of women in technology (and at YC-funded companies, specifically):
-
Why Americans don't hire servants
-
Did immigrants (and Simpson's Paradox) cause the Great Stagnation?
Many people have commented on the fact that, after adjusting for chained CPI, median income has not risen significantly since the 1970’s. Tyler Cowen points to this as evidence for his theory of the “Great Stagnation”, which purports that the economy has grown more slowly during the latter parts of the 20’th century than during the former.
-
MapReduce explained in 41 words
-
The poor don't work because they are economically rational
-
The High-Heel Bubble never popped, and the Education Bubble may not either
High heels were invented in the near east around the year 900 with the very practical purpose of making it easier for a man riding a horse to keep his foot in the stirrup. This was crucial for the Persian cavalry, all of whom wore heels. This is the same reason why contemporary cowboy boots often have a heel. This trend eventually made it’s way west, and upper class men began wearing boots with a heel in the late 1500’s. But a curious thing happened - men who did not ride began wearing boots with heels in order to emulate men who were rich enough (and thus high status enough) to own a horse. High heels stopped being a practical measure to stay in the saddle and became a social signal. To indicate wealth, fashionability and social status, men and women walked around town in stylized riding shoes.
-
NullPointerException when running distcp to Amazon s3 filesystem
I recently ran into the following error when trying to copy data from a local Hadoop cluster into Amazon S3:
#!bash $ hadoop distcp -i / s3n://USERID:SECRETKEY@BUCKETNAME/ 11/04/13 07:15:31 INFO tools.DistCp: srcPaths=[/] 11/04/13 07:15:31 INFO tools.DistCp: destPath=s3n://USERID:SECRETKEY@BUCKETNAME/ With failures, global counters are inaccurate; consider running with -i Copy failed: java.lang.NullPointerException at org.apache.hadoop.tools.DistCp.makeRelative(DistCp.java:901) at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1059) at org.apache.hadoop.tools.DistCp.copy(DistCp.java:650) at org.apache.hadoop.tools.DistCp.run(DistCp.java:857) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.tools.DistCp.main(DistCp.java:884) -
Investment, employment, and the role of women
I recently saw a post on the Freakonomics blog in which Justin Wolfers harshly criticized John Taylor’s blog post which reported a strong correlation between investment and unemployment:
-
Hadoop's MapWritable sometimes a performance hog
I’ve been using Hadoop a lot lately for a stealth mode project I’m working on. One of the big lessons I’m learning is that where medium to big data is concerned, data formats matter a lot. Where small filesizes are concerned, there is little harm in slinging around JSON objects and text representations. But once you reach the point of several GB, it’s in your best interest to think carefully about efficient representations.
-
Structural Shift in the Economy
-
Hadoop error - HTTP Response Code 503
I recently had a power failure, which resulting in my hadoop cluster shutting down. No matter, hadoop came back after a little while.
However, I ran into problems immediately after restarting it:
#!bash $ hadoop fsck / Exception in thread "main" java.io.IOException: \ Server returned HTTP response code: 503 for URL: http://0.0.0.0:50070/fsck?path=%2F at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1441) at org.apache.hadoop.hdfs.tools.DFSck.run(DFSck.java:123) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.hdfs.tools.DFSck.main(DFSck.java:159) -
Inanity of overeating - don't ignore the bacon in the room
-
Older Posts
Older posts are available at my old blog.