I recently gave a talk at the Fifth Elephant 2019. The talk was a discussion about how to use math to handle unfixably bad data. The slides are available here.. Go check it out.

]]>With Andrew Yang's presidential candidacy moving forward, people are discussing basic income again. One common meme about a Basic Income is that by removing the implicit high marginal tax rates that arise from the withdrawal of welfare benefits, disincentives for labor would be reduced and therefore a Basic Income would not reduce labor supply. In this blog post I provide both empirical and theoretical evidence why this conclusion is false.

In particular, I review the experimental literature, which suggests a Basic Income will result in approximately a 10% drop in labor supply. I also review standard economic theory regarding diminishing marginal utility, which provides a clear theoretical reason why a Basic Income would reduce labor supply.

Finally, I extrapolate the data from the 1970's Basic Income experiments to the contemporary era. In particular, I consider a counterfactual history, taking the employment effects from past experiments and applying them to contemporary employment rates.

Let me begin with the empirical evidence. There have been 5 experiments on either Basic Income or Negative Income Tax in North America. I'll discuss each of them in turn, focusing on the labor force effects.

This experiment measured the effect of a Negative Income Tax as compared to the welfare programs (AFDC, Food Stamps) available at the time. The experiment ran from 1970-1975 in Seattle and 1972-1977 in Denver. The benefit of the plan was an unconditional cash transfer, assigned randomly at various levels, with the maximum transfer being equal to 115% of the poverty line.

The sample was focused on lower income Americans, and aimed to collect representative data on Whites, Blacks and Chicanos (in Denver only) - this resulted in the minority ethnic groups being significantly overrepresented. There were also two variations - a 3 year treatment group and a 5 year treatment group.

The net result on labor supply is the following:

- Husbands reduced their labor supply by about 7% in the 3-year treatment group and by 12-13% in the 5 year treatment group.
- Wives reduced their labor supply by about 15% in the 3-year treatment group and by 21-27% in the 5 year treatment group.
- Single mothers reduced their labor supply between 15 and 30%.
- The labor supply reduction typically took 1 year to kick in, suggesting a shorter experiment might have missed it.
- The fact that the reduction is larger in the 5 year group than the 3 year group suggests anticipation effects - people know that they will have guaranteed income for several years so they plan for long term work reductions.

The experiment tracked both treatment groups for 5 years, and in the final 2 years the 3 year treatment group recovered most of their labor market losses.

An additional interesting effect observed is a higher rate of divorce in the treatment groups.

The Mincome experiment is the one most commonly cited by fawning journalists. For example, here's how Vice describes it:

> The feared labor market fallout—that people would stop working-didn't materialize... "If you work another hour, you get to keep 50 percent of the benefit you would have gotten anyway, so you are better off working than not."

This framing is typical in mainstream journalism:

> 'Politicians feared that people would stop working, and that they would have lots of children to increase their income,' professor Forget says. Yet the opposite happened: the average marital age went up while the birth rate went down. The Mincome cohort had better school completion records. The total amount of work hours decreased by only 13%.

This framing is weird, because a 13% decrease in labor supply is actualy pretty big. A differences in differences analysis of Mincome suggests a treatment effect of about 11.3%. There are also subgroup analysis suggesting that this effect might be driven more by women than men, but these are prone to small sample effects as well as the Garden of Forking Paths.

One interesting result from Mincome is that about 30% of the effect of Mincome is likely to be socially driven rather than pure economics. The claim made is that if a Basic Income is given to everyone, reducing labor market participation may become socially normalized and this will drive further reductions in labor supply. This effect would likely not be measurable in a randomized control trial.

Unlike the other experiments, this one was designed to measure the effect of Basic Income on rural people, in particular self employed farmers. Rural communities in Iowa and North Carolina were chosen and basic income equivalent to 50%-100% of the poverty level were given to people in the treatment group.

This experiment was on the small side; only 809 families entered and 729 remained in the program for all three years.

After controlling statistically for differences in observable characteristics of participants, the Rural Income Experiment showed an overall labor supply reduction of 13%. In this experiment the labor market effect was somewhat smaller for husbands (-8% to +3%) while being quite large for wives (-22%-31%) and dependents (-16% to -66%).

The widely disparate results reported across subgroups also suggest that the subgroup analysis is noisy and suffering from insufficiently large samples - not surprising given 809 sample familes split across 3 subgroups (NC Blacks, NC Whites and Iowa) plus 3 subgroup analyses per group (Husbands, Wives and Dependents).

The Gary Experiment was focused on mitigating urban Black poverty. It was run from 1971-1974 and had a sample of 1800 families (43% of which were the control group). 60% of participating familes were female headed. Dual earning families were generally excluded from the experiment because their income was too high.

The size of the BI was pretty similar to those of the other experiments - 70% and 100% of the poverty level.

In this experiment, the work reduction was 7% for husbands, 17% for wives, and 5% for female heads of household. The reason the drop in female heads of household is low may simply be due to the fact that prior to the experiment, female heads of household only worked an average of 6 hours/week.

An additional effect of the BI was an increase in wealth inequality - higher earning married couples tended to save money and pay down debt while much poorer single mothers merely increased consumption.

The New Jersey Experiment ran from 1967-1974. In this experiment the income levels ranged from 50% to 125% of the poverty line, and the experiment included 1350 randomly selected low income families in NJ and PA. Each family in the experiment received 3 years of Basic Income. As with the Rural Income Experiment, many subgroup analysis were performed (on a relatively low number of families per subgroup) and inconsistent results were obtained across subgroups.

The overall results were a reduction in labor supply (hours worked) by 13.9% for white families, 6.1% for black families and 1.5% for Spanish speaking families. The labor force participation rate reduction was 9.8%, 5% and +/- 6.7% for White, Black and Spanish speaking families respectively. (Due to the poor quality of the scan, I can't make out the digits after the decimal for black families or whether the effect is positive or negative in table 3.)

I do not endorse the level of excessive subgroup analysis they performed. In such a small sample they should have just done an overall analysis. But the experiment was designed in 1967 so I'll be forgiving of the authors - my viewpoint of their methodology is, of course, heavily informed by living through the modern replication crisis.

The studies I've surveyed were all social experiments performed in the 1970's. As such, the treatment effects are comparing a Basic Income providing a roughly 1970-level poverty rate income to welfare programs from that era. These experiments were also performed in an era with significantly lower female workforce participation and higher marriage rates.

The experiments were also all pre-Replication Crisis, and as a result they feature excessive subgroup analysis, experimenter degrees of freedom, and for this reason I don't fully believe most of the fine grained effects these studies purport to measure.

However, there is one very clear and significant top line effect that is consistent across every experiment: a roughly 10% reduction in labor supply.

The common justification for why a Basic Income would not reduce labor supply is the following. Because a BI is given regardless of work, a person receiving a BI gains the same amount of money from working as they would gain if they did not work. This is often contrasted to means-tested welfare, which often has high implicit marginal tax rates due to the withdrawal of welfare benefits.

However, this verbal analysis ignores something very important: diminishing marginal utility.

In economics, people are modeled as making decisions based on *utility* - roughly speaking, the happiness you get from something - not on *cash*. And an important stylized fact, accepted by pretty much everyone, is that utility as a function of income is strictly concave down. In mathematical terms, that means that for any
:

Since more income is always better, we can also assume that is a strictly increasing function of income.

In pictures, this means that a person's utility function looks like this:

Now the choice to work is made by balancing the utility gained from income against the disutility from working:

Since the net utility is positive, this person will choose to work.

However, because the utility function is concave, if we start from a point further out (namely ), the utility gain from labor decreases. This can be illustrated in the following graph:

In the Basic Income regime, a person's utility gain from working is only , which is lower than .

In some cases, this decrease will result in the net utility gain from work being negative:

These are the people who are deterred from working.

Now the graphs I've given above are just an example. The clever reader might ask if it is true for every graph. I will prove in the appendix that a Basic Income always reduces the marginal utility from work if diminishing marginal utility is true.

How big is this affect? Journalists favorable to a Basic Income tend to talk about "only" a 10% drop in labor supply. Let me make an invidious comparison.

In 2008, the United States (and the world) suffered the Great Recession. To make a comparison, I've plotted the male employment to population ratio (approximated by taking the male civilian employment rate and dividing it by half of the US population) at the time of the great recession.

What would have happened if the Great Recession didn't occur, but we instead instituted a Basic Income in 2008?

To speculate about this, I assumed a baseline employment to population rate of 52% for men (the peak employment rate just before the recession). I then plotted for comparison the results of several Basic Income experiments focusing on the effects on men (though in a couple of cases that was not well disambiguated).

In the case of Seattle/Denver, I plotted the effect observed in each year. In the other cases, where yearly effects were not reported, I merely assumed a drop equal to the average reported drop.

The result can be seen above. The typical effects of a Basic Income are in the same ballpark as those of the Great Recession.

The conclusion we can draw from this is that all the available evidence suggests that a Basic Income will have a very large and negative effect on the economy.

We can also anticipate that the effect will be worse if people believe that a Basic Income is likely to be permanent. As can be seen by comparing the 3 and 5 year groups in the Seattle/Denver experiment, people assigned to a longer term BI reduced their work effort significantly more than those assigned to the short term BI.

A few commenters on reddit suggest that unemployment due to "low aggregate demand" is somehow different from reduced labor force participation due to Basic Income. However, this idea is based on either MMT or some weird newspaper columnist pop-Keynesianism; it is not in any way based on the economic mainstream.

Mainstream economist version of Keynesian theory says that in a recession, people do not work because they have sticky nominal wages but a shock has resulted in their real output dropping. Concretely, a worker has a nominal wage demand of dollars. He used to produce widgets at a cost of each, but for whatever reason he can now only produce widgets. His real output is now which has nominal value .

In order to productively employ him the nominal wage must be reduced by a factor of . However, the worker refuses to work unless he is paid dollars.

The Keynesian prescription of stimulating aggregate demand solves this problem by inflation; if the price of a single widget can be increased by a factor of , then the worker can again be paid a wage of dollars.

In essense, stimulating aggregate demand is about tricking prideful workers into reducing their real wage demands so that they stop being lazy and go back to work.

In spite of the difference in mood affiliation, Keynesian economics claims that recessions reduce labor supply for the exact same reason a Basic Income does: workers refusing to work.

The choice to work can be framed as a question of utility maximization. Assuming one receives an income of from work and an income of from Basic Income, the utility of working is:

While the utility of not working is

Here is the utility penalty that describes the unpleasantness of work. Let us define as the marginal utility gained by making the choice to work in a Basic Income regime:

In contrast, the marginal utility gained or lost from work in a non-Basic Income regime is:

The concavity relation (setting ) tells us that for any , we have:

Now if we compute the difference , we discover:

If we substitute and into the concavity relation above, we discover:

Therefore:

This completes the proof that if Diminishing Marginal Utility is true, a Basic Income reduces the incentive to work.

I recently saw a silly twitter exchange between two of the lyingest politicians in American politics. Given that they have both explicitly expressed the viewpoint that morals matter more than numbers and being "technically correct", I figured that I should just check for myself. On twitter, Trump says white nationalism is a small group of people with serious problems while Alexandria O. Cortez claims "White supremacists committed the largest # of extremist killings in 2017". This question is easily answerable...*with Python*.

So actually no, this blog isn't about politics. But I recently discovered pandas.read_html, and two idiot politicians tweeting at each other is as good as reason as any to write a blog post about it. The real audience for this post is python developers who want to see a couple of cool pydata tricks I've learned recently.

This is one of the coolest tricks I've learned in 2019. The pandas library has a method, read_html which takes a webpage as input and returns a list of dataframes containing the tables on that webpage.

So to answer the question about terrorism in 2017, I'm going to browse Wikipedia's List of Terrorist Incidents in 2017.

Sadly, there's a lot of terror attacks, so they have separate pages for each month. Each page looks like this: https://en.wikipedia.org/wiki/List_of_terrorist_incidents_in_January_2017.

Therefore, to extract the data I'll do this:

import pandas def load_month(m): results = pandas.read_html('https://en.wikipedia.org/wiki/List_of_terrorist_incidents_in_' + m + '_2017') df = results[0] df.columns = df[0:1].values[0] return df[1:].copy() data = [] for month in [datetime(2008, i, 1).strftime('%B') for i in range(1,13)]: data.append(load_month(month)) data = pandas.concat(data)

The function read_html is doing all the heavy lifting here.

The result of this is a dataframe listing a location, a perpetrator, a number of deaths/injuries, and a few more columns. It's not super clean, but at least it's pretty structured.

This read_html function is awesome because I needed to do literally no work parsing.

In this data, there were 230 separate perpetrators listed *after* cleaning up some of the obvious data issues (e.g. some rows containing Al Shabaab and others containing Al-Shabaab). That's far too much for me to manually classify everything.

So instead I used the wikipedia module.:

import wikipedia def get_summary(x): if x == 'Unknown': return None try: return wikipedia.page(x).summary except Exception: return None count['perp_summary'] = count['perpetrator_cleaned'].apply(get_summary)

This gets me a summary of each terrorist group, assuming wikipedia can easily find it. For example, here's the result of get_summary('Al-shabaab'):

'Harakat al-Shabaab al-Mujahideen, more commonly known as al-Shabaab (lit. '"The Youth" or "The Youngsters", but can be translated as "The Guys"'), is a jihadist fundamentalist group based in East Africa. In 2012, it pledged allegiance to the militant Islamist organization Al-Qaeda.[...a bunch more...]

With a little bit of string matching (e.g. if the summary contains "Communist" or "Marxist", classify as "Communist"), I was able to classify assorted terrorist attacks into a few broad causes:

cause dead Islam 8170.0 Central Africal Republic 432.0 Communism 310.0 Myanmar 105.0 Congo 85.0 Anarchy 3.0 Far-right 3.0 Far-left 1.0

Some of these are broad catch-all terms that simply reflect my ignorance. For example, one group is called Anti-Balaka, and wikipedia explains them as a predominantly Christian rebel group "anti-balakas are therefore the bearers of grigris meant to stop Kalashnikov bullets". I lumped a bunch of similar groups into "Central African Republic", similarly for "Congo" and "Myanmar".

This classification scheme let me classify 92% of the 9933 deaths due to terrorism. Note that Islam alone accounted for at least 82%, and eyeballing the groups I didn't match it's probably higher.

There were also a number of attacks that I found very hard to classify, e.g. Patani independence or Fulani pastoralism. Key summary of the Fulani Pastoralism conflict: the Fulani people of Nigeria are mostly nomadic cow herders and they are getting into violent land disputes with non-Fulani farmers who don't want Fulani cows eating/trampling their crops. The world is a big place and it's full of all sorts of bad shit most folks have never heard of.

It looks like Donald Trump is right and AOC is wrong. Even if we take high end estimates of the number of people killed by white supremacists in 2017 (34 in the US according to the SPLC), it seems like a small problem compared to things like Anti-Balaka, Communism or Balochistan independence.

There are many individual terrorist groups that I imagine most readers have never heard of, such as Indian Naxalites (communists), which kill far more people than white supremacists.

Also, far more importantly for most of my readers, you can easily extract data from Wikipedia into a dataframe using pandas.read_html and the wikipedia module.

You can find my python notebook here.

**Correction:** A previous version of this post described an "Independent Nasserite Movement (a Socialist pan-Arab nationalist movement)", which was a reference to Al Mourabitoun. However that might have been me getting confused by wikipedia results - I think the actual attack in 2017 was done by a different Al Mourabitoun which is just ordinary boring Islamist violence. So we probably need to add another 77 or so to the Islam row.

**Also**, at least one commenter noted that the SPLC counts 34 dead due to white nationalists, which is higher than I get from Wikipedia. I don't particularly trust the SPLC, but I do reference it above. It still doesn't really change the results. Fulani Pastoralism killed more people.

Anyone following Nassim Taleb's dissembling on IQ lately has likely stumbled across an argument, originally created by Cosma Shalizi, which purports to show that the psychometric *g* is a statistical myth. But I realized that based on this argument, not only is psychometrics a deeply flawed science, but so is thermodynamics!

Let us examine pressure.

In particular, we will study a particular experiment in mechanical engineering. Consider a steel vessel impervious to air. This steel vessel has one or more pistons attached, each piston of a different area. For those having difficulty visualizing, a piston works like this:

The pipe at the top of the diagram is connected to the steel vessel full of gas, and the blue is a visualization of the gas expanding into the piston. The force can be determined by measuring the compression of the spring (red in the diagram) - more compression means more force.

If we measure the force on the different pistons, we might makes a curious observation - the force on each piston is equal to a constant P times the area of the piston. If we make a graph of these measurements, it looks something like this:

We can repeat this experiment for different steel vessels, containing different quantities of gas, or perhaps using the same one and increasing the temperature. If we do so (and I did so in freshman physics class), we will discover that for each vessel we can make a similar graph. However, the graph of each vessel will have a different slope.

We can call the slope of these lines P, the pressure, which has units of force divided by area (newtons/meter^2).

To summarize, the case for P rests on a statistical technique, making a plot of force vs area and finding the slope of the line, which works solely on correlations between measurements. This technique can't tell us where the correlations came from; it always says that there is a general factor whenever there are only positive correlations. The appearance of P is a trivial reflection of that correlation structure. A clear example, known since 1871, shows that making a plot of force vs area and finding the slope of the line can give the appearance of a general factor when there are actually more than completely independent and equally strong causes at work.

These purely methodological points don't, themselves, give reason to doubt the reality and importance of pressure, but do show that a certain line of argument is invalid and some supposed evidence is irrelevant. Since that's about the only case which anyone does advance for P, however, it is very hard for me to find any reason to believe in the importance of P, and many to reject it. These are all pretty elementary points, and the persistence of the debates, and in particular the fossilized invocation of ancient statistical methods, is really pretty damn depressing.

If I take an arbitrary set of particles obeying Newtonian mechanics, and choose a sufficiently large number of them, then the apparent factor "Pressure" will typically explain the behavior of pistons. To support that statement, I want to show you some evidence from what happens with random, artificial patterns of particles, where we know where the data came from (my copy of Laundau-Lifshitz). So that you don't have to just take my word for this, I describe my procedure and link to a textbook on statistical mechancs where you can explore these arguments in detail.

Suppose that the gas inside the vessel is not a gas in the continuous sense having some intrinsic quantity pressure, but is actually a collection of a huge number of non-interacting particles obeying Newton's laws of motion. It can be shown that the vast majority of the time, provided the vessel has been at rest for a while, that the distribution of particle velocities is approximately the same in any particular cube of volume. Furthermore, the density of particle positions will be uniformly distributed throughout the vessel.

For simplicity, let us suppose the volume is cubic of length L, and one side of the volume is the piston. Consider now a single particle in the cube, moving with an x-velocity . This particle will cross the cube once every units of time, and each time it hits the piston it will exert a force every units of time. Thus, on average the force will be .

The total force on the piston will be the sum of this quantity over all the particles in the vessel, namely . Here denotes the average velocity (in the x-direction) of a particle. If we divide this by area, we obtain . Here is the density of particles per unit volume. I.e., we have derived that !

Thus, we have determined that under these simple assumptions, pressure is nothing fundamental at all! Rather, pressure is merely a property derived from the number density and velocity of the individual atoms comprising the gas.

But - and I can hear people preparing this answer already - doesn't the fact that there are these correlations in forces on pistons mean that there must be a single common factor somewhere? To which question a definite and unambiguous answer can be given: No. It looks like there's one factor, but in reality all the real causes are about equal in importance and completely independent of one another.

(As a tangential note, several folks I've spoken about this article vaguely recollect that temperature is when the particles in a gas move faster. This is true; increasing temperature makes increase. If we note that - ignoring what the constant C is - and multiply our equation of pressure above by the volume of the vessel V, we obtain . Since , this becomes , with . We have re-derived the fundamental gas law from high school chemistry from the kinetic theory of gases.)

The end result of the self-confirming circle of test construction is a peculiar beast. If we want to understand the mechanisms of how gases in a vessel work, how we can use it to power a locomotive, I cannot see how this helps at all.

Of course, if P was the only way of accounting for the phenomena observed in physical tests, then, despite all these problems, it would have some claim on us. But of course it isn't. My playing around with Boltzmann's kinetic theory of gases has taken, all told, about a day, and gotten me at least into back-of-the-envelope, Fermi-problem range.

All of this, of course, is completely compatible with P having some ability, when plugged into a linear regression, to predict things like the force on a piston or whether a boiler is likely to explode. I could even extend my model, allowing the particles in the gas to interact with one another, or allowing them to have shape (such as the cylindrical shape of a nitrogen molecule) and angular momentum which can also contain energy. By that point, however, I'd be doing something so obviously dumb that I'd be accused of unfair parody and arguing against caricatures and straw-men.

I'll now stop paraphrasing Shalizi's article, and get to the point.

In physics, we call quantities like pressure and temperature mean field models, thermodynamic limits, and similar things. A large amount of the work in theoretical physics consists of deriving simple macroscale equations such as thermodynamics from microscopic fundamentals such as Newton's law of motion.

The argument made by Shalizi (and repeated by Taleb) is fundamentally the following. If a macroscopic quantity (like pressure) is actually generated by a statistical ensemble of microscopic quantities (like particle momenta), then it is a "statistical myth". Lets understand what "statistical myth" means.

The most important fact to note is that "statistical myth" does *not* mean that the quantity cannot be used for practical purposes. The vast majority of mechanical engineers, chemists, meteorologists and others can safely use the theory of pressure without ever worrying about the fact that air is actually made up of individual particles. (One major exception is mechanical engineers doing microfluidics, where the volumes are small enough that individual atoms become important.) If the theory of pressure says that your boiler may explode, your best bet is to move away from it.

Rather, "statistical myth" merely means that the macroscale quantity is not some intrinsic property of the gas but can instead be explained in terms of microscopic quantities. This is important to scientists and others doing fundamental research. Understanding how the macroscale is derived from the microscale is useful in predicting behaviors when the standard micro-to-macro assumptions fail (e.g., in our pressure example above, what happens when N is small).

As this applies to IQ, Shalizi and Taleb are mostly just saying, "the theory of *g* is wrong because the brain is made out of neurons, and neurons are made of atoms!" The latter claim is absolutely true. A neuron is made out of atoms and it's behavior can potentially be understood purely by modeling the individual atoms it's made out of. Similarly, the brain is made out of neurons, and it's behavior can potentially be predicted simply by modeling the neurons that comprise it.

It would surprise me greatly if any proponent of psychometrics disagrees.

One important prediction made by Shalizi's argument is that in fact, the psychometric *g* could very likely be an ensemble of a large number of independent factors; a high IQ person is a person who has lots of these factors and a low IQ person is one with few. Insofar as psychometric *g* has a genetic basis, it may very well be highly polygenic (i.e. the result of many independent genetic loci).

However, none of this eliminates the fact that the macroscale exists and the macroscale quantities are highly effective for making macroscale predictions. A high IQ population is more likely to graduate college and less likely to engage in crime. Shalizi's argument proves nothing at all about any of the highly contentious claims about IQ.

I recently gave a talk at CrunchConf 2018. The talk was a about the various impossibility theorems that make a person concerned with AI Ethics must content with. The slides are available here.. Go check it out.

]]>I recently attended a discussion at Fifth Elephant on privacy. During the panel, one of the panelists asked the audience: "how many of you are concerned about your privacy online, and take steps to protect it?"

At this point, most of the hands in the panel shot up.

After that, I decided to ask the naughty question: "how many of you pay at least 500rs/month for services that give you privacy?"

Very few hands shot up.

Let me emphasize that this was a self selected group, a set of people at a technology conference who were so interested in privacy that they chose to attend a panel discussion on it (instead of concurrent talks on object detection and explainable algorithms). Besides me and perhaps 2 or 3 others, no one was willing to pay for privacy.

Instead of paying for it, many of the people at the panel wanted the government to mandate it. Moreover, many people seemed to think it would somehow be free to provide.

If you are not paying for it, you're not the customer; you're the product being sold.

Every online service costs money to provide. To get an idea on the metrics, here are some leaked revenues at a company I worked for. Content isn't free. Engineers aren't free. Ad revenues aren't very high. If the site is storing lots of personal data (e.g. email, picture/videos, etc), even the cost of computing infrastructure can become significant.

Since most people are unwilling to pay for online services, the way to cover these costs is by advertising to the users.

**Ad revenue per user varies by several orders of magnitude depending on how well targeted it is.**

Here's a calculation, which was originally done by Patrick McKenzie to answer the question

I just bought a refrigerator yesterday. Why, why, why do you show me refrigerator ads?

- Assume a typical person buys a refrigerator once every 10 years.
- Assume 2% of refrigerator purchases go wrong (e.g. your wife hates it, it breaks), and you need to buy a new refrigerator within a week.

Subject to these assumptions, a person who's bought a refrigerator is 10x more likely to buy another refrigerator in the next week than someone who hasn't.

The fundamental problem of advertising is sparsity - the fact that most advertisements are worthless to most people. An ad for "faster than FFTW" might be useful to me, but it's pointless for most people who've never heard of FFTW. If you haven't spied on me well enough to know that I do fast fourier transforms, your odds of making money by advertising to me are essentially zero.

Advertising generates negligible revenue without personalization.

Without advertising, people will need to pay for their online services. Email services tend to cost around $5-10/month. The NY Times costs about $10/month, and the Wall St. Journal costs 2-4x that. It's hard to guesstimate the cost of social networks, but my best guesstimates for Facebook is several dollars per user per month.

**Will you pay $20-50 a month to replace your free online services with privacy preserving ones?**

Another major fact is that service providers use data to improve their service. User tracking enables product managers/UI designers to figure out exactly what customers want, and give it to them. Google cannot index your email and make it searcheable without also reading it. **Would you use a free email product with a much worse UI than Gmail?**

Consider your payment provider - PayPal, PayTM, Simpl (disclaimer: I work there), etc. One of the most invisible and pervasive concerns at a company like this is preventing fraud.

The economics of a payment provider are as follows:

- A customer books a 100rs movie ticket on BookMyShow.
- The customer pays 100rs to the payment provider.
- The payment provider transfers 97-99.5rs to BookMyShow and pays for their expenses with the remaining 0.5-3rs.

That's a pretty tight margin. For concreteness and simplicity of exposition, lets suppose the Merchant Discount Rate is 1%.

Now lets consider the impact of fraud. If fraud levels ever get as high as 1 transaction out of every 100, the payment provider will have zero revenue and will go broke. If fraud is not carefully controlled, it can reach levels far higher than this.

In mid-2000, we had survived the dot-com crash and we were growing fast, but we faced one huge problem: we were losing upwards of $10 million to credit card fraud every month.

Peter Thiel notes that reducing fraud was the difference between loss and profitability.

In the long run, the cost of fraud must be passed on to the consumer. Either the payment provider or the merchant will eat the cost of fraud, and will in turn raise prices on consumers to compensate.

**Will you pay 120rs for a 100rs movie ticket in order to protect your privacy from your payment provider?** It's important to note that while the extra 20rs may seem to go to the payment network, in reality it will go to the smartest scammers.

There is plenty of fraud that occurs beyond payment networks. On Uber, there are fake drivers that take fake passengers on trips and demand to be paid even though the fake passengers have paid with stolen credit cards. Many fraud rings attempt to misuse incentive systems (e.g. "refer a friend, get 100rs off your next order") in order to generate credits with which they can purchase saleable goods. A merchant aggregator is also at risk from the submerchants; movie theatres/restaurants/etc will attempt to exploit BookMyShow/Seamless/etc, in general, submerchants will attempt to make fraudulent transactions on the aggregator and demand payment for them.

A special case of fraud which also relates to the problem of paying for services with advertising is display network fraud. Here's how it works. I run "My Cool Awesome Website About Celebrities", and engage in all the trappings of a legitimate website - creating content, hiring editors, etc. Then I pay some kids in Ukraine to build bots that browse the site and click the ads. Instant money, at the expense of the advertisers. To prevent this, the ad network demands the ability to spy on users in order to distinguish between bots and humans.

**Question**: What does the government call a payment platform that provides privacy to it's users?

**Answer**: Money laundering.

Here in India, the bulk of the privacy intrusions I run into are coming from the government. It is government regulations which require me to submit passport photocopies/personal references/etc in order to get a SIM card. Tracking my wifi use by connecting my Starbucks WiFi to a phone number via OTP is another government regulation. Prohibitions against the use of encryption are generally pushed by national governments. Things were pretty similar in the US.

It is, of course, impossible for a service provider to satisfy the government's desire to spy on users without doing so itself.

The desire for the government to spy on users extends far beyond preventing money laundering. In the United States, Congress has demanded information and action from technology companies in order to prevent Russians from posting Pepe memes on Twitter or attempting to organize "Blacktivism" on Facebook. The Kingdom bans most encrypted communication, and many democratic nations (the US, India, UK, France) have politicians pushing in the same direction.

In the intermediary stages, there is a large amount of information that the government requires service providers to keep. This typically includes accounting details (required by tax departments), both purchase history as well as KYC information used by tax authorities to track down tax evaders (e.g., Amazon is required to keep and provide to the IRS tax related information about vendors using Amazon as a platform).

In many cases, censorship authorities require social networks and others to track and notify them about people posting illegal content (Nazi imagery, child pornography, Savita Bhabhi, anti-Islamic content).

Fundamentally, it is government regulations that shut down cryptocurrency exchanges in India. It is government regulations that ban encrypted communication in the Kingdom (at least partially), and it was politicians in the US and UK and India who want to move in the same direction.

Insofar as privacy preserving platforms might exist, it is far from clear whether governments will allow them to continue existing should they become popular.

. . .if you're against witch-hunts, and you promise to found your own little utopian community where witch-hunts will never happen, your new society will end up consisting of approximately three principled civil libertarians and seven zillion witches. It will be a terrible place to live even if witch-hunts are genuinely wrong.

Unfortunately, this Scott Alexander quote explains very nicely what will happen when someone builds a moderately successful privacy preserving network.

If we built a privacy preserving payment network, it would be used for money laundering, drug sales and ransomware. If the Brave private browser/micropayment system ever approaches viability, it will be overrun by criminals laundering money through blogs about Ukrainian food.

If an ad network vowed to protect privacy, fraud would shoot up and good advertisers would leave. The few remaining customers would be selling penis enlargement pills, accepting the click fraud as the cost of doing business because no one else will work with them.

There are privacy preserving/censorship resistant social networks. They're full of Nazis.

This is a fundamental collective action problem, and no player in the game seems to have the ability change things. There are bad actors out there - fraudsters/scammers, terrorists laundering money, legal gun manufacturers moving money around, child pornographers, people who believe in evolution (even among humans), people advocating abandoning Islam, Russians posting Pepe memes, and journalists/revenge pornographers revealing truthful information that people want kept hidden. Any privacy preserving network, at it's core, allows these people to engage in these actions without interference.

And as any network approaches viability, it's early adopters will be these sorts of undesirables.

Make no mistake; I want this privacy preserving network to exist. I have no problem with teaching evolution and exploring it's consequences, advocating atheism over Islam, laundering drug money, or teaching people how to manufacture firearms. But I'm very much in a minority on this.

And if, like me, you want this privacy preserving network, the first step in making that happen is recognizing and acknowledging the very real barriers to making it happen.

I recently gave a talk at the Fifth Elephant 2018. The talk was an introduction to linear regression and generalized linear models from the Bayesian perspective. The slides are available here.. Go check it out.

]]>In principle A/B testing is really simple. To do it you need to define two separate user experiences, and then randomly allocate users between them:

def final_experience(user): if random.choice([0,1]) == 0: return user_experience_A(user) else: return user_experience_B(user)

So far this seems pretty simple. But then you think about edge cases:

- Shouldn't the same user get the same experience if they do this twice?
- After the test is complete, how can I compare group A and b?

It's not hard to track this data, but it certainly makes your code a bit uglier:

def final_experience(user): user_variation = db.run_query("SELECT user_variation FROM users WHERE user_id = ?", user.id) if user_variation == 0: # If the user already saw a variation, show them the same one return user_experience_A(user) elif user_variation == 1: return user_experience_B(user) else: #No record in the DB user_variation = random.choice([0,1]) db.run_query("INSERT INTO user_variation (user_id, variation) VALUES (?, ?)", user.id, user_variation) if user_variation == 0: return user_experience_A(user) else: return user_experience_B(user)

This is doable, but the code is a lot longer and more annoying. Are there race conditions? Should everything live in a single transaction, potentially skewing things?

Fortunately there's a better way: the hashing trick.:

def deterministic_random_choice(user_id, test_name, num_variations): """Returns a 'random'-ish number, between 0 and num_variations, based on the user_id and the test name. The number will not change if the user_id and test name remain the same. """ return (hash(user_id + test_name) % num_variations) def final_experience(user): if deterministic_random_choice(user.id, "experience_test", 2) == 0: return user_experience_A(user) else: return user_experience_B(user)

Usingdeterministic_random_choice instead of random.choice will ensure that the same user is always assigned to the same variation. This is done without any database access.

It also makes it very easy to run analytics and compare the two groups, even though we never stored group membership in any database table:

SELECT SUM(user.revenue), COUNT(user.id), deterministic_random_choice(user.id, "experience_test", 2) FROM users WHERE user.signup_date > test_start_date GROUP BY deterministic_random_choice(user.id, "experience_test", 2)

(This is not SQL that any real DB will actually run, but it's illustrative.)

Whatever you currently do for analytics, you can take the exact same queries and either GROUP BY the deterministic_random_choice or else run the query once for each variation and put deterministic_random_choice(user.id, "experience_test", 2) = 0,1 into the WHERE clause.

It's just a nice simple trick that makes it easy to start A/B testing today. No database migration in sight!

This post was first published on the Simpl company blog.

]]>Today I spoke about AI ethics at 50p 2018. Here are the slides from my talk.

The general topic was multiple ethical principles, and how it's mathematically impossible to satisfy all of them.

]]>Some time back, I was involved in a discussion with folks at an India-based software company. An important question was asked - why isn't this company as productive (defined as revenue/employee) as it's western competitors, and what can be done to change this situation? In this discussion, I put forward an unexpected thesis: if this company were profit maximizing, then it's productivity should *always* be far lower than any western company. During the ensuing conversation, I came to realize that very few people were aware of the Cobb-Douglas model of productio, on which I was basing my counterintuitive conclusions.

I've observed that the topic of Cobb-Douglas has come up quite a few times, and several folks have asked me to write up a description of it. Hence, this blog post. In my opinion, Cobb-Douglas is a very useful model to have in one's cognitive toolbox.

To lay out the basics of the problem, consider two competing companies - Bangalore Pvt. Ltd. and Cupertino Inc. For concreteness, let us say that these two companies are both sotware companies catering to the global market and they are direct competitors.

The question now arises; how should Bangalore and Cupertino allocate their capital?

For a software company, there are two primary uses toward which capital can be directed:

- Marketing. Both Bangalore and Cupertino can direct an extra $1 of spending towards adwords, facebook ads, attendance at conferences, and similar things. Both companies will receive the same amount of
*exposure*on their marginal dollar. - Employees. Bangalore and Cupertino can both spend money on employees, but in this case they receive
*different*returns on investment. In Bangalore, a typical employee might cost 100,000 rupees/month, whereas in Cupertino an employee might cost $100,000/year. This is approximately a 5x cost difference if we round up 1 lac rupees/month to $20,000/year.

Let us now model what the effect of each resource is on revenue.

It's a simple arithmetic identity is that revenue is equal to:

The values

is the probability of any individual prospect making a purchase multiplied by the value of that purchase, and

is the number of prospects who can be reached by marketing as a function of money spent on it.

We choose this decomposition because it helps us understand the impact of two separate resources:

- The value
is mainly increased by spending money on additional
*labor*. Engineers can build features, which increase value for customers and allow the product to be sold for more money. Marketers may improve the brand value, increasing the probability of a sale. - The value
is increased by spending money on additional
*marketing*. It's a simple machine - money is spent on facebook ads, conferences, TV commercials, and more people become exposed to the product.

We also choose this decomposition since it helps us avoid the Cambridge Controversy, which can under other circumstances make the model less well founded.

To understand the relationship between resources and production, let us take the following exercise. Suppose we have a large set of projects, each with a certain cost and benefit. To begin with, lets discuss labor projects:

- Integrate the software with Salesforce, cost 100 hours, benefit $50/prospect.
- Come up with a more enterprisey-sounding brand, cost 40 hours, benefit $10/prospect.
- Slap some AI on top of the software, cost 2000 hours, benefit $60/prospect.
- etc...

Fundamentally, I'm making two important assumptions here:

- The projects have no interdependencies.
- The amount of labor required for each project is small compared to the overall amount of labor.

Let us assume the corporate strategy is to spend whatever amount of labor we have on this collection of projects in order of decreasing ROI. This means that if we sort the list of projects by ROI = benefit / cost, then the corporate strategy will be to take on the highest ROI projects first.

Here's a typical result. As noted above, the units on the y-axis are dollars per prospect.

Note that the xkcd plotting style is used to illustrate this is a schematic drawing, and should not be taken too literally.

The graph was made as follows:

data = pandas.DataFrame({ 'cost' : uniform(10,100).rvs(50), 'benefit' : uniform(1,100).rvs(50) }) data['roi'] = data['benefit'] / data['cost'] data = data.sort_values(['roi'], ascending=False) step(cumsum(data['cost']), cumsum(data['benefit'])) #Like `plot(...)`, except that it shows steps at each data point.

As can be seen, no particular correlation between cost and benefit was assumed in order to get diminishing returns. Diminishing returns follows solely from the sorting operation, i.e. the choice to take on the highest ROI projects first.

One can similarly construct a diminishing returns curve on marketing spend. Note also that on the marketing side, many marketing channels (for example adwords) have their own diminishing returns curves built in.However, there's one very important distinction between labor and marketing. For the labor graph the X-axis is *hours of labor*, while for marketing the X-axis is *amount of money spent*.

After observing the diminishing returns curve above, I thought it looked kind of like a for some . So I decided to do a least squares fit using the model . Using numpy, this can be accomplished in a fairly straightforward way using the minimize function from scipy:

def err(a): t = cumsum(data['cost']) y = cumsum(data['benefit']) return sum(pow(y - a[0]*pow(t,a[1]), 2)) x = minimize(err, [1.0, 1.0])

The result of this optimization yields , as well as a reasonably accurate best fit curve:

This kind of a graph shape is not an accident. I repeated ths experiment, but this time generating a different data set:

data = pandas.DataFrame({ 'cost' : expon(1).rvs(50), 'benefit' : expon(1).rvs(50) })

I repeated this experiment, but this time I used a different distribution of costs/benefits. The result was pretty similar, albeit with :

I suspect that there is some more interesting law of probability which is causing this result to occur, but I'm not entirely sure what.

If we substitute this back into the equation , we arrive at the Cobb-Douglas model:

In the Cobb-Douglas model, the term represents Total Factor Productivity.

**Note:** Normally, the use of the Cobb-Douglas model is somewhat problematic due to the Cambridge Controversy which points out the difficulties in assigning a single value to capital. However in this case capital is literally dollars which can be spent on marketing, so we can avoid the issue.

Let us now suppose that both Bangalore Pvt. Ltd. and Cupertino Inc. have a fixed amount of capital available for spending in the current period. These firms can convert capital into labor at the rates:

- Bangalore Pvt. Ltd.: 1 unit of capital converts to 1 unit of labor.
- Cupertino Inc.: 5 units of capital convert to 1 unit of labor.

Now let represent the fraction of capital spent on marketing. Then we can rewrite our output (in Bangalore) as:

Whereas in Cupertino our output is:

Note that these outputs differ *solely* due to the presence of the
sitting in front. The dependence on
is unchanged. We can maximize this with simple calculus:

Solving this for yields .

In pictures, the following is what is happening:

As can be seen from the graph, the production function for both firms is the same, as is the capital allocation that maximizes production. All that differs is the *level* of production.

It's important to recognize what this means in business terms: the sole difference between Cupertino and Bangalore is that Bangalore has a higher total factor productivity. In terms of capital allocation, both firms should behave in the same way.

Secondly, this means that revenue in Bangalore Pvt. Ltd. will be higher than at Cupertino Inc. by a factor of .

The third conclusion is that *revenue per employee* will be significantly lower at Bangalore Pvt. Ltd. Bangalore Pvt. Ltd. is devoting the same amount of capital to labor as Cupertino Inc., but it has 5x lower cost per employee. As a result, it will have 5x as many employees as Cupertino Inc. It's revenue is higher by a factor of
, but the number of employees is higher by a factor of
. As a result, revenue per employee is *lower* by a factor of
(recall that
).

For example, assuming (as it appeared to be in the synthetic examples I concocted above), this means Bangalore Pvt. Ltd. will have as much revenue as Cupertino Inc., but the revenue per employee will be only 0.447 as large as that of Cupertino Inc.

It's often a bit difficult to translate abstract economic results into practical business advice. In this case, what the economic result implies is the following.

Because is the same for both Bangalore Pvt. Ltd. and Cupertino Inc., both firms should spend approximately the same fraction of their capital on labor. This will result in Bangalore Pvt. Ltd. consuming more labor (i.e. having more employees, and having more labor hours), and moving further along the diminishing returns curve.

For example, if these competing firms are in the adtech business, then integrating with more ad networks might be a valuable way to increase their customer value. In this case, while Cupertino Inc. might integrate only with Adwords, Facebook and AppNexus, Bangalore Pvt. Ltd. might integrate with those networks as well as YouTube, Pornhub and other more niche sites. If these firms are in the business of selling an ecommerce widget, then Bangalore Pvt. Ltd. might provide a larger number of specialized themes than Cupertino Inc. In most software businesses there is value to be generated by repeating the same process for more data providers, more platforms, etc. Generally speaking, an Indian firm should make their product significantly broader than any corresponding western firm.

Similarly, on the marketing side, one might expect Bangalore Pvt. Ltd. to create a broader advertising surface. This might involve creating a larger number of landing pages, which would target smaller niches of customers. Similarly, one would expect more organic marketing as a fraction of total marketing.

At the micro level, the fundamental calculus is the following. For Cupertino Inc. to take on a project requiring 1 man-year of labor, the project must generate $100k in revenue to break even. In contrast, Bangalore Pvt. Ltd. can take on any project generating $20k in revenue or more. As a result, Bangalore Pvt. Ltd. should take on all the same projects as Cupertino Inc., in addition to projects generating between $20k-100k revenue.

Projects in this revenue range form a natural core competency for the Indian firm; simple economics forms a moat that few western firms can cross.

So in terms of practical business advice, the takeaway (for Indian firms) is the following: hire more people, and have them work on more marginal projects. It will lower your revenue/employee, but it will increase profits and help you capture business that western competitors are economically incapable of capturing.

I recently analyzed a somewhat puzzling data set. I was sending HTTP POST requests to a system. The system's would then acknowledge receipt of these requests (returning a 200 status code), and some time later (it was a slow asynchronous process) send a web hook to a specified URL *if the request was successful*. However, successful was far from certain; most requests actually failed. My job was to measure the success rate.

Concretely, event `A`

would trigger at some time `t0`

. If `A`

was successful, then event `B`

might occur at time `t1`

. `B`

can only occur if `A`

occurred.

Similar systems like this happen in a variety of contexts:

- Ad delivery. The ad must first be displayed (event
`A`

), and only after it's displayed can the viewer click a link (event`B`

). - Email. The email must first be opened (event
`A`

), and only after it's opened can the reader click a link (event`B`

). - Web forms. A user must first enter their credit card, and only after that can they click submit.

What I wanted to compute was `alpha = P(A)`

and `beta = P(B | A)`

.

When analyzing the data I had, I noticed a curious pattern.

```
request ID| time of A | time of B
----------+-----------+----------
abc | 12:00 | 1:00
def | 12:01 | null
ghi | null | null
jkl | null | 1:03 <--- WTF is this?
```

That last row (for request ID `jkl`

) indicates something really weird happening. It suggests that event `B`

has occurred even though event `A`

has not!

According to my model, which I have a high degree of confidence in, this isn't possible. Yet it's in the data; the responding system could not post a web hook with ID `jkl`

if they hadn't received the request; they couldn't possibly know this ID.

The conclusion I drew is that our measurements of `A`

and `B`

are unreliable. `A`

and `B`

may actually occur without being observed. So the real problem at hand is to infer the true rates at which `A`

and `B`

occur from the complete data set.

I'll begin with some simple calculations - using nothing but arithmetic - to give the flavor of this analysis. To make things concrete, suppose we have the following data set:

Suppose we have the following counts:

- 100k requests were made
- In 40k cases, event
`A`

was reported and`B`

did was not reported - In 10k cases, event
`A`

was reported and then`B`

was reported - In 5k cases, event
`B`

was reported but`A`

was never reported

The most naive possible approach is to simply treat the cases where `B`

occurred to be *bad data* and discard them. Then we can estimate:

```
alpha = 50k / 95k = 0.526
beta = 10k / 50k = 0.200
```

But we can do better than this. We can use logical inference to deduce that in every case where `B`

occurred, `A`

also occurred. So we actually know that `A`

occurred 55k times, and `A`

then `B`

occurred 15k times. So we can then estimate:

```
alpha = 55k / 100k = 0.550
beta = 15k / 55k = 0.273
```

Finally, there's a third approach we can take. Lets define the parameters `gamma_A = P(A reported | A occurred)`

and `gamma_B = P(B reported | B occurred)`

. Lets assume that `gamma_A = gamma_B = gamma`

; this is reasonable in the event that events `A`

and `B`

are measured by the same mechanism (e.g., a tracking pixel).

Then we can infer, based on the fact that `B`

occurred at least 500 times without `A`

being reported, that approximately 10% (5k A occurrences without reports/ 50k A reports) of the time, data is lost. This suggests `gamma ~= 0.9`

.

We can then estimate that there were 50k / 0.9 = 55.56k occurrences of `A`

and 15k / 0.9 = 16.67k occurrences of `B`

, yielding:

```
alpha = 55.56k / 100k = 0.556
beta = 16.67k / 55.56k = 0.300
```

Based on the data we have, we've guesstimated that approximately 10% of the events which occur are not reported. However, this effect cascades and results in an overall success rate of `alpha * beta`

being reported as 10.5% (= 1,000 / 9,500) rather than 16.7% (= 1667 / 10,000). That's a huge difference!

These calculations are all great, but we also need to deal with uncertainty. It's possible that actually `gamma=0.95`

but we simply got unlucky, or `gamma=0.85`

and we got very lucky. How can we quantify this?

This can be done relatively straightforwardly with pymc3.

```
import pylab as pl
import pymc3 as pymc
import numpy
N = 100000
ao = 40000
bo_and_ao = 10000
bo_no_ao = 5000
model = pymc.Model()
with model:
alpha = pymc.Uniform('alpha', lower=0, upper=1)
beta = pymc.Uniform('beta', lower=0, upper=1)
gamma = pymc.Uniform('gamma', lower=0, upper=1)
a_occurred = pymc.Binomial('a_occurred', n=N, p=alpha)
a_observed = pymc.Binomial('a_observed', n=a_occurred, p=gamma, observed=ao+bo_and_ao)
b_occurred = pymc.Binomial('b_occurred', n=a_occurred, p=beta)
b_observed = pymc.Binomial('b_observed', n=b_occurred, p=gamma, observed=bo_and_ao+bo_no_ao)
```

The results can then be plotted:

As is expected, we have sharp lower bounds; the true number of events could not be lower than our observed number of events.

These numbers are in rough accordance with our heuristic calculations above.

In the above data, we've done two important things.

*First*, we've built a nowcast of our underlying data. That is to say, while the number of times events `A`

and `B`

occur is nominally directly observable (albeit noisily), the actual number of times are not. So we can construct better estimates (as well as credible intervals) of the event occurrent counts.

*Second*, we've built a direct probabilistic way of computing the fundamental parameters of the problem, namely `alpha`

and `beta`

. In our pymc code, just as we can plot a histogram of `a_occurrences`

(via `pl.hist(trace['a_occurred'][::20], bins=50)`

), we can similarly plot a histogram of `alpha`

itself. In many instances - e.g. A/B testing or bandit algorithms - the underlying probabilities are the parameter of direct interest. The actual counts are only incidental.

The conclusion here is that missing data is not a fundamentally limiting factor in running many analysis. Provided you have a more complete generative model of data collection, and adequate data to fit the model, you can actually correct for missing data when running such analysis.

]]>I wrote an article for Jacobite (with Lisa Mahapatra ) explaining why AI "bias" - as described by journalists - is mostly fictional. Go check it out!

]]>It's a widely cited fact that Uber has continually lost money since it started. Uber opponents widely cite this fact hoping that once the VCs stop investing, Uber will collapse and we can return to the era of yellow cabs. Another stylized fact about Uber is that it's profitable in many cities. In this post I'm going to do some basic financial modeling, and explain why these two facts make me very bullish on Uber.

The economics I'm going to describe are actually pretty well known in the field of SaaS, where they are commonly called the SaaS cash flow trough.

The SaaS trough, roughly speaking, looks like this:

This is an extremely common cash flow diagram across a wide variety of SaaS companies. The trough portion is the time when such companies are raising repeated VC rounds, while simultaneously witnessing revenues rising and costs rising even faster.

The key cause of this cash flow graph is a fairly universal feature of SaaS companies, and it's also a feature such companies share with Uber. The key cause of this is the high cost of acquiring a customer, the very low costs of servicing the customer, and the fact that profit happens over time.

As can be demonstrated by the cumulative cash flow at the end of this graph, *we have a profitable business*. You can put a fixed amount of money in, and over time you'll pull a larger amount of money out. But early on losses are quite high.

In the example I've graphed, we have the following (stylized) data:

- customer acquisition cost = $200
- customer cost per time period = $1
- customer revenue per time period = $5

In terms of python code, I did the following:

```
cost = zeros(t.shape, dtype=float)
cost = (1+erf((10-t)/2.0))*10+1
revenue = zeros(t.shape, dtype=float)
evenue[:] = 5
profit = revenue - cost
```

Roughly speaking, I'm assuming that a customer costs approximately $200 for a short time at the beginning of their lifetime, and then their costs drop to $1. They generate a constant $5 in profit.

This means that for an initial expense of $200, you will receive revenues of $400 (and profit of $200) after 100 time units. Sounds awesome, right? Any VC would wish to invest in this.

The trough I drew above is the cash flow for a single customer.

But suppose there is a wide open market - a huge number of customers to acquire. Lets do some more detailed modeling.

First, I'll define the following function which models a *single* customer's cash flow over time. This assumes the customer is acquired at a time `t0`

:

```
def customer_cash_flow(t, customer_acquisition_time):
cost = zeros(t.shape, dtype=float)
cost = (1+erf((customer_acquisition_time+10-t)/2.0))*10+1
cost[where(t < customer_acquisition_time)] = 0.0
revenue = zeros(t.shape, dtype=float)
revenue[where(t >= customer_acquisition_time)] = 5
profit = revenue - cost
return (cost, revenue, profit)
```

Then I'll assume that every 5 units of time, the business gains a new customer:

```
t = arange(100)
cost = zeros(t.shape, dtype=float)
revenue = zeros(t.shape, dtype=float)
profit = zeros(t.shape, dtype=float)
for i in range(0,100, 1):
c, r, p = customer_cash_flow(t, i)
cost += c
revenue += r
profit += p
```

Here's a graph of the result. A new customer is added at every unit of time.

The key point here is that the SaaS trough lasts a much longer time than before.

In the example above, I assumed the rate of customer acquisition was *linear*. Every 5 days a new customer is added.

But in reality, exponential growth is the ultimate goal of both Uber and many SaaS companies. So what happens if we assume exponential growth?

In this case we just change our model to include an exponential growth rate `alpha`

:

```
alpha = 0.05
for i in range(0,tmax, 1):
c, r, p = customer_cash_flow(t, i)
cost += c*exp(i*alpha)
revenue += r*exp(i*alpha)
profit += p*exp(i*alpha)
```

With a slow growth rate (1% per time period), our hypothetical company becomes profitable eventually:

But with a faster growth rate of 5% per time period, losses will continue for as long as exponential growth does. In fact, losses will grow exponentially!

This model applies to companies with a high customer acquisition cost and a very low customer maintenance cost. Typically this will be a SaaS company with a high-touch sales process; lots of marketing, inside sales, that kind of thing. I claim that Uber has similar economics; when Uber opens up a new city they need to spend a lot of money onboarding drivers and fighting corrupt politicians. (I'm told by an inside source that actually acquiring *customers* is almost "if you build it they will come" easy.)

So the microconditions of our model appear to apply to Uber.

Furthermore, if you graph of Uber's revenue vs it's losses, with numbers taken from Business Insider, you see that the outputs of the model also appear accurate:

If we plot the same data on a logscale we can see a very clear rate of exponential growth:

Bloomberg provides similar information.

So all the data we can see concerning Uber suggests that Uber is, in fact, in the "lose exponential amounts of money" phase of the SaaS growth curve.

There's another test of the model that we can run.

Another implication of the model is the following. Every single customer acquired by the growing company is profitable over the long haul. The first customer breaks even after 50 units of time, and is profitable thereafter. But the business as a whole is losing money because the profits are spent on acquiring customers 2, 3 and 4 (who will eventually be profitable).

So in our model, the older cohorts (e.g. customers acquired 200 units of time ago) should be profitable, and losses should be concentrated among the newer (and much larger) cohorts.

In fact, we observe this exact behavior with Uber. In Uber's older markets, it is profitable, but it is losing money on newer markets where it's attempting to grow.

One common question asked about Uber is the following. What happens when growth stops? This could happen in multiple ways:

- VCs might get cold feet and pull out, because Travis Kalanick is such a big mean jerk.
- Uber may reach a natural saturation point when it's acquired all customers who could be acquired.

At this point, does the whole house of cards collapse?

The answer is that it doesn't have to. I've run the model again, but I'm assuming at a certain point our company *stops acquiring new customers*. This has two effects:

- Growth stops.
- The cost of customer acquisition drops.

In python, I've done this:

```
tmax = 200
t = arange(tmax)
cost = zeros(t.shape, dtype=float)
revenue = zeros(t.shape, dtype=float)
profit = zeros(t.shape, dtype=float)
alpha = 0.05
for i in range(0,100, 1):
c, r, p = customer_cash_flow(t, i)
cost += c*exp(i*alpha)
revenue += r*exp(i*alpha)
profit += p*exp(i*alpha)
```

So while we're extrapolating the graph out to `t=200`

, growth stops at `t=100`

. The result is as follows:

Basically what happens is at the point when customer acquisition stops, the company becomes profitable. Customer acquisition was the sole reason for losing money in an otherwise profitable business.

At this point new investments should also stop. Sometime after this Uber can become a stable, boring, long term profitable business. At this point it can be valued according to ordinary Wall St. utility company metrics, and the VCs who've invested can take their profits.

I have no insider knowledge on whether Uber is actually long term profitable on individual cohorts, beyond what they've told the media. I also have no knowledge of whether Uber can successfully make the transition from growth to utility; if a company fails to cut their acquisition costs at the right time, they may crash and burn regardless of their per-customer profitability.

However, the point I want to make in this article is the following. Uber's current trajectory is perfectly consistent with the trajectory that many currently profitable SaaS business have taken. There is no fundamental reason to believe that Uber will crash and burn, either when growth stops or VCs stop funding it. Uber does not appear to be on VC-funded life support - by all indications it's on VC-funded hypergrowth, and can probably eventually become profitable once that growth stops.

Unless you like formulas, skip this section. I'm building a discrete time model to understand exactly when hypergrowth implies exponential unprofitability.

Consider a discrete time model. Assume that a customer generates a profit of $1 for $@ t > 0 $@ and costs $@ A $@ at $@ t = 0 $@. Assume further that the company has $@ e^{\alpha t } $@ customers at time $@ t $@. At time $@ t $@ the company has $@ c $@ customers. Based on hypergrowth, at time $@ t + 1 $@ the following happens:

- The company acquires $@c (e^{\alpha} - 1)$@ customers.
- The company gains $@ c $@ units of revenue from older customers.
- The company loses $@ c (e^{\alpha} - 1) A $@ due to customer acquisition costs.

The profit/loss $@ p $@ is therefore:

$@ p = c - c(e^{\alpha} - 1)A = c (1 - A(e^{\alpha}-1)) $@

If $@ A(e^{\alpha}-1) < 1 $@, then the company will have profit growing exponentially. On the flip side, if $@ A(e^{\alpha}-1) > 1 $@, then the company's losses will grow exponentially.

What's really important to note here is that **faster growth implies greater losses!**

I recently gave a talk at PyDelhi 2017. The talk was an introduction to Bayesian statistics for the average Python programmer who knows no statistics. The slides are available here.. Examples included my gamble with @lisamahapatra as to whether Richard Spencer ever hurt anyone, how to predict disease given diagnostic tests, and compass calibration.

]]>Consider a data set, a sequence of point $@ (x_1, y_1), (x_2, y_2), \ldots, (x_k, y_k)$@. We are interested in discovering the relationship between x and y. Linear regression, at it's simplest, assumes a relationship between x and y of the form $@ y = \alpha x + \beta + e$@. Here, the variable $@ e $@ is a *noise* term - it's a random variable that is independent of $@ x $@, and varies from observation to observation. This assumed relationship is called the *model*.

(In the case where x is a vector, the relationship is assumed to take the form $@ y = \alpha \cdot x + \beta + e$@. But we won't get into that in this post.)

The problem of linear regression is then to estimate $@ \alpha, \beta $@ and possibly $@ e $@.

In this blog post, I'll approach this problem from a Bayesian point of view. Ordinary linear regression (as taught in introductory statistics textbooks) offers a recipe which works great under a few circumstances, but has a variety of weaknesses. These weaknesses include an extreme sensitivity to outliers, an inability to incorporate priors, and little ability to quantify uncertainty.

Bayesian linear regression (BLR) offers a very different way to think about things. Combined with some computation (and note - computationally it's a LOT harder than ordinary least squares), one can easily formulate and solve a very flexible model that addresses most of the problems with ordinary least squares.

To begin with, let's assume we have a one-dimensional dataset $@ (x_1, y_1), \ldots, (x_k, y_k) $@. The goal is to predict $@ y_i $@ as a function of $@ x_i $@. Our model describing $@ y_i $@ is

$$ y_i = \alpha x_i + \beta + e $$

where $@ \alpha $@ and $@ \beta $@ are unknown parameters, and $@ e $@ is the statistical noise. In the Bayesian approach, $@ \alpha $@ and $@ \beta $@ are unknown, and all we can do is form an opinion (compute a posterior) about what they might be.

To start off, we'll assume that our observations are independent and identically distributed. This means that for every $@ i $@, we have that:

$$ y_i = \alpha \cdot x_i + \beta + e_i $$

where each $@ e_i $@ is a random variable. Let's assume that $@ e_i $@ is an absolutely continuous random variable, which means that it has a probability density function given by $@ E(t) $@.

Our goal will be to compute a *posterior* on $@ (\alpha, \beta) $@, i.e. a probability distribution $@ p(\alpha,\beta) $@ that represents our degree of belief that any particular $@ (\alpha,\beta) $@ is the "correct" one.

At this point it's useful to compare and contrast standard linear regression to the bayesian variety.

In **standard linear regression**, your goal is to find a single estimator $@ \hat{\alpha} $@. Then for any unknown $@ x $@, you get a point predictor $@ y_{approx} = \hat{\alpha} \cdot x $@.

In **bayesian linear regression**, you get a probability distribution representing your degree of belief as to how likely $@ \alpha $@ is. Then for any unknown $@ x $@, you get a probability distribution on $@ y $@ representing how likely $@ y $@ is. Specifically:

$$ p(y = Y) = \int_{\alpha \cdot x + \beta = Y} \textrm{posterior}(\alpha,\beta) d\alpha $$

To compute the posteriors on $@ (\alpha, \beta) $@ in Python, we first import the PyMC library:

```
import pymc
```

We then generate our data set (since this is a simulation), or otherwise load it from an original data source:

```
from scipy.stats import norm
k = 100 #number of data points
x_data = norm(0,1).rvs(k)
y_data = x_data + norm(0,0.35).rvs(k) + 0.5
```

We then define priors on $@ (\alpha, \beta) $@. In this case, we'll choose uniform priors on [-5,5]:

```
alpha = pymc.Uniform('alpha', lower=-5, upper=5)
beta = pymc.Uniform('beta', lower=-5, upper=5)
```

Finally, we define our observations.

```
x = pymc.Normal('x', mu=0,tau=1,value=x_data, observed=True)
@pymc.deterministic(plot=False)
def linear_regress(x=x, alpha=alpha, beta=beta):
return x*alpha+beta
y = pymc.Normal('output', mu=linear_regress, value=y_data, observed=True)
```

Note that for the values `x`

and `y`

, we've told PyMC that these values are known quantities that we obtained from observation. Then we run to some Markov Chain Monte Carlo:

```
model = pymc.Model([x, y, alpha, beta])
mcmc = pymc.MCMC(model)
mcmc.sample(iter=100000, burn=10000, thin=10)
```

We can then draw samples from the posteriors on alpha and beta:

Unsurprisingly (given how we generated the data) the posterior for $@ \alpha $@ is clustered near $@ \alpha=1 $@ and for $@ \beta $@ near $@ \beta=0.5 $@.

We can then draw a *sample* of regression lines:

Unlike in the ordinary linear regression case, we don't get a single regression line - we get a probability distribution on the space of all such lines. The width of this posterior represents the uncertainty in our estimate.

Imagine we were to change the variable `k`

to `k=10`

in the beginning of the python script above. Then we would have only 10 samples (rather than 100) and we'd expect much more uncertainty. Plotting a sample of regression lines reveals this uncertainty:

In contrast, if we had far more samples (say `k=10000`

), we would have far less uncertainty in the best fit line:

In many classical statistics textbooks, the concept of an "outlier" is introduced. Textbooks often treat outliers as data points which need to be specially treated - often ignored - because they don't fit the model and can heavily skew results.

Consider the following example. 50 data points were generated. In the left graph, the y-values were chosen according to the rule $@ y = 1.0 x + 0.5 + e $@ where $@ e $@ is drawn from a normal distribution. In the right graph, the y-values were chosen according to the same rule, but with $@ e $@ drawn from a Cauchy distribution. I then did ordinary least squares regression to find the best fit line.

The red line in the graph is the best fit line, while the green line is the true relation.

The Cauchy distribution is actually pretty pathological - it's variance is infinite. The result of this is that "outliers" are not actually uncommon at all - extremely large deviations from the mean are perfectly normal.

With a normal distribution, the probability of seeing a data point at $@ y=-20$@ or $@ -30 $@ (as described in the figure) is very small, particularly if the line is sloping upward. As a result, the fact that such a data point did occur is very strong evidence in favor of the line having a much smaller upward slope - even though only a few points slope this way.

In fact, a sufficiently large singleton "outlier" can actually shift the slope of the best fit line from positive to negative:

However, if we use Bayesian linear regression and simply change the distribution on the error in Y to a cauchy distribution, Bayesian linear regression adapts perfectly:

In this picture, we have 50 data points with Cauchy-distributed errors. The black line represents the true line used to generate the data. The green line represents the best possible least squares fit, which is driven primarily by the data point at (1, 10). The red lines represent samples drawn from the Bayesian posterior on $@ (\alpha, \beta) $@.

In code, all I did to make this fix was:

```
-y = pymc.Normal('output', mu=linear_regress, value=y_data, observed=True)
+y = pymc.Cauchy('output', alpha=linear_regress, beta=0.35, value=y_data, observed=True)
```

This fairly simple fix helps the model to recover.

Rather than simply setting up a somewhat overcomplicated model in PyMC, one can also set up the MCMC directly. Supposing we have a data set $@D = { (x_i, y_i) }$@. Then:

$$ \textrm{posterior}(\alpha,\beta | D) = \frac{ P(D | \alpha,\beta) \textrm{prior}(\alpha,\beta) } { \textrm{Const} } $$

If the samples are i.i.d, we can write this as:

$$ \textrm{posterior}(\alpha,\beta | D) = \textrm{Const} \times \textrm{prior}(\alpha,\beta) \prod_{i=1}^k P(y_i|x_i,\alpha,\beta) $$

Because we assumed the error term has a PDF and is additive, we can simplify this to:

$$ \textrm{posterior}(\alpha, \beta | D) = \textrm{Const} \times \textrm{prior}(\alpha, \beta) \prod_{i=1}^k E(y_i - \alpha \cdot x_i - \beta) $$

Given this formulation, we have now expressed the posterior as being proportional to a known function. This allows us to run any reasonable Markov Chain Monte Carlo algorithm directly and draw samples from the posterior.

Suppose we now choose a very particular distribution - suppose we take $@ E(t) = C e^{-t^2/2} $@. Further, suppose take an uninformative (and improper) prior, namely $@ \textrm{prior}(\alpha, \beta) = 1 $@ - this means we have literally no information on $@ \alpha $@ before we start. In this case:

$$ \textrm{posterior}(\alpha | D) = \textrm{Const} \prod_{i=1}^k \exp\left(-(y_i - \alpha x_i - \beta \right)^2/2 - \beta) $$

If we attempt to maximize this quantity (or it's log) in order to find the point of maximum likelihood, we obtain the problem:

$$ \textrm{argmax} \left[ \ln\left( \textrm{Const} \prod_{i=1}^k \exp\left(-(y_i - \alpha x_i - \beta \right)^2/2) \right) \right] = $$

$$ \textrm{argmax} \left[ \textrm{Const} \sum_{i=1}^k -(y_i - \alpha \cdot x_i)^2/2 \right] = $$

$$ \textrm{argmax} \left[ \sum_{i=1}^k -(y_i - \alpha \cdot x_i)^2/2 \right]$$

Here the argmax is computed over $@ (\alpha, \beta) $@. This is precisely the problem of minimizing the squared error. So an uninformative prior combined with a Gaussian error term allows Bayesian Linear Regression to revert to ordinary least squares.

One very important fact is that because OLS computes the peak of BLR, one can think of OLS as being a cheap approximation to BLR.

Bayesian Linear Regression is an important alternative to ordinary least squares. Even if you don't use it often due to it's computationally intensive nature, it's worth thinking about as a conceptual aid.

Whenever I think of breaking out least squares, I work through the steps of BLR in my head.

- Are my errors normally distributed or close to it?
- Do I have enough data so that my posterior will be narrow?

In the event that these assumptions are true, it's reasonably safe to use OLS as a computationally cheap approximation to BLR.

Sometimes the approximations simply aren't true. For example, a stock trading strategy I ran for a while had a very strong requirement of non-normal errors. Once I put the non-normal errors in, and quantified uncertainty (i.e. error in the slope and intercept of the line), my strategy went from losing $1500/month to gaining $500/month.

For this reason I encourage thinking of regression in Bayesian terms. I've adopted this in my thinking and I find the benefits vastly exceed the costs.

]]>Problematic Presentation of Probabilities. About how difficult it is to explain probabilistic predictions to lay audiences.

...any bayesian algorithm is also an algorithm that can be implemented for streaming...

Inherent Trade-Offs in the Fair Determination of Risk Scores - explains how all algorithms have a tradeoff between fairness and accuracy.

A warning about value discaring in Scala. Specifically, Scala will sometimes implicitly turn objects of type `T`

to `()`

when you don't expect it.

Automated Inference on Criminality using Face Images - looks like Phrenology might have some empirical support. All American twitter can do is repeat "ethics in AI" like a mantra, pressure the arxiv should take it down (with no argument why it must be wrong) and how it must somehow be racist (the entire data set was Chinese people).

Relatedly, China is winning the nuclear race using technology the US has abandoned.

A Theory of Efficient Short-Termism. This article postulates that the owners of firms want managers to pursue short term goals in order to reduce the risk of managers engageing in rent seeking.

A meta analysis of student evaluations shows that student evaluations of professors are more or less uncorrelated with teaching effectiveness. I'm not particularly surprised - I always considered student evaluations to be a measure of attractiveness, sympathy and how easy the tests are.

A discussion of Trump's Carrier deal, by Larry Summers. What Summers fails to recognize is that the "deal based capitalism" which he decries is already here - it's just usually the bureaucracy rather than the president you need to make the deal with. In particular, in the world of big city real estate (e.g. NYC), there is nothing but Trump's "deal based capitalism". I love how the economic elites are suddenly remembering their principles now that Trump is in charge.

CDC is unable to reliably connect "food deserts" and obesity. Also, Yes there are grocery stores in Detroit.

Trump is considering an economically literate FDA chief, with very clear Peter Thiel influence. I'm starting to think Peter Thiel might have been right about this one.

The Chronicle of Higher Education laments the fact that economics is too capitalistic, and celebrates the AEA founder who denounced Lassez faire. Alex Tabarrok points out exactly what this means. (Hint: Lassez faire suggests that whites and negros should be allowed to breed, while the anti-capitalists the Chronicle celebrates did not think this.)

What if powerful political and ideological forces stood to benefit if the general public believed that small orange rocks dropped into swimming pools cause no increases in the water levels of swimming pools? ...there would be no shortage of physicists who conduct and publish studies allegedly offering evidence that, indeed, the dropping of small orange rocks into swimming pools does not tend to raise the water levels of swimming pools (and, indeed, might even lower pools water levels!).

Is Empathy Necessary for Morality?

What the hell is postmodernism? Who the hell were Frank Lloyd Wright and Le Corbusier? What the Hell is Modern Architecture? and Part 3.

Steve Bannon in his own words. Here's an interesting speech that Bannon gave which outlines his worldview. I think he's wrong on a lot, and I'm also not his target audience (I'm a white American who feels closer to the people in Bombay than in Kansas), but he's definitely not the person the media portrays him as.

Another useful description of class in America. This article definitely resonates with me. A woman I know with a penchant for feminism and social justice - a dark skinned Indian with the accent that sounds British to people who've never lived in India - recently remarked to me that she's concerned about police harasment. I asked how often the American police harassed her and it turns out the answer is "never", and she refused my bet that it will remain "never" in 2017. For me the answer is very far from "never". I also can't request a cortisone (anti-inflammatory) injection from doctors without them thinking I'm seeking narcotics.

Who gave us post truth, conspiracy culture?

Donna Zuckerberg, a classicist of some note (mostly due to her famous brother, here's his website) advocates refusing to share the classics with the alt-right. The alt-right response is significantly more intellectual.

]]>Sometimes in statistics, one knows certain facts with absolute certainty about a distribution. For example, let $@ t $@ represent the time delay between an event occurring and the same event being detected. I don't know very much about the distribution of times, but one thing I can say with certainty is that $@ t > 0 $@; an event can only be detected *after* it occurs.

In general, when we don't know the distribution of a variable at all, it's a problem for nonparametric statistics. One popular technique is Kernel Density Estimation. Gaussian KDE builds a smooth density function from a discrete set of points by dropping a Gaussian kernel at the location of each point; provided the width of the Gaussians shrinks appropriately as the data size increases, and the true PDF is appropriately smooth, this can be proven to approximate the true PDF.

However, the problem with Gaussian KDE is that it doesn't respect boundaries.

Although we know apriori that every data point in this data set is contained in [0,1] (because I generated the data that way), unfortunately the Gaussian KDE approximation does not respect this. By graphing the data together with the estimate PDF, we discover a non-zero probability of finding data points below zero or above one:

The problem is that although we know the data set is bounded below by $@ t = 0 $@, we are unable to communicate this information to the gaussian KDE.

Lets consider a very simple data set, one with two data points at $@ t = 0.1 $@ and $@ t = 0.3 $@ respectively. If we run a gaussian KDE on this, we obtain:

The gaussian near $@ t=0 $@ overspills into the region below $@ t < 0 $@.

The way to resolve this issue is to replace the Gaussian kernel with a kernel that does respect boundaries. For example, if we wish to respect boundaries of [0,1], we can use a beta distribution. Because the variance of a beta distribution is an important parameter, we'll try and choose a beta distribution with the same variance as a comparable gaussian kernel.

The beta distribution takes two parameters, $@ \alpha $@ and $@ \beta$@, has mean $@ \alpha/(\alpha+\beta) $@ and variance $@ \alpha\beta/((\alpha+\beta)^2(\alpha+\beta+1)) $@.

So what we'll do is choose a bandwidth parameter $@ K $@, set $@ \alpha = dK $@ and $@ \beta = (1-d)K $@ for a data point $@ d $@. Then the mean will be $@ \alpha/(\alpha+\beta) = dK/(dK+(1-d)K) = d $@. The value $@ K $@ can then be chosen so as to make the variance equal to a gaussian, i.e.:

$@ K = \frac{d(1-d)}{V} - 1 $@

(This calculation is done in the appendix.)

If we remain away from the boundary, a beta distribution approximates a gaussian very closely:

In this graph (and all the graphs to follow), the blue line represents a Gaussian KDE while the green line represents the Beta KDE.

But if we center a gaussian and a beta distribution at a point near the boundary, the beta distribution changes shape to avoid crossing the line $@ x=0 $@:

Now suppose we run a kernel density estimation using beta distributions, centered at the data points. The result is a lot like gaussian KDE, but it respects the boundary:

The code to generate this is the following (for a 1000-element data set, and a kernel bandwidth of 0.01):

```
xx = arange(-0.1, 1.1, 0.001)
sigma = 0.01
gaussian_kernel = zeros(shape=(xx.shape[0], ), dtype=float)
beta_kernel = zeros(shape=(xx.shape[0], ), dtype=float)
for i in range(1000):
K = max(data[i]*(1-data[i])/pow(sigma,2) - 1, 200)
beta_kernel += beta( data[i]*K + 1, (1-data[i])*K + 1).pdf(xx)/1000.0
gaussian_kernel += norm(data[i], sigma).pdf(xx) / 1000.0
```

A generalization of this idea can be used on the unit simplex, i.e. the set of vectors with $@ \vec{d}_i \geq 0 $@ and $@ \sum_i \vec{d}_i = 1 $@.

Consider a data point $@ \vec{d} $@ on the unit simplex. Given a large parameter $@ K $@, one can define $@ \vec{\alpha} = K\vec{d} $@ and use a $@ \textrm{Dirichlet}( K\vec{d}) $@ distribution in the exact same way we used a beta distribution above. Recall that a Dirichlet distribution has mean $@ \vec{\alpha} / | \vec{\alpha} |_{l^1} = \vec{d} $@ if $@ \alpha = K\vec{d} $@. By making $@ K $@ sufficiently large, the variance (which is $@O(K^{-1})$@) will be small, so this Dirichlet distribution behaves the same way as the beta distribution above.

This would allow us to run a KDE-like process on the unit simplex.

Another situation which can arise is in MCMC. In a Bayesian nonparametric regression situation that I ran into recently, I needed a proposal distribution for vectors which I knew lived on the unit simplex. Given the old value $@ \vec{d} $@, I drew a proposed value given the exact distribution described above.

When using kernel approximations, don't treat the process as a black box. One can often preserve valuable properties and get more accurate results simply by making model-based tweaks to the kernel. This is a very important fact I learned in my academic career as a (computational) harmonic analyst, but which I don't see the data science community adopting.

**Special thanks** to Lisa Mahapatra for massively improving my data visualizations.

Consider a gaussian with variance $@ V $@. We want to construct a beta distribution with the same variance, centered at the point $@ d $@. Note that for a beta distribution with parameters $@ \alpha, \beta $@, the variance is:

$@ \textrm{Var} = \frac{ \alpha \beta } { (\alpha + \beta)^2(\alpha + \beta + 1) } $@

(I'm taking these identities from le wik.)

Now let $@ \alpha = K d $@ and $@ \beta = K(1-d) $@. Then we obtain the variance:

$@ \textrm{Var} = \frac{ K^2 d (1-d) } { (K d + K(1-d))^2(Kd + K(1-d) + 1) } = \frac{ K^2 d (1-d) } { K^2(K + 1) } = \frac{ d(1-d) }{K+1} $@

Now suppose we wanted the variance of a beta distribution to be $@ V $@, then we would set this equal to $@ V $@ and solve for $@ K $@. The result is:

$@ K = \frac{d(1-d)}{V} - 1 $@

There is one important fact to note. For $@ d < V $@, the resulting $@ K $@ will actually become negative. This will result in a singular beta distribution. So to get a non-singular PDF, we need to bound $@ K $@ below by zero; in practice I've found choosing $@ K = \max(\frac{d(1-d)}{V} - 1, 100) $@ works reasonably well.

If we do not impose regularization conditions, then if any data points exist which make $@ K $@ negative, the pdf becomes singular at $@ x=0 $@ and/or $@ x=1 $@.

However, the *CDF* of the distribution does NOT become singular, merely non-differentiable. The nature of the problem is that the PDF behaves like $@ x^K $@ for some $@ -1 < K < 0 $@ near $@ x=0 $@. This is an integrable singularity, and the CDF then behaves like $@ x^{K+1} $@ which is a continuous function.

Cinta Vidal is really interesting.

“Why Should I Trust You?” Explaining the Predictions of Any Classifier

Physics, Topology, Logic and Computation: A Rosetta Stone - a great paper on category theory, explaining connections between computation, mathematics and quantum physics.

A Simple Explanation for the Replication Crisis. Great piece by Andrew Gelman.

Science is not always self correcting. A great article by Cofnas on scientist's tendency to treat *positive* claims as *normative* and then reject positive conclusions for being evil.

Distributional Inequalities for the Binomial Law. The next time you approximate a binomial by a gaussian, pull this paper out. It'll give you some rigorous bounds.

A cool idea: Adiabatic Monte Carlo.

Easy Parsing with Parser Combinators (in Scala).

The Cybernetics Scare and the Origins of the Internet.

Passive Investing Is Worse for Society Than Marxism. An interesting article which makes the argument that while Marxism at least *attempts* to optimize capital flows, passive investing does not.

American Exceptionalism. Kind of contradicts my personal pessimism about the US.

It's a long way down. Scott Sumner (toward the end of the post) discusses how much worse poverty can be, comparing the US to Mexico, Mexico to China 2011, China 2011 to China 1997, etc.

Do Immigrants Import their Economic Destiny? The ideas contained herein are, for me, the only strong argument against open borders.

The Role of Headhunters in Wage Inequality: It's All About Matching. Paper argues that income inequality is rising because high skilled labor markets were previously inefficient and the rich were underpaid. Now headhunters have helped the markets to clear.

India's role in exporting health care. India's hub&spoke model of health care, where spokes focus on diagnosis/funneling to the hub, apparently saves a lot of costs. Relatedly, my post on medical tourism.

Globalisation ‘not to blame’ for income woes. See also financial times on the topic. Apparently US poor moved up the income distribution, the "trough" near the global middle class due primarily to Japan + Former Soviet Union.

Politics is Upstream of AI. An article which discusses the dangers of politically driven AI development. One important counterpoint to this article is that we don't need to *imagine* Soviet AI - it actually existed. One of the great results of Soviet AI is the concept of shadow prices; it's a theorem of Pontryagin (I think, might be remembering here) that the Lagrange multiplier of the socialist calculation problem is the market price that would appear *assuming* customers had the same utility function that socialist planners think they do.

I recently learned about IQ shredders, more and . A truly interesting concept - cities like Singapore, Mumbai and NYC attract the smartest people, reduces their fertility to below replacement levels, and thereby reduces IQ for humans in the future.

The Expressive Meaning of Democracy. Argues that our views on democracy are an artifact of treating social standing as synonymous with voting, and that this social convention should be dropped if it is harmful.

Socially Enforced Thought Boundaries - a nice article criticizing Randall Munroe's xkcd comic which comes out advocating the shunning of people who express skepticism of sacred beliefs.

What if the bottom 50% went Galt?

The Most Intolerant Wins: The Dictatorship of the Small Minority by Nassim Taleb.

The Social Crucifixion of IOError, part 2 and part 3. It's a good analysis of what I now believe to be a mobbing against Jakob Applebaum, an attack on him by some adversaries within his circles who for some reason have decided they dislike him.

Social Gentrification. Premise: nerd culture has been gentrified. "If it had turned out that nerd went mainstream, and suddenly...I was cool...that would be amazing. But what happened, [is] a bunch of people decided nerd chic is coolthen they said “ew what’s this loser doing here” before kicking me out so they could enjoy themselves." I think that the analogy fails a bit as Erik's comment points out; unlike land, it's always possible to create new subcultures. The folks currently gentrifying "nerd culture" could just create "nerd lite culture" and allow the nerds to keep their subculture.

Iconoclast author gives a speech I hope the concept of cultural appropriation is a passing fad. Writer's conference then disavows her and removes speech from website.

]]>Very often one builds a statistical model in pieces. For example, imagine one has a binary event which may or may not occur - to work with my thematic example, a visitor arrives on a webpage and he may or may not convert. A reasonable question to ask is "if I have 100 visitors, how many of them can I expect to convert?" Assume now that I *know* the conversion rate `lmbda`

; in this case the maximum likelihood point estimate for the number of conversions is `100*lmbda`

and the probability distribution of possible events that could occur is `binom(100, lmbda)`

(i.e. a binomial distribution). But what happens if `lmbda`

is not known, but instead a random variable?

In a previous post, I considered the problem of measuring the detection probability of an individual sensor in a sensor network with delays between detection and reporting. My solution to this problem involved *assuming* that I knew the detection probability as a function between the *current time* and the *time of detection*. I.e., I assumed that I knew exactly the cdf `r(t)`

of the *delay*. When I showed the delayed reactions post to a critic, one of his immediate reaction was to ask how I'd find `r(t)`

. My suggestion was then to use a nonparametric Bayesian estimator, the output of which is a probability distribution over the space of possible functions `r(t)`

as opposed to an individual `r(t)`

.

In both cases, I made assumptions that certain quantities were known exactly, and then I used those exact numbers to derive a probability distribution on the quantity of interest. But in reality, those quantities are not known exactly - merely probabilistically.

In this blog post I'll show why this is fundamentally not a problem. That's because probability is a monad and this monadic structure allows me to combine various analysis in a natural and obvious way.

**Background**: I am assuming that the reader of this post has a moderate amount of knowledge of probability theory, and a moderate amount of knowledge of functional programming. I will be *assuming* that functors (objects with a `map`

method) and monads (objects which also have `flatMap`

or `bind`

or `>>=`

on them) are known to the reader.

Also, for a more mathematical look at this topic, I'm mostly taking this stuff from the papers A Categorical Approach to Probability Theory (by Giry) and A Categorical Foundation for Bayesian Probability (by Culbertson and Surtz). This post is more intended for programmers than mathematicians.

In the language of type theory, probability is a type constructor `Prob[T]`

. An object of this type should be interpreted as being a probability distribution over objects of type `T`

, or a probability measure on `T`

. As the simplest possible measure, lets allow `T = Boolean`

. Then an object in `Prob[Boolean]`

can be thought of as being a function `f: Boolean => Real`

where `f(true) + f(false) = 1`

, `f(true) >= 0`

and `f(false >= 0)`

.

In the language of computer science, there are several alternative ways to represent `Prob[T]`

. The first is simply as a function mapping objects to their probabilities:

```
trait Prob[T] {
//Here T is a finite object
def prob(t: T): Real
}
forAll( (p: Prob[T]) => {
//allT is a sequence of every possible value of T.
(allT.map(prob.prob).sum == 1.0)
})
forAll( (p: Prob[T], t: T) => {
require(p.prob(t) >= 0)
})
```

A second way to represent it is as a sequence of samples:

```
trait Rand[T] {
def draw: T
}
```

In this case one can approximately recover a `Prob[T]`

object:

```
class ApproxProb[T](rand: Rand[T]) extends Prob[T] {
//It doesn't really extend Prob[T], but it comes close
val numAttempts = 1000000
def prob(t: T) = {
var numFound = 0
var i: Int=0
while (i < numAttempts) {
if (rand.draw == t) { numFound += 1 } else { }
i += 1
}
return numFound.toDouble / numAttempts.toDouble
}
}
```

This sampling representation will not exactly satisfy the laws that `Prob[T]`

does, but it will come close if `numAttempts`

is large enough.

Relating this problem to the examples above, let us consider first the problem of estimate a future number of conversions. Given a conversion rate `lmbda`

and `N`

visitors, we know that the number of conversions is binomially distributed. We can therefore represent our solution to the problem as a function of type `(Real, Integer) => Prob[Integer]`

:

```
def numConversions(lmbda: Real, N: Integer): Prob[Integer] =
new Prob[Integer] {
def prob(t: Integer) = Binomial(lmbda, N).pmf(t)
}
```

(Here `pmf`

is the probability mass function of the binomial distribution.)

We could also define this using the sampling representation:

```
def numConversions(lmbda: Real, n: Integer): Rand[Integer] =
new Prob[Integer] {
def draw = Binomial(lmbda, N).draw
}
```

In either case, we are building a function of deterministic inputs and getting a function of type `Prob[T]`

as an output.

The first observation which is important to make is that probability is a functor. Specificaly, what this means is that if you have an object of type `Prob[T]`

, and a function `f: T => U`

, you can get an object of type `Prob[U]`

out from it. Let me start with a motivating example. Let `T`

be the set `{a, b, c}`

. Then define:

```
val prob = new Prob[T] {
def prob(t: T) = 1.0/3.0
}
```

This probability distribution assigns equal weight (1/3 probability) to each element of `T`

. Now let `U`

be the set `{x, y}`

, and `f: T => U`

be the function:

```
def f(t: T) = t match {
case a => x
case b => x
case c => y
}
```

The result of `prob.map(f)`

should be the probability distribution mapping to `x`

with 2/3 probability and to `y`

with 1/3 probability.

The simpler way to define `map`

is in the sampling representation:

```
object RandFunctor extends Functor[Rand] {
def map[T, U](p: Rand[T])(f: T => U): Rand[U] = new Rand[U] {
def draw: U = f(p.draw)
}
}
```

When we have `map`

ed a `Rand[T]`

, we get a new object which provides random samples of type `U`

.

If we apply this definition to our example above, we discover that 2/3 of the time, the outcome of `p.draw`

is either `a`

or `b`

. As a result, 2/3 of the time the outcome of the mapped distribution is `x`

, as desired.

We can also provide a definition in the `Prob[T]`

representation, but it's a bit more complicated:

```
object ProbFunctor extends Functor[Prob] {
def map[T, U](p: Prob[T])(f: T => U): Prob[U] = new Prob[U] {
def prob(u): Real = {
val inverseImage: List[T] = allT.filter(t => f(t) == u)
return inverseImage.map(p.prob _).sum()
}
}
}
```

In this case, we can do the calculations by hand. Suppose we compute `prob(u=x)`

. Then the value of `inverseImage`

is the set of all `T`

for which `f(t) == x`

, and this happens to be `List(a,b)`

. Next we compute `inverseImage.map(p.prob _)`

which works out to be `List(1/3, 1/3)`

. Finally we sum that list, resulting in 2/3.

Woot! Both of our representations work out correctly.

Lets consider now the following situation. We run an experiment, and measure our conversion rate as described here. The net result is that we form an opinion on the conversion rate:

```
val conversionRate = new Rand[Real] {
def draw = BetaDistribution(numConversions + 1, numVisitors - numConversions + 1).draw
}
```

Given the previous discussion, we now have the following idea - lets take this and `map`

it with our `numConversions`

function above:

```
val expectedConversions =
conversionRate.map(lmbda => numConversions(lmbda, 100))
```

Unfortunately, if we look at the type of `expectedConversions`

, it works out to be `Rand[Rand[Int]]`

. That's not what we wanted - we really wanted a `Rand[Int]`

.

So what we need to do is somehow flatten a `Prob[Prob[Int]]`

or a `Rand[Rand[Int]]`

down to a `Prob[Int]`

or `Rand[Int]`

.

In the sampling approach, there is one pretty obvious approach. Recall how we defined `map`

on a `Rand[T]`

object - we applied the function to the result of drawing a random number. What if we draw a new sampel? I'll write an implementation of this and suggestively name it `bind`

:

```
object RandMonad extends Monad[Rand] {
def bind[T, U](p: Rand[T])(f: T => Rand[U]): Rand[U] = new Rand[U] {
def draw: U = f(p.draw).draw
}
}
```

Clearly the type signature of this matches. It also makes intuitive sense. In the probabilistic formulation we can do the same thing:

```
object ProbMonad extends Monad[Prob] {
def bind[T, U](p: Prob[T])(f: T => Prob[U]): Prob[U] = new Prob[U] {
def prob(u: U): Real = {
val probSpace: List[(T,U)] = cartesianProduct(allT, allU)
val slice = probSpace.filter( tu => tu._2 == u )
return slice.map( tu => p.prob(tu._1)*f(tu._1).prob(tu._2) ).sum()
}
}
}
```

To understand what we are doing here, it makes sense to visualize. Lets represent the cartesian product `probSpace`

above as a grid - suppose `allT = {1, 2, ..., 16}`

and `allU = {1, 2, ..., 16}`

. Then consider the function `density: (T,U) => Real`

defined by `density(t,u) = p.prob(tu._1)*f(tu._1).prob(tu._2)`

. (One example of such a function is plotted below.)

Then the result of bind is a new probability which results by taking a vertical slice, at the x-coordinate `u`

, and summing over the vertical line. This is, of course, purely a function of `u`

now since all the dependence on `t`

has been averaged out. The result of summing is displayed in the graph via the black line, interpreted as a 1-dimensional plot of `u`

.

Lets now consider a concrete question. I have a Bernoulli event, for example visitors arriving at a webpage. I've now run an experiment and measured 53 conversions out of 200 visitors. Computing the posterior on the true conversion rate is a straightforward matter that I've discussed previously:

But the question arises - what *empirical* conversion rate can we expect over the next 100 visitors? This is now a question that's straightforwardly answered with the probability monad. We do this as follows:

```
val empiricalCR = for {
l <- new Beta(51, 151)
n <- Binomial(100, l)
} yield (n/100.0)
```

(In case you are wondering, this is valid scala breeze code, just `import breeze.stats.distributions._`

.)

The result of this is a probability distribution describing the empirical conversion rate. We can draw samples from it and plot a histogram:

The resulting distribution is a bit wider than the distribution of the true conversion rate. That makes intuitive sense - the empirical conversion rate has two sources of variance, uncertainty in the true conversion rate and uncertainty in the binomial distribution of 100 samples.

In the realm of pure mathematics, what's happening here is pretty simple.

When you have a probability distribution on a space `T`

, you have a measure `mu`

mapping (some) subsets of `T`

into `[0,1]`

. I.e. you have a function `mu: Meas[T] => [0,1]`

. Hear `Meas[T]`

represents the measurable subsets of `T`

- for simplicity, if `T`

were simply the integers than `Meas[T]`

could just be all sets of integers.

A nondeterministic model would be a function `f: t => Meas[U]`

. Then the `map`

operation `mu.map(f)`

would have type `mu.map(f): Meas[T x U] => [0,1]`

- i.e., it would be a measure on the product space `T x U`

consisting of pairs of elements `(t,u)`

.

Finally, the `flatMap`

operation would consist of mapping, but then integrating over the `T`

variable.

As noted earlier, this is described in much greater detail in A Categorical Approach to Probability Theory (by Giry) and A Categorical Foundation for Bayesian Probability (by Culbertson and Surtz). So this approach is both practical and also on solid theoretical footing.

We can of course do the same calculations manually. In python, the following vectorized code seems to work in this particular case:

```
l = beta(51, 151).rvs(1024*1024)
n = binom(100, l).rvs()
```

But this would be trickier in more general cases. For example:

```
val funkyDistribution = for {
l <- new Beta(51, 151)
d <- if (l < 0.5) { Normal(5, 1) } else { Normal(-5, 1) }
} yield (d)
```

It's always possible, of course, to simply hack something together. Because this approach is on solid theoretical footing, one can derive results solely within the language of probability theory, and then take the result and turn it into python code.

But I personally favor the programmatic approach - it's always easier for me when the theory maps directly onto the code.

Probability is a monad. This allows us to take probabilistic models with *deterministic* inputs, and flatmap them together to build full-on probabilistic models. This can be done either mathematically (in order to derive a model to build) and much of it can be done programmatically. It's a great way to make statistical models composable, which is a very important real world consideration. Deriving probabilistic models from deterministic inputs is easy, and chaining easy steps together is usually a lot more straightforward than actually solving the full problem in one shot.