Pilot Programs: Don't Believe the Hype

Every so often, I read about a super exciting policy program that's having dramatic results with some tough-to-solve problem like child poverty. I am filled with enthusiasm about the potential for resolving some terrible harm. And then I remember that successful pilot programs rarely pan out:

Sometims the "success" of the earlier project was simply a result of random chance, or what researchers call the Hawthorne Effect. The effect is named after a factory outside of Chicago which ran tests to see whether workers were more productive at higher or lower levels of light. When researchers raised the lights, productivity went up. When researchers lowered the lights, productivity also went up. Obviously, it wasn't the light that boosted productivity, but something else--the change from the ordinary, or the mere act of being studied.

Sometimes the success was due to what you might call a "hidden parameter", something that researchers don't realize is affecting their test. Remember the New Coke debacle? That was not a hasty, ill-thought out decision by managers who didn't care about their brand. They did the largest market research study in history, and repeated it several times, before they made the switch. People invariably told researchers they loved the stuff. And they did, in the taste test. But they didn't love the stuff when it cost them the option of drinking old Coke. More importantly, they were being offered a three-ounce cup of the stuff in a shopping mall lobby or supermarket parking lot, often after they'd spent an hour or so shopping. New Coke was sweeter, so (like Pepsi before it) it won the taste test. But that didn't mean that people wanted to drink a whole can of the stuff with a meal.

Sometimes the success was due to the high quality, fully committed staff. Early childhood interventions show very solid success rates at doing things like reducing high school dropout and incarceration rates, and boosting employment in later life. Head Start does not show those same results--not unless you squint hard and kind of cock your head to the side so you can't see the whole study. Those pilot programs were staffed with highly trained specialists in early childhood education who had been recruited specially to do research. But when they went to roll out Head Start, it turned out the nation didn't have all these highly trained experts in early childhood education that you could recruit specially--and definitely not at the wages they were paying. Head Start ended up requiring a two-year associates degree, and recruiting from a pool that included folks who were just looking for a job, not a life's mission to rescue poor children while adding to the sum of human knowledge.

Sometimes the program becomes unmanageable as it gets larger. You can think about all sorts of technical issues, where architectures that work for a few nodes completely break down when too many connections or users are added. Or you can think about a pilot mortgage modification program. In the pilot, you're dealing with a concrete group of people who are already in default, and in every case, both the bank and the individual are better off if you modify the mortgage. But if you roll the program out nationwide, people will find out that they can get their mortgages modified if they default . . . and then suddenly the bank isn't better off any more.

Sometimes the results are survivor bias. This is an especially big problem with studying health care, and the poor. Health care, because compliance rates are quite low (by one estimate I heard, something like 3/4 of the blood pressure medication prescribed is not being taken 9 months in) and the poor, because their lives are chaotic and they tend to move around a lot, so they may have to drop out, or may not be easy to find and re-enroll if they stop coming. In the end, you've got a study of unusually compliant and stable people (who may be different in all sorts of ways) and oops! that's not what the general population looks like

Here's a telling example from Mike Munger:

The paper illustrates the point by undertaking two different RCTs on cowpea seeds in Tanzania. One is a traditional study where the control group knows they are getting traditional seeds and the treatment group knows they are getting modern seeds. The second is double blind; neither group knows what seed it is getting.

The traditional RCT shows a significant over 20% increase in yields from the modern seed. But the double blind RCT shows that virtually all of that improvement comes from changed behavior, not from any inherent effectiveness of the modern seed.

Specifically, the average treatment effect in the double blind RCT was zero! And when the harvests in the control groups across the two RCTs were compared, the blind control group showed a significant over 20% increase over the traditional RCT control group which knew they were getting the traditional seeds. This is the "pseudo-placebo" effect and it explains the entire average treatment effect in the traditional RCT.

Wow!

In other words, the significant effect found in the traditional RCT was not due to better seeds, it was due to actions taken by the farmers who thought they were getting better seeds (they planted them in larger plots with more space between the plants on better quality land). These farmers' expectations were wrong (in post experiment surveys, over 60% of them said they were disappointed in the yields), and the significant effect in the traditional RCT would not survive over time because the farmers, having adjusted their expectations downward would stop taking the actions that produced the "success".

It's often quite hard to do double-blind studies in the field--it's hard to give people cash, or indoor plumbing, without letting them know about it. Frankly, you're often lucky to get an RCT. What Munger's post is telling us is that even a well-designed study often has hidden problems that there is no way to discern just by reading the paper, or even necessarily by talking to the participants.

Of course, the fact that pilot programs rarely scale is not an argument for never doing anything. Rather, it's an argument for never banking on their results. Don't count cost-savings in your budget, or improved human development indicators in your long range plans.

Every time you see an account of some exciting new study, and are tempted to believe that we know how to fix developing world poverty, or health care cost inflation, or some other pressing problem, go back and read Munger's article. And remember that even a few good tests may not be telling you what you think.

Pilot Programs: Don't Believe the Hype

They're exciting! They're innovative! And they're probably flukes.

Megan McArdle