How Not to Cherry-Pick the Results of the Oregon Study (Ultrawonkish)
A look at statistical significance, and what Oregon can really tell us
Last week, I asked Jim Manzi for his thoughts on the Oregon health care experiment. Manzi is a very smart guy who has founded a very successful company that helps other companies do experiments. He is also the author of the terrific Uncontrolled, a book about using randomized controlled trials to improve business, policy, and life in general. Jim was kind enough to send me his very long, very smart, very wonkish thoughts, which you'll see below. If you have any interest in Oregon, or just want to be smarter about issues in evaluating social science, you should read this all the way through.
Some Observations on the Oregon Health Experiment
As a vocal proponent of using randomized experiments to inform policy debates, I have followed the discussion surrounding the recent Oregon Experiment with great interest. I think the only thing I’ve previously written for publication on the topic of health care finance was a review of the RAND Health Insurance Experiment. This is the only other randomized experiment of which I am aware that tested the impacts of varying levels of generosity of health care coverage on physical health. The RAND experiment concluded that (1) lower levels of coverage “reduced the use of nearly all health services,” but that (2) this reduction in services “had no adverse effect on participants’ health.” As a casual observer of the topic, that struck me as a fairly important result.
The Oregon Experiment has replicated the first part the first part of the RAND result: Providing free health care coverage increased the use of health care services. However, a debate has arisen between Austin Frakt, Kevin Drum, Avik Roy, Megan and others around the second part: Did this increase in use of health care services lead to measurable improvements in physical health?
This debate has been very informative, but here a few key points that I don’t think have been very widely noted:
1. Almost half the people who were offered free health insurance coverage didn’t bother to send back the application to get it.
About 90,000 people applied for the health insurance lottery (though selection was done at the household level). About 35,000 people won the lottery, and thus had the right to submit an application, but only about 60% of these lottery winners actually sent the application back. This ought to tell any common sense person a lot about the revealed preference for how much the uninsured value the coverage on offer.
If your mental model of the uninsured is a poor family huddled outside of a hospital unable to find any way to pay for a doctor to give antibiotics to their coughing child, then this result doesn’t make a lot of sense. I’m not trying to trivialize the difficult struggles that poverty imposes that can make it very difficult to accomplish seemingly-straightforward tasks. But there are only two possibilities for why a given lottery winner failed to submit the application and get guaranteed health coverage. Either: (1) a rational analysis indicated that the expected gain from the coverage being offered didn’t justify the time and effort of filling out a form and submitting it; or (2) the winner acted irrationally about long-term benefits versus immediate inconvenience. Neither represents a good argument for the efficacy of the insurance. If it is the first case, it means the insurance isn’t worth much for lots of recipients. If it is the second case, it means that consistent compliance over months and years with many of the therapeutic regimes necessary to achieve improvement on the physical health outcomes measured in the experiment – blood pressure, blood sugar and cholesterol – are not likely to be very good.
Of course, one could argue that this only applies to the 40% who did not submit the application. But there are a couple of big problems with this.
First, it would create serious problems for the analysis. It would indicate that there are large, systematic differences in rationality plus conscientiousness (collectively, I’ll call this prudence) between 60% of the winners (the “compliers”) and 40% of the winners (the “non-compliers”). This would mean that you couldn’t just compare the people who won the lottery and submitted the forms to those who lost the lottery. We don’t know who among the lottery losers would have been the ones to submit the application if they had won, so we would have to compare those who got the coverage only to the prudent losers of the lottery. The researchers, of course, built econometric models to match compliers with members of the control group who “look like” them in terms of race, sex, prior health and so forth. But unless you want to argue that you can reliably tell me who is prudent or not – to a high degree of precision, because the purported physical health effects are extremely small – based on race, sex, prior health and so on, then the assumption of a hard break in prudence between compliers and non-compliers leads you to place much more emphasis on the “intent-to-treat” estimates of effects, which rely on comparing all of the lottery winners to the lottery losers. In this study, according to data in the Supplemental Appendix, intent-to-treat effects are only about one-fourth as large as the estimated effects for only the compliers – this would eliminate 75% of the estimated impact.
Second and more important, this is an unrealistic assumption. It is highly unlikely that there is some hard break in prudence between those who submitted the form and those who did not. Prudence is much more plausibly distributed in varying shades of grey in the relevant population – 60% happened to cross the threshold for getting that specific application submitted in that specific 45-day window. On average, you would expect the compliers to be more prudent than the non-compliers, but some degree of the problem of low prudence is then likely to be present in a significant fraction of those who received free coverage. That is, the average person who was granted free coverage very likely has a less prudent approach to health-related decision-making than the average reader of this blog post. As one obvious piece of evidence, about half (48.4%) of those who received free coverage in the Oregon Experiment smoked, as compared to fewer than 25% of American adults between 19 and 64 in the relevant time period. (And yes, I’ve read Orwell, and understand that in their shoes I might very well act the same way.) This is not a moral judgment on the recipients; it is an argument for why it is extremely plausible that the population granted coverage has less than educated, middle class American standards of prudence about compliance with therapeutic regimens, which in turn makes very small positive impacts on chronic health conditions as a result of coverage more understandable. We can therefore look with more confidence on the coverage results for the compliers in the experimental analysis. These results are fascinating.
2. If you accept that the statistically insignificant findings of this experiment are reliable estimates of effects, then providing health coverage caused sick people to have worse cardiovascular health …
The test group had a lower proportion of people diagnosed with hypertension, high cholesterol and elevated blood sugar than did the control group. None of these was statistically significant, meaning roughly speaking that we can’t reject with 95% probability the possibility that the measured difference between the two groups is the result of measurement noise.
This does not mean that we can say that coverage had no effect. In fact, I can tell you with certainty that it had some effect on every one of these outcomes. No non-trivial action has literally zero effect on anything in the universe. Further, if you asked me for the proverbial I-have-to-make-a-wild-guess-with-a- gun-to-my-head stab at the most likely true effects of providing coverage to the test subjects in Oregon, I would use exactly the estimates provided in the paper. But I would have very low confidence in these guesses, and would be very hesitant to act on them as a practical decision-maker.
What the experiment does tell us with some confidence is that these effects are very likely smaller than X (though we can’t be absolutely, philosophically certain about even that). The core of the debate has been about whether it is possible that the true effects of coverage are smaller than X, but still big enough to matter. This is frequently couched as asking: What was the “power” of the experiment?
We can describe the power of an experiment as being how small an effect it can reliably detect given the background variation in all physical health metrics. Think of power as being analogous to the magnification power of the microscopes you probably used in high school. If I try to use a child’s microscope to carefully observe a section of leaf looking for an insect a little smaller than an ant, and I do not observe it, I can reliably say that “I don’t see the insect, and therefore there is no bug there.” But if I use the same microscope to try to find a tiny microbe on that same section of the same leaf, all I can say is that “It’s all kind of fuzzy…I see a lot of little squiggly things…I guess that little black squiggle there that might or might not be something.” I can’t reliably say that there is no microbe there, because as I try to zoom in closer and closer to look for something that small, all I see is a bunch of fuzz. My failure to see a microbe is a statement about the precision of my instrument, not about whether there is a microbe on the leaf.
I think it is fair to characterize the relevant part of the argument that Frakt, Drum and others have made as: (1) the experiment has the power to detect physical health effects for blood pressure, blood sugar and cholesterol of some specific size X with 95% significance; (2) it has not detected effects of this size with 95% significance; (3) this doesn’t rule out effects smaller than X; (4) this also does not imply that the best available estimate for each of the physical health effects based on this experiment is zero, and (5) in fact, based on the experiment the best available estimate for each of these effects is positive. With all the caveats I provided above about the extreme unreliability of these estimates, I agree with this series of statements. However, this is very incomplete.
Start by noting that just about any change in probability of mortality, when multiplied by a huge population of uninsured in America is going to be a very big number in absolute terms. As an illustrative example, a reduction of 0.0001 in the probability of death over a ten-year horizon multiplied by 50 million uninsured people means giving 5,000 human beings a chunk of their lives back. I’d call that morally significant. I’m very skeptical that any real-world experiment or other analysis could ever reliably detect an effect that small. Therefore, the argument that “well, this effect has not been determined by your experiment, but could still be big enough to matter” is in practice non-falsifiable.
So why even bother with any analysis? First, because there’s no free lunch. Typically any intervention will have some tangible costs, impose some risks and/or foreclose other options that might have better payouts. Second, because it is not certain that coverage will reduce rather than increase mortality. It is at least theoretically possible that applying coverage could have negative consequences.
So if we were to use the statistically insignificant estimated impacts in order to inform an evaluation of the costs and benefits of health care coverage, I think we would need to start building the benefits case by finding some way to combine the predicted long-run actual health (mortality, etc.) impacts of these estimated improvements in the predictive metrics of blood pressure, blood sugar and cholesterol. Something, in other words, a lot like the Framingham Risk Score.
The Framingham Risk Score predicts 10-year risk of cardiovascular disease based on age, cholesterol levels, blood pressure, blood sugar, use of medication for high blood pressure, and smoking. The researchers in the Oregon Experiment estimated the impact of coverage on Framingham Risk Score for those who started the period sick (defined as those with diabetes, hypertension, hypercholesterolemia, myocardial infarction, or congestive heart failure before the experiment started). By these estimates, coverage increased the Framingham Risk Score for those who were sick. That is, it made overall cardiovascular health of sick people worse. And this estimated effect was far closer to statistical significance (p = 0.24) than the estimated effects of coverage on any measurements of elevated blood pressure (p = 0.65), elevated blood sugar (p = 0.61) or high total cholesterol (p = 0.37).
What’s sauce for the goose is sauce for the gander. If we want to accept that the impacts on the components are positive, even though not statistically significant, then there is no principled reason to throw out the result that the overall impact on heart health for the sick is negative.
This negative (statistically insignificant) impact on cardiovascular risk is, however, a seeming mystery. How can overall risk get worse when blood pressure, blood sugar and cholesterol are all getting better? Because of smoking.
3. …in large part because it induced them to smoke more.
42.8% of those in the control group without coverage smoke. In the coverage group this increases to 48.4%. This difference is also not statistically significant, though it is closer to significance (p = 0.18) than the impact on Framingham Risk Score, and therefore far closer to significance than the positive impacts on blood pressure, blood sugar or cholesterol. This isn’t something esoteric or hard to follow: 48% of those who got coverage ended up smoking versus 43% of those who did not. That is really, unambiguously bad for the health of the coverage group.
It is quite plausible that the provision of coverage would cause people to undertake riskier behavior – this is the famous “mandatory seat belt laws will cause people to drive more recklessly” argument. In this context, it is intuitive that some of the people who get coverage feel less anticipated cost to the illnesses that may be created by smoking, and therefore more likely to smoke (about 5.6% of them, apparently). If you are a person who sets much store by non-experimental analyses of the behavioral impacts of social interventions (I am not such a person), this confirms rather than contradicts several prior non-experimental studies that argue providing health care coverage does in fact increase risky behavior.
So, the (statistically insignificant) series of effects here is ironic and pernicious. Some people are granted health care coverage. This leads to greater utilization of health care services, which in turn leads to some very small improvements in some physical health indicators. But, the fact of coverage also leads some subset of these people to smoke who otherwise would not. The net effect is to make those who started the experiment sick have a worse cardiovascular prognosis than they would have absent the coverage.
To be fair, average Framingham Risk Score for the total population went down by a tiny and extremely statistically insignificant (p = 0.76) amount. This is a reduction in cardiovascular risk of 0.2%. So, if we push this non-statistically-significant logic to an extreme, we could argue that the best guess for the overall cardiovascular risk impact of coverage in this experiment was to make the total population very slightly healthier, basically by redistributing health from those who start off sick (who end up materially worse off) to the healthy (a much larger group, the average of whom ends up very slightly better off).
And it’s not at all clear, even without considering “regressive health redistribution” – never mind the costs of providing coverage, foreclosure of other reform options and everything else – that this would really represent an improvement for the population as a whole. Once we consider the set of effects of coverage (very small estimated impacts on physical heath indicators traded off against the increases in smoking), the left side of the confidence interval starts to get very relevant: It becomes much more plausible, though slightly less than an even-money bet, that we could actually have reduced average cardiovascular health in this experiment. Therefore, even before considering any costs associated with the provision of this care, we have to risk-adjust the “bet” we would be taking with population health by expanding coverage.
Suppose you were offered the opportunity to participate in the following bet: You have a 60% chance of getting $1 million and a 40% chance of losing $1 million – and if your personal net assets are less than $1 million, you go directly into personal bankruptcy, lose all of your assets and have all of your wages above subsistence level garnished for the rest of your life until $1 million plus interest is repaid.) Would you take this bet? I am confident that most people would not if playing for real money, in spite of the fact that it has a positive expected present value. This is because we have to be paid to take on risk. This kind of risk adjustment is not a practical issue if you have 95% significance or anything close to it, but with such a tiny average effect with big estimated negative effects for the unhealthy, and p = 0.76, it becomes central to a rational decision calculus.
In summary, based on statistically insignificant effects of coverage from the Oregon Experiment: (1) The effects that are closest to statistical significance are that coverage would increase the rate of smoking and damage the cardiovascular prognosis of sick people; (2) the best estimated net effect on total population cardiovascular health is extraordinarily tiny; (3) this effect would be achieved by making the sick sicker, while very slightly improving the health of already healthy people ; and (4) this effect is almost certainly unattractive on a risk-adjusted basis. This is not a series of effects that makes a very attractive argument for an increase in health from the experiment.
4. So What?
When interpreting the physical health results of the Oregon Experiment, we either apply a cut-off of 95% significance to identify those effects which will treat as relevant for decision-making, or we do not. If we do apply this cut-off (as the authors did; as is consistent with accepted practice for medical RCTs; and as is what I believe to be a good way to make decisions based on experiments), then we should agree with the authors’ conclusion that the experiment “showed that Medicaid coverage generated no significant improvements in measured physical health outcomes in the first 2 years.” If, on the other hand, we wish to consider non-statistically-significant effects, then we ought to conclude that the net effects were unattractive, mostly because coverage induced smoking, which more than offset the risk-adjusted physical health benefits provided by the incremental utilization of health services.
This does not mean that the Oregon Experiment proved that coverage does not help heath. This discussion considered only some effects only over a two-year period in one state in one time period. To step back even further, I have not made any mention of the findings of an impact on measured depression, nor of any potential effects on other health effects such as cancer, nor on the financial benefits to those covered, nor on the potential effects of greater long-term utilization of health services on health.
I don’t know how to reform American health care, and I don’t think this experiment holds any secret key to the debate. But I do think this experiment should make everybody who is confident that “roll out something like our current system via an insurance mandate” is a good answer more humble about that belief, if it is premised even in part on the idea that this reform will make more sick people well. I think we remain ignorant about what the real effects of proposed reforms would be. As Ezra Klein argued, one of the things that this study shows (to beat my own drum) is that we would likely make much better decisions on this subject if we did a lot more such experiments.