Asymmetrical Information - Megan McArdle

Why Don't More Social Scientists Share Their Data?

It's not malevolent. But the results of data secrecy can be.

04.18.13 5:28 PM ET

While we're on the subject of Reinhart and Rogoff, Tyler Cowen has a nice little post on "Who shares data". Basically, most economists don't share data, and the ones who do are more likely to be full professors with tenure and a clear personal committment to sharing.  

This seems to be pretty common in a lot of social sciences.  When I was reporting on the controversy over counting Iraqi civilian casualties, there was strong wall around the data.  Only a few researchers were allowed to see it, and being a strong critic of the study seemed to be an exclusion criteria for access.  I was shocked--until public health researchers assured me that no, this was pretty much normal.

Which caused me to think about who doesn't share data--and why.  

Before I go on, let me be clear that I'm not talking about Rogoff and Reinhart, who are in the middle of this controversy because they did share their sources. And kudos to them for that.  No, this is a general meditation on a broad problem in the social sciences. 

Anyhoo.  As I was saying. There are a lot of private data sets out there these days, and a lot of work being produced off of them.  Why can't more of us see it?  

Mostly, I suspect, because of the economics of the thing.  Assembling a nice private data set is a huge amount of work.  You want to be able to mine that work for publishable insights.  Very little professional credit accrues to the guy who built a great dataset which everyone else uses to generate elegant new findings.  The credit goes to the authors of the elegant new findings.  Which means that once you've built a dataset, you want to keep the thing to yourself as long as possible.  

If it takes three or four years to assemble a dataset, and you only get one paper out of it before everyone else swoops in, no one will want to build a dataset.  The swoopers who never build their own data sets will end up with more published papers, and better careers, than the plodders who put the data together. In other words, the incentives are all wrong for data openness.  

Unfortunately, there's always the possibility that "I want to hold onto it as long as possible so I can publish" can be a cover for "I need to hold onto it as long as possible because if anyone else sees it, it will rapidly become obvious that my results aren't very robust."  Or, in some cases, for "There are no results because I made the whole thing up."  

If we're worried about the latter problems, however, we need to first address the former issue.  As long as holding onto your data for a long time is an acceptable professional norm, weak results and outright cheats will be able to hide behind it.

But encouraging people to release their data means a deep reassessment about what counts as valuable, high-status work.  As long as the social sciences prize analysis over data-building--and they do--then data-builders will be at a competitive disadvantage.  Which they will try to rectify by keeping their data secret as long as possible.