May 21, 2013

I Got 200 Million Problems, But Multicollinearity Ain't One

When even David Brooks, Herodotus of the Bobos, is waxing lyrical about data and empiricism you know that data science has become mainstream. Drew Conway is right that the phrase is rather clumsy, but so are many other things in social science. If the mad awesome/state of the art work Conway does is the data equivalent of the mouth-watering Chinese restaurants I go to during my summer jaunts back to LA, the now de rigueur pretty-looking bloggy data visualization is the bland but dependable PF Chang's. Both are great, but only King Hua is going to get you that great dim sum. ¹

Look past my questionable Chinese food analogy and the nature of the problem becomes apparent. Pretty pictures that answer big questions are becoming hotter than Hairless Cats That Look Like Putin. In some ways, this is a good thing. It means less listicles/GIFs, less argument by analogy, and more evidence. And we certainly need more of that. I've spent the last week trying and failing to write a follow-up post to my Benghazi piece here from last December due to the sheer amount of derp on that subject, to say nothing of "Syria is Vietnam/Rwanda/Iraq/Sudatenland" analogy Mad-Libs.

So what's the problem? The blogospheric blow-up over a controversial map of global racial tolerance illustrates some larger tensions inherent in the move towards data journalism. Daniel Drezner has a good summary of what went down after The Washington Post created a visualization of a paper on the geography of racism. The basics are that the World Values Survey (cross-national survey) data turned out to be fairly rough, the operationalization of the research question was dodgy, and there are some underlying conceptual issues with varying perceptions of race in a large-N study.

A great hue and cry arose on both blogs and Twitter. Ultimately, Drezner and Jay Ulfelder are right that the harsh criticism of Max Fisher's visualization is unnecessarily overwrought. It's not hard to find questionable uses of cross-national data (especially in academic literature) for a very simple reason: cross-national data often is very messy. As Nathan Jensen noted of an attempt to analyze Big(-ish) Data, "[d]ata quality is a serious issue. When using a cross-national dataset, I
look at the individual observations to make sure nothing looks odd." This is true not just of Big Data--any large cross-national dataset will have holes.

I painfully discovered just what Jensen meant this spring semester when I ran regressions on the Correlates of War (COW). Even one of the oldest and most well-used databases in international security has significant issues that have been well-documented by generations of scholars. This isn't a knock on what J. David Singer built. COW, like Kenneth Waltz, made much of what we would consider modern international relations possible.

But collecting and coding systematic global data is inherently rough. Take one of the most important variables in the Interstate Wars codebook--battle deaths--and make it your dependent variable. Read a bibliographical essay on the latest Interstate Wars dataset and you'll discover that the COW data collectors had to deal with some fairly Herculean problems trying to construct said variable:

Many historical accounts of war contain only vague generalizations about battles that resulted in severe (or light) casualties. Authors frequently utilize the terms deaths and casualties interchangeably, for instance, noting in two different sentences that a specific war resulted in 1,000 casualties or 1,000 deaths, though generally the term casualties refers to the combination of the number of those who died and the number of wounded. Many sources report only total death figures, combining deaths of civilians and combatants.

There are also wide differences even within the death figures provided for a specific war. The death numbers from a variety of sources, each of which claims to be accurate, can vary widely, with one source reporting deaths that are two or three times as high as those reported in other sources. ....Although gathering fatality estimates was difficult in the past, especially in extra-state wars that were sometimes fought in remote areas, the process has not necessarily become easier in the present. Even though today there is an impressive array of nongovernmental agencies with resources devoted to gathering statistics on the costs of war (though many are primarily concerned with civilian deaths), governments have also displayed their ability to utilize technology as a means of concealing war fatality figures.

But let's say you aren't satisfied with just using the other variables in the Interstate Wars dataset to examine the dependent variable. Let's say you are a glutton for punishment and want to combine an interval-scale variable like battle-related deaths with nominal or ordinal variables you could pull from something like the POLITY Project or some development stats from the World Development Indicators.

Get ready to tear your hair out dealing with the data cleaning problems in two (or more) large cross-national datasets, standardizing different operationalizations and measures, and otherwise getting to know every possible flaw that might bias your results. By the time those little regression asteriks finally appear on your output table, you might as well be one of FP's famed bald ex-KGB doppelganger cats. Now, I'm intentionally over-dramatizing the process²--interesting research with cross-national data gets published all the time. Research with merged datasets gets published all the time. There are ways to deal with messy data that range from "anything goes" ad hoc fixes to highly sophisticated scientific techniques. But it's still hard work.

So cross-national data is sketchy. But what about just highly granular data for one country, or even one region? GDELT has 200 million events at a very granular level of detail. You can take a look at just Syria. Cool, right? Sadly, as they say on the Internet, this is why we can't have nice things. Surprise! GDELT is messy too. Just as with my earlier comments on the COW, this isn't a knock on the GDELT. I have, after all, created my very own GDELT t-shirt ("200 Million Observations. Only One Boss") that I will gratuitously flaunt at an academic conference near you. ³

And the COW and GDELT, while messy, are also maintained by objective and well-trained professionals. Many other sources of data collected by governments, international organizations, advocacy groups, or sloppy military historians, will be an order of magnitude less reliable. So yes, there is no free lunch. All datasets come with big limitations, some vastly more so than others. That's part of why Sean J. Taylor wrote that making your own data ought to be the ideal. And it's also why former Abu M poster Erin Simpson tweeted "[s]ay it with me: model the data generating proces."⁴Or take it from Jensen: "[t]here is no way to let the data 'speak' to you. It is a confusing mess." Jensen goes on to caution that "you really need to have a plan on how to analyze it." Though writing about Big Data, Jensen's advice is valid for most research in general, whether you are Clifford Geertz observing a Balinese cockfight or Ulfelder forecasting political unrest.

This doesn't mean that every data visualization ought to come with a caveat list longer than this already lengthy post. I've enjoyed reading Bad Hessian precisely because of its short, snappy, and often tentative posts on subjects ranging from NFL coverage to RuPaul's drag queen competition. The real fun in online data visualization lies not in the cool relationships it reveals but the experimentation and feedback that goes on at places like Bad Hessian. Someone (like Trey Causey, who is responsible for clogging up my Instaper and Twitter favorites beyond measure) posts a cool entry and then goes back to the drawing board after the Internet has its say. It's the social science equivalent of Kanye West or Radiohead road testing a new song on tour, vs. releasing it on iTunes and shooting the music video.

But data has always been a tool for winning arguments on the Internet, as evidenced by the frequency of anguished invocations of the phrase "correlation is not causation" in comment threads. Some data-driven blogs, like the Post's Wonkblog, have become an integral part of the online policy conversation. As journalists and policy bloggers become much more fluent with nifty open-source tools like R, Python's Pandas package, or BUGS we'll see far more bar charts, histograms, and heat maps start to pepper our favorite blogs. And then they'll move on to greener pastures with Hadoop and Amazon EC2, Natural Language Processing, or a link analysis tool like NetworkX.

The problem with the coming wave of data-driven policy blogging is that the three paragraph max MSM blog post format doesn't really mesh all that well with the complexities of either summarizing someone else's results or presenting your own. That, and as Drezner notes, it's difficult and time-consuming to dig out holes in other people's work that might bias the results. Policy debates and journalism deadlines tend to fly by at light speed compared to the tinkering of academic-ish blogging. So expect to see more of the mistakes made by the Post and the overheated critique Drezner and Ulfelder rightly counter-critique in their blogs and tweets.

That being said, data blogging, and particularly the potential for Big Data blogging, are a positive thing (Abu M featured a data viz contribution by Daveed Garteinstein-Ross and Chris Albon). In the focus on detailing all the obvious ways data can be go bad, critics often fail to note the equally numerous ways to BS with qualitative analysis---a tendency Causey dubs "qualsplaining." So bring (and blog!) your data, wonks of the world! You have nothing to lose but your p-values.

_____________________

1. I haven't been back to LA in a year and I still have dreams about that dim sum.

2. I did run out in the street at 4AM to shake my fists at the moon, seized by impotent rage over the unpleasant discovery that my attempt to merge three datasets resulted in a very peculiar yet nonetheless fatal data error: more battle deaths than deployed soldiers! "Curse you J. David Singer! Curse youuuuuuuu," I yelled until I realized that the night watchmen were all looking at me suspiciously. After returning to my senses, I went back and stared with guilt at my copy of Singer's Nations at War. It was surely not the illustrious political scientist's fault that I had so naively thought that I could merge three datasets without something going horribly wrong.

3. Mad that your girl (or panel discussant) loves my GDELT style? I love your passion, hater.

4. Also see Simpson's post here, which I plan to expand on in an article I am currently writing.

Publications

Research Areas

Resident Experts

Adjunct Experts

Who We Are

CNAS Programs

Press

Events

Connect

I Got 200 Million Problems, But Multicollinearity Ain't One

Get the Latest from CNAS

Sign up for weekly updates and analysis on the most important issues in U.S. national security.