Finding Big Data’s Place in Conflict Analysis

Daniel Solomon recently posted a piece on how we conceptualize (and often misconceptualize) the role of big data in conflict event prediction. His post got me thinking about what role big data plays in conflict analysis. This comes on the heels of Chris Neu’s post on the TechChange blog about the limits of using crowdsourcing to track violence in South Sudan.

This is one of my favorite parts of Daniel’s post: “Acts of violence don’t create data, but rather destroy them. Both local and global information economies suffer during conflict, as warring rumors proliferate and trickle into the exchange of information–knowledge, data–beyond a community’s borders. Observers create complex categories to simplify events, and to (barely) fathom violence as it scales and fragments and coheres and collapses.”

The key question for me becomes: is there a role for Big Data in conflict analysis? Is it something that will empower communities to prevent violence locally, as Letouze, Meier and Vinck propose? Will it be used by the international community for real-time information to speed responses to crises? Could it be leveraged into huge datasets and used to predict outbreaks of violence, so that we can be better prepared to prevent conflict? All of these scenarios are possible, but I’ve yet to see them come to fruition (not to say that they won’t!). The first two are hampered by practicalities of local access to information, and bureaucratic decision making speed; thus, for me the interesting one is the third since it deals directly with an analytic process, which is what I’ll focus on.

When we talk about prediction, we’re talking about using observed information to inform what will happen in the future. In American political science, there has been a trend toward using econometric methods to develop models of conflict risk. There are other methods, such as time-series analysis, that can be used as well. But the efficacy of these methods hinges on the quality and attributes of the data itself. Daniel’s post got me to think about a key issue that has to be dealt with if big data is going to generate valid statistical results. This key issue is the problem of endogeneity.

To start, what is endogeneity? Basically, it means that the data we’re using to predict an event is part of the event we’re trying to predict. As Daniel points out, the volume of data coming out of a region goes down as violence goes up; what we end up with is information that is shaped out of the conflict itself. If we rely on that data to be our predictor of conflict likelihood, we have a major logical problem – that data is endogenous to (part of) conflict. Does data collected during conflict predict conflict? Of course it does, because the only time we see that stream of data appear is when there’s already a conflict. Thus we don’t achieve our end goal, which is predicting what causes conflict to break out. Big Data doesn’t tell us anything useful if the underlying analytic logic that was used in the data collection is faulty.

So what do we do? There’s all kids of dirty, painful math that can be used to address problems in data, such as instrumental variables, robustness checks, etc. But these are post hoc methods, things you do when you’ve got data that’s not quite right. The first step to solving the problem of endogeneity is good first principles. We have to define what are we looking for, and state a falsifiable* hypothesis for how and when it happens. We’re trying to determine what causes violence to break out (this is what we’re looking for). We think that it breaks out because political tensions rise over concerns that public goods and revenues will not be democratically shared (I just made this up, but I think it’s probably a good starting place). Now we know what we’re looking for, and a hypothesis for what causes it and when.

If the violence has already started, real-time data probably won’t help us figure out what caused the violence to break out, so we should perhaps look elsewhere in the timeline. This relates to another point Daniel made: don’t think of big events as a big event. Big events are the outcome of many sequential events over time. There was a time before violence – this would be a good place to look for data about what led to the violence.

Using good first principles and well thought out data collection methods, Big Data might yet make conflict analysis as much science as art.

*This is so important that it deserves a separate blog post. Fortunately, if you’re feeling saucy and have some time on your hands Rene Decartes does the topic far more justice than I could (just read the bit on Cartesian Doubt). Basically, if someone says “I used big data and found this statistical relationship” but they didn’t start from a falsifiable proposition, be very wary of the validity of their results.

Matrix Math and Paul Revere

This week has been a rather stat oriented week of posts.  I blame this on the fact that political economy and peacekeeping has been dominating my official academic life in the form of a comprehensive exam.  The silver lining is that I will soon have political economy and peacekeeping content galore.

To keep everyone entertained this Friday, I wanted to revisit what has become one of my favorite little write ups on social network analysis, matrix math, and why we should be concerned about how our meta data gets used.  Kieran Healy is a Duke professor of sociology, and wrote this in response to the revelations about the NSA’s PRISM program.  He does a great job demonstrating the math in an accessible way, while also elegantly demonstrating why we should be wary of talking points such as “…we only capture meta data, not the content of transmissions…”

Have a great weekend!

Rob Baker: Managing risk in the open data and crowdsourcing space

Rob Baker, currently a Presidential Innovation Fellow with USAID, was willing to sit down with me earlier this year to discuss risk management and ethics in crowdsourcing in disaster and conflict-affected regions.  He’s incredibly smart, insightful, and brings a deep technical expertise to the practice of crisis mapping and crowdsourcing given his many years of work with Ushahidi.  This interview was used in my forthcoming paper in the Georgetown Journal of International Affairs on the ethics of using communication technology for crowdsourcing in conflict-affected settings.  This is also my first crack at a pod cast, so hopefully it provides some good lunchtime or afternoon listening!