Daniel Solomon recently posted a piece on how we conceptualize (and often misconceptualize) the role of big data in conflict event prediction. His post got me thinking about what role big data plays in conflict analysis. This comes on the heels of Chris Neu’s post on the TechChange blog about the limits of using crowdsourcing to track violence in South Sudan.
This is one of my favorite parts of Daniel’s post: “Acts of violence don’t create data, but rather destroy them. Both local and global information economies suffer during conflict, as warring rumors proliferate and trickle into the exchange of information–knowledge, data–beyond a community’s borders. Observers create complex categories to simplify events, and to (barely) fathom violence as it scales and fragments and coheres and collapses.”
The key question for me becomes: is there a role for Big Data in conflict analysis? Is it something that will empower communities to prevent violence locally, as Letouze, Meier and Vinck propose? Will it be used by the international community for real-time information to speed responses to crises? Could it be leveraged into huge datasets and used to predict outbreaks of violence, so that we can be better prepared to prevent conflict? All of these scenarios are possible, but I’ve yet to see them come to fruition (not to say that they won’t!). The first two are hampered by practicalities of local access to information, and bureaucratic decision making speed; thus, for me the interesting one is the third since it deals directly with an analytic process, which is what I’ll focus on.
When we talk about prediction, we’re talking about using observed information to inform what will happen in the future. In American political science, there has been a trend toward using econometric methods to develop models of conflict risk. There are other methods, such as time-series analysis, that can be used as well. But the efficacy of these methods hinges on the quality and attributes of the data itself. Daniel’s post got me to think about a key issue that has to be dealt with if big data is going to generate valid statistical results. This key issue is the problem of endogeneity.
To start, what is endogeneity? Basically, it means that the data we’re using to predict an event is part of the event we’re trying to predict. As Daniel points out, the volume of data coming out of a region goes down as violence goes up; what we end up with is information that is shaped out of the conflict itself. If we rely on that data to be our predictor of conflict likelihood, we have a major logical problem – that data is endogenous to (part of) conflict. Does data collected during conflict predict conflict? Of course it does, because the only time we see that stream of data appear is when there’s already a conflict. Thus we don’t achieve our end goal, which is predicting what causes conflict to break out. Big Data doesn’t tell us anything useful if the underlying analytic logic that was used in the data collection is faulty.
So what do we do? There’s all kids of dirty, painful math that can be used to address problems in data, such as instrumental variables, robustness checks, etc. But these are post hoc methods, things you do when you’ve got data that’s not quite right. The first step to solving the problem of endogeneity is good first principles. We have to define what are we looking for, and state a falsifiable* hypothesis for how and when it happens. We’re trying to determine what causes violence to break out (this is what we’re looking for). We think that it breaks out because political tensions rise over concerns that public goods and revenues will not be democratically shared (I just made this up, but I think it’s probably a good starting place). Now we know what we’re looking for, and a hypothesis for what causes it and when.
If the violence has already started, real-time data probably won’t help us figure out what caused the violence to break out, so we should perhaps look elsewhere in the timeline. This relates to another point Daniel made: don’t think of big events as a big event. Big events are the outcome of many sequential events over time. There was a time before violence – this would be a good place to look for data about what led to the violence.
Using good first principles and well thought out data collection methods, Big Data might yet make conflict analysis as much science as art.
*This is so important that it deserves a separate blog post. Fortunately, if you’re feeling saucy and have some time on your hands Rene Decartes does the topic far more justice than I could (just read the bit on Cartesian Doubt). Basically, if someone says “I used big data and found this statistical relationship” but they didn’t start from a falsifiable proposition, be very wary of the validity of their results.