Where Are the Legislators (Who Ostensibly Pay for Data)?

I watched from a distance on Twitter as the World Bank hosted its annual data event. I would love to have attended – the participants were a pretty amazing collection of economists, data professionals and academics. This tweet seemed to resonate with a theme I’ve been focused on the last week or so: There is a data shortage such that even the most advanced countries can’t measure the Sustainable Development Goals (SDGs).

The European Statistical System can only produce around 1/3 of #SDG indicators, according to Pieter Everaers of @EU_Eurostat #ABCDEwb — Neil Fantom (@neilfantom) June 21, 2016

I replied to this tweet with a query about whether there was evidence of political will among EU member states to actually collect this data. In keeping with the “data is political” line that I started on last week, political will is important because the European Statistical System relies heavily on EU member states’ statistics offices to provide data. The above tweet highlights two things for me – there needs to be a conversation about where the existing data comes from, and there need to be MPs or MEPs (legislative representatives) at meetings like the World Bank’s annual data event.

Since Eurostat and the European Statistical System were the topic of the tweet, I’ll focus on how they gather statistics. Most of my expertise is in their social and crime stats so I’ll speak to those primarily, but it’s important to note that the quality and quantity of any statistic is based on its importance to the collector and end user. Eurostat got its start as a hub for data on the coal and steel industries in the 1950s, and while its mandate has grown significantly the quality and density of the economic and business indicators hosted on its data site reflect its founding purpose. Member states provide good economic data because states have decided that trade is important – there is a compelling political reason to provide these statistics. Much of this data is available at high levels of granularity, down to the NUTS 3 level. It’s mostly eye-wateringly boring agricultural, land use, and industrial data, but it’s the kind of stuff that’s important for keeping what is primarily an economic union running smoothly(-ish).

If we compare Eurostat’s economic data to its social and crime data, the quality and coverage decrease notably. This is when it’s important to ask where the data comes from and how it’s gathered – if 2/3 of the data necessary to measure the SDGs isn’t available for Europe (let alone say, the Central African Republic) we need to be thinking clearly about why we have the data we have, and the values that undergird gathering good social data. Eurostat statistics that would be important to measuring the SDGs might include the SILC surveys that measure social inclusion, and general data on crime and policing. The SILC surveys are designed by Eurostat and implemented by national statistics offices in EU member states. The granularity and availability varies depending on the capacity of the national stats office and the domestic laws regarding personal data and privacy. For example, some countries run the SILC surveys at the NUTS 2 level while others administer them only at the national level. A handful of countries, such as France, do the surveys at the individual level and produce panel data. The problem is that the SILC data has mixed levels of availability due to national laws regarding privacy – for example, if you want the SILC panel data you have to apply for it and prove you have data storage standards that meet France’s national laws for data security.

Crime and police data is even more of an issue. Eurostat generally doesn’t collect crime data directly from member states. They have an arrangement with the UN Office on Drugs and Crime where crime and police data reported to the UN by EU member states gets passed to Eurostat and made available through their database. One exception is a dataset of homicide, robbery and burglary in the EU from 2008-2010 that is disaggregated down to the NUTS 3 level. When I spoke with the crime stats lead at Eurostat about this dataset he explained that it was a one-off survey in which Eurostat worked with national statistics offices to gather the data; in the end it was so time consuming and expensive that it was canceled. Why would such a rich data collection process get the axe? Because it’s an established fact that crime statistics can’t be compared across jurisdictions due to definitional and counting differences. So funders reasonably asked: What’s the point of spending a lot of money and time collecting data that isn’t comparable in the first place?

A key problem I see in the open data discussion is a heavy focus on data availability with relatively little focus on why the data we have exists in the first place, and by extension what would go into gathering new SDG-focused data (e.g. the missing 2/3 noted in the opening tweet). Some of this is driven by, in my opinion, an over confidence in/fetishization of ‘big data’ and crowdsourced statistics. Software platforms are important if you think the data availability problem is just a shortage of capacity to mine social networks, geospatial satellite feeds and passive web-produced data. I’d argue though that the problem isn’t collection ability, and that the focus on collection and validation of ‘big data’ distracts from the important political discussion of whether societies value the SDGs enough to put money and resources into filling the 2/3 gap with purpose-designed surveys instead of mining the internet’s exhaust hoping to find data that’s good enough to build policy on.

I’m not a Luddite crank – I’m all for using technology in innovative ways to gather good data and make it available to all citizens. Indeed, ‘big data’ can provide interesting insights into political and social processes, so finding technical solutions for managing reams and reams of it are important. But there is something socially and politically important about allocating public funds for gathering purpose-designed administrative statistics. When MPs, members of Congress, or MEPs allocate public funds they are making two statements. One is that they value data-driven policy making; the other, more important in my opinion, is that they value a policy area enough to use public resources to improve government performance in it. For this reason I’d argue that data events which don’t have legislative representatives featured as speakers are missing a key chance to talk seriously about the politics of data gathering. Perhaps next year instead of having a technical expert from Eurostat tell us that 2/3 of the necessary data for measuring the SDGs is missing, have Marianne Thyssen, the Commissioner for Employment, Social Affairs and Inclusion that covers Eurostat, come and take questions about EU and member state political will to actually measure the SDGs.

The World Bank’s data team, as well as countless other technical experts at stats offices and research organizations, are doing great work when it comes to making existing data available through better web services, APIs, and open databases. But we’re only having 50% of the necessary discussion if the representatives who set budgets and represent the interests of constituents aren’t participating in the discussion of what we value enough to measure, and what kind of public resources it will take to gather the necessary data.

 

Advertisements

How Is Public Data Produced?

The 2016 Global Peace Index (GPI) launched recently. Along with its usual ranking of most to least peaceful countries it included a section analyzing the capacity for the global community to effectively measure progress in the Sustainable Development Goals (SDGs), specifically Goal 16, the peace goal. The GPI’s analysis of statistical capacity (pp. 73-94) motivates a critical question: Where does data come from, and why does it get produced? This is important, because while the GPI notes that some of the Goal 16 targets can be measured with existing data, many cannot. How will we get all this new data?

Some of the data necessary to measure the Targets for Goal 16 is available. I’d say the GPI’s  findings can probably be extended to the other goals, so we’ll imagine for the sake of argument that we can measure 50-60% of the 169 Targets across all the SDGs with the data currently available globally. How will we get the other 40-50%? To deal with these questions it’s important to know who collects data: The primary answer is of course national statistics offices. These are the entities tasked by governments with managing statistics across a country’s ministries and agencies, as well as doing population censuses. Other data organizations include international institutions and polling firms. NGOs and academic institutes gather data too, but I’d argue that the scale of the SDGs means that governments, international organizations and big polling firms are going to carry the primary load. Knowing the Who, we can now get to the How.

National statistics offices (NSOs) should be the place where all data that will be used for demonstrating a nation’s progress toward goals is gathered and reported. In a perfect world NSOs would have necessary resources for collecting data, and the flexibility to run new surveys using innovative technologies to meet the rapidly evolving data needs of public policy. This is of course not how NSOs work. Much of what happens in a statistics office is less about gathering new data, and more about making sure what exists is accessible. In my experience NSOs have a core budget for census taking, but if new data has to be collected the funding comes from another government office. This last bit is important: NSOs do not generally have the authority to go get whatever data is necessary. If NSOs are going to be the primary source for data that will be used to measure the SDGs, it is critical that legislatures provide funding to government offices for data gathering.

International organizations are the next place we might look to for data. The World Bank, in my opinion, is the gold standard for international data. United Nations agencies also collect a fair amount of data. What sets the Bank apart is that they do some of their own data collection. Most international organizations’ data though is actually just NSO data from member states. For example, when you go to the UN Office on Drugs and Crime’s database, most of what you’ll find are statistics that were voluntarily reported by member states’ statistical offices. The UN, World Bank, OECD and other myriad organizations do relatively little of their own data gathering; much of their effort is spent making sure that the data they are given is accessible. Unless legislatures in member states provide funding to government agencies to gather data, and the government agrees to share the data with international organizations, most international institutions won’t have much new data.

Polling firms such as Gallup gather international survey data that is both timely, accurate and covers a wide range of topics relevant to the SDGs. Unfortunately their data is expensive to access. As a for-profit entity they have a level of flexibility to gather new data  that statistics offices don’t, but this level of flexibility is very expensive to maintain. A problem arises too when Gallup (and similar firms) decide that the data necessary to measure the SDGs is not commercially viable to gather and sell access to. In this case legislatures would need to provide funding to government agencies to hire Gallup to gather data that is relevant to measuring progress toward the SDGs.

There is a pattern in the preceding paragraphs. All of them end with the legislature or representative body of government having to provide funding for data gathering. How we gather data (the funding, budgeting, administration, and authority) is entirely political. This is a key issue that gets lost in a lot of discussion around ‘open data’ and demands for data-driven policy making. It is too easy to fall into a trap where data gets treated as a neutral, values-free thing, existing in a plane outside the messy politics of public administration. The Global Peace Index does a good service by highlighting where there are serious gaps in the necessary data for tracking the SDG Targets. This leads us to the political question of financing data collection.

If the UN and the various stakeholders who developed the Sustainable Development Goals can’t make the case to legislatures and parliaments that investments in data gathering and statistical capacity are politically worthwhile, it is entirely likely that the SDGs will go unmeasured and we’ll be back around a table in 2030 hacking away at the same development challenges while missing the harder conversation about the politics necessary to drive sustainable change.

Build Peace 2015

I was invited to be a speaker on the panel on behavior change and technology in peacebuilding and Build Peace 2015. The panel was a lot of fun, with some fascinating presentations! You can find them on the Build Peace YouTube page. Here’s mine:

This was a particularly fun conference, pulling together practitioners, activists and academics in a setting that breaks away from the usual paper/panel/questions format of most conferences. Looking forward to next year!

Big News: The GDELT Global Dashboard

GDELT just released their new Global Visualization dashboard, and it’s pretty cool. It blinks and flashes, glows and pulses, and is really interesting to navigate. Naturally, as a social scientist who studies conflict, I have some thoughts.

1) This is really cool. The user interface is attractive, it’s easy to navigate, and it’s intuitive. I don’t need a raft of instructions on how to use it, and I don’t need to be a programmer or have any background in programming to make use of all its functionality. If the technology and data sectors are going to make inroads into the conflict analysis space, they should take note of how GDELT did this, since most conflict specialists don’t have programming backgrounds and will ignore tools that are too programming intensive. Basically, if it takes more than about 10 minutes for me to get a tool or data program functioning, I’m probably not going to use it since I have other analytic techniques at my disposal that can achieve the same outcome that I’ve already mastered.

2) Beware the desire to forecast! As I dug through the data a bit, I realized something important. This is not a database of information that will be particularly useful for forecasting or predictive analysis. Well, replicable predictive analysis at least. You might be able to identify some trends, but since the data itself is news reports there’s going to be a lot of variation across tone, lag between event and publication, and a whole host of other things that will make quasi-experiments difficult. The example I gave to a friend who I was discussing this with was the challenge of predicting election results using Twitter; it worked when political scientists tried to predict the distribution of seats in the German Bundestag by party, but then when they replicated the experiment in the 2010 U.S. midterm elections it didn’t work at all. Most of this stemmed from the socio-linguistics of political commentary in the two countries. Germans aren’t particularly snarky or sarcastic in their political tweeting (apparently), while Americans are. This caused a major problem for the algorithm that was tracking key words and phrases during the American campaign season. Consider, if we have trouble predicting relatively uniform events like elections using language-based data, how much harder will it be to predict something like violence, which is far more complex?

3) Do look for qualitative details in the data! A friend of mine pointed out that the data contained on this map is treasure trove of sentiment, perception and narrative about how the media at a very local level conceptualizes violence. Understanding how media, especially local media, perceive things like risk or frame political issues is incredibly valuable for conflict analysts or peacebuilding professionals. I would argue that this is actually more valuable than forecasting or predictive modeling; if we’re honest with ourselves I think we’d have to admit that ‘predicting’ conflict and then rushing to stop it before it starts has proven to be a pretty lost endeavor. But if we understand at a deeper level why people would turn to violence, and how their context helps distill their perception of risk into something hard enough to fight over, then interventions such as negotiation, mediation and political settlements are going to be better tailored to the specific conflict. This is where the GDELT dashboard really shines as an analytic tool.

I’m excited to see how GDELT continues to make the dashboard better – there are already plans to provide more options for layering and filtering data, which will be helpful. Overall though, I’m excited to see what can be done with some creative qualitative research using this data, particularly for understanding sentiment and perception in the media during conflict.

MCIT/NUS ICTs in Emergency Survey: Replication data

I spent the last two months managing a research collaboration between Samoa’s Ministry of Communications and Information Technology (MCIT) and the National University of Samoa, collecting nation wide data on how people use information and information technology to respond to natural disasters. This data will feed into my dissertation, as well as be useful to the Ministry and the National University, who will be using it for policy development and research. The research team wanted to make this data publicly available, since funding for the research came from MCIT and thus we see it as a public good. You can download the data here, and below is the suggested citation:

Martin-Shields, Charles, Ioana Chan Mow, Lealaolesau Fitu & Hobert Sasa. (2014) “ICTs and Information Use During Emergencies: Data from Samoa,” MCIT/NUS Data Project. Dataset available at: https://charlesmartinshields.files.wordpress.com/2014/06/mcitnus-survey.xlsx

The thrust of the research design was multifold. For MCIT, it’s important to know how people get their information, especially when trying to allocate spectrum or regulate communication providers. The research team from NUS does quite a bit of work with ICT4D and the social aspects of access to communication technology, so having data on use preferences from around the country is helpful in their research agenda. My own research looks at technologies as proxies for socio-political behavior, aiming to understand how social and political context affects the way that people use technology to manage collective action problems during crisis.

The dataset takes inspiration from the work I’ve done with Elizabeth Stones, whose dataset on Kenyan information use and trust inspired my thinking on doing a tailored replication in Samoa. We welcome feedback on the data, our structure, and hope that it can be useful to others working on ICT policy.

Headed to Toronto soon…

I’ll be at the International Studies Association annual convention from March 26-30 presenting two papers (never again will I submit two abstracts for papers that have to be written from scratch…) on Crowdsourcing methodology and technology in peacekeeping operations. Should be a lot of fun – feel free to give me feedback on the papers as I get them posted and let me know if you’ll be in Toronto. I’m always up for a coffee, beer or lunch!

Finding Big Data’s Place in Conflict Analysis

Daniel Solomon recently posted a piece on how we conceptualize (and often misconceptualize) the role of big data in conflict event prediction. His post got me thinking about what role big data plays in conflict analysis. This comes on the heels of Chris Neu’s post on the TechChange blog about the limits of using crowdsourcing to track violence in South Sudan.

This is one of my favorite parts of Daniel’s post: “Acts of violence don’t create data, but rather destroy them. Both local and global information economies suffer during conflict, as warring rumors proliferate and trickle into the exchange of information–knowledge, data–beyond a community’s borders. Observers create complex categories to simplify events, and to (barely) fathom violence as it scales and fragments and coheres and collapses.”

The key question for me becomes: is there a role for Big Data in conflict analysis? Is it something that will empower communities to prevent violence locally, as Letouze, Meier and Vinck propose? Will it be used by the international community for real-time information to speed responses to crises? Could it be leveraged into huge datasets and used to predict outbreaks of violence, so that we can be better prepared to prevent conflict? All of these scenarios are possible, but I’ve yet to see them come to fruition (not to say that they won’t!). The first two are hampered by practicalities of local access to information, and bureaucratic decision making speed; thus, for me the interesting one is the third since it deals directly with an analytic process, which is what I’ll focus on.

When we talk about prediction, we’re talking about using observed information to inform what will happen in the future. In American political science, there has been a trend toward using econometric methods to develop models of conflict risk. There are other methods, such as time-series analysis, that can be used as well. But the efficacy of these methods hinges on the quality and attributes of the data itself. Daniel’s post got me to think about a key issue that has to be dealt with if big data is going to generate valid statistical results. This key issue is the problem of endogeneity.

To start, what is endogeneity? Basically, it means that the data we’re using to predict an event is part of the event we’re trying to predict. As Daniel points out, the volume of data coming out of a region goes down as violence goes up; what we end up with is information that is shaped out of the conflict itself. If we rely on that data to be our predictor of conflict likelihood, we have a major logical problem – that data is endogenous to (part of) conflict. Does data collected during conflict predict conflict? Of course it does, because the only time we see that stream of data appear is when there’s already a conflict. Thus we don’t achieve our end goal, which is predicting what causes conflict to break out. Big Data doesn’t tell us anything useful if the underlying analytic logic that was used in the data collection is faulty.

So what do we do? There’s all kids of dirty, painful math that can be used to address problems in data, such as instrumental variables, robustness checks, etc. But these are post hoc methods, things you do when you’ve got data that’s not quite right. The first step to solving the problem of endogeneity is good first principles. We have to define what are we looking for, and state a falsifiable* hypothesis for how and when it happens. We’re trying to determine what causes violence to break out (this is what we’re looking for). We think that it breaks out because political tensions rise over concerns that public goods and revenues will not be democratically shared (I just made this up, but I think it’s probably a good starting place). Now we know what we’re looking for, and a hypothesis for what causes it and when.

If the violence has already started, real-time data probably won’t help us figure out what caused the violence to break out, so we should perhaps look elsewhere in the timeline. This relates to another point Daniel made: don’t think of big events as a big event. Big events are the outcome of many sequential events over time. There was a time before violence – this would be a good place to look for data about what led to the violence.

Using good first principles and well thought out data collection methods, Big Data might yet make conflict analysis as much science as art.

*This is so important that it deserves a separate blog post. Fortunately, if you’re feeling saucy and have some time on your hands Rene Decartes does the topic far more justice than I could (just read the bit on Cartesian Doubt). Basically, if someone says “I used big data and found this statistical relationship” but they didn’t start from a falsifiable proposition, be very wary of the validity of their results.

Disaggregating Peacekeeping Data: A new dataset on peacekeeping contributions

Jacob Kathman at the University of Buffalo has an article in the current issue of Conflict Management and Peace Science about his new dataset on the numbers and nationalities of all peacekeeper contributions by month since 1990.  This is a pretty fantastic undertaking since peacekeeping data is often difficult to find, and no small feat given how challenging it is not only to code a 100,000+ point dataset, but do it in such a way that it complements other datasets like Correlates of War and Uppsala/PRIO.  I’m particularly excited about this dataset because it highlights something I’ve been interested in, and will continue to work on throughout my career: gathering and coding historical data on peacekeeping missions so that social scientists and economists can start producing quantitative research to compliment the existing case study-oriented research on peacekeeping operations and practice.

As Kathman points out, there has usually been a focus on case study approaches to researching peacekeeping.  This makes sense: most of the research is geared toward identifying lessons learned from mission success and failure, and is meant to be easily integrated into operational behavior, instead of addressing theoretical issues.  This also reflects the ad hoc nature of peacekeeping; a mission gets a mandate to deal with a specific issue, and missions tend to be short (with some exceptions), so the data tends to be mission and context specific which lends to case study research approaches.  As civil wars became the norm in the 1990s though, missions expanded their roles to include war fighting, humanitarian aid delivery, medical provision, policing, and other aspects of civil society.  This meant that peacekeeping missions became part of the political, economic and social fabric of the post-ceasefire environment, and over the last ten years social scientists started studying the effects of peacekeeping missions on ceasefire duration and economic development, among other things.

One of the things that has lacked, and that Kathman’s dataset helps with, is data about the missions themselves.  Studies, such as Virginia Page Fortna’s excellent book on the effect of peacekeeping missions on ceasefire durability tend to rely on conflict start-stop data to make inferences about the impact of peacekeeping.  Studies of peacekeeping and economics also run into the same issues; researchers have used baseline effect on GDP of peacekeeping missions, but this is a blunt instrument approach and suffers from problems of endogeneity.  Caruso et al’s analysis of the UN mission in South Sudan’s positive effect on cereal production treats the UN mission as a mass entity, but is unable to show comparative impacts on food production across missions since there isn’t finer grained mission data readily available.

Given the need, I would suggest pushing forward with datasets that contain not only data on troop contributions, but also data on mission expenditures, since peacekeeping missions have effects on the local economy which could be positive.  The problem is that the positive effects might not be seen without finer grained data on how missions use their money in the country they’re operating in.  Do investments in durable infrastructure make a difference to the durability of peace and economic growth?  What about focusing on local provision of goods and services where available?  At the moment data on these things is hard to find, but would be useful to conflict researchers.

Kathman’s paper is worth a read since he gives us a road map for how to develop further datasets on peacekeeping missions.  More datasets like this are important for the theorists who do research in the abstract, but can also help inform better processes for mission mandating, procurement and staffing.  If you want to download the datasets, Kathman has them in zip files on his website.

Matrix Math and Paul Revere

This week has been a rather stat oriented week of posts.  I blame this on the fact that political economy and peacekeeping has been dominating my official academic life in the form of a comprehensive exam.  The silver lining is that I will soon have political economy and peacekeeping content galore.

To keep everyone entertained this Friday, I wanted to revisit what has become one of my favorite little write ups on social network analysis, matrix math, and why we should be concerned about how our meta data gets used.  Kieran Healy is a Duke professor of sociology, and wrote this in response to the revelations about the NSA’s PRISM program.  He does a great job demonstrating the math in an accessible way, while also elegantly demonstrating why we should be wary of talking points such as “…we only capture meta data, not the content of transmissions…”

Have a great weekend!