I watched from a distance on Twitter as the World Bank hosted its annual data event. I would love to have attended – the participants were a pretty amazing collection of economists, data professionals and academics. This tweet seemed to resonate with a theme I’ve been focused on the last week or so: There is a data shortage such that even the most advanced countries can’t measure the Sustainable Development Goals (SDGs).
The European Statistical System can only produce around 1/3 of #SDG indicators, according to Pieter Everaers of @EU_Eurostat #ABCDEwb — Neil Fantom (@neilfantom) June 21, 2016
I replied to this tweet with a query about whether there was evidence of political will among EU member states to actually collect this data. In keeping with the “data is political” line that I started on last week, political will is important because the European Statistical System relies heavily on EU member states’ statistics offices to provide data. The above tweet highlights two things for me – there needs to be a conversation about where the existing data comes from, and there need to be MPs or MEPs (legislative representatives) at meetings like the World Bank’s annual data event.
Since Eurostat and the European Statistical System were the topic of the tweet, I’ll focus on how they gather statistics. Most of my expertise is in their social and crime stats so I’ll speak to those primarily, but it’s important to note that the quality and quantity of any statistic is based on its importance to the collector and end user. Eurostat got its start as a hub for data on the coal and steel industries in the 1950s, and while its mandate has grown significantly the quality and density of the economic and business indicators hosted on its data site reflect its founding purpose. Member states provide good economic data because states have decided that trade is important – there is a compelling political reason to provide these statistics. Much of this data is available at high levels of granularity, down to the NUTS 3 level. It’s mostly eye-wateringly boring agricultural, land use, and industrial data, but it’s the kind of stuff that’s important for keeping what is primarily an economic union running smoothly(-ish).
If we compare Eurostat’s economic data to its social and crime data, the quality and coverage decrease notably. This is when it’s important to ask where the data comes from and how it’s gathered – if 2/3 of the data necessary to measure the SDGs isn’t available for Europe (let alone say, the Central African Republic) we need to be thinking clearly about why we have the data we have, and the values that undergird gathering good social data. Eurostat statistics that would be important to measuring the SDGs might include the SILC surveys that measure social inclusion, and general data on crime and policing. The SILC surveys are designed by Eurostat and implemented by national statistics offices in EU member states. The granularity and availability varies depending on the capacity of the national stats office and the domestic laws regarding personal data and privacy. For example, some countries run the SILC surveys at the NUTS 2 level while others administer them only at the national level. A handful of countries, such as France, do the surveys at the individual level and produce panel data. The problem is that the SILC data has mixed levels of availability due to national laws regarding privacy – for example, if you want the SILC panel data you have to apply for it and prove you have data storage standards that meet France’s national laws for data security.
Crime and police data is even more of an issue. Eurostat generally doesn’t collect crime data directly from member states. They have an arrangement with the UN Office on Drugs and Crime where crime and police data reported to the UN by EU member states gets passed to Eurostat and made available through their database. One exception is a dataset of homicide, robbery and burglary in the EU from 2008-2010 that is disaggregated down to the NUTS 3 level. When I spoke with the crime stats lead at Eurostat about this dataset he explained that it was a one-off survey in which Eurostat worked with national statistics offices to gather the data; in the end it was so time consuming and expensive that it was canceled. Why would such a rich data collection process get the axe? Because it’s an established fact that crime statistics can’t be compared across jurisdictions due to definitional and counting differences. So funders reasonably asked: What’s the point of spending a lot of money and time collecting data that isn’t comparable in the first place?
A key problem I see in the open data discussion is a heavy focus on data availability with relatively little focus on why the data we have exists in the first place, and by extension what would go into gathering new SDG-focused data (e.g. the missing 2/3 noted in the opening tweet). Some of this is driven by, in my opinion, an over confidence in/fetishization of ‘big data’ and crowdsourced statistics. Software platforms are important if you think the data availability problem is just a shortage of capacity to mine social networks, geospatial satellite feeds and passive web-produced data. I’d argue though that the problem isn’t collection ability, and that the focus on collection and validation of ‘big data’ distracts from the important political discussion of whether societies value the SDGs enough to put money and resources into filling the 2/3 gap with purpose-designed surveys instead of mining the internet’s exhaust hoping to find data that’s good enough to build policy on.
I’m not a Luddite crank – I’m all for using technology in innovative ways to gather good data and make it available to all citizens. Indeed, ‘big data’ can provide interesting insights into political and social processes, so finding technical solutions for managing reams and reams of it are important. But there is something socially and politically important about allocating public funds for gathering purpose-designed administrative statistics. When MPs, members of Congress, or MEPs allocate public funds they are making two statements. One is that they value data-driven policy making; the other, more important in my opinion, is that they value a policy area enough to use public resources to improve government performance in it. For this reason I’d argue that data events which don’t have legislative representatives featured as speakers are missing a key chance to talk seriously about the politics of data gathering. Perhaps next year instead of having a technical expert from Eurostat tell us that 2/3 of the necessary data for measuring the SDGs is missing, have Marianne Thyssen, the Commissioner for Employment, Social Affairs and Inclusion that covers Eurostat, come and take questions about EU and member state political will to actually measure the SDGs.
The World Bank’s data team, as well as countless other technical experts at stats offices and research organizations, are doing great work when it comes to making existing data available through better web services, APIs, and open databases. But we’re only having 50% of the necessary discussion if the representatives who set budgets and represent the interests of constituents aren’t participating in the discussion of what we value enough to measure, and what kind of public resources it will take to gather the necessary data.