Where Are the Legislators (Who Ostensibly Pay for Data)?

I watched from a distance on Twitter as the World Bank hosted its annual data event. I would love to have attended – the participants were a pretty amazing collection of economists, data professionals and academics. This tweet seemed to resonate with a theme I’ve been focused on the last week or so: There is a data shortage such that even the most advanced countries can’t measure the Sustainable Development Goals (SDGs).

The European Statistical System can only produce around 1/3 of #SDG indicators, according to Pieter Everaers of @EU_Eurostat #ABCDEwb — Neil Fantom (@neilfantom) June 21, 2016

I replied to this tweet with a query about whether there was evidence of political will among EU member states to actually collect this data. In keeping with the “data is political” line that I started on last week, political will is important because the European Statistical System relies heavily on EU member states’ statistics offices to provide data. The above tweet highlights two things for me – there needs to be a conversation about where the existing data comes from, and there need to be MPs or MEPs (legislative representatives) at meetings like the World Bank’s annual data event.

Since Eurostat and the European Statistical System were the topic of the tweet, I’ll focus on how they gather statistics. Most of my expertise is in their social and crime stats so I’ll speak to those primarily, but it’s important to note that the quality and quantity of any statistic is based on its importance to the collector and end user. Eurostat got its start as a hub for data on the coal and steel industries in the 1950s, and while its mandate has grown significantly the quality and density of the economic and business indicators hosted on its data site reflect its founding purpose. Member states provide good economic data because states have decided that trade is important – there is a compelling political reason to provide these statistics. Much of this data is available at high levels of granularity, down to the NUTS 3 level. It’s mostly eye-wateringly boring agricultural, land use, and industrial data, but it’s the kind of stuff that’s important for keeping what is primarily an economic union running smoothly(-ish).

If we compare Eurostat’s economic data to its social and crime data, the quality and coverage decrease notably. This is when it’s important to ask where the data comes from and how it’s gathered – if 2/3 of the data necessary to measure the SDGs isn’t available for Europe (let alone say, the Central African Republic) we need to be thinking clearly about why we have the data we have, and the values that undergird gathering good social data. Eurostat statistics that would be important to measuring the SDGs might include the SILC surveys that measure social inclusion, and general data on crime and policing. The SILC surveys are designed by Eurostat and implemented by national statistics offices in EU member states. The granularity and availability varies depending on the capacity of the national stats office and the domestic laws regarding personal data and privacy. For example, some countries run the SILC surveys at the NUTS 2 level while others administer them only at the national level. A handful of countries, such as France, do the surveys at the individual level and produce panel data. The problem is that the SILC data has mixed levels of availability due to national laws regarding privacy – for example, if you want the SILC panel data you have to apply for it and prove you have data storage standards that meet France’s national laws for data security.

Crime and police data is even more of an issue. Eurostat generally doesn’t collect crime data directly from member states. They have an arrangement with the UN Office on Drugs and Crime where crime and police data reported to the UN by EU member states gets passed to Eurostat and made available through their database. One exception is a dataset of homicide, robbery and burglary in the EU from 2008-2010 that is disaggregated down to the NUTS 3 level. When I spoke with the crime stats lead at Eurostat about this dataset he explained that it was a one-off survey in which Eurostat worked with national statistics offices to gather the data; in the end it was so time consuming and expensive that it was canceled. Why would such a rich data collection process get the axe? Because it’s an established fact that crime statistics can’t be compared across jurisdictions due to definitional and counting differences. So funders reasonably asked: What’s the point of spending a lot of money and time collecting data that isn’t comparable in the first place?

A key problem I see in the open data discussion is a heavy focus on data availability with relatively little focus on why the data we have exists in the first place, and by extension what would go into gathering new SDG-focused data (e.g. the missing 2/3 noted in the opening tweet). Some of this is driven by, in my opinion, an over confidence in/fetishization of ‘big data’ and crowdsourced statistics. Software platforms are important if you think the data availability problem is just a shortage of capacity to mine social networks, geospatial satellite feeds and passive web-produced data. I’d argue though that the problem isn’t collection ability, and that the focus on collection and validation of ‘big data’ distracts from the important political discussion of whether societies value the SDGs enough to put money and resources into filling the 2/3 gap with purpose-designed surveys instead of mining the internet’s exhaust hoping to find data that’s good enough to build policy on.

I’m not a Luddite crank – I’m all for using technology in innovative ways to gather good data and make it available to all citizens. Indeed, ‘big data’ can provide interesting insights into political and social processes, so finding technical solutions for managing reams and reams of it are important. But there is something socially and politically important about allocating public funds for gathering purpose-designed administrative statistics. When MPs, members of Congress, or MEPs allocate public funds they are making two statements. One is that they value data-driven policy making; the other, more important in my opinion, is that they value a policy area enough to use public resources to improve government performance in it. For this reason I’d argue that data events which don’t have legislative representatives featured as speakers are missing a key chance to talk seriously about the politics of data gathering. Perhaps next year instead of having a technical expert from Eurostat tell us that 2/3 of the necessary data for measuring the SDGs is missing, have Marianne Thyssen, the Commissioner for Employment, Social Affairs and Inclusion that covers Eurostat, come and take questions about EU and member state political will to actually measure the SDGs.

The World Bank’s data team, as well as countless other technical experts at stats offices and research organizations, are doing great work when it comes to making existing data available through better web services, APIs, and open databases. But we’re only having 50% of the necessary discussion if the representatives who set budgets and represent the interests of constituents aren’t participating in the discussion of what we value enough to measure, and what kind of public resources it will take to gather the necessary data.


The Challenge of Conflict Data

The last two posts I wrote focused on the social and political structures that drive data collection and availability. In these posts I was primarily talking about statistics in wealthy countries, as well as developing countries that aren’t affected by conflict or violence. When it comes to countries that are beset by widespread conflict and violence, all the standard administrative structures that would normally gather, process and post data are either so compromised by the politics of conflict that the data can’t be trusted, or worse they just don’t exist. Without human security and reliable government structures, talking about data selection and collection is a futile exercise.

Conflict data, compared to other administrative data, is a bit of a mash up. There are long term data collection projects like the Correlates of War project and the UCDP data program, both of which measure macro issues in conflict and peace such as combatant types, conflict typologies, and fatalities. Because both projects have long timelines in their data they are considered the best resources for quantitatively studying violence and war. Newer data programs include the Armed Conflict Location and Event Data project and the Global Database of Events Language and Tone. These projects take advantage of geographic and internet-based data sources to examine the geographic elements of conflict. There are other conflict data projects that use communication technologies to gather local-level data on conflict and peace, including Voix des Kivus and the Everyday Peace Indicators project.

This is just a sample of projects and programs, but the main thing to note is that they are generally hosted by universities and the data they gather is oriented toward research as opposed to public administration. Administrative data is obviously a different animal than research data (though researchers often use administrative data and vice versa). To be useful it has to be consistent, statistically valid in terms of sampling and collection technique, and available through some sort of website or institutional application. If the aim of the international community is to measure the twelve Goal 16 Targets in the Sustainable Development Goals, particularly in countries affected by conflict, international organizations and donors need to focus on how to develop the structures that collect administrative data.

We can look to existing models of how to gather data, particularly sensitive data on things like violence. Household surveys are a core tool for gathering administrative data, but to gather representative samples takes a lot of work. It also requires a stable population and reliable census data. For example if a statistical office gets tasked by a ministry of justice to run a survey on crime victimization, the stats office would need to interview as many victims as possible to develop sampling tranches. The U.S. Bureau of Justice Statistics National Crime Victimization Survey is an excellent example of a large-scale national survey. One only needs to read the methodology section to grasp how large an undertaking this is; the government needs the capacity to interview over 150,000 respondents twice a year, citizens need to be stable enough to have a household, and policing data needs to be good enough at the local level to identify victims of crime. Reliable administrative statistics, especially about sensitive topics like crime victimization and violence requires: Functional government, stable populations, and effective local data collection capacity.

While many countries can measure the Goal 16 Targets, countries affected by conflict and violence (the ones that we should be most interested in from a peacebuilding perspective) fundamentally lack the political and social structures necessary to gather and provide reliable administrative data. Proposing a solution like “establish a functioning state with solid data collection and output processes at the local and national level” sounds comically simplistic, but for many conflict-affected states this is the level of discussion – talking about what kind data to collect is an academic exercise unless issues of basic security and population stability and institutional capacity are dealt with first.

How is Public Data Produced (Part 2)

I published a post yesterday about how administrative data is produced. In the end I claimed that data gathering is an inherently political process. Far from being comparable, scientifically standardized representations of general behavior, public data and statistics are imbued with all the vagaries and unique socio-administrative preferences of the country or locality that collects them.

Administrative criminal statistics are an interesting starting point if someone wants to understand how data reflects the vagaries of administrative structures. If someone thought “I would really like to compare crime rates across European Union member states” they would probably be surprised to learn that unless they just compare homicide rates it’s impossible to compare crime rates between countries. This is not only because definitions of different crimes are different between countries (though the UNODC has done a lot of work to at least standardize definitions), but the actual events of crime are counted differently. For example, Germany uses what’s called “principal offense” counting – this means that in the event that multiple crimes are committed at the same time, the final statistics only count the most serious crime. Belgium doesn’t use this counting method, so its crime statistics look much higher than Germany’s on paper. The University of Lausanne’s Marcelo Aebi, arguably the expert on comparative criminal statistics, published an excellent paper on comparing criminal statistics and the problems posed by different counting procedures (pages 17-18 for those who just want the gist).

Aebi makes a crucial point in the conclusion of his article: Statistics are social constructs and each society has a different way of constructing them. Statistics represent the things we have valued. The past-tense is important here; when we see data it’s showing us the past (the 2016 Global Peace Index uses numbers from 2015, for example), and thus represents what we valued at the time. Data can be used to build and test models of potential future events, but there is no such thing as ‘future data’. The value in data is that it can help citizens and policy makers understand what worked, or didn’t work, so that policies and behaviors can be adjusted going forward.

Of course institutional and administrative behavior is often resistant to trends in data (or very comfortable with data that supports the status quo). This can be for valid, or at least non-nefarious, reasons. For example the Sustainable Development Goals (SDGs) rely heavily on GDP as an economic indicator. The SDGs are supposed to represent sustainable growth and social development into the future, so it’s interesting that they use an economic indicator that many experts and organizations view as quite flawed.

Why would the SDGs rely so heavily on GDP then? For one, it’s a reliable indicator – everyone at least has some vague idea of what is represents. Two, it’s got a long history – we have tracked it for decades. Three, most of the people who created the SDGs come from backgrounds where GDP is a standard indicator – they pick targets and data based on their professional and institutional experience. They didn’t do this because they’re jerks. They did it because GDP represents the standard, if flawed, way that we measure economic performance. They probably also did it because gathering new data is an expensive, time consuming process that everyone says is important [for someone else to pay for].

This is all to say: If you want better public data, or to at least understand why the public data you have seems to reflect the status quo instead of telling you how to break out of it, it’s imperative to understand the qualitative political, social and administrative behaviors inherent to the place or people you’re researching. Once you’ve got that, you can start the political process of organizing the resources to get newer, better, data to formulate newer, better public policy and/or smashing the status quo.


How Is Public Data Produced?

The 2016 Global Peace Index (GPI) launched recently. Along with its usual ranking of most to least peaceful countries it included a section analyzing the capacity for the global community to effectively measure progress in the Sustainable Development Goals (SDGs), specifically Goal 16, the peace goal. The GPI’s analysis of statistical capacity (pp. 73-94) motivates a critical question: Where does data come from, and why does it get produced? This is important, because while the GPI notes that some of the Goal 16 targets can be measured with existing data, many cannot. How will we get all this new data?

Some of the data necessary to measure the Targets for Goal 16 is available. I’d say the GPI’s  findings can probably be extended to the other goals, so we’ll imagine for the sake of argument that we can measure 50-60% of the 169 Targets across all the SDGs with the data currently available globally. How will we get the other 40-50%? To deal with these questions it’s important to know who collects data: The primary answer is of course national statistics offices. These are the entities tasked by governments with managing statistics across a country’s ministries and agencies, as well as doing population censuses. Other data organizations include international institutions and polling firms. NGOs and academic institutes gather data too, but I’d argue that the scale of the SDGs means that governments, international organizations and big polling firms are going to carry the primary load. Knowing the Who, we can now get to the How.

National statistics offices (NSOs) should be the place where all data that will be used for demonstrating a nation’s progress toward goals is gathered and reported. In a perfect world NSOs would have necessary resources for collecting data, and the flexibility to run new surveys using innovative technologies to meet the rapidly evolving data needs of public policy. This is of course not how NSOs work. Much of what happens in a statistics office is less about gathering new data, and more about making sure what exists is accessible. In my experience NSOs have a core budget for census taking, but if new data has to be collected the funding comes from another government office. This last bit is important: NSOs do not generally have the authority to go get whatever data is necessary. If NSOs are going to be the primary source for data that will be used to measure the SDGs, it is critical that legislatures provide funding to government offices for data gathering.

International organizations are the next place we might look to for data. The World Bank, in my opinion, is the gold standard for international data. United Nations agencies also collect a fair amount of data. What sets the Bank apart is that they do some of their own data collection. Most international organizations’ data though is actually just NSO data from member states. For example, when you go to the UN Office on Drugs and Crime’s database, most of what you’ll find are statistics that were voluntarily reported by member states’ statistical offices. The UN, World Bank, OECD and other myriad organizations do relatively little of their own data gathering; much of their effort is spent making sure that the data they are given is accessible. Unless legislatures in member states provide funding to government agencies to gather data, and the government agrees to share the data with international organizations, most international institutions won’t have much new data.

Polling firms such as Gallup gather international survey data that is both timely, accurate and covers a wide range of topics relevant to the SDGs. Unfortunately their data is expensive to access. As a for-profit entity they have a level of flexibility to gather new data  that statistics offices don’t, but this level of flexibility is very expensive to maintain. A problem arises too when Gallup (and similar firms) decide that the data necessary to measure the SDGs is not commercially viable to gather and sell access to. In this case legislatures would need to provide funding to government agencies to hire Gallup to gather data that is relevant to measuring progress toward the SDGs.

There is a pattern in the preceding paragraphs. All of them end with the legislature or representative body of government having to provide funding for data gathering. How we gather data (the funding, budgeting, administration, and authority) is entirely political. This is a key issue that gets lost in a lot of discussion around ‘open data’ and demands for data-driven policy making. It is too easy to fall into a trap where data gets treated as a neutral, values-free thing, existing in a plane outside the messy politics of public administration. The Global Peace Index does a good service by highlighting where there are serious gaps in the necessary data for tracking the SDG Targets. This leads us to the political question of financing data collection.

If the UN and the various stakeholders who developed the Sustainable Development Goals can’t make the case to legislatures and parliaments that investments in data gathering and statistical capacity are politically worthwhile, it is entirely likely that the SDGs will go unmeasured and we’ll be back around a table in 2030 hacking away at the same development challenges while missing the harder conversation about the politics necessary to drive sustainable change.

National Interests, Overwork, and Statistics

I ended up jumping into a Twitter conversation started by international development journalist Tom Murphy about how Rwanda changed the methodology for its Integrated Household Living Conditions survey (EICV), and thus demonstrated that their poverty rate had decreased. The problem is that the new methodology essentially redefines ‘poverty’ to get the numbers to look good; using the previous EICV methodology, it indeed appears that poverty hasn’t decreased but has increased by 6%. While a number of people have already picked apart the methodological problems, is this really a methodological problem or part of a wider indictment of how donor agencies determine success and manage their human resources? Are the people in donor agencies dupes, cynics or both? Neither I reckon. I think they’re just overworked and probably under trained in statistics to get to the root of story, and have little incentive to do so anyway.

Filip Reyntjens does a really nice job of breaking down the problems with Rwanda’s EICV. He makes some good points about the problems with changing the methodology, and in the twitter discussion many other people highlighted technical problems with the new definitions of poverty used in the EICV. While these technical issues are important, the other problem is what the survey means to the stakeholders. This group includes the Rwandan government, donor agencies, and DAC governments. Reyntjens notes that the numbers in the updated EICV make the Rwandan government look good, and by extension make donor agencies look good. Everyone wins (except for the Rwandans who are still in poverty). Setting aside why the Rwandan government would want to modify a survey to make their baseline poverty statistics look better, what do we make of the donor community’s attitude? Are the various aid and development professions that guide policy just cynical bureaucrats happy to tick the box marked “Rwanda got better”?

Some probably are, but in my experience most development professionals take their jobs seriously and want to see the lives of people improved. So what would lead otherwise upstanding development professionals to ignore potentially blatant number cooking by a beneficiary government?  Overwork and a lack of statistical training most likely. The work loads that staff at donor agencies deal with are immense. Combine that with a tendency within agencies to stovepipe the statisticians away from the policy makers and you end up with over burdened staff who may not have the training to quickly digest the vagaries of a survey’s methodology or analyze the reason certain changes happen in data from year to year.

It shouldn’t surprise anyone that Rwanda’s government took the opportunity to redefine the methodology that signals how they’re doing at reducing poverty. Their government stays in the good graces of allied governments and donor agencies by ‘hitting’ their poverty prevention targets. But if we’re going to demand that donor agencies be prepared to call out number cooking, the donor agencies need to bring on more staff to spread the workload and make sure that the statistics capacity isn’t stove piped away from all the policy teams. Unfortunately the trend in donor agency funding right now is to focus on ‘efficiency’ above all else (read: too few people doing too much work), which means frayed policy staff will check the “hit the targets” box and the Rwandan government will continue cooking its data to keep donor money flowing.

Big News: The GDELT Global Dashboard

GDELT just released their new Global Visualization dashboard, and it’s pretty cool. It blinks and flashes, glows and pulses, and is really interesting to navigate. Naturally, as a social scientist who studies conflict, I have some thoughts.

1) This is really cool. The user interface is attractive, it’s easy to navigate, and it’s intuitive. I don’t need a raft of instructions on how to use it, and I don’t need to be a programmer or have any background in programming to make use of all its functionality. If the technology and data sectors are going to make inroads into the conflict analysis space, they should take note of how GDELT did this, since most conflict specialists don’t have programming backgrounds and will ignore tools that are too programming intensive. Basically, if it takes more than about 10 minutes for me to get a tool or data program functioning, I’m probably not going to use it since I have other analytic techniques at my disposal that can achieve the same outcome that I’ve already mastered.

2) Beware the desire to forecast! As I dug through the data a bit, I realized something important. This is not a database of information that will be particularly useful for forecasting or predictive analysis. Well, replicable predictive analysis at least. You might be able to identify some trends, but since the data itself is news reports there’s going to be a lot of variation across tone, lag between event and publication, and a whole host of other things that will make quasi-experiments difficult. The example I gave to a friend who I was discussing this with was the challenge of predicting election results using Twitter; it worked when political scientists tried to predict the distribution of seats in the German Bundestag by party, but then when they replicated the experiment in the 2010 U.S. midterm elections it didn’t work at all. Most of this stemmed from the socio-linguistics of political commentary in the two countries. Germans aren’t particularly snarky or sarcastic in their political tweeting (apparently), while Americans are. This caused a major problem for the algorithm that was tracking key words and phrases during the American campaign season. Consider, if we have trouble predicting relatively uniform events like elections using language-based data, how much harder will it be to predict something like violence, which is far more complex?

3) Do look for qualitative details in the data! A friend of mine pointed out that the data contained on this map is treasure trove of sentiment, perception and narrative about how the media at a very local level conceptualizes violence. Understanding how media, especially local media, perceive things like risk or frame political issues is incredibly valuable for conflict analysts or peacebuilding professionals. I would argue that this is actually more valuable than forecasting or predictive modeling; if we’re honest with ourselves I think we’d have to admit that ‘predicting’ conflict and then rushing to stop it before it starts has proven to be a pretty lost endeavor. But if we understand at a deeper level why people would turn to violence, and how their context helps distill their perception of risk into something hard enough to fight over, then interventions such as negotiation, mediation and political settlements are going to be better tailored to the specific conflict. This is where the GDELT dashboard really shines as an analytic tool.

I’m excited to see how GDELT continues to make the dashboard better – there are already plans to provide more options for layering and filtering data, which will be helpful. Overall though, I’m excited to see what can be done with some creative qualitative research using this data, particularly for understanding sentiment and perception in the media during conflict.

MCIT/NUS ICTs in Emergency Survey: Replication data

I spent the last two months managing a research collaboration between Samoa’s Ministry of Communications and Information Technology (MCIT) and the National University of Samoa, collecting nation wide data on how people use information and information technology to respond to natural disasters. This data will feed into my dissertation, as well as be useful to the Ministry and the National University, who will be using it for policy development and research. The research team wanted to make this data publicly available, since funding for the research came from MCIT and thus we see it as a public good. You can download the data here, and below is the suggested citation:

Martin-Shields, Charles, Ioana Chan Mow, Lealaolesau Fitu & Hobert Sasa. (2014) “ICTs and Information Use During Emergencies: Data from Samoa,” MCIT/NUS Data Project. Dataset available at: https://charlesmartinshields.files.wordpress.com/2014/06/mcitnus-survey.xlsx

The thrust of the research design was multifold. For MCIT, it’s important to know how people get their information, especially when trying to allocate spectrum or regulate communication providers. The research team from NUS does quite a bit of work with ICT4D and the social aspects of access to communication technology, so having data on use preferences from around the country is helpful in their research agenda. My own research looks at technologies as proxies for socio-political behavior, aiming to understand how social and political context affects the way that people use technology to manage collective action problems during crisis.

The dataset takes inspiration from the work I’ve done with Elizabeth Stones, whose dataset on Kenyan information use and trust inspired my thinking on doing a tailored replication in Samoa. We welcome feedback on the data, our structure, and hope that it can be useful to others working on ICT policy.

Finding Big Data’s Place in Conflict Analysis

Daniel Solomon recently posted a piece on how we conceptualize (and often misconceptualize) the role of big data in conflict event prediction. His post got me thinking about what role big data plays in conflict analysis. This comes on the heels of Chris Neu’s post on the TechChange blog about the limits of using crowdsourcing to track violence in South Sudan.

This is one of my favorite parts of Daniel’s post: “Acts of violence don’t create data, but rather destroy them. Both local and global information economies suffer during conflict, as warring rumors proliferate and trickle into the exchange of information–knowledge, data–beyond a community’s borders. Observers create complex categories to simplify events, and to (barely) fathom violence as it scales and fragments and coheres and collapses.”

The key question for me becomes: is there a role for Big Data in conflict analysis? Is it something that will empower communities to prevent violence locally, as Letouze, Meier and Vinck propose? Will it be used by the international community for real-time information to speed responses to crises? Could it be leveraged into huge datasets and used to predict outbreaks of violence, so that we can be better prepared to prevent conflict? All of these scenarios are possible, but I’ve yet to see them come to fruition (not to say that they won’t!). The first two are hampered by practicalities of local access to information, and bureaucratic decision making speed; thus, for me the interesting one is the third since it deals directly with an analytic process, which is what I’ll focus on.

When we talk about prediction, we’re talking about using observed information to inform what will happen in the future. In American political science, there has been a trend toward using econometric methods to develop models of conflict risk. There are other methods, such as time-series analysis, that can be used as well. But the efficacy of these methods hinges on the quality and attributes of the data itself. Daniel’s post got me to think about a key issue that has to be dealt with if big data is going to generate valid statistical results. This key issue is the problem of endogeneity.

To start, what is endogeneity? Basically, it means that the data we’re using to predict an event is part of the event we’re trying to predict. As Daniel points out, the volume of data coming out of a region goes down as violence goes up; what we end up with is information that is shaped out of the conflict itself. If we rely on that data to be our predictor of conflict likelihood, we have a major logical problem – that data is endogenous to (part of) conflict. Does data collected during conflict predict conflict? Of course it does, because the only time we see that stream of data appear is when there’s already a conflict. Thus we don’t achieve our end goal, which is predicting what causes conflict to break out. Big Data doesn’t tell us anything useful if the underlying analytic logic that was used in the data collection is faulty.

So what do we do? There’s all kids of dirty, painful math that can be used to address problems in data, such as instrumental variables, robustness checks, etc. But these are post hoc methods, things you do when you’ve got data that’s not quite right. The first step to solving the problem of endogeneity is good first principles. We have to define what are we looking for, and state a falsifiable* hypothesis for how and when it happens. We’re trying to determine what causes violence to break out (this is what we’re looking for). We think that it breaks out because political tensions rise over concerns that public goods and revenues will not be democratically shared (I just made this up, but I think it’s probably a good starting place). Now we know what we’re looking for, and a hypothesis for what causes it and when.

If the violence has already started, real-time data probably won’t help us figure out what caused the violence to break out, so we should perhaps look elsewhere in the timeline. This relates to another point Daniel made: don’t think of big events as a big event. Big events are the outcome of many sequential events over time. There was a time before violence – this would be a good place to look for data about what led to the violence.

Using good first principles and well thought out data collection methods, Big Data might yet make conflict analysis as much science as art.

*This is so important that it deserves a separate blog post. Fortunately, if you’re feeling saucy and have some time on your hands Rene Decartes does the topic far more justice than I could (just read the bit on Cartesian Doubt). Basically, if someone says “I used big data and found this statistical relationship” but they didn’t start from a falsifiable proposition, be very wary of the validity of their results.

Causes of Effects…and Effects of Causes

Andrew Gelman and Guido Imbens recently posted a paper entitled “Why Ask Why? Forward Causal Inference and Reverse Causal Questions.” It completely made my day, primarily because it succinctly deals with the way people naturally arrive at research questions with the help of some statistical logic.  While I liked the models and the logic, what I think is more important is the authors’ process for explaining the value of ‘why’ questions.

Continue reading