Big data: correlation is no substitute for causality
A slightly shorter version of this blog post was originally published on the University of Amsterdam’s website here. It is re-published here in extended form with their permission.
‘Big data’ has become one of those buzzwords we can’t escape in the urban field these days. There is a whole new industry of big data analytics companies, selling their services to cities that want to be seen as ‘smart’, promising insights that will lead to more efficient or effective services. In academia, the terms ‘big data’ and ‘smart cities’ are dominating funding calls and research initiatives.
One of the claims of the ‘big data’ movement is that their techniques are based on a new paradigm for research. Using approaches which are unfamiliar to most social scientists, they apply automated analytical strategies (machine learning) to vast quantities of data from diverse sources. These approaches are said to succeed where traditional approaches would be overwhelmed by the volume and complexity of the data. And they do so by the application of brute computing power (or, rather, sophisticated programming), instead of relying on theory to target limited research resources: “With enough data, the numbers speak for themselves” as one well-known comment has put it (Anderson 2008).
The numbers ‘speak’ through the patterns of correlations which these automated analytics reveal. The new techniques, it is claimed, have moved beyond the need to worry about theory or causality. The sheer volume of data is sufficient guarantee of the importance of the relationships.
It is true that simple correlations are enough for some purposes. One commonly cited illustration is the work in New York to understand the links between building subdivisions and fire risks. The city authorities had limited resources and inspectors were frequently called out to premises where they found minimal risks. The data scientists built a model to predict which callouts were likely to be associated with a significant fire risk so they could target the inspection resources more effectively. And they seem to have succeeded (Mayer-Schönberger and Culkier 2013).
For many other purposes, however, correlations are not enough. Consider another widely-cited case study – a hospital seeking to reduce readmission rates, also discussed by Mayer-Schönberger and Culkier (2103). Data scientists identified an unexpected risk factor for readmission that clinicians had supposedly overlooked – depression. On the basis of this study of correlations, the hospital introduced a policy for screening patients for depression and offering additional counselling those with symptoms, and readmission rates fell as a result.
Before we chalk this up as another success for the big data approach, however, let us unpack this a bit further. It is true that, at the stage of the data scientists’ analysis, the study is simply based on correlations – which factor or factors co-occur with readmission. Once the hospital moves to intervention, however, it shifts to being a claim about causation: counselling is introduced on the assumption that the relationship is causal – otherwise, it would be pointless. And that is what it turns out to be – celebrations all round.
But it is also possible that depression could have turned out not to be the causal factor. For example, it might have been that the causal factor in readmissions was poor housing. This could cause both a higher incidence of readmission because it places greater stress on subjects’ physical health, but also cause a higher incidence of depression. In this case, counselling would have had no benefit. The insight from big data was only useful because it was subsequently shown to be causal.
At this point, it is instructive to turn to another area of health – epidemiology. In this field, scientists have come to appreciate all too well the misleading result of studies based on correlations, which they term ‘observational’ studies (Davey-Smith and Ebrahim 2001). For diverse health conditions, they have been able to compare the findings from observational studies with those from studies which directly test causal effects, notably randomised control trials. The correlations which emerge from observational studies are routinely found not to be an indication of causal effects. As a result, they are moving in precisely the opposite direction to that proposed by ‘big data’, turning their attention to more effective techniques for causal modelling, drawing on the approaches developed in economics and other areas.
Finally, consider another lesson from the medical world – that of publication bias. It has long been realised that the published results from medical trials can give a misleading impression of the efficacy of a particular drug because of the tendency for researchers (and the companies seeking to profit from the drugs) to cover up the results of studies with negative conclusions. Random error means that the results of trials for any given treatment will vary around the ‘true’ effect which should be revealed by averaging across these studies, as meta-analyses do. Suppress all the negative findings, however, and the observed average effect from published studies can look quite different. There is no doubt that the big data industry can offer up lots of positive examples where their approach provided insights that were useful. But how can we tell that there haven’t been as many occasions, or indeed more, where this approach failed?
We should not dismiss big data approaches. They have much to offer the social sciences. They offer us a real challenge to try to exploit types of data that we are not used to working with, and to learn from their sophisticated analytical techniques. But social scientists need to ensure that concerns about causality are kept at the centre of the picture, and that in turn requires the traditional social science expertise of good theory and good design.
Anderson, C. (2008) The end of theory: the data deluge makes the scientific method obsolete, Wired Magazine , 16 July. [http://archive.wired.com/science/discoveries/magazine/16-07/pb_theory]
Davey Smith, G. and Ebrahim, S. (2001) Epidemiology: is it time to call it a day?, International Journal of Epidemiology 30 (1): 1-11.
Mayer-Schönberger, V. and Cukier, K. (213) Big Data: a revolution that will transform how we live, work and think. Boston: Eamon Dolan.