blog | 22.06.2022 | Andrew McHugh

Answering the question, “What is ‘Research-ready’ data?”

In February this year, I was part of a diverse group of over 50 academics, information curators, research managers, statisticians and analysts assembled by Administrative Data Research Centre UK (ADR UK) to consider the question of what makes administrative data ‘research-ready’.

Several themes, challenges and opportunities for further work were identified in the roundtable discussion that followed. The event has now been summarised in a report published by ADR UK.

Administrative data are data routinely produced by public bodies in the provision of services like healthcare or education. They are capable of offering tremendous insights into the performance of these functions and the experience of related stakeholders. While often an invaluable source of insights for researchers, administrative data are not created with research uses in mind. There are often a variety of processing and curation activities that are required to transform them into suitable ‘research-ready’ formats. Although administrative data are a relatively minor part of UBDC’s work, there’s much in common in terms of the challenges we face with our mission to enhance access to digital footprint data in general.

Although the term ‘research-ready data’ (RRD) is quite well established within ADR UK, there isn’t really a common consensus about what this means in practice. What is clear is the importance of minimising redundant effort from multiple researchers performing the same data cleaning and preparation tasks, and promoting parity in terms of quality and utility of data and associated services.

Event summary

Back to the February event. Introductory presentations from Louise McGrath-Lone (UCL) and Ben Gordon (HDR UK) respectively outlined the multi-dimensionality of ‘research readiness’ and introduced Health Data Research UK’s Data Utility Evaluation Framework. The variety of considerations and priorities, and the importance of context were consistent themes throughout the event. HDR’s framework responds to this, providing a vocabulary for describing the usefulness of data that covers issues such as documentation, technical quality, coverage, access limitations and value (in terms of linkability and other enrichments). The framework includes several criteria associated with each aspect, with examples of bronze, silver, gold and platinum. It’s a useful reference point, and we hope to use it in the near future as the basis of a simple maturity modelling exercise across UBDC data products.

The main part of the event was a breakout session where groups were tasked with exploring three questions:

  • How ‘good’ (clean, curated etc) does data need to be, to be useful?
  • ‘Transparency’ of data: What are the barriers to making broad, messy data available for research?
  • What mechanisms are needed to iteratively improve datasets through research which uses them?

What does it mean to be good?

My group had a wide-ranging discussion, reflecting first on the varying expectations of researchers around data quality that are often informed by their area of interest, their technical abilities or the tools or research environments available. Sometimes researchers will be comfortable - and most reassured by - a dataset that has been subject to minimal processing, whereas other times a more curated and produced data product will appeal to those that favour greater accessibility and usability.

Irrespective of individual data specifications, there was a widespread agreement on the value of formal documentation - making explicit not only the content of the data but also details of how it was created and of any processing or curation activities that have been undertaken. Given potential users’ varied use cases, disciplinary associations, technical preferences and research questions, it’s also essential that documentation accounts for how a given dataset can be or has been designed to be used - and not just details of its origins.

Our group discussed the costs and complexities of making admin data available and the difficulties in justifying investment into data processing that may not directly support core service objectives. Even researchers - perhaps better placed to invest time and energy to produce data outputs that reflect the needs of peer communities - may not be particularly incentivised to do so in comparison with other academic priorities. We agreed that cultural change will be needed to elevate the prestige associated with code and data sharing to the same level as more established academic activities like research publishing. Given that UBDC hosts both research and data service functions, the value of open science is well understood here but that perspective is not necessarily widely shared.

More mess, less haste

We reflected too on the fact that messy data can mean greater risk exposure, which may often discourage data owners from making them available. Even where more sensitive data are released, their access may be limited to secure research environments, which some researchers may find difficult to access, or which may be unable to scale to accommodate large volumes of data or support complex or atypical types of data analysis.

As part of an ESRC national data service, I was keen to make the point that intermediary bodies like UBDC are well equipped to work with data owners and producers, sparing them the costs of specifying, producing and curating research datasets. UBDC and similar organisations also facilitate data discovery and provide hubs where communities of researchers can articulate their needs and ensure datasets offer greater utility and usability. Finally, they provide a means for reinvesting researchers’ efforts to continuously enhance data products and specifications.

Next steps

The event was mainly a chance to articulate and assume shared responsibility for a problem - but also highlighted the existing organisations, relationships and opportunities that can have a big part in its solution. It’s perhaps no surprise that the event report includes more questions within its conclusion than its preamble. At UBDC, we’re excited about the role we can play to help achieve a common understanding of how data can be made more available, usable and valuable, which resonates equally with data owners, users and the beneficiaries of research.

Andrew McHugh

As Senior Data Science Manager, Andrew is responsible for the development, management and implementation of the data services, data collections and IT strategy of UBDC. He joined the Centre in March 2016.

JOINTLY FUNDED BY