Assessing the value of data - a puzzle with many moving parts
As one may imagine, data value is a critical consideration at the Urban Big Data Centre.
That’s true in the administration of our national data service (which we provide on behalf of the Economic and Social Research Council) and within our research, where assessing data value is an essential starting point to ensure that foundational information can be relied upon.
In a recent blog, I covered the issue of 'Research-ready' data, summarising my experiences from a recent Administrative Data Research Centre UK (ADR UK) event and exploring several factors that influence the utility of data sources for academic research. Here I expand on some of the concepts raised in my earlier blog and present details of how we recognise and determine value at UBDC. That often begins when assessing proposals for new data acquisitions, where their potential value is of foremost importance, but not always straightforward to determine.
UBDC’s research agenda and data service are primarily associated with digital footprints data which within the Centre’s urban research setting take several forms. They include data produced from sensor systems, user-generated content from websites and mobile devices, administrative data produced as a result of government transactions, and private businesses’ operational or transactional data. In all cases, data value might be reduced to some product of information content, technical utility and permitted data usage - the latter usually a result of contractual, licensing, or other legal constraints. Other less tangible aspects may also be influential, such as relationships with data owners and publishers or broader trends in technology or regulation.
The collections development approach for UBDC data services is firmly research-led. That means being as responsive as possible to the needs of individual researchers and their wider communities. UBDC uses several mechanisms to maintain its understanding of researchers’ requirements. These include administering formal calls for expressions of interest, building thematic communities of researchers with shared interests, and liaising directly with existing and prospective service users. These engagements are intended to direct data acquisition efforts and to prompt advice and reaction to planned or proposed data investments. Whether expressed in terms of specific data products of interest or more primitive information requirements, an understanding of what research users themselves value is a critical starting point.
UBDC is responsible for investing in data for a wider UK community of researchers. To accomplish this as transparently and effectively as possible, we consider five main criteria when assessing the value of any prospective data acquisition. These are:
- Dataset content
- Dataset quality
- Terms and conditions of use
- Relationship with data owners
This first category of criteria is focused primarily on the information, facts and knowledge encapsulated within a given dataset. Understandably, this will vary according to the theme, application or intended use of a given dataset. For any given dataset, this will typically include:
- spatial and temporal coverage - which frames a given dataset in time and space
- information resolution (which may be manifested in terms of how spatially or temporally aggregated data is)
- the richness of available information (how many of the details we care about are included in or accessible from the dataset)
- elements that facilitate linking with other datasets.
A further content consideration is the nature of implicit biases. These can be challenging to determine and require some understanding of how data has been created, processed or curated (not always available).
Content is often the first consideration of a prospective user - particularly if all other aspects are assumed in the first instance to be acceptable. If the information is lacking, insufficiently granular or has no means of interoperability with other datasets of interest, then it is likely to be of little use for research purposes.
Data quality is a wide-ranging consideration, which is often approached as a set of technical or physical properties of data. A high-quality dataset is likely to offer:
- high levels of completeness (minimising null values and gaps in data) and correctness (data are authentic to what they purport to be, trustworthy and representative)
- minimal incidences of duplication (often a particular problem for datasets that pool data from other sources)
- uniformity or consistency (particularly where data is being captured periodically over time or from different sources)
- assurances regarding its integrity - minimising errors and bugs.
A critical consideration that straddles the lines somewhat between content and quality is the availability of suitable documentation. In ideal cases, this should include: data schema and data dictionaries, manifests of supplied data points, details of how data was created and has been processed, and evidence of completed technical quality checks. Sample data can offer understanding and reassurance before committing to data purchasing or licensing while providing indicative characteristics of content, quality and data volume to be maintained through subsequent data supply.
Poor data quality is often evident within datasets collated from multiple sources, with several contributors or based on poor-quality source data. Issues with poor-quality data are exacerbated when no reasonable explanation or justification why can be found. Poor data quality need not be fatal. In some cases, it can be tolerated - particularly in larger data sets where an obvious strategy is simply to ignore any records or data points affected by quality concerns.
Terms and conditions of use
The usefulness of a given dataset or collection is rarely determined solely by intrinsic content and quality characteristics. Licensing terms and conditions are hugely influential determinants of data value and utility.
At UBDC, we prioritise several aspects when negotiating access to data. Open data - published and released under a recognised open data licence - is an attractive ideal but seldom available when dealing with commercial organisations that are often the owners or custodians of the most interesting digital footprints datasets. We recognise commercial pressures and concerns around widespread information disclosure. These require a balance to be struck between minimising friction to accessibility and maximising information completeness and utility.
Since UBDC operates a national data service primarily on behalf of the UK academic research community (although with other audiences too), a primary goal is to agree to terms that permit as broad as possible a range of purposes and potential users. That typically means being able to support non-commercial academic research uses of licensed data.
Where licences restrict us to only being able to support a limited number of potential users, we must retain the discretion to manage data allocations and resist any terms that would purport to offer data vendors rights of veto over projects or proposed research outputs (other than based on limits that have been explicitly agreed up-front). It’s also vital that we agree to terms that provide sufficient time for research to be completed, including the publication of research outputs which may be subject to unpredictable timeframes. Where numbers or types of outputs produced by eligible users are restricted within terms and conditions, it’s critical that these are also not prohibitive to the process and completion of academic work. Our principles of data negotiation also require that any new intellectual property created in the course of related research projects is retained by a corresponding academic, project or organisation. Finally, it is vital to negotiate circumstances for data access that do not unreasonably restrict the completion of related research work. Where data have sensitivities related to, for example, privacy or commercial interests, it may be reasonable to require users to access and consume data only within secure, offline environments. But, if possible, we pursue agreements whereby end users may securely transfer data into their own - potentially certified - environments for research use.
Relationship with data owners
Beyond the legal formality of terms and conditions, there are other elements of relationship building with data vendors and providers which can contribute significantly to the value offered by a given dataset.
Where agreements with data owners can be framed as partnerships - going beyond strict commercial vendor/customer relationships - there are likely to be opportunities to add value to the data being supplied and consumed. This can yield collaboration and cooperation in the form of joint participation in data fora events and activities, joint hackathons and innovation activities, student placement schemes and shared agenda setting.
To be successful the benefits of such activities must be mutual; we commit lots of energy to making the case for why partnering with academic researchers can benefit commercial or public sector data owners. Researchers can validate the quality and representativeness of data, lending credibility to its role as evidence. They can generate interest in data from other organisations and communities prepared to licence data themselves under commercial terms, stimulating and contributing to the sustainability of the market. And they can feed back their efforts, enhancing the quality of original datasets and contributing results that are complementary to the aims of the data owner.
Collaborative relationships are also much more likely to offer continuity of supply - another aspect that influences the value of a given dataset. Within policy-facing research, an ongoing, contemporary reference point is critical. Without a means for comparison, data supplied in the past risks becoming little more than a historical curiosity. A good relationship that works for both data supplier and consumer is likely to have a better chance of being long-lasting.
More generally, relationships built on communication, collaboration and openness are much more likely to deliver data products suited to research goals.
The price payable for any given digital footprints dataset varies - often a result of commercial considerations quite irrelevant to anticipated research or policy outcomes. Such outcomes do not always lend themselves to straightforward financial valuation – therefore, affordability and price comparison with alternative data sources can offer a more meaningful representation of value.
Costs need not always be in terms of a price payable to the data owner; they may be associated with person-hours needed to implement a data collection platform or data infrastructure service costs (such as cloud storage costs). Even non-monetary costs - like the time taken to build a data collection from APIs with strict request limits - should be considered when assessing value.
Making sense of it all
While UBDC has established processes for assessing and approving its data investments, value remains a complicated thing to determine. Clarity of need and purpose provide an essential starting point and a specification to evaluate data offerings against for assessing their usability and utility. Beyond that, having the means to compare what’s out there is really useful - there’s seldom such a thing as a perfect dataset available under the perfect terms. Instead, we compare to see where the fewest compromises are required, or where there are the greatest opportunities to be as successful as possible. UBDC’s longevity - at the time of writing, we’ve been doing this for more than 8 years – provides us with a great vantage point to recognise those data products and services that offer good value.
If you know of any potential dataset acquisitions that could deliver value for your research community, please contact us at firstname.lastname@example.org to make a suggestion.