A fundamental principle of research data stewardship and sustainability is the curation and preservation of high-value datasets.
The term “high-value datasets” is tossed around quite a bit and often used in reference to the anticipated value of a research dataset presently being collected. It’s also often asserted that it’s really difficult, if not impossible, to reach agreement around which datasets are of high-value and which are not, and reason and rationale devolves into acquiescence that the true value of a dataset lies in the eyes of the beholder. This just isn’t going to work as a basis for the strategy for DataScience@NIH. We have a responsibility to science and to the public to be good stewards of data and to ensure its sustainability.
The concept of high-value datasets is deeply linked to the doctrine of Open Government—that citizens exercise oversight of the government through access to documents and proceedings. High-value datasets include high-value information, which The Open Government Directive defines as "information that can be used to increase agency accountability and responsiveness; improve public knowledge of the agency and its operations; further the core mission of the agency; create economic opportunity or respond to need and demand as identified through public consultation."
The term “high-value datasets” isn’t new—it’s been discussed for almost a decade. In the US, high-value datasets are archived through data.gov and include a wide range of datasets, from the Nuclear Regulatory Commission’s Daily Power Reactor Status Report to the White House Tapes of the Nixon Administration, 1971-1973. (Talk about variety!) So these datasets were created in the process of government operation and address at least two of the criteria enumerated by the Open Government Directive. While one can imagine that these datasets are of value to research, the way the term “high-value dataset” is bandied about in the research context is decidedly different.
From the NIH perspective, the “value” in high-value dataset refers to the value to the research endeavor.
The data set may have been purposely created for a specific research project or it may have emerged from processes related to biomedical phenomena or healthcare, such as electronic health records.
The value may be retrospective, in that the dataset includes the evidence upon which scientific discoveries or evidence for practice rests. The value may also be in anticipation of future benefit, such as affording secondary analysis or extended exploration to discover new knowledge.
There are many criteria that could be used to determine the value of a data set. I will present three, and remind the reader of my perspective grounded in stewardship and sustainability.
Datasets that are of high-value first and foremost must be of high quality—collected in accord with appropriate practices, relevant to important research themes and with sufficient internal integrity (indexing, provenance, etc.) that provides assurance of the technical quality of the data.
Second, datasets that are of high-value are so in part because replicating or reproducing the data set is infeasible—perhaps the dataset is unique and could not be re-created or perhaps the re-creation of the data set is cost-prohibitive.
My last criterion for characterizing high-value datasets relates to its usefulness—please note that this refers not simply to the sheer number of people who might make later use of a dataset, but also to the potential impact of the rare-but-important later use of a dataset.
Meeting our responsibility to science and to the public to be good stewards of data and to ensure its sustainability requires engagement with the “community” in the determination of the value of a dataset. While to some that community is bounded by the research domains of those who create or are likely to use the dataset, because of our status as a public organization, the conversation of the value of a dataset must involve the public who provides the funding for our research and for the preservation of the data, and who stand to benefit from discoveries enabled by effective use of high-value datasets.
What criteria do YOU think NIH should consider in evaluating the value of datasets and who should be involved in those discussions?