Conceptualized in 1813 and later formalized in 1962, the Federal Depository Library program made US federal publications freely accessible by allowing agencies to send documents postage-free to libraries across the country for deposit. It was perhaps one of the government’s earliest attempts at an “open” framework that provided information to the public. While the push for open data is not new, there remains a lack of consensus on how to best leverage secondary use of the data. To aid this endeavor, the research community has recently adopted “FAIR” data principles, pledging to make data “Findable, Accessible, Interoperable, and Re-usable.” But what does it mean to make data FAIR and how can open data be used to its upmost potential?
To answer this question, one must comprehend where the data came from in the first place, which is what many secondary researchers grapple with when encountering data. It has been estimated that up to 90% of a researcher’s time is spent cleaning and deciphering data. Despite this, data science in some areas of research seems more focused on analysis than provenance, creating somewhat of a false dichotomy. Provenance lies at the heart of the scientific process as researchers carefully analyze data (and any associated documentation) to understand the evolution and methodology behind the data. Provenance also relates to other important aspects such as data flow, lineage, and traceability. All these elements elucidate data in a variety of ways and are sometimes used interchangeably.
Unfortunately, there is no universal, formal definition for what constitutes “good” provenance or how to achieve it – it is easier to detect the glaring absence of provenance than it is to agree on a single definition. To complicate matters further, it is also unclear to what extent investigators have maintained provenance. Ideally, investigators should address data provenance early on in a study; yet provenance is oftentimes an afterthought, rendering the exercise all that more difficult. In such cases, the value of the data is diminished, leading to a data paradox: vast amounts of data with very little insight.
Data provenance should and can be used to overcome the data paradox by distinguishing the novelty from the noise. Performing basic data elucidation such as saving and annotating code, documenting steps taken to manipulate data, and tracking data flow over time can promote a study’s repeatability and reproducibility. Investigators collecting data need platforms to facilitate provenance and data lineage upstream, so that researchers and analysts downstream can understand variable evolution, analyze and explain intricacies, as well as address potential data limitations. Above all else, a keen attention to detail and deep understanding of research data structures are essential for successful and sustained provenance.
As the research community continues to define the data science domain, data provenance will be key to producing FAIR, meaningful data. Much of this is dependent on study investigators not grading back their data, which is why provenance should be viewed as a part of a larger data science infrastructure. Developing this infrastructure requires a holistic approach in which data standards, formats, interoperability, security, privacy, costs, and policy levers are viewed within the data provenance lens. In today’s digital age, simply depositing data ad nauseam and hoping for the best is a disservice to the participants involved as well as the time and money invested with the data. Open data are not enough, as data must be properly tracked, managed, and explained. This not only unlocks the value of the data – it offers an opportunity for a better-informed participant, researcher, and citizen.
About the Author:
Dina Mikdadi is a Data Scientist at Booz Allen Hamilton, a consulting and technology firm. Dina possesses a wide range of experience in health and science research, data management, and analytics. She is part of the Health team at Booz Allen working at the intersection of science, analytics and technology. Her other areas of expertise include public health, program evaluation, policy analysis, and statistical programming. She currently works with the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD).