Data sharing and reuse is a concept that’s rapidly gaining ground. But, researchers remain understandably cautious about the quality and trustworthiness of data collected by others.
To move forward with the grand visions for Big Data, we must overcome this suspicion and find ways to ensure the data we store and make available are trustworthy.
But, what does “trustworthiness” mean in this context?
Trustworthiness encompasses both the quality of the data and sustainable, reliable access to it. Together, both will enhance scientific reproducibility by ensuring the data are selected, collected, organized, and stored using agreed-upon, established criteria.
But, what are those criteria?
The European Framework for Audit and Certification proposed three levels of certification for a Trusted Digital Repository (TDR). Each level has different requirements to address different needs. The three certification levels are Core, Extended, and Formal, also referred to as Bronze, Silver, and Gold.
The levels are not meant to convey a hierarchy or superiority/inferiority as much as to satisfy the minimum standards for different types of data, such as basic research data vs. human health data vs. financial transaction data.
The major assessment areas are the same for all three levels and include:
- Organization
- Management of intellectual entities and representations
- Infrastructure
- Security
The differences reside in how the audit is performed and the number of factors evaluated in that audit.
The following table summarizes the three levels and how their certifications differ:
DSA: Data Seal of Approval
Ultimately, the choice of certification depends on how much a repository is willing to invest in its perceived prestige and good operational practices. Certification costs include both the certification fees themselves—which range from free of charge to $10,000 per year—and the time or personnel costs spent on preparation. The latter can be substantial because certifying a repository can take months, depending on its maturity and audit readiness.
A certification’s lifespan also varies. The Extended level of certification is valid indefinitely, but will need to be updated to stay relevant, while the Core and Formal levels are valid for three years each. Changes in technology and user needs can also drive re-certification.
Currently, due to the time needed to train auditors, the ISO has yet to certify a repository at the Formal level.
A fourth option, outside the European framework, arose from the Center for Research Libraries (CRL), a consortium for academic libraries in the United States and Canada. The CRL’s approach to certifying data repositories uses the Trustworthy Repositories Audit & Certification: Criteria and Checklist (TRAC). TRAC is based on the concepts of Open Archival Information System (OAIS | ISO 14721:2002), the precursor checklist to ISO 16363:2012. While the CRL is not an accredited certifying body, this group has used TRAC to audit digital repositories since 2005.
Where does this leave us?
Two decades in, we are still early enough in the Big Data era that what constitutes “trustworthiness” and how to identify and evaluate those attributes—like the certification options above—are still evolving. The validity and helpfulness of certifications themselves are also under debate, as some experts highlight the potential of them being misused or misunderstood.
As the conversations continue, NIH is committed to working with the community to determine the best ways to validate and certify data. Last year, NIH issued a Request for Information (RFI) on Metrics to Assess Value of Biomedical Digital Repositories, and the 2016 BD2K All-Hands Meeting focused on the topic during a session on sustainability. Based on stakeholder feedback, we know there is a clear and immediate need to evaluate the value and performance of data repositories. TDR certifications offer an interesting and potentially useful avenue to address that.
-------
Acknowledgments:
The author would like to thank Allen Dearry (NIEHS), Susan Gregurick (NIGMS), and Gabriel Rosenfeld (NIAID) for insightful feedback, edits, and stimulating discussions. David Giaretta, Mustapha Mokrane, Christian Keitel, and Marie Waltz helped review and edit the factual content of the certification descriptions for ISO, WDS/DSA, DIN, and TRAC, respectively.
About the Author:
Dr. Dawei Lin is the Associate Director for Bioinformatics and the Senior Advisor to the Director of the Division of Allergy, Immunology, and Transplantation (DAIT) at the National Institute of Allergy and Infectious Diseases (NIAID). In this capacity, Dr. Lin leads a Data Science Group in developing and administrating scientific and infrastructure programs, including ImmPort, Informatics Methodology and Secondary Analyses for Immunology Data, and the Statistical and Clinical Coordinating Center (SACCC) for DAIT-sponsored clinical research. This Data Science Group also uses Big Data based approaches in grant portfolio analysis and advises DAIT on data sharing issues and policy. In his trans-NIH efforts, Dr. Lin is the Program Officer for bioCADDIE, NIH's Big Data to Knowledge (BD2K) Data Discovery Index initiative and is actively participating in discussions regarding sustainability issues for data repositories.