A cost-benefit analysis of persistent identifiers across the NIH-funded biomedical research sector (OD/OER)

Project Point of Contact: Dr. Marianna Mertts, Director, and Ms. Claudette McBeth, Lead Administrative Officer, Strategic Management and Contracting Office, OER

Goals and Objectives: Understand the potential benefits of persistent identifiers (PIDs) to strengthen infrastructure supporting metadata re-use, automation of data systems, and improved quality of disambiguated data elements with clearer relationships between them. Estimate the cost savings based on expected improved productivity of researchers and research support staff, against the costs incurred to integrate PIDs into funder and research organization data systems. Make recommendations for an NIH-wide PID strategy that is aligned with existing recommendations made by the Subcommittee on Open Science, Office of Science & Technology Policy (https://doi.org/10.5479/10088/113528).

Significance: NIH-wide adoption of PIDs could potentially result in significant cost savings to the NIH-funded research sector via improved productivity of researchers and research support staff. A critical examination of potential cost benefits could help inform decisions regarding adoption of PIDs.

Description: A persistent identifier is a long-lasting and reliable reference to a digital entity. At NIH, PIDs can potentially be used to identify researchers, research institutions, grants, research output including publications and data sets, research instruments, and other resources of relevance to biomedical research. PIDs are connected to registries of metadata about those entities that facilitate robust linking between the various entities. The use of metadata establishes data provenance and attribution in a manner consistent with FAIR (Findable, Accessible, Interoperable, Reusable) principles.

Recently released cost-benefit analyses of PID adoption for the UK PID Consortium (doi: 10.5281/ZENODO.4772627, July, 2021) and in Australian research systems (doi: 10.5281/ZENODO.7100578, September, 2022) have found significant savings. The UK study, which considered only adoption of PIDs for researchers and research articles, estimated a direct savings of £5.67M per year based on PID adoption rates of 85% within 5 years, and projects a far greater benefit to the UK economy based on the impact of more-efficient research. The Australian study estimated a direct financial benefit of $24M per year, due primarily to reduced need for researchers and research support staff to conduct tedious manual data entry.

In conducting a cost-benefit analysis to NIH-funded research, considerations should be given to the cost to research institutions of integrating PIDs into their research data systems as well as costs to funding agencies of building infrastructure to support PID adoptions.

As with the UK and Australian studies, informed assumptions should be made regarding the cost of performing data entry tasks based on size and cost of the workforce. The final report should include recommendations for a strategy on a target set of PIDs to adopt and integrate into research organization systems as well as integration into NIH grants management systems. Considering the January, 2023 release of the NIH Data Management and Sharing Policy (https://sharing.nih.gov/data-management-and-sharing-policy), recommendations concerning the use of PIDs for shared data would be timely.

Data set(s) involved: Primary data sources would include the IMPAC II grants database, PubMed, Scientific Publication Information Retrieval and Evaluation System (SPIRES), and ORCID records available via an API. Additional data would likely need to be gathered from public data sources (e.g., Bureau of Labor Statistics) and targeted surveys of research organizations.

Anticipated outcomes of the project: Cost-benefit analysis, recommendations, presentations to leadership and interest groups, final report, and publication in a peer-reviewed journal.

Required skills of the DATA Scholar: A successful Scholar will have completed doctoral-level training or equivalent experience in the physical or life sciences, mathematics, statistics, economics, computer science, or data science. Additional required skills include ability to wrangle complex data and perform SQL queries; skill in analyzing data in R, Python, or Stata; and highly developed skills in oral and written communication. Desirable but not required skills include knowledge of the NIH grants ecosystem; knowledge of PIDs; and familiarity with Shiny for deploying analysis results in R.

Expected/preferred length of DATA Scholar appointment: 1 year.

Expected/preferred time effort commitment of the DATA Scholar: Full time (100%)

Remote work preference: 100% remote allowable

ICO support: ORRA Division Directors and Associate Directors will be broadly available for guidance and support. In particular, Calvin Johnson, Ph.D., Associate Director, Data Quality is a Staff Scientist who can serve as technical mentor. Resources for computing equipment, services, and software will be made available to the DATA Scholar commensurate with the needs of the project.

Additional activities: Scholar will be given the opportunity to participate in planning and coordination around various strategic initiatives in ORRA, particularly those initiatives involving PIDs. Scholar will also be given the opportunity to participate in the ORRA Data Lab environment.

Career or professional development opportunities: Opportunities for appropriate training, travel, and participation in technical and scientific conferences and NIH interest groups, and other high level, relevant meetings will be made available to the Scholar.

To apply to this or other DATA Scholar positions, please see instructions here: datascience.nih.gov/data-scholars.