Biomedical Data Repositories and Knowledgebases

About Biomedical Data Repositories and Knowledgebases

To better support a modern data resource ecosystem, NIH makes a distinction between data repositories and knowledgebases. While both are important for advancing biomedical research, data repositories and knowledgebases can have unique functions, metrics for success, and sustainability needs.

Sustaining a healthy and productive data resource ecosystem means that each component:

Delivers scientific impact to the communities that they serve
Employs and promotes good data management practices and provides efficient operation for quality and services
Engages with the user community and continuously addresses their needs
Supports a process for data life-cycle analysis
Engrosses exploration of the current landscape of biomedical data repository metrics to to NIH in better understanding how datasets and repositories are used
Provides long-term preservation and trustworthy governance

Both data repositories and knowledgebases contribute to the NIH data resource ecosystem

Data Repositories

Biomedical data repositories accept the submission of relevant data from the research community to store, organize, validate, archive, preserve, and distribute data in compliance with the FAIR Data Principles.
Curation focuses on quality assurance and quality control.
Example: core data might include genome, transcriptome, and protein sequences or imaging or spectroscopic data

Knowledgebases

Biomedical knowledgebases extract, accumulate, organize, annotate, and link the growing body of information that is related to, and relies on, core datasets.
Significant levels of human curation are traditionally required.
Example: information about expression patterns, splicing variants, localization, protein-protein interaction, and pathway networks related to an organism or set of organisms; publication information

View Trans-NIH BioMedical Informatics Coordinating Committee (BMIC) Data Sharing Resources.

Metrics and Lifecycle

Data repositories and knowledgebases exist on a spectrum of ability and readiness to adopt the desirable characteristics aligned with FAIR and TRUST principles. Due to the critical nature of research data resources, repositories, and datasets, the development of metrics to evaluate the usage, utility, and impact of a given repository is essential. To that end, NIH conducted a survey and organized a workshop to better understand both existing and desired lifecycle metrics. The NIH then issued a report which presents the findings to better understand metrics currently used within the biomedical repository community, which can inform future NIH efforts to help develop this space and to understand patterns of use across datasets and repositories.

Open Funding Opportunities

(Open) Promoting Data Reuse for Health Research (NOT-OD-24-096), April 30, 2024
(Open) Enhancement and Management of Established Biomedical Data Repositories and Knowledgebases (PAR-23-237) August 31, 2023
(Open) Early-stage Biomedical Data Repositories and Knowledgebases (PAR-23-236) August 31, 2023
FAQs for PAR-23-237 and PAR-23-236
Notice of Pre-Application Webinar for the NIH Biomedical Data Repositories and Knowledgebases Program (DRKB) (NOT-OD-24-097) April 11, 2024. Recording of the webinar can be accessed on this website: [Recording]

Closed Funding Opportunities

(Closed) Support for existing data repositories to align with FAIR and TRUST principles and evaluate usage, utility, and impact (NOT-OD-23-044) FAQs January 5, 2023
(Closed) Support for existing data repositories to align with FAIR and TRUST principles and evaluate usage, utility, and impact (NOT-OD-22-069) January 31, 2022
(Closed) Administrative Supplements Available to Strengthen NIH-Funded Biomedical Data Repositories (NOT-OD-21-089), April 6, 2021
(Closed) Biomedical Data Repository (PAR-23-079), May 9, 2023
(Closed) Biomedical Knowledgebase (PAR-23-078), May 9, 2023
(Closed) NIH released two funding opportunities to support biomedical data repositories and knowledgebases, January 17, 2020
- Biomedical Data Repository (PAR-20-089)
- Biomedical Knowledgebase (PAR-20-097)

Funded Awards

View PAR Awardees

PAR Award Recipients
Grant Number	Contact Principal Investigator	Project Title	Announcement Number
EB037545	Thomas Heldt	PhysioNet: A FAIR and sustainable data repository	PAR-23-237
CA302705	Melissa Suzanne Cline	The BRCA Exchange Knowledgebase for Clinical Cancer Genetics	PAR-23-237
NS143946	Samden Lhatoo	International Seizure and SUDEP Research Repository (InSSURR)	PAR-23-236
CA295532	Jill P. Mesirov	The Molecular Signatures Database: A knowledgebase for gene set based analysis of genomic data	PAR-23-237
DE034163	Carl Kesselman	USC Facebase IV Craniofacial Development and Dysmorpholoy Data Management and Integration Hub (FaceBase IV)	PAR-23-237
DK141185	Shankar Subramaniam	Metabolomics Workbench - National Metabolomics Data Repository	PAR-23-237
HG003345	Elspeth Bruford	The Nomenclature of Human and Vertebrate Genes	PAR-23-237
HG010859	Paul Warren Sternberg	Alliance Central: A platform for sustainable development of next generation genome knowledgebases	PAR-23-237
AR084725	Peter Maye	Archiving and Sharing Skeletal Phenotyping Data	PAR-23-236
AR085003	Noel P. Burtt	The Musculoskeletal Knowledge Portal	PAR-23-236
AR085006	Jeffrey B. Driban	The OAI Collaborative Osteoarthritis Research Enterprise (OAI CORE) Knowledgebase	PAR-23-236
ES036917	Itai Kloog	GeoSpace - GeoSpatial Knowledgebase for Exposomics	PAR-23-236
AI179612	James E. Gern	Childrens Allergy and Asthma Data Repository (CADRE)	PAR-23-079
AI181685	Mackenzie Cottrell	HIV Pharmacology Data Repository	PAR-23-079
CA285296	Allan C. Halpern	ISIC-REPO; ISIC Skin Imaging Repository Enhancements for Promoting Interoperability and Utilization	PAR-23-079
HL171672	Arlene A. Stecenko	The Georgia Cystic Fibrosis Data Warehouse	PAR-23-079
NS134536	Joost B. Wagenaar	Pennsieve: Impactful Multimodal Data Sharing for Epilepsy Research	PAR-23-079
HG013300	Norbert Perrimon	FlyBase: A Drosophila Genomic and Genetic Database	PAR-23-078
GM144308	Anita Elzbieta Bandrowski	From RRID to Resource Watch: A Knowledgebase of Biomedical Research Resources	PAR-20-097
ES035386	Dinesh Barupal	Exposome Correlation and Interpretation Database (ECID)	PAR-20-097
HG007822	Alex Bateman	UniProt: A Protein Sequence and Function Resource for Biomedical Science	PAR-20-097
AI177622	Lindsay G. Cowell	i-AKC: Integrated AIRR Knowledge Commons	PAR-20-097
GM144232	Michael K. Gilson	BindingDB: An Open Knowledgebase of Protein-Small Molecule Interactions	PAR-20-097
CA275783	Malachi Griffith	Creation of a knowledgebase of high quality assertions of the clinical actionability of somatic variants in cancer	PAR-20-097
GM142435	Marc S. Halfon	REDfly: The regulatory sequence resource for Drosophila and other insects	PAR-20-097
HG012556	Carol Marie Hamilton	Establishing the PhenX Toolkit as a Biomedical Knowledgebase	PAR-20-097
AI171008	Yongqun He	VIOLIN 2.0: Vaccine Information and Ontology LInked kNowledgebase	PAR-20-097
GM150703	Peter D. Karp	Knowledgebase of Escherichia coli Genome and Metabolism	PAR-20-097
HG010615	Teri Ellen Klein	PharmGKB	PAR-20-097
AI162625	Elliot J. Lefkowitz	Virus Taxonomy: A Community Knowledgebase Supporting Virus Research	PAR-20-097
ES033155	Carolyn J. Mattingly	Comparative Toxicogenomics Database (CTD)	PAR-20-097
HG012750	Nicola Mulder	African Genomics Data Hub Biomedical Knowledgebase	PAR-20-097
GM143402	Mark A. Musen	BioPortal: An Expansive Knowledgebase of Biomedical Entities and Relations	PAR-20-097
HG012542	Helen Parkinson	Strengthening community knowledge bases for genetic association studies and polygenic scores, the GWAS and PGS Catalogs	PAR-20-097
HG012557	Lynn Marie Schriml	The Human Disease Ontology: An integrated, mechanistic knowledge resource for biomedical research.	PAR-20-097
HG012198	Lincoln D. Stein	Reactome: An Open Knowledgebase of Human Pathways.	PAR-20-097
HG002223	Paul W. Sternberg	WormBase: a core data resource for C. elegans and other nematodes	PAR-20-097
HG012212	Paul D. Thomas	Gene Ontology Consortium and Knowledgebase	PAR-20-097
GM146616	Michael Tiemeyer	GlyGen growth and evolution into a central resource for glycans and glycoconjugates	PAR-20-097
ES035214	Alexander Tropsha	Supporting Biomedical Discovery with the ROBOKOP Graph Knowledgebase.	PAR-20-097
CA265879	Jeremy Lyle Warner	Enhancing the HemOnc Knowledgebase of Chemotherapy Drugs and Regimens	PAR-20-097
GM148372	Nuno Bandeira	Global proteomics mass spectrometry data sharing infrastructure	PAR-20-089
NS122732	Adam R. Ferguson	Pan-Neurotrauma Data Commons	PAR-20-089
GM150793	Jeffrey C. Hoch	Biological Magnetic Resonance Data Bank Base	PAR-20-089
NS132940	Jonathan Rosand	An Imaging Repository for the Cerebrovascular Disease Knowledge Portal (iCDKP)	PAR-20-089
AA029959	Samuel S. Wu	Southern HIV and Alcohol Research Consortium Biomedical Data Repository	PAR-20-089

View NOT Awardees

NOT Award Recipients
Grant Number	Contact Principal Investigator	Project Title	Notice Number
DC014664-07	Julius Fridriksson	Improving usage of the Aphasia Research Cohort (ARC) repository	NOT-OD-23-044
HG012198-02S1	Lincoln D. Stein	Optimizing Reactome TRUST	NOT-OD-23-044
HD048404-18	John E. Marcotte	Improving DSDR's FAIRness Through Improved Infrastructure and Enriched Expanded Metadata Exports	NOT-OD-23-044
HG006370-11	Helen E. Parkinson	Mouse Phenotyping Informatics Infrastructure - Data acquisition, integration, analysis and translation of high throughput mammalian phenotyping data.	NOT-OD-23-044
AG066793-02	Antonella Zanobetti	National Cohort Studies of Alzheimers Disease, Related Dementias and Air Pollution	NOT-OD-22-069
EB029173-04S1	Christian Haselgrove	Neuroimaging Informatics Tools and Resources Collaboratory: Outreach, Infrastructure and Maintenance	NOT-OD-22-069
OD011883-10A1	Melissa Haendel	The Monarch Initiative: Linking diseases to model organism resources	NOT-OD-22-069
HG007822-09S1	Alex Bateman	UniProt building community metrics for FAIR and TRUSTworthy resources	NOT-OD-22-069
ES026555-04	Susan Teitelbaum	Human Health Exposure Analysis Resource (HHEAR) Data Center	NOT-OD-22-069
GS-35F-0442V, 75N97021F00100	Alison Garcia	Modernizing the Federal Interagency Traumatic Brain Injury Research (FITBIR) repository to better support the NIH data ecosystem and advance TBI biomedical research	NOT-OD-22-069
GS-35F-0442V, 75N97021F00100	Alison Garcia	NEI BRICS – Harnessing the Power of Data in Vision Research	NOT-OD-22-069
HHSN26110071	Andrey Fedorov	UDash -a Usage Dashboard for the Imaging Data Commons	NOT-OD-22-069
DK072476-16S3	Eric Ravussin	Improving FAIR-ness and TRUST-worthiness of the Pennington/Louisiana NORC Biorepository In 2018 the Pennington/Louisiana Nutrition Obesity Research Center (NORC; P30 DK072476-16) established a repository of human subjects’ data and biospecimens of nutrition and obesity research. Currently, the repository includes data and biospecimens from 213 studies and 13,787 unique participants (68% women and 38% with obesity) funded by the National Institutes of Health, Department of Defense, United States Department of Agriculture, American Heart Association, American Diabetes Association and other government and non-profit organizations. In September 2020, an online portal (https://my.pbrc.edu/NORC/NORCRepository/Landing (link is external)) was opened to allow people to independently search the cadre of available data. As we transition to increase usage, it is imperative that we align with the FAIR and TRUST principles and to ensure we can appropriately track usage, utility, and impact. In response to NOT-OD-21-089, we have developed a comprehensive but conservative one-year project to achieve these goals. The overarching objective of this administrative supplement is to improve upon the “FAIR”-ness and “TRUST”-worthiness of the Pennington/Louisiana NORC Biorepository and its online portal. In aim 1, we will improve “FAIR”-ness by adding existing data and increasing metadata and establishing metrics for tracking and usage. In aim 2, we will improve “TRUST”-worthiness by promoting and demonstrating the methods used for data collection. Finally, aim 3, will explore the possibility for certification. This unique repository provides unique data on nutrition and obesity which seeks to benefit researchers across the country for years to come.	NOT-OD-21-089
DA028420-17S1	Molly A. Bogue	Mouse Phenome Database: Making it More FAIR-compliant and TRUST-worthy The Mouse Phenome Database (MPD; https://phenome.jax.org ) is a widely accessed NIH-supported Biomedical Data Repository focused on primary mouse phenotype data from genetic studies of complex traits in strains and populations. For over 20 years, MPD has been a community resource developed at The Jackson Laboratory, an independent non-profit research institute, that has disseminated mouse genetic data and resources to the biomedical community since its founding. MPD, listed in the Trans-NIH Biomedical Informatics Coordinating Committee registry, accepts submission of relevant data from the community to store, organize, validate, archive, preserve, and distribute the core data from phenotyping experiments to end users in an increasingly FAIR compliant manner. We have discovered challenges and opportunities in meeting evolving requirements in enhancing MPD’s FAIR capabilities and also meeting TRUST-worthy standards. Our overarching goal is to make MPD more FAIR-compliant and TRUST-worthy while providing better metrics to evaluate usage, utility, and impact of data in MPD. Our specific tasks for this Supplement are to: 1) Implement database changes to support emerging metadata standards for the description of primary experimental data, 2) Refine our API so that it exposes data to external systems using emerging metadata standards, 3) Refine the user experience and self-curation of data so that it meets emerging metadata standards and provides intuitive data submission, and 4) Develop traceability methods for data and document user’s workflow to enhance reproducibility, tracking, and reporting of data and analytic tool usage. By simultaneously addressing our challenges, we will improve data exposure and utilization globally through integration with a modernized informatics infrastructure. Funding provided by NIH DA028420	NOT-OD-21-089
DC016094-04S1	Nadine Martin	Translation and Clinical Implementation of a Test of Language and Short-term Memory (STM) in Aphasia: The CORE-APHASIA Collaboratory: Advancing Robust Data Science & Sharing (CARDS) Aphasia is an impairment of language, affecting the production or comprehension of speech and the ability to read or write, resulting from stroke, head trauma or other neurological condition. Research to improve health related quality of life for individuals with aphasia has resulted in rich datasets and knowledge but rely on inefficient data structures that do not fully leverage efficient research and all that could be learned from existing datasets. This proposed project will align our existing data platform for aphasia research, CORE-APHASIA, with the FAIR (Findable, Accessible, Interoperable, and Reusable) and TRUST (Transparency, Responsibility, User Focus, Sustainability, and Technology) principles to improve the standardization, interoperability, and shareability resulting in a modernized CORE-APHASIA data resource and platform. Our proposed methods and approach follow guidance and processes addressed in the NIH Data Science Strategic Plan.	NOT-OD-21-089
HD082736-18S1	Brian MacWhinney	Administrative supplement to align with FAIR and TRUST principles	NOT-OD-21-089
DE028729-03S1	Carl Kesselman	Enhancing FAIRness of the FaceBase Research Data Hub	NOT-OD-21-089
AG059624-04S1	Dalane Kitzman	Improving the usage and impact of the Integrated Aging Studies Databank and Registry	NOT-OD-21-089
HG010859-03S1	Paul Sternberg	Aligning the Alliance of Genome Resources with FAIR and TRUST principles	NOT-OD-21-089
MH068457-19	Linda Brzustowicz	Enhancing alignment of the NRGR with FAIR and TRUST principles	NOT-OD-21-089
NS106899-04S1	Adam Ferguson	FAIR VISION for TOP-NT	NOT-OD-21-089
DC018446-03S1	Vikash Gilja	Data Repository for “CRCNS: Avian Model for Neural Activity Driven Speech Prostheses” Understanding the physical, computational, and theoretical bases of human vocal communication, speech, is crucial to improved comprehension of voice, speech and language diseases and disorders, and improving their diagnosis, treatment, and prevention. Meeting this challenge requires knowledge of the neural and sensorimotor mechanisms of vocal motor control. Our project directly investigates the neural and sensorimotor mechanisms involved in the production of complex, natural, vocal communication signals. Our results will directly enhance brain-computer interface technology for communication and will accelerate the development of prostheses and other assistive/augmentative technologies for individuals with communications deficits due to injury or disease. We will develop a vocal prosthetic that directly translates neural signals in cortical sensorimotor and vocal-motor control regions into vocal communication signals output in real-time. Building on success using non-human primates for brain computer interfaces for general motor control, the prosthetic will be developed in songbirds, whose acoustically rich, learned vocalizations share many features with human speech. Because the songbird vocal apparatus is functionally and anatomically similar to the human larynx, and the cortical regions that control it are closely analogous to speech motor-control areas of the human brain, songbirds offer an ideal model for the proposed studies. Beyond the application of our work to human voice and speech, development of the vocal prosthetic will enable novel speech-relevant studies in the songbird model that can reveal fundamental mechanisms of vocal learning and production. As a critical component of the project, we are collecting a large dataset of simultaneously recorded neural activity from implanted multielectrode arrays (e.g., Neuropixels) along with vocalizations and additional behavioral data. These multimodal data are collected over multi-hour sessions and behaviors are spontaneous and heterogeneous. To enable effective dissemination of these data to the research community our team will, in alignment with the principles of FAIR and TRUST, develop a comprehensive data schema that meets community standards, build software tools to enable broad data reuse, develop a queryable data repository, and will provide detailed tutorials. These efforts will contribute to existing active open-source projects utilized by the neuroscience community, including Neurodata Without Borders (NWB:N). We believe that by investing in the development of this data repository, the impact of the data produced by our studies will be significantly augmented. Additionally, the software engineering tools developed will have a broader impact on data-intensive neuroscience studies of complex behaviors including and beyond speech and vocalization.	NOT-OD-21-089
DC019370-01S1	Ronna Hertzano	Advancing FAIRness and TRUST in the gEAR portal (DBAASP)	NOT-OD-21-089
DC014664-06A1S1	Julius Fridriksson	Public sharing of the Aphasia Recovery Cohort	NOT-OD-21-089
HHSN316201200036W	Atul Butte	FHIR’ing up ImmPort: Improving Interoperability of ImmPort Data	NOT-OD-21-089
HHSN316201300006W/ HHSN27200002	Nada Midani	Increased Interconnectivity for Database of Antimicrobial Activity and Structure of Peptides (DBAASP)	NOT-OD-21-089
HHSN316201300006W/ HHSN27200002	Nada Midani	Making 3D Data “FAIR” with NIH 3D Improving the usage and impact of the Integrated Aging Studies Databank and Registry	NOT-OD-21-089
75N94021D00001/ 75N94021F00001	Michael Keller	Applying FAIR and TRUST Principles for Enhanced Resource Sharing and Sustainable and Reliable Repository Operations	NOT-OD-21-089
HHSN316201200054W	Jennifer Fostel	Ensuring FAIR and TRUST for High-dimensional Environmental Study Data In this project, we will work with an external contractor, BioTeam Inc., with the following objectives: BioTeam Inc. will assist NIEHS in developing a framework that will inform the desired future state of NIEHS databases by using the Chemical Effects in Biological Systems (CEBS) database / knowledgebase as a model for sustainable management of environmental health data in a distributed data ecosystem. BioTeam Inc. will provide advice about the planned future state of CEBS. CEBS contains data produced by the Division of the National Toxicology Program (DNTP) over the past 45 years, moving towards an integrated Data Warehouse and permitting cross-cutting questions to be asked. DNTP is also identifying high-value datasets to host in public CEBS Data Marts for direct user query. In addition, DNTP has collected high-dimensional expression data from some subjects and currently houses these in the SRA and GEO repositories and may apply the same model to microbiome and metabolomics data in a CEBS data ecosystem such as GEN3. BioTeam Inc. will advise NIEHS on the following: sustainability and governance considerations; FAIR data sharing and TRUSTworthy repositories; interoperability with NIH data systems; common standards for measuring data use and utility; suggestions to model storage and personnel cost, and to enhance and partially-automate curation; considerations of complexity of storage of high-dimensional data in government repositories while integrating the data for analysis. Finally, BioTeam Inc. will suggest how these solutions might differ if applied to a data system that accepts environmental health data from the public in addition to the data provided by the DNTP.	NOT-OD-21-089

Engage with the community by joining [email protected] listserv. View instructions on how to join.