Biomedical research is increasingly driven by the acquisition and analysis of large digital datasets. Sharing of such datasets has the potential to accelerate research, since in many cases it is far more expensive to collect than to analyze data. As the number, size, and public availability of biomedical datasets grow, so do the opportunities to advance biomedical knowledge. However, this also presents new challenges to the biomedical researcher, as data sets relevant to a problem of interest may be scattered across multiple repositories and be difficult to find. The heterogeneous nature of biomedical data, the lack of data discovery infrastructure, and fragmented data environments, data standards, and documentation present a barrier to data sharing. Tools to enable researchers to discover and re-use data to facilitate biomedical science are needed.
The biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE) project, a part of the NIH Big Data to Knowledge (BD2K) initiative, seeks to provide a prototype platform for researchers to find biomedical data to enable reanalysis and the creation of derived products through data integration. One of the outcomes of bioCADDIE is the aggregation of metadata across a variety of biomedical data repositories into a prototype discovery system named DataMed , which provides researchers with a PubMed-like search engine. DataMed supports the findability and accessibility of data sets, characteristics - along with interoperability and reusability - of the four FAIR principles to facilitate knowledge discovery in today's big data-intensive science landscape. A platform-independent common model (DATS) describes the metadata elements and the structure for datasets, and powers DataMed’s ingestion and indexing pipeline, as well as its search functionality. Publicly launched in 2016 and published in Nature Genetics, DataMed’s latest release (v3.0) contains new features, improvements based on user feedback, and indexing of 74 major repositories hosting over 2.3 million datasets. Users don’t need to know in advance which repositories may have data of interest nor use a different query strategy for each repository portal. Once they find datasets of interest in DataMed, they can follow links to retrieve them from the various repositories.
DataMed, a completely open-source prototype search system, uses a modular architecture to support use cases from the general, biological and translational communities with a number of common needs for metadata searches. This makes DataMed broad in its search capabilities, allowing it to easily span a range of diverse domains and types of data. Thus, DataMed is designed to be a common data index infrastructure, connecting with existing biomedical data repositories (e.g. dbGaP, PDB), aggregators (e.g., bioProject, OmicsDI, dkNET), and other data sources. Both DataMed and the underlying DATS metadata model use schema.org (http://schema.org) annotation to expose the harvested metadata to general search engines such as Google, Microsoft, Yahoo and Yandex. Additionally, the latest version of DataMed incorporates a RESTful API to provide programmatic access to all of DataMed’s harvested metadata and additional user-interface features. Such sharing capabilities make DataMed easily accessible not only to the biomedical community but also to those outside of the biomedical sphere.
About the Authors:
Executive Committee members of bioCADDIE:
- Lucila Ohno-Machado, MD, MBA, PhD: Dr. Ohno-Machado is the Principal Investigator for bioCADDIE and Professor of Medicine and founding chief of the Division of Biomedical Informatics at UCSD. She is associate dean for informatics and technology and has experience leading multidisciplinary projects at the intersections of biomedicine and quantitative sciences. Dr. Ohno-Machado is also director of the Biomedical Research Informatics for Global Health training program.
- George Alter, PhD: Dr. Alter is Director of the Inter-University Consortium for Political and Social Research (ICPSR), Research Professor at the Population Studies Center, and Professor of History at the University of Michigan. His research grows out of interests in the history of the family, demography, and economic history, and recent projects have examined the effects of early life conditions on health in old age and new ways of describing fertility transitions.
- Susanna-Assunta Sansone, PhD: Susanna-Assunta Sansone is an Associate Director and Principal Investigator at the Oxford e-Research Centre, part of the Engineering Science at the University of Oxford; and Consultant for Springer Nature Scientific Data. Her activities are around and in support of data curation, management and publication and their pivotal roles in enabling reproducible research and knowledge discovery.
- Hua Xu, PhD: Dr. Xu is the Robert H. Graham Professor at the School of Biomedical Informatics and Director of the Center for Computational Biomedicine in The University of Texas Health Science Center at Houston (UTHealth). Dr. Xu is an expert in biomedical text processing and data mining.
- Jeffrey Grethe, PhD: Dr. Grethe is currently a co-investigator for the Neuroscience Information Framework (NIF) and Principal Investigator for the NIDDK Information Network (dkNET) in the Center for Research in Biological Systems (CRBS) at the University of California, San Diego. Throughout his career, he has been involved in enabling collaborative research, data sharing and discovery through the application of advanced informatics approaches.
- Ian Fore, D.Phil: Dr. Fore is the NIH Scientific Officer on bioCADDIE. He serves as the Senior Biomedical Informatics Program Manager at the National Cancer Institute (NCI).
Management team of bioCADDIE:
- Elizabeth Bell, MPH: Ms. Bell is the general bioCADDIE project manager. She has experience with patient-centered electronic consent studies. Her background is in environmental health.
- Anupama E Gururaj, PhD: Dr. Gururaj serves as the Project Manager for the Core Technology Development Group of BioCADDIE as well as all BioCADDIE-related activities at The University of Texas Health Science Center at Houston (UTHealth). She has research experience in cancer research focusing on cancer signaling pathways as well as biomedical informatics, specifically in the area of data mining and natural language processing.