Modernizing the NIDDK Biomedical Data Ecosystem to Enhance Translation of Big Data Science to Clinical Studies and Health Outcomes (NIDDK/DDEM)

Project Point of Contact: Xujing Wang, Ph.D., Program Director, Division of Diabetes, Endocrinology, and Metabolic Diseases (DDEM), NIDDK/William Cefalu, M.D., Director,

Goals and Objectives: The overarching goal of this project is to modernize the NIDDK biomedical data ecosystem by improving findability, accessibility, compatibility, and interoperability (FAIR) of clinical data resources, connecting basic and clinical research resources, and enabling cross disciplinary collaboration to accelerate the translation of scientific advancement into improved health and quality of life. The specific objectives include:

  • Design and develop a centralized metadata and meta-standard catalogue of NIDDK data science resources, including those for clinical, behavioral, social, and basic biomedical research, with a dynamic, searchable web interface
  • Design and develop a platform to assist users adopting common data elements and metadata standards for both clinical and observational data types
  • Design and develop a data science model to connect and harmonize clinical, observational, and basic science data types
  • Prepare for, and start if time permits, development of a centralized portal to assist researchers with integrating clinical and observational data types with basic science data types, and to conduct cross disciplinary biomedical research

Significance: The proposed project is well aligned with the following scientific goals of the NIDDK Strategic Plan for Research, released Dec 2021:

  • Scientific Goal 1.3 Develop innovative technologies and resources and expand data science to advance scientific progress and enhance health.
  • Scientific Goal 2.4 Utilizing data science to improve clinical studies.
  • It is also aligned well with the following Overarching Goals and correspondent Strategic Objectives of the NIH Strategic Plan for Data Science, released in June 2018:
  • GOAL 1 Support a Highly Efficient and Effective Biomedical Research Data Infrastructure
  • Objective 1-2 | Connect NIH Data Systems
  • GOAL 2 Promote Modernization of the Data-Resources Ecosystem
  • Objective 2-3 | Leverage Ongoing Initiatives to Better Integrate Clinical and Observational Data into Biomedical Data Science
  • GOAL 3 Support the Development and Dissemination of Advanced Data Management, Analytics, and Visualization Tools
  • Objective 3-3 | Improve Discovery and Cataloging Resources


If successfully accomplished, the project will address a data science gap faced by the NIDDK community in clinical and observational research and will facilitate cross disciplinary collaboration and translational innovations and discoveries. At the NIH level, currently there are relatively fewer efforts aimed at integrating clinical and observational data into biomedical data science (compared to basic science data types), or at translating advancements in Big Data science to improving health and quality of life.  This project could help stimulate interest in this important direction.  

Description: Recent decades have seen significant advancements in medicine and science for chronic conditions such as diabetes, many of which have been powered by modern molecular and data science technologies. However, these advances have not translated, at scale, into significant improvement in health outcomes, nor reduction in health disparities. One critical gap is a lack of integration of clinical and observational data types into basic biomedical research. While NIDDK has supported several data science programs, none thus far have focused on clinical or observational data types. This project will address this gap by designing and developing a centralized portal of NIDDK data science resources (e.g., large datasets, repositories, knowledgebases, etc.) that links clinical, observational, and basic science data types, and facilitates their interoperability, integration and harmonization. The Scholar will join a multidisciplinary team and will lead the technical design and implementation.  

The project will initially focus on diabetes with potential to expand to other diseases and conditions in NIDDK’s mission. Diabetes exemplifies the challenges presented by many chronic conditions. It is a major public health problem that disproportionately impacts communities that have been marginalized. Effective care and self-management of diabetes is extremely data centric, relying heavily on digital health technologies, such as glucose meters and insulin pumps, and facilitated through the use of a range of smart and connected devices, such as wearables and activity trackers. An enormous amount of clinical and observational data has been generated from such technologies and from EHR and surveys of Social Determinants of Health (SDOH). However, these data resources mostly remain siloed in different NIH or NIDDK funded initiatives and even in different sectors of the society. In addition, they follow varying data standards, and show poor findability, accessibility, compatibility, and interoperability (FAIRness). This disconnect hinders patients’ and care providers’ ability to make real time best decisions, and poses significant hurdles in using combined devices and in using combined information from disparate sources (e.g., devices from different manufacturers, patient surveys, EHR data) for artificial intelligence or machine learning (AI/ML) approaches. The gaps also hinder discoveries into the fundamental mechanisms of disease, such as how clinical inflammation may drive molecular events leading to diabetes, or how chronic stress resulting from adverse SDOH impacts an individual’s physiology and epigenome to increases risk for diabetes and diabetes related complications. 

This project will:

  • Develop a dynamic, searchable metadata and meta-standards catalogue of NIDDK data science resources, including both basic science data types, and clinical and observational data types; and a data science model to link and harmonize the resources. 
  • Organize a workshop to recruit community input.
  • Design and start the development of a centralized portal—with a cloud-base, AI/ML-ready repository—that connects basic and clinical research, facilitates cross disciplinary collaboration, and thus enables clinical data to inform basic research hypotheses, and findings from basic research to translate to clinical care. 

Data set(s) involved: Datasets will come from the following sources:

  • NIDDK’s Central Repository (NIDDK-CR) and information NETwork (dkNET)
  • NIDDK funded basic science and clinical consortia, including but not limited to, HIRN, TEDDY, and TrialNET 
  • The AI Ready and Equitable Atlas for Diabetes Insights (AI-READI) data generation project of NIH’s Bridge2AI common fund program.
  • Repositories in NIH’s Generalist Repository Ecosystem Initiative (GREI)
  • Community-built resources such as Tidepool, T1D Exchange, and Jaeb Center
  • Data collection by diabetes technology industry through platforms such as Carelink, Clarity Diasend, and Glooko. 

Anticipated outcomes of the project: The anticipated outcomes will include a white paper; a metadata and meta-standard catalogue that links clinical and observational data to basic diabetes research; and a workshop, with an executive summary of the workshop.

Required skills of the DATA Scholar:

  • Expertise in Artificial Intelligence and Machine Learning (AI/ML) technologies, including deep Learning (DL) and Natural Language Processing (NLP).
  • Expertise in cloud computing and programming (Python, R, etc.). 
  • Expertise in Jupyter notebooks, SQL databases and graphic databases.
  • Experience in AI-enabled knowledge extraction and representation methods (e.g., semantic networks, knowledge graphs, etc.)
  • Familiarity with large-scale human datasets (e.g., omics, electronic health record, smart and connected devices, wearable, social media apps, survey data), and knowledge of associated tools, standards.
  • Knowledge in data harmonization, aggregation and integration, and interoperability.
  • Strong communications skills for role as a technical liaison among multiple stakeholders.

Expected/preferred length of DATA Scholar appointment: 2 years.

Expected/preferred time effort commitment of the DATA Scholar: Full time (100%)

Remote work preference: Hybrid preferred.

ICO support: 50% salary and benefit support from NIDDK OD. Office space, computer, printer, and mentors.

Additional activities: The Scholar will participate in workshop development and hosting, and white paper writing; and can also participate in the (AI-READI) project of NIH’s Bridge2AI common fund program.

Career or professional development opportunities: There will be opportunities for the Scholar to develop skills and expertise in scientific administration, multi-disciplinary collaboration and teaming, initiative planning, and in providing summaries of topic areas of expertise to leadership.

To apply to this or other DATA Scholar positions, please see instructions here: datascience.nih.gov/data-scholars.

This page last reviewed on April 17, 2023