Administrative Supplements to Support Collaborations to Improve the AI/ML-Readiness of NIH-Supported Data
About the Administrative Supplements to Support Collaborations to Improve the AI/ML-Readiness of NIH-Supported Data
Artificial intelligence and machine learning (AI/ML) are a collection of data-driven technologies with the potential to significantly advance biomedical research. The National Institutes of Health (NIH) makes a wealth of biomedical data available and reusable to research communities however, not all of these data are able to be used efficiently and effectively by AI/ML applications.
Thirty-four awards were made in 2023 to principal investigators at 15 different institutions across the country. Awardee projects and their descriptions are available below.
Improving AI/ML-readiness of Synthetic Data in a Resource-Constrained Setting
FIC
ARNAOUT, RIMA
UNIVERSITY OF CALIFORNIA, SAN FRANCISCO
ENRICHing NIH Imaging Datasets to Prepare them for Machine Learning
NHLBI
AWAD, ISSAM A
UNIVERSITY OF CHICAGO
Biomarkers of Cerebral Cavernous Angioma with Symptomatic Hemorrhage (CASH) - Supplemental
NINDS
BELL, MICHELLE L
YALE UNIVERSITY
Containerizing tasks to ensure robust AI/ML data curation pipelines to estimate environmental disparities in the rural south
NIMHD
BLETZ, JULIE A
SAGE BIONETWORKS
Assuring AI/ML-readiness of digital pathology in diverse existing and emerging multi-omic datasets through quality control workflows
NCI
CHICHOM, ALAIN MEFIRE
UNIVERSITY OF BUEA
Harnessing Data Science to Promote Equity in Injury and Surgery for Africa
FIC
CHINCHILLI, VERNON M
PENNSYLVANIA STATE UNIV HERSHEY MED CTR
Data Coordinating Center for the Type 1 Diabetes in Acute Pancreatitis Consortium
NIDDK
CHIU, YU-CHIAO
UNIVERSITY OF PITTSBURGH AT PITTSBURGH
Enhancing AI-readiness of multi-omics data for cancer pharmacogenomics
NCI
CHOI, SUNG WON
UNIVERSITY OF MICHIGAN AT ANN ARBOR
Patient-Oriented Research and Mentoring in Hematopoietic Cell Transplantation Supplement
NHLBI
CHUNARA, RUMI
NEW YORK UNIVERSITY
NYU-Moi Data Science for Social Determinants Training Program
FIC
COOK, DIANE JOYCE
WASHINGTON STATE UNIVERSITY
Crowdsourcing Labels and Explanations to Build More Robust, Explainable AI/ML Activity Models
NIA
DING, MINGZHOU
UNIVERSITY OF FLORIDA
Acquisition, extinction, and recall of attention biases to threat: Computational modeling and multimodal brain imaging
NIMH
ERICKSON, LOREN D
UNIVERSITY OF VIRGINIA
IgE antibody responses to the oligosaccharide galactose-alpha-1,3-galactose (alpha-gal) in murine and human atherosclerosis
NIAID
FRIED-OKEN, MELANIE
OREGON HEALTH & SCIENCE UNIVERSITY
An AI/ML-ready closed loop BCI simulation framework
NIDCD
GUO, JINGCHUAN
UNIVERSITY OF FLORIDA
Supplement of NIDDK R01 newer GLDs and Clinical Outcomes
NIDDK
HIRSCH, KAREN G
STANFORD UNIVERSITY
PREcision Care In Cardiac ArrEst - ICECAP (PRECICECAP)
NINDS
HSU, WILLIAM
UNIVERSITY OF CALIFORNIA LOS ANGELES
An AI/ML-ready Dataset for Investigating the Effect of Variations in CT Acquisition and Reconstruction
NIBIB
IM, HYUNGSOON
MASSACHUSETTS GENERAL HOSPITAL
Development of plasmon-enhanced biosensing for multiplexed profiling of extracellular vesicles
NIGMS
LARSON, MARY JO
BRANDEIS UNIVERSITY
Trajectories of non-pharmacologic and opioid health services for pain management in association with military readiness and health status outcomes: SUPIC renewal
NCCIH
MACCARINI, PAOLO FRANCESCO
DUKE UNIVERSITY
Development of AI/ML-ready shared repository for parametric multiphysics modeling datasets: standardization for predictive modeling of selective brain cooling after traumatic injury
NINDS
MOKUAU, NOREEN
UNIVERSITY OF HAWAII AT MANOA
Processing Multiomic Datasets for Improved AI/ML-readiness in Congenital Heart Disease Research
NIMHD
NGUYEN, THU
UNIV OF MARYLAND, COLLEGE PARK
Risk and strength: determining the impact of area-level racial bias and protective factors on birth outcomes
NIMHD
ORDOVAS, JOSE M.
TUFTS UNIVERSITY BOSTON
Social Stressors, Epigenetics and Health Status in Underrepresented minorities
NIMHD
PANAGEAS, KATHERINE S
SLOAN-KETTERING INST CAN RESEARCH
MATCHES: Making Telehealth Delivery of Cancer Care at Home Effective and Safe - Addressing missing data in the MATCHES study to improve ML/AI readiness
NCI
PAYNE, SAMUEL H
BRIGHAM YOUNG UNIVERSITY
Creating AI/ML-ready data for single cell proteomics
NIGMS
REHM, HEIDI L
BROAD INSTITUTE, INC.
ClinGen AI Data Delivery Supplement
NHGRI
SETTE, ALESSANDRO
LA JOLLA INSTITUTE FOR IMMUNOLOGY
THE CANCER EPITOPE DATABASE AND ANALYSIS RESOURCE
NCI
SHEFFIELD, NATHAN
UNIVERSITY OF VIRGINIA
Novel methods for large-scale genomic interval comparison
NHGRI
TEMPANY, CLARE M
BRIGHAM AND WOMEN'S HOSPITAL
Generation and Dissemination of Enhanced AI/ML-ready Prostate Cancer Imaging Datasets for Public Use
NIBIB
ULRICH, CORNELIA M
UNIVERSITY OF UTAH
Harmonizing genomic, transcriptomic, and drug response data across pre-clinical models of cancer to support machine learning approaches for personalized cancer therapy selection
NCI
WOLMARK, NORMAN
NRG ONCOLOGY FOUNDATION, INC.
NRG Oncology Network Group Operations Center
NCI
ZHANG, WEI
WAKE FOREST UNIVERSITY HEALTH SCIENCES
Developing unbiased AI/Deep learning pipelines to strengthen lung cancer health disparities research
NCI
ZHAO, ZHONGMING
UNIVERSITY OF TEXAS HLTH SCI CTR HOUSTON
Transforming dbGaP genetic and genomic data to FAIR-ready by artificial intelligence and machine learning algorithms
NLM
Thirty-six awards were made in 2022 to principal investigators at 33 different institutions across the country. Awardee projects and their descriptions are available below.
The HEAL Data Ecosystem is working to collect data across its projects and networks to meet FAIR (Findable, Accessible, Interoperable, Reusable) data standards. This administrative supplement builds on the mission of the NIH HEAL IMPOWR network to blend existing and future chronic pain (CP) and opioid use disorder (OUD) data. The proposed work will significantly deepen and augment approaches to FAIR principles in CP and OUD data for both the HEAL network and larger NIH research community. The overall objective of this project is to move CP and OUD data one step closer to FAIR by leveraging existing datasets and developing tools for new projects. The general hypothesis of the project is that leveraging existing CP & OUD data and collecting new data using ML/AI data quality standards will accelerate the impact of the HEAL Data Ecosystem. The aims of the project are to transform existing datasets to be ML/AI ready, and to adapt tools to support ML/AI readiness for existing and prospectively collected HEAL common data elements (CDE). The expected outcome of this project is data optimization pipelines and tools to support the goal of ML/AI ready data. The results of this project will provide a strong basis for further development of the HEAL Data Ecosystem, helping to bring diverse data sources together and meet FAIR data standards.
NIDA
Alkalay, Ron N
Beth Israel Deaconess Medical Center
Curating musculoskeletal CT data to enable the development of AI/ML approaches for analysis of clinical CT in patients with metastatic spinal disease
Patients with metastatic spine disease to a high risk of pathologic vertebral fracture (PVF). Up to 50% of these patients suffer neurological deficits with further complications that may be fatal. Prediction of PVF risk is a critical clinical need for managing these patients. Segmentation of vertebral anatomy, bone properties, and individual spinal musculature cross-sectional area from clinical CT imaging is fundamental for developing precise, patient-specific diagnostics of PVF risk. Such segmentation faces unique challenges due to the cancer-mediated alteration in skeletal tissues' radiological appearance. Deep learning (DL) methods will speed and standardize the critical segmentation step, permitting analysis of larger datasets and promoting new DL analysis for improved insight into the drivers of PVF risk in patients with metastatic spine disease. For this project, titled; “Curating musculoskeletal CT data to enable the development of AI/ML approaches for analysis of clinical CT in patients with metastatic spinal disease”, our work aimed to establish a curated, publicly accessible computer tomography imaging dataset from 140 metastatic spine disease patients treated with radiotherapy, imaged as part of our parent study, titled; “Predicting Fracture Risk in Patients Treated with Radiotherapy for Spinal Metastatic Disease” (AR075964). For this purpose, we have established manual segmentation of each vertebral level, including delineation for lesion type and fractures. Based on this data, we successfully developed a testbed deep learning model for 1) segmentation of thoracic and lumbar vertebrae and 2) Complete set of thoracic and abdominal muscles, demonstrating the applicability of the curated data set and the application for developing DL methods. Based on this effort, the curated images and associated delineation data from this cohort have been accepted and are undergoing final submission to the NIH Cancer Imaging Archive (TCIA). Integrating DL systems within our approach forms an important step in changing the patient management paradigm from reactive to data-driven proactive management to prevent PVF events and critically reduce bias in patient management.
NIAMS
Bateman, Alex
European Molecular Biology Laboratory
UniProt - Protein sequence and function embeddings for AI/Machine Learning readiness
UniProt represents a wealth of protein related information that is very diverse and yet highly structured, covering all living organisms from microbes to humans. It is an ideal source of information for AI/ML and indeed has already been used to help train essential tools such as AlphaFold that uses deep learning to generate 3D structural models for nearly all proteins. This project further harnesses AI/ML techniques to enhance UniProt, particularly using sequence embeddings. Sequence embeddings provide a representation of protein sequence that can be used for a broad range of AI/ML tasks. We are making these embeddings and AI/ML models available to researchers, saving community compute time and enabling data science. We are testing embeddings for critical tasks in UniProt such as clustering sequences to enable users to search faster. We are also exploring using embeddings for enzymatic reactions to enhance our ability to identify novel data in the literature - and improve the diversity of catalytic reactions captured in UniProt. To better understand the needs of the AI/ML community, we have held a workshop to engage leaders in the field. We have learned about new directions and opportunities we can benefit from as well as the challenges they face in their own work that we can help with by improving our data provision. Collectively, we will scale up protein functional annotation with AI/ML-assisted techniques, organize the growing sequence space with AI/ML-enabled sequence clustering to sustain the sequence computing, and collaborate with the AI/ML research communities to develop new solutions to benefit the broad user community of the UniProt resources.
NHGRI
Bertagnolli, Monica M.
Brigham And Women's Hospital
A-STOR Cancer Clinical Trial Artificial Intelligence & Machine Learning Readiness
Cancer clinical trials are facilitating an explosion of biomedical data, including complex clinical data, diverse genomic data, pathologic image data, high-dimensional molecular characterization, and clinical imaging data among others. Maximizing analyses of samples and data collected through clinical trials and the rationale is well understood – comprehensive molecular profiling should accelerate our goal of ‘precision cancer medicine’, especially when applied to the randomized clinical trials that incorporate current and emergently effective treatments. Among cancer clinical trials, many high impact trials are designed and conducted by National Cancer Institute’s (NCI) National Clinical Trial Network (NCT), including the focus of this study – Alliance for Clinical Trials in Oncology. However, present barriers impede cancer clinical trials from unlocking the full potential of these datasets. Currently, omics data generated from trials are largely decentralized: data are housed at a variety of sites, analyses take place locally, and other researchers do not have access until public deposition of data on repositories. Further, analyses vary widely in bioinformatics methods, including choice of tools, dependencies, file formats, parameterizations, data quality filtering thresholds, and other workflow elements, which makes integration across groups challenging. In this proposal, we are expanding the Alliance Standardized Translational Omics Resource (A-STOR) to realize the full potential of artificial intelligence (AI)/machine learning (ML) modeling for cancer clinical trials. Specifically, we will: 1) rapidly expand A-STOR to host data from over a dozen existing or ongoing Alliance clinical trials and optimize infrastructure for AI/ML analyses; 2) develop a unified clinical and adverse event (AE) data dictionary to facilitate clinical data harmonization; and 3) complete an already-approved pooled multi-modal ML-based predictor as a pilot study. Progress to date includes co-localization of digital pathology and genomic data, harmonization of clinical data dictionary, and initiation of a pilot project to interrogate the breast cancer tumor immune microenvironment through cutting edge AI-based approaches and best-in-class RNA-based immune signature approaches.
NCI
Bhatt, Tanvi
University Of Illinois At Chicago
Perturbation training for enhancing stability and limb support control for fall-risk reduction among stroke survivors
NICHD funds several clinical trials targeting novel balance and gait interventions. Yet there is a gap in the field pertaining to data sharing, accessibility and utilization. Machine learning models based on gait data could accurately identify pathological gait patterns, classify motor disorder, predict the need for ankle foot arthrosis, and assess rehabilitation status for stroke survivors. However, there is a lack of publicly available data repositories for clinicians and researchers, and computational expertise is required for the use of the data. Those barriers greatly limit the development of data-driven approaches for health. Therefore, this project aims to take a step towards the democratization of gait analysis to empower a broad range of stakeholders. To create the gait data repository (Aim 1), we will evaluate and enable metadata through data wrangling and harmonization capabilities following FAIR data principles (findability, accessibility, interoperability and reusability) before building the repository. To uncover data issues (i.e., data loss, and data artifacts), Aim 2 will focus on data analytics leveraging harmonized data sets from Aim 1 to create scientific workflows for biomechanical data utilization (data visualization, cleaning, and analysis functionalities). To support cleaning and transformation tasks in gait data, a set of open-source libraries will be created for data cleaning, analysis, and visualization. Our computational libraries will be made available to researchers through an easy-to-use visual interface that allows them to query and visualize the data. Additionally, a centralized website (GaitPortal) will be designed to make data and libraries publicly available. Lastly, we will demonstrate an initial use for the transformed data by developing a fall risk predictive model based on the time series gait data (Aim 3). The demonstration code containing all basic and advanced functions will be provided for other researchers to enable customization of the code for their specific purpose(s). These user-friendly tools would improve data checking, cleaning, and analysis by cutting manual data analysis time by at least 25% and reducing overall financial cost for researchers and clinicians. Additionally, the association between clinical measures and gait data could guide the development of an objective function to evaluate balance status and training effects, which will help the researchers and clinicians to identify individualized impairments in gait performance and balance control. This personalized insight can benefit the development of tailored rehabilitation strategies for people with hemiparetic stroke. Clinicians can also monitor the rehabilitation progress based on real-time or post-processed feedback from biomechanical assessments, enhancing the precision and efficacy of interventions in the field of stroke rehabilitation.
NICHD
Casey, Joan A
Columbia University Health Sciences
Approaches for AI/ML Readiness for Wildfire Exposures
"Artificial intelligence (AI) and machine learning (ML) models are subject to biases inherent in data used to train them. Efforts to mitigate bias have focused largely on study designs, and implementation of fair quality checks. However, as focus is placed on generalizability of these models, it is critical to contextualize representativeness of study data used for modeling with respect to the population on which insights are intended to be used. Currently, research emphasizes descriptions of study cohorts, highlighting on whom analyses were performed. However, summary statistics cannot provide the granularity needed to identify potential bias brought on by diverse populations and limited sample sizes.
This project focuses on improving artificial intelligence/machine learning (AI/ML)-readiness of a wide range of environmental data sources used to predict wildfire fine particulate matter (PM2.5) exposure. Developing models to predict wildfire PM2.5 exposure is crucial as nearly 70% of the U.S. population is exposed to wildfire smoke each year, with 30% experiencing more severe levels of exposure. PM2.5 exposure has been associated with increased risk of respiratory care-related medical encounters and mortality among older individuals. As part of our parent R01 we are estimating the risk of incident and worsening mild cognitive impairment (MCI) and Alzheimer’s disease and related dementias (ADRD) associated with wildfire PM2.5. exposure. To do so, we are developing models to predict daily exposure to wildfire-specific PM2.5 levels using a two-stage ML approach. Stage one relies on a Bayesian machine learning algorithm to integrate multiple existing PM2.5 prediction models to assign ambient PM2.5 exposures. Stage two uses NOAA’s Hazard Mapping System to identify areas exposed to wildfire smoke plumes. We use the smoke plume information combined with statistical techniques to isolate daily estimates of wildfire PM2.5 from non-wildfire PM2.5 levels. The data sources needed to predict PM2.5 and wildfire PM2.5 are disparate, not very accessible, and unfriendly to AI/ML applications. These datasets include weather variables from multiple sources, satellite smoke plums from multiple sources, air pollution monitor data, national land use variables, topographical data, and others. Although the data is rich and publicly available through US agencies, acquiring it and preparing it for analysis presents a significant investment by any researcher. All datasets come in different spatial and temporal resolutions that need to be resolved for them to be merged. Additional processing is also needed to handle potential spurious information due to the periodicity of satellites, monitors, and cloud cover. With this administrative supplement, our goals are to improve the data processing for the vast and wide range of data sources by developing reproducible pipelines. For example, one source of PM2.5 predictions is obtained from the Atmospheric Composition Analysis Group, and we provide reproducible code to process and aggregate this netCDF data by applying a downscaling rasterization strategy using TIGER/Line shapefiles (see https://github.com/NSAPH-Data-Processing/pm25_components_randall_martin for more details). We will also annotate and document the data and ensuring computational scalability. In addition, we will deposit the processed data to a public data repository, Harvard Dataverse, and include a data demonstration to further disseminate the work.
NIA
Chen, Shigang
University of Florida
Supplement: SCH: Enabling Data Outsourcing and Sharing for AI-powered Parkinson's Research
Artificial intelligence holds the promise of transforming data-driven biomedical research for more accurate diagnosis, better treatment, and lower cost. In the meantime, modern digital technologies make it much easier to collect information from patients in large scale. While “big” medical data offers unprecedented opportunities for building deep-learning artificial neural network (ANN) models to advance the research of complex diseases such as Parkinson’s disease (PD), it also presents unique challenges to patient data privacy. The task of training and continuously refining ANN models with data from tens of thousands of patients, each with numerous attributes and images, is computation-intensive and time-consuming. Outsourcing computation and data to the cloud is a viable solution. However, the problem of performing the ANN learning operations in the cloud, without the risk of leaking any patient data from their sources, remains open to date. We propose to develop novel data masking technologies based on randomized orthogonal transformation to enable AI-computation outsourcing and data sharing. The proposed research includes (1) experimental studies of training ANN models with data masking for PD prediction and Parkinsonism diagnosis, and (2) theoretical development on data privacy, inference accuracy, and model performance. This supplement project expands the research into a new dimension of differential privacy by incorporating randomized orthogonal transformation and noise addition into the process of data masking. Differential privacy is a rigorously defined and widely adopted model, which provides a quantitative measure for privacy loss in data release. This project consists of a theoretical aim and an experimental aim. The theoretical aim is to develop a new method of achieving differential privacy for data outsourcing that minimizes noise addition with the help of randomized orthogonal transformation. The experimental aim is to use the new method to produce sharable PD (Parkinson’s Disease) data sets under the protection of differential privacy and ready for machine learning studies. The outcome of this research, with the new method of outsourcing data with differential privacy, is expected to have a broader impact beyond PD research in advancing the theory and implementation of cloud-based medical studies.
NLM
Devinsky, Orrin
New York University School of Medicine
Machine learning approaches for improving EEG data utility in SUDEP research
The proposed supplemental project is built upon the base of augmented datasets and new AI/ML techniques. Our research team consists of SUDEP and AI/ML experts with complementary expertise, who are uniquely qualified to develop innovative analytic tools for EEG data AI/ML-readiness. First, we have explored new EEG feature extraction/engineering techniques and validated the efficacy with ML approaches. To date, our preliminary results have achieved a median AUC (area under curve) of 0.87 in classification between SUDEP and living epilepsy patient controls---a significant improvement from our previously reported result (median AUC of 0.77, Frontiers in Neurology, 2022). Second, we are developing explainable ML models to enhance result interpretation. Third, we are developing and employing data augmentation techniques to improve the consistency of labeled EEG data from both SUDEP cases and living epilepsy patient controls. Finally, we will validate existing and newly developed ML methods on newly collected SUDEP and control samples at multiple sites. Overall, this project will complement and enrich the research aims in our parent grant, and promote research rigor, transparency and reproducibility. Accomplishing these research goals will maximize the data utility and improve AI/ML-readiness in epilepsy research.
NINDS
Ellisman, Mark H
University of California, San Diego
3D Reconstruction and Analysis of Alzheimers Patient Biopsy Samples to Map and Quantify Hallmarks of Pathogenesis and Vulnerability
This administrative supplement is advancing tools and methodology for the normalization of large-scale, 3D electron microscopic (EM) image volumes as a means to enhance the performance, reusability, and repeatability of high throughput artificial intelligence and machine learning (AI/ML) algorithms for automatic volume segmentation of brain cellular and subcellular ultrastructure. This work is being conducted in the context of an active research project that is advancing the acquisition, processing/refinement, and dissemination of large-scale 3D EM reference data derived from a remarkable collection of legacy biopsy brain samples from patients suffering from Alzheimer’s Disease (AD) (5R01AG065549). This active project is deeply rooted in the use of advance AI/ML technologies for delineating key ultrastructural constituents of neurons and glia exhibiting hallmarks of the progression of AD. It is organized to comprehensively target areas associated with plaques, tangles and brain vasculature, attending to locations where existing findings suggest cell and network vulnerability and contain molecular interactions suspected by some to underlie the initiation and progression of AD. Through this work, we are advancing the development and dissemination of fully trained neural-network models for volume segmentation to simplify (and reduce the costs associated with) community efforts to extract their own 3D geometries and associated morphometrics from this collection of AD reference data and similar repositories of neuronal 3D EM data. With this supplemental effort, we will develop, refine and disseminate a set of tools which allow for direct feedback and standardization of primary image quality. With these tools users will be able to optimize and normalize imaging parameters at time of image acquisition. The outcome of this work is to advance the use of transfer learning methods, facilitating repeatability and reuse of trained neural network models for scalable EM image segmentation.
NIA
Friel, Kathleen Margaret
Winifred Masterson Burke Medical Research Institute
Targeted transcranial direct current stimulation combined with bimanual training for children with cerebral palsy
In this supplement, we aim to use Deep Learning (DL) pose estimation models along with 3D depth sensing cameras to develop a cost effective, easy to use, and compact Deep Learning based markerless kinematic data acquisition (DL-KDA) system that can be applied to children with UCP. To achieve this overall goal, we must establish the accuracy and validity of the kinematic data obtained from the system. We have developed a modular software framework for building and testing DL-KDA systems against a very precise marker-based motion capture gold standard (VICON). To date, we have collected kinematic data from: 8 additional children with UCP, 16 typically developing children, and 40 healthy adults when they were performing the Box and Blocks Test. We have also so far extracted 2D images from 4 years (2015-2018) of previous video recordings and will begin annotating the images to retrain the optimal DL pose estimation model. Using kinematic data from healthy adults, we are studying the effects of 3D camera and DL parameters/architecture on the accuracy of the resulting kinematic data. In parallel, using the extracted 2D images from previous recordings (2015-2018), we will begin annotating images with body joint locations for transfer learning-based retraining of DL pose estimation models. We will use this dataset to perform transfer learning and retrain DL pose estimation models. The goal is to address the gap in existing training datasets used for most DL pose estimation models, which are not inclusive of individuals with movement disorders. Without carefully addressing this gap, potential ethical and scientific biases may arise if such pose estimation models are applied to underrepresented groups such as children with UCP. Our Images will be transformed into ML/AI ready HDF5 datasets and published in public DL and NIH repositories. These datasets will be available for other researchers, when using or building DL pose estimation models for applications in UCP clinical research. Finally, we will collect data from an additional 12 children with UCP (in 2023) and use the retrained DL model for body pose estimation. The performance of the retrained DL model will be statistically compared to the original DL model to verify if bias was indeed present. Validated kinematics for the UCP population, as well, will be uploaded to public DL and NIH repositories for use in future UCP research.
NICHD
Fuller, Clifton David
University of Texas MD Anderson Cancer Center
Administrative Supplement: Development of functional magnetic resonance imaging-guided adaptive radiotherapy for head and neck cancer patients using novel MR-Linac device
Radiotherapy (RT) treatment of head and neck cancer aims to deliver a therapeutic dose to cancer cells while minimizing the damage to surrounding healthy tissue. Identifying tumors that respond well to treatment and those that do not is essential to make RT more effective and reduce side effects. Multiparametric MRI, a technique that combines anatomical with functional imaging, has proven useful in identifying early responders and radiation-resistant disease in head and neck cancer patients. These techniques could be used to adapt radiation therapy during treatment. Our parent grant aimed to develop hardware, software, and infrastructure for multiparametric MRI-guided RT for head and neck cancer patients. In this supplement, the resulting imaging data will be curated, annotated, and made publicly available to facilitate community-driven artificial intelligence (AI) model building efforts. The proposed one-year supplement includes curation of high-quality anatomical and functional MRI sequences and corresponding clinical data for each patient. These anonymized datasets will be made FAIR (findable, accessible, interoperable, reusable) and available for public use and will support the development of robust AI projects. The project will also initiate a series of public AI data challenges to foster novel AI innovation and solve clinically relevant RT problems. The success of this project will enable a modernized and integrated biomedical data ecosystem for public use of RT data for AI model building. The proposed benchmark datasets will provide a foundation to achieve the long-term goal of personalized medicine for head and neck cancer patients using AI to reduce side effects while maintaining high cure rates. This supplement will positively impact patients by enabling the characterization of malignancy for improved therapeutic intervention and downstream translational application of AI technologies.
NIDCR
Grundberg, Elin
Children's Mercy Hospital
Contextualizing and Addressing Population-Level Bias in Social Epigenomics Study of Asthma in Childhood
"Artificial intelligence (AI) and machine learning (ML) models are subject to biases inherent in data used to train them. Efforts to mitigate bias have focused largely on study designs, and implementation of fair quality checks. However, as focus is placed on generalizability of these models, it is critical to contextualize representativeness of study data used for modeling with respect to the population on which insights are intended to be used. Currently, research emphasizes descriptions of study cohorts, highlighting on whom analyses were performed. However, summary statistics cannot provide the granularity needed to identify potential bias brought on by diverse populations and limited sample sizes.
Given impacts of recruitment and selection bias, we are developing metrics to benchmark alignment of study participants against a geographic reference (i.e., subjects in same geographic region who met inclusion criteria but not enrolled). These metrics capture both geographic factors (i.e., measuring if study recruitment was well-distributed across the set of eligible patients), and comparisons between distributions of sociodemographic factors (i.e., identifying recruited families have significantly higher household incomes when compared to a matching eligible population). In parallel we are developing metrics of data stability across subgroups of a study cohort and across recruitment periods. Given many AI/ML models focus on modeling subspaces, simply utilizing all data that meets inclusion criteria can obfuscate imbalances. For example, race may only be missing in 5% of patients, but understanding all instances come from those aged 10-15 is critical for reliable use of model results.
To address some of such bias, we are developing population-aware imputation techniques. It is common to perform imputation prior to utilizing AI/ML models. Yet, when study sampling and/or missingness is not balanced by geographic region, imputation estimates for data in subgroups with a high degree of missingness, are driven by relationships from subgroups with more complete information. Creating a potential for bias in scenarios where these cohorts differ in a fundamental way. We hypothesize novel techniques can be developed to better account for similarity of patients. In turn, providing more precise estimates of missing data across subgroups that better reflect exposures. "
NIMHD
Hsieh, Evelyn
Yale University
Use of Optical Character Recognition (OCR) to Enable Al/ML-Readiness of Data from Dual-Energy X-ray Absorptiometry (DXA) Images.
Fragility fractures caused by osteoporosis are costly to individuals and healthcare systems. A key limitation of artificial intelligence/machine learning research focused on osteoporosis has been the difficulty capturing bone mineral density (BMD) data. BMD data do not exist as structured data fields. Furthermore, to use data from dual-energy x-ray absorptiometry (DXA), current methods rely on machine learning (ML) algorithms that draw information from text summaries in the patient medical record. In our preliminary work as part of our parent R01, we have found that while valuable, restricting the ML algorithm to radiologist-generated DXA summaries limits our ability to capture accurate, reliable DXA-related data from the EHR. Text summaries are vulnerable to inaccuracies and lack clarity. In addition, when patients are referred to external facilities for their DXA scans, the information is returned in PDF format and scanned into the EHR. Imaged documents are not accessible to the text-based algorithms often used in Natural Language Processing (NLP)/ML applications. A superior source of these data is the original reports generated by DXA machines. These reports use a consistent, tabular format to present BMD and T-score results at the three regions of interest in osteoporosis research (lumbar spine, total hip and femoral neck). In collaboration with colleagues at Vanderbilt University Medical Center (VUMC) and the VA Tennessee Valley Healthcare System (VATVHS), we propose to extend the use of a novel optical character recognition (OCR) system developed by their team to extract tabular BMD data directly from machine generated and PDF versions of DXA reports. This proof-of-concept study will leverage our team’s existing work with EHR data from the 2019 VA National Cohort, a national cohort of all Veterans who receive care within the VA system. This work is an important first step towards having accurate, valid, and reliable BMD data to include in the parent R01 and will have important applications for future projects exploring BMD among Veterans, an important and often underserved population. In addition, it makes tabular data from images more easily available and accessible for researchers to apply AI/ML approaches.
Our lab studies retinal connectomics. Connectomes are Rosetta Stones for discovering how neural systems are wired and they reveal how neural systems are structured and function, as well as informing us how neural circuits are altered by disease. We use retina as a window to understand normal brain wiring as well as diseases like the blinding diseases retinitis pigmentosa (RP), age-related macular degeneration (AMD), glaucoma, as well as brain diseases like Alzheimer’s. To study these complex networks, we use electron microscopes to reveal the neurons, glia and all of the connections between these cells, assembling massive databases that reveal normal circuit topology frameworks as well as disease frameworks. These databases allow comparisons of normal to pathological networks that emerge in disease. Prior research efforts from our lab have unmasked unexpected, pervasive complexities in mammalian retinal networks, informing neuronal modeling and understanding of the problems behind vision rescue. These efforts have also revealed that neural systems while complex, are extraordinarily precise. There are no wiring errors in normal, healthy tissue. This was a surprise because we thought there would be lots of “biological noise”. There is not. Connections and the partnerships of those connections are precise. The other surprising finding is that when neural systems fail, they fail in predictable, precise ways making errors of wiring that were unexpected in terms of their predictability. This supplement funds tool development to enable sharing of specific aspects of our datasets in great demand from the artificial intelligence/machine learning (AI/ML) community. Our work has built not only connectomics infrastructure for datasets, but the annotation work within the datasets themselves provides a valuable ground truth to feed AI/ML approaches as training data. While our annotated databases have been the highest resolution connectomics databases yet available, allowing discrimination of synapses and gap junctions as well as organelle data, we do not have good tools to subset these connectomes to feed AI/ML approaches allowing feature detection of synaptic features or sub-cellular features desired by the AI/ML community. The tools deriving from this supplement will interface with our open-source datasets, providing the entire connectomics community access to rich, validated, ground-truth data for AI/ML training and mining.
NHGRI
Kane-Gill, Sandra L
University of Pittsburgh at Pittsburgh
(MEnD-AKI) Multicenter Implementation of an Electronic Decision Support System for Drug-associated AKI
The goal of the parent project, Multicenter Implementation of an Electronic Decision Support System for Drug-associated AKI (MEnD-AKI), is to assess the effectiveness of a clinical surveillance system augmented with real-time predictive analytics that will support a pharmacist-led intervention aimed at reducing the progression and complications of drug-associated acute kidney injury (D-AKI). Social determinants of health (SDOH) are important drivers of health inequities and disparities, and responsible for between 30% and 50% of health outcomes. There is wealth of information contained in routine clinical notes that improves performance of prediction models for AKI, as shown by recent literature. Our ongoing parent project is not yet utilizing SDOH and clinical notes that carry important information about patient health status and access to health care. The proposed supplement project will develop integration, standardization, and processing tools and pipelines to create AI/ML ready data using SDOH and unstructured clinical text data. We will develop and assess tools for: 1) extracting, cleaning, and representing SDOH and unstructured text data; b) integrating SDOH to databases from the University of Florida (UF) and University of Pittsburgh (UPitt) for use in D-AKI risk model development and validation; and c) integrating clinical data and notes to prepare multimodal AI/ML-ready data at UF. We will acquire SDOH data in the domain of economic stability, education access and quality, healthcare access and quality, neighborhood and build environment, and social and community context. Using 9-digit ZIP and/or 11-digit Federal Information Processing Standards (FIPS) codes, we will link patient data to SDOH data. Our developed risk model will be enriched by incorporating SDOH variables into the model, and we will evaluate the bias of our developed AKI risk model by subgroup analysis based on SDOH variables. Natural Language Processing (NLP) tools will be employed to extract medical concepts, medications, and SDOH from clinical notes. For medical concepts unable to be extracted using existing tools, we will develop our own NLP tool. Using the Bidirectional Encoder Representations from Transformers (BERT) model. The completion of this supplemental project will provide multimodal AI/ML-ready data with additional data elements to improve efficiency and performance of the D-AKI risk prediction models and other AI applications.
NIDDK
Krening, Samantha
Ohio State University
An automated AI/ML platform for multi-researcher collaborations for a NIH BACPAC funded Spine Phenome Project
Chronic Low Back Pain (cLBP) is a debilitating condition that affects millions of people globally. The parent grant (BACPAC) is focused on developing and validating a digital health platform that collects data from wearable technology on patients and healthy individuals. A goal is to diagnose and create treatment plans based on how someone moves rather than current subjective metrics. This project pertains to a data access pipeline and AI/ML workflows that will enable researchers to focus on the ML rather than data cleaning, organization, and processing. The pipeline begins with the collection of data in clinical and research environments from many sources across the BACPAC consortium, and extends through the validation of trained models. We aim to transform these complex, multi-source data to a “machine learning friendly” format to make them easily accessible to all researchers across the BACPAC consortium. The research is likely to lead to novel methods to analyze large, multi-source, time-series datasets. We expect that these analyses will shed new light on the prevention, diagnosis, and treatment of spine and musculoskeletal maladies.
NIAMS
Lacey, James V
Beckman Research Institute/City Of Hope
A More Perfect Union: Leveraging Clinically Deployed Models and Cancer Epidemiology Cohort Data to Improve AI/ML Readiness of NIH-Supported Population Sciences Resources
" In theory, large prospective observational cohort studies have great potential to contribute to AI/ML. Cohorts often include large sample sizes; long-follow-up; multiple clinical endpoints and phenotypes; and diverse, real-world data on lifestyle, environment, patient-reported outcomes, and social determinants of health, which together could help reduce data bias in AI/ML. In practice, important methodologic and technical questions about how best to use observational-cohort data for AI/ML are unanswered. We are using our California Teachers Study (CTS), a large cohort study, to address those questions. The CTS includes 133,477 female participants who have been followed continuously since 1995. Through surveys and linkages, the CTS has collected comprehensive exposure and lifestyle data and has identified over 28,000 cancers; over 34,000 deaths; and over 800,000 individual hospitalizations. Our team includes AI/ML experts at City of Hope (COH); population-science researchers from the CTS team; and cloud computing specialists from the San Diego Supercomputer Center's (SDSC) Sherlock Cloud. In partnership with SDSC, our CTS research data management lifecycle strategy has been cloud-first since 2015—but, like most cohort studies, that lifecycle was designed for investigator-initiated & project-specific analyses, rather than larger-scale AI/ML. We are addressing this “readiness gap” in cohort data by expanding our data & computing infrastructure and architecture; reconfiguring data exploration and aggregation tools and documentation; and testing a clinically deployed AI/ML model in CTS data. We are deploying the Amazon SageMaker platform within our secure and controlled-access CTS cloud environment. We are generating embeddings that will cluster CTS data into phenotype-based subgroups that can be used for essential AI/ML functions, such as cohort discovery, close-neighbor identification, and imputation. We are evaluating performance of a previously deployed risk model in CTS data to assess the potential for real-world cohort data to improve model performance. Our project’s combination of multidisciplinary experts from relevant fields; new embedding representations of observational cohort data; and a secure cloud infrastructure configured for AI/ML is generating valuable insights and lessons learned about use of NIH-supported cohort data in AI/ML applications. "
NCI
Levey, Allan I
Emory University
Piloting a web-based neuropathology image resource for the ADRC community
Alzheimer’s Disease (AD) is the most common dementia, affecting over 45 million people globally. Sharing data between institutions is critical to developing new understanding and treatments. There is great potential for artificial intelligence / machine learning/ML techniques to be applied in neuropathology, but sharing of digital pathology images is uncommon. We have developed the Digital Slide Archive (DSA), an open-sourced web-based platform to visualize, annotate, and analyze neuropathology imaging datasets. The DSA platform, funded through a U24 and U01 grant from the NCI/NIH has primarily been optimized for cancer related image analysis workflows. Our goal was to collect digital slide sets from AD centers to develop tools and data models to facilitate data sharing. Neuropathologic evaluation of brain tissue is central to the diagnosis and staging of these diseases, but the raw histology data is not widely shared within this community. The increasing availability of whole slide imaging systems now makes data sharing feasible, although there are numerous technical hurdles making this challenging. We have collected neuropathology datasets from 7 universities, and used these images, in conjunction with 500 cases scanned at Emory University, to develop an initial metadata schema to facilitate data sharing. The lack of standard file formats, inconsistent naming schemas, image de-identification, and the enormous size of these images are ongoing challenges. We have developed a set of tools to facilitate large scale image de-identification and metadata cleanup, which we are continually improving for this work. We have also developed a customized interface to facilitate efficient viewing of neuropathology cases. The interface, leveraging the underlying metadata schema, makes it much easier for pathologists to scan through cases, and view annotations and analysis results. The DSA also has a companion set of analysis tools, HistomicsTK, which we have developed to support whole slide image analysis. We are tuning these algorithms to support detection of nuclei, neurofibrillary tangles (NFT), and phospho-TDP inclusions as demonstration workflows. We are developing tutorials and a set of benchmarks to characterize the performance of these algorithms.
NIA
Liang, Rongguang
University of Arizona
Improving AI/ML-Readiness of data generated from NIH-funded research on oral cancer screening
The potential of AI/ML algorithms in solving cancer analysis tasks is well established, but to develop high-performance AI/ML algorithms and applications for automatic oral cancer diagnosis, a well-annotated, compatible, and clean dataset with a sufficient number of images is essential. Supported by NIH 3-R01-DE030682-01, we have developed and evaluated three generations of mobile dual-mode imaging devices, screened over 7,000 patients in India since 2019, and generated a dataset with at least 28,000 images and related information that represents an excellent resource for developing highly accurate and efficient AI/ML algorithms and applications for oral cancer diagnosis. The primary goal of this project is to transform the raw data collected through NIH-funded research on oral cancer screening into a clean, compatible, and well-annotated dataset that can be readily processed using AI/ML tools. As popular AI/ML tools like PyTorch and TensorFlow are incompatible with string-based inputs, we convert patient information, such as lesion sites, tobacco use history, sex, age, and other subjective descriptions, into an AI/ML compatible format. To ensure data cleanliness, we identify out-of-distribution (OOD) and low-quality images in the image data and rectify missing values and typos in the patient information data. Oral cancer datasets obtained from high-risk populations often suffer from imbalanced data, leading to AI/ML model bias. To minimize the impact of model bias, we employ data-level and algorithm-level methods to balance the dataset. Since the data is annotated manually, labeling errors are inevitable, which can significantly impact the performance of AI/ML models. To address this issue, we use confident learning to identify mislabeled datapoints and study their effects on the stability of AI/ML models. The key objective of converting the raw data into an AI/ML compatible dataset is to facilitate the creation of dependable and precise AI/ML algorithms for automated oral cancer diagnosis. Building on this curated multi-modal dataset, we are designing AI/ML models that offer high levels of accuracy, interpretability, and reliability. This project holds significant potential for accelerating the development of AI/ML-based techniques for early oral cancer detection in low-resource settings, which could ultimately help reduce morbidity and mortality rates associated with this disease.
NIDCR
Linden, David R.
Mayo Clinic Rochester
Neurobiology of Intrinsic Primary Afferent Neurons
The bowel has independent neural reflexes initiated by intrinsic primary afferent neurons (IPANs) in the enteric nervous system (ENS) that sense meals and pathogens to alter vascular, secretory, and motor function. Unique IPAN neuroplasticity adapts to inflammatory, hormonal and neural stimuli to cause digestive disease. Diagnoses and therapeutic decisions use symptom classifications and, in some cases, the use of thin section pathology. 3D assessment of the ENS has the potential to transform digestive disease diagnosis, but solutions to technical and labor-intensive barriers of adoption are needed. The objective of the parent grant, R01 DK129315, is to test the overall hypothesis that different classes of IPANs possess morphologies and physiology that uniquely contribute to intestinal function. With this supplement, we add additional approaches to Specific Aim 1 to incorporate artificial intelligence into the data analyses. Importantly, these additions will 1) accelerate throughput of data analysis from that anticipated in the parent grant application, 2) ensure datasets will be freely available to all and are ready for AI analyses by other investigators, 3) develop AI tools and ground truth annotated datasets for enteric neurobiology that will be freely available, and 4) expand analyses into human enteric neurons, a major step toward translation of findings anticipated in the parent grant. There are three revisions to our approach of Specific Aim 1: 1. Develop AI tools to acquire and analyze large volume high resolution confocal microscopy data. 2. Develop deep learning approaches to segment, annotate and analyze morphological features of mouse IPANs. 3. Image the human enteric nervous system and apply deep learning tools to human tissues. AI technology may accomplish the goal of rapid and objective ENS annotation. Improving human ENS labeling for in situ and in vivo imaging will allow the deployment of our preclinical AI tools to transform the future of clinical gastrointestinal pathology. Further, the tools that we are developing may have a broad scientific interest given the currently ongoing efforts to map the landscape of both central and peripheral nervous systems.
NIDDK
Majumdar, Amitava
University of California, San Diego
Neuroscience Gateway to Enable Dissemination of Computational And Data Processing Tools And Software.
The supplement project is developing a standardized provenance metadata framework to make NSG data sets and tools Findable Accessible, Interoperable, and Reusable (FAIR). Since 2012, NSG catalyzes progress in neuroscience by reducing technical and administrative barriers that neuroscientists face in large scale modeling and data processing which require high performance computing. NSG provides about twenty neuroscience software and tools that are used for neuronal modeling, data (EEG, fMRI, etc.), processing and AI/ML work. NSG’s user base is growing and has over 1500 registered users currently. The NSG team acquires time on academic supercomputers yearly and is made available fairly to the NSG users. NSG has been enhanced by adding new features, making it an efficient environment for dissemination of lab-developed neuroscience software and tools. This supplement project is a first step towards recording meta data or provenance for developing a standard-based provenance metadata framework for any of the NSG tools and data sets they produce to allow NSG resources to be used in reproducible machine learning (ML) workflows. Provenance metadata is critical for supporting scientific reproducibility by implementing FAIR principles. This project integrates the World Wide Web Consortium PROV standard based provenance ontology called ProvCaRe in NSG to allow users to record provenance metadata using standardized ontology classes. The use of ProvCaRe ontology is expected to reduce term variability and improve performance of ML workflows that rely on metadata terms for reproducibility. We demonstrate the integration and application of the ProvCaRe ontology using a neuroscience software called the NeuroIntegrative Connectivity (NIC) tool, which analyzes high fidelity brain recordings to compute functional brain networks in neurological disorders. The NIC tool has provenance metadata characteristics built into it, and is the first NSG tool to carry the metadata provenance information from the beginning to the end of a dataset’s lifecycle. In the second phase of this supplemental project, we utilize the Open Science Chain (OSC) project to provide a blockchain based solution to maintain the integrity for datasets and its provenance metadata. The supplement project will allows us, in the future, to integrate provenance metadata information for other NSG tools. This makes NSG comprehensively more AI/ML ready.
The Nulliparous Pregnancy Outcomes Study: Monitoring Mothers-to-Be (nuMoM2b; NICHD, 2009-2014) initiated a longitudinal cohort of 10,038 nulliparous women prospectively enrolled early in their first pregnancy that has continued with the nuMoM2b Heart Health Study (nuMoM2b-HHS; NHLBI and NICHD, 2014-2020) and current Continuation of the nuMoM2b Heart Health Study (nuMoM2b-HHS2; NHLBI, 2020-2027). The racially and socioeconomically diverse cohort participants have rich and accurate records-based phenotyping of their first pregnancy and associated outcomes, including risk factors for adverse pregnancy outcomes (APOs) and cardiovascular disease (CVD) measured in early pregnancy, with subsequent examinations during the early postdelivery period. As part of the cohort’s longitudinal follow-ups, a planned subgroup of 4,508 participants have ascertainment of APOs in additional pregnancies, assessments for risk factors and subclinical and clinical CVD, and completed laboratory assays on stored biospecimens. Omics analysis results, including GWAS, WGS, methylation, plasma proteomics, exosome proteomics, and placental RNAseq are also available or are in development for subgroups of the participants. Continued participant follow-ups and separately funded ancillary studies will further expand the duration and scope of data collection. With the support provided by this administrative supplement, the full nuMoM2b and nuMoM2b-HHS data, including omics data not yet deposited, will be moved onto BioData Catalyst and brought into alignment with FAIR principles. Upon completion, the data will be located on a single platform with pipeline and analytic tool support, and datasets will be machine-readable with relevant clinical ontologies assigned, with user-friendly metadata and documentation, and with future-ready processes for conversion of newly gathered data to AI/ML ready status. These steps will result in data ready for use in AI/ML analyses as well as traditional epidemiologic models and will enhance data accessibility to members of the research community, maximizing the potential scientific knowledge gain from the existing and future data contributed by this truly remarkable longitudinal cohort.
NHLBI
Mellins, Claude Ann
New York State Psychiatric Institute
Pathways to successful aging among perinatally HIV-infected and exposed young adults: Risk, resilience, and the role of perinatal HIV infection
The CASAH study has followed 340 predominantly Black and Latino/a youth with perinatally acquired HIV (PHIV) and youth perinatally HIV exposed but uninfected (PHEU) for 20 years – enrolled at ages 9-16 years from vulnerable communities in New York City – documenting health risk and resilience across childhood, adolescence, and emerging adulthood. CASAH is guided by Social Action Theory (SAT), examining: 1) the impact of HIV infection on behavioral health outcomes (e.g., mental health, sexual risk, substance use, adherence) and achievement of adult milestones (e.g., education, vocation, independence); 2) how SAT-informed risk and protective factors affect behavioral health and achievement of milestones in adolescence and young adulthood (AYA); and 3) trajectories of behavioral health across AYA and SAT-informed predictors of these trajectories. Currently in its fourth competing continuation (R01 MH69133-19), CASAH 4 is following this cohort through young adulthood (20s-early 30s). CASAH is one of the most comprehensive longitudinal datasets on adolescents and young adults living with PHIV or PHEU –ideal for machine learning (ML) and sharing. However, gaps in the full CASAH dataset (e.g., unprocessed data, unscaled variables) and in data documentation act as barriers to using it for ML and sharing it via NIH-supported data repositories. The aims of this data readiness administrative supplement are to prepare (i.e., collate, clean, and evaluate) the full CASAH dataset (spanning 10 waves of data collection) and create comprehensive documentation of all data elements (i.e., provenance, missingness, utilized scales, etc.), while ensuring data can be easily and securely transported and utilized. ML data analytic approaches will also be tested on the finalized data to confirm completeness, accuracy, and useability. This supplement will not only advance our knowledge of health outcomes and social determinants of health in vulnerable young people affected by HIV, but will also allow for cross-cohort studies with non-HIV populations (e.g., The National Longitudinal Study of Adolescent to Adult Health; Boriqua Youth Study; ABCT). Overall strengthening the ability to identify multimodal determinants of health across critical developmental stages, aiding in the development of evidence-based interventions for youth with HIV, chronic health conditions, and affected by a range of health disparities.
NIMH
Mirmira, Raghavendra G
University of Chicago
The Integrated Stress Response in Human Islets During Early T1D
In the parent project, we hypothesized that activation of a specific integrated stress response is an early cellular response in type 1 diabetes (T1D) that determines pancreatic β-cell survival and can be monitored in pre- and early-T1D individuals with minimal invasiveness. To test this hypothesis, we have assembled a multidisciplinary team collecting large suites of heterogenous biological data, including mRNA, lipid, protein, and immune system profiles, from individuals at various stages of T1D, as well as healthy controls. Analyses of these data are yielding panels of potential T1D biomarkers associated with cellular stress in human pancreatic islets. The data collected under the parent project, as well as other data collected from prior collaborations studying genetically at-risk children, such as the Diabetes AutoImmunity Study in the Young (DAISY) and The Environmental Determinants of Diabetes in the Young (TEDDY), are excellent candidates to be used as “flagship” datasets for Artificial Intelligence/Machine Learning (AI/ML) readiness. For the supplement grant, we focus on development of AI/ML-ready data to tackle the challenges of processing large heterogenous datasets in addition to identifying molecular signatures of T1D. Our first task focuses on generating AI/ML-ready datasets that are properly annotated to address the main data processing challenges: missing data and introduction of bias. The second task focuses on generating AI/ML-ready datasets that can be used to establish molecular markers of disease biomarkers. To allow other AI/ML researchers to efficiently use the machine learning datasets, we will provide detailed information about their performance and intended uses in model cards. Our goal is to create reusable software approaches and data packages that can be directly imported into common AI/ML packages. We will share these with the broader AI/ML research community, gather feedback, and continue to refine the AI/ML readiness software development plan. Datasets are released on ‘AI/ML Ready Datasets for Type 1 Diabetes’ platform (https://data.pnnl.gov/group/nodes/project/33480).
NIDDK
Musen, Mark A
Stanford University
Improved metadata authoring to enhance AI/ML readiness of associated datasets
The CEDAR Workbench is technology that makes it easy to describe datasets using metadata that comply with community standards. Thorough the use of (1) reporting guidelines that enumerate the things that need to be said about an experiment for a scientist—or an AI algorithm—to make sense of what has been done and (2) ontologies that formalize those descriptions, the CEDAR Workbench offers a convenient mechanism for investigators to share their data in a useful way. The standardized metadata that scientists create using CEDAR make datasets more valuable both to people and to machines that might learn from the data to make new discoveries. Many NIH-supported consortia, including those working on the RADx Data Hub for sharing of data related to COVID diagnostics, rely on the CEDAR Workbench. To advance the role of CEDAR in the creation of AI-ready datasets, we worked to make CEDAR deployable in the cloud by containerizing all CEDAR microservices and by making these microservices discoverable and observable. Furthermore, we worked to make CEDAR a highly available system that is easy to maintain and evolve. We simplified and enhanced the system’s architecture, taking advantage of new approaches and components that were not available to us when the system was first designed. As a result, CEDAR is now much more scalable, maintainable, and deployable. The new architecture will help investigators to create standards-adherent metadata more easily, advancing the application of AI techniques to a wide range of data of importance to the biomedical community.
NLM
Neelamegham, Sriram
State University of New York at Buffalo
Application of machine/deep-learning to the systems biology of glycosylation
The NHLBI grant “Systems Biology of Glycosylation” applies bioengineering approaches to study blood cell glycosylation from both a basic science and translational perspective. The goal is to develop a quantitative link between the cellular transcriptome and epigenetic status, with the resulting glycosylation profile. In order to achieve this, two types of perturbation experiments are performed. In the first, CRISPR-Cas9/sgRNA is used to implement defined system perturbations and resulting changes in the cellular glycome are measured. This represents the ‘labeled dataset’. In the second, biochemical stimuli are applied to perturb cell state, and again cell glycosylation status measurements are made. This is the ’unlabeled dataset’ as the perturbation is imprecise. In each case several output experimental measurements or ‘features’ are measured including: 1) Single cell next-generation sequencing (NGS) measurements of the cellular transcriptome; 2) Lectin binding using spectral flow cytometry; and 3) Mass spectrometry to obtain detailed glycan structure data. Mathematical methods are developed to fuse results from these different methods and develop input-output responses. Currently, such modeling relies on prior biochemical knowledge that is curated in pathway maps, linear-mixed models and explicit programming. As an alternative to this traditional approach, this supplement develops ML/DL (machine learning/deep learning) supervised learning models to analyze the same data. Successful completion of this project will confirm the value of ML/DL modeling in the study of blood cell and Glycoscience applications, particularly in a multi-omics context. It will reveal if cellular regulatory pathways discovered using blood cells can be generalized to other cell/tissue/organ systems. The identification of key makers/checkpoints of glycosylation also has translational significance as it can inform both patient stratification in the context of clinical trials and precision medicine applications.
NHLBI
Nelson, Amanda E
University of North Carolina Chapel Hill
Development of an AI/ML-ready knee ultrasound dataset in a population-based cohort
Knee osteoarthritis (KOA) is highly prevalent and frequently debilitating. Development of potential treatments has been hampered by the heterogeneous nature of this common chronic condition, which is characterized by several subgroups, or phenotypes, with different underlying pathophysiological mechanisms. Imaging, genetics, biochemical biomarkers, and other features can be used to characterize phenotypes, but variations in data types can make it difficult to harmonize definitions. Ultrasound is a widely accessible, time-efficient, and cost-effective imaging modality that can provide detailed and reliable information for all joint tissues. Application of deep learning methodology to discover ultrasound features associated with pain and radiographic change in KOA is highly innovative and will be a major step forward for the field. We will leverage standardized ultrasound images from the diverse and inclusive population-based Johnston County Health Study (JoCoHS), the new enrollment phase of the 30-year Johnston County OA Project which includes Black, White, and Hispanic men and women aged 35-70. We will utilize deep learning to identify features of US images that are associated with aspects of knee OA while also generating an AI/ML-ready FAIR dataset for use by the research community. By developing and maintaining an AI/ML ready repository of standardized ultrasound images from this generalizable cohort, we can enhance the uptake of this modality and contribute to further study on its use in OA worldwide, including in low-resource settings and across populations.
NIAMS
Rhee, Kyu Y
Weill Medical College of Cornell University
Towards AI/ML-enabled molecular epidemiology of Mycobacterium tuberculosis
This project aims to utilize artificial intelligence (AI) and machine learning (ML) to analyze the whole-genome sequencing (WGS) dataset of clinically annotated Mycobacterium tuberculosis isolates generated through the parent award with the goal of identifying novel potential transmission blocking targets and vaccine antigens. Our focus will be on phylodynamics of Mtb, a statistical framework that infers disease dynamics from genetic data by leveraging the evolutionary tree of disease agents. This approach aims to enhance our understanding of genes associated with Mtb transmission and evolution, thereby facilitating the development of targeted control and prevention strategies. (1) We will investigate optimal data representations for phylodynamic applications of AI/ML. To make the Mtb WGS dataset suitable for deep learning (DL) approaches, it must be encoded into a machine-readable format. Current genetic data representations for DL, which simplify datasets into summary statistics or images, are not suitable for infectious diseases as samples are often collected at different timepoints. We will utilize a genealogy of sampled sequences as a new input data structure for phylodynamic applications of DL, as this type of data structure encodes the underlying evolutionary and epidemiological histories of disease dynamics. (2) We will develop likelihood-free scalable DL/ML frameworks for inferring important epidemiological parameters and mutations of concern, capitalizing on the principles of epidemiology and population genetics. We will then apply our newly developed framework to the Mtb WGS dataset to identify genetic determinants of transmissibility in Mtb and their phenotypic association to the survivability of this pathogen during inter-host transmission in the aerosol phase. (3) We will provide FAIR-curated, comprehensively annotated AI/ML-ready datasets to the research community. Our Mtb WGS dataset, along with other curated datasets, will be standardized, annotated, and documented to rigorous standards, creating the most extensive centralized AI/ML-ready Mtb genetic database to date. This resource will advance the development of new computational methods for the molecular epidemiology of bacterial pathogens and help launch novel discoveries, including meaningful clinical parameters for the control and prevention of Mtb. To promote the use of our database and methods, we will develop, test, and distribute new publicly available open-source software programs.
NIAID
Schaefer, Andrew J
Rice University
Administrative Supplement to Support Collaborations to Improve AIML-Readiness of NIH-Supported Data for Parent Award SCH: Personalized Rescheduling of Adaptive Radiation Therapy for Head & Neck Cancer
Toxicity prediction models for cancer therapy suffer from lack of reference datasets of known provenance. The proposed project aims to develop a highly curated dataset of multi-observer segmented, multi-parametric, and multi-time-point MRI-CT-RT Dose/radiomics data for mid-and post-treatment therapeutic response/tumor control probability (TCP) prediction for head and neck cancer (HNC) patients treated with radiation therapy (RT). The dataset includes pre-, on-, and post-therapy serial imaging and toxicity collection prospectively for head and neck therapy cohorts. The project seeks to address critical barriers to developing accurate and reliable machine learning (ML)/artificial intelligence (AI)-based models through the expert curation of the dataset and multi-observer annotation of known provenance and standardized post-processing. The successful completion of the proposed efforts will directly impact scientific advances in predictive modeling for cancer treatment and promote the development of ML-based prediction models or therapy response and normal tissue complication probability models, enabling estimation of the trajectory of the normal tissue injury, utilizing ML/AI approaches to both segmentation quality and toxicity assessment.
NCI
Stanton, Bruce A.
Dartmouth College
Retrieval, Reprocessing, Normalization and Sharing of Gene Expression and Lung Microbiome Data Sets to Facilitate AI/ML Analysis Studies of Bacterial Lung Infections
The CDC reports that more than 2.8 million antibiotic-resistant infections occur in the United States each year, and unfortunately, 35,000 people die from the infections. Moreover, 7.7 million people in the world died from bacterial infections in 2022 alone. Thus, there is a pressing need to develop more effective treatments for bacterial infections, especially those that are resistant to available antibiotics. Accordingly, the goal of our research is to develop new approaches to treat antibiotic-resistant infections and to facilitate research by other scientists. In the research supported by the supplemental funding from the NIH we will create an easy to access compendium of ~25,000 archived data sets on bacterial infections, most funded by the NIH, enabling the research community to easily mine the data, enhance our understanding of the biology of lung infections and develop new therapeutic approaches that will reduce significant disease and death caused by these infections. Working with Dartmouth Research Computing Center we will create a searchable online data portal containing artificial intelligence and machine learning (AI/ML)-ready gene expression data sets for various antibiotic resistant pathogens including ESKAPE pathogens (Enterococcus faecium, Staphylococcus aureus, Klebsiella pneumoniae, Acinetobacter baumannii, Pseudomonas aeruginosa, Enterobacter spp.) other bacterial lung pathogens (Haemophilus influenzae, Burkholderia cenocepacia, Streptococcus pneumoniae) and clinically relevant fungal lung pathogens (Aspergillus fumigatus, Candida albicans). The data portal will also contain searchable and downloadable AL/ML data sets relevant to chronic lung diseases, which often involve antibiotic resistant communities. By making these archived data easier to find, access, and reuse we anticipate that we and other scientists will identify new targets in antibiotic resistant bacteria that will lead to the development of novel treatments for these infections.
NHLBI
Sung, Kyung Hyun
University Of California Los Angeles
A structured multi-scale dataset with prostate MRI for AI/ML research
Prostate magnetic resonance imaging (MRI) datasets are crucial for training and validating artificial intelligence and machine learning (AI/ML) algorithms. However, publicly available datasets are often limited by the uncertainty and bias associated with biopsy-confirmed histopathology, which is prone to sampling error and interpretation variability. This project aims to improve the AI/ML-readiness of prostate MRI data by providing ground truth labels that link multiscale information across clinical, radiologic, and pathologic data. Leveraging an ongoing NIH R01 project (R01-CA248506) focused on developing novel quantitative MRI and AI/ML methods for predicting clinically significant prostate cancer, the proposed work aims to augment the investigative team with experts in MRI-ultrasound fusion biopsy and biomedical informatics. This collaborative effort will result in a unique dataset of consented subjects who underwent prostate MRI, biopsies, and prostatectomy, along with structured clinical, radiologic, and pathologic findings shared in a standardized manner using a clear data dictionary. The availability of this augmented dataset will facilitate direct comparison and validation of different AI/ML models, which utilize different ground truth labels. It will also provide potential opportunities to combine with other publicly available AI/ML datasets. The ultimate goal is to refine and improve AI/ML models for image-to-histopathology correlation and temporal monitoring of prostate cancer in patients undergoing active surveillance.
NCI
Terskikh, Alexey V
Sanford Burnham Prebys Medical Discovery Institute
Novel Strategy to Quantitate Delayed Aging by Caloric Restriction
Our parental R21 application proposed to integrate the data sets from single-cell imaging, ATAC-seq, and RNA-seq modalities from control diet mice and caloric restriction (CR) diet mice. These experiments will provide a large amount of multidimensional and multimodal data that is not only suitable for machine learning applications but in fact, absolutely require machine learning approaches to make sense of the vast datasets. For example, each image of a nucleus (up to 5000 nuclei are analyzed per replica per sample) will generate up to 1000 texture features (e.g. threshold adjacency statistics). Critically, our datasets are coming from 3 different types of measurements (microscopic imaging, chromatin accessibility, and gene expression) of which 2 modalities (imaging and sequencing) are very different. At present all three data streams are handled in conventional ways primarily using very large Excel files; this data structure and format is not well suited for handling large data sets (increase chance of errors during copy/paste procedures) and for machine learning integration. Hence there is a need for custom build approaches to cleaning, filtering, quality control, handling, and analysis of imaging data, ATAC-seq, and RNA-seq data. The most critical step is to structure these diverse datasets in a similar format that streamlines storage and handling and enables downstream integration and analyses using machine learning algorithms. To meet these challenges, we will implement the AnnData format, which offers a broad range of computationally efficient features including, among others, sparse data support, Scanpy, and a PyTorch interface. To ensure that AnnData from very different modalities (imaging, ATAC-seq, RNA-seq) could be analyzed together, we will test run already developed computational algorithms that best suited to efficiently assimilate and combine multi-omics data to identify key factors that drive aging. Most critically, we will combine large datasets from imaging, ATAC-seq, and RNA-seq using Bayesian methods for integrating multi-omics data and hyperbolic embedding with principled criteria for choosing the best-fitting curvature and dimension. Finally, we establish an electronic repository for python scripts and data structures for straightforward dissemination to the broad research community.
NIA
Vonder Haar, Cole
Ohio State University
Dopamine modulation for the treatment of chronic dysfunction due to traumatic brain injury
The goal of this supplement is to compile, harmonize, and analyze large-scale behavioral datasets in rat models of traumatic brain injury (TBI). As part of the parent grant, multiple datasets describing deficits in attention, impulsivity, and decision-making after TBI were and are being collected. These resulted in millions of lines of data across studies – a rare phenomenon for animal TBI. Given the varied and heterogeneous nature of brain injury, these large datasets capture rare and common phenotypes across this broad spectrum. As such, we will leverage machine learning methods to better understand what factors determine individual vulnerability and resilience to injury. Two databases will be compiled and prepared with metadata for automated processing. One dataset will comprise risky decision-making and have roughly 2 million lines of data, with approximately 70% corresponding to “pure” control or TBI conditions (i.e., no other interventions). The second dataset will have just under 1 million lines of data, with approximately 80% corresponding to “pure” control or TBI conditions, and with multiple injury severities. We will analyze these to understand how vulnerability to chronic impairments from TBI can be detected early so that treatments may be better targeted and developed.
NINDS
Wagner, Alex Handler
Research Institute Nationwide Children's Hospital
Development and validation of a computable knowledge framework for genomic medicine
The clinical interpretation of genomes is a labor-intensive process that remains a barrier to scalable genomic medicine. Efforts to improve this “interpretation bottleneck” have resulted in the development of clinical classification guidelines and databases for genomic variants in Mendelian diseases and cancers. The development of AI-augmented genome interpretation systems is a potential solution to help scale the interpretation process. Development of such systems will benefit from aggregation and collation of evidence that is in an AI-ready state. The parent award for this supplement is addressing this challenge through the development and validation of a computable framework for genomic knowledge (R35 HG011949). These efforts are underway in collaboration with the broader genomic knowledge community under the auspices of the Global Alliance for Genomics and Health (GA4GH). The NIH-supported Genome Aggregation Database (gnomAD) is currently the largest and most widely used public resource for population allele frequency data. These data are commonly used as strong evidence to rule out variant pathogenicity, making this a highly impactful resource for filtering out variants that are unlikely to be causative for a Mendelian disease. The importance of the gnomAD population allele frequency data to clinical interpretation systems makes the resource an ideal candidate resource for AI-readiness. The objective of this project is to improve the AI/ML readiness of gnomAD through application of the GA4GH Variation Representation Specification (VRS) and the associated genomic knowledge framework developed in the parent award to data from the gnomAD database. As a result of this work, we will be able to couple the semantic precision of VRS and the genomic knowledge framework to the high performance genomic data search capabilities of the Hail platform on which the gnomAD data is stored. Our aims include development of tools for translating large-scale genomics file formats (i.e. Variant Call Format / VCF) into VRS objects, design of semantic data models for representing population allele frequency data, and implementation of these tools and models to the gnomAD Hail tables and graphQL interface. Our work was initially described as part of the gnomAD v4 data release in November 2023: https://gnomad.broadinstitute.org/news/2023-11-ga4gh-gks/.
NHGRI
Waldron, Levi David
Graduate School of Public Health and Health Policy
Cancer Genomics: Integrative and Scalable Solutions in R/Bioconductor
Large numbers of curated and annotated multi-omic cancer datasets have been developed and enhanced through the parent grant and distributed via the R/Bioconductor project. However, the complexity of existing data structures and lack of accessibility outside of the R/Bioconductor ecosystem has impeded the development of AI/ML-based methods for those datasets. Lack of standardization between data compendia and inadequate documentation of key characteristics of study cohorts have further limited the development of inclusive AI/ML models that leverage diverse datasets while accounting for population subgroups. This administrative supplement creates a FAIR and well-annotated data repository for cancer omics datasets using current best-practices for AI/ML-ready data. We automate the conversion of Bioconductor data resources, including 188,323 multi’omic cancer profiles from 373 cBioPortal studies and 22,588 metagenomic profiles from 93 studies in curatedMetagenomicData, into plain-text formats with file manifests of samples and datasets. We further annotate key characteristics of each study cohort by literature review to provide a more complete picture of the representativeness of study participants and the comparability of independent studies. Finally, we harmonize annotations using controlled language from ontology databases, which innately establish relationships between attributes. This project will produce the largest omics data repository specifically for research in AI/ML methods to date, facilitating the development of robust and inclusive models that account for diverse population subgroups. We will provide documented, runnable usage examples using TensorFlow, PyTorch, and scikit-learn, enabling the broader research community to leverage these curated resources in the development of AI/ML-based methods for analysis of cancer genomics data. We expect applications to include biomarker identification and disease subtyping, validation of therapeutic targets for personalized medicine, and molecular pathway analysis for the development of new therapeutic approaches. The successful completion of this project will extend the impact and utility of the parent grant, address the critical need for more inclusive models that incorporate diverse population subgroups in cancer genomics research, improve accessibility and standardization of cancer genomics datasets, and accelerate progress in the field of AI/ML-based cancer genomics research.
NCI
Yin, Yanbin
University of Nebraska Lincoln
Carbohydrate enzyme gene clusters in human gut microbiome
CAZymes (Carbohydrate Active Enzymes) are enzymes that act upon specific glycosidic linkages to degrade or synthesize polysaccharides. CAZymes are extremely important as part of the genomic repertoire of human microbiome, especially in the human gut. CAZymes work with other proteins, such as sugar transporters, to act on a specific carbohydrate substrate. Genes encoding these proteins often form physically linked gene clusters known as CAZyme Gene Clusters (CGCs) in bacterial genomes. CGCs that have been experimentally characterized are termed polysaccharide utilization loci (PULs), with known polysaccharide substrates (e.g., starch, mannan, xylan, and glucan). An ability to predict the carbohydrate substrates for CAZymes and CGCs will significantly enhance the emerging personalized nutrition practice, e.g., using gut microbiome sequencing to infer if a person is a responder to certain dietary fibers or prebiotics. To apply AI/ML technology in carbohydrate substrate prediction, one has to prepare two types of training data that are in a computer-readable format, in order to enable AI/ML-readiness: (1) PULs (experimentally characterized gene clusters with known carbohydrate substrates) curated from literature, and (2) CGCs (without known carbohydrate substrates) predicted from human microbiome. Therefore, the major goal of this AI/ML-readiness project is to develop a consistent, standardized, and systematic format of PULs and CGCs to make them AI/ML ready to not only our parent R01 project but also to other data scientists and nutrition scientists. To achieve this goal, we have assembled a multi-disciplinary research team including three faculty, one postdoc, and three graduate students. These members have all necessary expertise in nutritional science and CAZymes, statistical ML model development, and bioinformatics and ML application development. Two Aims with four subtasks and four milestones are planned to make the PUL and CGC data formatted and documented in a way that they can be readily available to other data scientists and nutrition scientists. All AI/ML-ready data will be freely available on two online data repositories: dbCAN-PUL and dbCAN-seq. This project will contribute to the basic understanding of dietary modulation of human microbiome and applied personalized nutrition research.
NIGMS
Thirty-eight awards were made in 2021 to principal investigators at 35 different institutions across the country. Awardee projects and their descriptions are available below.
Substance misuse comprises a complex set of conditions, often associated with comorbidities and social factors, and can lead to poor outcomes. Opioid misuse, non-opioid illicit use, and alcohol misuse can also lead to repeated encounters with hospital emergency departments or first responders. Although substance use disorders are a leading cause of repeat hospital visits, our fragmented data systems do not generate comprehensive information on the scope and character of this poorly treated condition that would allow providers to improve and monitor the quality of care. Crucial social and behavioral determinants strongly linked with substance use (e.g., prehospital behavioral events from ambulances, public health data) are not readily available to health systems, but they are important data that can be used to better train artificial intelligence and machine learning (AI/ML) models. In principle, hospitals are well-positioned to address these challenges. In practice, these opportunities are frequently missed given the fragmented structure and design of current data systems. Many patients living with substance misuse visit a specific hospital for the first time after an overdose or a related medical condition of drug use, such as infection or trauma. This supplemental grant will build an AI/ML-ready public health informatics Substance Use Data Commons and share a novel, all-inclusive prediction model that will help guide clinical interventions and regional health policy. We aim to foster an academic, public, and private collaboration to build a data ecosystem in this supplemental grant that will harmonize data across a Wisconsin regional hospital, prehospital agencies like local fire departments, and public health agencies for the first time. We will build a cohort with substance misuse with linked data that are engineered as an AI/ML-ready data commons. We will train and test an AI/ML model that can prioritize those at the highest risk for poor outcomes and uncover important biases in our data sources with input by health equity experts. Access to combined data from hospitals, public health agencies, and first-responder agencies could provide a comprehensive data resource that would allow us to reliably identify, risk stratify, and prioritize care for some of Wisconsin’s most vulnerable residents through AI/ML modeling.
NIDA
ALSHAWABKEH, AKRAM N.
NORTHEASTERN UNIVERSITY
Addressing Class Imbalance and Missingness in the PROTECT Database
PROTECT has collected extensive datasets on environmental and prenatal conditions of pregnant mothers (e.g., exposure, socioeconomic, and health data) collected from a cohort of over 2,000 expectant mothers and their children, resulting in close to more than 2,400 data points per participant. The PROTECT database has the potential to help unlock relationships that can tie environmental factors to adverse birth outcomes. We would like to leverage powerful AI/ML toolsets to help identify and establish these relationships. But before we can start leveraging these powerful tools, we must make our data AI-ready. Certain PROTECT datasets suffer from a high degree of missingness. Multivariate imputation is a popular method for analyzing data with missingness, though it relies on specifying a single model to impute missing values. We have developed Multiple Imputation by Super Learning (MISL), a method that leverages ensemble learning to combine a variety of parametric and nonparametric models to generate imputations. MISL performs well when estimating complex functions—including interaction terms—which will enable the modeling of complex relationships between observed and missing data, accommodating nonlinear relationships and correlated exposures. In many birth cohorts, the number of term versus preterm births results in highly imbalanced datasets, posing challenges to existing machine learning algorithms which assume relatively balanced labeling. To overcome this challenge, we use (area under the curve (AUC) as the cost objective to ameliorate the imbalance labeling while performing classification. For regression, Kernel Density Estimation is used to approximate the likelihood for each sample. This likelihood is then used as an inverse probability weight to reweight each sample during regression, replicating a more balanced design. Once we have completed these efforts, we will engage multiple trainee communities to identify holes and gaps in our AL/ML data readiness through a series of hack-a-thons, where trainees will leverage these AI/ML tools to explore both our curated and original datasets. The result should be the production of a suite of datasets ready for AI/ML analyses.
NIEHS
AMBROSIO, FABRISIA
UNIVERSITY OF PITTSBURGH
Using Machine Learning and Artificial Intelligence Models to Predict Muscle Stem Cell Biological Age and Regenerative Potential
Studies of the parent grant focus on skeletal muscle trauma resulting from an injury or surgery, which typically results in significant functional declines in older adults. In advancing the aims of the parent grant, we performed single-cell level analysis at the transcript and protein levels to investigate the influence of age-related biophysical alterations on muscle stem cell (MuSC) fate. Data revealed that young MuSCs displayed resistance to alterations in substrate stiffness, and lineage progression was not altered. Aged MuSCs, on the other hand, were highly sensitive to extrinsic mechanical stimuli, displaying significantly delayed lineage progression with exposure to a stiff substrate. Whereas studies in the parent grant primarily focus on performing in vitro and in vivo assays to evaluate the direct effect of age-related extracellular matrix (ECM) alterations on MuSC fate and function, we found that escalating complexity and noise in the MuSC system over time impedes our ability to make accurate predictions of cell behavior using traditional methods. This supplement will allow us to expand our study and test whether biological data and domain knowledge relating to MuSC aging can be embedded in a framework of Bayesian optimization to elucidate mechanisms and accurately predict regenerative responses. This project aims to (1) prepare omics data for ML models, (2) perform benchmark ML modeling with Bayesian optimization, and (3) broaden approaches to ML modeling and broaden researcher engagement in the biology of aging. This work will support collaborative efforts that bridge expertise in stem cell biology and regenerative medicine together with machine learning and artificial intelligence to expand foundational studies that aim to understand fundamental factors driving the biological age of stem cells.
NIA
BAICKER, KATHERINE
NATIONAL BUREAU OF ECONOMIC RESEARCH
A Machine Learning Platform and Database Linking Digitized Electrocardiogram Waveforms with Hospital Electronic Health Records
This project creates a de-identified and highly curated dataset of digitized electrocardiogram (ECG) waveforms linked to other electronic health records (EHR) from a major research hospital. These records include elaborate clinical information on diagnoses, procedures, treatments, medications, vital signs, lab tests, physician notes, and outcomes over a 5-year longitudinal timeframe. The integrated ECG-EHR dataset will be housed in a secure, monitored cloud platform, and available for noncommercial research applications. An established legal framework is in place for sharing the dataset with a broad coalition of researchers and for using the data in cutting-edge research applications. The project supplements a large multi-component research grant on Improving Health Outcomes for an Aging Population (P01-AG05842), and more specifically, Project 4 of the parent grant on Assessing the Overuse and Underuse of Diagnostic Testing. The use of machine learning algorithms to improve health care decision-making is a major overarching theme of the overall parent grant, and Project 4 draws on advances in machine learning technology to gauge the extent of over- and under-use of diagnostic testing. The incremental database development proposed here builds directly from the work being conducted in Project 4. Machine learning algorithms have the potential to identify patterns from historical patient records and to use those patterns to improve the real-time decisions that physicians need to make when diagnosing and treating patients in clinical settings. The proposed database will be used initially to study the diagnostic and treatment decisions made for patients presenting with acute coronary symptoms in hospital emergency rooms. The data will be used, specifically, to predict adverse events like positive troponin and positive catheterization results from the digitized ECG waveforms. There is no existing equivalent to the platform and dataset we propose to assemble and make available to the scientific community.
NIA
BEDRICK, STEVEN
OREGON HEALTH & SCIENCE UNIVERSITY
Towards Automatic Transcription of Post-Stroke Disordered Speech
Standardized tasks like confrontation naming tests are widely used in the diagnosis and treatment of acquired neurogenic language disorders (e.g., aphasia) in clinical settings, and also play a vital role in speech-language pathology research more broadly. One of the primary goals of our parent grant (#1R01DC015999) is to automate the administration and scoring of such tests and enable professionals to assess aphasia in an objective, precise, efficient, and ecologically valid manner. However, current methods for generating test results involve detailed and careful analysis of manually transcribed speech samples. Such transcription is laborious, time consuming, and error prone and represents a significant barrier to the practical adoption of various automated tools and techniques for scoring and analyzing test results. Automated speech recognition (ASR) technology has the potential to mitigate this obstacle and could in theory be used to quickly produce accurate transcripts of patient responses during testing. Recent years have seen dramatic improvements in the “state-of-the-art” in ASR as applied to connected speech from neurotypical speakers; however, current “off-the-shelf” ASR tools struggle with disordered speech produced by individuals with neurological conditions. In this project, we will develop and make publicly available a high-quality set of transcriptions and annotations of already-existing recordings of speech by people with aphasia (Data Source: AphasiaBank; 4R01DC00852410). Our dataset will be explicitly designed to support the development of automated phoneme recognition systems that would be of use for our automated systems intended to analyze aphasic speech. Further, in order to demonstrate the use of the transformed data in an AI/ML application, we will build a baseline ASR system to automate the transcription of participant responses in a confrontation naming test scenario. We will use this system to empirically investigate the impact of increased amounts of data, and different transcription and annotation decisions, on transcription accuracy. Finally, to facilitate awareness and use of our dataset by the larger AI/ML community, we will organize a community-shared evaluation task focused on automated phonemic transcription of aphasic speech, to which we will invite participation from interested ASR researchers.
NIDCD
BRINJIKJI, WALEED
MAYO CLINIC
Impact of Clot Histological and Physical Properties on Revascularization Strategies in Acute Ischemic Stroke
In the last four years, our group collected >1800 stroke emboli from centers across the United States and Canada as part of the Stroke Thromboembolism Registry of Imaging and Pathology (STRIP). The STRIP registry was our parent award’s Aim 1, and we have further elucidated the association between clot histology and revascularization outcomes, as well as clot histology and imaging characteristics. The STRIP registry has also allowed us to uncover novel mechanisms in stroke thrombosis, catalyzing thrombectomy device research and thrombolysis-related research in our lab. We believe that the most impactful finding from our registry will be uncovering associations between histopathology of retrieved emboli and stroke etiology. Developing tools to predict stroke etiology is important because nearly 40% of strokes are of unknown etiology. Determining stroke etiology (i.e., cardiac source versus large artery atherosclerosis) is important as secondary stroke prevention strategies are highly dependent on determination of stroke etiology. However, when we performed superficial quantitative analyses examining the relationships between fibrin, platelet, and WBC and RBC density, our results were unrevealing. Still, we hypothesize that it remains feasible to predict stroke etiology based on analysis of retrieved stroke emboli through deep learning and machine learning approaches. Machine learning and deep learning approaches can also aid in uncovering histological features associated with device and pharmacological failure related to stroke revascularization. Thus, the goals of this administrative supplement are to (1) allow for complete digitization and online archiving of our database of over 1,800 retrieved clot specimens, as well as all available anonymized clinical data from Aim 1 of our current R01 to facilitate deep learning and machine learning, and (2) perform deep learning on the whole slide specimens from these patients to determine if various deep learning and machine learning algorithms can be used to predict stroke etiology based solely off of the histological appearance of retrieved stroke emboli.
NINDS
CHENG, FEIXIONG
CLEVELAND CLINIC LERNER RESEARCH INSTITUTE
Using Artificial Intelligence for Alzheimer’s Disease Drug Repurposing
Although researchers have conducted more than 400 human trials for potential treatments of Alzheimer’s disease (AD) in the last two decades, the attrition rate is estimated at over 99%. Furthermore, the “one gene, one drug, one disease” reductionism-informed paradigm overlooks the inherent complexity of the disease and continues to challenge drug discovery for AD. The predisposition to AD involves a complex, polygenic, and pleiotropic genetic architecture. Recent studies have suggested that AD often has common underlying mechanisms and pathobiology, sharing intermediate endophenotypes with many other complex diseases. These endophenotypes—such as amyloidosis and tauopathy—have essential roles in many neurodegenerative diseases. Systematic identification and characterization of novel underlying pathogenesis and disease modules, more so than mutated genes, will serve as a foundation for generating actionable targets as input for drug repurposing and rational design of combination therapy in AD. Integration of the genome, transcriptome, proteome, and the human interactome using artificial intelligence (AI) and machine learning (ML) are essential for such identification. Given our preliminary results, we posit that AI/ML-based identification of novel risk genes and endophenotype network modules offer unexpected opportunities for drug repurposing and combination therapy design in AD compared to traditional single target approaches. To address the underlying hypothesis, we propose to establish an AI/ML-based multi-modal analytic framework to assemble available genetics, genomics, and transcriptomics data generated from NIA-funded AD genome sequencing projects and the AD Knowledge Portal for drug repurposing studies within scope of the parent award (#R01AG066707).
NIA
DEPP, COLIN A.
UNIVERSITY OF CALIFORNIA, SAN DIEGO
A Novel Dataset for Speech Analysis in Serious Mental Illness (Parent Study: Social Cognitive Biases and Suicide in Psychotic Disorders)
This administrative supplemental project is linked with the NIMH-supported R01MH116902, which focuses on social cognition and suicide in people with psychotic disorders. The central aim of the project is to create and share a novel database resource for multi-modal speech analysis from a unique, large, and diverse sample of audio-recorded, standardized social interactions. Social processes are a key dimension of dysfunction in serious mental illness. The linguistic and paralinguistic markers of impairment in serious mental illness and suicide are an active area of research, consistent with the parent study’s focus on social processes in suicide and psychosis. However, small datasets and lack of standardization of data resources greatly hamper extant speech analysis work, disenabling investigation of mechanisms underlying observed patterns and robustness of models across potential sources of bias. In the proposed supplement, we will generate a new corpus data from over 1,000 individuals who have standardized data for speech analysis, who are richly characterized, and who are diverse across key demographic variables, including minority status. Our focus is on the Social Skills Performance Assessment, which is the most widely used performance-based measure of social function, entailing expert-rated, audio-recorded, simulated social interactions that involve affiliative and confrontational scenarios. In the proposed supplement, we will generate sharable, de-identified transcripts for natural language processing, with additional annotation according to conventional natural language features and also more novel dialogue actions. We will also create corresponding de-identified audio files that contain frequency, amplitude, and other paralinguistic annotations. This project extends and expands parallel work in aging research led by our new collaborator team. Indeed, sharable data from speech analysis derived from standardized testing has been highly impactful in other fields—including aging and dementia research—but currently no such data resource exists in large samples of people with mental illnesses. The aims of the project are to create, process, and annotate the dataset; generate toolkits and source code for analysis; examine new markers in relation to the parent study aims; and share the data to the NIMH National Data Archive and the scientific community.
The goal of the parent grant is to understand the structure and use of the semantic system, its neural basis, and the ways in which it can be impaired by neuropathology. Conceptual or semantic knowledge refers to information about objects, actions, relations, self, and culture and includes meanings of words and phrases. Everyday activities—such as communication, recognition and use of objects, social interactions, and decision-making—are crucially reliant on conceptual knowledge. Impairment of this complex system has serious consequences for quality of life. Several neurological and psychiatric disorders are associated with the impairment of this system, including stroke with aphasia, dementias, temporal lobe epilepsy, schizophrenia, and autism. This project pertains to three neuroimaging datasets—two functional MRI (fMRI) and one from stroke survivors—that speak to the neural basis of concepts and language processing in the brain. The fMRI datasets measure brain activation in response to verbal and nonverbal concepts using multiple tasks. The stroke dataset contains lesion information and a number of behavioral tests, both standardized and novel, that assess linguistic and semantic deficits in survivors. We aim to transform these data and associated metadata to a “machine learning–ready” format and make them easily accessible. This will enable application of powerful modern machine learning techniques—such as deep learning—for their analysis. This may lead to development of new techniques—such as deep learning–based lesion-symptom mapping and multivoxel pattern analysis—that may provide richer insights into effects of lesions in the brain and functioning of the healthy brain. We expect that these analyses will shed new light on the organization and function of core semantic regions in the healthy and impaired brain.
NIDCD
FARRER, LINDSAY A.
BOSTON UNIVERSITY MEDICAL CAMPUS
A Computational Pipeline To Evaluate AI/ML Readiness in Digital Datasets in the Framingham Heart Study
This project is focused on improving AI/ML-readiness of heterogeneous types of data (e.g., performance on cognitive tests captured by digital voice and digital pen, neuroimaging, biomarker, multi-omics) collected by the Framingham Heart Study Brain Aging Program (FHS-BAP) for studies of Alzheimer’s disease and related dementias. With the increased use of digital technologies to capture phenotypic data, machine learning systems continue to take on ever more central roles in addressing clinical questions, and the issue of ML reliability has become critical. While the software industry employs best practices to increase robustness in their product development lifecycle, a similar set of ground rules needs to be established to make data collected for ML clinical research studies more reliable. As such, robust data is key to creating reliable ML models. Our experience with FHS-BAP data revealed that across thousands of participants, numerous issues have been identified related to missing data, artifacts, unformatted structures obtained from heterogeneous sources, and unfiltered and biased data. All these aspects directly influence data quality and using such data without properly harmonizing them would lead to poorly generalizable ML models. There is an urgent need to make data “ML-ready” so that researchers do not have to contend with tedious and nontrivial data processing methods. Here, we will focus on our plan related to building computational frameworks that can evaluate the ML readiness in datasets and our intent to build software toolkits for public use.
NIA
GILMORE, JOHN HORACE
UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL
Rescuing Missed Longitudinal MRI Scans in the UNC Early Brain Development Study
The first year of life is a period of rapid and dynamic structural brain development, and data from our cohort suggests that a large portion of individual differences in brain structure in 10- and 12-year-olds is already present in early childhood. Adolescence and puberty is the second major period of postnatal brain development associated with emerging risk for psychiatric disorders. The UNC Early Brain Development Study (EBDS) is a unique and innovative longitudinal study that has followed children with imaging and cognitive and behavioral assessments at birth, 1, 2, 4, 6, 8, and 10 years. Four hundred and eighty-two children from this cohort are reaching adolescence, and we are now following these children at 12, 14, and 16 years of age. New knowledge gained in this study will provide a dramatically improved framework for understanding childhood brain development and identifying of children at risk for subsequent psychiatric disorders. Deep learning–based machine learning (ML) methods require datasets with large sample sizes and complete data to successfully train longitudinal data inference. Longitudinal MRI studies in young children typically suffer from significant missing data due to acquisition failure, secondary to motion and participant attrition. This lack of complete image data significantly hampers the application of machine learning to the EBDS dataset. We propose to impute and generate all missing EBDS timepoints of the structural T1 and T2 weighted longitudinal image data via a multi-modal, multi-timepoints image prediction network. This includes cross-modality data generation (generating missing MRI data from existing MRI data at the same time) where available, as well as multi-timepoints imputation of longitudinal data (generating missing MRI data from existing MRI data at different timepoints). Imputed and generated MRI images will then be processed to compute missing brain measurements (i.e., cortical thickness, surface area). This imputed data will increase in training data over 100%, and significantly improve longitudinal ML studies of the EBDS dataset. The imputed MRI images, the estimated validity of the imputed MRI images, intermediate representations from the imputed images (e.g., label maps, surfaces) and morphometric measures will be shared via NDA, alongside the trained imputation network for use by others.
NIMH
GUPTON, STEPHANIE
THE UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL
Collaboration to Improve the AI/ML-Readiness of Plasma Membrane Remodeling Data
The data acquired under the parent award describe fundamental cellular behavior of the developing neuron, namely exocytic vesicle fusion with the plasma membrane and membrane remodeling. Using fluorescent probes in live-cell imaging, we visualize exocytic events in developing neurons and have generated fully-automated, unbiased computer-vision tools for the detection and analysis of these events. Our current analysis process involves several software applications—including imageJ, MATLAB, and R—and requires manual inspection and transfer of data between applications. Although this method has proved valuable, challenges arise in tracking data transformations and performing quality control. To overcome these limitations, we are developing an integrated processing pipeline in ImageTank, a graphical programming language. ImageTank combines image visualization with numerical methods to perform all of the processing steps required to extract data for machine learning–based classification. Our goal is to improve the efficiency of this process, to make quality control and method development more robust, and to simplify the process to make it more accessible to different researchers.
The parent award (R01NS119825, “PREcision Care In Cardiac ArrEst – ICECAP” [PRECICECAP]) aims to discover novel biomarker signatures of postcardiac arrest brain injury and extracerebral organ failure that predict treatment responsiveness and long-term recovery. We will train auto-encoder neural networks using high resolution multi-modal data—including cardiopulmonary waveforms, neuroimaging, and electroencephalography—acquired during intensive care unit (ICU) monitoring. Similar data sources are ubiquitous in modern medicine but remain underutilized in clinical care and research due to their complexity. This supplement supports development of freely available software to facilitate curation of ICU-acquired waveform data, with special attention to neurocritical care, as well as preparation and dissemination of an AI/ML-ready dataset from PRECICECAP. We will create a series of modular functions that permit users to graphically construct processing pipelines maximizing automation where appropriate and allowing facile interaction with data when needed. Modules will be designed to combine synergistically, maximizing efficiency and reproducibility, while allowing flexibility to meet project-specific objectives. Key modules will allow harmonization of heterogeneous data sources commonly encountered in multicenter research; visualization and annotation of ICU waveform data; and feature creation. These will be made freely available to the research community and can be combined into pipelines using a cloud-based graphical user interface.
NINDS
INDZHYKULIAN, ARTUR
MASSACHUSETTS EYE AND EAR INFIRMARY
Cross-Modality Imaging Data Annotations for Deep Learning–Based Analysis Solutions in the Auditory Field
The parent award generates a plethora of 2-D and 3-D imaging data of the sensory cells of the inner ear and the hair cells, across three different imaging techniques: (1) confocal microscopy, (2) scanning electron microscopy, and (3) focused ion-beam scanning electron microscopy. These imaging techniques are common across the hearing field and traditionally are labor intensive to analyze. This supplemental award aims to encourage the development of new automated tools to analyze such images by providing annotated deep learning–ready datasets accompanied with exemplar pre-trained deep learning–based models. We will carefully choose images to annotate which are representative of the field and diverse, with the hope that these will lead generalizable models when trained on this dataset. When possible, we will supplement our data with community-provided exemplar datasets to increase the generalizability of resulting AI/ML-based solutions. We will carefully document and publicly release all images and annotations in formats common to the deep learning field, ensuring ease of access and streamlining their application.
The goal of the parent study (R33AG068931) is to create a comprehensive research repository of aging trajectory datasets and to demonstrate their utility for aging research through: (1) harmonizing and merging multiple datasets to generate the data infrastructure needed to understand change over time in care settings, geriatric syndromes, physical functioning, and shared risk factors at multiple levels and across multiple domains; (2) developing state-of-the-art analytic methods to identify patterns of aging trajectories experienced by older adults during the final years of life and their association with shared risk factors and outcomes; (3) using both model-based approaches and machine learning (ML) algorithms to predict trajectories and outcomes of interest; and (4) disseminating resources generated, including datasets, documentation, source code, and methodology. The AI/ML-readiness supplement aims to develop and implement code for data pre-processing, data fine-tuning and precision, missing data imputation, data connectivity, and fully established hierarchical relationships for the AI/ML framework to interactively model late-life aging trajectories and selected outcomes in a cohort of Medicare beneficiaries under the environment of the Centers for Medicare & Medicaid Services (CMS) Virtual Research Data Center (VRDC). Unique features of this project include creating reweighted race and ethnicity variables that incorporate self-reported race and multiple racial and ethnic categories and that feature extraction from irregularly spaced individual aging trajectories. Guided by the Government Alliance for Racial Equity (GARE) framework, the proposed supplement project will normalize, organize, and operationalize racial equity throughout data pre-processing and integration. Through ongoing and new collaborations, we will work with and engage the AI/ML community to address ethical issues, including eliminating biases in datasets, algorithms, and applications; considering impacts and unintended consequences for disadvantaged or marginalized groups and health disparities; and contributing to the development of best practice guidelines arising from community consensus and work groups. Completion of this work will contribute to the NIH vision of a modernized and integrated biomedical data ecosystem that adopts the latest data science technologies and best practice guidelines, including FAIR (findable, accessible, interoperable, reusable) principles and open-source development.
NIA
KENNEDY, RICHARD E.
THE UNIVERSITY OF ALABAMA AT BIRMINGHAM
De-identified Delirium Data: Finding Delirium to Study Delirium
Delirium, or acute confusional state, affects 30% to 40% of hospitalized older adults, and is associated with increased risk of death, functional decline, and long-term cognitive impairment. As up to 75% of cases are not recognized by providers, there is a critical need for advanced methods to identify delirium for clinical and research purposes. Methods—such as natural language processing (NLP) and machine learning (ML)—have the potential to automate identification of delirium from the electronic health record (EHR) notes, but are hampered by lack of suitable data for algorithm development. We will leverage the systematic delirium screening available through The University of Alabama at Birmingham (UAB) Virtual Acute Care for Elders (ACE) quality improvement program to create and release a de-identified delirium dataset to address the data and diagnosis gap in epidemiological studies of delirium. Our Virtual ACE program has determined delirium status on more than 33,000 patients across a six-year period, providing a rich set of data from which this project will draw. We will develop and refine our transfer learning–based de-identification method to assist annotators to more rapidly de-identify clinical text, opening the door to larger, faster, more widely available dataset releases. We will apply this de-identification method to a subset of clinical notes from 3,000 notes in our Virtual ACE database, half with delirium episodes and half without delirium episodes during their hospitalization. This delirium dataset, containing de-identified clinical notes and associated structured data, will be to our knowledge the only text inclusive corpora specifically for the study of delirium. The proposed dataset will be available for download on Physionet with a data use agreement (DUA) to facilitate further development of NLP and ML approaches for determining delirium status, risk factors, and sequelae at other institutions and in other populations by transfer learning. Release of our de-identification algorithms and methodology will also facilitate the development and release of large-scale text corpora in other disorders.
NIA
KESSELMAN, CARL
UNIVERSITY OF SOUTHERN CALIFORNIA
Improving AI/ML-Readiness of FaceBase Research Datasets
Facial image analysis has received intense interest by the artificial intelligence (AI) and machine learning (ML) community to study the correlation between genotype and phenotype. As one of the most important facial image resources, we can point to the FaceBase dataset containing over 22,000 labeled facial images based on clinical and genomic diagnoses. While FaceBase embraces the FAIR (findable, accessible, interoperable, and reusable) principles, there are unique concerns specific to AI/ML research, including presence of noise, uncertainty of labels, and bias within the dataset. In this project, we propose to unlock the tremendous potential of FaceBase facial scans by identifying gaps in how data is characterized, formatted, and preprocessed from the perspective of its use in AI/ML research. By developing a prototypical AI/ML pipeline and designing a revised data submission pipeline for human subjects, we aim to drive the evaluation of FaceBase datasets for ML-readiness and eventually make the dataset more broadly useful to AI/ML researchers.
NIDCR
LARSCHAN, ERICA NICOLE
BROWN UNIVERSITY
Integrating Experimental and Computational Methods to Understand How Gene Expression is Regulated in Early Development
Every complex organism begins as a single cell that divides and eventually differentiates into a vast array of tissues with highly specialized functions. Much like actors in a play, an elaborate cast of genes must be expressed in a precisely choreographed sequence in order for differentiation to occur correctly. This cooperative process, known as coordinate gene regulation, is achieved through the actions of numerous regulatory factors. These factors are therefore of critical importance and their disruption leads to a host of pathological conditions ranging from developmental disorders to cancer. Using Drosophila as a model organism, we are investigating the mechanisms by which genes are expressed and regulated in both a spatial and temporal manner. To gain insights into these mechanisms, our approach incorporates next-generation sequencing technologies together with machine learning algorithms. In particular, we use single-cell RNA-sequencing (scRNA-seq) to interrogate gene expression in developing Drosophila embryos at various time points. To measure the presence of regulatory factors—such as histone modifications—we use chromatin immunoprecipitation (ChIP-seq), and to determine DNA accessibility, we use the assay for transposase-accessible chromatin (ATAC-seq). Finally, we use chromatin conformation capture technologies (Hi-C or Micro-C) to measure chromatin co-localization frequency, which is a proxy for the spatial organization of the genome. We integrate this data with a machine learning model that we have previously developed called Graph Convolutional Model for Epigenetic Regulation of Gene Expression (GC-MERGE). GC-MERGE uses a class of computational methods known as neural networks and embeds the data into a graph-based representation to better describe the underlying spatial–relational factors regulating gene expression. We couple this with an interpretation technique for graph neural networks to elucidate specific genomic regions and regulatory factors that drive the expression of specific genes of interest. By combining both experimental technologies and computational modeling, we can therefore obtain a deeper understanding of coordinated gene regulation. In turn, this knowledge will help us to infer how gene misregulation leads to the development of chronic diseases, as well as to discover new therapeutics to address them.
The inner ear is one of the most elegant and delicate parts of the human anatomy and is responsible for several physiological events, including signal transduction from acoustic to electrochemical impulses to elicit our sense of hearing. In mature humans, there are only 3,500 inner hair cells (IHCs) and 12,000 outer hair cells (OHCs) responsible for signal transduction and acoustic wave amplification, respectively. However, in individuals suffering from sensory neural hearing loss (SNHL), there is a loss of these sensory hair cells of the inner ear that ultimately results in hearing loss. While attempts have been made to convert supporting cells into hair-like cells, there is often a lack of data showing that these reprogramed cells are like those of the inner ear. This is partly because of a lack of available transcriptomic data that can highlight differentially expressed genes between IHCs and OHCs. Thus, in order to supplement future downstream analysis and applications, we aim to create a conditional generative adversarial network (GAN) that has the potential to mimic, but not replicate, elements of the transcriptome that are differentially expressed between these two rare cell types. Due to the scarcity of these cells and the limited amount of available data, special consideration will be given to the collection, preprocessing, and validation of our model. Specifically, data will be integrated from various studies, and batch corrections will be made to mitigate the amount of bias that arises during this process. Moreover, differentially expressed genes will be identified from single-cell RNA-sequencing (scRNA-seq) studies and will be further utilized to distinguish between IHC, OHC, and other cell types present from bulk RNA-seq experiments. The batch corrections will be evaluated through several means, including kBET, t-SNE, and ASW. Finally, linear interpolation between pair-wise combinations of datasets from scRNA-seq will be utilized to artificially create more inputs for the GAN. After training the GAN, the in silico generated cells will be compared to real cells by t-SNE and MMD calculations, and a random forest classifier will be applied to try and distinguish between the generated and real cells. In summary, this project will focus on integrating datasets to mitigate bias to drive the creation of machine learning models that can facilitate a robust downstream analysis and increase experimental reproducibility. Ultimately, by modeling these rare cells of the inner ear, we can augment existing data to validate cellular reprogramming experiments.
NIDCD
MAINSAH, BOYLA
DUKE UNIVERSITY
An Open Source and Machine Learning–Ready P300-based Brain–Computer Interface Dataset
Brain–computer interfaces (BCIs) have wide-ranging applications that are solutions for replacing neural output that has been lost because of injury and disease, such as individuals with late-stage amyotrophic lateral sclerosis (ALS). BCIs process signals associated with brain activity acquired in real time—such as electroencephalography (EEG) signals—to extract relevant neural information that conveys a user’s intent and translate this information into device control commands, such as a prosthetic arm or a spelling board. Current BCIs have relatively low communication rates due to the inherent limitations associated with processing inherently noisy data and highly variable neural signal components of interest to extract the relevant information that is needed to control the BCI. Thus, improving BCI communication efficiency is an area of significant AI/ML research interest. Part of the development process for any BCI algorithm involves performing simulations with EEG data collected from previous BCI studies to pre-assess various paradigms, AI/ML models, or strategies under consideration, and selecting promising candidates for real-world testing. Acquiring BCI data is time-consuming and expensive; thus, most BCI research groups rely on publicly available datasets to obtain the necessary data to perform simulations, rather than collecting data in-house. However, current publicly available BCI datasets have a limited number of participants, have an under-representation of target BCI end users, and mostly use a proprietary file format. Also, there is a lack of serial data collected over several hours and days of BCI use that are needed for the long-term evaluation of AI/ML algorithms. Based on NIH-supported research for 10-plus years, we have acquired a large amount of single- and multi-session data from P300-based BCI speller studies with abled-bodied individuals and individuals with ALS under a wide range of experiment conditions. Guided by FAIR principles, we will perform data curation, data cleaning, and data engineering to transform proprietary data files into an open and nonproprietary file format; package the transformed files into a machine-readable dataset with metadata and documentation; and make this transformed dataset publicly available via an open-source repository. We will demonstrate the usability of this BCI dataset in (1) an AI/ML application focused on developing robust data representations to mitigate the negative effect of variabilities in EEG data on AI/ML algorithms and (2) student research programs focused on skill development in data science and AI/ML.
NIDCD
MARSDEN, ALISON L.
STANFORD UNIVERSITY
An AI-Ready Vascular Model Repository for Modeling and Simulation in Cardiovascular Disease
In this project, we will ready the Vascular Model Repository (VMR) for use in artificial intelligence (AI) and machine and deep learning (ML/DL) applications. The VMR is an open database of medical image data, segmented vascular models, and blood flow simulation results developed with support from the National Library of Medicine. We will use the VMR to support two major AI/ML efforts in the community. First, we will greatly accelerate anatomic model construction by DL for image segmentation, overcoming a major bottleneck to studies with large cohorts of patients. Second, we will develop physics-informed ML methods to drastically reduce current lengthy simulation times and provide fast and interactive feedback for surgical and interventional planning. This will produce fast and deployable “digital twins” that can provide interactive feedback to clinicians; these methods will be in direct support of the parent proposal and of general interest to the field. The supplement proposal contains three specific aims: (1) ready the VMR for AI/ML by users who are not domain experts and spark interest in cardiovascular applications in the ML community; (2) run community challenges in image segmentation and physics-informed ML at a major international meeting to identify best in class ML/DL methods; and (3) demonstrate ML/DL methods in interactive surgical planning applications by integrating efforts with the parent proposal. We will disseminate our findings, methods, data, and source code to the research community via the open source SimVascular software project and the open data VMR. Our team brings together expertise in cardiovascular biomechanics and finite element modeling, computer graphics and reduced order modeling, physics-informed ML, and developing open source and open data resources for the scientific community.
NIBIB
MASTERS, COLIN
UNIVERSITY OF MELBOURNE
Democratizing Machine Learning for Researchers Working in Alzheimer’s Space
The parent project combines data from five leading international cohorts to establish and validate the impact of demographics, genotype, and comorbidities on the onset and progression rates of Alzheimer’s dementia (AD). One of the main outcomes is a unique dataset comprising longitudinal data from over 700 participants observed at three or more time points, constituting the largest available AD-related cohort to date. With the aim of facilitating the use of this dataset for AI/ML analyses, this proposal seeks to extend the parent project in three major ways: (1) expanding the types of harmonization performed, whereby granularity of the data is kept at high-resolution rather than reduced through the use of summary statistics; (2) improving missing data-imputation by using methods that allow for incorporation of prior knowledge of diseases processes; and (3) generating FAIR-curated and comprehensively documented versions of the datasets along with open-source software to reproduce the novel cohort construction in-line with user-specific requirements. As a result, we hope to remove most artefacts likely to bias downstream analyses when using highly expressive AI/ML models. This is especially important for the cases where data are examined at a higher resolution than simple summary statistics. The proposal will ensure that the combined dataset established by the parent project can be used by both clinical researchers and AI/ML practitioners to investigate scientific hypothesis using modern machine learning methodologies while lowering the risk that the outputs of such methods are biased or uninformative. The ultimate aim of this work is to drive innovation in the application of AI/ML to AD and improve our ability to identify individuals with preclinical AD.
NIA
MCALLISTER, TARA
NEW YORK UNIVERSITY
PERCEPT: A Database of Clinical Child Speech for Automatic Speech Recognition and Classification
Speech sound disorder in childhood poses a barrier to academic and social participation, with potentially lifelong consequences for educational and occupational outcomes. Treatment of speech sound disorder could be enhanced through the development of tools incorporating artificial intelligence and machine learning (AI/ML). For example, applications with automated scoring of speech sounds could in principle be used to augment clinician services and achieve higher-intensity practice for faster progress in therapy. However, no computerized treatment to date has demonstrated sufficient speech processing accuracy for clinical use with children. Therefore, this project will curate and label child speech data collected from a decade of clinical trials focusing on speech sound disorder intervention, generating a corpus (PERCEPT) to overcome a fundamental barrier for accurate clinical AI/ML: a paucity of samples of speech from children with atypical speech upon which to train the AI/ML algorithms. The PERCEPT corpus will contain audio samples of child speech paired with human-generated ground-truth labels of phoneme accuracy, as well as speaker and transcript metadata. Following corpus development, we will demonstrate the utility of this corpus in training a neural network for the classification of speech sounds as correct or incorrect. This corpus will meet a critical need for the continued development of speech technology for child speakers and individuals with communication disorders.
NIDCD
MESSINGER, DANIEL S.
UNIVERSITY OF MIAMI
Harnessing Multimodal Data To Enhance Machine Learning of Children’s Vocalizations
We will implement a multi-modal data pipeline to support machine learning of child language production in complex naturalistic environments. The parent R01 award (DC018542) objectively measures the vocal interactions of children with hearing loss (HL) over longitudinal time in preschool classrooms. Even with cochlear implantation, HL is a life-altering condition with high social costs. Inclusion of children with HL and typically hearing (TH) peers in preschool classrooms is a national standard. But how does early vocal interaction of children with HL and their TH peers contribute to their language development? The parent R01 employs top-down computational models of child location and orientation to indicate when children are in social contact with their peers and teachers. A supplementary bottom-up strategy for pursuing the broad R01 goals—identifying interactive contexts in which children produce phonemically complex vocalizations and interactive speech—is machine learning. Machine learning algorithms can determine the contextual, individual, and interactive factors that predict children’s vocalizations and vocal interactions. To facilitate machine learning of classroom data, a rigorous diarization process is required to determine speaker identity, the likelihood that each vocalization was spoken by a given child or teacher. We will integrate audio processing of each target child’s and teacher’s first-person audio recording with processing of the recordings of their interactive partners. The influence of these partner recordings will be determined by their physical distance and orientation to the target. This will yield a weighted speaker identification score for each vocalization. For 25% of the sample, the algorithmic score will be compared to speaker identification provided by trained coders. We will process 7,160 hours of multi-modal recordings of child and teacher classroom movement synchronized with continuously recorded, child- and teacher-specific (first-person) audio recordings. De-identified output data will characterize vocalizations via algorithmically computed speaker identification probabilities, coder-identified speaker identity, phonemic complexity and audio characteristics (e.g., fundamental frequency), the position and relative orientation of all individuals in the classroom, and child demographics (including HL information). Output data, Python code, and metadata descriptions will be disseminated via dedicated distribution portals. Recordings will be released to certified investigators via NIH-funded repositories.
NIDCD
O’BRYANT, SID E.
UNIVERSITY OF NORTH TEXAS HEALTH SCIENCE CENTER
Improving AI/ML-Readiness of Data Generated from HABLE or Other NIH-Funded Research
Currently, data collected and shared in the biorepository of our parent project—the Health and Aging Brain among Latino Elders study, or HABLE—continues to expand rapidly. Artificial intelligence (AI) and machine learning (ML) cannot make these data valuable to biomedical research until these data are AI/ML-ready. Therefore, there is an urgent need to develop effective AI/ML-readiness for HABLE and other NIH-funded data-sharing projects. This supplemental project will focus on three critical and common areas to improve the AI/ML-readiness of data generated from our parent project: missing data imputation, feature selection and outlier removal, and data readiness report. We will address the following three specific aims: (1) develop an ML-based multiple imputation method for handling missing data; (2) develop a Recursive Feature Elimination and Cross-Validation (RFE-CV) algorithm for feature selection and outlier removal; and (3) develop an integrated tool to report data readiness. The algorithms and tools from this project will be the first of their kind to report data readiness for NIH data-sharing projects to facilitate heterogeneous data and feature engineering for AI/ML. It will make data scientists improve their AI/ML modeling more effectively and effortlessly. The administrative supplement project will benefit not only the parent HABLE project but also all other NIH-funded data-sharing projects. We expect that with the development of the algorithms and tools we will complete high data readiness in the HABLE project, which will eventually make HABLE more innovative in developing state of the art methods for Alzheimer’s Disease (AD) clinical trials, leading to the development of effective personalized treatments which slow the progression of, and prevent, AD.
NIA
PAKHOMOV, SERGUEI V.S.
UNIVERSITY OF MINNESOTA
Improving Interoperability and Reusability of Existing Datasets Used To Study Dementia
In this project, we plan to engage the AI/ML community that works on using speech and language characteristics as diagnostic or prognostic biomarkers of dementia to develop a comprehensive but parsimonious data model to represent all data relevant to AI/ML studies of dementia available in three datasets: Carolina Conversations Collection, Wisconsin Longitudinal Study, and Dementia Bank (Pitt corpus). We will also create an open-source toolkit consisting of tools to ingest the original data available from these datasets and convert them into a database based on the developed data model. The toolkit will consist of pipelines to process text of transcripts and waveform data from audio recordings, standardize the available timestamps to represent the same units (e.g., tokens, utterances), filter the transcripts and audio based on desired parameters (e.g., remove/keep speech and/or non-speech noise, remove/keep filled pauses [um’s and ah’s] and other disfluencies [e.g., fragments, repetitions, repairs], remove/keep transcriptionist notes [e.g., {unintelligible}, etc.]). Any available metadata for the audio–transcript pairs (e.g., demographics, diagnoses, neurocognitive, other test scores) will be verified, aligned, and added to the resulting database. We also plan to develop a process and supporting software for users of the toolkit to create and publish machine-readable manifests that will store the details of criteria by which data samples were selected along with actual sample identifiers and how these samples were pre-processed so that other researchers can use these manifests to re-create the conditions of the prior experiments in order to replicate the findings. As a result of this work, the research community working on speech and language characteristics in dementia will be able to obtain heterogeneous datasets from their original sources and transform them to interoperable data. AI/ML research and development conducted on these data will also become easier to evaluate, reproduce, and compare to determine state-of-the-art approaches.
NIA
PALMER, ABRAHAM A.
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Making Data from the Center for Genome Wide Association Studies in Outbred Rats FAIR and AI/ML-Ready
The NIDA Center for GWAS in outbred rats has collected data on more than 8,000 outbred rats that include phenotypes, genotypes, gene expression, microbiome, and metabolome. The goal of this supplement is to make these data FAIR and ready for AI/ML applications. Data include genotypes at millions of SNPs, complex behavioral and physiological phenotypes, RNASeq, ATACSeq, microbiome, and metabolomic data. While the center is focused on traits relevant to substance use, these datasets are much more broadly applicable. They include other behavioral traits relevant to all fields of neuroscience and physiological traits relevant to numerous organ systems and diseases. These data have been carefully curated, including numerous human and automated quality control steps, and are organized as data types available for each unique individual. However, there is no public facing description of the data, and no effort has been put into making them AI/ML-ready. We will bring together a team with expertise in the dataset, in best practices for information sharing, and in AI/ML for genetic applications to identify the most important and addressable shortcomings. We will then begin to address these goals, meeting frequently to monitor progress and overcome unanticipated challenges. Finally, as the work is completed, we will perform simple AI/ML exercises to confirm the improvements and revise our goals. Over the course of one year, we anticipate establishing a website and making all of our data findable. We will use protocols.io to describe each data type, assign RRIDs to each cohort, deposit data in various public repositories, and use best practices to make all data FAIR and AI/ML-ready. This supplement will provide the impetus and funding to bring together an outstanding team to make sure that NIH’s investment in this database can be used by cutting-edge approaches. This project is within the scope of the parent award but does not duplicate any work already supported by the parent grant.
NIDA
PROMISLOW, DANIEL EDWARD
UNIVERSITY OF WASHINGTON
Development and Use of an AI/ML-Ready Dog Aging Project Dataset
The parent grant (U19AG057377) funds the Dog Aging Project (DAP), a community science project designed to identify the biological and environmental determinants of variation in healthy aging among companion dogs and to test our ability to improve healthy lifespan in companion dogs. This long-term longitudinal study currently includes over 30,000 participating dogs, providing an extraordinarily rich dataset with which to explore questions about aging in dogs. These data include annual owner-reported survey data, electronic medical records, and environmental information, such as air quality, weather patterns, and neighborhood type. In addition, low-pass whole genome sequencing is available for 10,000 of these dogs, and biospecimens from more than 1,000 companion dogs provide diverse clinical and high-dimensional omic data. The DAP is an open science study, with all data made available to researchers around the world, with the goal of maximizing the impact of scientific discoveries that arise from these data. The goals of this supplemental project are two-fold. First, it will ensure that the DAP data are maximally compliant with the needs of AI/ML analytical approaches. Second, in collaboration with the University of Washington eScience Institute, the DAP team will develop webinars and a weeklong hack-a-thon event to introduce and implement AI/ML approaches for DAP data.
NIA
ROUSSOS, PANAGIOTIS
ICAHN SCHOOL OF MEDICINE AT MOUNT SINAI
Multi-omic Human Brain Immune Cell (HBIC) Resources for AI/ML Applications
As part of the parent R01 grant, we generated multi-omic molecular profiles of Human Brain Immune Cells (HBICs), representing characteristics of their DNA, RNA, chromatin accessibility, and protein. The profiles were generated from isolates collected from the prefrontal cortex of 300 donors at different stages of Alzheimer’s disease (AD). In addition, we will collect single-cell transcriptome and regulome data from over 100 donors to learn more about their single-cell variation. Using this administrative supplement, we will transform these remarkable resources into AI/ML-ready shared resources using a cloud-based AD knowledge portal. To share and collaborate ideas using an open forum, we will publish uniformly processed and fully annotated AI/ML-ready resources in a self-contained form for rapid prototyping with modern AI/ML tools. This award will provide our team with the support necessary to (1) facilitate access to large-scale, multidimensional datasets on HBICs for AI/ML applications, (2) accelerate research to understand the mechanisms of the onset and progression of AD, (3) provide systems-level insights about transcriptional regulation in HBICs and AD pathogenesis using integrative AI/ML models, and (4) provide a prioritized list of significant loci and genes for future mechanistic studies in AD. Together, with exemplary systems-level analyses and annotations of these datasets for AI/ML-based research, this supplement will provide the scientific community with an urgently needed resource to investigate the role of immune cells in AD and increase our mechanistic understanding of dysfunction in AD risk loci.
NIA
SALZMAN, JULIA
STANFORD UNIVERSITY
Enabling the AI/ML-Readiness of Massive Single-Cell Data for Discovering RNA Regulatory Biology
Despite the centrality of precisely detecting RNA expression at single-nucleotide resolution, state-of-the-art, reproducible, statistical algorithms to achieve this goal lag far behind the rate at which single-cell RNA-seq (scRNA-seq) data is generated, limiting AI/ML-readiness. Our suite of four algorithms which were engineered with support from the parent award—SICILIAN for precise junction calling, SpliZ and ReadZS for differential splice and RNA transcript processing in single-cell sequencing experiments, and SRPRS to mine spatial transcriptomics data to find novel subcellular expression programs—opens a new field of research and enables untapped discovery potential of the massive single-cell and spatial transcriptomic sequencing data currently available in public repositories. These algorithms will enable AI/ML-readiness of massive data already and continually generated. The ultimate goal of this supplement is to deliver and apply AI/ML-ready data using novel statistical genomic algorithms engineered through the parent award.
NIGMS
SCHAFFER, CHRIS B.
CORNELL UNIVERSITY
Agent-Based Participation of Machine Learning Models in a Crowdsourcing System
Collective hybrid intelligence may serve as an effective progression in the evolution of information processing systems that still rely on human cognition. Supervised machine learning has been treated as a panacea for automated image classification under the assumption that prediction performance depends primarily on the quality and volume of the training corpus, which is often obtained through crowdsourcing. We have observed that this assumption may fall short when accurate classification depends upon contextual knowledge that is not encoded in the pixels or on the inference needed to apply that knowledge. The 37,000 volunteers on our citizen science platform have contributed over 12 million classification labels for a biomedical research application. Bespoke “wisdom-of-the-crowd” methods, which effectively create ensemble models out of humans, combined multiple individual labels for the same input to produce 1.5 million research grade labels. These gold standard data were used by over 900 participants in a machine learning competition to train 55 unique models exhibiting a range of performance characteristics. Though none of these models produces research grade labels, the sensitivity and bias distributions of these models are similar to those of individual human volunteers, suggesting the models’ suitability for crowd-based participation. We endowed one of the winning models, named GAIA, with sufficient agency to participate as humans do on our citizen science platform and discovered that our “wisdom-of-the-crowd” algorithm was effective in extracting research grade classifications when GAIA was a member of the crowd. Our goal now is to examine the extent to which such human and AI ensembles may give purpose to imperfect ML models as an intermediate practicable step toward fully automated solutions. To this end, we’ll run a new study including two additional agents employing the other winning models from the competition in a new study. Success in our pursuit would allow us to incorporate full-time citizen science bots into Stall Catchers, which could double the number of capillary stalling studies we can conduct in a given year toward elucidating a more complete mechanistic model of capillary stalling which contributes to dementias, such as Alzheimer’s disease.
NIA
SEBASTIANI, PAOLA
TUFTS MEDICAL CENTER
Improving AI/ML-Readiness of Data Generated Under the R01: Protein Signatures of APOE2 and Cognitive Aging
Through the parent award (R01 AG061844, “Protein signatures of APOE2 and cognitive aging”), we are generating proteomic and metabolomics data in a cohort of centenarians, their offspring, and unrelated controls from the New England Centenarian Study (NECS). The goal of the parent project is to validate a proteomic signature of APOE genotypes and to evaluate its value together with metabolic profiles to predict patterns of cognitive function change in aging individuals. While we plan to share data through the Alzheimer’s disease (AD) and the new extreme longevity (ELITE) portals with the usual restrictions to preserve privacy and anonymity of the study participants, unrestricted sharing of data would be an attractive option for AI/ML investigators. This project will use advanced machine learning techniques to generate a high-fidelity, privacy-preserving, synthetic version of the data obtained in the parent project to be shared without restriction. Machine learning methods have emerged that can be used to generate synthetic data using a model that is trained in the real data. In the synthetic data set, no single data point corresponds to a real person so the data can be shared freely, and the synthetic data can be analyzed to produce results that are statistically like those derived from the original data. Our team includes data scientists and partners from the company Syntegra and will implement a data generation model to create the synthetic dataset. In parallel, we will develop a protocol for validation of the synthetic data that includes fidelity to a variety of results of machine learning analyses and metrics to assess the de-identification of data. We will conduct the analysis in the real and synthetic datasets and compare the results using these metrics. This is a high-risk, but potentially high-return proposal. If the approach works, we will be able to generate data that can be widely shared with the community. The approach will also be applicable to several other studies of aging that struggle with the issues of data sharing to enhance scientific research while preserving privacy.
NIA
SHIRTS, MICHAEL R.
UNIVERSITY OF COLORADO
Extending the QCArchive Small Molecule Quantum Chemistry Archive To Support Machine Learning Applications in Biomolecular Modeling
Current generation molecular simulation models are insufficiently accurate, and the tools for building those models are limited, based on aging infrastructure, and in need of automation. Our parent award, “Open Data-Driven Infrastructure for Building Biomolecular Force Fields for Predictive Biophysics and Drug Design,” aims to solve these problems by producing a modern infrastructure for building, applying, and improving accurate molecular mechanics force fields. As part of our NIH-funded project, we have collaborated closely with the Molecular Sciences Software Institute (MolSSI) to use the QCArchive ecosystem to generate and continuously expand very large quantum chemical datasets relevant to biomolecular systems on a variety of supercomputing resources. QCArchive now contains over 42 million quantum chemical calculations for over 39 million molecules, and has become incredibly popular, with over 1.79 million accesses per month. Large quantum chemical datasets relevant to biomolecular systems are incredibly valuable to the AI/ML community. Data is the key element needed for both fundamental research into ML architectures and constructing predictive models for downstream use. Unfortunately, quantum chemical datasets are incredibly expensive to generate, limiting in-house generation of large, useful datasets needed to drive AI/ML research to a few large companies and researchers with access to sufficient computing resources. While AI/ML quantum chemical methods have shown immense promise for biomolecular systems, the limited access to large, curated datasets has greatly hindered researchers from making rapid progress in this area. We aim to bridge this gap by working closely with MolSSI QCArchive developers to address robustness, scalability, and data delivery challenges to meet the needs of the biomolecular AI/ML community requiring access to large quantum chemistry datasets. Additional software developers will enable improvements to the QCArchive infrastructure to meet the rapidly growing demands of the AI/ML community. As QCArchive is primarily maintained by a single MolSSI Software Scientist, additional developers are necessary to enable the AI/ML community to take full advantage of the wealth of data generated by our NIH-funded project, as well as the data actively being generated by the tools our project has engineered to enable distributed, fault-tolerant quantum chemistry that is rapidly populating QCArchive. We will additionally develop interfaces and dashboards to enable facile discovery, retrieval, and import of quantum chemical datasets within popular ML frameworks. To ensure our tools are specifically useful for the most promising AI/ML applications, we will collaborate directly with AI researchers in the OpenMM, TorchMD, and SchNetPack communities actively developing and deploying quantum machine learning potentials for biomolecular simulation, with the goal of producing useful tools capable of driving these high-priority applications yet generally suitable for the wider community.
There are large amounts of high-quality data that are underutilized in our attempt to understand the structure and function of neural circuits. In particular, the application of artificial intelligence (AI) and machine learning (ML) approaches could be enabled by having these data in computable form and available via simple download or application programming interface (API). We will facilitate construction of knowledge graphs for neural circuits, construct graphs for the C. elegans nervous system and the mouse retina, and demonstrate utility by applying AI and ML to those graphs. Known entities (such as neurons, small molecules, and neuropeptides) and Ontologies (such as, anatomy, relations, and experimental evidence) provide the underlying data of the graph. Curated assertions provide the knowledge (e.g., synaptic or functional connections between neurons, neuropeptide has a specific receptor, a neuropeptide is expressed in a specific neuron). In this graph model, entities are the nodes, and ontological relationships are the edges. These provide an inferred knowledge graph supported by evidence backed assertions. This type of knowledge graph can be applied to biological pathways that are based on phenotype observations—including expression, neuronal activity, and organismal behavior—rather than physical interactions or enzymatic activities, such as those used to describe biochemical pathways. To accomplish the generation of knowledge graphs, we will (1) refine the relevant vocabularies to focus on relations used in neural circuit research; (2) adjust existing infrastructure to handle the appropriate ontologies, data models, and curation tools for neural circuit data; and (3) incentivize expert contributions by arranging short reviews coupled to computable assertions using the microPublication Biology platform. The knowledge graph will be published on the internet and tested by a hack-a-thon.
NHGRI
VAZQUEZ GUILLAMET, MARIA CRISTINA
WASHINGTON UNIVERSITY IN ST. LOUIS
Machine Learning to Identify Sepsis Phenotypes at Risk for Infections Caused by Multidrug Resistant Gram-Negative bacilli: Evaluating the Relevance of Unstructured Data and Data Engineering Tools
Sepsis, the body’s response to an infection, is a devastating syndrome and a leading cause of death, morbidity, and health care costs. Its impact is amplified by rising rates of antimicrobial resistance because improving sepsis outcomes primarily results from prescribing timely antibiotics based on the estimated risk of multidrug resistance (MDR). Artificial intelligence (AI) and machine learning (ML) are data-driven approaches looking for patterns in massive datasets. The promise of AI/ML in sepsis and antimicrobial resistance research remains largely unfulfilled. The main reason is deficient, inaccessible, and poorly labeled clinical data allowing access to only a small portion of the electronic health record (EHR) data. More so, rich clinical narratives—such as notes and imaging reports which contain unstructured data elements in free text format—are almost never used. Our overall aim is to identify sepsis phenotypes at risk for MDR microbes by leveraging big EHR data and using innovative methods such as ML. This will enable better antibiotic prescribing practices and standardize comparisons across hospitals. This supplement will provide the framework for ML use in sepsis and antimicrobial resistance research. We will augment the access to EHR data and test the importance of unstructured data elements retrieved from clinical narratives. Our aims reflect these priorities: (1) analyze barriers to use of EHR structured data and provide data engineering solutions for data enrichment; (2) extract and assess the importance of unstructured data in developing ML sepsis models; and (3) compare the ML sepsis models using unstructured and structured data versus structured data only and ensure algorithm fairness by testing it across subgroups of interest based on gender and race. We will incorporate clinical data from the 15 hospitals in our health care system serving an ethnically and socioeconomically diverse patient population in rural, suburban, and urban hospitals. This project will continue a multidisciplinary collaboration at Washington University in St. Louis between clinical experts, data scientists, and computer science engineers.
NIGMS
WANG, JUN
THE UNIVERSITY OF TEXAS AT AUSTIN
Detecting Speech Articulation Patterns Following Laryngeal Cancer Treatment Using Artificial Intelligence and Machine Learning
Laryngectomees (individuals following laryngectomy, a surgery for laryngeal cancer treatment) struggle in their daily communication. The specific goals of the parent award are to determine silent and alaryngeal speech articulation patterns and then to develop a novel, wearable silent speech interface (SSI) that synthesizes quality speech from articulation. Through the parent R01 project, we are collecting a unique, multi-modal dataset from alaryngeal and typical speakers. This dataset includes speech kinematic and acoustic data and inertial (rotational and acceleration) and magnetic data from a magnetic tacker attached on the tongue. This supplement aims to develop automated AI/ML-based algorithms for signal quality assessment to assist the process. External researchers from The University of Texas at Austin and Georgia Institute of Technology will be recruited to validate the ML-readiness of the dataset and provide feedback. Fused data (combination of low-level features of the multiple-modal data) using machine learning will also be provided. Making the dataset ML-ready and publicly available will allow our findings to be validated by other teams and to motivate new algorithms that can potentially bring high impact to the field to help address the long-standing articulation-to-speech mapping question.
NIDCD
XU, DONG
UNIVERSITY OF MISSOURI
Machine Learning Development Environment for Single-Cell Sequencing Data Analyses
Single-cell sequencing has become one of the most powerful biotechnologies in studying biology and diseases, such as cancers and Alzheimer’s. However, computational analyses are often the bottlenecks to revealing biological insights and defining cellular heterogeneity underlying the data. The applications of machine learning (ML), especially deep learning, hold great promises to address these challenges. While ML studies have made significant progress along this line, the involvement of the ML community in single-cell data analysis is limited due to the barriers of technological complexity and biology knowledge. To attract more ML experts into this field, we propose making large-scale, single-cell sequencing data ML-ready and providing an ML-friendly development environment. The specific aims of this supplement include: (1) Collecting, processing, and managing diverse single-cell sequencing data from public sources and making them ML-ready by converting them into formats efficient for storage and handling. The data will be processed with multiple options—such as imputation, normalization, and dimension reduction—using a pipeline to be developed. (2) Configuring the collected data to build benchmarks, gather public benchmarks, and encourage the community to submit their benchmarks. The data will be divided into training, validation, and test sets in multiple settings, including a minimum viable benchmark to assist efficient method development and a comprehensive benchmark for full evaluations. We will develop utilities to evaluate results based on the assessment measures and generate detailed reports. We will select a set of public tools to run them on the benchmarks as baselines for others to compare with. (3) Providing an integrated development environment (IDE) to support partial method development. We will build an IDE for single-cell sequencing analysis method development with plug-and-play features at the code level and web interface for ML researchers to contribute and test any minimum new ideas. A report will be provided containing evaluation metrics and usage of computer resources, comparisons with some public tools, and downstream visualization and interpretation. The newly formatted data, the benchmarks, and the method development and assessment environment will be available at GitHub and the in-house single-cell data analysis web portal DeepMAPS.
NIGMS
ZHONG, HUA JUDY
NEW YORK UNIVERSITY SCHOOL OF MEDICINE
Fair Risk Predictions for Underrepresented Populations Using Electronic Health Records
Electronic health records (EHRs) have become an increasingly common data source for clinical risk prediction, offering large sample sizes and frequently sampled clinical predictors and outcomes. The key feature of EHR sampling is that patients go to a hospital of their own choice and at their own frequency. During the implementation of our parent R01 project, we found emerging evidence of undersampling bias in EHRs for racial and ethnic minorities and patients with disadvantaged social determinants of health (SDOHs). Such patients are more likely to visit multiple institutions to receive care and often receive fewer diagnostic tests and medications in the EHR data of a single institution. We hypothesize that racial and ethnic minorities and patients with disadvantaged SDOHs are under-represented with smaller sample sizes, insufficient diagnostics and laboratory information, and less frequent encounters in EHRs. Consequently, we hypothesize that EHR-based risk prediction models—including conventional linear models and modern AI/ML methods—that ignore this unbalanced sampling have less-accurate predictions for these under-represented patient populations. Little or no work has been done to systematically investigate the impact of these biases. Thus, in this project, we propose to (1) investigate under-representations (bias) of racial and ethnic minorities and patients with disadvantaged SDOHs in EHRs; (2) develop FAIRness-aware EHR prediction methods; and (3) share the simulated EHR datasets and linked SDOH datasets, along with the developed fair risk prediction tools, to inspire and enable the AI/ML research community to further investigate FAIR EHR algorithms.