Adams, Meredith C. B. | Wake Forest University Health Sciences | Wake Forest IMPOWR Dissemination Education and Coordination Center (IDEA-CC) Using artificial intelligence to transform NIH HEAL Initiative clinical trial data to increase scientific impact. The HEAL Data Ecosystem is working to collect data across its projects and networks to meet FAIR (Findable, Accessible, Interoperable, Reusable) data standards. This administrative supplement builds on the mission of the NIH HEAL IMPOWR network to blend existing and future chronic pain (CP) and opioid use disorder (OUD) data. The proposed work will significantly deepen and augment approaches to FAIR principles in CP and OUD data for both the HEAL network and larger NIH research community. The overall objective of this project is to move CP and OUD data one step closer to FAIR by leveraging existing datasets and developing tools for new projects. The general hypothesis of the project is that leveraging existing CP & OUD data and collecting new data using ML/AI data quality standards will accelerate the impact of the HEAL Data Ecosystem. The aims of the project are to transform existing datasets to be ML/AI ready, and to adapt tools to support ML/AI readiness for existing and prospectively collected HEAL common data elements (CDE). The expected outcome of this project is data optimization pipelines and tools to support the goal of ML/AI ready data. The results of this project will provide a strong basis for further development of the HEAL Data Ecosystem, helping to bring diverse data sources together and meet FAIR data standards. | NIDA |
Alkalay, Ron N | Beth Israel Deaconess Medical Center | Curating musculoskeletal CT data to enable the development of AI/ML approaches for analysis of clinical CT in patients with metastatic spinal disease This project aimed to provide computer tomography image data for developing deep-learning methods to analyze disease progression and fracture risk in patients with metastatic spine disease based on clinical and image-based classifications. Patients with metastatic spine disease to a high risk of pathologic vertebral fracture (PVF). Up to 50% of these patients suffer neurological deficits with further complications that may be fatal. Prediction of PVF risk is a critical clinical need for managing these patients. Segmentation of vertebral anatomy, bone properties, and individual spinal musculature cross-sectional area from clinical CT imaging is fundamental for developing precise, patient-specific diagnostics of PVF risk. Such segmentation faces unique challenges due to the cancer-mediated alteration in skeletal tissues' radiological appearance. Deep learning (DL) methods will speed and standardize the critical segmentation step, permitting analysis of larger datasets and promoting new DL analysis for improved insight into the drivers of PVF risk in patients with metastatic spine disease. For this project, titled; “Curating musculoskeletal CT data to enable the development of AI/ML approaches for analysis of clinical CT in patients with metastatic spinal disease”, our work aimed to establish a curated, publicly accessible computer tomography imaging dataset from 140 metastatic spine disease patients treated with radiotherapy, imaged as part of our parent study, titled; “Predicting Fracture Risk in Patients Treated with Radiotherapy for Spinal Metastatic Disease” (AR075964). For this purpose, we have established manual segmentation of each vertebral level, including delineation for lesion type and fractures. Based on this data, we successfully developed a testbed deep learning model for 1) segmentation of thoracic and lumbar vertebrae and 2) Complete set of thoracic and abdominal muscles, demonstrating the applicability of the curated data set and the application for developing DL methods. Based on this effort, the curated images and associated delineation data from this cohort have been accepted and are undergoing final submission to the NIH Cancer Imaging Archive (TCIA). Integrating DL systems within our approach forms an important step in changing the patient management paradigm from reactive to data-driven proactive management to prevent PVF events and critically reduce bias in patient management. | NIAMS |
Bateman, Alex | European Molecular Biology Laboratory | UniProt - Protein sequence and function embeddings for AI/Machine Learning readiness We will incorporate new representations of proteins and their functions that will help unlock the power of AI/ML for biomedical researchers using UniProt. UniProt represents a wealth of protein related information that is very diverse and yet highly structured, covering all living organisms from microbes to humans. It is an ideal source of information for AI/ML and indeed has already been used to help train essential tools such as AlphaFold that uses deep learning to generate 3D structural models for nearly all proteins. This project further harnesses AI/ML techniques to enhance UniProt, particularly using sequence embeddings. Sequence embeddings provide a representation of protein sequence that can be used for a broad range of AI/ML tasks. We are making these embeddings and AI/ML models available to researchers, saving community compute time and enabling data science. We are testing embeddings for critical tasks in UniProt such as clustering sequences to enable users to search faster. We are also exploring using embeddings for enzymatic reactions to enhance our ability to identify novel data in the literature - and improve the diversity of catalytic reactions captured in UniProt. To better understand the needs of the AI/ML community, we have held a workshop to engage leaders in the field. We have learned about new directions and opportunities we can benefit from as well as the challenges they face in their own work that we can help with by improving our data provision. Collectively, we will scale up protein functional annotation with AI/ML-assisted techniques, organize the growing sequence space with AI/ML-enabled sequence clustering to sustain the sequence computing, and collaborate with the AI/ML research communities to develop new solutions to benefit the broad user community of the UniProt resources. | NHGRI |
Bertagnolli, Monica M. | Brigham And Women's Hospital | A-STOR Cancer Clinical Trial Artificial Intelligence & Machine Learning Readiness 'Big data' generated through clinical trials offers an incredible opportunity to lead to new cancer discoveries and through this project, we will evaluate optimal approaches to apply artificial intelligence/machine learning algorithms to advance our understanding of cancer diagnosis, treatment, and management. Cancer clinical trials are facilitating an explosion of biomedical data, including complex clinical data, diverse genomic data, pathologic image data, high-dimensional molecular characterization, and clinical imaging data among others. Maximizing analyses of samples and data collected through clinical trials and the rationale is well understood – comprehensive molecular profiling should accelerate our goal of ‘precision cancer medicine’, especially when applied to the randomized clinical trials that incorporate current and emergently effective treatments. Among cancer clinical trials, many high impact trials are designed and conducted by National Cancer Institute’s (NCI) National Clinical Trial Network (NCT), including the focus of this study – Alliance for Clinical Trials in Oncology. However, present barriers impede cancer clinical trials from unlocking the full potential of these datasets. Currently, omics data generated from trials are largely decentralized: data are housed at a variety of sites, analyses take place locally, and other researchers do not have access until public deposition of data on repositories. Further, analyses vary widely in bioinformatics methods, including choice of tools, dependencies, file formats, parameterizations, data quality filtering thresholds, and other workflow elements, which makes integration across groups challenging. In this proposal, we are expanding the Alliance Standardized Translational Omics Resource (A-STOR) to realize the full potential of artificial intelligence (AI)/machine learning (ML) modeling for cancer clinical trials. Specifically, we will: 1) rapidly expand A-STOR to host data from over a dozen existing or ongoing Alliance clinical trials and optimize infrastructure for AI/ML analyses; 2) develop a unified clinical and adverse event (AE) data dictionary to facilitate clinical data harmonization; and 3) complete an already-approved pooled multi-modal ML-based predictor as a pilot study. Progress to date includes co-localization of digital pathology and genomic data, harmonization of clinical data dictionary, and initiation of a pilot project to interrogate the breast cancer tumor immune microenvironment through cutting edge AI-based approaches and best-in-class RNA-based immune signature approaches. | NCI |
Bhatt, Tanvi | University Of Illinois At Chicago | Perturbation training for enhancing stability and limb support control for fall-risk reduction among stroke survivors Democratizing data-driven approaches in quantitative gait analysis to enhance effectiveness of assessment and treatment approaches for stroke rehabilitation by creating harmonized gait data repository and scientific workflow library. NICHD funds several clinical trials targeting novel balance and gait interventions. Yet there is a gap in the field pertaining to data sharing, accessibility and utilization. Machine learning models based on gait data could accurately identify pathological gait patterns, classify motor disorder, predict the need for ankle foot arthrosis, and assess rehabilitation status for stroke survivors. However, there is a lack of publicly available data repositories for clinicians and researchers, and computational expertise is required for the use of the data. Those barriers greatly limit the development of data-driven approaches for health. Therefore, this project aims to take a step towards the democratization of gait analysis to empower a broad range of stakeholders. To create the gait data repository (Aim 1), we will evaluate and enable metadata through data wrangling and harmonization capabilities following FAIR data principles (findability, accessibility, interoperability and reusability) before building the repository. To uncover data issues (i.e., data loss, and data artifacts), Aim 2 will focus on data analytics leveraging harmonized data sets from Aim 1 to create scientific workflows for biomechanical data utilization (data visualization, cleaning, and analysis functionalities). To support cleaning and transformation tasks in gait data, a set of open-source libraries will be created for data cleaning, analysis, and visualization. Our computational libraries will be made available to researchers through an easy-to-use visual interface that allows them to query and visualize the data. Additionally, a centralized website (GaitPortal) will be designed to make data and libraries publicly available. Lastly, we will demonstrate an initial use for the transformed data by developing a fall risk predictive model based on the time series gait data (Aim 3). The demonstration code containing all basic and advanced functions will be provided for other researchers to enable customization of the code for their specific purpose(s). These user-friendly tools would improve data checking, cleaning, and analysis by cutting manual data analysis time by at least 25% and reducing overall financial cost for researchers and clinicians. Additionally, the association between clinical measures and gait data could guide the development of an objective function to evaluate balance status and training effects, which will help the researchers and clinicians to identify individualized impairments in gait performance and balance control. This personalized insight can benefit the development of tailored rehabilitation strategies for people with hemiparetic stroke. Clinicians can also monitor the rehabilitation progress based on real-time or post-processed feedback from biomechanical assessments, enhancing the precision and efficacy of interventions in the field of stroke rehabilitation. | NICHD |
Casey, Joan A | Columbia University Health Sciences | Approaches for AI/ML Readiness for Wildfire Exposures Predicting wildfire PM2.5 using AI/ML techniques "Artificial intelligence (AI) and machine learning (ML) models are subject to biases inherent in data used to train them. Efforts to mitigate bias have focused largely on study designs, and implementation of fair quality checks. However, as focus is placed on generalizability of these models, it is critical to contextualize representativeness of study data used for modeling with respect to the population on which insights are intended to be used. Currently, research emphasizes descriptions of study cohorts, highlighting on whom analyses were performed. However, summary statistics cannot provide the granularity needed to identify potential bias brought on by diverse populations and limited sample sizes. This project focuses on improving artificial intelligence/machine learning (AI/ML)-readiness of a wide range of environmental data sources used to predict wildfire fine particulate matter (PM2.5) exposure. Developing models to predict wildfire PM2.5 exposure is crucial as nearly 70% of the U.S. population is exposed to wildfire smoke each year, with 30% experiencing more severe levels of exposure. PM2.5 exposure has been associated with increased risk of respiratory care-related medical encounters and mortality among older individuals. As part of our parent R01 we are estimating the risk of incident and worsening mild cognitive impairment (MCI) and Alzheimer’s disease and related dementias (ADRD) associated with wildfire PM2.5. exposure. To do so, we are developing models to predict daily exposure to wildfire-specific PM2.5 levels using a two-stage ML approach. Stage one relies on a Bayesian machine learning algorithm to integrate multiple existing PM2.5 prediction models to assign ambient PM2.5 exposures. Stage two uses NOAA’s Hazard Mapping System to identify areas exposed to wildfire smoke plumes. We use the smoke plume information combined with statistical techniques to isolate daily estimates of wildfire PM2.5 from non-wildfire PM2.5 levels. The data sources needed to predict PM2.5 and wildfire PM2.5 are disparate, not very accessible, and unfriendly to AI/ML applications. These datasets include weather variables from multiple sources, satellite smoke plums from multiple sources, air pollution monitor data, national land use variables, topographical data, and others. Although the data is rich and publicly available through US agencies, acquiring it and preparing it for analysis presents a significant investment by any researcher. All datasets come in different spatial and temporal resolutions that need to be resolved for them to be merged. Additional processing is also needed to handle potential spurious information due to the periodicity of satellites, monitors, and cloud cover. With this administrative supplement, our goals are to improve the data processing for the vast and wide range of data sources by developing reproducible pipelines. For example, one source of PM2.5 predictions is obtained from the Atmospheric Composition Analysis Group, and we provide reproducible code to process and aggregate this netCDF data by applying a downscaling rasterization strategy using TIGER/Line shapefiles (see https://github.com/NSAPH-Data-Processing/pm25_components_randall_martin for more details). We will also annotate and document the data and ensuring computational scalability. In addition, we will deposit the processed data to a public data repository, Harvard Dataverse, and include a data demonstration to further disseminate the work. | NIA |
Chen, Shigang | University of Florida | Supplement: SCH: Enabling Data Outsourcing and Sharing for AI-powered Parkinson's Research Improving AI-readiness of outsourced medical data by addressing the data privacy issue through randomization and noise addition Artificial intelligence holds the promise of transforming data-driven biomedical research for more accurate diagnosis, better treatment, and lower cost. In the meantime, modern digital technologies make it much easier to collect information from patients in large scale. While “big” medical data offers unprecedented opportunities for building deep-learning artificial neural network (ANN) models to advance the research of complex diseases such as Parkinson’s disease (PD), it also presents unique challenges to patient data privacy. The task of training and continuously refining ANN models with data from tens of thousands of patients, each with numerous attributes and images, is computation-intensive and time-consuming. Outsourcing computation and data to the cloud is a viable solution. However, the problem of performing the ANN learning operations in the cloud, without the risk of leaking any patient data from their sources, remains open to date. We propose to develop novel data masking technologies based on randomized orthogonal transformation to enable AI-computation outsourcing and data sharing. The proposed research includes (1) experimental studies of training ANN models with data masking for PD prediction and Parkinsonism diagnosis, and (2) theoretical development on data privacy, inference accuracy, and model performance. This supplement project expands the research into a new dimension of differential privacy by incorporating randomized orthogonal transformation and noise addition into the process of data masking. Differential privacy is a rigorously defined and widely adopted model, which provides a quantitative measure for privacy loss in data release. This project consists of a theoretical aim and an experimental aim. The theoretical aim is to develop a new method of achieving differential privacy for data outsourcing that minimizes noise addition with the help of randomized orthogonal transformation. The experimental aim is to use the new method to produce sharable PD (Parkinson’s Disease) data sets under the protection of differential privacy and ready for machine learning studies. The outcome of this research, with the new method of outsourcing data with differential privacy, is expected to have a broader impact beyond PD research in advancing the theory and implementation of cloud-based medical studies. | NLM |
Devinsky, Orrin | New York University School of Medicine | Machine learning approaches for improving EEG data utility in SUDEP research Leveraging AI and machine learning to identify EEG biomarkers of sudden unexpected depth in epilepsy (SUDEP) The proposed supplemental project is built upon the base of augmented datasets and new AI/ML techniques. Our research team consists of SUDEP and AI/ML experts with complementary expertise, who are uniquely qualified to develop innovative analytic tools for EEG data AI/ML-readiness. First, we have explored new EEG feature extraction/engineering techniques and validated the efficacy with ML approaches. To date, our preliminary results have achieved a median AUC (area under curve) of 0.87 in classification between SUDEP and living epilepsy patient controls---a significant improvement from our previously reported result (median AUC of 0.77, Frontiers in Neurology, 2022). Second, we are developing explainable ML models to enhance result interpretation. Third, we are developing and employing data augmentation techniques to improve the consistency of labeled EEG data from both SUDEP cases and living epilepsy patient controls. Finally, we will validate existing and newly developed ML methods on newly collected SUDEP and control samples at multiple sites. Overall, this project will complement and enrich the research aims in our parent grant, and promote research rigor, transparency and reproducibility. Accomplishing these research goals will maximize the data utility and improve AI/ML-readiness in epilepsy research. | NINDS |
Ellisman, Mark H | University of California, San Diego | 3D Reconstruction and Analysis of Alzheimers Patient Biopsy Samples to Map and Quantify Hallmarks of Pathogenesis and Vulnerability This project will develop software for normalizing the signal-to-noise ratio, resolution, and contrast of acquired 3D electron microscopic datasets to facilitate use of artificial intelligence and machine learning algorithms for automatic volume segmentation. This administrative supplement is advancing tools and methodology for the normalization of large-scale, 3D electron microscopic (EM) image volumes as a means to enhance the performance, reusability, and repeatability of high throughput artificial intelligence and machine learning (AI/ML) algorithms for automatic volume segmentation of brain cellular and subcellular ultrastructure. This work is being conducted in the context of an active research project that is advancing the acquisition, processing/refinement, and dissemination of large-scale 3D EM reference data derived from a remarkable collection of legacy biopsy brain samples from patients suffering from Alzheimer’s Disease (AD) (5R01AG065549). This active project is deeply rooted in the use of advance AI/ML technologies for delineating key ultrastructural constituents of neurons and glia exhibiting hallmarks of the progression of AD. It is organized to comprehensively target areas associated with plaques, tangles and brain vasculature, attending to locations where existing findings suggest cell and network vulnerability and contain molecular interactions suspected by some to underlie the initiation and progression of AD. Through this work, we are advancing the development and dissemination of fully trained neural-network models for volume segmentation to simplify (and reduce the costs associated with) community efforts to extract their own 3D geometries and associated morphometrics from this collection of AD reference data and similar repositories of neuronal 3D EM data. With this supplemental effort, we will develop, refine and disseminate a set of tools which allow for direct feedback and standardization of primary image quality. With these tools users will be able to optimize and normalize imaging parameters at time of image acquisition. The outcome of this work is to advance the use of transfer learning methods, facilitating repeatability and reuse of trained neural network models for scalable EM image segmentation. | NIA |
Friel, Kathleen Margaret | Winifred Masterson Burke Medical Research Institute | Targeted transcranial direct current stimulation combined with bimanual training for children with cerebral palsy A path to improving movement therapy by integrating 3D motion information of disabled individuals with current promising therapies. In this supplement, we aim to use Deep Learning (DL) pose estimation models along with 3D depth sensing cameras to develop a cost effective, easy to use, and compact Deep Learning based markerless kinematic data acquisition (DL-KDA) system that can be applied to children with UCP. To achieve this overall goal, we must establish the accuracy and validity of the kinematic data obtained from the system. We have developed a modular software framework for building and testing DL-KDA systems against a very precise marker-based motion capture gold standard (VICON). To date, we have collected kinematic data from: 8 additional children with UCP, 16 typically developing children, and 40 healthy adults when they were performing the Box and Blocks Test. We have also so far extracted 2D images from 4 years (2015-2018) of previous video recordings and will begin annotating the images to retrain the optimal DL pose estimation model. Using kinematic data from healthy adults, we are studying the effects of 3D camera and DL parameters/architecture on the accuracy of the resulting kinematic data. In parallel, using the extracted 2D images from previous recordings (2015-2018), we will begin annotating images with body joint locations for transfer learning-based retraining of DL pose estimation models. We will use this dataset to perform transfer learning and retrain DL pose estimation models. The goal is to address the gap in existing training datasets used for most DL pose estimation models, which are not inclusive of individuals with movement disorders. Without carefully addressing this gap, potential ethical and scientific biases may arise if such pose estimation models are applied to underrepresented groups such as children with UCP. Our Images will be transformed into ML/AI ready HDF5 datasets and published in public DL and NIH repositories. These datasets will be available for other researchers, when using or building DL pose estimation models for applications in UCP clinical research. Finally, we will collect data from an additional 12 children with UCP (in 2023) and use the retrained DL model for body pose estimation. The performance of the retrained DL model will be statistically compared to the original DL model to verify if bias was indeed present. Validated kinematics for the UCP population, as well, will be uploaded to public DL and NIH repositories for use in future UCP research. | NICHD |
Fuller, Clifton David | University of Texas MD Anderson Cancer Center | Administrative Supplement: Development of functional magnetic resonance imaging-guided adaptive radiotherapy for head and neck cancer patients using novel MR-Linac device We have developed a unique quality-curated “benchmark” imaging dataset with multiple human observer segmentations as an avenue to improve head and neck cancer therapy using advanced multiparametric imaging, which can ideally be used to assure quality and expand potentiate improved AI/ML model development through FAIR (re)use. Radiotherapy (RT) treatment of head and neck cancer aims to deliver a therapeutic dose to cancer cells while minimizing the damage to surrounding healthy tissue. Identifying tumors that respond well to treatment and those that do not is essential to make RT more effective and reduce side effects. Multiparametric MRI, a technique that combines anatomical with functional imaging, has proven useful in identifying early responders and radiation-resistant disease in head and neck cancer patients. These techniques could be used to adapt radiation therapy during treatment. Our parent grant aimed to develop hardware, software, and infrastructure for multiparametric MRI-guided RT for head and neck cancer patients. In this supplement, the resulting imaging data will be curated, annotated, and made publicly available to facilitate community-driven artificial intelligence (AI) model building efforts. The proposed one-year supplement includes curation of high-quality anatomical and functional MRI sequences and corresponding clinical data for each patient. These anonymized datasets will be made FAIR (findable, accessible, interoperable, reusable) and available for public use and will support the development of robust AI projects. The project will also initiate a series of public AI data challenges to foster novel AI innovation and solve clinically relevant RT problems. The success of this project will enable a modernized and integrated biomedical data ecosystem for public use of RT data for AI model building. The proposed benchmark datasets will provide a foundation to achieve the long-term goal of personalized medicine for head and neck cancer patients using AI to reduce side effects while maintaining high cure rates. This supplement will positively impact patients by enabling the characterization of malignancy for improved therapeutic intervention and downstream translational application of AI technologies. | NIDCR |
Grundberg, Elin | Children's Mercy Hospital | Contextualizing and Addressing Population-Level Bias in Social Epigenomics Study of Asthma in Childhood This study is developing novel approaches to quantify representativeness of study cohorts with respect to communities from which it was drawn and create a standardized scorecard to convey intrinsic biases that must be considered when designing analyses and interpreting generalizability of AI/ML results. "Artificial intelligence (AI) and machine learning (ML) models are subject to biases inherent in data used to train them. Efforts to mitigate bias have focused largely on study designs, and implementation of fair quality checks. However, as focus is placed on generalizability of these models, it is critical to contextualize representativeness of study data used for modeling with respect to the population on which insights are intended to be used. Currently, research emphasizes descriptions of study cohorts, highlighting on whom analyses were performed. However, summary statistics cannot provide the granularity needed to identify potential bias brought on by diverse populations and limited sample sizes. Given impacts of recruitment and selection bias, we are developing metrics to benchmark alignment of study participants against a geographic reference (i.e., subjects in same geographic region who met inclusion criteria but not enrolled). These metrics capture both geographic factors (i.e., measuring if study recruitment was well-distributed across the set of eligible patients), and comparisons between distributions of sociodemographic factors (i.e., identifying recruited families have significantly higher household incomes when compared to a matching eligible population). In parallel we are developing metrics of data stability across subgroups of a study cohort and across recruitment periods. Given many AI/ML models focus on modeling subspaces, simply utilizing all data that meets inclusion criteria can obfuscate imbalances. For example, race may only be missing in 5% of patients, but understanding all instances come from those aged 10-15 is critical for reliable use of model results. To address some of such bias, we are developing population-aware imputation techniques. It is common to perform imputation prior to utilizing AI/ML models. Yet, when study sampling and/or missingness is not balanced by geographic region, imputation estimates for data in subgroups with a high degree of missingness, are driven by relationships from subgroups with more complete information. Creating a potential for bias in scenarios where these cohorts differ in a fundamental way. We hypothesize novel techniques can be developed to better account for similarity of patients. In turn, providing more precise estimates of missing data across subgroups that better reflect exposures. " | NIMHD |
Hsieh, Evelyn | Yale University | Use of Optical Character Recognition (OCR) to Enable Al/ML-Readiness of Data from Dual-Energy X-ray Absorptiometry (DXA) Images. The current study proposes to utilize optical character recognition to convert dual-energy x-ray absorptiometry images (DXA scan images) from the VA Connecticut Healthcare System to text documents. This conversion makes the DXA scan images compatible with the natural language processing/machine learning pipelines that our team has developed for extracting bone mineral density-related data from text summaries in the patient medical record. Fragility fractures caused by osteoporosis are costly to individuals and healthcare systems. A key limitation of artificial intelligence/machine learning research focused on osteoporosis has been the difficulty capturing bone mineral density (BMD) data. BMD data do not exist as structured data fields. Furthermore, to use data from dual-energy x-ray absorptiometry (DXA), current methods rely on machine learning (ML) algorithms that draw information from text summaries in the patient medical record. In our preliminary work as part of our parent R01, we have found that while valuable, restricting the ML algorithm to radiologist-generated DXA summaries limits our ability to capture accurate, reliable DXA-related data from the EHR. Text summaries are vulnerable to inaccuracies and lack clarity. In addition, when patients are referred to external facilities for their DXA scans, the information is returned in PDF format and scanned into the EHR. Imaged documents are not accessible to the text-based algorithms often used in Natural Language Processing (NLP)/ML applications. A superior source of these data is the original reports generated by DXA machines. These reports use a consistent, tabular format to present BMD and T-score results at the three regions of interest in osteoporosis research (lumbar spine, total hip and femoral neck). In collaboration with colleagues at Vanderbilt University Medical Center (VUMC) and the VA Tennessee Valley Healthcare System (VATVHS), we propose to extend the use of a novel optical character recognition (OCR) system developed by their team to extract tabular BMD data directly from machine generated and PDF versions of DXA reports. This proof-of-concept study will leverage our team’s existing work with EHR data from the 2019 VA National Cohort, a national cohort of all Veterans who receive care within the VA system. This work is an important first step towards having accurate, valid, and reliable BMD data to include in the parent R01 and will have important applications for future projects exploring BMD among Veterans, an important and often underserved population. In addition, it makes tabular data from images more easily available and accessible for researchers to apply AI/ML approaches. | NIAMS |
Jones, Bryan William | University Of Utah | Retinal Circuitry Retinal connectomics allows us to study how neural systems are wired together and has the potential to teach us about how vision works, how vision fails in disease, and how to better design computational systems through biologically inspired computing.Retinal connectomics allows us to study how neural systems are wired together and has the potential to teach us about how vision works, how vision fails in disease, and how to better design computational systems through biologically inspired computing. Our lab studies retinal connectomics. Connectomes are Rosetta Stones for discovering how neural systems are wired and they reveal how neural systems are structured and function, as well as informing us how neural circuits are altered by disease. We use retina as a window to understand normal brain wiring as well as diseases like the blinding diseases retinitis pigmentosa (RP), age-related macular degeneration (AMD), glaucoma, as well as brain diseases like Alzheimer’s. To study these complex networks, we use electron microscopes to reveal the neurons, glia and all of the connections between these cells, assembling massive databases that reveal normal circuit topology frameworks as well as disease frameworks. These databases allow comparisons of normal to pathological networks that emerge in disease. Prior research efforts from our lab have unmasked unexpected, pervasive complexities in mammalian retinal networks, informing neuronal modeling and understanding of the problems behind vision rescue. These efforts have also revealed that neural systems while complex, are extraordinarily precise. There are no wiring errors in normal, healthy tissue. This was a surprise because we thought there would be lots of “biological noise”. There is not. Connections and the partnerships of those connections are precise. The other surprising finding is that when neural systems fail, they fail in predictable, precise ways making errors of wiring that were unexpected in terms of their predictability. This supplement funds tool development to enable sharing of specific aspects of our datasets in great demand from the artificial intelligence/machine learning (AI/ML) community. Our work has built not only connectomics infrastructure for datasets, but the annotation work within the datasets themselves provides a valuable ground truth to feed AI/ML approaches as training data. While our annotated databases have been the highest resolution connectomics databases yet available, allowing discrimination of synapses and gap junctions as well as organelle data, we do not have good tools to subset these connectomes to feed AI/ML approaches allowing feature detection of synaptic features or sub-cellular features desired by the AI/ML community. The tools deriving from this supplement will interface with our open-source datasets, providing the entire connectomics community access to rich, validated, ground-truth data for AI/ML training and mining. | NHGRI |
Kane-Gill, Sandra L | University of Pittsburgh at Pittsburgh | (MEnD-AKI) Multicenter Implementation of an Electronic Decision Support System for Drug-associated AKI Enabling the preparation and harmonization of multimodal AI/ML-ready data enriched with social determinants of health and unstructured text data to improve performance of drug-associated acute kidney injury (D-AKI) risk prediction models The goal of the parent project, Multicenter Implementation of an Electronic Decision Support System for Drug-associated AKI (MEnD-AKI), is to assess the effectiveness of a clinical surveillance system augmented with real-time predictive analytics that will support a pharmacist-led intervention aimed at reducing the progression and complications of drug-associated acute kidney injury (D-AKI). Social determinants of health (SDOH) are important drivers of health inequities and disparities, and responsible for between 30% and 50% of health outcomes. There is wealth of information contained in routine clinical notes that improves performance of prediction models for AKI, as shown by recent literature. Our ongoing parent project is not yet utilizing SDOH and clinical notes that carry important information about patient health status and access to health care. The proposed supplement project will develop integration, standardization, and processing tools and pipelines to create AI/ML ready data using SDOH and unstructured clinical text data. We will develop and assess tools for: 1) extracting, cleaning, and representing SDOH and unstructured text data; b) integrating SDOH to databases from the University of Florida (UF) and University of Pittsburgh (UPitt) for use in D-AKI risk model development and validation; and c) integrating clinical data and notes to prepare multimodal AI/ML-ready data at UF. We will acquire SDOH data in the domain of economic stability, education access and quality, healthcare access and quality, neighborhood and build environment, and social and community context. Using 9-digit ZIP and/or 11-digit Federal Information Processing Standards (FIPS) codes, we will link patient data to SDOH data. Our developed risk model will be enriched by incorporating SDOH variables into the model, and we will evaluate the bias of our developed AKI risk model by subgroup analysis based on SDOH variables. Natural Language Processing (NLP) tools will be employed to extract medical concepts, medications, and SDOH from clinical notes. For medical concepts unable to be extracted using existing tools, we will develop our own NLP tool. Using the Bidirectional Encoder Representations from Transformers (BERT) model. The completion of this supplemental project will provide multimodal AI/ML-ready data with additional data elements to improve efficiency and performance of the D-AKI risk prediction models and other AI applications. | NIDDK |
Krening, Samantha | Ohio State University | An automated AI/ML platform for multi-researcher collaborations for a NIH BACPAC funded Spine Phenome Project Transforming data from wearable technology from chronic low-back pain patients and healthy individuals into a machine learning-ready format for the prevention, evaluation, and treatment of spine and musculoskeletal disorders. Chronic Low Back Pain (cLBP) is a debilitating condition that affects millions of people globally. The parent grant (BACPAC) is focused on developing and validating a digital health platform that collects data from wearable technology on patients and healthy individuals. A goal is to diagnose and create treatment plans based on how someone moves rather than current subjective metrics. This project pertains to a data access pipeline and AI/ML workflows that will enable researchers to focus on the ML rather than data cleaning, organization, and processing. The pipeline begins with the collection of data in clinical and research environments from many sources across the BACPAC consortium, and extends through the validation of trained models. We aim to transform these complex, multi-source data to a “machine learning friendly” format to make them easily accessible to all researchers across the BACPAC consortium. The research is likely to lead to novel methods to analyze large, multi-source, time-series datasets. We expect that these analyses will shed new light on the prevention, diagnosis, and treatment of spine and musculoskeletal maladies. | NIAMS |
Lacey, James V | Beckman Research Institute/City Of Hope | A More Perfect Union: Leveraging Clinically Deployed Models and Cancer Epidemiology Cohort Data to Improve AI/ML Readiness of NIH-Supported Population Sciences Resources This project is evaluating how the unique design and data characteristics of large-scale observational cohort studies affect the use of cohort data in AI/ML models. " In theory, large prospective observational cohort studies have great potential to contribute to AI/ML. Cohorts often include large sample sizes; long-follow-up; multiple clinical endpoints and phenotypes; and diverse, real-world data on lifestyle, environment, patient-reported outcomes, and social determinants of health, which together could help reduce data bias in AI/ML. In practice, important methodologic and technical questions about how best to use observational-cohort data for AI/ML are unanswered. We are using our California Teachers Study (CTS), a large cohort study, to address those questions. The CTS includes 133,477 female participants who have been followed continuously since 1995. Through surveys and linkages, the CTS has collected comprehensive exposure and lifestyle data and has identified over 28,000 cancers; over 34,000 deaths; and over 800,000 individual hospitalizations. Our team includes AI/ML experts at City of Hope (COH); population-science researchers from the CTS team; and cloud computing specialists from the San Diego Supercomputer Center's (SDSC) Sherlock Cloud. In partnership with SDSC, our CTS research data management lifecycle strategy has been cloud-first since 2015—but, like most cohort studies, that lifecycle was designed for investigator-initiated & project-specific analyses, rather than larger-scale AI/ML. We are addressing this “readiness gap” in cohort data by expanding our data & computing infrastructure and architecture; reconfiguring data exploration and aggregation tools and documentation; and testing a clinically deployed AI/ML model in CTS data. We are deploying the Amazon SageMaker platform within our secure and controlled-access CTS cloud environment. We are generating embeddings that will cluster CTS data into phenotype-based subgroups that can be used for essential AI/ML functions, such as cohort discovery, close-neighbor identification, and imputation. We are evaluating performance of a previously deployed risk model in CTS data to assess the potential for real-world cohort data to improve model performance. Our project’s combination of multidisciplinary experts from relevant fields; new embedding representations of observational cohort data; and a secure cloud infrastructure configured for AI/ML is generating valuable insights and lessons learned about use of NIH-supported cohort data in AI/ML applications. " | NCI |
Levey, Allan I | Emory University | Piloting a web-based neuropathology image resource for the ADRC community In order to benefit from recent advances in AI/ML to better understand Alzheimer’s disease and related disorders, we have developed an open-source platform to visualize, annotate, analyze, and share large neuropathology data sets and an accompanying data schema to facilitate data-sharing. Alzheimer’s Disease (AD) is the most common dementia, affecting over 45 million people globally. Sharing data between institutions is critical to developing new understanding and treatments. There is great potential for artificial intelligence / machine learning/ML techniques to be applied in neuropathology, but sharing of digital pathology images is uncommon. We have developed the Digital Slide Archive (DSA), an open-sourced web-based platform to visualize, annotate, and analyze neuropathology imaging datasets. The DSA platform, funded through a U24 and U01 grant from the NCI/NIH has primarily been optimized for cancer related image analysis workflows. Our goal was to collect digital slide sets from AD centers to develop tools and data models to facilitate data sharing. Neuropathologic evaluation of brain tissue is central to the diagnosis and staging of these diseases, but the raw histology data is not widely shared within this community. The increasing availability of whole slide imaging systems now makes data sharing feasible, although there are numerous technical hurdles making this challenging. We have collected neuropathology datasets from 7 universities, and used these images, in conjunction with 500 cases scanned at Emory University, to develop an initial metadata schema to facilitate data sharing. The lack of standard file formats, inconsistent naming schemas, image de-identification, and the enormous size of these images are ongoing challenges. We have developed a set of tools to facilitate large scale image de-identification and metadata cleanup, which we are continually improving for this work. We have also developed a customized interface to facilitate efficient viewing of neuropathology cases. The interface, leveraging the underlying metadata schema, makes it much easier for pathologists to scan through cases, and view annotations and analysis results. The DSA also has a companion set of analysis tools, HistomicsTK, which we have developed to support whole slide image analysis. We are tuning these algorithms to support detection of nuclei, neurofibrillary tangles (NFT), and phospho-TDP inclusions as demonstration workflows. We are developing tutorials and a set of benchmarks to characterize the performance of these algorithms. | NIA |
Liang, Rongguang | University of Arizona | Improving AI/ML-Readiness of data generated from NIH-funded research on oral cancer screening Creating a reliable and accurate AI/ML model for oral cancer diagnosis by generating a clean, well-annotated dataset from raw data obtained from oral cancer screening. The potential of AI/ML algorithms in solving cancer analysis tasks is well established, but to develop high-performance AI/ML algorithms and applications for automatic oral cancer diagnosis, a well-annotated, compatible, and clean dataset with a sufficient number of images is essential. Supported by NIH 3-R01-DE030682-01, we have developed and evaluated three generations of mobile dual-mode imaging devices, screened over 7,000 patients in India since 2019, and generated a dataset with at least 28,000 images and related information that represents an excellent resource for developing highly accurate and efficient AI/ML algorithms and applications for oral cancer diagnosis. The primary goal of this project is to transform the raw data collected through NIH-funded research on oral cancer screening into a clean, compatible, and well-annotated dataset that can be readily processed using AI/ML tools. As popular AI/ML tools like PyTorch and TensorFlow are incompatible with string-based inputs, we convert patient information, such as lesion sites, tobacco use history, sex, age, and other subjective descriptions, into an AI/ML compatible format. To ensure data cleanliness, we identify out-of-distribution (OOD) and low-quality images in the image data and rectify missing values and typos in the patient information data. Oral cancer datasets obtained from high-risk populations often suffer from imbalanced data, leading to AI/ML model bias. To minimize the impact of model bias, we employ data-level and algorithm-level methods to balance the dataset. Since the data is annotated manually, labeling errors are inevitable, which can significantly impact the performance of AI/ML models. To address this issue, we use confident learning to identify mislabeled datapoints and study their effects on the stability of AI/ML models. The key objective of converting the raw data into an AI/ML compatible dataset is to facilitate the creation of dependable and precise AI/ML algorithms for automated oral cancer diagnosis. Building on this curated multi-modal dataset, we are designing AI/ML models that offer high levels of accuracy, interpretability, and reliability. This project holds significant potential for accelerating the development of AI/ML-based techniques for early oral cancer detection in low-resource settings, which could ultimately help reduce morbidity and mortality rates associated with this disease. | NIDCR |
Linden, David R. | Mayo Clinic Rochester | Neurobiology of Intrinsic Primary Afferent Neurons Developing computational tools to analyze the structure of nerve cells in the bowel to better understand digestive disease. The bowel has independent neural reflexes initiated by intrinsic primary afferent neurons (IPANs) in the enteric nervous system (ENS) that sense meals and pathogens to alter vascular, secretory, and motor function. Unique IPAN neuroplasticity adapts to inflammatory, hormonal and neural stimuli to cause digestive disease. Diagnoses and therapeutic decisions use symptom classifications and, in some cases, the use of thin section pathology. 3D assessment of the ENS has the potential to transform digestive disease diagnosis, but solutions to technical and labor-intensive barriers of adoption are needed. The objective of the parent grant, R01 DK129315, is to test the overall hypothesis that different classes of IPANs possess morphologies and physiology that uniquely contribute to intestinal function. With this supplement, we add additional approaches to Specific Aim 1 to incorporate artificial intelligence into the data analyses. Importantly, these additions will 1) accelerate throughput of data analysis from that anticipated in the parent grant application, 2) ensure datasets will be freely available to all and are ready for AI analyses by other investigators, 3) develop AI tools and ground truth annotated datasets for enteric neurobiology that will be freely available, and 4) expand analyses into human enteric neurons, a major step toward translation of findings anticipated in the parent grant. There are three revisions to our approach of Specific Aim 1: 1. Develop AI tools to acquire and analyze large volume high resolution confocal microscopy data. 2. Develop deep learning approaches to segment, annotate and analyze morphological features of mouse IPANs. 3. Image the human enteric nervous system and apply deep learning tools to human tissues. AI technology may accomplish the goal of rapid and objective ENS annotation. Improving human ENS labeling for in situ and in vivo imaging will allow the deployment of our preclinical AI tools to transform the future of clinical gastrointestinal pathology. Further, the tools that we are developing may have a broad scientific interest given the currently ongoing efforts to map the landscape of both central and peripheral nervous systems. | NIDDK |
Majumdar, Amitava | University of California, San Diego | Neuroscience Gateway to Enable Dissemination of Computational And Data Processing Tools And Software. With focus on end-to-end ML workflows, implementing standardized specification-based provenance ontology into a Neuroscience Gateway (NSG) software for EEG data analysis, and integrated with a blockchain based approach for independent verification of the data. The supplement project is developing a standardized provenance metadata framework to make NSG data sets and tools Findable Accessible, Interoperable, and Reusable (FAIR). Since 2012, NSG catalyzes progress in neuroscience by reducing technical and administrative barriers that neuroscientists face in large scale modeling and data processing which require high performance computing. NSG provides about twenty neuroscience software and tools that are used for neuronal modeling, data (EEG, fMRI, etc.), processing and AI/ML work. NSG’s user base is growing and has over 1500 registered users currently. The NSG team acquires time on academic supercomputers yearly and is made available fairly to the NSG users. NSG has been enhanced by adding new features, making it an efficient environment for dissemination of lab-developed neuroscience software and tools. This supplement project is a first step towards recording meta data or provenance for developing a standard-based provenance metadata framework for any of the NSG tools and data sets they produce to allow NSG resources to be used in reproducible machine learning (ML) workflows. Provenance metadata is critical for supporting scientific reproducibility by implementing FAIR principles. This project integrates the World Wide Web Consortium PROV standard based provenance ontology called ProvCaRe in NSG to allow users to record provenance metadata using standardized ontology classes. The use of ProvCaRe ontology is expected to reduce term variability and improve performance of ML workflows that rely on metadata terms for reproducibility. We demonstrate the integration and application of the ProvCaRe ontology using a neuroscience software called the NeuroIntegrative Connectivity (NIC) tool, which analyzes high fidelity brain recordings to compute functional brain networks in neurological disorders. The NIC tool has provenance metadata characteristics built into it, and is the first NSG tool to carry the metadata provenance information from the beginning to the end of a dataset’s lifecycle. In the second phase of this supplemental project, we utilize the Open Science Chain (OSC) project to provide a blockchain based solution to maintain the integrity for datasets and its provenance metadata. The supplement project will allowallows us, in the future, to integrate provenance metadata information for other NSG tools. This makes NSG comprehensively more AI/ML ready. | NIBIB |
McNeil, Rebecca Boehm | Research Triangle Institute | Continuation of the NuMoM2b Heart Health Study Transforming longitudinal pregnancy and cardiovascular health data and documentation into formats that are easier to use in machine learning models to identify risk factors for adverse pregnancy outcomes and early markers of cardiovascular disease The Nulliparous Pregnancy Outcomes Study: Monitoring Mothers-to-Be (nuMoM2b; NICHD, 2009-2014) initiated a longitudinal cohort of 10,038 nulliparous women prospectively enrolled early in their first pregnancy that has continued with the nuMoM2b Heart Health Study (nuMoM2b-HHS; NHLBI and NICHD, 2014-2020) and current Continuation of the nuMoM2b Heart Health Study (nuMoM2b-HHS2; NHLBI, 2020-2027). The racially and socioeconomically diverse cohort participants have rich and accurate records-based phenotyping of their first pregnancy and associated outcomes, including risk factors for adverse pregnancy outcomes (APOs) and cardiovascular disease (CVD) measured in early pregnancy, with subsequent examinations during the early postdelivery period. As part of the cohort’s longitudinal follow-ups, a planned subgroup of 4,508 participants have ascertainment of APOs in additional pregnancies, assessments for risk factors and subclinical and clinical CVD, and completed laboratory assays on stored biospecimens. Omics analysis results, including GWAS, WGS, methylation, plasma proteomics, exosome proteomics, and placental RNAseq are also available or are in development for subgroups of the participants. Continued participant follow-ups and separately funded ancillary studies will further expand the duration and scope of data collection. With the support provided by this administrative supplement, the full nuMoM2b and nuMoM2b-HHS data, including omics data not yet deposited, will be moved onto BioData Catalyst and brought into alignment with FAIR principles. Upon completion, the data will be located on a single platform with pipeline and analytic tool support, and datasets will be machine-readable with relevant clinical ontologies assigned, with user-friendly metadata and documentation, and with future-ready processes for conversion of newly gathered data to AI/ML ready status. These steps will result in data ready for use in AI/ML analyses as well as traditional epidemiologic models and will enhance data accessibility to members of the research community, maximizing the potential scientific knowledge gain from the existing and future data contributed by this truly remarkable longitudinal cohort. | NHLBI |
Mellins, Claude Ann | New York State Psychiatric Institute | Pathways to successful aging among perinatally HIV-infected and exposed young adults: Risk, resilience, and the role of perinatal HIV infection CSpanning 20 years and 10 waves of data collection, C.A.S.A.H. for CASAH will create an accessible, shareable, accurate and harmonized database comprised of Social Action Theory informed variables across behavioral, psychiatric, psychosocial, neurocognitive, milestone achievement, and medical factors. The CASAH study has followed 340 predominantly Black and Latino/a youth with perinatally acquired HIV (PHIV) and youth perinatally HIV exposed but uninfected (PHEU) for 20 years – enrolled at ages 9-16 years from vulnerable communities in New York City – documenting health risk and resilience across childhood, adolescence, and emerging adulthood. CASAH is guided by Social Action Theory (SAT), examining: 1) the impact of HIV infection on behavioral health outcomes (e.g., mental health, sexual risk, substance use, adherence) and achievement of adult milestones (e.g., education, vocation, independence); 2) how SAT-informed risk and protective factors affect behavioral health and achievement of milestones in adolescence and young adulthood (AYA); and 3) trajectories of behavioral health across AYA and SAT-informed predictors of these trajectories. Currently in its fourth competing continuation (R01 MH69133-19), CASAH 4 is following this cohort through young adulthood (20s-early 30s). CASAH is one of the most comprehensive longitudinal datasets on adolescents and young adults living with PHIV or PHEU –ideal for machine learning (ML) and sharing. However, gaps in the full CASAH dataset (e.g., unprocessed data, unscaled variables) and in data documentation act as barriers to using it for ML and sharing it via NIH-supported data repositories. The aims of this data readiness administrative supplement are to prepare (i.e., collate, clean, and evaluate) the full CASAH dataset (spanning 10 waves of data collection) and create comprehensive documentation of all data elements (i.e., provenance, missingness, utilized scales, etc.), while ensuring data can be easily and securely transported and utilized. ML data analytic approaches will also be tested on the finalized data to confirm completeness, accuracy, and useability. This supplement will not only advance our knowledge of health outcomes and social determinants of health in vulnerable young people affected by HIV, but will also allow for cross-cohort studies with non-HIV populations (e.g., The National Longitudinal Study of Adolescent to Adult Health; Boriqua Youth Study; ABCT). Overall strengthening the ability to identify multimodal determinants of health across critical developmental stages, aiding in the development of evidence-based interventions for youth with HIV, chronic health conditions, and affected by a range of health disparities. | NIMH |
Mirmira, Raghavendra G | University of Chicago | The Integrated Stress Response in Human Islets During Early T1D Preparing complex omics datasets for machine learning applications to identify early warning signs of type 1 diabetes that can be measured through minimally invasive tests. In the parent project, we hypothesized that activation of a specific integrated stress response is an early cellular response in type 1 diabetes (T1D) that determines pancreatic β-cell survival and can be monitored in pre- and early-T1D individuals with minimal invasiveness. To test this hypothesis, we have assembled a multidisciplinary team collecting large suites of heterogenous biological data, including mRNA, lipid, protein, and immune system profiles, from individuals at various stages of T1D, as well as healthy controls. Analyses of these data are yielding panels of potential T1D biomarkers associated with cellular stress in human pancreatic islets. The data collected under the parent project, as well as other data collected from prior collabora tions studying genetically at-risk children, such as the Diabetes AutoImmunity Study in the Young (DAISY) and The Environmental Determinants of Diabetes in the Young (TEDDY), are excellent candidates to be used as “flagship” datasets for Artificial Intelligence/Machine Learning (AI/ML) readiness. For the supplement grant, we focus on development of AI/ML-ready data to tackle the challenges of processing large heterogenous datasets in addition to identifying molecular signatures of T1D. Our first task focuses on generating AI/ML-ready datasets that are properly annotated to address the main data processing challenges: missing data and introduction of bias. The second task focuses on generating AI/ML-ready datasets that can be used to establish molecular markers of disease biomarkers. To allow other AI/ML researchers to efficiently use the machine learning datasets, we will provide detailed information about their performance and intended uses in model cards. Our goal is to create reusable software approaches and data packages that can be directly imported into common AI/ML packages. We will share these with the broader AI/ML research community, gather feedback, and continue to refine the AI/ML readiness software development plan. Datasets are released on ‘AI/ML Ready Datasets for Type 1 Diabetes’ platform (https://data.pnnl.gov/group/nodes/project/33480). | NIDDK |
Musen, Mark A | Stanford University | Improved metadata authoring to enhance AI/ML readiness of associated datasets Re-engineering technology to promote the sharing of AI-ready data that are findable, accessible, interoperable, and reusable The CEDAR Workbench is technology that makes it easy to describe datasets using metadata that comply with community standards. Thorough the use of (1) reporting guidelines that enumerate the things that need to be said about an experiment for a scientist—or an AI algorithm—to make sense of what has been done and (2) ontologies that formalize those descriptions, the CEDAR Workbench offers a convenient mechanism for investigators to share their data in a useful way. The standardized metadata that scientists create using CEDAR make datasets more valuable both to people and to machines that might learn from the data to make new discoveries. Many NIH-supported consortia, including those working on the RADx Data Hub for sharing of data related to COVID diagnostics, rely on the CEDAR Workbench. To advance the role of CEDAR in the creation of AI-ready datasets, we worked to make CEDAR deployable in the cloud by containerizing all CEDAR microservices and by making these microservices discoverable and observable. Furthermore, we worked to make CEDAR a highly available system that is easy to maintain and evolve. We simplified and enhanced the system’s architecture, taking advantage of new approaches and components that were not available to us when the system was first designed. As a result, CEDAR is now much more scalable, maintainable, and deployable. The new architecture will help investigators to create standards-adherent metadata more easily, advancing the application of AI techniques to a wide range of data of importance to the biomedical community. | NLM |
Neelamegham, Sriram | State University of New York at Buffalo | Application of machine/deep-learning to the systems biology of glycosylation Machine learning tools and complementary datasets to study cellular glycosylation: their role in human physiology and blood disorders The NHLBI grant “Systems Biology of Glycosylation” applies bioengineering approaches to study blood cell glycosylation from both a basic science and translational perspective. The goal is to develop a quantitative link between the cellular transcriptome and epigenetic status, with the resulting glycosylation profile. In order to achieve this, two types of perturbation experiments are performed. In the first, CRISPR-Cas9/sgRNA is used to implement defined system perturbations and resulting changes in the cellular glycome are measured. This represents the ‘labeled dataset’. In the second, biochemical stimuli are applied to perturb cell state, and again cell glycosylation status measurements are made. This is the ’unlabeled dataset’ as the perturbation is imprecise. In each case several output experimental measurements or ‘features’ are measured including: 1) Single cell next-generation sequencing (NGS) measurements of the cellular transcriptome; 2) Lectin binding using spectral flow cytometry; and 3) Mass spectrometry to obtain detailed glycan structure data. Mathematical methods are developed to fuse results from these different methods and develop input-output responses. Currently, such modeling relies on prior biochemical knowledge that is curated in pathway maps, linear-mixed models and explicit programming. As an alternative to this traditional approach, this supplement develops ML/DL (machine learning/deep learning) supervised learning models to analyze the same data. Successful completion of this project will confirm the value of ML/DL modeling in the study of blood cell and Glycoscience applications, particularly in a multi-omics context. It will reveal if cellular regulatory pathways discovered using blood cells can be generalized to other cell/tissue/organ systems. The identification of key makers/checkpoints of glycosylation also has translational significance as it can inform both patient stratification in the context of clinical trials and precision medicine applications. | NHLBI |
Nelson, Amanda E | University of North Carolina Chapel Hill | Development of an AI/ML-ready knee ultrasound dataset in a population-based cohort Improving utilization of ultrasound for knee osteoarthritis assessment with an AI/ML-ready dataset Knee osteoarthritis (KOA) is highly prevalent and frequently debilitating. Development of potential treatments has been hampered by the heterogeneous nature of this common chronic condition, which is characterized by several subgroups, or phenotypes, with different underlying pathophysiological mechanisms. Imaging, genetics, biochemical biomarkers, and other features can be used to characterize phenotypes, but variations in data types can make it difficult to harmonize definitions. Ultrasound is a widely accessible, time-efficient, and cost-effective imaging modality that can provide detailed and reliable information for all joint tissues. Application of deep learning methodology to discover ultrasound features associated with pain and radiographic change in KOA is highly innovative and will be a major step forward for the field. We will leverage standardized ultrasound images from the diverse and inclusive population-based Johnston County Health Study (JoCoHS), the new enrollment phase of the 30-year Johnston County OA Project which includes Black, White, and Hispanic men and women aged 35-70. We will utilize deep learning to identify features of US images that are associated with aspects of knee OA while also generating an AI/ML-ready FAIR dataset for use by the research community. By developing and maintaining an AI/ML ready repository of standardized ultrasound images from this generalizable cohort, we can enhance the uptake of this modality and contribute to further study on its use in OA worldwide, including in low-resource settings and across populations. | NIAMS |
Rhee, Kyu Y | Weill Medical College of Cornell University | Towards AI/ML-enabled molecular epidemiology of Mycobacterium tuberculosis The project aims to develop novel deep learning approaches for inferring important epidemiological parameters and events from large whole-genome sequence datasets of Mycobacterium tuberculosis, and to provide FAIR-curated, comprehensively annotated datasets and open-source software to the AI/ML research community. This project aims to utilize artificial intelligence (AI) and machine learning (ML) to analyze the whole-genome sequencing (WGS) dataset of clinically annotated Mycobacterium tuberculosis isolates generated through the parent award with the goal of identifying novel potential transmission blocking targets and vaccine antigens. Our focus will be on phylodynamics of Mtb, a statistical framework that infers disease dynamics from genetic data by leveraging the evolutionary tree of disease agents. This approach aims to enhance our understanding of genes associated with Mtb transmission and evolution, thereby facilitating the development of targeted control and prevention strategies. (1) We will investigate optimal data representations for phylodynamic applications of AI/ML. To make the Mtb WGS dataset suitable for deep learning (DL) approaches, it must be encoded into a machine-readable format. Current genetic data representations for DL, which simplify datasets into summary statistics or images, are not suitable for infectious diseases as samples are often collected at different timepoints. We will utilize a genealogy of sampled sequences as a new input data structure for phylodynamic applications of DL, as this type of data structure encodes the underlying evolutionary and epidemiological histories of disease dynamics. (2) We will develop likelihood-free scalable DL/ML frameworks for inferring important epidemiological parameters and mutations of concern, capitalizing on the principles of epidemiology and population genetics. We will then apply our newly developed framework to the Mtb WGS dataset to identify genetic determinants of transmissibility in Mtb and their phenotypic association to the survivability of this pathogen during inter-host transmission in the aerosol phase. (3) We will provide FAIR-curated, comprehensively annotated AI/ML-ready datasets to the research community. Our Mtb WGS dataset, along with other curated datasets, will be standardized, annotated, and documented to rigorous standards, creating the most extensive centralized AI/ML-ready Mtb genetic database to date. This resource will advance the development of new computational methods for the molecular epidemiology of bacterial pathogens and help launch novel discoveries, including meaningful clinical parameters for the control and prevention of Mtb. To promote the use of our database and methods, we will develop, test, and distribute new publicly available open-source software programs. | NIAID |
Schaefer, Andrew J | Rice University | Administrative Supplement to Support Collaborations to Improve AIML-Readiness of NIH-Supported Data for Parent Award SCH: Personalized Rescheduling of Adaptive Radiation Therapy for Head & Neck Cancer We create a rich data set of MRIs of head-and-neck cancer patients to engage the AI/ML scientific community. Toxicity prediction models for cancer therapy suffer from lack of reference datasets of known provenance. The proposed project aims to develop a highly curated dataset of multi-observer segmented, multi-parametric, and multi-time-point MRI-CT-RT Dose/radiomics data for mid-and post-treatment therapeutic response/tumor control probability (TCP) prediction for head and neck cancer (HNC) patients treated with radiation therapy (RT). The dataset includes pre-, on-, and post-therapy serial imaging and toxicity collection prospectively for head and neck therapy cohorts. The project seeks to address critical barriers to developing accurate and reliable machine learning (ML)/artificial intelligence (AI)-based models through the expert curation of the dataset and multi-observer annotation of known provenance and standardized post-processing. The successful completion of the proposed efforts will directly impact scientific advances in predictive modeling for cancer treatment and promote the development of ML-based prediction models or therapy response and normal tissue complication probability models, enabling estimation of the trajectory of the normal tissue injury, utilizing ML/AI approaches to both segmentation quality and toxicity assessment. | NCI |
Stanton, Bruce A. | Dartmouth College | Retrieval, Reprocessing, Normalization and Sharing of Gene Expression and Lung Microbiome Data Sets to Facilitate AI/ML Analysis Studies of Bacterial Lung Infections Data sharing to enhance studies on bacterial lung infections. The CDC reports that more than 2.8 million antibiotic-resistant infections occur in the United States each year, and unfortunately, 35,000 people die from the infections. Moreover, 7.7 million people in the world died from bacterial infections in 2022 alone. Thus, there is a pressing need to develop more effective treatments for bacterial infections, especially those that are resistant to available antibiotics. Accordingly, the goal of our research is to develop new approaches to treat antibiotic-resistant infections and to facilitate research by other scientists. In the research supported by the supplemental funding from the NIH we will create an easy to access compendium of ~25,000 archived data sets on bacterial infections, most funded by the NIH, enabling the research community to easily mine the data, enhance our understanding of the biology of lung infections and develop new therapeutic approaches that will reduce significant disease and death caused by these infections. Working with Dartmouth Research Computing Center we will create a searchable online data portal containing artificial intelligence and machine learning (AI/ML)-ready gene expression data sets for various antibiotic resistant pathogens including ESKAPE pathogens (Enterococcus faecium, Staphylococcus aureus, Klebsiella pneumoniae, Acinetobacter baumannii, Pseudomonas aeruginosa, Enterobacter spp.) other bacterial lung pathogens (Haemophilus influenzae, Burkholderia cenocepacia, Streptococcus pneumoniae) and clinically relevant fungal lung pathogens (Aspergillus fumigatus, Candida albicans). The data portal will also contain searchable and downloadable AL/ML data sets relevant to chronic lung diseases, which often involve antibiotic resistant communities. By making these archived data easier to find, access, and reuse we anticipate that we and other scientists will identify new targets in antibiotic resistant bacteria that will lead to the development of novel treatments for these infections. | NHLBI |
Sung, Kyung Hyun | University Of California Los Angeles | A structured multi-scale dataset with prostate MRI for AI/ML research This project aims to enhance the readiness of artificial intelligence (AI) and machine learning (ML) in prostate MRI data by creating a structured multi-scale dataset that integrates clinical, radiologic, and pathologic data, with the goal of advancing prostate cancer research and improving patient care. Prostate magnetic resonance imaging (MRI) datasets are crucial for training and validating artificial intelligence and machine learning (AI/ML) algorithms. However, publicly available datasets are often limited by the uncertainty and bias associated with biopsy-confirmed histopathology, which is prone to sampling error and interpretation variability. This project aims to improve the AI/ML-readiness of prostate MRI data by providing ground truth labels that link multiscale information across clinical, radiologic, and pathologic data. Leveraging an ongoing NIH R01 project (R01-CA248506) focused on developing novel quantitative MRI and AI/ML methods for predicting clinically significant prostate cancer, the proposed work aims to augment the investigative team with experts in MRI-ultrasound fusion biopsy and biomedical informatics. This collaborative effort will result in a unique dataset of consented subjects who underwent prostate MRI, biopsies, and prostatectomy, along with structured clinical, radiologic, and pathologic findings shared in a standardized manner using a clear data dictionary. The availability of this augmented dataset will facilitate direct comparison and validation of different AI/ML models, which utilize different ground truth labels. It will also provide potential opportunities to combine with other publicly available AI/ML datasets. The ultimate goal is to refine and improve AI/ML models for image-to-histopathology correlation and temporal monitoring of prostate cancer in patients undergoing active surveillance. | NCI |
Terskikh, Alexey V | Sanford Burnham Prebys Medical Discovery Institute | Novel Strategy to Quantitate Delayed Aging by Caloric Restriction We will develop a comprehensive pipeline to integrate multimodal and multidimensional datasets (e.g. imaging and sequencing data) from aging modeling experiments to be analyzed together. Our parental R21 application proposed to integrate the data sets from single-cell imaging, ATAC-seq, and RNA-seq modalities from control diet mice and caloric restriction (CR) diet mice. These experiments will provide a large amount of multidimensional and multimodal data that is not only suitable for machine learning applications but in fact, absolutely require machine learning approaches to make sense of the vast datasets. For example, each image of a nucleus (up to 5000 nuclei are analyzed per replica per sample) will generate up to 1000 texture features (e.g. threshold adjacency statistics). Critically, our datasets are coming from 3 different types of measurements (microscopic imaging, chromatin accessibility, and gene expression) of which 2 modalities (imaging and sequencing) are very different. At present all three data streams are handled in conventional ways primarily using very large Excel files; this data structure and format is not well suited for handling large data sets (increase chance of errors during copy/paste procedures) and for machine learning integration. Hence there is a need for custom build approaches to cleaning, filtering, quality control, handling, and analysis of imaging data, ATAC-seq, and RNA-seq data. The most critical step is to structure these diverse datasets in a similar format that streamlines storage and handling and enables downstream integration and analyses using machine learning algorithms. To meet these challenges, we will implement the AnnData format, which offers a broad range of computationally efficient features including, among others, sparse data support, Scanpy, and a PyTorch interface. To ensure that AnnData from very different modalities (imaging, ATAC-seq, RNA-seq) could be analyzed together, we will test run already developed computational algorithms that best suited to efficiently assimilate and combine multi-omics data to identify key factors that drive aging. Most critically, we will combine large datasets from imaging, ATAC-seq, and RNA-seq using Bayesian methods for integrating multi-omics data and hyperbolic embedding with principled criteria for choosing the best-fitting curvature and dimension. Finally, we establish an electronic repository for python scripts and data structures for straightforward dissemination to the broad research community. | NIA |
Vonder Haar, Cole | Ohio State University | Dopamine modulation for the treatment of chronic dysfunction due to traumatic brain injury Compiling high-dimensional behavioral datasets to better understand traumatic brain injury vulnerability through machine learning The goal of this supplement is to compile, harmonize, and analyze large-scale behavioral datasets in rat models of traumatic brain injury (TBI). As part of the parent grant, multiple datasets describing deficits in attention, impulsivity, and decision-making after TBI were and are being collected. These resulted in millions of lines of data across studies – a rare phenomenon for animal TBI. Given the varied and heterogeneous nature of brain injury, these large datasets capture rare and common phenotypes across this broad spectrum. As such, we will leverage machine learning methods to better understand what factors determine individual vulnerability and resilience to injury. Two databases will be compiled and prepared with metadata for automated processing. One dataset will comprise risky decision-making and have roughly 2 million lines of data, with approximately 70% corresponding to “pure” control or TBI conditions (i.e., no other interventions). The second dataset will have just under 1 million lines of data, with approximately 80% corresponding to “pure” control or TBI conditions, and with multiple injury severities. We will analyze these to understand how vulnerability to chronic impairments from TBI can be detected early so that treatments may be better targeted and developed. | NINDS |
Wagner, Alex Handler | Research Institute Nationwide Children's Hospital | Development and validation of a computable knowledge framework for genomic medicine Standardizing the large-scale population genomics data in the Genome Aggregation Database to enable AI-supported clinical genomics. The clinical interpretation of genomes is a labor-intensive process that remains a barrier to scalable genomic medicine. Efforts to improve this “interpretation bottleneck” have resulted in the development of clinical classification guidelines and databases for genomic variants in Mendelian diseases and cancers. The development of AI-augmented genome interpretation systems is a potential solution to help scale the interpretation process. Development of such systems will benefit from aggregation and collation of evidence that is in an AI-ready state. The parent award for this supplement is addressing this challenge through the development and validation of a computable framework for genomic knowledge (R35 HG011949). These efforts are underway in collaboration with the broader genomic knowledge community under the auspices of the Global Alliance for Genomics and Health (GA4GH). The NIH-supported Genome Aggregation Database (gnomAD) is currently the largest and most widely used public resource for population allele frequency data. These data are commonly used as strong evidence to rule out variant pathogenicity, making this a highly impactful resource for filtering out variants that are unlikely to be causative for a Mendelian disease. The importance of the gnomAD population allele frequency data to clinical interpretation systems makes the resource an ideal candidate resource for AI-readiness. The objective of this project is to improve the AI/ML readiness of gnomAD through application of the GA4GH Variation Representation Specification (VRS) and the associated genomic knowledge framework developed in the parent award to data from the gnomAD database. As a result of this work, we will be able to couple the semantic precision of VRS and the genomic knowledge framework to the high performance genomic data search capabilities of the Hail platform on which the gnomAD data is stored. Our aims include development of tools for translating large-scale genomics file formats (i.e. Variant Call Format / VCF) into VRS objects, design of semantic data models for representing population allele frequency data, and implementation of these tools and models to the gnomAD Hail tables and graphQL interface. Our work was initially described as part of the gnomAD v4 data release in November 2023: https://gnomad.broadinstitute.org/news/2023-11-ga4gh-gks/. | NHGRI |
Waldron, Levi David | Graduate School of Public Health and Health Policy | Cancer Genomics: Integrative and Scalable Solutions in R/Bioconductor Transforming cancer omics datasets into language-agnostic AI/ML-ready resources for diverse research communities. Large numbers of curated and annotated multi-omic cancer datasets have been developed and enhanced through the parent grant and distributed via the R/Bioconductor project. However, the complexity of existing data structures and lack of accessibility outside of the R/Bioconductor ecosystem has impeded the development of AI/ML-based methods for those datasets. Lack of standardization between data compendia and inadequate documentation of key characteristics of study cohorts have further limited the development of inclusive AI/ML models that leverage diverse datasets while accounting for population subgroups. This administrative supplement creates a FAIR and well-annotated data repository for cancer omics datasets using current best-practices for AI/ML-ready data. We automate the conversion of Bioconductor data resources, including 188,323 multi’omic cancer profiles from 373 cBioPortal studies and 22,588 metagenomic profiles from 93 studies in curatedMetagenomicData, into plain-text formats with file manifests of samples and datasets. We further annotate key characteristics of each study cohort by literature review to provide a more complete picture of the representativeness of study participants and the comparability of independent studies. Finally, we harmonize annotations using controlled language from ontology databases, which innately establish relationships between attributes. This project will produce the largest omics data repository specifically for research in AI/ML methods to date, facilitating the development of robust and inclusive models that account for diverse population subgroups. We will provide documented, runnable usage examples using TensorFlow, PyTorch, and scikit-learn, enabling the broader research community to leverage these curated resources in the development of AI/ML-based methods for analysis of cancer genomics data. We expect applications to include biomarker identification and disease subtyping, validation of therapeutic targets for personalized medicine, and molecular pathway analysis for the development of new therapeutic approaches. The successful completion of this project will extend the impact and utility of the parent grant, address the critical need for more inclusive models that incorporate diverse population subgroups in cancer genomics research, improve accessibility and standardization of cancer genomics datasets, and accelerate progress in the field of AI/ML-based cancer genomics research. | NCI |
Yin, Yanbin | University of Nebraska Lincoln | Carbohydrate enzyme gene clusters in human gut microbiome Developing new artificial intelligence and machine learning (AI/ML) applications for dietary fiber utilization enzymes to enhance personalized nutrition and diet intervention CAZymes (Carbohydrate Active Enzymes) are enzymes that act upon specific glycosidic linkages to degrade or synthesize polysaccharides. CAZymes are extremely important as part of the genomic repertoire of human microbiome, especially in the human gut. CAZymes work with other proteins, such as sugar transporters, to act on a specific carbohydrate substrate. Genes encoding these proteins often form physically linked gene clusters known as CAZyme Gene Clusters (CGCs) in bacterial genomes. CGCs that have been experimentally characterized are termed polysaccharide utilization loci (PULs), with known polysaccharide substrates (e.g., starch, mannan, xylan, and glucan). An ability to predict the carbohydrate substrates for CAZymes and CGCs will significantly enhance the emerging personalized nutrition practice, e.g., using gut microbiome sequencing to infer if a person is a responder to certain dietary fibers or prebiotics. To apply AI/ML technology in carbohydrate substrate prediction, one has to prepare two types of training data that are in a computer-readable format, in order to enable AI/ML-readiness: (1) PULs (experimentally characterized gene clusters with known carbohydrate substrates) curated from literature, and (2) CGCs (without known carbohydrate substrates) predicted from human microbiome. Therefore, the major goal of this AI/ML-readiness project is to develop a consistent, standardized, and systematic format of PULs and CGCs to make them AI/ML ready to not only our parent R01 project but also to other data scientists and nutrition scientists. To achieve this goal, we have assembled a multi-disciplinary research team including three faculty, one postdoc, and three graduate students. These members have all necessary expertise in nutritional science and CAZymes, statistical ML model development, and bioinformatics and ML application development. Two Aims with four subtasks and four milestones are planned to make the PUL and CGC data formatted and documented in a way that they can be readily available to other data scientists and nutrition scientists. All AI/ML-ready data will be freely available on two online data repositories: dbCAN-PUL and dbCAN-seq. This project will contribute to the basic understanding of dietary modulation of human microbiome and applied personalized nutrition research. | NIGMS |