Sustainable Software Tools For Open Science

Software is an integral component of health research due in part to the speed and growth of new technology innovations in the software and computing fields. An NIH collaboration across 19 institutes, centers, and offices supports the development and enhancement of sustainable software tools for open science by fostering new collaborations between biomedical and clinical scientists and research software engineers.

Two active funding opportunities supporting these efforts are listed below.

Building Sustainable Software Tools for Open Science (RFA-OD-24-010) provides a flexible mechanism to support the use of best practices for scientific software development and to promote community engagement for open science.
NIH Research Software Engineer (RSE) Award (RFA-OD-24-011) supports the ability of exceptional Research Software Engineers (RSEs) to contribute their skills in the development and dissemination of NIH-funded research software, tools, and algorithms
FAQs for both funding opportunities are available here

Administrative Supplements to Support Enhancement of Software Tools for Open Science

As part of their research projects, investigators often produce innovative, scientifically valuable software tools. Frequently, these valuable tools cannot be supported long-term or are developed under conditions that aren’t optimal for reuse. In an effort to address this, the National Institutes of Health (NIH) Office of Data Science Strategy (ODSS), along with other institutes and centers (ICs) at NIH, previously developed several funding opportunities as administrative supplements to enhance software tool development for open science.

These supplements have been utilized to:

support robustness, sustainability, and scalability of existing biomedical research software tools and workflows.
invest in research software tools with recognized value in a scientific community to enhance their impact by leveraging best practices in software development and advances in cloud computing.
support collaborations between researchers and research software engineers to enhance the design, implementation, and “cloud-readiness” of research software.

For more information on software sustainability and researcher engagement, read Dr. Gregurick's blog and conversation with Dr. Daniel S. Katz.

The NIH Council of Councils held a meeting January 19 and 20, 2023, where working group leadership presented a concept clearance: “Building Sustainable Foundations for Open Software and Tools in Biomedical and Behavioral Science.” The concept was approved by the Council of Councils, and you can see the presentation here.

Prior NOSIs:

2023: NOT-OD-23-073 expired on March 7, 2023, with 27 awardees.
2022: NOT-OD-22-068 expired on April 13, 2022, with 29 awardees.
2021: NOT-OD-21-091 expired on May 16, 2021, with 41 awardees.
2020: NOT-OD-20-073 expired May 16, 2020, with 28 awardees.

Awardee projects and their descriptions are available below.

FY2023

Click to view FY2023 Award Recipients

FY2023 Award Recipients
Principal Investigator	Institution	Project Title	NIH IC
Christos Davatzikos	University of Pennsylvania	Machine Learning and Large-scale Imaging analytics for dimensional representations of brain trajectories in aging and preclinical Alzheimers Disease: The brain aging chart and the iSTAGING consortium	NIA
Leah Hanson	Health Partners Institutes	A Technology-Driven Intervention to Improve Early Detection and Management of Cognitive Impairment	NIA
Michelle Bell	Yale University	Enhancing SPACE, an innovative python package to account for spatial confounding used to estimate climate-sensitive events among older Medicare	NIA
Mark Vanderlaan	University of California Berkeley	Efficient Refactoring of Longitudinal Targeted Machine Learning	NIAID
Joseph Crisco	Rhode Island Hospital	Multi-modal Tracking of In Vivo Skeletal Structures and Implants	NIAMS
Cornelia Ulrich	University of Utah	Enhancing Loon: Increasing Robustness and Generalizing Input Formats for a Visualization Tool for Large-Scale Microscopy Data	NCI
Oscar Jan Patrick Schuemann	Massachusetts General Hospital	TOPAS - nBio, a Monte Carlo tool for radiation biology research	NCI
John Pearson	Duke University	Real-time mapping and adaptive testing for neural population hypotheses	NIDA
Cecilia Yeung	Fred Hutchinson Cancer Institute	Rapid Acute Leukemia Genomic Profiling with CRISPR enrichment and Real-time long-read sequencing	NCI
Jason Flannick	Broad Institute INC	The next iteration of the AMP-T2D Knowledge Portal	NIDDK
Scott Delp	Stanford University	Mobilize Center: Models for Mobile Sensing and Precision Rehabilitation	NIBIB
Muriah Wheelock	Washington University	Implementing best practices in software design for Network Level Analysis	NIBIB
James Coughlan	Smith-Kettle Eye Research Institutes	Point and Listen: Augmented Reality Interfaces for the Visually Impaired	NEI
Leslie Loew	University of Connecticut School of Medicine	Mechanistic Modeling of Cellular Systems	NIGMS
Sara Goodwin	Science Communication Lab	Online Courses for Navigating Research Mentoring Relationships	NIGMS
Sara Flood	University of Minnesota	Integrated Current Population Survey Data for Population Dynamics and Health Research	NICHD
David Beier	Seattle Children’s Hospital	Open-source Software Development Supplement for 3D quantitative analysis of mouse models of structural birth defects through computational anatomy	NICHD
Rafael Irizarry	Dana- Farber Cancer	Enhancing Community Contributions to Bioconductor With Build System Containerization and a GPU for Testing	NHGRI
Heidi Rehm	Broad Institute Inc	Enhancing gnomAD Sustainability: Implementing Site Reliability Engineering Principles for Genomic Data Infrastructure	NHGRI
Lincoln Stein	Ontario Institute for Cancer Research	Introducing CI/CD Technologies to Optimize Software Development in Reactome	NHGRI
Yong Chen	University of Pennsylvania	PheBC: bias correction methods for EHR derived phenotype	NLM
Qingzhao Yu	LSU Health Sciences	Trends of disparities in breast cancer progression and health care considering multilevel risk factors	NIMHD
Sandra Mccoy	University of California Berkeley	Strengthening the continuity of HIV care in Tanzania with economic support	NIMH
Mary Disis	University of Washington	Institute of Translational Health Sciences	NCATS
Shuzhao Li	Jackson Laboratory	Enhancement of asari software for metabolomics data processing	NIAID
Ramy Arnaout	Beth Isreal Deaconess Medical Center	Cloud Six: Immunology-Inspired Software for Measuring and Modeling Large Datasets	NIAID
Ankit Parekh	Icahn School of Medicine at Mount Sinai	Reimagining the diagnosis of obstructive sleep apnea beyond the apnea-hypopnea index	NHLBI

FY2022

Click to view FY2022 Award Recipients

FY2022 Award Recipients
Principal Investigator	Institution	Project Title	NIH IC
Cristian T Badea	Duke University	A multi-channel reconstruction toolkit for computed tomography	NIA
Helene D Benveniste	Yale University	Robust workflow software for MRI tracking of glymphatic-lymphatic coupling	NCATS
Michelle Birkett	Northwestern University	Enabling cloud deployment of a network data capture tool to improve Partner Services	NIDA
Helen M Blau	Stanford University	Improvement and standardization of a bioinformatic software suite for multiplexed imaging	NIA
Emre H Brookes	University of Montana	Development of an UltraScan Meta-Scheduler for HPC Job Submission	NIGMS
Rachel Clipp	Kitware	Optimizing the Pulse Physiology Engine to Meet Medical Simulation Community Needs	NIBIB
Lee Cooper	Northwestern University	Guiding humans to create better labeled datasets for machine learning in biomedical research	NLM
Salvador Dura-Bernal	The State University of New York	Development of robust cloud-based software for co-simulation of biophysical circuit and whole-brain network models	NIBIB
Adam R Ferguson	University of California- San Francisco	Enhancing the Pan-Neurotrauma Data Commons (PANORAUMA) to a complete open data science tool by FAIR APIs	NINDS
Johann Eli Gudjonsson	University of Michegan	Immunogenomics and Systems Biology Core	NIAMS
Melissa A Haendel	University of Colorado	Improvements to the LinkML framework to support the Phenomics First open science resource	NHGRI
Tomas Helikar	University of Nebraska- Lincoln	Software for collaborative construction, simulation, and analysis of mechanistic computational models of biological systems	NIGMS
John David Herrington	Children's Hospital of Philidelphia	Enhancing the Cloud-Readiness of Perceptual Computing Through Data Standardization Software	NIMH
Brian P Jackson	Dartmouth University	Laying the Groundwork for Web-based Elemental Imaging Software: The MicroAnalysis Toolkit	NIGMS
Xia Jing	Clemson University	Open, interoperable, and configurable clinical decision support modules for OpenMRS, OpenEMR, and beyond	NIGMS
David Nelson Kennedy	University of Massachesetts	Enhancing neuroimaging reusability through semantic enrichment	NIBIB
Yueh Z Lee	University of North Carolina- Chapel Hill	Arterial input function Independent Measures of Perfusion with Physics Driven Models	NINDS
X Lucas Lu	University of Delaware	Prevention of PTOA via regulation of the cytomechanics of chondrocytes	NIAMS
Gabor T Marth	University of Utah	Enhancing clinical diagnostic analysis with a robust de novo mutation detection tool	NHGRI
Michael Churton Neale	Virginia Commonwealth University	Accelerating Development of OpenMx for Interoperability and Cloud Use	NIDA
Bence P Olveczky	Harvard University	An easy-to-use software for 3D behavioral tracking from multi-view cameras	NIGMS
Liam M Paninski	Columbia University	Modularization and integration of the International Brain Laboratory spike-sorting pipeline into SpikeInterface	NINDS
Kathleen Powis	Harvard University	Immune correlates of tuberculosis and non-tuberculosis infectious morbidity in Southern African HIV-exposed, uninfected infants.	NIAID
Emilie Roncali	University of California- Davis	Improved optical Monte Carlo simulation through standardization, robustness, and training	NIBIB
Louis J Soslowsky	University of Pennsylvania	An open-source software for microCT-based longitudinal tracking of musculoskeletal tissues	NIAMS
Lena H Ting	Emory University	Biophysical muscle modeling software for enhancing open science	NICHD
Junichi Tokuda	Harvard University	Open Software Platform for Data-Driven Image-Guided Robotic Interventions	NIBIB
Richard W Tsien	New York University	BrainSTEM - An e-age Experimental Neuroscience Lab Notebook	NINDS
Linda J Van Eldik	University of Kentucky	Portable and modular UDS Data Collection software to increase collaboration and engagement of Alzheimer’s Disease Research Center research software engineers	NIA

FY2021

Click to view FY2021 Award Recipients

FY2021 Award Recipients
Principal Investigator	Institution	Project Title	NIH IC
Olusola Ajilore	University of Illinois, Chicago	Unobtrusive Monitoring of Affective Symptoms and Cognition using Keyboard Dynamics Refactor the BiAffect codebase to enable collaborative open science. The parent grant “Unobtrusive Monitoring of Affective Symptoms and Cognition using Keyboard Dynamics” or UnMASCK, uses a novel digital technology (“BiAffect”) for the study of cognitive dysfunction in the context of mood disorders. BiAffect leverages smartphone keyboard dynamics metadata to unobtrusively and passively monitor cognitive function. The core technology of BiAffect is a custom-built smartphone virtual keyboard that replaces the native default keyboard, allowing the collection of real-time data of potential clinical relevance while individuals interact with their device as usual within their natural environment. Working with Sage Bionetworks, the supplement will enable us to refactor the BiAffect codebase to enable more robust multi-developer contribution and version control. We also plan to create standardized data processing pipelines to support collaborations with researchers who may have varying levels of capacity for data science and engineering. The results will be an innovative, meaningful contribution to collaborative open science, modernizing the biomedical research data ecosystem and making data findable, accessible, interoperable, and reusable (FAIR). With Sage Bionetworks, refactor BiAffect codebase for collaborative open science, so other developers can contribute code and collaborators have pipelines to contribute data.	NIMH
Pamela Bjorkman	California Institute of Technology (PI)	Developing Immunogens to Elicit Broadly Neutralizing anti-HIV-1 Antibodies Enhancing antibody and variant resources by adopting public data streams, standard input formats, and open-source integration The goal of the parent project is to advance our germline-targeting approach to HIV-1 vaccine design by cycles of immunogen design and testing. Similar strategies guided by antibody discovery are showing exciting promise with a range of other difficult pathogens such as influenza, malaria, hepatitis C, dengue, and Zika virus. As part of our project's integrated approach to immunogen design, selection, and evaluation, the Bjorkman lab has developed the software package HIV Antibody Database, which was designed to enable frictionless access to, comparisons of, and analyses of broadly neutralizing anti-HIV antibody sequences, structures, and neutralization data. In recent developments a sister project addressing COVID called Variant Database can quickly search SARS-CoV-2 genome datasets with millions of sequences. This administrative supplement will build on these codebases to improve their robustness and sustainability by integrating publicly available data streams, adopting standard formats for input and output of structural data, and optimizing performance. We will enhance HIV Antibody Database to automatically download data from the Los Alamos National Laboratory CATNAP database. Antibody Database will also utilize PDBx/mmCIF, a modern, extensible structure file format, and use the RCSB REST-based API for online structure searches. We will ensure that high performance graphics frameworks (e.g., Metal) are fully utilized. Antibody Database will also be extended for use on other viruses, included access to SARS-CoV-2 data through Variant Database. The build process for Variant Database, an open source tool, will be improved to use a modern package manager. To facilitate wider use, a python API for interacting with Variant Database will be developed. Refactor antibody and variant and resources.	NIAID
John Buse	University of North Carolina Chapel Hill	CAMP FHIR: Lightweight, Open-Source FHIR Conversion Software to Support EHR Data Harmonization and Research Practical, lightweight data standardization: improving CAMP FHIR software to map clinical common data models to FHIR Clinical common data models (CDMs) such as PCORnet, OMOP, and i2b2, aim to ease data harmonization and interoperability thereby fostering collaboration. However, when institutions support different CDMs, cross-institutional data sharing can be impeded. While HL7 Fast Healthcare Interoperability Resources (FHIR) is an increasingly key standard for interoperability and data exchange that aims to address these issues, mapping an individual CDM to FHIR is resource intensive. The parent grant developed the CAMP FHIR software to advance the clinical and translational science mission of the NC TraCS center both in North Carolina and for national CTSA goals. By offering a cloud-based, open-source application that can transform multiple types of input data to FHIR, CAMP FHIR makes it easier and more efficient for organizations with varying data environments to participate in new clinical research partnerships. The aims of this administrative supplement are to (1) make CAMP FHIR more robust by implementing additional software development best practices, (2) improve interoperability by adding support for bidirectional data transformation and new high-value FHIR resources, and (3) improve usability and accessibility by developing a graphical user interface, adding support for new input types, and ensuring cloud-readiness. Map to FHIR from different clinical data models.	NCATS
Vince Calhoun	Georgia State University	Multivariate methods for identifying multitask/multimodal brain imaging biomarkers GIFTwrap: a containerized and FAIR cloud-based implementation of the widely used GIFT toolbox This supplement will usher in the next phase of the widely used GIFT software (or Group ICA of fMRI toolbox) which provides data-driven methods for capturing intrinsic functional and structural brain networks. GIFT has been continuously updated and extended over the past twenty years and has a large amount of functionality that is not available in other tools including 20 different Independent Component Analysis (ICA) algorithms. However, GIFT is primarily based on a standalone software development and analysis model. We will extend GIFT to interface with and leverage modern software tools, to facilitate comparability across ICA analysis, and to fully engage in the community development movement. The GIFTwrap implementation will be a containerized python compliant tool that will also be accessible via a cloud interface. The work will open up access to a wide suite of approaches including dozens of different ICA approaches, functional network connectivity, independent vector analysis (a generalization of ICA for multiple datasets), dynamic functional network connectivity, spatial dynamics, connectome visualization, and much more. We have three main goals: 1) Architecture improvements to facilitate FAIR principles and modernize the tools, 2) to deploy the GIFT tools, especially our robust neuromark and auto-labelling approaches to facilitate comparability across analyses, into a brain imaging data structure app (BIDSapp) for easy use and integration into modern analysis frameworks, and deploy in cloud-based analytic platforms (e.g. brainforge), and 3) to provide a cloud interface for individuals to run fully automated ICA analysis which requires a simple upload of data to the tools.	NIBIB
Naomi Caselli	Boston University	Effects of input quality on ASL vocabulary acquisition in deaf children Use best practices in software development to make the sign language assessments developed under the parent grant openly available, and to make a robust platform for collecting and tagging these data as a step towards large-scale machine-readable sign language datasets. The majority of deaf children experience a period of limited exposure to language (spoken or signed), which has cascading effects on many aspects of cognition. The parent project aims to understand how children build a vocabulary in sign language, and whether and how this differs for deaf children who have limited exposure to a sign language. This includes developing American Sign Language (ASL) proficiency tests, and using these tests to examine how children learn ASL. The supplement will use best practices in software development to make these ASL proficiency tests widely available in the community. We will do so using a code base our lab has already developed-- a single-use application for collecting sign language video data--and convert it to a platform that can be used to collect and tag video data of all kinds. Not only will this platform be used to make the ASL tests accessible, the platform will be made publicly available to other sign language researchers. This project will remove significant barriers in the field of sign language research, making it much more efficient to develop large-scale machine-readable sign language datasets. Additionally, making the ASL tests widely accessible in the community will help clinicians and researchers track language acquisition among deaf children.	NIDCD
Connie Celum	University of Washington	Viroverse: Bedside through Bioinformatics database retrieval system Transitioning the Viroverse specimen repository database, lab notebook and retrieval and visualization system to a cloud ready application based on the Python Django framework The parent grant supports the Retrovirology and Molecular Data Sciences (RMDS) Core, a component of the University of Washington / Fred Hutchinson Center for AIDS Research. The largest emphasis of this core is the development and dissemination of new software tools, databases and custom applications for bench support, data management, and visualization across a wide spectrum of bench science-based activities conducted locally, nationally and internationally. The centerpiece of this effort is the laboratory information management system, Viroverse, as it provides a highly flexible specimen repository database, lab notebook and retrieval and visualization system for data acquired from the bedside through the laboratory bench and bioinformatic analysis. The Supplement award will enable enhancement of the Viroverse feature set as well as produce a modern, sustainable and cloud ready database platform. We will migrate Viroverse from a Perl/Catalyst to a Python Django framework, providing a code base that is easy for programmers to contribute to and allow wide community adoption, intense security monitoring, reporting and patching, and that leverages the availability of many publicly available libraries and modules, as well as a broad base of experienced developers. Importantly, this will allow for multiple authentication backends and supports industry standard encryption and single sign on (SSO) technology. The source code for Viroverse will continue to be made freely available, and we will generate and release a containerized (Docker) version. Finally, we plan to add significant unit testing and a continuous integration pipeline to ensure overall application stability while maintaining gatekeeper review and merge authority over the codebase as it is developed further.	NIAID
Melissa Cline	University of California Santa Cruz	Eliminating variants of uncertain significance in BRCA1, BRCA2 and beyond Resources for sharing knowledge on the genetic risk of disease An estimated 10% of all cancers arise through inherited genetic risk, through harmful genetic variants in the patient’s germline DNA. One particularly vivid example is Hereditary Breast and Ovarian Cancer (HBOC) Syndrome arising through harmful variation in the BRCA1 and BRCA2 genes. The lifetime risk of breast or ovarian cancer ranges between 42% and 70% for women who inherit a harmful BRCA variant, versus the average risk of 11% in the U.S. population. These cancers can often be prevented through detection and clinical management of the genetic risk, provided the risk can be recognized. Most BRCA variants currently have no known clinical impact. The BRCA Exchange was launched in 2016 with the goal of developing new approaches to share data on BRCA variants to catalyze variant interpretation, with BRCA1/2 in HBOC serving as exemplars for additional genes and heritable disorders. Today, roughly 3,000 users per month visit the site for BRCA variant data aggregated from public variation repositories and additional annotation data curated by the ENIGMA Consortium, the internationally-recognized organization for the expert interpretation of BRCA variants. This has inspired other research consortia to reach out to launch variant data exchanges for the genes and heritable disorders under their purview. With this supplement, we propose to refactor the BRCA Exchange software to improve its modularity, reusability and cloud readiness. By refactoring the data integration pipeline and database management, we will add flexibility to the data model to allow external consortia to integrate the data that is most informative to their variants. By integrating the pipeline with cloud APIs, we will enable external consortia to run the pipeline on the NIH secure cloud platforms, alleviating the need for an internal server. Finally we will produce a simplified front end, which collectively will allow external consortia to build and run their own variant data exchanges. We anticipate that these developments will catalyze research in pediatric and diffuse gastric cancers, as well as contributing valuable new functionality to the parent grant.	NCI
Stephania Cormier	Louisiana State Baton Rouge	LSU Superfund Research Center - Environmentally Persistent Free Radicals “Cloud-based multilevel mediation analysis for investigating adverse effects of particulate matter from hazardous waste remediation and individual risk factors on respiratory health”. The goal of our project is to create a scalable, open-source comprehensive toolset for performing multi-level mediation analyses. The link between particulate matter (PM) exposure and poor respiratory health is well established. As part of our parent grant, the LSU Superfund Research Center postulates that environmentally persistent free radicals (EPFRs) present on PM from hazardous waste sites, incinerators/chemical fires, and other sources is the missing mechanistic link between PM exposure and poor respiratory health. We will investigate the mediation of this association using hierarchical mediation analysis to decompose the air pollutant adverse respiratory effects into direct and indirect (EPFR-mediated) effects. We developed a multilevel mediation analysis method that allows for both longitudinal assessments of residential environments and individual risk factors to be jointly utilized in determining the mechanistic link between exposure to PM and poor respiratory health. This method is available in our R package, Multilevel Mediation Analysis (mlma). However, effective use of mlma requires knowledge of the R programming language, including manipulating datasets within the R environment and writing commands in R. In addition, R-Project© software must be downloaded and installed on individual computers to perform the analysis. In large or complex applications, computational resources may become heavily taxed due to the high computational burden of the software algorithms. These limitations hinder a broader use of the method in research and applications. To address these limitations and facilitate wider adoption of our comprehensive method for hierarchical mediation analysis, we will develop a more robust cloud-based application of our R program mlma by (1) creating an interactive visual interface that allows users to easily import datasets and build conceptual mediation model frameworks using drag-and-drop functionality, (2) transforming the R-code to an efficient low-level structured computer language (e.g., C, C++) employing a high-performance parallel computing model to improve computational speed, and {3) enhancing usability, speed, storage, and ability to handle big data, by making use of clustering technology on cloud.	NIEHS
Adam Eggebrecht	Washington University	Illuminating development of infant and toddler brain function with DOT Extending our NeuroDOT tools to develop robust and efficient software for photometric data-anatomy registration and data fidelity assurance for fNIRS and HD-DOT data, and our NLA software tools for general connectome-wide statistical analyses. The long-term goal of the Parent BRAINS R01 (R01MH122751, ‘Illuminating development of infant and toddler brain function with DOT’) is to advance high-density diffuse optical tomography (HD-DOT) methods for evaluating brain-behavior relationships in infants and toddlers at risk for developing autism spectrum disorder (ASD) while they are awake and engaged within a naturalistic setting. Funding from this Administrative Supplement NOT-OD-21-091 will promote modernization of our growing components of the data-resources ecosystem by providing crucial support to (1) refactor our Matlab-based NeuroDOT/NLA toolboxes in Python with enhanced development tools and standardization, (2) establish cloud readiness of NeuroDOT/NLA, and (3) expand documentation and support our community of users and developers with tutorials, workshops, and hackathons. Successful completion of these Aims will both complement and extend the impact of the Parent R01, not only in terms of uncovering longitudinal patterns of covariation of brain function and behavior that may provide novel predictive diagnostic value, but also in terms of harmonizing methods and strategies for high fidelity optical functional brain mapping that will be crucial to ongoing investigations of brain function during engaged behavior in infants and toddlers.	NIMH
Evelina Fedorenko	MIT	The neural architecture of pragmatic processing Establishing a common language in human fMRI: linking the traditional group-averaging fMRI approach and functional localization in individual brains through probabilistic functional atlases for four high-level cognitive networks. The parent project examines the contributions of three communication-relevant brain networks—the language network, the social cognition network, and the executive-control network—to pragmatic reasoning, the ability to go beyond the literal meaning to understand the intended meaning. The projects adopts the ‘functional localization’ fMRI approach, where networks of interest are defined functionally in each individual brain. Although this approach is superior to the traditional group-averaging fMRI approach, it is not always feasible, and it is unclear how to relate findings from studies that rely on these disparate approaches. We will develop and make publicly available probabilistic functional atlases for four brain networks critical for high-level cognition based on the data from extensively validated functional ‘localizer’ paradigms collected under the parent award and in prior work. Such atlases, based on overlaying large numbers of activation maps, capture not only the areas of most consistent responses but also the inter-individual variability in the locations of functional areas. These probabilistic representations of the network landscapes can therefore help estimate the probability that any given location in the common brain space belongs to a particular functional network. In this way, probabilistic atlases can provide a critical bridge between two disparate approaches in human fMRI—traditional group-averaging and functional localization in individual brains, as well as link fMRI work with lesion-behavior patient investigations. The ability to more straightforwardly compare findings across studies is bound to lead to more robust, replicable, and meaningful science in our understanding of human communication and related abilities.	NIDCD
Alexander Fleischmann	Brown University	Odor Memory Traces in the Mouse Olfactory Cortex Open-source software tools to enhance the processing, reproducibility and shareability of integrated multimodal physiology and behavioral data A major challenge for neuroscience research is the complexity and size of multimodal data sets, which often include a combination of calcium imaging, electrophysiology, behavioral tracking through video and other sensors, and electrical or optogenetic stimulation. Acquisition and analysis of such data typically rely on a mix of vendor built and custom workflows, with diverse file formats and software applications required at various stages of the analysis pipeline. We will improve and extend our calcium imaging and behavior analysis pipeline, built around the Neurodata Without Borders (NWB) standard, into a general purpose, cloud-enabled tool for managing and analyzing systems neuroscience data. We will generalize the pipeline from our current data format to create a well-documented, user friendly application programming interface (API). We will expand integration checks to include Microsoft Windows and macOS and disseminate the outcome as a Python package through standard repositories like PyPI and Conda-Forge. We will adapt the pipeline to enable automatic saving to the cloud, generalize the pipeline other open-source software tools, and create a Graphical User Interface to complement the current command line interface. Together, these enhancements to an already in-use data analysis pipeline will provide a streamlined framework for use by other labs and enhance reproducibility and shareability of integrated neural activity and behavioral data.	NIDCD
Julius Fridriksson	University of South Carolina at Columbia	Center for the Study of Aphasia Recovery (C-STAR) Advanced neuroimaging visualization for cloud computing ecosystems The Center for the Study of Aphasia Recovery (C-STAR, P50-DC014664) explores recovery from language impairments following stroke, bringing together a diverse team of specialists from communication sciences, neurology, psychology, statistics and neuroimaging. This project acquires a broad range of magnetic resonance imaging (MRI) modalities (structural, diffusion, arterial spin labelling, functional, resting state) from stroke survivors to understand the brain areas critical for language, improve prognosis, and identify the optimal treatment or compensation strategy for each individual. Our team has developed novel desktop based tools (MRIcroGL and Surfice) to visualize these different modalities. The aim of this supplement is to adapt our methods to a web-based tool (NiiVue) that can work on any device (computer, tablet, phone).	NIDCD
Andrew Gelman	Columbia	Improving representativeness in non-probability surveys and causal inference with regularized regression and post-stratification Improving the flow of the Bayesian workflow by enhancing the Stan probabilistic programming platform The parent grant will develop general, flexible, and reliable Bayesian methods for survey sampling adjustment that can be used for a wide range of problems in public health research. This requires extensive use of the Stan probabilistic programming platform, both to carry out the research itself and to put the resulting methodology into practice. In the supplement we will improve the core Stan platform in three ways: implementing common input and output formats; speeding up the core Stan inference algorithms through more sophisticated parallelization; and improving memory efficiency. This will improve the overall speed and scalability of inference, allowing for Bayesian methods to be used with increasingly complex models, and in turn allowing more stable and effective inference from non-random samples, a problem that is increasingly relevant when learning about populations in public health.	NIA
Guy Genin	Washington University	Multiscale models of fibrous interface mechanics Strain Analysis Software for Open Science This supplement addresses a critical need in biomechanics and mechanobiology, and eventually in clinical practice: seamlessly analyze large imaging datasets to determine how tissues deform under mechanical loading. The parent grant (R01AR077793) uses a comprehensive modeling and experimental approach to study how fibrous interfaces transfer load between dissimilar. The supplement will implement previously developed strain tracking algorithms into user friendly software that is cloud-ready and broadly available. This software will enable quantitative analysis of deformation in biomedical images from a wide range of modalities, including microscopy, ultrasound, and optical. Notably, commercially available software packages for this employ regularization techniques to ensure smooth solutions, and are therefore often unable to accurately identify local tissue deformations or predict soft tissue tears. We will enable researchers to study strain fields associated with injury patterns and rehabilitation protocols by executing two aims: (1) We will develop open-source software as a plugin to ImageJ for the strain-tracking algorithm in two dimensions (2D), stereo view (2.5D), and three dimensions (3D). Best practices will be used for open-source software development, and modules will be created to facilitate the development of a user community. (2) We will develop a working, static code implementation on GitHub that can be run on Amazon AWS using data in the cloud. This will help overcome the primary obstacle for widespread adoption of strain mapping techniques in musculoskeletal research, namely that the 3D datasets require substantial computational resources to analyze. The work will enable collaboration between the PIs of the parent grant and an expert on open-source software development for clinical and research translation, and enhance the impact of a tool with strong potential.	NIAMS
Thomas Gill	Yale	Claude D. Pepper Older Americans Independence Center at Yale Yale Study Support Suite (YES3): Dashboard and Web Portal Software Supporting Research Workflow through integrated, customizable REDCap External Modules This data science project will refactor and refine the Yale Study Support Suite (YES3), a suite of REDCap workflow and data management programmatic extensions (external modules) designed to improve the efficiency and quality of research field operations. The Yale Study Support Suite (YES3): Dashboard and Web Portal Software Supporting Research Workflow through Integrated, Customizable Redcap External Modules The NCATS-funded, Vanderbilt University-developed REDCap platform is used at thousands of institutions worldwide. A powerful characteristic of REDCap is its support for user-contributed programmatic extensions – external modules – that can add features or UI/UX elements, as well as integration with external informatics resources. Over the past decade, the Operations Core of the NIA-funded Claude D. Pepper Older Americans Center (OAIC) at Yale (P30AG021342) has built a suite of REDCap external modules that promote efficiencies in workflows and data management. The Yale Study Support Suite (YES3) is in use by studies and data coordinating centers associated with Yale University, including a large PCORI/NIA-funded national pragmatic trial (D-CARE Study) led by investigators from three OAICs. YES3 components include a dashboard for advanced data collection and study operations management that can be tailored to support study-specific workflows; a study portal for disseminating study materials and single or multi-site conduct-of-study monitoring that includes a comparative "site report card" analysis; and an automatable module that exports both code (SAS and R) and data for datamarts. The REDCap@Yale team will use the NOSI funding to refactor and refine the YES3 codebase into a form suitable for long-term collaborative maintenance by the extensive consortium of REDCap open-source developers. Refactoring will focus on adherence to established coding style and documentation guidelines, design patterns widely in use by consortium developers, and adherence to REDCap@Yale software workflow and security practices. The YES3 GitHub repository will include automated workflows for code inspection and security review, in addition to comprehensive documentation for end-users and developers.	NIA
Karl Grosh	University of Michigan Ann Arbor	Active and Nonlinear Models for Cochlear Mechanics Developing open source finite-element codes for algorithm development and scientific discovery The overarching goal of this research is to develop a complete fluid-mechanical-electrical model that describes the response of the cochlea to external acoustic stimulation. A predictive model that covers the full range of audio amplitudes and frequencies will help us to understand how important classes of signals are processed in the cochlea (such as speech and music), since our understanding of this processing is incomplete. Potential outcomes of a predictive model include the improvement of speech processing algorithms, approaches for both cochlear implant electrical and hearing aid receiver stimulation paradigms, noninvasive diagnoses of auditory function, and higher fidelity input for models of neural processing of sound. While we have made progress in developing efficient, predictive codes, improving the utility and predictive ability rests on innovations in scientific computing along with the ability to rapidly integrate new biophysical mechanisms. The modular nature of our finite element-based approach makes such improvements relatively easier. Two main motivations for creating an open-source resource for these models are as follows: (1) broaden the user base by making the present version of CSound easier to use and directly accessible by to the entire auditory research community; (2) create a GitHub-based opensource resource to enable computational researchers to modify CSound using GitHub's branch structure along with its pull/push request software update project management process. To facilitate the realization of these goals, we use widely accessible software that harnesses large scale (parallel) computing capability. We hope that by creating an opensource software ecosystem, we will accelerate discovery by experimentalists through their use of the code and spur advances in scientific computing aimed directly at the grand computational challenge of predicting the response of mammalian cochleae to sound.	NIDCD
Ron June	Montana State Bozeman	Role of Glucose metabolism in Chondrocyte Mechanotransduction Build open science framework to enable more labs to use metabolomic flux analysis of central metabolism. All cells use various metabolic processes to harvest nutrients and energy from biological inputs. The most studied pathways that cells use are components of the integrated system that is known as central metabolism. This project develops web versions of key tools to study central metabolism that use a type of mathematical modeling called metabolomic flux analysis. This technique integrates experimental metabolomics data with a stoichiometric model. The experimental metabolomics data describes changes in metabolite concentrations in response to an experimental stimulus or clinical treatment. The stoichiometric model describes the quantitative flow of nutrients and energy through central metabolism. Many labs across the nation perform studies of central metabolism, yet this new approach of metabolomic flux analysis is not widely available. The objective of these supplemental studies is to develop open science frameworks that allow users to perform this analysis on their own data.	NIAMS
Daisuke Kihara	Purdue	Building protein structure models for intermediate resolution cryo-electron microscopy maps Make cryo-EM structure modeling software cloud-ready and integrate with popular molecular modeling packages. The overall goal of the parent award is to develop computational methods for modeling global and local structures for interpreting cryo-EM density maps of 4 Å to medium resolution. Under the parent award, methods we have successfully developed include MAINMAST for de novo protein main-chain modeling, VESPER for structure fitting to EM Maps, and Emap2Sec+, a deep-learning method for detecting local structures in medium resolution EM maps. The goal of this administrative supplement is to improve availability, sustainability, robustness, and user-friendliness with strong emphasis for cloud computing readiness of the biomolecular structure modeling software for cryo-EM. To achieve this goal, we will restructure developed codes for integrating the software with popular molecular modeling packages and for effective computation on cluster computers. We will also develop web servers to perform computation on cloud for easy access of the software.	NIGMS
Arjun Krishnan	Michigan State University	Resolving and understanding the genomic basis of heterogeneous complex traits and disease GenePlexus: a cloud platform for network-based machine learning The goal of the parent project is to develop a suite of computational frameworks that integrate massive collections of genomic and biomedical data to understanding of the mechanistic relationships between genomic variation, cellular processes, tissue function, and phenotypic variation in relation to complex traits and diseases. Genome-wide molecular networks effectively capture these mechanistic relationships and, when combined with supervised machine learning (ML), lead to state-of-the-art results in predicting novel genes associated pathways, traits, and diseases. This software supplement will support the development of GenePlexus: a cloud platform for network-based machine learning to enable: i) biomedical/experimental researchers to perform network-based ML on massive genome-scale molecular networks and get novel interpretable predictions about gene attributes, and ii) computational researchers to programmatically run network-based ML, retrieve results, and integrate with existing –omics data analysis workflows. Building on a current prototype, we will 1) Refactor GenePlexus to use a service-oriented architecture to scale ML runs and enable programmatic access, stand-alone use, and community contributions; 2) Generalize data structure and storage design, and improve inter-service communication with existing large-scale resources for genes and networks; and 3) Improve security, privacy, database architecture, and cost management to improve user experience in running, storing, retrieving, and sharing machine learning models and results.	NIGMS
Maria Kukuruzinska	Boston University	Defining the catenin/CBP axis in head and neck cancer Enhancement and Cloud Deployment of CaDrA, a software tool for Candidate Driver Analysis of Multiomics Data In the parent project we developed CaDrA (Candidate Driver Analysis), a methodology we developed for the identification and prioritization of candidate cancer drivers associated with a given pathway, regulator, or other molecular phenotype of interest through the analysis of multi-omics data. We aim to optimize, harden, and deploy CaDrA as an open-source R package developed based on best software engineering practices and design principles. We plan to thoroughly document the package, design a R Shiny interface to make the tool “biologist-friendly”, and to containerize it via Docker/Singularity to make it cloud-ready and scalable, and compatible with various high-performing computing (HPC) environments.	NIDCR
Sudhir Kumar	Temple University	Methods for Evolutionary Genomics Analysis Making MEGA cloud-ready for big data phylogenetics The Molecular Evolutionary Genetics Analysis (MEGA) software offers a large repertoire of tools for assembling sequence alignments, inferring evolutionary trees, estimating genetic distances and diversities, inferring ancestral sequences, computing timetrees, and testing selection. These analyses are key to deciphering the genetic basis of evolutionary change and discovering fundamental patterns in the tree of life. New methods for molecular evolutionary genomics, developed during the parent grant of this supplement, will be implemented in MEGA, which is widely used in the community. The primary aim of this supplement is to adapt MEGA’s computational core to run on cloud computing infrastructure, i.e., make MEGA cloud-ready (MEGA-CR). We plan to refactor the source code for scalable distributed execution on cloud infrastructure as well as computer clusters. We will optimize performance through code profiling, containerizing, and minimizing factors that increase latency on cloud environments such as network communication and data transfer. These developments will increase the scalability of MEGA for analyzing big data, via the elastic computing power enabled by cloud infrastructure. Consequently, MEGA-CR will meet the needs of the scientific community that is now analyzing much larger datasets by providing greater accessibility, cost-efficiency, and scalability in computational molecular evolution.	NIGMS
Barry Lester	Women and Infants Hospital Rhode Island	Clinical markers of neonatal opioid withdrawal syndrome: onset, severity and longitudinal neurodevelopmental outcome Developing cloud-based software to analyze acoustic characteristics of newborn babies’ cries to diagnose withdrawal due to prenatal opioid exposure. The incidence of Neonatal Opioid Withdrawal Syndrome (NOWS), the withdrawal in the newborn infant due to prenatal opioid exposure, has increased dramatically due to the worldwide rise in opioid use. Accurate prediction and diagnosis of NOWS would change the treatment and management of these infants. The purpose of the parent grant is to identify clinical markers of NOWS including acoustic characteristics of the infant’s cry and determine the long-term predictive validity of these markers. Computer analysis of infant cry characteristics enables us to predict NOWS diagnosis with 91% accuracy but is not usable at the bedside. With this supplement, we will develop a cloud-based system for the automated analysis of infant cry acoustical characteristics for the diagnosis of NOWS. The software will enable us to have a fully automated system, where a user records a baby cry in the newborn nursery using a phone or other connected device, and within a few seconds, receives the diagnostic result. We will develop the cloud-based software to change clinical practice by providing a more accurate and reliable diagnosis of NOWS which will affect the pharmacological treatment of NOWS including length of hospital stay and potentially improve the long-term outcome of these infants. Cloud use for clinical application	NIDA
Allan Levey, David Gutman	Emory	MP-AD Brain Proteomic Network Enhancement, Validation, and Translation into CSF Biomarkers Enhancing the Open-Source Digital Slide Archive Platform to Enhance Proteomic Biomarker Discovery in Alzheimer’s and Related Diseases The AMP-AD’s project overall goal is to identify proteomic networks from cortical tissue to identify protein co-expression modules that strongly associate with diagnosis, cognition and neuropathology. The neuropathologic tissue serves as the gold-standard for patient diagnosis, but given the intra-rater variability^1–3 inherent in neuropathology, developing a system which can confirm or standardize diagnosis will improve the robustness of proteomic marker validation studies. The Cancer Digital Slide Archive 1–5 is a web-based imaging repository that houses 26,000+ digital pathology images. Originally designed specifically for the NCI The Cancer Genome Atlas (TCGA) pathology data, we developed the open source Digital Slide Archive (DSA) platform. This platform supports web-based visualization of pathology images, integrated tools for both computer and human-generated annotations, and reads the majority of whole slide image formats. Code is available on GitHub, and the REST based API supports integration with other applications. The dockerized DSA platform supports local and cloud based installation. This proposal will provide foundational support to adapt the DSA platform to serve the neuropathology community. However the system needs significant user interface(UI) refinement to support typical neuropathology workflows as it was initially developed for cancer studies. We will improve our documentation, develop tutorials, and harden custom UIs we have prototyped. UI enhancements will simplify navigation between the slides, stains, and sections used in neuropathology assessments. This will involve developing an initial data model to standardize neuropathology data management. Additional features needed for safe, secure, and compliant image sharing and organization will also be evaluated. The HistomicsTK toolkit, developed by our group, has ~10,000+ installations a month via PyPi. We will optimize the HistomicsTK algorithms for NP slide sets with accompanying demos and documentation.	NIA
Trevor Lujan	Boise State University	Role of Distortion Energy in Fibroblast-Mediated Remodeling of Collagen Matrices Launching a cloud-based application that enables the fast quantification of material anisotropy from two-dimensional images of fiber networks. Our parent R15 application uses a standalone software application developed in our lab, called FiberFit, to quantify differences in the fiber networks of cellular scaffolds. FiberFit was developed to automate the fast and accurate measurement of two structural properties that describe material anisotropy: fiber orientation and fiber dispersion. The accurate measurement of these parameters is of major importance in understanding structure-function relationships at molecular and macroscales in biological systems, and in the engineering of advanced materials, yet a standard software tool to compute these metrics has not been broadly adopted. We will address this limitation by re-engineering FiberFit to be a robust, intuitive, and sustainable cloud-based application. This project will emphasize ease-of-use and will include the development of supportive documentation and pre-processing features to create a turn-key solution that automates the accurate and transparent analysis of multiple two-dimensional image files. These images can be acquired from numerous imaging modalities at various length lengths, and materials can be biological, inorganic, or engineered. As a web-based application, FiberFit will utilize cloud storage technology to manage data that is uploaded and generated, and user accounts will allow analyzed images and project-specific settings to be stored and retrieved from cloud directories.	NIAMS
Rob Macleod	University of Utah	Integration of Uncertainty Quantification with SCIRun Bioelectric Field Simulation Pipeline Enabling cloud computing resources to enhance access to uncertainty quantification tools A rapidly emerging need in biomedical simulations is to quantify the uncertainty that arises because of the impact on model predictions of inevitably incomplete knowledge of simulation parameters. Every simulation has errors—modeling assumptions, imprecise estimates of parameter values, and numerical discretization errors—and the scientist or physician using the results of a simulation should know and take into account the nature of those errors, i.e., how sensitive the results are to assumptions and uncertainties of the settings and parameters that drive the simulations. Techniques to characterize this sensitivity are known collectively as uncertainty quantification. While mathematical methods for uncertainty quantification (UQ) have made significant recent progress, verified methods and validated software tools that implement them are available, they are not easily accessible to biomedical scientists. The gaps in this process for much of biomedical science currently lie in the integration and easy availability of state-of-the-art tools for uncertainty quantification within the simulation pipelines. The parent U24 project for this supplement request addresses the first of these gaps, integration: we are developing an open-source Python-based software suite, UncertainSCI, for non-intrusively quantifying uncertainty due to a variety of parameters in biomedical simulations. However, while this tool holds considerable promise, its use requires considerable knowledge of software engineering and also access to in-house computing resources, both of which may be lacking in many biomedical settings. Thus the second gap, availability, is not addressed as well as might be desired. The overall goal of this supplement request is to enhance the availability of UncertainSCI by leveraging best practices in modern software development and advances in cloud computing.	NIBIB
Brian Macwhinney	Carnegie Mellon University	Computational analysis of child language transcript data To promote FAIR-ness, web delivery, and system replication, we will modernize and containerize the software used by the CHILDES (Child Language Data Exchange System) database and programs. The goal of this supplement is to modernize the software infrastructure supporting the CHILDES Project. We will leverage best practices in software development and advances in cloud computing to improve the FAIR-ness of the data and system. To improve web-compatibility and to take full advantage of the new TalkBankDB database system, we will convert the current XML-based data model to JSON. This will allow us to improve data validation, repository checking, and cloud deployment. We will modularize and containerize all software components and integrate the containers through the Kubernetes system. These improvements will open the system to open-source development of analysis through R and Python, thereby widening the scope of empirical issues that can be addressed from the database. They will also allow us to replicate the full system in new sites internationally, thereby promoting sustainability.	NICHD
Mark Musen	Stanford	The Metadata Powerwash - Integrated tools to make biomedical data FAIR Modernizing the Protégé ontology-editing system: Enhanced ontology engineering through a Web-based, Cloud-based software architecture To support the needs for ontology engineering in our parent NLM R01 grant, we propose two specific aims to enhance the Web-based version of the Protégé ontology editor: (1) We will convert WebProtégé to a modern, microservice-based architecture, adding new microservices—including the availability of a plug-in architecture that will allow third parties to contribute novel additions to the WebProtégé code base. We will use a software-development approach that will allow us to implement the new architecture in a controlled, incremental manner. (2) We will modernize WebProtégé to make it Cloud-native. We will take advantage of the NIH STRIDES initiative, containerizing the system for deployment in the Google Cloud Platform (GPC) and adapting the software to operate with Cloud-based third-party software for data storage, data queueing, and search. We also will migrate all current WebProtégé users and their projects to the Cloud-based system. Our work will benefit the biomedical community at large, while enhancing our capabilities for ontology engineering as required by our existing grant.	NLM
Towfique Raj, David Knowles	Icahn School of Medicine at Mount Sinai	Learning the Regulatory Code of Alzheimers Disease Genomes Refactoring and containerizing the LeafCutter RNA splicing pipeline for cluster and cloud infrastructures The parent award for the proposed work (U01 AG068880-01 ``Learning the Regulatory Code of Alzheimer's Disease Genomes'') involves developing deep learning (DL) and machine learning (ML) models of pre- and post-transcriptional (in particular, RNA splicing) gene regulation in Alzheimer’s disease associated cell-types and states. The training data for the RNA splicing models is generated by applying our previously published splicing quantification tool, LeafCutter, to large-scale RNA-seq datasets. LeafCutter has also been used by consortia including GTEx and PsychENCODE for their splicing analysis. However, LeafCutter remains “early stage” software with a challenging install process on both cluster and cloud infrastructure. In this supplement we will refactor LeafCutter to use 1) software engineering best practices including modularization and GA4GH schemas, 2) conda and Docker based installation and 3) standard workflow languages (e.g. nextflow) to enable straightforward and efficient deployment on the cloud at scale.	NIA
Adam Resnick	Children’s Hospital of Philadelphia	Data Management and Portal for the INCLUDE (DAPI) Project User-ready tools and scalable workflows for INCLUDE datasets in the cloud: advancing brain imaging data management and analytics Advancing large-scale imaging dataset generation, management, and analytics with user-ready tools and cloud-based workflows. Although radiological images are routinely acquired in clinical care practices, the use of medical imaging data in a research context is limited by the technical expertise required for data preparation and analysis. Moreover, the need for large-scale, ML-ready imaging datasets for predictive analytics has been largely unmet due to a lack of tools and workflows that generalize across scanners and sites and that can be flexibly deployed in high-performance computing environments. To bridge these gaps, the goal of this supplement is to develop interoperable and scalable cloud-based workflows to allow for AI/ML analytics with clinically acquired imaging data. We will integrate existing state-of-the-art software with cloud services that can be utilized in a user-friendly web-based platform (Flywheel). Additionally, we will establish pipelines to allow for the association of imaging-based ML features with features from other data modalities, creating rich, multi-modal datasets for advanced predictive analytics. This will ultimately support the goal of the parent grant in defining generalizable workflows for large-scale data processes, integrating data across dispersed sources, and providing harmonized datasets to bolster research on Down syndrome and co-occurring conditions.	NHLBI
Panagiotis Roussos	Icahn School of Medicine at Mount Sinai	Understanding the molecular mechanisms that contribute to neuropsychiatric symptoms in Alzheimer Disease Dreamlet workflow to power analysis of large-scale single cell datasets Recent advances in single cell and single nucleus transcriptomic technology has enabled studying the molecular mechanisms of Alzheimer's disease (AD) at unprecedented resolution by profiling the transcriptome of thousands of single nuclei from hundreds of post mortem brains from AD donors and controls. The parent grant has generated a compendium of single nucleus transcriptome profiles comprising ~7.2M nuclei from ~1,800 total donors. Yet elucidating the molecular mechanisms of AD from these data requires scalable software and sophisticated statistical modelling. In order to address this challenge, we are developing the dreamlet package, a widely applicable framework for differential expression analysis from single cell (and single nucleus) RNA- and ATAC-seq data that models complex study designs in highly scalable analysis workflows. Dreamlet uses a pseudobulk approach and fits a regression model for each gene (or open chromatin region) and cell cluster to test differential expression across individuals associated with a trait of interest. Use of precision-weighted linear mixed models enables accounting for repeated measures study designs, high dimensional batch effects, and varying sequencing depth or observed cells per biosample. Dreamlet further enables analysis of massive-scale of single cell RNA-seq and ATAC-seq datasets by addressing both CPU and memory usage limitations. Dreamlet performs preprocessing and statistical analysis in parallel on multicore machines, and can distribute work across multiple nodes on a compute cluster. Dreamlet also uses the H5AD format for on-disk data storage to enable data processing in smaller chunks to dramatically reduce memory usage. The dreamlet workflow easily integrates into the Bioconductor ecosystem, and uses the SingleCellExperiment class to facilitate compatibility with other analyses. Beyond differential expression testing, dreamlet provides seamless integration of downstream analysis including quantifying sources of expression variation, gene set analysis using the full spectrum of gene-level t-statistics, testing differences in cell type composition and visualizing results.	NIA
David Rowe, Peter Maye, Dong-Guk Shin	University of Connecticut	High resolution 3D mapping of cellular heterogeneity within multiple types of mineralized tissues Revamp mGEA (Make GEO Accessible) to be available for a wide user base In this software engineering supplement project, we aim to revamp one important software tool, called mGEA (Make GEO Accessible), that we have been using internally to identify candidate MERFISH probes for human knee and bone tissues so that it can benefit a larger user base who needs to examine population-based reference gene expression data sets readily available from NCBI GEO. Examining GEO deposited data can be very beneficial for HuBMAP users since one can acquire a cohort of gene expression data sets from which tissue/organ specific reference gene expression patterns could be mined. The importance of using population-based signals for probe design has been hotly discussed during the recent HuBMAP FISH-assay meeting (March 15, 2021 organized by Dr. Ajay Pillai). Unfortunately, GEO has been mostly optimized for “data archiving” and as such, using deposited data by ordinary biologists and even for computational scientists has been severely limited. Our tool mGEA could dramatically lower that barrier. Difficulties of using GEO deposited data include (i) associating experimental platform IDs (e.g., Affymetrix, Illumina, Agilent, etc.) with gene symbols that biologists are mostly familiar with, and (ii) organizing which populations of samples (biological and technical replicates) can be grouped together and compared (e.g., treatment vs. control, KO population vs. WT, etc.). Using mGEA, scientists should be able to convert the archived data into biologists-friendly formats (e.g., Excel spreadsheet with gene symbols, fold change and statistical sample-wise and gene-wise z-scores precomputed) within a few clicks over any web browser. If everything goes well, users should be able to convert a GEO deposited data set into a format amenable to their local exploration in less than 10 minutes using the tool’s user-friendly visual GUI although problematic cases may take longer as manual intervention is needed. Making mGEA cloud-ready would not only benefits the members of the HuBMAP consortium but also the constituents far beyond the HuBMAP. With mGEA, the majority of wet-bench biologists should be able to explore GEO deposited data, thus facilitating GEO to unleash its intended power as an important community resource.	NIAMS
Nathan Salomonis	Cincinnati Childrens Hospital	Unbiased identification of spliceosome vulnerabilities across cancer Leveraging cloud workflows for splicing discovery As a central component of our funded NCI R01 for “Unbiased identification of spliceosome vulnerabilities across cancer”, we have been extending and leveraging a comprehensive splicing analysis pipeline to define splicing vulnerabilities across human cancers and healthy tissues. The bioinformatics tools to yield these discoveries largely consistent of distinct components of the large AltAnalyze open-source project, begun in 2008. To enable fast and comprehensive analyses of splicing in the cloud we are updating and translating the primary splicing analysis components of AltAnalyze to a CWL pipeline, containerized with Docker. Analyze.cloud will be established as a Terra workflow for integrated supervised and unsupervised splicing analyses of large user and controlled NIH deposited datasets in the cloud. Cloud workflow on Terra using CWL	NCI
Luis Santana	University of California Davis	Multi-Scale Modeling of Vascular Signaling Units Multi-modal data analysis pipeline in the cloud to unify electrophysiology, Ca2+ imaging, super-resolution microscopy, and predictive modelling. Hypertension is one of the largest modifiable risk factors for cardiovascular disease, which is the leading cause of mortality for men and women. As more antihypertensive therapies become available, it has become clear that males and females respond differently to these treatments. Yet the mechanisms behind these sex differences are largely unknown. The parent grant aims to develop a multi-disciplinary approach to reveal the mechanisms of male and female hypertension and by building a detailed model to predict how drugs can differentially alter vascular function between these groups. This is being achieved by comparing male and female vascular smooth muscle cells using a range of state-of-the-art multi-modal techniques including electrophysiology, Ca2+ imaging, nano-scale super-resolution microscopy, and in silico predictive modelling. The goal of this supplemental project is to unify the data analysis from heterogeneous multi-modal techniques by creating an easy to use, reproducible, interoperable, and extensible software analysis pipeline. The pipeline will have both a back-end Python engine with an advanced-programming-interface (API) to ensure re-usability and a front-end cloud-based graphical-user-interface (GUI) to facilitate collaboration and sharing between groups with no programming experience. The pipeline will be hardened using best programming practices including versioning, unit testing, and code documentation. For the pipeline to be discoverable, it will be accessible through online open-source code sharing and installation from package managers. The pipeline will be containerized for one-click access allowing the same code to be run on individual computers, local clusters, or in the cloud. A key feature in the design of this pipeline is that curated multi-modal data analysis will seamlessly provide input to in silico computational models to enable robust ground-truth predictions of functional biophysical parameters. Our goal is to create a cloud-based analysis pipeline that merges heterogeneous multi-model data and promotes reproducibility, sharing, and collaborations between multi-disciplinary research groups and ultimately the greater public.	NHLBI
Matthew Silva	Washington University	Resource Based Center for Musculoskeletal Biology and Medicine Washington University Musculoskeletal Image analysis program (WUMI): a multi-platform open-source software for the visualization and evaluation of musculoskeletal imaging data. Under the parent grant P30 AR074992, the Washington University Resource-Based Center for Musculoskeletal Biology and Medicine supports the development, implementation, and evaluation of animal models for musculoskeletal biology and medicine. High-resolution imaging modalities such as MicroCT enable the evaluation of bone microstructure and morphology in musculoskeletal research. These advances in imaging technology have enabled investigators to answer crucial questions in health and disease in innovative ways. Yet the ability to evaluate and interpret this data is critically dependent on the software tools that can conduct the analyses correctly, rapidly, and rigorously. Increasing imaging data size and complexity often outpaces the development of the tools to handle them. In addition, there is an increased need for remote analysis solutions that can be run on personal computers. Therefore, the premise of this work is driven by the need to develop software to address the imaging analysis needs of the musculoskeletal research community. To make a significant impact to the research community, the software must also be widely available, accessible, and easy to use. To address these needs, we have developed software scripts that support 3D visualization and histomorphometric analyses. This led to the creation of the Washington University Musculoskeletal Image analysis program (WUMI), an image analysis software package with a Graphical User Interface (GUI). Compared to available commercial and open-source options, WUMI has equal or superior capabilities, and it has empowered its users with flexible access and evaluation of their datasets, and reduced their reliance on commercial software. Yet, to date the use of WUMI has required a relatively high degree of technical expertise and is only used by a few power users. Thus, there is a need to undertake additional development efforts to provide this software as a robust, open-source resource to the general research community, both at Washington University and beyond. Our overall objective is to rigorously validate the WUMI 3D histomorphometric analyses, improve documentation and usability of the software while releasing multi-platform executables to the musculoskeletal research community.	NIAMS
Peter Sorger	Harvard	Systems Pharmacology of Therapeutic and Adverse Responses to Immune Checkpoint and Small Molecule Drugs Enhancing the MCMICRO data analysis pipeline to use standardized languages and best practices from genome science to facilitate multi-step processing of complex tissue images locally or on the cloud. Highly multiplexed tissue imaging is rapidly emerging as a means to study the properties of single cells in a preserved 3D environment. In a research setting, high-plex imaging provides new insight into the molecular properties of tissues and their spatial organization; in a clinical setting high-plex imaging promises to augment traditional histopathological diagnosis with molecular information needed to guide use of targeted and immuno-therapies. High-plex tissue imaging yield subcellular resolution data on 20-100 proteins or other biomolecules across 106 to 107 cells, encoded in up to 1 TB of data. The primary barrier to deeper analysis of highly multiplexed tissue imaging is the computational challenge in processing, managing, and disseminating images of this size. To address this challenge, we have recently developed the modular and open source image processing system, MCMICRO, that uses either the NextFlow pipeline language or the Galaxy platform. This pipeline incorporates existing image processing code and an increasing number of newly developed modules. It can be deployed locally or on the commercial cloud (AWS and GCP) and is in use by multiple NIH/NCI-funded tissue atlas consortia, notably the Human Tumor Atlas Network (HTAN). The goal of our work our current work (under the auspices of the Office of Data Science Strategy) is to further engineer MCMICRO to (i) improve overall performance of the individual modules through code profiling and optimization (ii) standardize inputs and outputs to increase interoperability of individual processing steps (iii) refine the general user and programmer documentation to promote continued contributions from the open-source community (iv) enable visualization of pipeline intermediate and final results directly in the cloud. We welcome participation from interested individuals, laboratories, and companies in the creation of a robust, foundational platform for highly spatial profiling of human and animal tissues.	NCI
Ingo Titze	University of Utah	The Role of the Vocal Ligament in Vocalization FAIR Software Development for Voice and Speech Simulation A major aim of the parent R01 grant entitled “The Role of the Vocal Ligament in Vocalization” is to quantify the fundamental frequency (pitch) range and vocal cord vibration stability. This is achieved by measuring ligament properties and embedding them into a finite element computer model. The fiber-gel model is a primary mathematical simulation component of a software package developed over 40 years by investigators at multiple institutions. It will be shared with users across the nation. Its general application is to predict vocalization outcomes when structural and motor activation inputs are modified with surgery, therapy, or voice training. Muscle activation plots are produced that show how fundamental frequency varies with cricothyroid and thyroarytenoid activation. Laryngeal framework mechanics of translation and rotation of the cricothyroid joint and the crico-arytenoid joint are used to calculate vocal fold strain (elongation). The role of the ligament in control of fundamental frequency (perceptually known as pitch) will be quantified in humans and in non-human species. Results from non-human species provide insights into alternative solutions for surgical and behavioral treatment. The goals of the supplement are (1) to update the code from Fortran to Python, (2) to develop user interfaces that many clinicians and voice trainers can use in daily practice, and (3) the dissemination of the software package to a wide group of users through national voice and speech organizations. No commercial gain is anticipated for any individual, company, or institution.	NIDCD
Doug Whalen	Haskins Laboratories	Links Between Production and Perception in Speech Making DeepEdge, a tool for ultrasound analysis for speech, cloud-accessible. A neural network-based tool for automatic tracking tongue contours in ultrasound video This project is a supplement to the parent NIH award (DC-002717) “Links Between Production and Perception in Speech”, intended to enhance software tools for open science. The parent project investigates the variability and flexibility of speech production through the human vocal tract and how listeners perceive the phonologically meaningful primitives within the speech signal, through a series of speech production and perception experiments. In the speech production experiments, ultrasound imaging provides a non-intrusive and cost-effective technique to measure tongue movements. To facilitate the analysis of tongue measurements in ultrasound images, we developed a software program to track midsagittal tongue contours automatically on a proprietary platform. The aim of this supplement project is to migrate our software program to a cloud platform with other improvements, in order to increase robustness, interoperability, sustainability, portability and performance.	NIDCD
Travis John Wheeler	University of Montana	Machine learning approaches for improved accuracy and speed in sequence annotation Improving robustness, release, and quality of genome annotation software through extensive software engineering practices. The goal of the parent grant for this supplement is to develop Machine Learning approaches to improve both accuracy and speed of highly-sensitive sequence database search and alignment. We have developed three software tools associated with this effort of correctly annotating genomes: (i) ULTRA, which labels repetitive sequence; (ii) PolyA, which integrates such labels with other sequence annotations in a probabilistic framework, computing uncertainty and improving accuracy; and (iii) SODA, a library that aids in visualization of annotations and supporting evidence. The effort supported by this supplement will be to refactor these software tools and their documentation to improve robustness and reliability, and to improve their availability through package management systems and incorporation into cloud-based analysis frameworks.	NIGMS
Guorong Wu	University of North Carolina Chapel Hill	A Scalable Platform for Exploring and Analyzing Whole Brain Tissue Cleared Images Promoting public science and data sharing for 3D nuclei segmentation through a cloud-based annotation system. In our ongoing R01 project, we aim to develop a high-throughput 3D nuclei segmentation platform for the neuroscience field to analyze large-scale microscopy images derived from cleared tissue in a standardized manner, across multiple labs. Since the success of our learning-based nuclei segmentation method requires a large pool of manually-annotated 3D nuclei for both training and validation, the overarching goal of this supplement project is refactoring our current stand-alone annotation software to a “cloud-ready” solution, called “Ninjatō”, as an analogy to the Swiss army knife for 3D nuclei annotation. We will use VTK.js (an open-source JavaScript library for scientific visualization) and Resonant (an open-source data analytics software platform) to implement a client-server architecture for data acquisition, cloud-based processing, and data management components. The output of this administration supplement project is a new cloud-based 3D nuclei annotation system, which will allow us to (1) significantly augment the pool of manual annotations, (2) further improve the accuracy and robustness of our learning-based nuclei segmentation engine, (3) make annotation data AI/ML-ready using standardized data management, and (4) support the neuroscience community with enhanced extensibility to new analytic tools.	NINDS
Yana Yunusova	Sunnybrook Research Institute	The development and validation of a novel tool for the assessment of bulbar dysfunction in ALS Creating a user-friendly and intuitive cloud-based platform for research and clinical assessment of bulbar dysfunction The current healthcare environment affords an unprecedented opportunity for the development and implementation of new remote platforms for the assessment of various motor behaviours including speech and orofacial gestures. As part of our parent grant, we recently created VirtualSLP, a software tools to support remote, online multi-modal - high quality video/ kinematic and audio/acoustic – data collection, using a multi-platform compatible, web-browser based audio and video recordings. In parallel, we have designed a tool for automatic extraction of kinematic and acoustic metrics of neurological diseases, which provides clinically interpretable features/ measures for detecting and tracking the onset and progression of oro-motor and speech (aka bulbar) impairments in neurological diseases. The next step in the progression of our technology development efforts is to incorporate the automatic metrics extraction software within VirtualSLP, creating a user-friendly and intuitive cloud-based platform for research and clinical assessment of bulbar dysfunction. To achieve rapid development and deployment of VirtualSLP, we will use modern engineering methodology and user-centered design to iterate through a number of steps in the software development cycle. The cycle will begin with the engagement of end users (e.g., researchers, clinicians and patients) to document their current and anticipated end-to-end experiences with the software, while performing the baseline analysis of the existing software components and determining and implementing necessary changes. The necessary components will be incorporated to create VirtualSLP on the Amazon Web Services (AWS) platform, to enhance its usability, interoperability, scalability, and security, as part of the NIH STRIDES initiative. Usability testing of Virtual SLP with end users will be performed throughout the development process and at the end of the development cycle. When completed, the work will result in an enhanced software tool for collection of audio and video data and corresponding AI-based analytics to be used in the context of clinical research. VirtualSLP will further support our ongoing work on the clinician-administered tool for bulbar ALS assessment and monitoring (ALS-Bulbar Dysfunction Index) by providing a novel cloud-based platform for its clinical validation. This work aims to exemplify the intent of the current funding opportunity by supporting collaborations between clinical speech scientists, data scientists, and software engineers to enhance the design, implementation, and "cloud-readiness" of research software.	NIDCD
Antonella Zanobetti	Harvard	National Cohort Studies of Alzheimers Disease, Related Dementias and Air Pollution Refactor and redesign two novel statistical R packages, leveraging professional software engineers to deploy open-source efficient software, reaching a broad community of users and developers to facilitate nationwide studies on environmental public health. The goals of the parent grant are to: 1) conduct national epidemiological studies of Medicare and Medicaid claims to estimate the effects of long-term exposures to air pollution (PM2.5 and ozone) on Alzheimer’s disease and related dementias (ADRD) hospitalization and disease progression; 2) apply machine learning methods to identify co-occurrence of individual-level, environmental, and societal factors that lead to increased vulnerability; 3) develop statistical methods to disentangle the effects of air pollution exposure from other confounding factors and to correct for potential outcome misclassification. To address these aims we developed two R packages; Causal Rule Ensemble (CRE) (Aim 2) and Gaussian processes for the estimation of causal exposure-response curves (GP_CERF) (Aim 3). With this administrative proposal, we plan on refactoring and redesigning these packages and converting them into robust, easy to maintain, and efficient software packages to reach broader users and developers from the open-source community. We will review numerical implementation from the algorithmic design point of view, following standard version controlling systems, unit testing, and continues integration. We will implement infrastructure to run the package on shared and distributed memory computational nodes for cloud readiness requirements. Toward this goal, we will use a broad spectrum of approaches to handle big data. In addition to R, the packages will also be implemented in Python3. Stable versions of the packages will be hosted on CRAN and PyPI.	NIA

FY2020

Click to view FY2020 Award Recipients

FY2020 Award Recipients
Principal Investigator	Institution	Project Title	NIH IC
Apkar Vania Apkarian	Northwestern University at Chicago	Brain Pathophysiology of Osteoarthritis Pain For enhancing open-source pain neuroscience, we migrate OpenPain into a public-cloud hosted global infrastructure and develop an open-source cloud version of our pain rating smartphone app. The parent grant uses brain imaging-based prognostic models to prospectively predict outcomes following total knee replacement. This supplemental grant will implement resources for cloud-based data collection and analysis of pain studies, especially regarding post-knee surgery pain in patients with knee osteoarthritis. To create cloud environment for sharing of tools, models, and data with the scientific community regarding pain research and to engage community users and developers to improve transparency, decrease analysis bias, facilitate reproducibility, promote collaboration, and ensure better generalizability of obtained pain research results, we will migrate OpenPain, portal for sharing human pain brain imaging data and NIH-supported platform, into a public-cloud hosted global infrastructure, and integrate an open source deidentification and preprocessing pipeline into OpenPain to expedite pain neuroimaging data sharing and analysis, develop an open-source cloud version of our pain rating smartphone app to enable researchers and clinicians to collect daily pain ratings with ease, and create and deploy a cloud environment to facilitate neuroimaging and pain rating analyses. We hope this cloud environment will help translate basic science into clinical utility by making our models to predict pain outcomes after knee surgery more accessible, and these same tools can be replicated and re-used for other valid and generalizable prediction models.	NIAMS
Sylvain Bouix, Marek Kubicki, Nikolaos Makris	Brigham and Women's Hospital	High Resolution, Comprehensive Atlases of the Human Brain Morphology Refactoring a custom neurosegmentation desktop application into a web-based editor with a centralized access-controlled database of images that can be deployed on cloud infrastructures The overall goal of our parent award (R01MH112748) is to manually label 200 magnetic resonance (MR) images from the Human Connectome Project (HCP) into a set of neuroanatomical structures that are more refined and accurate than current MR atlases. The goal of this administrative supplement is to refactor a custom manual neurosegmentation desktop application into a web-based editor with a centralized access-controlled database of images that can be deployed on cloud infrastructures. Refactoring our desktop-based platform into a collaborative cloud- and web-based solution will enhance the design, implementation, and cloud-readiness of our editing software and greatly facilitate the scalability of annotation projects not only for the parent award but also for the large community of medical image data annotators.	NIMH
Gregory R. Bowman	Washington University	MSMs, adaptive sampling, and data sharing on the cloud This proposal aims to deploy adaptive sampling simulation algorithms on the cloud to enable a better understanding of the connection between mutations and disease, and to identify new drug target sites. Under parent award R01 GM124007, the Bowman lab at the Washington University in St. Louis is developing a powerful pipeline for running molecular dynamics simulations and analyzing them to extract valuable information, like allosteric networks that couple distant components of a protein and cryptic pockets that are absent in available experimental structures but may provide new opportunities for drug design. Successes like the identification of cryptic pockets and allosteric networks in β-lactamase enzymes and Ebola proteins, followed by experimental confirmation of their existence and the development of small molecules targeting these sites, demonstrate the maturity of the underlying science. Two of the technologies that have enabled this work are 1) the development of Markov state model (MSM) methods for building maps of proteins’ conformational spaces and 2) adaptive sampling algorithms that enable the efficient acquisition of simulation data by iterating between running simulations, building MSMs, and using the MSMs to decide where to start new simulations. The goals of this award are 1) to incorporate software engineering best practices into the Bowman lab’s open source tools for simulating and understanding protein dynamics using adaptive sampling and Markov state models (MSMs); 2) to adapt these methods to run equally well on the cloud or high performance computing (HPC) resources; and 3) to establish the infrastructure required to share large simulation datasets. Progress We’ve been making good progress and very much appreciate your support. So far we have Modularized our adaptive sampling code so that we and others can more easily swap in different simulation engines, queueing systems, and adaptive sampling algorithms. Open sourced our code on github https://github.com/bowman-lab/fast Worked with software engineers from AWS to deploy our code on the cloud, which we are working on synthesizing into a tutorial now. Used these methods to tackle the COVID-19 pandemic, generating an unprecedented 0.1 seconds of simulation of the viral proteome and gaining structural insight into how the virus works and new opportunities for drug design https://www.biorxiv.org/content/10.1101/2020.06.27.175430v3 Worked with AWS to make all 200 TB of our initial data on SARS-CoV-2 available to anyone for download via their public datasets program https://registry.opendata.aws/foldingathome-covid19/	NIGMS
Laurel H. Carney	University of Rochester	Auditory Processing of Complex Sounds Transitioning Matlab-based simulations and analysis tools into cloud-based Web Apps The R01 grant supports a research program that applies physiological, psychophysical, and computational approaches to test hypotheses related to neural processing of sounds, including difficult listening conditions, such as speech in noisy backgrounds. Over the course of this research program, we have developed a number of software tools for simulations of peripheral and central auditory neurons. One goal of our supplement project is to create cloud-based software tools, using Matlab Web Apps, to provide visualizations of neural responses to sound with a wide, interdisciplinary group of users. Our computational models for auditory neurons have generally required users to have ownership of and facility with Matlab software. In recent years, we have worked towards standalone-tools that provide better visualizations of neural population responses. In this project, we are moving these tools onto a virtual, public-facing machine. We are also developing software tools that will allow shared access to our psychophysical and physiological data analysis, taking advantage of databases we are setting up on the Open Science Framework (OSF) site. Finally, we are designing demos based on our software that would make auditory modeling more accessible to musicians and linguists. As we proceed, we also plan to carefully document the process, in the form of user manuals, in order to encourage and assist others to transition their Matlab-based simulations and analysis tools into cloud-based web apps.	NIDCD
Brian Yuan Chen	Lehigh University	Algorithmic identification of binding specificity mechanisms in proteins Containerize software for analyzing binding specificity mechanisms to make them usable on public cloud services and responsive to API calls The parent project aims to create software that can predict and explain, in English, the role of specific amino acids in selective binding. The explanations are generated using a suite of tools that predict if certain mutations interfere with binding according to specific biochemical mechanisms. This data science supplement will enhance the reusability and interoperability of the tools by creating a systematic method for containerization, cloud deployment, and API integration for each tool in the suite. As a result, the software products of the parent project will be usable by anyone, using public cloud resources, and be integratable with other software.	NIGMS
Gerardo Andres Cisneros	University of North Texas	Investigation of DNA Modifying Enzymes by Computational Simulations: Development and Applications Standards-based open source software for multi-scale molecular simulations on the cloud The goals of the parent grant are to develop advanced methods for hybrid quantum/classical (QM/MM) simulations and apply them to investigate the role of mutations on three different DNA repair and modification enzymes. The software supplement award will leverage the contribution of the Molecular Software Sciences Institute (MolSSI) and Microsoft's AI for good (Azure credit) to adapt molecular simulation software for the cloud. Specifically, LICHEM will be adapted to include the state-of-the-art MolSSI Driver Interface (MDI) software. MDI enables fast on-the-fly communication between molecular simulation codes. This addition will improve communication and efficiency for our LICHEM-based simulations. LICHEM will also be implemented and deployed to Azure to perform QM/MM simulations on the cloud using open-source software.	NIGMS
Gaudenz Danuser	UT Southwestern Medical Center	Functional causality in regulating cell morphogenesis Deploy with minimal maintenance and user effort specialized software pipelines for high resolution live cell image analysis on platform-agnostic or cloud-based computing infrastructures. The laboratory of Gaudenz Danuser at UT Southwestern Medical Center develops foundational methods to determine cause and effect cascades in molecular pathways where experimental perturbation is limited by inaccessibility of key pathway components or by systemic compensation among pathway components. The lab is using advanced live cell microscopy to monitor the dynamic activation of pathway components and applies stochastic time series modeling to infer causal connections between components. A critical step in this workflow is the unbiased extraction from live cell movies of time courses of component activities by computer vision and machine learning. Among the diverse tools the lab has produced for this purpose is the u-track particle tracking software, which has gained significant popularity with the microscopy community for its robustness and versatility. Core elements of the u-track algorithm have also been adopted by the by the widely-used TrackMate/ImageJ and CellProfiler open science software packages, but u-track remains frequently downloaded and used because of specialized options that are not implemented on these open science platforms. The goal of the supplement under the Office of Data Science Strategy is to deploy u-track to a wider community by containerization of the package, as well as by making it compatible with cloud services. This will entail software engineering of hybrid workflows that seamlessly integrate cloud computing with interactive visualization on local computing infrastructure. The long term benefit of the supplement for the Danuser lab team is the generation of a framework for rapid and robust dissemination of all the lab software packages with wider user groups.	NIGMS
Shireen Youssef Elhabian	University of Utah	ShapeWorks in the Cloud Cloud-empowered anatomy modeling: scalable and accessible technology for automation The goal of the parent grant is to provide general-purpose, scalable, and open-source machine learning based computational tools to infer statistical shape representations of general anatomies directly from images, and hence bypassing the time-consuming and expert-driven anatomy segmentation step and subsequent user assistance and parameter tuning for building statistical shape models. Shapeworks applies particle-based shape modeling in a suite of open-source software tools to enable unsupervised learning of population-level anatomical representation that respects and discovers population variability via automatic dense placement of homologous landmarks of general anatomy. However, its widespread applicability and impact on medicine and biology are hindered by computational barriers that most existing shape modeling packages face. This administrative supplement will revise, refactor, and redeploy ShapeWorks to take advantage of new cloud computing paradigms, to be robust, sustainable, scalable, and accessible to a broader community, and to address the growing need for shape modeling tools to handle large collections of clinical data and to obtain sufficient statistical power for large shape studies. To achieve this goal, we have developed a plan to enhance the design, implementation, and cloud-readiness of ShapeWorks and augmented our scientific team to add senior, experienced software engineers/developers who have extensive experience in professional programming, code refactoring, and scientific computing. This award will provide our team with the support necessary to (1) design ShapeWorks as a collection of modular and reusable services, (2) decouple ShapeWorks services from explicitly encoded data sources, and (3) refactor ShapeWorks to scale efficiently on the cloud. All software development will be performed in adherence to software engineering practices and design principles, including coding style, documentation, and version control. The proposed efforts will be released as open-source software in a manner consistent with the principles of reproducible research and open science practices. Our long-term goal is to make ShapeWorks a standard tool for shape analyses in medicine.	NIAMS
Melanie Fried-Oken	Oregon Health & Science University	Enhancing an open-source brain-computer interface software for greater adoption and physiologic data sharing Engage community and data sharing through enhanced tooling and integration of cloud services with Brain Computer Interface (BCI) data from people with severe speech and physical impairments. There are high technological and software demands associated with conducting Brain-Computer Interface (BCI) research. BCIs are computer-facilitated systems that rely on direct, real-time measures of brain activity for environmental interaction. In order to accelerate the development and accessibility of BCIs, the parent BCI award (R01DC009834) created an open-source software, BciPy, written in Python and available on GitHub (https://github.com/CAMBI-tech/BciPy). This supplement increases user community engagement and data sharing through enhanced tooling and integration with cloud services to ensure any experimental data collected through our open-source library are readily accessible and adequately curated. Additionally, a unique scientific contribution for this data science supplement is our dataset: physiologic data acquired from people with severe speech and physical impairments (SSPI) secondary to locked in syndrome for use in BCI research. The project allows cloud-based data sharing to achieve the quality recognized by the FAIR (Findable, Accessible, Interoperable, and Reusable) guiding principles for scientific data management and stewardship and will move translational science forward to improve the health and participation potential of patients with severe disabilities.	NIDCD
Ryan Gutenkunst	University of Arizona	Joint inferences of natural selection between sites and populations Scaling computationally intensive population genetics inference on the cloud. Population-level genomic data contain information about recent population history and natural selection. Our software dadi enables users to extract that information by building models and fitting them to such data. The fitting process can be computationally intensive and is limiting for many users. The primary goal of our supplement is to make dadi accessible on cloud computing platforms. Cloud computing is particularly attractive for dadi, because we have recently shown that Graphics Processing Units are highly effective for dadi, and specialized GPU resources are easier to access in the cloud than at many research centers. The supplement will also fund improvements in dadi user documentation and our development process. These enhancements to the dadi software directly support the goals of our parent award, which is focused on developing and applying new models to learn about differences and similarities in natural selection between related populations and between nearby regions of the genome.	NIGMS
John H L Hansen, Ruth Y. Litovsky, Mario A. Svirsky	University of Texas Dallas	CCI-Mobile: Signal Processing Advancements for Cochlear Implant Users in Naturalistic Environments Cloud deployment of CCi-MOBILE research platform to enable remote crowd-sourced testing and machine learning/AI strategies for the speech processing and hearing research community The focus of this supplement effort is to develop a cloud-based community portal to enhance the use and access of the CCi-MOBILE research platform for Cochlear-Implant (CI) and Hearing Aid (HA) research. Due to the evolving nature and continuous improvements of speech processing algorithms as well as machine learning/AI strategies, the research community in CI/HA needs a portable and easy to use research platform to: (i) evaluate speech and sound perception experiments, (ii) implement novel algorithms in a laboratory setting, and (iii) implement testing in naturalistic environments including take-home trials. The effectiveness of the resulting system would provide the CI research community and individual CI users with flexibility that is otherwise lacking in a standalone clinical or benchtop research system. The portal will support three domains for the CI research community: (i) Cloud based Data Sharing (CI/HA), (ii) Access & Sharing Tools/Algorithms, and (iii) Remote testing of cochlear implant users for longitudinal experiments. The features being planned include development of: Cloud Data Repository: A cloud storage location to securely store experimental results that could help other researchers which contains subject responses, processor settings, subject test results, and encrypted personal information of the subjects. Data Logging: Application programming interface (API) for tracking listener choices and exact parameter sequences for user adjusted controls for sound quality optimization; allow users to provide feedback on testing in environments outside the lab. Remote Crowdsourced Testing: a) Develop support for group CI team testing across multiple sites, as well as connecting CI users to those needing CI test subjects for algorithm evaluations; b) developing a graphical user interface to allow user interaction with CCi-MOBILE application, providing longitudinal testing capabilities.	NIDCD
Carl Kesselman, Yang Chai	University of Southern California	Creating Scalable, Reliable, Sustainable Infrastructure for FAIR Data Leveraging cloud services to create scalable, reliable, sustainable infrastructure for FAIR Data in support of diverse craniofacial and dental research communities. The FaceBase III Hub is funded by NIDCR to create a FAIR data repository to enable a broad community of dental and craniofacial researchers by sharing diverse data related to craniofacial development and dysmorphia. Much of the success of data sharing in FaceBase is due to being built on Deriva, an open-source data management system designed with FAIR data principles in mind. Deriva is a comprehensive data and computation platform for storage, annotation, search, retrieval, and visualization of complex, heterogeneous, multi-modal data for research. This platform has allowed FaceBase and other NIH-supported research communities to evolve with changing requirements for data on new experimental methodologies and instruments, additional model organisms, cell characterization, integration of computational pipelines, and visualization interfaces. With the goal of serving the entire craniofacial and dental research community, FaceBase and Deriva will be continually challenged to scale up its capacity to handle the growing volume of data and a worldwide user community that depends on the 24/7 availability of its resources. Exploiting all of the advantages of a cloud platform will help meet these challenges. Hence, we are enhancing Deriva for cloud-based operations to address three key aspects of Deriva in support of FaceBase and its other NIH communities: scalability, reliability, and sustainability. Improvements in Deriva scalability, reliability, and sustainability will allow the FaceBase Hub and many other user communities that rely on Deriva, such as GUDMAP, (Re)Building a Kidney, the Kidney Precision Medicine Project (NIDDK) and the Common Fund Data Environment (OD), to provide the growing community of data contributors and users with better service for FAIR data practices.	NIDCR
Steven H. Kleinstein	Yale University	Enhancing the cloud-readiness of the Immcantation framework for B cell receptor repertoire sequencing analysis Connecting Immcantation (R/Python), third party tools, and reference data with a Nextflow workflow to be used in local, cluster and cloud compute environments to facilitate B and T cell receptor repertoire sequencing data analysis. The Immcantation framework (immcantation.org) is an integrated collection of open-source Python and R packages that provide a start-to-finish analytical ecosystem for Adaptive Immune Receptor Repertoire sequencing (AIRR-seq) studies, with a focus on B cell receptor (BCR) repertoire analysis. Many of the methods in Immcantation have their origin in academic research projects, and are applied to analyze studies that can contain multiple samples reaching up to billions of full-length BCR sequences each. In collaboration with the Yale Center for Research Computing, we are developing a comprehensive open-source Nextflow workflow, that can be used to analyze samples in diverse computing environments (desktop computer, cluster, cloud). The analysis leverages methods and tools in the Immcantation framework to identify B cell clonal relationships, reconstruct clonal lineages and compute repertoire properties. We are using the AIRR Community standards for data and metadata representation to allow for interoperability with other tools that use the standard. We expect this workflow will be useful for core facilities, bioinformatics units, and researchers, to preform state-of-the-art analysis of AIRR-seq data. Because Immcantation is open source and permissively licensed, users with specific requirements will be able to modify the workflow to meet their needs.	NIAID
Francine Laden, Francesca Dominici	Harvard School of Public Health	Relationship Between Multiple Environmental Exposures and CVD Incidence and Survival: Vulnerability and Susceptibility Collaboration with professional software engineers to redeploy software for causal inference in studies of exposure response for a large user community. The goals of our parent R01 are to develop new methods to assess the association between long-term exposure to multiple environmental agents (air pollution and weather) on several health outcomes (e.g., risk of cardiovascular disease, mortality); and identify vulnerable populations. As part of the parent R01, we developed two statistical R packages that address two common challenges in environmental health: 1) how to estimate the exposure-response curve flexibly under a statistical approach that allows users to assess causality from analyses of observational data (GPSmatching); and 2) how to identify subgroups of the population that are more or less vulnerable to adverse health effects of environmental exposure (denovo). There is increasing interest in the community to adopt new cutting-edge statistical tools, such as GPSmatching and denovo, however, these packages are not well designed in terms of software engineering best practices to be accessible to a broader scientific community and are not optimized for larger datasets and efficient utilization on the cloud. As part of this data science supplement, we are establishing a sustainable collaboration with professional software engineers to redeploy our packages which will enable a huge user base of researchers to efficiently utilize the tools, add capabilities to the codebase, and offer improvements to the algorithms. More specifically, we are refactoring and improving the source code of both R packages, improving cloud readiness by optimizing and parallelizing the R packages, and porting both R packages to Python 3.x so that they are available for users in either R or Python. Finally, we'll containerize (Docker/Singularity) both R and Python packages for cloud portability.	NIEHS
Suzanne M. Leal, Andrew Dewan, Paul Livermore Auer	Columbia University Health Sciences	Innovative approaches to elucidate the genetic etiology of age-related hearing impairment and tinnitus Extend SEQSpark to improve the analysis of genetic data by incorporating APIs, providing state of the art tools for data analysis, and ease its use on the cloud. The goal of the parent grant is to investigate the genetic basis for age-related hearing impairment and tinnitus and also to identify pleiotropic variants that modulate both phenotypes. The goal of the supplement is to extend SEQSpark to facilitate use on the cloud to aid in the analysis of UK Biobank data and other large-scale genetic data sets. The supplement will support the development of Python and R Application Programming Interfaces (APIs) that will take advantage of SEQSpark’s distributed computing, allowing for a large variety of existing and newly developed software to be incorporated; the development of containers to overcome installation hurdles on the cloud and traditional high performance computing clusters that will also ensure computational reproducibility; and the development of genetic data analysis pipelines using a new workflow language Script of Scripts (SoS) that blends rigorous workflow execution and interactive data exploration in a Jupyter Notebook format. The new APIs, SEQSpark Installer, and SoS-SEQSpark pipelines will be fully tested and documented to smooth the learning curve for users. All resources will be open source and made freely available on GitHub. The completion of these aims will aid in increasing the versatility of types of analyses SEQSpark can perform and enabling current and future software to be parallelized and run-on clusters. It will improve SEQSpark’s ability to be run in a cloud environment and allow users to implement best practices for analyzing genetic data, and to generate reproducible results. These additions to SEQSpark will enrich the research community’s ability to analyze large-scale genetic data sets with hundreds of thousands of samples with extensive phenotype data on servers or the cloud.	NIDCD
Yaling Liu	Lehigh University	Hemolysis Prediction Software Development Cloud-based software for hemolysis evaluation in customized medical device online The goal of the parent grant is to develop a new multiscale model to predict the shear-induced hemolysis in biomedical devices. The data science goal of the supplement is to translate the developed multiscale model into a cloud-based software to process large amounts of simulation data for reproducible hemolysis evaluation in medical devices. The project will be open-source and shared with the scientific and industry community so that anyone can perform hemolysis on their customized device on the cloud platform.	NHLBI
Shaun Aengus Mahony, Franklin B. Pugh	Pennsylvania State University-University Park	Genome-wide structural organization of proteins within human gene regulatory complexes Develop a web-based platform for enabling reproducible, cloud-based management and analysis of large collections of regulatory genomic sequencing datasets. Knowing precisely where gene regulatory proteins bind, and how they are organized throughout the genome, informs us as to how genes are regulated and mis-regulated in healthy and diseased cells. Parent award R01 GM125722 aims to characterize the high-resolution binding locations of regulatory proteins across several human cell types using the ChIP-exo protein-DNA binding assay. The parent award further aims to develop computational techniques to discover the organization of protein-DNA complexes from large collections of epigenomic data. One of the main bottlenecks in the large-scale analysis of genomic data is the lack of efficient and scalable data processing and visualization approaches. Supported by Data Science Supplement GM125722-03S1, we are developing computational infrastructure for managing large-scale regulatory genomics experiments and for executing defined workflows. Our open-source platform, the Platform for (Epi)Genomic Regulation (PEGR), aims to empower discovery through the automated analysis and visualization of results from both small- and large-scale regulatory genomics projects. PEGR will include the following features: 1) a secure, cloud-based, metadata management system that instills best practices of experimental rigor, reproducibility, and data sharing; 2) automated Galaxy-based epigenomic data processing pipelines that provide easy-to-use “wizards” for standardized processing of common epigenomic data types; and 3) an easily deployable, open-source software package as a means to disseminate data, tools, and discoveries via cloud services. We have extended PEGR’s ability to leverage cloud services by fully integrating with the Galaxy platform, and we are currently actively working to enable extension to NSF supercomputing cloud services at XSEDE and OSG.	NIGMS
Nadine Martin	Temple University of the Commonwealth	Translation and Clinical Implementation of a Test of Language and Short-term Memory (STM) in Aphasia Development of a cloud-based open science resource to support data standardization and aggregation, and collaboration amongst aphasia researchers. The parent grant supports development of a clinical version of a comprehensive test battery for aphasia, the Temple Assessment of Language and Verbal Short-term Memory (STM) in Aphasia, (TALSA). Aphasia is an impairment of language, affecting the production or comprehension of speech and the ability to read or write. Advanced models of the language and cognitive impairments in aphasia have led to a wide range of measures that yield more precise diagnoses and finely targeted treatment plans. A consequence of this much needed gain in diagnostic precision and treatment specificity has been a preponderance of underpowered single case and case series studies with little consistency of measurements or treatment content across groups. Thus, there is a critical need, for a technological infrastructure to support data standardization and aggregation, and collaboration amongst aphasia researchers. The data science strategy supplement is enabling us to develop an open science platform, CORE-APHASIA, which leverages existing data science architecture built by collaborators with prior SBIR funding. The virtual space will include several features that promote and enable open data and science opportunities for researchers in a common area of investigation. First, the technological barriers which have limited dissemination of the TALSA will be removed, allowing participating investigators to administer the TALSA in their research and contribute to the TALSA database. Second, CORE-APHASIA will include a technological infrastructure for scientists to form research collaborations (“collaboratories”). This resource will mitigate several persistent roadblocks in patient-oriented research on behavioral therapies for aphasia, as it will facilitate aggregation and normalization of data from multi-site treatment related studies, which will increase participant numbers in these studies, and foster greater consistency of methodologies across rehabilitation research laboratories. This model of open science research collaboration can be generalized to other areas of patient-oriented rehabilitation research focused on disorders other than aphasia. Successful research collaborations require a sustainable infrastructure that improves access to knowledge resources, collaborators, and access to aggregated standardized data resources. We expect that this project will demonstrate that a virtual laboratory predicated on open-science principles and leveraging the use of the cloud can facilitate larger scale studies of behavioral treatments more rapidly and efficiently advancing science and improving patient outcomes.	NIDCD
David P. Miller	Wake Forest University Health Sciences	Effectiveness and Implementation of mPATH-CRC: A Mobile Health System for Colorectal Cancer Screening Translate the mPATH colorectal cancer screening application to scalable and robust software infrastructure using interoperable industry standards (SMART on FHIR) and cloud computing. Our parent award is implementing an iPad app for colorectal cancer screening, called mPATH™-CRC, into routine primary care, which is now fully-integrated with Wake Forest’s Epic EHR. Prior to COVID-19, over 15,000 patients had used mPATH^TM, which identified over 1100 patients who were overdue for CRC screening by a mean of 10 years. The aims of this supplement are to: (1) Enhance the mPATH^TM-CRC application’s interoperability and decrease EHR-vendor specific dependencies by leveraging WebServices and interoperable industry standards (SMART on FHIR); (2) re-engineer mPATH^TM-CRC from a containerized, on-premise architecture to a service-agnostic cloud-based architecture; and (3) validate the interoperable (SMART on FHIR leveraging OAUTH2.0) and cloud-based architecture through rigorous testing. Achieving these aims will greatly increase the impact of the platform and contribute to the literature for open-standards based software development. Additionally, this project will yield essential information for the successful implementation of other technology-mediated interventions in healthcare settings.	NCI
David Lowell Mobley	University of California-Irvine	Advancing predictive physical modeling through focused development of model systems to drive new modeling innovations Build a cloud-ready framework to crowdsource containerized, interoperable and reproducible components for SAMPL community challenges in computational chemistry and molecular design. The Statistical Assessment of Modeling of Proteins and Ligands (SAMPL) set of challenges provides crowdsourcing techniques to drive the development of new methods for predictive biomolecular design. Computational methods already provide considerable benefits in a wide variety of areas in computational chemistry, drug discovery, and molecular design, but SAMPL uses friendly competition to drive further improvements in the underlying technologies. However, in traditional SAMPL challenges, participants often perform well by combining human expertise with computational approaches, making it unclear whether the best performance resulted from human skill or superior methods. In this supplement, we are building out infrastructure to allow a fully automated arm of the SAMPL challenges, where participants submit containerized methods rather than predictions, then these methods are automatically run to provide predictions. This will allow methods to compete directly, without the intervention of human experts, yielding new insights into method performance. At the same time, the submitted methods can also be made available to the broader community in a fully reproducible manner, and deposited in a registry of interacting components. This is a key step towards enabling automation, reproducibility, and interoperability in this area.	NIGMS
Matteo Pellegrini	University of California Los Angeles	Enhancing the International Molecular Exchange (IMEx): Providing an Improved Community-Oriented Molecular Interactions Resource Bringing interoperability to protein interaction data though adoption of community standards and containerization The Database of Interacting Proteins, one of the key members of the IMEx Consortium, is a comprehensive, fully open access database of protein-protein interactions manually curated from original research publications. The data science supplement project addresses the need for data format standardization and simplification of the legacy database infrastructure. Towards this end we will refactor the database layer in order to support the most recent version of the HUPO-PSI XML 3.0 molecular interactions record format. The refactored database will serve as a back end for the DIP database Web site as well as the data store for a standard-compliant, ElasticSearch-backed implementation of PSICQUIC web service. In order to enable installation of local mirrors of the services provided by DIP database we will evaluate deployment of the DIP database as a set of containerized Docker modules. These will include ImexCentral (https://imexcentral.org/icentral), which is currently used for coordinating curation within the IMEx Consortium but can be easily adopted as a tool for coordinating curation activities within any resource or group of resources curating data other than molecular interactions.	NIGMS
Bjoern Peters	La Jolla Institute for Immunology	Developing computational models to predict the immune response to B. pertussis booster vaccination We provide a FAIR-compliant representation of common ‘metadata’ elements in human immunology studies, which ties together life events, immune exposures, and specimen collection in a structured and simple representation that is machine interpretable. In our parent U01 grant, we are planning to utilize the existing IEDB immune exposure model which uses standard terminology from the Open Biological and Biomedical Ontologies (OBO) to represent the different types of immune exposures typically encountered in human immunology studies and expand it to include time course information such as life events, sample collections, their times and durations. The requested supplemental funds will ensure that this expanded immune exposure model and all the supporting software infrastructure will be available to the broader community. Specifically, this grant will support improving the existing open source software for the updated model, and implement a cloud-hosted immune exposure validation tool. These improvements will benefit not only the current project, but other immune databases such as IEDB, HIPC, ImmPort, and the wider biomedical community.	NIAID
David Claude Richardson	Duke University	Extending MolProbity Diagnosis & Healing Methods to Empower Better CryoEM & Xray Models at 2.5-4A Resolution, plus Versioned, Redeposited GEMS for Import Bring community-essential software for 3D molecular structures of biological macromolecules up to modern best practices. The major goals of the parent MIRA grant are to maintain our lab's MolProbity validation service, develop new tools effective for the increasingly prevalent structures at lower resolution (2.5-4Å), and collaborate with depositors to correct important problems in their structures and re-version them at the Protein Data Bank (which we recently did with 7 important early-release SARS-CoV-2 structures). The Data Science supplement supports an experienced software engineer who is familiar with 3D molecular structure (at ReliaSolve LLC) who is rewriting (from a mix of languages into Python plus libraries) our open and effective but 20-year-old central programs that produce MolProbity's validation indicators of where a model needs repair. This work brings our community-essential software up to modern best practices, maintainability, and smoother interoperability with the Phenix system, the worldwide Protein Data Bank, and other cloud-based services besides ours.	NIGMS
Pinaki Sarder	State University of New York at Buffalo	Human AI Loop in Cloud for Renal Pathology Informatics We are developing an intuitive, web-based human-AI-loop system in cloud for computational detection, segmentation, and quantification of image features from very large digital histology tissue images with wide applicability in various disease models. We will enhance our human-AI-loop (HAIL) software to develop a scalable tool for training new algorithms to segment structures with diverse scales, staining, and disease using very large digital histology whole slide images (WSIs) coming from multiple institutions. The proposed tool will be optimized with respect to ground-truth annotations, which is costly in computational pathology research. To optimize this annotation effort, an analytical framework to derive a theoretical minima for ground-truth annotation will be developed for cost-effective training. This tool will be made available to the research community as a web-plugin for quantifying structures of interest, with a focus on the ease of interaction for users not versed in machine learning. The tool will also have an interactive portal summarizing digital image features of segmented structures, discussing data diversity and distinction between different data classes. The resulting tool will empower basic science researchers to study big data and make informed decisions on future experiments without waiting for inputs from computational engineers.	NIDDK
Michael R. Shirts, John Damon Chodera	University of Colorado	Open Data-driven Infrastructure for Building Biomolecular Force Field for Predictive Biophysics and Drug Design Developing the tools to make data contained in biomolecular simulation workflows more easily interchangeable, which will improve the usability and interoperability of physics-based simulations, and allow users to leverage the significant global investment in tools across different biomolecular simulation ecosystems. Molecular simulation is a powerful tool to predict the properties of biomolecules, interpret biophysical experiments, and design small molecules or biomolecules with therapeutic utility. Our R01 seeks to develop the software and data infrastructure to automate the construction of accurate molecular models with which to perform quantitative, predictive molecular simulations. But even with improved molecular models, complete workflows of biomolecular simulations interface disparate simulation tools developed independently in the absence of community interoperability standards. The majority of the existing molecular simulation workflows are therefore mutually incompatible, with differing representations of the molecular models. Our data science project aims to maximize the utility of the products of our R01 by creating an extensible common representation of the data contained in molecular simulations of proteins and biomolecules, as well as an API for and translators to and from this common representation. This work will maximize the utility of molecular models produced by our tools by expanding the size of the software ecosystem they can be used within. Additionally, it will make it significantly easier to integrate disparate workflow components independently developed in different biophysical simulation software ecosystems. For example, it will make output of simulations produced by a workflow using one set of software tools to be easily used as inputs to other workflows in ways that otherwise would have required significant manual intervention and expertise. Progress: We have recruited a developer, had a number of meetings with collaborator so on the project on design objectives and scope of our extensible common molecular simulation representation, and begun initial design work (https://github.com/openforcefield/openff-system).	NIGMS
Louis J. Soslowsky	University of Pennsylvania	Differential Roles of Collagen V in Establishing the Regional Properties in Mature and Aging Supraspinatus Tendons Bringing a novel open-source tool that can address missing data prior to regression analysis to the cloud for users with massive datasets To address the limitation of missing data from experiments that are often destructive to the sample in studies of tendon load-transfer, we have adapted a newly described statistical method that can combine data from multiple studies using multiple regression approaches without requiring a normality criterion, called synthesis analysis. This method improves on older synthesis algorithms by removing the normality assumption, reducing bias, and most crucially can provide a variance estimate. It is anticipated that this method will be extremely useful to other researchers dealing with missing data in their endeavors to examine tissue structure-function relationships. To make this approach widely available to the research community, we have outlined two goals: 1) Open-source Application of Synthesis Analysis - To ensure the widest possible distribution of synthesis analysis to research groups worldwide, we will transcribe the current Stata code to R and run follow-up validation studies to ensure replicability of results between the two languages. We also aim to provide an all-in-package for synthesis analysis that can be installed directly via R repositories. 2) Implement Cloud & Browser-Based Application - As a web-based application, synthesis analysis can be available anywhere and anytime. It will also make it very usable to people without R programming experience since these packages can allow for WYSIWYG (what you see is what you get) interfaces. This feature also allows R to interface with HTML, JavaScript, and CSS to make the interface very customizable. Goal 2 will aim to provide a scalable, secure, and easy-to-use portal for synthesis analysis without any prior experience with R. and also to develop cloud-server computation capabilities for users with huge datasets and limited availability of computational resources.	NIAMS
Andre Jan Willem, Martha Holmes, Mamadou Kaba	Massachusetts General Hospital	Neuroimaging and gut microbiome markers of development in HIV-exposed uninfected infants Create and disseminate portable, cloud-optimized pipelines to automatically analyze infant MRI brain images using standard data formats and developed within a continuous integration framework in order to investigate early brain development in perinatally HIV-exposed, but uninfected infants. In the parent grant we are using magnetic resonance imaging (MRI) to study brain development early in life. Our lab has developed software to automatically derive anatomical morphometric estimates and connectivity measurements from adult brain MRI volumes. The “FreeSurfer” software is widely distributed in the research community. We are extending the software to characterize the rapidly developing neonatal and infant brain. The project is associated with a study of the impact of maternal HIV infection on infant brain development in Cape Town, South Africa. Our data science goals are to fully integrate the Infant FreeSurfer (brain morphometrics) and Traculina (white matter tract tracing) software in portable and cloud optimized form so that it can be shared with collaborators including those in settings with web access but limited computational resources. This will standardize the software environment and pipeline and simplify installation. We will also transform the dockerized pipelines to create portable modules to recognize the standardized Brain Imaging Data Structure (BIDS). This software is created within a continuous integration framework and validated with publicly available datasets as well as the infant MRI data from Cape Town which includes a hand-labeled subset of brain volumes. We will assemble online training materials and test data sets for our collaborators and users.	NICHD
Xiaoyin Xu, Geoffrey Young	Brigham and Women's Hospital	Computer aided diagnosis of cancer metastases in the brain Migrate a brain metastasis detection deep learning model to the cloud for enhanced access and user experience The primary project aims to develop a computer-aided approach for assisting MRI diagnosis of cancer metastases to the brain. Each year, 170,000 cases of brain metastasis are diagnosed in the U.S. Brain is the second most common site, after lung, for cancer metastasis. Because small metastases can disrupt of the vital function and complex structure of the brain, prompt detection and treatment are critical. The goal of this data science-focused work is to migrate a brain metastasis detection deep learning model to the cloud to broaden access to the tool and enhance user experience. The migration will make the software more modular and efficient to upgrade and maintain as well as improving its function, security, and cost-effectiveness. This will include standardizing function and variable naming, homogenizing software structure, and incorporating standard interfaces and data formats to enhance interoperability. A continuous closed-loop quality control cycle will be employed to improve security, privacy protection and reduce software conflicts and related errors.	NLM