Olusola Ajilore | University of Illinois, Chicago | Unobtrusive Monitoring of Affective Symptoms and Cognition using Keyboard Dynamics Refactor the BiAffect codebase to enable collaborative open science. The parent grant “Unobtrusive Monitoring of Affective Symptoms and Cognition using Keyboard Dynamics” or UnMASCK, uses a novel digital technology (“BiAffect”) for the study of cognitive dysfunction in the context of mood disorders. BiAffect leverages smartphone keyboard dynamics metadata to unobtrusively and passively monitor cognitive function. The core technology of BiAffect is a custom-built smartphone virtual keyboard that replaces the native default keyboard, allowing the collection of real-time data of potential clinical relevance while individuals interact with their device as usual within their natural environment. Working with Sage Bionetworks, the supplement will enable us to refactor the BiAffect codebase to enable more robust multi-developer contribution and version control. We also plan to create standardized data processing pipelines to support collaborations with researchers who may have varying levels of capacity for data science and engineering. The results will be an innovative, meaningful contribution to collaborative open science, modernizing the biomedical research data ecosystem and making data findable, accessible, interoperable, and reusable (FAIR). - With Sage Bionetworks, refactor BiAffect codebase for collaborative open science, so other developers can contribute code and collaborators have pipelines to contribute data.
| NIMH |
Pamela Bjorkman | California Institute of Technology (PI) | Developing Immunogens to Elicit Broadly Neutralizing anti-HIV-1 Antibodies Enhancing antibody and variant resources by adopting public data streams, standard input formats, and open-source integration The goal of the parent project is to advance our germline-targeting approach to HIV-1 vaccine design by cycles of immunogen design and testing. Similar strategies guided by antibody discovery are showing exciting promise with a range of other difficult pathogens such as influenza, malaria, hepatitis C, dengue, and Zika virus. As part of our project's integrated approach to immunogen design, selection, and evaluation, the Bjorkman lab has developed the software package HIV Antibody Database, which was designed to enable frictionless access to, comparisons of, and analyses of broadly neutralizing anti-HIV antibody sequences, structures, and neutralization data. In recent developments a sister project addressing COVID called Variant Database can quickly search SARS-CoV-2 genome datasets with millions of sequences. This administrative supplement will build on these codebases to improve their robustness and sustainability by integrating publicly available data streams, adopting standard formats for input and output of structural data, and optimizing performance. We will enhance HIV Antibody Database to automatically download data from the Los Alamos National Laboratory CATNAP database. Antibody Database will also utilize PDBx/mmCIF, a modern, extensible structure file format, and use the RCSB REST-based API for online structure searches. We will ensure that high performance graphics frameworks (e.g., Metal) are fully utilized. Antibody Database will also be extended for use on other viruses, included access to SARS-CoV-2 data through Variant Database. The build process for Variant Database, an open source tool, will be improved to use a modern package manager. To facilitate wider use, a python API for interacting with Variant Database will be developed. - Refactor antibody and variant and resources.
| NIAID |
John Buse | University of North Carolina Chapel Hill | CAMP FHIR: Lightweight, Open-Source FHIR Conversion Software to Support EHR Data Harmonization and Research Practical, lightweight data standardization: improving CAMP FHIR software to map clinical common data models to FHIR Clinical common data models (CDMs) such as PCORnet, OMOP, and i2b2, aim to ease data harmonization and interoperability thereby fostering collaboration. However, when institutions support different CDMs, cross-institutional data sharing can be impeded. While HL7 Fast Healthcare Interoperability Resources (FHIR) is an increasingly key standard for interoperability and data exchange that aims to address these issues, mapping an individual CDM to FHIR is resource intensive. The parent grant developed the CAMP FHIR software to advance the clinical and translational science mission of the NC TraCS center both in North Carolina and for national CTSA goals. By offering a cloud-based, open-source application that can transform multiple types of input data to FHIR, CAMP FHIR makes it easier and more efficient for organizations with varying data environments to participate in new clinical research partnerships. The aims of this administrative supplement are to (1) make CAMP FHIR more robust by implementing additional software development best practices, (2) improve interoperability by adding support for bidirectional data transformation and new high-value FHIR resources, and (3) improve usability and accessibility by developing a graphical user interface, adding support for new input types, and ensuring cloud-readiness. - Map to FHIR from different clinical data models.
| NCATS |
Vince Calhoun | Georgia State University | Multivariate methods for identifying multitask/multimodal brain imaging biomarkers GIFTwrap: a containerized and FAIR cloud-based implementation of the widely used GIFT toolbox This supplement will usher in the next phase of the widely used GIFT software (or Group ICA of fMRI toolbox) which provides data-driven methods for capturing intrinsic functional and structural brain networks. GIFT has been continuously updated and extended over the past twenty years and has a large amount of functionality that is not available in other tools including 20 different Independent Component Analysis (ICA) algorithms. However, GIFT is primarily based on a standalone software development and analysis model. We will extend GIFT to interface with and leverage modern software tools, to facilitate comparability across ICA analysis, and to fully engage in the community development movement. The GIFTwrap implementation will be a containerized python compliant tool that will also be accessible via a cloud interface. The work will open up access to a wide suite of approaches including dozens of different ICA approaches, functional network connectivity, independent vector analysis (a generalization of ICA for multiple datasets), dynamic functional network connectivity, spatial dynamics, connectome visualization, and much more. We have three main goals: 1) Architecture improvements to facilitate FAIR principles and modernize the tools, 2) to deploy the GIFT tools, especially our robust neuromark and auto-labelling approaches to facilitate comparability across analyses, into a brain imaging data structure app (BIDSapp) for easy use and integration into modern analysis frameworks, and deploy in cloud-based analytic platforms (e.g. brainforge), and 3) to provide a cloud interface for individuals to run fully automated ICA analysis which requires a simple upload of data to the tools. | NIBIB |
Naomi Caselli | Boston University | Effects of input quality on ASL vocabulary acquisition in deaf children Use best practices in software development to make the sign language assessments developed under the parent grant openly available, and to make a robust platform for collecting and tagging these data as a step towards large-scale machine-readable sign language datasets. The majority of deaf children experience a period of limited exposure to language (spoken or signed), which has cascading effects on many aspects of cognition. The parent project aims to understand how children build a vocabulary in sign language, and whether and how this differs for deaf children who have limited exposure to a sign language. This includes developing American Sign Language (ASL) proficiency tests, and using these tests to examine how children learn ASL. The supplement will use best practices in software development to make these ASL proficiency tests widely available in the community. We will do so using a code base our lab has already developed-- a single-use application for collecting sign language video data--and convert it to a platform that can be used to collect and tag video data of all kinds. Not only will this platform be used to make the ASL tests accessible, the platform will be made publicly available to other sign language researchers. This project will remove significant barriers in the field of sign language research, making it much more efficient to develop large-scale machine-readable sign language datasets. Additionally, making the ASL tests widely accessible in the community will help clinicians and researchers track language acquisition among deaf children. | NIDCD |
Connie Celum | University of Washington | Viroverse: Bedside through Bioinformatics database retrieval system Transitioning the Viroverse specimen repository database, lab notebook and retrieval and visualization system to a cloud ready application based on the Python Django framework The parent grant supports the Retrovirology and Molecular Data Sciences (RMDS) Core, a component of the University of Washington / Fred Hutchinson Center for AIDS Research. The largest emphasis of this core is the development and dissemination of new software tools, databases and custom applications for bench support, data management, and visualization across a wide spectrum of bench science-based activities conducted locally, nationally and internationally. The centerpiece of this effort is the laboratory information management system, Viroverse, as it provides a highly flexible specimen repository database, lab notebook and retrieval and visualization system for data acquired from the bedside through the laboratory bench and bioinformatic analysis. The Supplement award will enable enhancement of the Viroverse feature set as well as produce a modern, sustainable and cloud ready database platform. We will migrate Viroverse from a Perl/Catalyst to a Python Django framework, providing a code base that is easy for programmers to contribute to and allow wide community adoption, intense security monitoring, reporting and patching, and that leverages the availability of many publicly available libraries and modules, as well as a broad base of experienced developers. Importantly, this will allow for multiple authentication backends and supports industry standard encryption and single sign on (SSO) technology. The source code for Viroverse will continue to be made freely available, and we will generate and release a containerized (Docker) version. Finally, we plan to add significant unit testing and a continuous integration pipeline to ensure overall application stability while maintaining gatekeeper review and merge authority over the codebase as it is developed further. | NIAID |
Melissa Cline | University of California Santa Cruz | Eliminating variants of uncertain significance in BRCA1, BRCA2 and beyond Resources for sharing knowledge on the genetic risk of disease An estimated 10% of all cancers arise through inherited genetic risk, through harmful genetic variants in the patient’s germline DNA. One particularly vivid example is Hereditary Breast and Ovarian Cancer (HBOC) Syndrome arising through harmful variation in the BRCA1 and BRCA2 genes. The lifetime risk of breast or ovarian cancer ranges between 42% and 70% for women who inherit a harmful BRCA variant, versus the average risk of 11% in the U.S. population. These cancers can often be prevented through detection and clinical management of the genetic risk, provided the risk can be recognized. Most BRCA variants currently have no known clinical impact. The BRCA Exchange was launched in 2016 with the goal of developing new approaches to share data on BRCA variants to catalyze variant interpretation, with BRCA1/2 in HBOC serving as exemplars for additional genes and heritable disorders. Today, roughly 3,000 users per month visit the site for BRCA variant data aggregated from public variation repositories and additional annotation data curated by the ENIGMA Consortium, the internationally-recognized organization for the expert interpretation of BRCA variants. This has inspired other research consortia to reach out to launch variant data exchanges for the genes and heritable disorders under their purview. With this supplement, we propose to refactor the BRCA Exchange software to improve its modularity, reusability and cloud readiness. By refactoring the data integration pipeline and database management, we will add flexibility to the data model to allow external consortia to integrate the data that is most informative to their variants. By integrating the pipeline with cloud APIs, we will enable external consortia to run the pipeline on the NIH secure cloud platforms, alleviating the need for an internal server. Finally we will produce a simplified front end, which collectively will allow external consortia to build and run their own variant data exchanges. We anticipate that these developments will catalyze research in pediatric and diffuse gastric cancers, as well as contributing valuable new functionality to the parent grant. | NCI |
Stephania Cormier | Louisiana State Baton Rouge | LSU Superfund Research Center - Environmentally Persistent Free Radicals “Cloud-based multilevel mediation analysis for investigating adverse effects of particulate matter from hazardous waste remediation and individual risk factors on respiratory health”. The goal of our project is to create a scalable, open-source comprehensive toolset for performing multi-level mediation analyses. The link between particulate matter (PM) exposure and poor respiratory health is well established. As part of our parent grant, the LSU Superfund Research Center postulates that environmentally persistent free radicals (EPFRs) present on PM from hazardous waste sites, incinerators/chemical fires, and other sources is the missing mechanistic link between PM exposure and poor respiratory health. We will investigate the mediation of this association using hierarchical mediation analysis to decompose the air pollutant adverse respiratory effects into direct and indirect (EPFR-mediated) effects. We developed a multilevel mediation analysis method that allows for both longitudinal assessments of residential environments and individual risk factors to be jointly utilized in determining the mechanistic link between exposure to PM and poor respiratory health. This method is available in our R package, Multilevel Mediation Analysis (mlma). However, effective use of mlma requires knowledge of the R programming language, including manipulating datasets within the R environment and writing commands in R. In addition, R-Project© software must be downloaded and installed on individual computers to perform the analysis. In large or complex applications, computational resources may become heavily taxed due to the high computational burden of the software algorithms. These limitations hinder a broader use of the method in research and applications. To address these limitations and facilitate wider adoption of our comprehensive method for hierarchical mediation analysis, we will develop a more robust cloud-based application of our R program mlma by (1) creating an interactive visual interface that allows users to easily import datasets and build conceptual mediation model frameworks using drag-and-drop functionality, (2) transforming the R-code to an efficient low-level structured computer language (e.g., C, C++) employing a high-performance parallel computing model to improve computational speed, and {3) enhancing usability, speed, storage, and ability to handle big data, by making use of clustering technology on cloud. | NIEHS |
Adam Eggebrecht | Washington University | Illuminating development of infant and toddler brain function with DOT Extending our NeuroDOT tools to develop robust and efficient software for photometric data-anatomy registration and data fidelity assurance for fNIRS and HD-DOT data, and our NLA software tools for general connectome-wide statistical analyses. The long-term goal of the Parent BRAINS R01 (R01MH122751, ‘Illuminating development of infant and toddler brain function with DOT’) is to advance high-density diffuse optical tomography (HD-DOT) methods for evaluating brain-behavior relationships in infants and toddlers at risk for developing autism spectrum disorder (ASD) while they are awake and engaged within a naturalistic setting. Funding from this Administrative Supplement NOT-OD-21-091 will promote modernization of our growing components of the data-resources ecosystem by providing crucial support to (1) refactor our Matlab-based NeuroDOT/NLA toolboxes in Python with enhanced development tools and standardization, (2) establish cloud readiness of NeuroDOT/NLA, and (3) expand documentation and support our community of users and developers with tutorials, workshops, and hackathons. Successful completion of these Aims will both complement and extend the impact of the Parent R01, not only in terms of uncovering longitudinal patterns of covariation of brain function and behavior that may provide novel predictive diagnostic value, but also in terms of harmonizing methods and strategies for high fidelity optical functional brain mapping that will be crucial to ongoing investigations of brain function during engaged behavior in infants and toddlers. | NIMH |
Evelina Fedorenko | MIT | The neural architecture of pragmatic processing Establishing a common language in human fMRI: linking the traditional group-averaging fMRI approach and functional localization in individual brains through probabilistic functional atlases for four high-level cognitive networks. The parent project examines the contributions of three communication-relevant brain networks—the language network, the social cognition network, and the executive-control network—to pragmatic reasoning, the ability to go beyond the literal meaning to understand the intended meaning. The projects adopts the ‘functional localization’ fMRI approach, where networks of interest are defined functionally in each individual brain. Although this approach is superior to the traditional group-averaging fMRI approach, it is not always feasible, and it is unclear how to relate findings from studies that rely on these disparate approaches. We will develop and make publicly available probabilistic functional atlases for four brain networks critical for high-level cognition based on the data from extensively validated functional ‘localizer’ paradigms collected under the parent award and in prior work. Such atlases, based on overlaying large numbers of activation maps, capture not only the areas of most consistent responses but also the inter-individual variability in the locations of functional areas. These probabilistic representations of the network landscapes can therefore help estimate the probability that any given location in the common brain space belongs to a particular functional network. In this way, probabilistic atlases can provide a critical bridge between two disparate approaches in human fMRI—traditional group-averaging and functional localization in individual brains, as well as link fMRI work with lesion-behavior patient investigations. The ability to more straightforwardly compare findings across studies is bound to lead to more robust, replicable, and meaningful science in our understanding of human communication and related abilities. | NIDCD |
Alexander Fleischmann | Brown University | Odor Memory Traces in the Mouse Olfactory Cortex Open-source software tools to enhance the processing, reproducibility and shareability of integrated multimodal physiology and behavioral data A major challenge for neuroscience research is the complexity and size of multimodal data sets, which often include a combination of calcium imaging, electrophysiology, behavioral tracking through video and other sensors, and electrical or optogenetic stimulation. Acquisition and analysis of such data typically rely on a mix of vendor built and custom workflows, with diverse file formats and software applications required at various stages of the analysis pipeline. We will improve and extend our calcium imaging and behavior analysis pipeline, built around the Neurodata Without Borders (NWB) standard, into a general purpose, cloud-enabled tool for managing and analyzing systems neuroscience data. We will generalize the pipeline from our current data format to create a well-documented, user friendly application programming interface (API). We will expand integration checks to include Microsoft Windows and macOS and disseminate the outcome as a Python package through standard repositories like PyPI and Conda-Forge. We will adapt the pipeline to enable automatic saving to the cloud, generalize the pipeline other open-source software tools, and create a Graphical User Interface to complement the current command line interface. Together, these enhancements to an already in-use data analysis pipeline will provide a streamlined framework for use by other labs and enhance reproducibility and shareability of integrated neural activity and behavioral data. | NIDCD |
Julius Fridriksson | University of South Carolina at Columbia | Center for the Study of Aphasia Recovery (C-STAR) Advanced neuroimaging visualization for cloud computing ecosystems The Center for the Study of Aphasia Recovery (C-STAR, P50-DC014664) explores recovery from language impairments following stroke, bringing together a diverse team of specialists from communication sciences, neurology, psychology, statistics and neuroimaging. This project acquires a broad range of magnetic resonance imaging (MRI) modalities (structural, diffusion, arterial spin labelling, functional, resting state) from stroke survivors to understand the brain areas critical for language, improve prognosis, and identify the optimal treatment or compensation strategy for each individual. Our team has developed novel desktop based tools (MRIcroGL and Surfice) to visualize these different modalities. The aim of this supplement is to adapt our methods to a web-based tool (NiiVue) that can work on any device (computer, tablet, phone). | NIDCD |
Andrew Gelman | Columbia | Improving representativeness in non-probability surveys and causal inference with regularized regression and post-stratification Improving the flow of the Bayesian workflow by enhancing the Stan probabilistic programming platform The parent grant will develop general, flexible, and reliable Bayesian methods for survey sampling adjustment that can be used for a wide range of problems in public health research. This requires extensive use of the Stan probabilistic programming platform, both to carry out the research itself and to put the resulting methodology into practice. In the supplement we will improve the core Stan platform in three ways: implementing common input and output formats; speeding up the core Stan inference algorithms through more sophisticated parallelization; and improving memory efficiency. This will improve the overall speed and scalability of inference, allowing for Bayesian methods to be used with increasingly complex models, and in turn allowing more stable and effective inference from non-random samples, a problem that is increasingly relevant when learning about populations in public health. | NIA |
Guy Genin | Washington University | Multiscale models of fibrous interface mechanics Strain Analysis Software for Open Science This supplement addresses a critical need in biomechanics and mechanobiology, and eventually in clinical practice: seamlessly analyze large imaging datasets to determine how tissues deform under mechanical loading. The parent grant (R01AR077793) uses a comprehensive modeling and experimental approach to study how fibrous interfaces transfer load between dissimilar. The supplement will implement previously developed strain tracking algorithms into user friendly software that is cloud-ready and broadly available. This software will enable quantitative analysis of deformation in biomedical images from a wide range of modalities, including microscopy, ultrasound, and optical. Notably, commercially available software packages for this employ regularization techniques to ensure smooth solutions, and are therefore often unable to accurately identify local tissue deformations or predict soft tissue tears. We will enable researchers to study strain fields associated with injury patterns and rehabilitation protocols by executing two aims: (1) We will develop open-source software as a plugin to ImageJ for the strain-tracking algorithm in two dimensions (2D), stereo view (2.5D), and three dimensions (3D). Best practices will be used for open-source software development, and modules will be created to facilitate the development of a user community. (2) We will develop a working, static code implementation on GitHub that can be run on Amazon AWS using data in the cloud. This will help overcome the primary obstacle for widespread adoption of strain mapping techniques in musculoskeletal research, namely that the 3D datasets require substantial computational resources to analyze. The work will enable collaboration between the PIs of the parent grant and an expert on open-source software development for clinical and research translation, and enhance the impact of a tool with strong potential. | NIAMS |
Thomas Gill | Yale | Claude D. Pepper Older Americans Independence Center at Yale Yale Study Support Suite (YES3): Dashboard and Web Portal Software Supporting Research Workflow through integrated, customizable REDCap External Modules This data science project will refactor and refine the Yale Study Support Suite (YES3), a suite of REDCap workflow and data management programmatic extensions (external modules) designed to improve the efficiency and quality of research field operations. The Yale Study Support Suite (YES3): Dashboard and Web Portal Software Supporting Research Workflow through Integrated, Customizable Redcap External Modules The NCATS-funded, Vanderbilt University-developed REDCap platform is used at thousands of institutions worldwide. A powerful characteristic of REDCap is its support for user-contributed programmatic extensions – external modules – that can add features or UI/UX elements, as well as integration with external informatics resources. Over the past decade, the Operations Core of the NIA-funded Claude D. Pepper Older Americans Center (OAIC) at Yale (P30AG021342) has built a suite of REDCap external modules that promote efficiencies in workflows and data management. The Yale Study Support Suite (YES3) is in use by studies and data coordinating centers associated with Yale University, including a large PCORI/NIA-funded national pragmatic trial (D-CARE Study) led by investigators from three OAICs. YES3 components include a dashboard for advanced data collection and study operations management that can be tailored to support study-specific workflows; a study portal for disseminating study materials and single or multi-site conduct-of-study monitoring that includes a comparative "site report card" analysis; and an automatable module that exports both code (SAS and R) and data for datamarts. The REDCap@Yale team will use the NOSI funding to refactor and refine the YES3 codebase into a form suitable for long-term collaborative maintenance by the extensive consortium of REDCap open-source developers. Refactoring will focus on adherence to established coding style and documentation guidelines, design patterns widely in use by consortium developers, and adherence to REDCap@Yale software workflow and security practices. The YES3 GitHub repository will include automated workflows for code inspection and security review, in addition to comprehensive documentation for end-users and developers. | NIA |
Karl Grosh | University of Michigan Ann Arbor | Active and Nonlinear Models for Cochlear Mechanics Developing open source finite-element codes for algorithm development and scientific discovery The overarching goal of this research is to develop a complete fluid-mechanical-electrical model that describes the response of the cochlea to external acoustic stimulation. A predictive model that covers the full range of audio amplitudes and frequencies will help us to understand how important classes of signals are processed in the cochlea (such as speech and music), since our understanding of this processing is incomplete. Potential outcomes of a predictive model include the improvement of speech processing algorithms, approaches for both cochlear implant electrical and hearing aid receiver stimulation paradigms, noninvasive diagnoses of auditory function, and higher fidelity input for models of neural processing of sound. While we have made progress in developing efficient, predictive codes, improving the utility and predictive ability rests on innovations in scientific computing along with the ability to rapidly integrate new biophysical mechanisms. The modular nature of our finite element-based approach makes such improvements relatively easier. Two main motivations for creating an open-source resource for these models are as follows: (1) broaden the user base by making the present version of CSound easier to use and directly accessible by to the entire auditory research community; (2) create a GitHub-based opensource resource to enable computational researchers to modify CSound using GitHub's branch structure along with its pull/push request software update project management process. To facilitate the realization of these goals, we use widely accessible software that harnesses large scale (parallel) computing capability. We hope that by creating an opensource software ecosystem, we will accelerate discovery by experimentalists through their use of the code and spur advances in scientific computing aimed directly at the grand computational challenge of predicting the response of mammalian cochleae to sound. | NIDCD |
Ron June | Montana State Bozeman | Role of Glucose metabolism in Chondrocyte Mechanotransduction Build open science framework to enable more labs to use metabolomic flux analysis of central metabolism. All cells use various metabolic processes to harvest nutrients and energy from biological inputs. The most studied pathways that cells use are components of the integrated system that is known as central metabolism. This project develops web versions of key tools to study central metabolism that use a type of mathematical modeling called metabolomic flux analysis. This technique integrates experimental metabolomics data with a stoichiometric model. The experimental metabolomics data describes changes in metabolite concentrations in response to an experimental stimulus or clinical treatment. The stoichiometric model describes the quantitative flow of nutrients and energy through central metabolism. Many labs across the nation perform studies of central metabolism, yet this new approach of metabolomic flux analysis is not widely available. The objective of these supplemental studies is to develop open science frameworks that allow users to perform this analysis on their own data. | NIAMS |
Daisuke Kihara | Purdue | Building protein structure models for intermediate resolution cryo-electron microscopy maps Make cryo-EM structure modeling software cloud-ready and integrate with popular molecular modeling packages. The overall goal of the parent award is to develop computational methods for modeling global and local structures for interpreting cryo-EM density maps of 4 Å to medium resolution. Under the parent award, methods we have successfully developed include MAINMAST for de novo protein main-chain modeling, VESPER for structure fitting to EM Maps, and Emap2Sec+, a deep-learning method for detecting local structures in medium resolution EM maps. The goal of this administrative supplement is to improve availability, sustainability, robustness, and user-friendliness with strong emphasis for cloud computing readiness of the biomolecular structure modeling software for cryo-EM. To achieve this goal, we will restructure developed codes for integrating the software with popular molecular modeling packages and for effective computation on cluster computers. We will also develop web servers to perform computation on cloud for easy access of the software. | NIGMS |
Arjun Krishnan | Michigan State Univeristy | Resolving and understanding the genomic basis of heterogeneous complex traits and disease GenePlexus: a cloud platform for network-based machine learning The goal of the parent project is to develop a suite of computational frameworks that integrate massive collections of genomic and biomedical data to understanding of the mechanistic relationships between genomic variation, cellular processes, tissue function, and phenotypic variation in relation to complex traits and diseases. Genome-wide molecular networks effectively capture these mechanistic relationships and, when combined with supervised machine learning (ML), lead to state-of-the-art results in predicting novel genes associated pathways, traits, and diseases. This software supplement will support the development of GenePlexus: a cloud platform for network-based machine learning to enable: i) biomedical/experimental researchers to perform network-based ML on massive genome-scale molecular networks and get novel interpretable predictions about gene attributes, and ii) computational researchers to programmatically run network-based ML, retrieve results, and integrate with existing –omics data analysis workflows. Building on a current prototype, we will 1) Refactor GenePlexus to use a service-oriented architecture to scale ML runs and enable programmatic access, stand-alone use, and community contributions; 2) Generalize data structure and storage design, and improve inter-service communication with existing large-scale resources for genes and networks; and 3) Improve security, privacy, database architecture, and cost management to improve user experience in running, storing, retrieving, and sharing machine learning models and results. | NIGMS |
Maria Kukuruzinska | Boston University | Defining the catenin/CBP axis in head and neck cancer Enhancement and Cloud Deployment of CaDrA, a software tool for Candidate Driver Analysis of Multiomics Data In the parent project we developed CaDrA (Candidate Driver Analysis), a methodology we developed for the identification and prioritization of candidate cancer drivers associated with a given pathway, regulator, or other molecular phenotype of interest through the analysis of multi-omics data. We aim to optimize, harden, and deploy CaDrA as an open-source R package developed based on best software engineering practices and design principles. We plan to thoroughly document the package, design a R Shiny interface to make the tool “biologist-friendly”, and to containerize it via Docker/Singularity to make it cloud-ready and scalable, and compatible with various high-performing computing (HPC) environments. | NIDCR |
Sudhir Kumar | Temple Univeristy | Methods for Evolutionary Genomics Analysis Making MEGA cloud-ready for big data phylogenetics The Molecular Evolutionary Genetics Analysis (MEGA) software offers a large repertoire of tools for assembling sequence alignments, inferring evolutionary trees, estimating genetic distances and diversities, inferring ancestral sequences, computing timetrees, and testing selection. These analyses are key to deciphering the genetic basis of evolutionary change and discovering fundamental patterns in the tree of life. New methods for molecular evolutionary genomics, developed during the parent grant of this supplement, will be implemented in MEGA, which is widely used in the community. The primary aim of this supplement is to adapt MEGA’s computational core to run on cloud computing infrastructure, i.e., make MEGA cloud-ready (MEGA-CR). We plan to refactor the source code for scalable distributed execution on cloud infrastructure as well as computer clusters. We will optimize performance through code profiling, containerizing, and minimizing factors that increase latency on cloud environments such as network communication and data transfer. These developments will increase the scalability of MEGA for analyzing big data, via the elastic computing power enabled by cloud infrastructure. Consequently, MEGA-CR will meet the needs of the scientific community that is now analyzing much larger datasets by providing greater accessibility, cost-efficiency, and scalability in computational molecular evolution. | NIGMS |
Barry Lester | Women and Infants Hospital Rhode Island | Clinical markers of neonatal opioid withdrawal syndrome: onset, severity and longitudinal neurodevelopmental outcome Developing cloud-based software to analyze acoustic characteristics of newborn babies’ cries to diagnose withdrawal due to prenatal opioid exposure. The incidence of Neonatal Opioid Withdrawal Syndrome (NOWS), the withdrawal in the newborn infant due to prenatal opioid exposure, has increased dramatically due to the worldwide rise in opioid use. Accurate prediction and diagnosis of NOWS would change the treatment and management of these infants. The purpose of the parent grant is to identify clinical markers of NOWS including acoustic characteristics of the infant’s cry and determine the long-term predictive validity of these markers. Computer analysis of infant cry characteristics enables us to predict NOWS diagnosis with 91% accuracy but is not usable at the bedside. With this supplement, we will develop a cloud-based system for the automated analysis of infant cry acoustical characteristics for the diagnosis of NOWS. The software will enable us to have a fully automated system, where a user records a baby cry in the newborn nursery using a phone or other connected device, and within a few seconds, receives the diagnostic result. We will develop the cloud-based software to change clinical practice by providing a more accurate and reliable diagnosis of NOWS which will affect the pharmacological treatment of NOWS including length of hospital stay and potentially improve the long-term outcome of these infants. - Cloud use for clinical application
| NIDA |
Allan Levey, David Gutman | Emory | MP-AD Brain Proteomic Network Enhancement, Validation, and Translation into CSF Biomarkers Enhancing the Open-Source Digital Slide Archive Platform to Enhance Proteomic Biomarker Discovery in Alzheimer’s and Related Diseases The AMP-AD’s project overall goal is to identify proteomic networks from cortical tissue to identify protein co-expression modules that strongly associate with diagnosis, cognition and neuropathology. The neuropathologic tissue serves as the gold-standard for patient diagnosis, but given the intra-rater variability1–3 inherent in neuropathology, developing a system which can confirm or standardize diagnosis will improve the robustness of proteomic marker validation studies. The Cancer Digital Slide Archive 1–5 is a web-based imaging repository that houses 26,000+ digital pathology images. Originally designed specifically for the NCI The Cancer Genome Atlas (TCGA) pathology data, we developed the open source Digital Slide Archive (DSA) platform. This platform supports web-based visualization of pathology images, integrated tools for both computer and human-generated annotations, and reads the majority of whole slide image formats. Code is available on GitHub, and the REST based API supports integration with other applications. The dockerized DSA platform supports local and cloud based installation. This proposal will provide foundational support to adapt the DSA platform to serve the neuropathology community. However the system needs significant user interface(UI) refinement to support typical neuropathology workflows as it was initially developed for cancer studies. We will improve our documentation, develop tutorials, and harden custom UIs we have prototyped. UI enhancements will simplify navigation between the slides, stains, and sections used in neuropathology assessments. This will involve developing an initial data model to standardize neuropathology data management. Additional features needed for safe, secure, and compliant image sharing and organization will also be evaluated. The HistomicsTK toolkit, developed by our group, has ~10,000+ installations a month via PyPi. We will optimize the HistomicsTK algorithms for NP slide sets with accompanying demos and documentation. | NIA |
Trevor Lujan | Boise State University | Role of Distortion Energy in Fibroblast-Mediated Remodeling of Collagen Matrices Launching a cloud-based application that enables the fast quantification of material anisotropy from two-dimensional images of fiber networks. Our parent R15 application uses a standalone software application developed in our lab, called FiberFit, to quantify differences in the fiber networks of cellular scaffolds. FiberFit was developed to automate the fast and accurate measurement of two structural properties that describe material anisotropy: fiber orientation and fiber dispersion. The accurate measurement of these parameters is of major importance in understanding structure-function relationships at molecular and macroscales in biological systems, and in the engineering of advanced materials, yet a standard software tool to compute these metrics has not been broadly adopted. We will address this limitation by re-engineering FiberFit to be a robust, intuitive, and sustainable cloud-based application. This project will emphasize ease-of-use and will include the development of supportive documentation and pre-processing features to create a turn-key solution that automates the accurate and transparent analysis of multiple two-dimensional image files. These images can be acquired from numerous imaging modalities at various length lengths, and materials can be biological, inorganic, or engineered. As a web-based application, FiberFit will utilize cloud storage technology to manage data that is uploaded and generated, and user accounts will allow analyzed images and project-specific settings to be stored and retrieved from cloud directories. | NIAMS |
Rob Macleod | University of Utah | Integration of Uncertainty Quantification with SCIRun Bioelectric Field Simulation Pipeline Enabling cloud computing resources to enhance access to uncertainty quantification tools A rapidly emerging need in biomedical simulations is to quantify the uncertainty that arises because of the impact on model predictions of inevitably incomplete knowledge of simulation parameters. Every simulation has errors—modeling assumptions, imprecise estimates of parameter values, and numerical discretization errors—and the scientist or physician using the results of a simulation should know and take into account the nature of those errors, i.e., how sensitive the results are to assumptions and uncertainties of the settings and parameters that drive the simulations. Techniques to characterize this sensitivity are known collectively as uncertainty quantification. While mathematical methods for uncertainty quantification (UQ) have made significant recent progress, verified methods and validated software tools that implement them are available, they are not easily accessible to biomedical scientists. The gaps in this process for much of biomedical science currently lie in the integration and easy availability of state-of-the-art tools for uncertainty quantification within the simulation pipelines. The parent U24 project for this supplement request addresses the first of these gaps, integration: we are developing an open-source Python-based software suite, UncertainSCI, for non-intrusively quantifying uncertainty due to a variety of parameters in biomedical simulations. However, while this tool holds considerable promise, its use requires considerable knowledge of software engineering and also access to in-house computing resources, both of which may be lacking in many biomedical settings. Thus the second gap, availability, is not addressed as well as might be desired. The overall goal of this supplement request is to enhance the availability of UncertainSCI by leveraging best practices in modern software development and advances in cloud computing. | NIBIB |
Brian Macwhinney | Carnegie Mellon Univerisity | Computational analysis of child language transcript data To promote FAIR-ness, web delivery, and system replication, we will modernize and containerize the software used by the CHILDES (Child Language Data Exchange System) database and programs. The goal of this supplement is to modernize the software infrastructure supporting the CHILDES Project. We will leverage best practices in software development and advances in cloud computing to improve the FAIR-ness of the data and system. To improve web-compatibility and to take full advantage of the new TalkBankDB database system, we will convert the current XML-based data model to JSON. This will allow us to improve data validation, repository checking, and cloud deployment. We will modularize and containerize all software components and integrate the containers through the Kubernetes system. These improvements will open the system to open-source development of analysis through R and Python, thereby widening the scope of empirical issues that can be addressed from the database. They will also allow us to replicate the full system in new sites internationally, thereby promoting sustainability. | NICHD |
Mark Musen | Stanford | The Metadata Powerwash - Integrated tools to make biomedical data FAIR Modernizing the Protégé ontology-editing system: Enhanced ontology engineering through a Web-based, Cloud-based software architecture To support the needs for ontology engineering in our parent NLM R01 grant, we propose two specific aims to enhance the Web-based version of the Protégé ontology editor: (1) We will convert WebProtégé to a modern, microservice-based architecture, adding new microservices—including the availability of a plug-in architecture that will allow third parties to contribute novel additions to the WebProtégé code base. We will use a software-development approach that will allow us to implement the new architecture in a controlled, incremental manner. (2) We will modernize WebProtégé to make it Cloud-native. We will take advantage of the NIH STRIDES initiative, containerizing the system for deployment in the Google Cloud Platform (GPC) and adapting the software to operate with Cloud-based third-party software for data storage, data queueing, and search. We also will migrate all current WebProtégé users and their projects to the Cloud-based system. Our work will benefit the biomedical community at large, while enhancing our capabilities for ontology engineering as required by our existing grant. | NLM |
Towfique Raj, David Knowles | Icahn School of Medicine at Mount Sinai | Learning the Regulatory Code of Alzheimers Disease Genomes Refactoring and containerizing the LeafCutter RNA splicing pipeline for cluster and cloud infrastructures The parent award for the proposed work (U01 AG068880-01 ``Learning the Regulatory Code of Alzheimer's Disease Genomes'') involves developing deep learning (DL) and machine learning (ML) models of pre- and post-transcriptional (in particular, RNA splicing) gene regulation in Alzheimer’s disease associated cell-types and states. The training data for the RNA splicing models is generated by applying our previously published splicing quantification tool, LeafCutter, to large-scale RNA-seq datasets. LeafCutter has also been used by consortia including GTEx and PsychENCODE for their splicing analysis. However, LeafCutter remains “early stage” software with a challenging install process on both cluster and cloud infrastructure. In this supplement we will refactor LeafCutter to use 1) software engineering best practices including modularization and GA4GH schemas, 2) conda and Docker based installation and 3) standard workflow languages (e.g. nextflow) to enable straightforward and efficient deployment on the cloud at scale. | NIA |
Adam Resnick | Children’s Hospital of Philadelphia | Data Management and Portal for the INCLUDE (DAPI) Project User-ready tools and scalable workflows for INCLUDE datasets in the cloud: advancing brain imaging data management and analytics Advancing large-scale imaging dataset generation, management, and analytics with user-ready tools and cloud-based workflows. Although radiological images are routinely acquired in clinical care practices, the use of medical imaging data in a research context is limited by the technical expertise required for data preparation and analysis. Moreover, the need for large-scale, ML-ready imaging datasets for predictive analytics has been largely unmet due to a lack of tools and workflows that generalize across scanners and sites and that can be flexibly deployed in high-performance computing environments. To bridge these gaps, the goal of this supplement is to develop interoperable and scalable cloud-based workflows to allow for AI/ML analytics with clinically acquired imaging data. We will integrate existing state-of-the-art software with cloud services that can be utilized in a user-friendly web-based platform (Flywheel). Additionally, we will establish pipelines to allow for the association of imaging-based ML features with features from other data modalities, creating rich, multi-modal datasets for advanced predictive analytics. This will ultimately support the goal of the parent grant in defining generalizable workflows for large-scale data processes, integrating data across dispersed sources, and providing harmonized datasets to bolster research on Down syndrome and co-occurring conditions. | NHLBI |
Panagiotis Roussos | Icahn School of Medicine at Mount Sinai | Understanding the molecular mechanisms that contribute to neuropsychiatric symptoms in Alzheimer Disease Dreamlet workflow to power analysis of large-scale single cell datasets Recent advances in single cell and single nucleus transcriptomic technology has enabled studying the molecular mechanisms of Alzheimer's disease (AD) at unprecedented resolution by profiling the transcriptome of thousands of single nuclei from hundreds of post mortem brains from AD donors and controls. The parent grant has generated a compendium of single nucleus transcriptome profiles comprising ~7.2M nuclei from ~1,800 total donors. Yet elucidating the molecular mechanisms of AD from these data requires scalable software and sophisticated statistical modelling. In order to address this challenge, we are developing the dreamlet package, a widely applicable framework for differential expression analysis from single cell (and single nucleus) RNA- and ATAC-seq data that models complex study designs in highly scalable analysis workflows. Dreamlet uses a pseudobulk approach and fits a regression model for each gene (or open chromatin region) and cell cluster to test differential expression across individuals associated with a trait of interest. Use of precision-weighted linear mixed models enables accounting for repeated measures study designs, high dimensional batch effects, and varying sequencing depth or observed cells per biosample. Dreamlet further enables analysis of massive-scale of single cell RNA-seq and ATAC-seq datasets by addressing both CPU and memory usage limitations. Dreamlet performs preprocessing and statistical analysis in parallel on multicore machines, and can distribute work across multiple nodes on a compute cluster. Dreamlet also uses the H5AD format for on-disk data storage to enable data processing in smaller chunks to dramatically reduce memory usage. The dreamlet workflow easily integrates into the Bioconductor ecosystem, and uses the SingleCellExperiment class to facilitate compatibility with other analyses. Beyond differential expression testing, dreamlet provides seamless integration of downstream analysis including quantifying sources of expression variation, gene set analysis using the full spectrum of gene-level t-statistics, testing differences in cell type composition and visualizing results. | NIA |
David Rowe, Peter Maye, Dong-Guk Shin | University of Connecticut | High resolution 3D mapping of cellular heterogeneity within multiple types of mineralized tissues Revamp mGEA (Make GEO Accessible) to be available for a wide user base In this software engineering supplement project, we aim to revamp one important software tool, called mGEA (Make GEO Accessible), that we have been using internally to identify candidate MERFISH probes for human knee and bone tissues so that it can benefit a larger user base who needs to examine population-based reference gene expression data sets readily available from NCBI GEO. Examining GEO deposited data can be very beneficial for HuBMAP users since one can acquire a cohort of gene expression data sets from which tissue/organ specific reference gene expression patterns could be mined. The importance of using population-based signals for probe design has been hotly discussed during the recent HuBMAP FISH-assay meeting (March 15, 2021 organized by Dr. Ajay Pillai). Unfortunately, GEO has been mostly optimized for “data archiving” and as such, using deposited data by ordinary biologists and even for computational scientists has been severely limited. Our tool mGEA could dramatically lower that barrier. Difficulties of using GEO deposited data include (i) associating experimental platform IDs (e.g., Affymetrix, Illumina, Agilent, etc.) with gene symbols that biologists are mostly familiar with, and (ii) organizing which populations of samples (biological and technical replicates) can be grouped together and compared (e.g., treatment vs. control, KO population vs. WT, etc.). Using mGEA, scientists should be able to convert the archived data into biologists-friendly formats (e.g., Excel spreadsheet with gene symbols, fold change and statistical sample-wise and gene-wise z-scores precomputed) within a few clicks over any web browser. If everything goes well, users should be able to convert a GEO deposited data set into a format amenable to their local exploration in less than 10 minutes using the tool’s user-friendly visual GUI although problematic cases may take longer as manual intervention is needed. Making mGEA cloud-ready would not only benefits the members of the HuBMAP consortium but also the constituents far beyond the HuBMAP. With mGEA, the majority of wet-bench biologists should be able to explore GEO deposited data, thus facilitating GEO to unleash its intended power as an important community resource. | NIAMS |
Nathan Salomonis | Cincinnati Childrens Hospital | Unbiased identification of spliceosome vulnerabilities across cancer Leveraging cloud workflows for splicing discovery As a central component of our funded NCI R01 for “Unbiased identification of spliceosome vulnerabilities across cancer”, we have been extending and leveraging a comprehensive splicing analysis pipeline to define splicing vulnerabilities across human cancers and healthy tissues. The bioinformatics tools to yield these discoveries largely consistent of distinct components of the large AltAnalyze open-source project, begun in 2008. To enable fast and comprehensive analyses of splicing in the cloud we are updating and translating the primary splicing analysis components of AltAnalyze to a CWL pipeline, containerized with Docker. Analyze.cloud will be established as a Terra workflow for integrated supervised and unsupervised splicing analyses of large user and controlled NIH deposited datasets in the cloud. - Cloud workflow on Terra using CWL
| NCI |
Luis Santana | University of California Davis | Multi-Scale Modeling of Vascular Signaling Units Multi-modal data analysis pipeline in the cloud to unify electrophysiology, Ca2+ imaging, super-resolution microscopy, and predictive modelling. Hypertension is one of the largest modifiable risk factors for cardiovascular disease, which is the leading cause of mortality for men and women. As more antihypertensive therapies become available, it has become clear that males and females respond differently to these treatments. Yet the mechanisms behind these sex differences are largely unknown. The parent grant aims to develop a multi-disciplinary approach to reveal the mechanisms of male and female hypertension and by building a detailed model to predict how drugs can differentially alter vascular function between these groups. This is being achieved by comparing male and female vascular smooth muscle cells using a range of state-of-the-art multi-modal techniques including electrophysiology, Ca2+ imaging, nano-scale super-resolution microscopy, and in silico predictive modelling. The goal of this supplemental project is to unify the data analysis from heterogeneous multi-modal techniques by creating an easy to use, reproducible, interoperable, and extensible software analysis pipeline. The pipeline will have both a back-end Python engine with an advanced-programming-interface (API) to ensure re-usability and a front-end cloud-based graphical-user-interface (GUI) to facilitate collaboration and sharing between groups with no programming experience. The pipeline will be hardened using best programming practices including versioning, unit testing, and code documentation. For the pipeline to be discoverable, it will be accessible through online open-source code sharing and installation from package managers. The pipeline will be containerized for one-click access allowing the same code to be run on individual computers, local clusters, or in the cloud. A key feature in the design of this pipeline is that curated multi-modal data analysis will seamlessly provide input to in silico computational models to enable robust ground-truth predictions of functional biophysical parameters. Our goal is to create a cloud-based analysis pipeline that merges heterogeneous multi-model data and promotes reproducibility, sharing, and collaborations between multi-disciplinary research groups and ultimately the greater public. | NHLBI |
Matthew Silva | Washington University | Resource Based Center for Musculoskeletal Biology and Medicine Washington University Musculoskeletal Image analysis program (WUMI): a multi-platform open-source software for the visualization and evaluation of musculoskeletal imaging data. Under the parent grant P30 AR074992, the Washington University Resource-Based Center for Musculoskeletal Biology and Medicine supports the development, implementation, and evaluation of animal models for musculoskeletal biology and medicine. High-resolution imaging modalities such as MicroCT enable the evaluation of bone microstructure and morphology in musculoskeletal research. These advances in imaging technology have enabled investigators to answer crucial questions in health and disease in innovative ways. Yet the ability to evaluate and interpret this data is critically dependent on the software tools that can conduct the analyses correctly, rapidly, and rigorously. Increasing imaging data size and complexity often outpaces the development of the tools to handle them. In addition, there is an increased need for remote analysis solutions that can be run on personal computers. Therefore, the premise of this work is driven by the need to develop software to address the imaging analysis needs of the musculoskeletal research community. To make a significant impact to the research community, the software must also be widely available, accessible, and easy to use. To address these needs, we have developed software scripts that support 3D visualization and histomorphometric analyses. This led to the creation of the Washington University Musculoskeletal Image analysis program (WUMI), an image analysis software package with a Graphical User Interface (GUI). Compared to available commercial and open-source options, WUMI has equal or superior capabilities, and it has empowered its users with flexible access and evaluation of their datasets, and reduced their reliance on commercial software. Yet, to date the use of WUMI has required a relatively high degree of technical expertise and is only used by a few power users. Thus, there is a need to undertake additional development efforts to provide this software as a robust, open-source resource to the general research community, both at Washington University and beyond. Our overall objective is to rigorously validate the WUMI 3D histomorphometric analyses, improve documentation and usability of the software while releasing multi-platform executables to the musculoskeletal research community. | NIAMS |
Peter Sorger | Harvard | Systems Pharmacology of Therapeutic and Adverse Responses to Immune Checkpoint and Small Molecule Drugs Enhancing the MCMICRO data analysis pipeline to use standardized languages and best practices from genome science to facilitate multi-step processing of complex tissue images locally or on the cloud. Highly multiplexed tissue imaging is rapidly emerging as a means to study the properties of single cells in a preserved 3D environment. In a research setting, high-plex imaging provides new insight into the molecular properties of tissues and their spatial organization; in a clinical setting high-plex imaging promises to augment traditional histopathological diagnosis with molecular information needed to guide use of targeted and immuno-therapies. High-plex tissue imaging yield subcellular resolution data on 20-100 proteins or other biomolecules across 106 to 107 cells, encoded in up to 1 TB of data. The primary barrier to deeper analysis of highly multiplexed tissue imaging is the computational challenge in processing, managing, and disseminating images of this size. To address this challenge, we have recently developed the modular and open source image processing system, MCMICRO, that uses either the NextFlow pipeline language or the Galaxy platform. This pipeline incorporates existing image processing code and an increasing number of newly developed modules. It can be deployed locally or on the commercial cloud (AWS and GCP) and is in use by multiple NIH/NCI-funded tissue atlas consortia, notably the Human Tumor Atlas Network (HTAN). The goal of our work our current work (under the auspices of the Office of Data Science Strategy) is to further engineer MCMICRO to (i) improve overall performance of the individual modules through code profiling and optimization (ii) standardize inputs and outputs to increase interoperability of individual processing steps (iii) refine the general user and programmer documentation to promote continued contributions from the open-source community (iv) enable visualization of pipeline intermediate and final results directly in the cloud. We welcome participation from interested individuals, laboratories, and companies in the creation of a robust, foundational platform for highly spatial profiling of human and animal tissues. | NCI |
Ingo Titze | University of Utah | The Role of the Vocal Ligament in Vocalization FAIR Software Development for Voice and Speech Simulation A major aim of the parent R01 grant entitled “The Role of the Vocal Ligament in Vocalization” is to quantify the fundamental frequency (pitch) range and vocal cord vibration stability. This is achieved by measuring ligament properties and embedding them into a finite element computer model. The fiber-gel model is a primary mathematical simulation component of a software package developed over 40 years by investigators at multiple institutions. It will be shared with users across the nation. Its general application is to predict vocalization outcomes when structural and motor activation inputs are modified with surgery, therapy, or voice training. Muscle activation plots are produced that show how fundamental frequency varies with cricothyroid and thyroarytenoid activation. Laryngeal framework mechanics of translation and rotation of the cricothyroid joint and the crico-arytenoid joint are used to calculate vocal fold strain (elongation). The role of the ligament in control of fundamental frequency (perceptually known as pitch) will be quantified in humans and in non-human species. Results from non-human species provide insights into alternative solutions for surgical and behavioral treatment. The goals of the supplement are (1) to update the code from Fortran to Python, (2) to develop user interfaces that many clinicians and voice trainers can use in daily practice, and (3) the dissemination of the software package to a wide group of users through national voice and speech organizations. No commercial gain is anticipated for any individual, company, or institution. | NIDCD |
Doug Whalen | Haskins Laboratories | Links Between Production and Perception in Speech Making DeepEdge, a tool for ultrasound analysis for speech, cloud-accessible. A neural network-based tool for automatic tracking tongue contours in ultrasound video This project is a supplement to the parent NIH award (DC-002717) “Links Between Production and Perception in Speech”, intended to enhance software tools for open science. The parent project investigates the variability and flexibility of speech production through the human vocal tract and how listeners perceive the phonologically meaningful primitives within the speech signal, through a series of speech production and perception experiments. In the speech production experiments, ultrasound imaging provides a non-intrusive and cost-effective technique to measure tongue movements. To facilitate the analysis of tongue measurements in ultrasound images, we developed a software program to track midsagittal tongue contours automatically on a proprietary platform. The aim of this supplement project is to migrate our software program to a cloud platform with other improvements, in order to increase robustness, interoperability, sustainability, portability and performance. | NIDCD |
Travis John Wheeler | University of Montana | Machine learning approaches for improved accuracy and speed in sequence annotation Improving robustness, release, and quality of genome annotation software through extensive software engineering practices. The goal of the parent grant for this supplement is to develop Machine Learning approaches to improve both accuracy and speed of highly-sensitive sequence database search and alignment. We have developed three software tools associated with this effort of correctly annotating genomes: (i) ULTRA, which labels repetitive sequence; (ii) PolyA, which integrates such labels with other sequence annotations in a probabilistic framework, computing uncertainty and improving accuracy; and (iii) SODA, a library that aids in visualization of annotations and supporting evidence. The effort supported by this supplement will be to refactor these software tools and their documentation to improve robustness and reliability, and to improve their availability through package management systems and incorporation into cloud-based analysis frameworks. | NIGMS |
Guorong Wu | University of North Carolina Chapel Hill | A Scalable Platform for Exploring and Analyzing Whole Brain Tissue Cleared Images Promoting public science and data sharing for 3D nuclei segmentation through a cloud-based annotation system. In our ongoing R01 project, we aim to develop a high-throughput 3D nuclei segmentation platform for the neuroscience field to analyze large-scale microscopy images derived from cleared tissue in a standardized manner, across multiple labs. Since the success of our learning-based nuclei segmentation method requires a large pool of manually-annotated 3D nuclei for both training and validation, the overarching goal of this supplement project is refactoring our current stand-alone annotation software to a “cloud-ready” solution, called “Ninjatō”, as an analogy to the Swiss army knife for 3D nuclei annotation. We will use VTK.js (an open-source JavaScript library for scientific visualization) and Resonant (an open-source data analytics software platform) to implement a client-server architecture for data acquisition, cloud-based processing, and data management components. The output of this administration supplement project is a new cloud-based 3D nuclei annotation system, which will allow us to (1) significantly augment the pool of manual annotations, (2) further improve the accuracy and robustness of our learning-based nuclei segmentation engine, (3) make annotation data AI/ML-ready using standardized data management, and (4) support the neuroscience community with enhanced extensibility to new analytic tools. | NINDS |
Yana Yunusova | Sunnybrook Research Institute | The development and validation of a novel tool for the assessment of bulbar dysfunction in ALS Creating a user-friendly and intuitive cloud-based platform for research and clinical assessment of bulbar dysfunction The current healthcare environment affords an unprecedented opportunity for the development and implementation of new remote platforms for the assessment of various motor behaviours including speech and orofacial gestures. As part of our parent grant, we recently created VirtualSLP, a software tools to support remote, online multi-modal - high quality video/ kinematic and audio/acoustic – data collection, using a multi-platform compatible, web-browser based audio and video recordings. In parallel, we have designed a tool for automatic extraction of kinematic and acoustic metrics of neurological diseases, which provides clinically interpretable features/ measures for detecting and tracking the onset and progression of oro-motor and speech (aka bulbar) impairments in neurological diseases. The next step in the progression of our technology development efforts is to incorporate the automatic metrics extraction software within VirtualSLP, creating a user-friendly and intuitive cloud-based platform for research and clinical assessment of bulbar dysfunction. To achieve rapid development and deployment of VirtualSLP, we will use modern engineering methodology and user-centered design to iterate through a number of steps in the software development cycle. The cycle will begin with the engagement of end users (e.g., researchers, clinicians and patients) to document their current and anticipated end-to-end experiences with the software, while performing the baseline analysis of the existing software components and determining and implementing necessary changes. The necessary components will be incorporated to create VirtualSLP on the Amazon Web Services (AWS) platform, to enhance its usability, interoperability, scalability, and security, as part of the NIH STRIDES initiative. Usability testing of Virtual SLP with end users will be performed throughout the development process and at the end of the development cycle. When completed, the work will result in an enhanced software tool for collection of audio and video data and corresponding AI-based analytics to be used in the context of clinical research. VirtualSLP will further support our ongoing work on the clinician-administered tool for bulbar ALS assessment and monitoring (ALS-Bulbar Dysfunction Index) by providing a novel cloud-based platform for its clinical validation. This work aims to exemplify the intent of the current funding opportunity by supporting collaborations between clinical speech scientists, data scientists, and software engineers to enhance the design, implementation, and "cloud-readiness" of research software. | NIDCD |
Antonella Zanobetti | Harvard | National Cohort Studies of Alzheimers Disease, Related Dementias and Air Pollution Refactor and redesign two novel statistical R packages, leveraging professional software engineers to deploy open-source efficient software, reaching a broad community of users and developers to facilitate nationwide studies on environmental public health. The goals of the parent grant are to: 1) conduct national epidemiological studies of Medicare and Medicaid claims to estimate the effects of long-term exposures to air pollution (PM2.5 and ozone) on Alzheimer’s disease and related dementias (ADRD) hospitalization and disease progression; 2) apply machine learning methods to identify co-occurrence of individual-level, environmental, and societal factors that lead to increased vulnerability; 3) develop statistical methods to disentangle the effects of air pollution exposure from other confounding factors and to correct for potential outcome misclassification. To address these aims we developed two R packages; Causal Rule Ensemble (CRE) (Aim 2) and Gaussian processes for the estimation of causal exposure-response curves (GP_CERF) (Aim 3). With this administrative proposal, we plan on refactoring and redesigning these packages and converting them into robust, easy to maintain, and efficient software packages to reach broader users and developers from the open-source community. We will review numerical implementation from the algorithmic design point of view, following standard version controlling systems, unit testing, and continues integration. We will implement infrastructure to run the package on shared and distributed memory computational nodes for cloud readiness requirements. Toward this goal, we will use a broad spectrum of approaches to handle big data. In addition to R, the packages will also be implemented in Python3. Stable versions of the packages will be hosted on CRAN and PyPI. | NIA |