There is great potential for discovery and innovation as the quantity and accessibility of biomedical data continues to expand. However, this potential can never be realized without appropriate tools. The Targeted Software Development awards fund software tools and methods development to tackle data management, transformation, and analysis challenges in areas of high need to the biomedical research community. In 2015, awards were made in the areas of data compression, data provenance, data visualization, and data wrangling. In 2016, awards were made in the areas of data privacy, data repurposing, and applying metadata. Each of these areas have unmet challenges that impact research capacity across many of the diverse domains of biomedicine. Awardees were selected from the applicants to the BD2K “Development of Software and Analysis Methods for Biomedical Big Data in Targeted Areas of High Need” funding opportunity announcements RFA-HG-14-020 and RFA-CA-15-017. These funding opportunities were developed in part via a 2013 BD2K Request for Information: Input on Development of Analysis Methods and Software for Big Data.
Deploying ONC Natinal Standards in Support of Metadata for Big Data Research Warehouse Management of Repurposed Laboratory, Pathology & Patient Findings Data from the EHR
University of Nebraska Medical Center
Principal Investigator: Scott Campbell
Grant Number: U01 HG009455
Argumentation and Linked-Metadata Services for Reproducible Target Validation
Massachusetts General Hospital
Principal Investigator: Timothy W. Clark
Grant Number: U01 HG009452
Tools for Standardizing Clinical Research Metadata Using HL7 FHIR
An Intelligent Concept Agent for Assisting with the Application of Metadata
Metadata Applications on Informed Content to Facilitate Biorepository Data Regulation and Sharing
University of Texas Health Science Center - Houston
Principal Investigator: Cui Tao
Grant Number: U01 HG009454
Data Compression & Reduction
Task-Specific Compression for Biomedical Big Data
University of Arizona
Principal Investigator: Ali Bilgin
Grant Number: U01 CA198945
The size and complexity of data produced by biomedical imaging are significant challenges to application of the data to many research questions. To tackle the bottlenecks of high volume data storage and interactive remote access for high-resolution digital pathology images, this award supports development of advanced data compression algorithms for open-source software. Image compression algorithms currently achieve a compression ratio of approximately 10:1. This project aims to achieve a compression ratio to 100:1 without degradation in visual image quality.
Theoretical Foundations and Software infrastructure for Biological Network Databases
Case Western Reserve University
Principal Investigator: Mehmet Koyuturk
Grant Number: U01 CA198941
The software developed for this award will make network data better organized and more accessible for biomedical research. To accomplish these goals, the developers will use unique methods, including the use of graph oriented approaches to implement compression in such networks and the integration of data compression with biological network data storage design and version control. This project is designed to develop and validate new computational theories, storage schemes, compression algorithms, version control mechanisms, and query interfaces. These advances will enable efficient storage, update, processing, and querying of biological networks.
Genomic Compression: From Information Theory to Parallel Algorithms
University of Illinois – Urbana-Champaign and Stanford University
Principal Investigators: Olgica Milenkovic and Tsachy Weissman
Grant Number: U01 CA198943
The goal of this award is to improve compression of application-driven genomic and functional genomic data by introducing new compaction and dimensionality reduction techniques. These techniques are developed from information theory and signal processing. The project will produce new compression algorithms for genomic data, focusing on three major data types: SAM files, FASTQ, and Wig files. The aim is to improve upon existing genomic data compression methods by a factor of 10 to 100.
University of California – San Diego
Principal Investigator: Peter Rose
Grant Number: U01 CA198942
As technologies in structural biology continue to improve, many new large complex structures are becoming characterized. Interactive visualization of large complex structures and structural comparison across the entire Protein Data Bank (PDB) archive exceeds available network bandwidth through the web. This currently requires dedicated local network and high performance computing infrastructure that is not widely available. This award aims to make these structures accessible to all scientists, educators, and students with more efficient ways to represent the structural data. The goal of this project is to develop a set of compression algorithms, applications, and workflows that will significantly improve the performance of interactive visualization of 3-dimensional structures of large complexes over the internet. Such tools would enable users to analyze large structures, carry out large scale searches, and visualize the coordinates directly in the compressed format.
Meaningful Data Compression and Reduction of High-Throughput Sequencing Data
Principal Investigator: Martin Farach-Colton
Grant Number: U01 CA198952
This project aims to develop novel computational algorithms for lossless data compression and lossy data reduction of sequencing data. The new development would allow direct downstream computation of compressed data without decompression. As so, potential impact of the proposed development is not limited to data storage and transfer, it will also impact high throughput sequence analysis such as genome comparison and metagenomics analysis. A compressive genomics middle-ware and related Active Programming Interface will be developed to allow communication with existing genomic analysis software without much new code development.
Methods and Software to Enhance Genomic Privacy and Sharing of RNA-seq Data
Principal Investigator: Mark Gerstein
Grant Number: U01 EB023686
Encryption Methods and Software for Privacy-Preserving Analysis of Biomedical Data
Indiana University Bloomington
Principal Investigator: Haixu T. Tang
Grant Number: U01 EB023685
Privacy-Protecting Distributed Analysis of Biomedical Big Data
Harvard Pilgrim Health Care, Inc.
Principal Investigator: Darren Toh
Grant Number: U01 EB023683
Flexible & Executable Provenance in Data-Intensive Biomedical Research: A Flexible Research Data Service
Principal Investigator: Erich S. Huang
Grant Number: U01 EB020957
This award is to create data and compute services that allow one to declare, generate and validate formal data provenance, and to display that provenance via services and visualizations. As a backbone for scientific exchange, the research team will build an open-source flexible research data service. They will also provide common open-source infrastructure to run containerized, modular, scientific workflows within the research data service. Such services will support efforts by other scientists to understand and reproduce complex analyses and to more effectively evaluate the science.
Approximating and Reasoning about Data Provenance
University of Pennsylvania
Principal Investigators: Zachary Ives and Junhyong Kim
Grant Number: U01 EB020954
Source information for portions of large combined datasets can often be missing or incomplete. This award funds the development of software tools to improve data reproducibility and consistency for multiple-source next generation sequencing (NGS) data through reconstruction of provenance records. The team plans to develop a novel model of partial provenance, develop tools for analyzing partial and reconstructed provenances, and develop new algorithms for inferring missing provenance information. The output of the research will be particularly useful to the field of “forensic bioinformatics” –helping to understand how other researchers may have arrived at their results.
A PROV Standard-Based Data Source Agnostic Provenance Engine for Big Data Analytics
Case Western Reserve University
Principal Investigator: Satya Sanket Sahoo
Grant Number: U01 EB020955
The research team proposes to develop a W3C (W3C, the World Wide Web Consortium, is the web technology standards organization) PROV standard-based provenance engine to create a public provenance resource for data quality, scientific reproducibility, and trusted information systems. By combining mathematical approaches such as algebraic graph theory and distributed algorithms for cloud-computing implementations, in conjunction with the new PROV representation standard, the project plans to develop a scalable data-source agnostic provenance engine. Such an engine would be able to judge the reproducibility of research results, score data quality, and compute the trustworthiness of data. The developed provenance engine will be tested on two categories of biomedical Big Data: sleep data from the NHLBI-funded National Sleep Research Resource, and neuroscience data from the NINDS-funded Center Without Walls for Epilepsy.
Integrating the World's Microbiology Laboratories into a Global Microbial Surveillance System
Brigham and Women's Hospital
Principal Investigator: John Stelling
Grant Number: U01 CA207167
Drug Repurposing for Cancer Therapy: From Man to Molecules to Man
Univeristy of North Calrolina - Chapel Hill
Principal Investigator: Alexander Tropsha
Grant Number: U01 CA207160
From Terabytes of Pixels to Intuitive Brain Networks
University of Southern California
Principal Investigator: Hong-Wei Dong
Grant Number: U01 CA198932
This project will develop tools to visualize and analyze connectivity networks in the mouse brain. The researchers will further develop the “Connection Lens” software to enable automated and visual analysis of the massive data sets accumulated by the Mouse Connectome Project. They will also develop a new software tool “Projection Lens” to enable the production of comprehensive connectivity diagrams. These tools will enable researchers to drill down from these comprehensive views to visualize connections between regions of interest in the mouse brain. This visualization web application will enable researchers to display all possible routes between regions of interest in the mouse brain in a manner analogous to a road map, as well as perform numerous other analyses and representations of brain connectivity data.
Tools for Visualization of Geographic Structure in Population Genomic Data
University of Chicago
Principal Investigator: John Novembre
Grant Number: U01 CA198933
The goal of this project is to develop software for geographic visualization of population genetics data. This web services software will be designed to help the novice user with basic analyses using interactive visualization. The software will also include some more advanced models using local migration rates to help explain the observed relationship between genetic similarities and geography. These tools will provide powerful methods to visualize and track genetic variations of populations. This is essential for the study of human demographics and can also augment other types of bioinformatic studies.
Visual Analysis of Genomic and Clinical Data from Large Patient Cohorts
Harvard Medical School
Principal Investigator: Peter J. Park
Grant Number: U01 CA198935
This project will support research and clinical efforts in diagnosing rare diseases and disease sub-types by developing new visualization algorithms and software. The software will enable identification of patterns across different types of molecular and clinical data obtained from studies containing large numbers of patients. The project includes design of visual analysis methods to efficiently identify disease sub-types and to aid in diagnosing rare and undiagnosed disease. These methods will be developed based on overlapping patient sets across multiple data types. In addition to new visualization methods, this award supports the creation of an extensible, open-source visual analysis framework for analyzing large biomedical, clinical, and epigenomic datasets on the web.
Harvard Medical School
Principal Investigator: Griffin M. Weber
Grant Number: U01 CA198934
This project will develop data visualization tools to extract and visualize Healthcare System Dynamics (HSD) information from the electronic health records (EHRs) of individual patients, clinical trials databases, laboratory test records, and administrative claims databases. Such tools could separate HSD information from patient pathophysiology making EHRs more useful for clinical research. The ability to segregate and visualize data that is revealed only by aggregating other data sets, like HSD information, should ultimately lead to improved patient care.
MACE2K - Molecular and Clinical Extraction: A Natural Language Processing Tool for Personalized Medicine
Principal Investigator: Subha Madhavan
Grant Number: U01 HG008390
This BD2K project will develop new computational methods and software to retrieve targeted molecular and drug therapy information from multiple sources of big data including: clinicaltrials.gov, PubMed abstracts, open access articles, and conference proceedings. The software suite, MACE2K, will primarily contain natural language processing features for biological entity and relation extraction, disambiguation and normalization modules. The suite can be used by biomedical researchers to generate new hypotheses for research on personalized cancer treatment decisions using large volumes of public data. A novel gamification and evidence-ranking component will be built to allow subject matter experts to verify, rank and validate the information corpus to enhance accuracy of the software for broader use.
University of Virginia
Principal Investigator: Wladek Minor
Grant Number: U01 HG008424
The Integrated Resource for Reproducibility in Macromolecular Crystallography will develop tools for molecular structure determination. These tools will enhance and sustain macromolecular diffraction data and provide metadata for raw, X-ray diffraction images. Such images are the primary data sources for macromolecular atomic coordinates in the Protein Data Bank (PDB). This project will also create a web-based archive for searching and analyzing diffraction images. This will preserve diffraction data for currently unsolved structures and will provide a large test set of data for improving diffraction image algorithms and hardware.
Automatic Discovery and Processing of EEG Cohorts from Clinical Records
Temple University and University of Texas - Dallas
Principal Investigators: Joseph Picone, Iyad Obeid, and Sanda Maria Harabagiu
Grant Number: U01 HG008468
This award supports software development for automated explanatory modeling of complex healthcare data. The researchers will develop a patient cohort retrieval system to provide big mechanism modeling capability for analysis of electroencephalogram (EEG) data. Big mechanisms have been defined as large explanatory models of complex systems with many causal interactions. The project is centered on the aggregation of clinical knowledge automatically discovered from EEG signals and EEG reports into a medical knowledge graph. The software framework established by this project could be transformative for mining the wealth of biomedical knowledge available from hospital medical records.
Mining the Social Web to Monitor Public Health and HIV Risk Behaviors
University of California Los Angeles
Principal Investigators: Wei Wang, Tyson Condie, and Sean Young
Grant Number: U01 HG008488
Surveillance and monitoring of public health-related risk behaviors is a top priority. A growing source of data that can be used to help address this issue is social big data, or information from social media and online platforms on which individuals and communities create, share, and discuss content. Although public health agencies are interested in mining social big data to address public health problems, current tools are not usable by most health scientists, as the tools require advanced computer science expertise. This project is of particularly high impact because it seeks to develop software to allow researchers to analyze the real-time conversations from social big data to monitor and predict public health-related risk behaviors and disease outbreaks.
Community Platform for Data Wrangling of Gene and Genetic Variant Annotations
The Scripps Research Institute
Principal Investigator: Chunlei Wu
Grant Number: U01 HG008473
One of the primary challenges in the biomedical big data era is that the rate of scientific discoveries greatly outpaces traditional efforts to structure the generated data into computable forms. The goal of this work is to develop a community platform to annotate gene and genetic variation data by integrating multiple streams of information that are typically separated and hard to combine. Bridging the gap in this way between the generation and processing of data will provide researchers with access to an important source of aggregated community knowledge. For more information, visit the following websites: MyGene.info (http://mygene.info) and MyVariant.info (http://myvariant.info). Both sites have a blog for communicating news/updates with users. You can also follow this program on Twitter: @mygeneinfo and @myvariantinfo.
Developing Methods for Curating Multi-Omics Data
Icahn School of Medicine at Mount Sinai
Principal Investigator: Jun Zhu
Grant Number: U01 HG008451
The quantity of large genetic, epigenetic, genomic, and environmental perturbation datasets in the public domain continues to grow and researchers are now integrating across the data types to solve biomedical problems. Due to the complexity of generating large datasets, data errors (e.g. caused by sample mislabeling or data mistyping) in public “omics” databases may be undetected. The impact of such errors can be magnified when additional researchers use these databases. By developing useful and transferable methods and tools to enable the identification and correction of such errors with high precision, this project will maximize statistical power and safeguard scientific conclusions based on public “omics” databases.