BD2K: Investing Now in Tools for the Future of Biomedical Data Science

June 1, 2015

New NIH Big Data to Knowledge initiative grants fund software development to promote biomedical Big Data discovery

There is great potential for discovery and innovation as the quantity and accessibility of biomedical data continues to expand.  However, this potential can never be realized without appropriate tools. A set of 15 National Institutes of Health (NIH) awards starting on June 1, 2015 will fund researchers engaged in the development of software tools for biomedical Big Data applications. The awards come from the Targeted Software Development program within the trans-NIH Big Data to Knowledge (BD2K) initiative. The grants, totaling $6.8 M in this fiscal year, represent innovative and collaborative approaches to common challenges across several areas of biomedical data use.

With these grants, researchers will develop software to tackle data management, transformation, and analysis challenges in areas of high need to the biomedical research community. For the 2015 awards, applications in the following areas of high need were solicited: data compression, data provenance, data visualization, and data wrangling. Each of these areas have unmet challenges that impact research capacity across many of the diverse domains of biomedicine. By addressing the researcher needs in these areas, such tools will reduce the cost of data storage and analysis, democratize accessibility to complex data, increase the capacity for data sharing and re-use, and enable innovative biomedical experiments. Additionally, by requiring that these tools be open-source, these awards open the door to future innovations and improvements based upon the initial developments.

Data Compression is a major area of unmet need. Biomedical imaging, DNA sequence data, protein structure data, and molecular network data all require better compression tools. Digital imaging has produced a revolution in the laboratory and the clinic just as it has in our homes. The ease with which images can be produced has created spectacular capacity to answer new biomedical questions, inspired novel experimental instruments, and provided more powerful tools for 21st Century clinical diagnostics. However, just like the burgeoning memory card full of snapshots on your smart phone, image data is stressing the capacity of biomedical researcher’s storage and compute capabilities. Data compression software has the potential to address some of the most significant data storage, compute, and sharing challenges facing biomedical researchers. The BD2K program has made four awards in the data compression area for software innovation that span these data types and will likely impact technology development for others as well.

Data Provenance is the tracking of the creation, modification, and movement of data during analysis and its entire lifecycle. While it sounds esoteric, this area of data management is essential for rigor and reproducibility in biomedical Big Data experiments. Without correct and complete data provenance, errors or incorrect assumptions can be made in the process of research. BD2K is funding three awards in the data provenance area that build the infrastructure and tools required to maximize the experimental value that is obtained by data provenance information. These new provenance tools permit researchers to reconstruct missing information about data, to better understand the methods used by others for a particular experiment, and to compute quality and trustworthiness scores for data.

Data Visualization research expands the human capacity to identify connections and trends in data by developing new and improved ways to visualize and explore the data. BD2K is funding four grants under the visualization topic area. These projects address visualization of tissue-level molecular networks, visualization of uncertainty and variety in geographic representations of population genetics data, accessibility of clinically-relevant insights, and visual analysis frameworks for dissimilar data from multiple sources. Together, these projects address critical barriers in transforming data to knowledge by enabling researchers to utilize previously inaccessible data, interact with data in a discovery-oriented manner, and derive new insights by visualizing different data types from across multiple studies.

Data Wrangling is the process of converting or mapping data across different forms, and refining its representation, usually through the use of automated tools. This type of data processing is essential to increase the usability of biomedical data and is required to produce Big Data analyses from disparate and dissimilar data sources. BD2K has funded four data wrangling awards. These awards will generate tools for enhancing protein structure data, integration of disparate  clinical data types, annotation of genomics data from multiple data sources, and error identification for large -omics data sets. Through data wrangling innovations, data will become easier to access and use and will gain value in additional annotation and connection information.

The quantity of data continues to grow across all of the domains of biomedicine. Software tools in these categories will permit scientists to maximize the potential for integration of data sets and minimize errors. Through new methods and innovative designs, they will allow researchers to see connections that were previously obscured. Most importantly, they will allow scientists to better manage, manipulate and share the data that they are already producing. These processes illuminate a bright future for biomedical experimentation and make strides towards the development of the type of personalized and precision healthcare that is highly dependent on real-time analysis of biomedical Big Data.

More information about the BD2K program can be found at

More information about the Targeted Software Development Awards can be found at