Driving Discovery Through Data

0 Inaugural Activities of the NIH Data Science Training Center
Michelle C. Dunn / 01.13.15

Last week saw a couple of inaugural activities of the NIH Data Science Training Center. Originally envisioned as a collection of in-person courses and training activities held at the NIH, the concept of the NIH Data Science Training Center quickly expanded to include online offerings to increase the reach of this national resource and to help meet the international need for training in biomedical data science. The NIH Data Science Training Center aims to support the development of open course materials, the running of online/in-person courses, and the discovery of educational resources.

The NIH Data Science Training Center includes parts of the Big Data to Knowledge (BD2K) training initiative but also includes NIH on-campus training activities that are not part of the BD2K program. Through extramural grants, BD2K supports the development of open educational resources (OER) and courses for biomedical data science. The first set of these R25 grants were announced in October, and applications for the next set are due on April 1, 2015. Also through BD2K, there is an open call for a Training Coordination Center, which will, among other things, pilot an Educational Resource Discovery Index, a key part of the NIH Data Science Training Center. Last week, the first NIH Data Science Training Center on-campus training activities were held: a Hackathon and a Software Carpentry course.

The Hackathon, organized by Ben Busby and sponsored by NCBI, enabled participants to learn new bioinformatics skills and to build useful products by working in small teams. On January 5-7, thirty genomic trainees and scientists congregated on the NIH campus for the first Hackathon. In four teams, they used publicly available data to address problems in DNA-seq, RNA-seq, Epigenomics, and Metagenomics. The goal was to produce useful pipelines by the end of the 3-day Hackathon while also learning and having fun.

Participants came from across the NIH, across the US, and as far away as Mexico and Germany. They ranged in experience from undergraduates to senior scientists. The participants discussed the problem at hand, the available data, and how to utilize and extend existing tools to create a product. In addition to writing and documenting code (available on GitHub at, a paper describing the experience and results was submitted to a journal within days. An example of the Hackathon’s output is metagenomics code that is at least an order of magnitude speed-up of previous metagenomic BLAST searches, achieved through either local or cloud loading of fastq files into the SRA database format.

Ben Busby, who leads genomic outreach activities at NCBI, is considering a larger, more ambitious Hackathon in July after the outstanding success of this one. If you are interested in leading a team at a future Hackathon, please get in touch with him at If you would like to participate, please watch for the announcement on NCBI’s news, Twitter feed, LinkedIn, or Facebook page.

In addition to the Hackathon, last week’s NIH Data Science Training Center activities included the first NIH Software Carpentry workshop. The Software Carpentry workshop, organized by Lisa Federer and sponsored by the NIH Library, equipped participants with time-saving programming and computer skills. The two-day course, taught by Ian Munoz and Daniel Chen, covered four topics: the Unix operating system, programming skills through RStudio, version control and collaborative programming through git and GitHub, and data storage through SQL. The four topics combine to give necessary skills for producing reproducible analyses of Big Data.

  • UNIX: Through the Unix command-line interface, tasks can be automated through scripts and basic operations (e.g. counting words and sorting) can be done quickly without repetitively reading/writing to disk.
  • Programming: Programming skills, whether in R or any other programming language, are necessary for scientists who need to manipulate large data files; Rstudio is an interactive environment that allows quick prototyping through standard functions and specialty packages while also maintaining a record of how the data was analyzed though the use of scripts.
  • Git/Github: Git is a software platform for version control, and Github is a web service that enables collaborative programming and the sharing of code.
  • SQL: Relational databases, such as SQLite, manage data efficiently on disk through the building of indices for quick filtering. This allows the user to pull only the relevant data into memory for analysis.

Using these skills, scientists can build pipelines that combine analysis and documentation, allowing others to understand what was done and allowing analyses to be reproduced. For large or complex data and complicated analyses, these pipelines are essential for good science.

Software Carpentry courses aim to give participants an introduction to basic computational tools. If you are interested in teaching a Software Carpentry course or another course in any aspect of data science on the NIH campus, please get in touch with Lisa Federer ( or Michelle Dunn (

Last week’s Hackathon and Software Carpentry courses are the just the first pilots for on-campus in-person courses and activities. Although held on the NIH campus, they are open to non-NIH participants as well and will complement other in-person courses and activities happening elsewhere. Although some aspects of learning are best through in-person interaction, much can be accomplished online. The NIH Data Science Training Center includes both online and in-person learning to meet the needs of the diverse biomedical workforce.

The ADDS office would like to know what data science courses you would like to participate in – course descriptions might include particular topics, levels, depths/lengths, and formats (in-person vs. online). You may use the box below to enter your comments.

Add New Comment

Posting Calendar

November 2018

Sun Mon Tue Wed Thu Fri Sat
Back to Top