Skip to Main Content

Data Science News

Data Science Community News



NIH awards to test ways to store, access, share, and compute on biomedical data in the cloud

November 6, 2017

NIH Data Commons Pilot Phase to seek best practices for developing and managing a data commons

Twelve awards totaling $9 million in Fiscal Year 2017 will launch a National Institutes of Health Data Commons Pilot Phase. A data commons is a shared virtual space where scientists can work with the digital objects of biomedical research, such as data and analytical tools. The NIH Data Commons will be implemented in a four-year pilot phase to explore the feasibility and best practices for making digital objects available through collaborative platforms. This will be done on public clouds, which are virtual spaces where service providers make resources, such as applications and storage, available over the internet. The goal of the NIH Data Commons Pilot Phase is to accelerate biomedical discoveries by making biomedical research data Findable, Accessible, Interoperable, and Reusable (FAIR) for more researchers.

“Harvesting the wealth of information in biomedical data will advance our understanding of human health and disease,” said NIH Director Francis S. Collins, M.D., Ph.D. “However, poor data accessibility is a major barrier to translating data into understanding. The NIH Data Commons Pilot Phase is an important effort to remove that barrier.”

“The NIH Data Commons Pilot Phase will create new opportunities for research not feasible before,” said NIH Data Commons Pilot Phase Program Manager, Vivien Bonazzi, Ph.D. “Making biomedical data sets accessible and connected at an unprecedented scale will lead to creative new ways to combine, analyze, and ask questions of the data to generate new knowledge.”

The recipients of the 12 awards will form the nucleus of an NIH Data Commons Pilot Phase Consortium in which researchers will start developing the key capabilities needed to make an NIH Data Commons a reality. These key capabilities, which were identified by NIH, collectively represent the principles, policies, processes, and architectures of a data commons for biomedical research data. Key capabilities include making data transparent and interoperable, safe-guarding patient data, and getting community buy-in for data standards.

Three NIH-funded data sets will serve as test cases for the NIH Data Commons Pilot Phase. The test cases include data sets from the Genotype-Tissue Expression and the Trans-Omics for Precision Medicine initiatives, as well as the Alliance of Genome Resources, a consortium of Model Organism Databases established in late 2016. These data sets were chosen based on their value to users in the biomedical research community, the diversity of the data they contain, and their coverage of both basic and clinical research. While just three datasets will be used at the outset of the project, it is envisioned the NIH Data Commons efforts will expand to include other data resources once the pilot phase has achieved its primary objectives.

NIH has acquired support from a Federally Funded Research and Development Center, the MITRE Corporation, to assist in establishing new NIH sustainable infrastructure for data science (people, processes, technologies). The MITRE Corporation will provide a broad range of support services for the NIH Data Commons Pilot Phase including innovative approaches to assure cost-effective cloud-based computing and storage for scientific data; analyses related to usage, cost, and comparative business models; and, other considerations to assure long-term viability of NIH data science efforts.

The trans-NIH Data Commons Pilot Phase receives funding from multiple NIH Institutes and Centers and is managed by the NIH Common Fund within in the NIH Office of the Director. The Common Fund; the National Heart, Lung, and Blood Institute; and the National Human Genome Research Institute are the lead NIH entities involved in management of the NIH Data Commons Pilot Phase.

About the NIH Common Fund: The NIH Common Fund encourages collaboration and supports a series of exceptionally high-impact, trans-NIH programs. Common Fund programs are managed by the Office of Strategic Coordination in the Division of Program Coordination, Planning, and Strategic Initiatives in the NIH Office of the Director in partnership with the NIH Institutes, Centers, and Offices. More information is available at the Common Fund website:

About the National Heart, Lung, and Blood Institute (NHLBI): Part of the National Institutes of Health, the National Heart, Lung, and Blood Institute (NHLBI) plans, conducts, and supports research related to the causes, prevention, diagnosis, and treatment of heart, blood vessel, lung, and blood diseases; and sleep disorders. The Institute also administers national health education campaigns on women and heart disease, healthy weight for children, and other topics. NHLBI press releases and other materials are available online at

About the National Human Genome Research Institute (NHGRI): NHGRI is one of the 27 institutes and centers at the National Institutes of Health. The NHGRI Extramural Research Program supports grants for research, and training and career development at sites nationwide. Additional information about NHGRI can be found at

About the National Institutes of Health (NIH): NIH, the nation's medical research agency, includes 27 Institutes and Centers and is a component of the U.S. Department of Health and Human Services. NIH is the primary federal agency conducting and supporting basic, clinical, and translational medical research, and is investigating the causes, treatments, and cures for both common and rare diseases. For more information about NIH and its programs, visit

NIH…Turning Discovery Into Health®



Call for Papers

October 17, 2017

NIPS Workshop on Machine Learning for Health (NIPS ML4H 2017)

What parts of Healthcare are Ripe for Disruption by Machine Learning Right Now?

A workshop at the Thirty-First Annual Conference on Neural Information Processing Systems (NIPS 2017).

Friday, December 8, 2017

Long Beach Convention Center, Long Beach, CA, USA

Please direct questions to:

NOTE 2017/09/28: NIPS 2017 workshop registrations are now sold out. If you have not registered you may still submit a paper. During submission, please indicate an author that will attend or could attend in the unlikely event that more registrations became available as a "corresponding author."


  • Mon Oct 30, 2017: Submission deadline at 11:59pm
  • Fri Nov 10, 2017: Acceptance notification (Poster or Spotlight+Poster)
  • Thu Nov 16, 2017: NIPS deadline to cancel registration (with full refund)
  • Fri Dec 01, 2017: Final papers posted online (with permission)
  • Fri Dec 08, 2017: Workshop


The goal of the Machine Learning for Health Workshop (NIPS ML4H 2017) is to foster collaborations that meaningfully impact medicine by bringing together clinicians, health data experts, and machine learning researchers. We aim to build on the success of the last two NIPS ML4H workshops which were widely attended and helped form the foundations of a new research community.

This year’s program emphasizes identifying previously unidentified problems in healthcare that the machine learning community hasn't addressed, or seeing old challenges through a new lens. While healthcare and medicine are often touted as prime examples for disruption by AI and machine learning, there has been vanishingly little evidence of this disruption to date. To interested parties who are outside of the medical establishment (e.g. machine learning researchers), the healthcare system can appear byzantine and impenetrable, which results in a high barrier to entry. In this workshop, we hope to reduce this activation energy by bringing together leaders at the forefront of both machine learning and healthcare for a dialog on areas of medicine that have immediate opportunities for machine learning. Attendees at this workshop will quickly gain an understanding of the key problems that are unique to healthcare and how machine learning can be applied to addressed these challenges.

The workshop will feature invited talks from leading voices in both medicine and machine learning. Invited clinicians will discuss open clinical problems where data-driven solutions can make an immediate difference. The workshop will conclude with an interactive panel discussion where all speakers respond to questions provided by the audience.

From the research community, we welcome short paper submissions highlighting novel research contributions at the intersection of machine learning and healthcare. Accepted submissions will be featured as poster presentations and (in select cases) as short oral spotlight presentations.


Researchers interested in contributing should upload short, anonymized papers of up to 4 pages in PDF format by Monday, October 30, 2017, 11:59 PM in the timezone of your choice.

Please submit via our ML4H EasyChair website:

Papers should adhere to the NIPS conference paper format, via the NIPS LaTeX style file:

Workshop papers should be at most 4 pages of content, including text and figures. Additional pages containing only bibliographic references can be included without penalty.

Relevant Topics

Submitted papers should describe innovative machine learning research focused on relevant problems in health and medicine. This can mean new models, new datasets, new algorithms, or new applications. Topics of interest include but are not limited to reinforcement learning, temporal models, deep learning, semi-supervised learning, data integration, learning from missing or biased data, learning from non-stationary data, model criticism, model interpretability, causality, model biases, and transfer learning.

Peer Review and Acceptance Criteria

All submissions will undergo double-blind peer review. It will be up to the authors to ensure the proper anonymization of their paper. Do not include any names or affiliations. Refer to your own past work in the third-person.

Accepted papers will be chosen based on technical merit and suitability to the workshop's goals. All accepted papers will be included in one of two poster presentation sessions on the day of the workshop. Some accepted papers will be invited to give short oral spotlight presentations at the workshop.

Registration and Attendance

To promote community interaction, we hope at least one presenting author has registered and can attend the workshop. However, because NIPS workshop registration has sold out, we encourage all researchers to submit a paper regardless of their registration status.

Accepted papers that cannot attend will at least be listed on our website. It is unlikely that we will be able to create new registration spots for accepted papers, but we are exploring possibilities. If your paper is accepted and you cannot attend due to registration or other issues, please contact us after you are accepted and we'll find solutions on a case-by-case basis. Acceptance notifications will go out a few days before the NIPS deadline for full refunds.

Copyright for Accepted Papers

This workshop will be informally published online but not officially archived. This means:

  • Authors will retain full copyright of their papers.

  • Acceptance to NIPS ML4H 2017 does not preclude publication of the same material in another journal or conference.

We encourage (but do not require) accepted papers to be posted on arXiv. With author permission, we will post links to accepted short papers on our workshop website.

Our workshop does allow submission of papers that are under review or have been recently published in a conference or a journal. Authors should clearly state any overlapping published work at time of submission.



SYMPOSIUM AND WEBCAST : Principles for Data-Driven Decision Making

September 7, 2017

The abundance of large and complex data, coupled with powerful modeling techniques and analytic methods, creates tremendous opportunity for organizations and individuals to base their decisions on empirical evidence. However, to appreciate both the capabilities and limitations of these data and tools, decision makers need some understanding of data science principles. The National Academies of Sciences, Engineering, and Medicine invite you to attend our upcoming symposium and webcast on data-driven decision making that will take place on September 14, 2017 from 9:00am-5:00pm at the Keck Center in Washington, DC. The event will highlight simple principles that can support data-driven decision making and help decision makers learn the right questions to ask when presented with new analyses.

Register here to attend in person or online.

About Math and Statistics at the National Academies

The Board on Mathematical Sciences and Analytics (BMSA) leads activities in the mathematical sciences at the National Academies in topic areas including from applied mathematics, scientific computing, and risk analysis.

The Committee on Applied and Theoretical Statistics (CATS) organizes studies and events focusing on the statistical sciences, big data and data science, statistical education, the use of statistics, and issues affecting the field. CATS occupies a pivotal position in the statistical community, providing expertise in methodology and policy formation.



NIH Data Science Week 2017

September 7, 2017

The NIH Data Science week is a bi-annual series of talks and workshops focused on Data Science
hosted by the Data Science and Bioinformatics Scientific Interest Groups

Monday, September 18th

9 am - 12 pm EDirect workshop at NLM

NCBI staff will offer a workshop on EDirect, NCBI’s suite of programs for easy command line access to literature and biomolecular records. To join the workshop, please register.

1 pm - 3:30 pm Containerization Workshops and Roundtable -- Natcher Balcony A

Details: 1:30 - 2:10 Docker presentation; 2:10 to 2:50 Singularity Presentation; 2:50 to 3:30 Containerization, HPC and cloud roundtable.

Tuesday, September 19th

11 am - 12 pm Speaker: Sarah Pendergrass from Geisinger -- NLM Visitor Center, Bldg 38A (Lister Hill Center) Lobby

From Learning Health Care to Genetic Research: Precision Medicine In Action at Geisinger Health System

Learning Health Care is now becoming a reality within Geisinger Health System. The MyCode Community Health Initiative of Geisinger Health System has whole exome sequencing data and whole genome array genotyping for more than 90,000 individuals to date, and is continuing to expand. Geisinger provides primary and specialty care across the life span, and the biorepository of genetic data are linked to de-identified longitudinal health records. With the breadth of data being collected, Geisinger is returning genetic results to patients and engaging in a variety of research to bring additional clinical and genetic findings back to the clinic. This talk will cover return of results at Geisinger, and new research within the Pendergrass Lab.

*** There are still a few spots available to speak directly with Sarah.  If interested, please email

Thursday, September 21st

11 am - 12 pm Speaker: Jake Lever from UBC -- NLM Visitor Center

PubRunner: Keeping text mining up-to-date with the latest publications

Biologists face a daunting challenge when trying to read all relevant scientific literature for their field. Text mining tools are designed to assist them by aiding search, summarizing the latest research and identifying important patterns in the literature. However many published tools lay dormant, as code is not public and any results shared become out-of-date as new publications enter the field. Through the NCBI hackathons initiative, we have built PubRunner; a framework for managing download of the latest publications, execution of text mining tools, and sharing of the results. This effort aims to help research groups keep text mining tools alive and make text mining results even more valuable to the biology community.

3 pm - 4 pm Speaker: Imran Haque from Freenome -- NLM Lindberg Room, Bldg 38

Embracing heterogeneity: statistical limitations and opportunities in early detection liquid biopsies

The discovery of tumor-derived circulating cell-free DNA (ctDNA) in cancer patients has ignited interest and investment in developing blood-based assays to detect cancer at early, treatable stages. The existence of many analytical methods (dPCR, BEAMing, UMI-tagged high-depth NGS) to detect mutated tumor-derived material combined with increasing knowledge of the characteristics of tumor genomes has driven an empirical approach of “more is better” to translate assays developed on late-stage cancer patients to the early detection setting. However, there exists a lack of data and analysis on the feasibility of such a translation.

In this presentation, Imran will analyze fundamental statistical challenges in liquid biopsy, including benign somatic heterogeneity, and quantitative limitations in the analysis of patient samples. He will further demonstrate that these limitations arise from upstream statistical assumptions about the nature of the problem, and that relaxing these assumptions admits potential solutions of a different flavor: making use of modern machine learning to integrate both prior data as well as multi-analyte analysis on individual samples to address the fundamental challenges of liquid biopsy.

*** There are still spots available to speak directly with Imran.  If interested, please email

Upon approval by presenters, materials or links will be available at



Lessons Learned from Funding the International Open Science Prize

August 1, 2017

Major funding bodies reflect on developing and implementing the Open Science Prize, a novel approach for funding international open science, in an essay publishing August 1 in the open access journal PLOS Biology. The essay by Elizabeth Kittrie of The National Institutes of Health, Philip Bourne of the University of Virginia, and colleagues from the Wellcome Trust, in partnership with the Howard Hughes Medical Institute, provides a series of reflections, addressing topics such as partnership development and sustainability, and the challenges of multiple funders pursuing joint global health technology initiatives.

The Open Science Prize (launched in October 2015) was a global competition designed to encourage innovative solutions in public health and biomedicine using open digital content. Prize competitions have received increased attention within the U.S. federal government with the passage of the America Competes Re-Authorization Act of 2010. The PLOS Biology essay points to the importance of aligning policies, procedures and regulations of the various funding agencies when engaging in joint prize competitions. The collaboration for this competition led to the inclusion of international participants, a larger purse for winners, and a shared responsibility for the costs of running the challenge.

The grand prize winner, “Real-time Evolutionary Tracking for Pathogen Surveillance and Epidemiological Investigation,” created its prototype that uses real-time visualization and viral genome data to track the spread of global pathogens such as Zika and Ebola. Prototypes developed by the six finalists can be accessed here:

“The Open Science Prize model accelerates team science and exemplifies the force multiplier effect that can occur when funding agencies join forces around a common goal,” said Dr. Patti Flatley Brennan, NIH Interim Associate Director for Data Science, and director, National Library of Medicine. “At times of declining budgets, leveraging resources through partnerships can be a key strategy for promoting innovation.”

Citation: Kittrie E, Atienza AA, Kiley R, Carr D, MacFarlane A, Pai V, et al. (2017) Developing international open science collaborations: Funder reflections on the Open Science Prize. PLoS Biol 15(8): e2002617.

About Biology
PLOS Biology is an open-access, peer-reviewed journal published by PLOS, featuring research articles of exceptional significance, originality, and relevance in all areas of biology. For more information visit, or follow @PLOSBiology on Twitter.



NIH Pi Day Celebration: New Date, New Location!

May 18, 2017

The National Institutes of Health will hold its third annual Pi Day Celebration on the NIH Main Campus on Pi Day 2.0, Thursday, May 18, 2107. As you may recall, the original Pi Day festivities, on 3.14, were postponed due to inclement weather. The goal of the NIH Pi Day Celebration is to increase awareness across the biomedical science community of the role that the quantitative sciences play in biomedical science. 

Pi Day @ NIH will feature the following activities:

  • 10:00 AM - 11:00 AM: Data Center Tours, Building 12A, Room 1100 (REGISTRATION REQUIRED)
  • 11:00 AM - 12:00 PM: PiCo Lightning Talks by NIH staff, Masur Auditorium, Clinical Center (Building 10), first floor


  • 12:00 PM - 1:00 PM: Poster/Demo Session and Networking, FAES Terrace, Clinical Center (Building 10), first floor
  • 1:00 PM - 2:00 PM: NIH Data Science Distinguished Seminar Series, Lecture by Simons Professor of Mathematics at MIT, Dr. Bonnie Berger, “The Mathematics of Biomedical Data Science,” Masur Auditorium, Clinical Center (Building 10), first floor


  • 2:30 PM - 4:30 PM: Research Reproducibility Workshop, NIH Library Training Room, Clinical Center (Building 10), first floor, near the South Entrance (REGISTRATION REQUIRED)

NIH campus map:

For more information about the day's events, visit the NIH Pi Day website:       

Pi Day is celebrated on March 14th (3/14) around the world and, under normal circumstances, at NIH! The Greek letter Pi is the symbol used in mathematics to represent a constant—the ratio of the circumference of a circle to its diameter—which is approximately 3.14159.

Pi has been calculated to over one trillion digits beyond its decimal point. As an irrational and transcendental number, it will continue infinitely without repetition or pattern. While only a handful of digits are needed for typical calculations, Pi’s infinite nature makes it a fun challenge to memorize, and to computationally calculate more and more digits.

NIH Pi Day is a joint effort of multiple ICs, including CIT, NCI, NHGRI, and NLM, and the NIH Office of the Director, including the NIH Library and the Office of Intramural Research. Additional support is provided by the Foundation for Advanced Education in the Sciences (FAES) and the NIH Bioinformatics Special Interest Group.

For all events, sign language interpreters can be provided. Individuals with disabilities who need reasonable accommodation to participate in this event should contact Jacqueline Roberts,, 301-594-6747, or the Federal Relay, 800-877-8339.



Open Science Prize announces as Grand Prize Winner

February 28, 2017

Congratulations to the development team led by Trevor Bedford, PhD, of the Fred Hutchinson Cancer Research Center, Seattle, and Richard Neher, PhD, of Biozentrum at the University of Basel, Switzerland winners of the grand prize of $230,000. Also participating were students from the laboratories of the team leaders; the University of Washington, Seattle; and the University of Auckland in New Zealand.

Read the official NIH press release.

A prototype online platform that uses real-time visualization and viral genome data to track the spread of global pathogens such as Zika and Ebola is the grand prize winner of the Open Science Prize. The international team competition is an initiative by the National Institutes of Health, in collaboration with the Wellcome Trust and the Howard Hughes Medical Institute (HHMI). The winning team, Real-time Evolutionary Tracking for Pathogen Surveillance and Epidemiological Investigation, created its prototype to pool data from researchers across the globe, perform rapid phylogenetic analysis, and post the results on the platform’s website.

Genome sequences of viral pathogens provide a hugely valuable insight into the spread of an epidemic, but to be useful, samples have to be collected, analyzed and the results disseminated in near real-time. The statistical analyses behind can be conducted in minutes, and can reveal patterns of geographic spread, timings of introduction events, and can connect cases to aid contact tracing efforts. The phylogenetic analyses are posted on the website as interactive and easy to understand visualizations. They hope that the platform will be of great use to researchers, public health officials and the public who want a snapshot of an epidemic. placed first out of three top finalists, selected from a pool of 96 multinational, interdisciplinary teams including 450 innovators from 45 countries. This award is the culmination of a year-long process which included development and demonstration of working prototypes and multiple stages of rigorous review by panels of expert Open Science advisors and judges from the Wellcome Trust and NIH. All stages of the competition emphasized open science in both form and process, including public input for the award gathered via a global public voting portal. During the public voting phase, which narrowed the six finalists to three top contenders, nearly 4,000 online votes were cast by members of the public from a total of 76 countries on all six inhabited continents.

The Open Science Prize is a global competition designed to foster innovative solutions in public health and biomedicine using open digital content. As increasing amounts of data are produced by scientists around the world and made openly available through publicly-accessible repositories, a major challenge to fully maximize this health information will be the lack of tools, platforms, and services that enable the sharing and synthesizing of disparate data sources. Development in this area is essential to turning diverse types of health data into usable and actionable knowledge.

The prize, which was launched in October 2015, aims to forge new international collaborations that bring together open science innovators to develop services and tools of benefit to the global research community. All six finalist teams were considered exemplary by the funders and are to be commended for their tenacity in developing creative approaches to applying publicly-accessible data to solve complex biomedical and public health challenges. The topics spanned the breadth of biomedical and public challenges, ranging from understanding the genetic basis of rare diseases, mapping the human brain, and enhancing the sharing of clinical trial information. As evidenced from the six Open Science Prize finalists, public health and biomedical solutions are enriched when data are combined from geographically diverse sources. Final prototypes developed by the six finalists can be accessed on the Open Science Prize website.



NLM Director Dr. Patricia Flatley Brennan Appointed NIH Interim Associate Director for Data Science

February 9, 2017

ON JANUARY 6, 2017, the National Institutes of Health announced that National Library of Medicine Director Patricia Flatley Brennan, RN, PhD will assume an additional role as NIH Interim Associate Director for Data Science.

The NIH Associate Director for Data Science (ADDS) and team provide input to the overall NIH vision and actions undertaken by each of the 27 Institutes and Centers in support of biomedical research as a digital enterprise. Among other duties, the office oversees the Big Data to Knowledge (BD2K) initiative, stimulating the best developments in the data science community.

This year will see the transition of trans-NIH data science initiatives to NLM, with the operational oversight of the BD2K initiatives being housed within the Common Fund programs in the Division of Program Coordination, Planning and Strategic Initiatives. This change builds on the recommendations by the NLM Working Group Report to the NIH Director, makes concrete steps towards the vision of NLM’s future proclaimed in the Advisory Committee to the NIH Director’s report—that the National Library of Medicine become the “epicenter of data science for the NIH.”

“I believe the future of health and health care rests on data—genomic data, environmental sensor-generated data, electronic health records data, patient-generated data, research collected data,” Dr. Brennan observed. “The data originating from research projects is becoming as important as the answers those research projects are providing.”

“NLM must play a key role in preserving data generated in the course of research, whether conducted by professional scientists or citizen scientists,” she continued. “We know how to purposefully create collections of information and organize them for viewing and use by the public. We can extend this skill set to the curation of research data. We also have the utilities in place to protect the data by making sure only those individuals with permission to access data can actually do so.”

“NLM is well positioned to add these new functions to its research portfolio,” the NLM Director observed. “In this new year and the years to follow, we welcome these exciting opportunities and challenges.”  



Big Data to Knowledge Multi-Council Working Group - January 2017

January 9, 2017

Notice is hereby given of a meeting of the Big Data to Knowledge (BD2K) Multi-Council Working Group.

Name of Working Group:  Big Data to Knowledge Multi-Council Working Group

Date:  January 9, 2017 - Canceled

Place:  Teleconference
This portion of the meeting is open to the public and is being held by teleconference.  This is a listen ONLY meeting.  Please submit any questions or comments via email to the contact person listed below.

Join WebEx Meeting
Meeting number: 627 298 875
Meeting password: 1234
Dial-in: 1-877-668-4493
Open Session:  11:00am - 12:00pm ET

Discussion will review current Big Data to Knowledge (BD2K) activities and newly proposed BD2K initiatives.

  • Roll Call and Introduction
  • Update from the Associate Director for Data Science
  • BD2K All Hands Meeting and Open Data Science Symposium Recap

Closed Session:  12:30pm - 3:00pm ET

Agenda:  Discussion will focus on review of proposed FY17 Funding Plans for BD2K Funding Opportunity Announcements and Administrative Supplements.

Event Contact: 
Individuals who plan to attend and need special assistance, such as sign language interpretation or other reasonable accommodations, should notify Tonya Scott, email:, phone: 301-402-9817.

Federal Register Meeting Announcement:
National Institutes of Health, Office of the Director - Notice of Meeting



Public Voting Determines Three Finalists for the Open Science Prize

January 9, 2017

Public voting for the Open Science Prize is now closed. Thank you to everyone who voted. The 3 prototypes which scored highest and will therefore be going forward to the next stage of review are:

MyGene2: Accelerating Gene Discovery with Radically Open Data Sharing


Real-Time Evolutionary Tracking for Pathogen Surveillance and Epidemiological Investigation

We will now be collecting expert reviews of these three prototypes. We anticipate announcing the the Grand Prize winner in early March 2017.

For additional information, contact:


Back to Top