By Phil Bourne, Associate Director for Data Science
The Office of Data Science at NIH has been in existence now for about 22 months. Its main extramural activity – the Big Data to Knowledge (BD2K) initiative – has settled down, and researchers are beginning to describe and publish research that we anticipate will be groundbreaking. It has been a good year. Here I muse on what has been accomplished in 2015 and what comes next.
The Big Data to Knowledge (BD2K) All-hands Meeting was held at the NIH campus in Bethesda from November 10-12, 2015 and provides a good litmus test to evaluate our progress in the year since the first awards were made. With over 400 participants, 133 posters, and an appearance by DJ Patil, the Chief Data Scientist of the US Government, there was a consensus that something special was happening in the world of data-driven biomedical research. Former head of the European Bioinformatics Institute (EBI), Prof. Dame Janet Thornton, who came to see what all the fuss was about, summed it up this way:
"BD2K has already changed the landscape of biomedical research in the USA. The All-hands meeting captured the excitement and change in culture that is happening across biomedical science, with the realization that sharing data lies at the heart of biomedical research today and that establishing the international infrastructure to do so is critical. Great science too!!!"
While there were many individual developments presented at the BD2K All-hands, here we focus on the common themes that have come to define the biomedical digital ecosystem that we are trying to establish that underlies and binds individual efforts.
Not surprisingly a number of research developments are being driven by larger datasets than typically used in the past. Mobility data from tens or thousands of individuals and large scale proteomics data sets are just two examples of data driving new research outcomes. Real progress comes from the ability to access, use, and integrate large and small data sets in ways not previously seen. For example, accurate prediction of health outcomes forms the basis of targeted, tailored interventions. Such utilization of data requires descriptions of the data from which some structure can be derived as well as new types of analytics, all of which were on display. Perhaps most rewarding was a strong sense of collaboration among BD2K participants to solve outstanding healthcare problems where data and software sharing are key. Formation of working groups to develop APIs, desire to participate in data and software indexing efforts, and a rush to work on defining metadata templates are three indications of a community in the making.
Aside from funding these research efforts, we have placed particular emphasis on workforce development, engaging new communities, and working with the policy makers to further drive the analytics that we believe will characterize future biomedical discoveries. An Innovation lab over the summer, a joint initiative with NSF, brought together computer scientists, statisticians, and biomedical scientists to work on outstanding biomedical problems. The projects proposed scored relatively highly in the peer-reviewed competition that followed and more labs are to follow. A workshop with those working in digital media led to an FOA for crowdsourcing solutions to outstanding problems. Engagement also involved national and international partners as exemplified by the Open Science Prize, a joint initiative with the Wellcome Trust and the Howard Hughes Medical Institute, to drive new uses of open content – data, software, and knowledge.
A Distinguished Lecture Series brought about a data vision of the future both broadly in translational science (Eric Lander) and within the neurosciences (Christof Koch and Emery Brown). A Frontiers Lecture series allowed us to drill down into such topics as machine learning (Andrew Moore) and data citation (Martin Fenner).
Closer to home we sponsored several NIH events including Pi Day, Software Carpentry workshops to train the trainers, and a Common Data Elements (CDE) workshop to reconcile how CDEs are proposed and indexed across the NIH.
None of this would be possible without an amazing staff of only 8 full-timers and contributions from many staff across all Institutes and Centers of NIH:
Leigh Finnegan left us to start medical school at the University of Pennsylvania;
Tonya Scott joined us to take up the many tasks undertaken by Leigh;
Beth Russell, a AAAS fellow, left us to take up position at NSF;
Lisa Dunnebacke joined us to be our communications specialist taking over from Beth;
Sonynka Ngosso joined us as our new scientific program analyst and also oversees the various committee management needs;
Audie Atienza came to us to launch the Open Science Prize and then went off to the private sector; and
Vivien Bonazzi, Michelle Dunn, Angel Horton, Mark Guyer, and Jennie Larkin continued their good work.
2016 will, we believe, see the first newsworthy developments from BD2K as well as furthering our training and workforce development efforts with a further emphasis on diversity. It will also be a time when we revisit each of the 27 Institutes and Centers of NIH to determine their data-related pain points to see what we might do collectively. A pain point we already identified in 2015 was that of sustainability of the data-related enterprise. A program management working group, one of a number of working groups, began an inventory of major resources and an exploration for how we might more efficiently manage such resources in the future. Beyond sustaining is the need to more effectively use the output of the biomedical digital enterprise. In 2015 we spoke of the FAIR principles – the ability to Find, Access, Interoperate, and Reuse the various forms of digital output – data, standards, software, courseware, narrative, etc. We proposed the Commons concept as a virtual shared space in which such digital output could be sustained and made FAIR. A number of awards were made to populate and evaluate the Commons as a series of pilots. 2016 will see these in full swing with a focus towards the end of the year on plans to evaluate the Commons.
A hallmark of our efforts is to be as transparent and open as possible. As such we welcome your input and sincerely hope that 2016 is a good year for however you contribute to data science and its impact on healthcare. Onwards.