It is the time of year to reflect on how far the Office of Data Science at NIH has come and to provide a sense of where we will be going in 2015 – All in the spirit of transparency.
This is the first blog on our official nihdatascience.wordpress.com site and is the beginning of a more coordinated and push-oriented communication strategy, which will be used by our staff as well as guests. The blog pebourne.wordpress.com will continue as my personal site for reflections on data science.
Phil Bourne, December 31, 2014.
It has been a demanding and rewarding ten months since I arrived at NIH and we established Data Science @ NIH. I will focus here not on the overall vision, which can be found in my last post, but rather some of the successes and challenges from 2014 and what you can expect in 2015.
Perhaps our greatest success this past year is establishing a team of smart and energetic individuals to execute on the vision for Data Science @ NIH. Let me introduce them (in alphabetical order):
- Vivien Bonazzi PhD leads our work in the Commons and our software initiatives and has extensive experience working with the public and private sectors.
- Michelle Dunn PhD leads our training, outreach and diversity efforts and advises on behalf of the mathematical and statistical communities.
- Leigh Finnegan is a program analyst who is responsible for documenting and analyzing our activities.
- Mark Guyer PhD (part time) comes to us with extensive experience with running consortia on behalf of the National Human Genome Research Institute (NHGRI) and is leading our Big Data to Knowledge (BD2K) consortia efforts.
- Angel Horton assists us with all our administrative needs.
- George Komatsoulis PhD has extensive contractual experience and is establishing the business model for the Commons. George works partly with us on behalf of the National Center for Biotechnology Information (NCBI).
- Jennie Larkin PhD is the go-to person for all our activities. She advises me on all aspects of running an intramural and extramural program across the 27 institutes and centers that comprise the NIH.
- Beth Russell PhD is a AAAS Fellow working on outreach and communications, including our web presence.
- Biomedical Informatics Specialist – position open (be in touch if you have an interest).
While each person has specific responsibilities and skills we try to be agile and all contribute to major activities, for example, sustainability, international cooperation, interagency cooperation, standards and much more.
The vast majority of our resources are expended on extramural activities (i.e., grants of one type or another to external investigators). In 2014 we awarded $32M and subject to available funds will make up to $80M of awards in 2015 as the BD2K ramps up.
In 2014 those funds were expended primarily on 12 Centers of Excellence for Big Data Computing, a Data Discovery Index Coordination Consortium (DDICC) and for educational resources and courses (details here). The first meeting of awardees occurred on November 3-4, 2014 where it was made clear that beyond the work of individual awardees in furthering the value of data science in biomedical research and healthcare they are expected to form a consortia, which begins to build an ecosystem in support of digital research. An analogy we are using for the consortia is that which developed around the human genome project. However, we are also aware that the human genome project had a tangible outcome whereas data science is open ended. Nevertheless, the level of community building, hence cooperation, sharing of data, software, methods etc. is something to aspire to. We will be making awards in 2015, which further the development of the consortia.
One focus for 2015 will be on the development of software in support of data science in four main areas of data science – data compression, data wrangling, visualization and provenance. We will also focus on an award to develop a software discovery index – a means to find the right open source software for a given task, akin to finding appropriate data sets and thus designed to be synergistic with the DDICC.
Another focus will be on support for standards development. Many standards exist already, as do resources that catalog and index such standards. The goal is to complement these various efforts through a National Standards Information Resource (NSIR) as a one-stop shop with pointers to existing information on standards.
This year we also launched our concept for the Commons, which is an experiment aimed at achieving sustainability, productivity and reproducibility. We will be launching this experiment early in the New Year with several pilot projects. A key part of the Commons is the business model by which it operates. In 2015 we hope to experiment with a credit-based model for assigning compute resources, which we will evaluate in terms of relative cost-benefit relative to a more direct awards system. A feature of this model is public-private partnerships (PPPs). The objective of the Commons, the DDICC and many of our other activities is to achieve a FAIR model of biomedical research; Find the data, software etc. you need; Access these research objects; Interoperate with them; and Reuse them. We will be making awards in 2015 that can develop and test these basic tenants of data driven biomedical research.
We recognize progress as top down – defined by the funders; meet bottom up – defined by the various scientific stakeholders. Thus in 2014 we have reached out to many communities – scientific societies, domains typically less engaged with the NIH (e.g., mathematicians, statisticians, computer scientists, game developers, the private sector), new communities (e.g., the Global Alliance for Genomic Health, the Research Data Alliance, the National Data Service) and potential international partners (e.g., Elixir) and begun to see how we might better interact to improve healthcare. In 2015 workshops and new funding calls can be expected to result from these interactions.
Scientific data knows no regional, national or international boundaries yet is often funded and hence managed as such. To maximize the value of scientific data requires communications across funding agencies and in 2014 we have reached out to many of our fellow agencies, notably, NSF, NIST, NITRD, DARPA, NOAA, Wellcome Trust, NHMRC (Aus.), DFG (Germany), MRC (UK). NIH is also part of the larger Department of Health and Human Services and we have reached out to other agencies within HHS, notably FDA, CDC and ONC and joint initiatives can be expected in 2015.
We work closely with the NIH Office of Science Policy (OSP) on various policies relating to data science. In 2014 a more extensive genomic data sharing policy was announced and the extension to JATS was completed which supports data citation (the precursor to a policy). 2015 will likely see a revision to the common rule defining the protection of human research subjects as it relates to information availability and policies for using human subjects data in the cloud. We are also working on revisions to NIH data sharing policies to further access to data generated by publically funded research.
In 2014 we expanded how our actions are governed to be as follows:
- The BD2K Executive Committee (EC) comprising NIH program staff for all 27 institutes and centers contribute to both the strategy and running of the BD2K program.
- Recommendations of the BD2K EC are discussed with the Scientific Data Council (SDC) at internal body of high level NIH staff and IC directors responsible for setting policy and strategy.
- Funding plans and strategy as defined by the SDC and BD2K EC are discussed with the Multi-Council Working Group (MCWG) comprising a single council member from the majority of the 27 ICs that comprise NIH.
- The SDC and MCWG is chaired by the Associate Director for Data Science (ADDS) when not in conflict.
- The ADDS reports to the Director of NIH, Dr. Francis Collins.
In 2015 we will appoint an external advisory group to advise the MCWG, the SDC, and the ADDS on important issues surrounding data science. A governance model has also been established to oversee the BD2K consortium and which will be discussed in a future post.
Training the biomedical workforce was a focus of 2014 and will continue to be going forward. At the current time the supply of skilled workers in biomedical data science does not meet the demand and is not likely to do so in the foreseeable future. Our training efforts to date can be summarized as follows:
- Awards to individuals for supplemental training in data science.
- Awards to individuals to train in biomedical data science as a major.
- Awards to develop programs and courses in biomedical data science.
- Intramural training initiatives for NIH personnel.
- A funding call for a center to coordinate training activities.
These programs will be expanded in 2015 and we expect workshops to be conducted to further refine how we fund biomedical data science training.
Overall a rewarding year but just the beginning of addressing a need, which will surely grow, as biomedical research continues to migrate from an observational science to one where research and healthcare is increasingly analytical and data driven. For us this is something to look forward to.