Wednesday, February 26, 2020
Office of Data Science Strategy: 2019 Year in Review
2019 was a year of growth and new initiatives for the Office of Data Science Strategy (ODSS). Formed in late 2018 as part of the Division of Program Coordination, Planning, and Strategic Initiatives, the office has brought together dozens of tactical teams from across the NIH’s 27 institutes and centers, along with offices within the Office of the Director, to tackle numerous data science-related challenges.
“Thanks to the tremendous efforts of a few dozen dedicated NIH employees, we have made outstanding strides toward realizing our vision of an integrated biomedical data ecosystem,” said Susan Gregurick, Ph.D., associate director for data science and director of ODSS. “We made great progress in 2019, and I’m looking forward to continuing this progress in 2020.”
The following is a snapshot of some of the biggest milestones accomplished in data science in 2019 in partnership between ODSS and the institutes, centers, and offices of NIH. To stay up-to-date on the latest news in data science at NIH, be sure to bookmark our website and follow us on Twitter (@NIHDataScience).
Associate Director for Data Science Appointed in September
Summer Fellows Take on Biomedical and Administrative Data Challenges
Fueling a Fire for Better Tools: Using the FHIR® Standard at NIH
Enhancing the Biomedical Data Repository Landscape with a Generalist Repository Pilot
Looking Ahead to 2020
In 2019 NIH moved more than 30 petabytes of data to the cloud. For perspective, just one petabyte of data is the equivalent of more than 4,000 digital photos per day over the course of one person’s lifetime! The NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative, supported by ODSS and managed with the Center for Information Technology, made moving such a massive amount of data to the cloud possible. This initiative provides NIH and NIH-funded researchers with cost-effective access to cloud storage and advanced computational infrastructure, tools, and services through partnerships with commercial cloud service providers. The STRIDES Initiative currently partners with Google Cloud and Amazon Web Services.
Five of the 30 petabytes moved to the cloud in 2019 were the public half of the Sequence Read Archive (SRA) data. This includes the genomes, gene expression data, and epigenetic data for humans, pathogens, and nearly all known living things on the planet.
Jim Ostell, Ph.D., director of the National Center for Biotechnology Information at the National Library of Medicine (NLM), described the full impact of this data’s move to the cloud in a blog post, noting that “for the first time in history, it’s now possible for anyone to compute across this entire five-petabyte corpus at will.”
The STRIDES Initiative didn’t just help get NIH data on the cloud; it also helped researchers learn how to master the computational tools and services available in the cloud. Beyond the training sessions offered to intramural and extramural researchers, the STRIDES Initiative also enabled codeathons to engage citizen scientists. Intramural and extramural researchers alike can expect additional trainings to be offered in 2020. ODSS will sponsor at least three more codeathons that use STRIDES Initiative-supported data in 2020, including ones focused on single-cell sequencing and graph genome annotation.
Having served as a senior advisor to ODSS since its formation in 2018, Dr. Gregurick was appointed Associate Director for Data Science (ADDS) and Director of ODSS on Sept. 12, 2019.
In his statement announcing the selection, NIH Director Francis Collins, M.D., Ph.D., stated, “Dr. Gregurick will help lead NIH efforts in coordinating and collaborating with appropriate government agencies, international funders, private organizations, and stakeholders engaged in scientific data generation, management, and analysis…
“She brings substantial experience in computational biology, high performance computing, and bioinformatics to this position. Additionally, she has worked across sectors, in the government at the NIH and the Department of Energy, on trans-government committees, and in academia, which is critical in the convening role that the ADDS plays.”
ODSS worked with the NIH Office of Intramural Training and Education to launch the Graduate Data Science Summer Program and with nonprofit Coding it Forward to bring the first cohort of Civic Digital Fellows to NIH in 2019. The two programs were featured in an article titled Summer Students Tackle Data Challenges in the Sept. 20, 2019, edition of the NIH Record. Excerpts from the article are below.
This summer brought a unique group of fellows to campus—21 data-savvy students with computational and technology backgrounds were matched with NIH mentors across 14 institutes, centers and offices.
These students impressed their mentors and NIH senior leadership by applying their expertise to hands-on problems such as new challenges in artificial intelligence and data analysis, improving and automating difficult processes and developing new algorithms for classification.
“Our experiences with these summer students emphasized that having more folks with computational and other tech backgrounds in the NIH and biomedical workforces will add great value,” said Dr. Jessica Mazerik, a special assistant to NIH principal deputy director Dr. Lawrence Tabak and workforce advisor to ODSS.
Tabak agrees. After hearing their project presentations, he took a moment at the Civic Digital Fellows’ demo day to give them the ultimate compliment.
“None of you can leave,” Tabak exclaimed. “You’re all hired!”
ODSS is preparing to support a second cohort of both programs in 2020 and is broadening the programs’ reach by including additional institutes, centers, and offices. The students and their mentors are fostering a unique community here at NIH, one that embraces computational approaches and will continue to grow as NIH expands its efforts to recruit new data- and tech-savvy talent.
The Fast Healthcare Interoperability Resources (FHIR®) standard provides a way of exchanging healthcare data from one health information system to another through an application programming interface. Electronic health record systems broadly use the FHIR standard already, and several federal health agencies are promoting the use of the FHIR standard to exchange data. Advancing the FHIR standard for research purposes has the potential to make data more interoperable and reusable.
“Data standards are key to enabling effective data sharing, which is a priority for NIH,” said Teresa Zayas Cabán, Ph.D., FHIR acceleration coordinator at NLM. “Use of the FHIR standard could accelerate the analysis, sharing, and combining of clinical and observational data for research leading to new discoveries and improved health.”
In 2019 ODSS issued a notice (NOT-OD-19-122) encouraging researchers to explore the use of the FHIR standard to capture, integrate, and exchange clinical data for research purposes and to enhance capabilities to share research data. The office also solicited input from the public to better understand researchers’ experiences using the FHIR standard, the extent to which researchers plan to use this standard in the future, what additional tools researchers need, the need for research regarding standards development, and challenges with using the FHIR standard (NOT-OD-19-150).
Two contract awards to advance the development of FHIR-based tools were also made in 2019. ODSS is working with multiple institutes and centers in 2020 to further define how the FHIR standard can successfully be used in research.
In 2019 ODSS looked at Enhancing Data Sharing, One Dataset at a Time, through data repositories. Dr. Gregurick wrote a guest blog post on the subject for the NLM director’s blog, “NLM Musings from the Mezzanine,” on Sept. 18, 2019. Excerpts from the blog are below.
The landscape of biomedical data repositories is vast and evolving. Currently, NIH supports many repositories for sharing biomedical data. These data repositories all have a specific focus, either by data type (e.g., sequence data, protein structure, continuous physiological signals) or by biomedical research discipline (e.g., cancer, immunology, or clinical research data associated with a specific NIH institute or center), and often form a nexus of resources for their research communities. These domain-specific, open-access data-sharing repositories, whether funded by NIH or other sources, are good first choices for researchers, and NIH encourages their use.
NIH recently launched an NIH Figshare instance, a short-term pilot project with the generalist repository Figshare. This pilot provides NIH-funded researchers with a generalist repository option for up to 100 GB of data per user. The NIH Figshare instance complies with FAIR principles; supports a wide range of data and file types; captures customized metadata; and provides persistent unique identifiers with the ability to track attention, use, and reuse.
NIH Figshare is just one part of our approach to understanding the role of generalist repositories in making biomedical research data more discoverable. We recognize that making data more FAIR is no small task and certainly not one that we can accomplish on our own.
The NIH Figshare instance was first announced in July 2019 and is expected to last around a year. After the pilot ends, all data will still be available in Figshare. In 2020 ODSS and supporting teams will continue exploring how generalist repositories may make biomedical research data more discoverable.
The ODSS is already off to a great start in 2020. In the coming year, ODSS will launch a Data and Technology Advancement (DATA) National Service Scholars program to bring experienced computer and data scientists and engineers to NIH. The office will initiate new opportunities for software development and a program to support data repositories and knowledgebases. Finally, in partnership with the NIH’s Center for Information Technology, ODSS will stand up a Researcher Authentication Service to facilitate easy credentialed access to NIH’s open and controlled data assets and repositories in a consistent, user-friendly, and secure manner.
For the latest news on what’s happening in data science at NIH, follow us on Twitter: @NIHDataScience.