New Request for Information Seeks Public Input on Use of Cloud Resources and New File Formats for Sequence Read Archive Data

Wednesday, May 20, 2020

Submissions Due July 17

The National Institutes of Health’s Office of Data Science Strategy and the National Center for Biotechnology Information (NCBI) at the National Library of Medicine recently issued a Request for Information (RFI) (NOT-OD-20-108) seeking public input on how Sequence Read Archive (SRA) data can be formatted and stored to better facilitate usage, exchange, and scientific impact of the data while maintaining a sustainable, cost-effective footprint that can support continued submissions to the archive.

The SRA is one of NIH's largest and most diverse datasets – a broad collection of experimental DNA and RNA sequences that represent genome diversity across the tree of life. The SRA currently contains more than 36 petabytes of data and is continually growing. The SRA was copied to Google Cloud Platform and Amazon Web Services cloud services in 2019 as part of the NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative. Currently, the SRA data continues to be accessible from NCBI on-premises (on-prem) storage as well.

NIH is requesting input on the use of SRA data to understand how best to manage this resource in cloud environments to facilitate its use in research while controlling costs as it grows in size. NIH would like to better understand how the research community currently uses SRA data, how researchers are using or anticipate using cloud computing with SRA data, and which formats of SRA data are most valuable to the research community.

Comments to the RFI should be submitted electronically by July 17.