Five Petabytes of Sequence Read Archive Data Now in the Cloud

Wednesday, September 25, 2019

The Sequence Read Archive (SRA) is the largest publicly available repository of raw, next-generation sequence data, and half of it is now available in the cloud.

The National Center for Biomedical Information (NCBI) at the National Library of Medicine (NLM) recently moved the five petabytes of public SRA data to the cloud with support from the National Institutes of Health (NIH) Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative. These data include a variety of genomes, gene expression data, and more. Plans are underway to move the other half of the SRA data, which is controlled-access human genomic data.

Having this high-throughput sequence data publicly available in the cloud marks the first time in history that researchers can compute across the entire 5-petabye collection. With this move, NIH is accelerating discoveries by providing researchers with access to this data in a flexible and scalable way via the cloud.

NCBI Director Jim Ostell talks more about the significance of this milestone in a guest blog post on the NLM director’s blog titled “Biomedical Discovery through SRA and the Cloud.”