Open-Access Data and Computational Resources to Address COVID-19

Open-Access Data and Computational Resources to Address COVID-19

COVID-19 open-access data and computational resources are being provided by federal agencies, including NIH, public consortia, and private entities. These resources are freely available to researchers, and this page will be updated as more information becomes available. 

The Office of Data Science Strategy seeks to provide the research community with links to open-access data, computational, and supporting resources. These resources are being aggregated and posted for scientific and public health interests. Inclusion of a resource on this list does not mean it has been evaluated or endorsed by NIH.

To suggest a new resource, please send an email with the name of the resource, the website, and a short description to datascience@nih.gov.

See Computational Resources
See Supporting Resources

Resource Resource Description Data Type NIH Funded
Amazon Web Services (AWS) data lake for analysis of COVID-19 data

A centralized repository of up-to-date and curated datasets on or related to the spread and characteristics of SARS-CoV-2 and COVID-19. Information on how to best use this resource is available.

dashboards and visualization tools, epidemiology, healthcare resources, literature
Broad Terra cloud commons for pathogen surveillance

The Broad Terra cloud workspace for best practices with COVID-19 genomics data

  • Raw COVID-19 sequencing data from the NCBI Sequence Read Archive (SRA)
  • Workflows for genome assembly, quality control, metagenomic classification, and aggregate statistics
  • Jupyter Notebook produces quality control plots for workflow output
genomics
CAS COVID-19 antiviral candidate compounds dataset

The open source dataset of nearly 50,000 chemical substances includes antiviral drugs and related compounds that are structurally similar to known antivirals for use in applications including research, data mining, machine learning and analytics. A COVID-19 Protein Target Thesaurus is also available. CAS is a division of the American Chemical Society.

chemical structure data
CDC COVID-19 Cases, Data, and Surveillance

The CDC is providing a variety of data on COVID-19 in the United States.

dashboards and visualization tools, epidemiology, healthcare resources
China National Center for Bioinformation's 2019 Novel Coronavirus Resource (2019nCoVR)

Maintained by China National Center for Bioinformation/National Genomics Data Center, 2019nCoVR is a comprehensive resource on COVID-19, combining up-to-date information on all published sequences, mutation analyses, literatures and others. 

dashboards and visualization tools, genomics, literature
ClinicalTrials.gov COVID-19 related studies

View listed clinical studies related to the coronavirus disease (COVID-19). Studies are submitted in a structured format directly by the sponsors and investigators conducting the studies. Submitted study information is generally posted on ClinicalTrials.gov within 2 days after initial submission and site content is updated daily. Full website content is also available through the API.

clinical studies
check mark
Collection of 3D Print Models of SARS-CoV-2 virions and proteins

This collection of files contains information for printing 3D physical models of SARS-CoV-2 proteins and is part of the NIH 3D Print Exchange.

chemical structure data
check mark
CORD-19: COVID-19 Open Research Dataset and AI Challenge

Freely available dataset of 45,000 scholarly articles, including over 33,000 with full text, on COVID-19, SARS-CoV-2, and related coronaviruses. This machine-readable resource is provided to enable the application of natural language processing and other AI techniques.

See the CORD-19 Challenge, developed in partnership with Kaggle. Amazon Web Services has a CORD-19 search website.

Read the accompanying call to action from the White House Office of Science & Technology Policy and learn more about the creation of CORD-19.

literature
Coronavirus3D

This web-based viewer offers 3D visualization and analysis of SARS-CoV-2 protein structures with respect to the CoV-2 mutational patterns.

chemical structure data
check mark
COVID Digital Pathology Resource (COVID-DPR)

The COVID-DPR provides whole slide images of histopathologic samples relevant to COVID-19, including biopsy samples and autopsy specimens. The current focus of the repository includes tissue from the lungs, heart, liver, and kidney. The repository contains examples of H1N1, SARS, and MERS for comparison. 

digital images
check mark
COVID-19 Datasets on The Cancer Imaging Archive (TCIA)

The NCI Cancer Imaging Program (CIP) is utilizing its Cancer Imaging Archive as a resource for making COVID-19 radiology and digitized histopathology patient image sets publicly available.

digital images
check mark
COVID-19 Genome Sequence Dataset on Registry of Open Data on AWS

A centralized sequence repository for all strains of novel corona virus (SARS-CoV-2) submitted to the National Center for Biotechnology Information (NCBI). Included are both the original sequences submitted by the principal investigator as well as SRA-processed sequences that require the SRA Toolkit for analysis.

genomics
check mark
Dimensions COVID-19 publications, datasets, and clinical trials

All Dimensions publications, datasets, and clinical trials related to COVID-19, updated daily. Content exported from the openly accessible Dimensions application accessible at https://covid-19.dimensions.ai/.

literature
EMBL-EBI's COVID-19 Data Portal

The European Bioinformatics Institute (EMBL-EBI), part of the European Molecular Biology Laboratory, has a COVID-19 Data Portal to facilitate data sharing and analysis and ultimately contribute to the European COVID-19 Data Platform. EMBL-EBI is part of the International Nucleotide Sequence Database Collaboration (INSDC); the National Center for Biotechnology Information (NCBI) is the U.S. partner of the INSDC.

chemical structure data, genomics, literature, RNA-seq and expression counts
European CDC geographic distribution of COVID-19 cases worldwide

The downloadable data file is updated daily and contains the latest available public data on COVID-19. Each row/entry contains the number of new cases reported per day and per country. You may use the data in line with ECDC’s copyright policy.

epidemiology
GenBank Nucleotide Sequences

Provides rapid, open, and unrestricted access to virus nucleotide sequences and is the repository being recommended by NIAID and CDC for investigator and public health submissions. Due to the scale of data indexing, there may be a delay before new submissions are indexed and retrievable with a term-based query. 

genomics
check mark
GenBank Protein Sequences

Provides rapid, open, and unrestricted access to virus conceptually translated protein sequences and is the repository being recommended by NIAID and CDC for investigator and public health submissions. Due to the scale of data indexing, there may be a delay before new submissions are indexed and retrievable with a term-based query. 

genomics
check mark
GEO DataSets

Human transcriptional responses to SARS-CoV-2 infection

RNA-seq and expression counts
check mark
GISAID

International database of hCoV-19 genome sequences and related clinical and epidemiological data

genomics
Google Cloud Platform (GCP) Datasets for COVID-19 Research

GCP is hosting a repository of public datasets and offering free hosting and queries of COVID datasets. Learn more about the free hosting and queries of COVID datasets.

epidemiology, healthcare resources, social sciences
iSearch COVID-19 Portfolio

Comprehensive, expert-curated portfolio of COVID‑19 publications and preprints that includes peer-reviewed articles from PubMed and preprints from medRxiv, bioRxiv, ChemRxiv, and arXiv.

literature
check mark
LitCovid

NLM curated literature hub for COVID-19

literature
check mark
Modeling Infectious Disease Agents Study (MIDAS) online portal for COVID-19

NIGMS-funded modeling research. Public-access data collections with documented metadata.

case studies, dashboards and visualization tools
check mark
NCATS OpenData | COVID-19

NCATS is generating a collection of datasets by screening a panel of SARS-CoV-2-related assays against all approved drugs. These datasets, as well as the assay protocols used to generate them, are being made immediately available to the scientific community on this site as these screens are completed.

bioactivity, chemical structure data, dashboards and visualization tools
check mark
NCBI Virus: SARS-CoV-2 data hub

SARS-CoV-2 focused content from NCBI Virus, including links to related resources. Search, filter, and download the most up-to-date nucleotide and protein sequences from GenBank and RefSeq (taxid 2697049). Generate multiple sequence alignments and phylogenetic trees for sequences of interest. Provides one-click access to the Betacoronavirus BLAST database and relevant literature in PubMed.

genomics
check mark
Nextstrain COVID-19 genetic epidemiology

Open-source SARS-CoV-2 genome data and analytic and visualization tools

genomics
OpenICPSR's COVID-19 Data Repository

The Inter-university Consortium for Political and Social Research (ICPSR) has launched a new repository of data examining the impact of the novel coronavirus global pandemic. This repository is a free, self-publishing option for researchers to share COVID-19 related data.

social sciences
outbreak.info

A resource to aggregate data critical to scientific research during outbreaks of emerging diseases, such as COVID-19

epidemiology
check mark
PubChem

Small molecule compounds, bioactivity data, biological targets, bioassays, chemical substances, patents, and pathways

bioactivity
check mark
PubMed Central (PMC) COVID-19 Initiative

On March 13, national science and technology advisors from a dozen countries, including the United States, called on publishers to voluntarily agree to make their COVID-19 and coronavirus-related publications, and the available data supporting them, immediately accessible in PMC and other appropriate public repositories to support the ongoing public health emergency response efforts. The articles added to PMC are distributed through the PMC Open Access Subset and are made available in CORD-19.

literature
check mark
RCSB Protein Data Bank COVID-19/SARS-CoV-2 Resources

The RCSB Protein Data Bank is offering access to COVID-19 related PDB structures for research and related images and videos for education. 

chemical structure data
check mark
Reactome

Reactome is a free, open-source, curated and peer-reviewed pathway database. The goal is to provide intuitive bioinformatics tools for the visualization, interpretation and analysis of pathway knowledge to support basic research, genome analysis, modeling, systems biology and education. In response to the COVID-19 pandemic, Reactome is fast-tracking the annotation of human coronavirus infection pathways.

dashboards and visualization tools, genomics
check mark
SARS-CoV-2 Related Structures

A database of carefully validated SARS-CoV-2 protein structures, including many structural models which have been re-refined or re-processed. The resource is being updated weekly by Minor Lab at the University of Virginia as new SARS-CoV-2 structures are being deposited to the Protein Data Bank.

chemical structure data
check mark
Sequence Read Archive (SRA)

Provides rapid, open, and unrestricted access to virus nucleotide or metagenomic sequence data and is the repository being recommended by NIAID and CDC for investigator and public health submissions. Due to the scale of data indexing, there may be a delay before new submissions are indexed and retrievable with a term-based query.

genomics
check mark
data

Computational Resources to Address COVID-19

See Data Resources
See Supporting Resources

Resource Resource Description NIH Funded
Atrio

Powered by Atrio software platform offers easy access to large numbers of freely available, high-performing GPU and CPU resources. Contact support for help creating portable application containers that are performance optimized for these powerful systems.

Betacoronavirus BLAST

BLAST database containing sequences from Betacoronavirus (taxid 694002), including the latest SARS-CoV-2 sequences in GenBank and RefSeq.

check mark
Cloud resources for COVID-19 research

Freely available high-performance computing resources immediately available for COVID-19 research. Provided by Rescale, Google Cloud, and Microsoft Azure.

The COVID-19 High Performance Computing (HPC) Consortium

Computing Infrastructure: XSEDE provides the portal, computing resources updated regularly, includes DOE National Laboratories, IBM, NSF, NASA, tech companies and academic computing centers.

computational

Supporting Resources

See Data Resources
See Computational Resources

Resource Resource Description NIH Funded
Data-Against-COVID Team

A group of more than 600 volunteer data scientists, machines learning experts, bioinformaticians and professional software developers who have joined together to offer their expertise for any data analysis problems that arise in the context of the ongoing coronavirus pandemic.

GenBank/SRA SARS-CoV-2 Sequence Submissions

Quickly and easily submit assembled and unassembled SARS-CoV-2 data with help from NCBI if needed.

check mark
NASEM Standing Committee on Emerging Infectious Diseases and 21st Century Health Threats

This National Academies of Science, Engineering, and Medicine (NASEM) standing committee provides rapid expert consultation on data elements and systems design for modeling and decision making for the COVID-19 pandemic.

NIAID Overview of Coronaviruses

Information about coronaviruses, including COVID-19, and resources for researchers

check mark
Research Data Alliance Working Group

Guidelines for data deposition in any common data hub or platform to facilitate data sharing in public health emergencies for scientific research

Schema.org

Schema.org 7.0 includes fast-tracked new vocabulary to assist the global response to the Coronavirus outbreak. Schema.org creates, maintains, and promotes schemas for structured data.

UC Health clinical data warehouse

Data warehouse using Observational Medical Outcomes Partnership standard to integrate patient data across University of California health systems

Viral Annotation DefineR (VADR) Sequence Annotation Tool

NCBI developed a system called Viral Annotation DefineR (VADR) that validates and annotates viral sequences, including SARS-CoV-2.

check mark
Virus Outbreak Data Network (VODAN)

CODATA, RDA, WDS, and GO FAIR have created a Virus Outbreak Data Network (VODAN) to make SARS CoV-2 virus data FAIR, or findable, accessible, interoperable and reusable, by both humans and machines.

Virus Pathogen Resource (ViPR)

ViPR is an NIAID-funded resource that support the research of viral pathogens in the NIAID Category A-C Priority Pathogen lists and those causing (re)emerging infectious diseases. It provides a dedicated gateway to SARS-CoV-2 data that integrates data from external sources (GenBank, UniProt, Immune Epitope Database, Protein Data Bank), direct submissions, analysis pipelines and expert curation, and provides a suite of bioinformatics analysis and visualization tools for virology research.

check mark
Webinar on Sharing, Discovering, and Citing COVID-19 Data and Code in Generalist Repositories

Generalist repositories are supporting the discoverability and reusability of COVID-19 data and associated code in different ways. See the presentations and resources from the webinar held April 24 or visit resources directly by clicking the links below.

supporting

This page last reviewed on May 13, 2020