Architecting Search Across Petabyte-Scale Genomic Sequence

Institute or Center: National Library of Medicine (NLM)

Project: Architecting Search Across Petabyte-Scale Genomic Sequence

Skills Sought:

Experience with high throughput computational analysis in a cloud environment
Experience with UNIX/Bash, SQL, Git, Python, R, C++, and/or Perl
Machine learning applications in dimensionality reduction, classification, and clustering
Preferred: Familiarity with bioinformatic and genomic workflows
- Sequence data modeling and compression approaches
- Sequence assembly and De Bruijn graphs
- k-mer analysis

About the position: NLM seeks a DATA Scholar to develop new sequence search methods that support identification of both known and as-yet undescribed sequences in the Sequence Read Archive (SRA), the NIH’s largest publicly available repository of high throughput sequence data that was recently moved to the cloud.

The DATA Scholar will deliver useful, working, sustainable, scalable search methodologies to the bioinformatics community. The tools will enable search strategies against petabyte-scale, cloud-based SRA data and support identification of genetic patterns associated with biological, environmental and/or clinical attributes. The Scholar will also be asked to communicate findings in workshops, scientific conferences, and publications.

About the work: Cloud compute offers powerful search and discovery opportunities. For the first time, the more than 36PB corpus of SRA is available for computational search. This creates exciting discovery opportunities, and the Scholar’s work will pioneer detection of novel microbial species and genes and a deeper biological understanding of genomic variation, gene expression, functional genomics, bioinformatics methods development. Search algorithms for SRA will be a significant positive driver for biological discovery and high-impact health outcomes. See BLAST for an example.

The recent replication of SRA content into commercial cloud platforms now offers the possibility of developing methods to execute sequence-based searches, including those that involve machine learning or other artificial intelligence approaches, across the entire corpus of SRA data. Ultimately, these methods should enable the correlation of sequence patterns across samples with biological and environmental conditions.

Datasets involved: SRA is a publicly accessible, comprehensive, petabyte-scale, archive of sequence data from all kingdoms of life: virus, archaea, bacteria, and eukaryotes, including human. Data in the SRA are generated, deposited, and used by researchers around the world doing wide-ranging types of basic biological and clinical research. Data in SRA are typically used to support reproduction of published results or for mining of new or alternate information from existing datasets.

Why this project matters: Using new sequence search strategies against the corpus of SRA data will stimulate novel approaches to advance data analysis and accelerate biological discoveries. These tools will allow researchers to continue to use NLM’s massive, publicly accessible datasets for biomedical discovery and data-powered health.

Work location: Bethesda, MD

Work environment: The Scholar will work onsite at NIH in the Information Engineering Branch (IEB) of NLM’s National Center for Biotechnology Information (NCBI). The scholar will interact primarily with the NCBI chief architect, IEB branch chief, and IEB senior leaders. The Scholar will partner with one or more mentors and teams (typically 5-7 members) whose work is aligned with the specific SRA content and datatypes for which the Scholar develops methods. The Scholar will work directly with scientists and researchers who have expertise in bioinformatics, computational biology, statistics, computer science, genetics, and biology, providing an opportunity to apply expertise and integrate efforts across several fields.

To apply to this or other DATA Scholar positions, please see instructions here: datascience.nih.gov/data-scholars-2021.