Sharing Language Models for Biological Sequences in Biomedical Repositories

Institute or Center: National Human Genome Research Institute (NHGRI)

Project: Sharing Language Models for Biological Sequences in Biomedical Repositories

Skills sought:

Expertise in developing unsupervised deep learning models in both PyTorch and TensorFlow.
Expertise in biological sequence-to-function modeling.
Great written and oral communication, project management and coordination skills.
Familiarity and experience with DNABERT, ProtTrans, and similar efforts.

About the position: NHGRI seeks a DATA Scholar to work with the larger community to develop metrics, standards, use cases, etc. that will assist NIH-supported data resources to share language models built on sequence data.

Building from a deep understanding of community needs the Data Scholar will identify a strategy for how repositories can share pre-trained language models.
Build and test prototype applications that test and utilize existing models for a range of downstream applications; effectively benchmarking activities including strategies to speed the generation of such models or their size or investigating multi-modal pretraining models etc.
present strategies and recommendations to NHGRI staff on identifying the remaining challenges and best ways to address these.

About the work: NHGRI spends significant effort creating large data resources for the biomedical community and on developing ML/AI based approaches to biomedical questions. Given recent promise of language models in biological sequence space it is the right time to systematically investigate which of these models to share with the larger community to effectively enable the largest number of downstream activities. Multiple DNA and Protein sequences language models have been published. Which are better, for what purpose? Can we systematically benchmark these models along multiple axes?

Why this project matters: This activity will help NHGRI create specific initiatives to enhance work in this area and will assist NHGRI (and wider NIH-supported) repositories to make data driven decisions on which pretrained models to share, what applications that use these pretrained models (e.g., equivalent of sequence searches) to support, etc.

Work Location: Bethesda, MD

Work environment: The DATA Scholar will be under the technical supervision of Dr. Ajay Pillai and be a member of the NHGRI Office of Genomic Data Science. The scholar will have the opportunity for mentorship by other NHGRI Program Directors, and to interact with scientists in the NIH Intramural Program.

To apply to this or other DATA Scholar positions, please see instructions here: datascience.nih.gov/data-scholars-2022.