Using Large Language Models to Understand Immune Cell Communications (NIAID/DAIT)

Project Point of Contact: Dawei Lin, Associate Director for Bioinformatics, DAIT, NIAID

Goals and Objectives: The proposal is to apply large language models to understand how messaging works in the immune system and explore the possibility of applying advanced AI-based language models to understand immune cell communication. The communication among immune cells will unlock the understanding of how the immune system responds to foreign pathogens and allergens and protect its cells. Such knowledge will be critical to developing vaccines and preventing and treating immune-mediated diseases.

Significance: The analogy between language and cell-cell communication is fascinating, allowing researchers to draw parallels between the two. Genes, for example, can be thought of as words, and gene-gene interactions as sentences. The order and interaction of genes can determine their meaning, much like the order of words in a sentence. Similarly, molecular pathways function as language structures, allowing cells to communicate with each other.

This similarity provides the foundation for applying Large Language Models (LLM) to studying cell-cell communication. LLMs are machine learning models trained on vast amounts of natural language data and can understand the nuances of human language. By applying LLMs to biological data, researchers can extract meaningful information and gain a deeper understanding of the complex communication between cells.

Applying LLMs to the study of cell-cell communication has significant potential. With the wealth of data sources available and the support of powerful computational resources, researchers can gain a deeper understanding of the complex communication between cells. This, in turn, may lead to transformative advances in data analysis, collaboration across disciplines, and the overall understanding of cell communication.

Description: COVID-19 killed millions of people, and many more suffered unknown symptoms. Such tragedies today highlight the lack of knowledge of all the players in the human immune system and how they work and coordinate with each other to defend against foreign pathogens. The complexity of the human immune system is second only to the brain. However, the immune system has no central control mechanism. Instead, it depends on cell-cell, cell-environment communications to provide robust and reliable defense mechanisms against diverse known and unknown foreign invaders while protecting the body's cells. Immune cells communicate with similar cytokines and small chemicals or cell surface receptors, which make signals ambiguous. However, activating the production of a large amount of T-killer cells or antibody need precise command. How the immune system does the trick still needs to be fully understood. Recently, large language models using AI have advanced rapidly to solve ambiguity problems in human languages. For example, the PaLM developed by Google can do reasoning, detect jokes and determine why it is funny. ChatGPT from OpenAI can provide coherent text summaries. Many available public databases and data sources provide researchers with the information they need to study cell-cell communication. One such database is ImmPort, which is extensively used and has gene lists that can be used as a starting point for constructing a vocabulary. Other data sources include the Human Immunological Profiling Consortium, which generates multimodal and longitudinal data, and IEDB, which collects epitopes. Additionally, ImmGen supplies mouse data and databases such as SRA, GEO, molecular Pathways, dbGaP, and Protein-Protein interactions have diverse information that can be used to study various aspects of cell-cell communication. The recent success of using the Transformer, a core technology of LLMs, to facilitate annotation in biological messages is an exciting development. The approach has the potential to reduce biases and improve interpretability and scalability. It also allows researchers to integrate multiple data sources to detect weak signals and novel connections, which can lead to new insights into cell-cell communication.

Data set(s) involved: ImmPort, IDEB, SRA, GEO

Anticipated outcomes of the project: Tools to predict immune responses.

Required skills of the DATA Scholar: An AI expert with experiences of developing language model and tools.

Expected/preferred length of DATA Scholar appointment: 2 years.

Expected/preferred time effort commitment of the DATA Scholar: Part time (50-99%)

Remote work preference: Hybrid preferred.

ICO support: NIAID can provide computing support, office space, and a strong mentorship team with relevant expertise and experience. The mentorship team will include primary mentor at NIH as well as systems immunologists, NLP experts, and Language Model experts.

Additional activities: ImmPort program and HIPC (Human Immunological Profiling Consortium)

Career or professional development opportunities: healthcare and pharmaceutical industry, academic and research, and policy development and consulting.

To apply to this or other DATA Scholar positions, please see instructions here: datascience.nih.gov/data-scholars.