LitCoin Natural Language Processing Coalescence (NCATS/ODDPP)

Project Point of Contact: Tyler Beck, Program Director/Christine Colvis, Director

Goals and Objectives:

Combine and coalesce the successful submissions from the LitCoin Natural Language Processing Challenge into a single, powerful system for generating knowledge assertions from biomedical text.

Create a tool to be used in several upcoming projects within the LitCoin program, including the generation of a foundational knowledge graph representing a huge corpus of knowledge in the Helping to End Addiction Long-term (HEAL) data ecosystem.

Significance: This project will be critical to advancing the LitCoin program, which will greatly improve the findability of scientific knowledge and data that are associated with a publication. This will be accomplished by generating computationally-accessible knowledge from free text publications, which is then integrated into a knowledge graph. This ambitious program is being piloted in the HEAL Initiative.

Description: The management and mining of scientific knowledge is progressing from a database-based format for cataloging scientific information to the more powerful knowledge network-/knowledge graph-based format. To prepare for that future, we need to build tools that will facilitate the generation of “modules” of knowledge -- the graph-based representation of knowledge from individual scientific publications. These modules will serve as the building blocks of large, computationally-accessible knowledge graphs, greatly improving the ability to make connections across publications.
For years, NIH has put a strong emphasis on FAIR (Findability, Accessibility, Interoperability, and Reuse) data practices, but uptake of these practices has been slow at best. One way to improve the discovery and use of FAIR data is to tie it to computationally-accessible knowledge modules and to align this process with the current mechanisms that researchers already use to report results – scientific publications. Automatically creating high-value knowledge graphs from biomedical publications requires highly-sophisticated natural language processing (NLP) systems trained using the right data. NCATS has been working toward creating a system that will generate knowledge networks during the publication process in its LitCoin initiative, and the first step toward this longer-term goal was the LitCoin Natural Language Processing challenge, where competitors were challenged to build highly-accurate NLP systems that can generate knowledge assertions from biomedical text. This resulted in eight successful submissions which are available to NCATS under a broad software license granted by the winning submitters.

The DATA Scholar’s task will be to evaluate the eight NLP software systems made available through the challenge for their strengths and weaknesses, and to then combine and coalesce these NLP systems into a single system best suited to generate knowledge assertions from biomedical text from a wide range of subjects within translational research. This system will be used to generate foundational knowledge graphs from previously-published work, as well as to generate knowledge assertions from new LitCoin publications submitted by researchers for publication.

Data set(s) involved: NCBI Disease Corpus, other publicly-available annotated text datasets as needed.

Anticipated outcomes of the project: NLP software tool that can generate highly-accurate knowledge graphs from biomedical publication text.

Required skills of the DATA Scholar:

A deep understanding of machine learning and experience with programming for computational biology projects
Experience with biomedical ontologies and challenges to fitting disparate data within a single ontological framework
Experience with natural language processing tools such as BioBERT or similar and with training such systems for specific use cases or data types
Well-versed in bringing software tools from early concept to production-quality final product, including the implementation of unit tests and other validation strategies
Desire to work in a team environment

Expected/preferred length of DATA Scholar appointment: 2 years.

Expected/preferred time effort commitment of the DATA Scholar: Full time (100%)

Remote work preference: 100% remote allowable

ICO support: The NCATS ODDPP will provide to the scholar a peer mentor, Tyler Beck, Ph.D., Program Director for LitCoin, including weekly meetings to monitor and support the progress of designing the NLP system product. NCATS will provide cloud computing space for the scholar to perform their design and implementation work on the NLP system. NCATS and its partners will provide data sets which can be used to optimize the system as it is built.

Additional activities: The Scholar will likely be included in meetings and discussions about the LitCoin project, including the design and implementation of the LitCoin Foundational Knowledge Graph and the LitCoin Pilot Program. They may also be included in committee and working group meetings from various parts of the Biomedical Data Translator consortium, as many concepts discussed in those meetings may be relevant to the design and implementation of LitCoin.

Career or professional development opportunities: The Scholar will have the opportunity to interact with established researchers in the biomedical NLP space such as Dr. Zhiyong Lu’s research group from NCBI and others. They will also have the opportunity to make connections with contractors who will be building components of the LitCoin system.

To apply to this or other DATA Scholar positions, please see instructions here: datascience.nih.gov/data-scholars.