Much of the data people produce is text, and NIH is no exception. Both extramural and intramural scientists use text to describe their research goals and results. Besides proposals, technical reports, and scientific publications, the raw data collected in NIH-supported studies also contain fair amounts of text: notes, including clinical notes, metadata describing collected data, semi-structured observations, and the results of qualitative studies. In addition, patients produce a wealth of information about health issues as they discuss their reactions to treatments, experiences with clinicians, and other health-related matters on websites, in social media, through their interactions with NIH customer services, and other venues. The list of text-based documents related to health does not stop there—think patient information, drug labels, and practice guidelines, to name a few.
In the context of research, these documents—along with the rest of big data—share one significant problem: volume. There are too many of them. Too many documents are produced for one individual to find all those relevant to a particular clinical or research question. (PubMed alone contains more than 27 million citations and is growing daily.) Even if all relevant documents are found, one cannot read and analyze them all. (Alper, et al, determined one would need an estimated 627.5 hours per month to evaluate articles published in 341 epidemiology-related journals.) And finally, no one individual can acquire and maintain the knowledge needed to comprehend the entirety of the data.
How can natural language processing (NLP) help this overwhelming situation?
To understand that, we need to set aside the complexity of language understanding and focus on NLP’s data science aspects.
NLP can find relevant documents by applying to information retrieval the more traditional NLP task of answering questions.
NLP can summarize content, turning large numbers of documents into digestible chunks. For example, NLP can automate the discovery of research for inclusion in systematic reviews, replacing the laborious process of researchers screening thousands of citations to determine their relevance. NLP can also keep systematic reviews and health care guidelines up to date.
In addition, NLP can support clinical decision-making by integrating and synthesizing symptoms, physical findings, and both positive and negative elements within a patient’s history.
Together these functions can yield a concise representation of the textual dataset under analysis. From a precise pool of relevant documents to meaningful, actionable summaries to a dashboard of germane findings, NLP has the capacity to speed discovery and support decisions.
In the past, many of the above tasks were accomplished using knowledge-based methods, which resulted in biomedical and clinical NLP being one of the most resource-rich NLP areas. Today, most of the solutions are statistical and machine-learning methods, with the increasing use of deep learning that both needs and helps with an abundance of textual data.
This development parallels the progression of biomedical NLP from art, in which carefully hand-crafted systems supported few, into the realm of data science, where machines, with humans occasionally in the loop, will solve real-world problems using datasets too large to be processed by any one individual.
About the Author: