Driving Discovery Through Data

0 How is natural language processing data science?
/ 08.31.17

Much of the data people produce is text, and NIH is no exception. Both extramural and intramural scientists use text to describe their research goals and results. Besides proposals, technical reports, and scientific publications, the raw data collected in NIH-supported studies also contain fair amounts of text: notes, including clinical notes, metadata describing collected data, semi-structured observations, and the results of qualitative studies. In addition, patients produce a wealth of information about health issues as they discuss their reactions to treatments, experiences with clinicians, and other health-related matters on websites, in social media, through their interactions with NIH customer services, and other venues. The list of text-based documents related to health does not stop there—think patient information, drug labels, and practice guidelines, to name a few.

In the context of research, these documents—along with the rest of big data—share one significant problem: volume. There are too many of them. Too many documents are produced for one individual to find all those relevant to a particular clinical or research question. (PubMed alone contains more than 27 million citations and is growing daily.) Even if all relevant documents are found, one cannot read and analyze them all. (Alper, et al, determined one would need an estimated 627.5 hours per month to evaluate articles published in 341 epidemiology-related journals.) And finally, no one individual can acquire and maintain the knowledge needed to comprehend the entirety of the data.

How can natural language processing (NLP) help this overwhelming situation?

To understand that, we need to set aside the complexity of language understanding and focus on NLP’s data science aspects.

NLP can find relevant documents by applying to information retrieval the more traditional NLP task of answering questions.

NLP can summarize content, turning large numbers of documents into digestible chunks. For example, NLP can automate the discovery of research for inclusion in systematic reviews, replacing the laborious process of researchers screening thousands of citations to determine their relevance. NLP can also keep systematic reviews and health care guidelines up to date.

In addition, NLP can support clinical decision-making by integrating and synthesizing symptoms, physical findings, and both positive and negative elements within a patient’s history. 

Together these functions can yield a concise representation of the textual dataset under analysis. From a precise pool of relevant documents to meaningful, actionable summaries to a dashboard of germane findings, NLP has the capacity to speed discovery and support decisions.

In the past, many of the above tasks were accomplished using knowledge-based methods, which resulted in biomedical and clinical NLP being one of the most resource-rich NLP areas. Today, most of the solutions are statistical and machine-learning methods, with the increasing use of deep learning that both needs and helps with an abundance of textual data.

This development parallels the progression of biomedical NLP from art, in which carefully hand-crafted systems supported few, into the realm of data science, where machines, with humans occasionally in the loop, will solve real-world problems using datasets too large to be processed by any one individual.


About the Author:

Dina  Demner-Fushman, MD, PhD leads research in information retrieval and natural language processing; providing clinical decision support through linking evidence (text and images) to patients’ data; answering clinical and consumer health questions; and extracting information from clinical text. Dr. Demner-Fushman earned her doctor of medicine degree from Kazan State Medical Institute in 1980, and clinical research Doctorate (PhD) in Medical Science degree from Moscow Medical and Stomatological Institute in 1989. She earned her MS and PhD in Computer Science from the University of Maryland, College Park in 2003 and 2006, respectively. She earned her BA in Computer Science from Hunter College, CUNY in 2000.  Dr. Demner-Fushman is a lead investigator in several NLM projects in the areas of Information Extraction for Clinical Decision Support, EMR Database Research and Development, and Image and Text Indexing for Clinical Decision Support and Education. The outgrowths of these projects are the evidence-based decision support system in use at the NIH Clinical Center since 2009, an image retrieval engine, OpenI, launched in 2012, and an automatic customers' requests answering service that supports NLM customer services since May 2014. She is the author of more than 120 articles and book chapters in the fields of information retrieval, natural language processing, and biomedical and clinical informatics. She has co-authored a textbook in Biomedical Natural Language Processing published in 2014.

Add New Comment

Posting Calendar

November 2018

Sun Mon Tue Wed Thu Fri Sat
Back to Top