Driving Discovery Through Data

3 What Kinds of Libraries Are Needed to Support Data-driven Discovery?
Patti Brennan / 05.18.17

The National Library of Medicine supports discovery. We help investigators disseminate their findings, and we help others see what has been done and build on those results. And, analyzing the library collection itself—the journal articles on a topic or the sequences in our gene databases—can yield its own discoveries as researchers look beyond the specifics to discern patterns or trends in the whole.

In this era of data-driven discovery, however, we need to do more.

First, we need a PubMed-equivalent for data. While the name “PubData” doesn’t do it for me, we need those basic functions.

After all, data-driven discovery begins with discovering the data. We must make that as easy as possible.

Just as PubMed provides citations to articles in selected journals, we need to compile a common catalog of biomedical data sets. And, as the NLM Literature Selection Technical Review Committee brings together experts to review journals and assess their quality to identify what should be indexed in PubMed, we need an experienced group to identify and select data sets based on predetermined criteria. Those criteria, in turn, will help set the standard for quality data.

Some PubMed citations link to the full-text article. That full-text might be in PubMed Central—which means the article is stored here—but most are not, so we simply link to them where they are, whether that’s a publisher’s site or another library’s repository. Similarly, while some data sets might be deposited in NIH-hosted repositories, most won’t need to be. We can link to them instead.

PubMed offers other value-added services—from standardized metadata to suggestions of similar articles to related data in other databases. The data side of the house will need similar built-in services: links to articles that used a data set, explanations of how a data set was enriched or modified, metadata that fully describe the origins and content of the data.

But, that is only the beginning.

Data-driven discovery also relies upon proven methods to investigate the data, create predictive or explanatory models, and conduct a range of operations, so we need to think about how to create a library of models, both those based on statistics and those drawn from operations research and optimization.

We’re just beginning to consider what a library of models might look like. Many of the services we use for curating the literature, such as a common vocabulary and a standard way of attaching metadata, will be necessary. We will want to know the class of modeling tools and a model’s provenance and intended use. It will also likely be useful to tie a model to the data sets investigated with it and to the articles derived from that model’s use.

I’m sure there’s much more. What else would you include?

We’re at the beginning. What do we need to make this work?


Completely agree. Maybe rather than PubData, you could entertain "DataLib" (which suggests both data liberation and data library). In addition to biomedical data sets, there are other related data resources -- particularly data warehouses; federally-funded repositories and registries; and common data models. I believe that having some means to increase the research community's awareness of what is already out there in terms of harmonized data models, data warehouses that have already mapped data to common formats, variable names, and value labels. will accelerate discovery and learning, but in an adjacent way to what you describe in your post. In a perfect world, we'd have a universal data language for research, such that variables are consistently recorded from entity to entity--be it a health system, a cardiovascular clinical trial, or the SEER registries. We have standards and taxonomies, of course, but since they are not universal, much more back-end curation of the type you describe will be integral and necessary to advancement.

Finally, I want to vigorously endorse inclusion of the richest available meta-data. Knowing the provenance and context under which data were collected is vitally important to interpretation.

Thanks so much for laying this out and inviting feedback.

Sarah Greene, Executive Director
Health Care Systems Research Network

Researchers go through great lengths to establish new knowledge and then take further steps to turn that knowledge into prose for publication. It is shame that in the process, they discard the concise statement of that knowledge, because if they kept it, and it could be rendered in some formal way, knowledge bases could easily be constructed.
The MeSH headings and subheadings associated with the publications' citations are a deconstructed view of that knowledge. Some informatics researchers have expended great effort to develop methods for reconstructing that knowledge from the citation or, with natural language processing, the bodies of the papers themselves.
Wouldn't it be nice if we could instead preserve the original knowledge through some additional simple step in the manuscript submission process? Failing that, could the NLM's indexing process evolve to better represent the semantics in the paper? A collection of such statements could form the core of a very nice knowledge base that with assertions backed up not just be papers but with the underlying original data that begat them. I have no doubt that the 10,000 informaticians out there would quickly find ways to exploit that knowledge for all sorts of purposes, such as decision support, hypothesis testing, education, to name a few.

health care-focused subject primers, NIH specific data management resources, and information on new developments in the intersection between library science and data management.

Add New Comment

Posting Calendar

July 2018

Sun Mon Tue Wed Thu Fri Sat
Back to Top