The National Library of Medicine supports discovery. We help investigators disseminate their findings, and we help others see what has been done and build on those results. And, analyzing the library collection itself—the journal articles on a topic or the sequences in our gene databases—can yield its own discoveries as researchers look beyond the specifics to discern patterns or trends in the whole.
In this era of data-driven discovery, however, we need to do more.
First, we need a PubMed-equivalent for data. While the name “PubData” doesn’t do it for me, we need those basic functions.
After all, data-driven discovery begins with discovering the data. We must make that as easy as possible.
Just as PubMed provides citations to articles in selected journals, we need to compile a common catalog of biomedical data sets. And, as the NLM Literature Selection Technical Review Committee brings together experts to review journals and assess their quality to identify what should be indexed in PubMed, we need an experienced group to identify and select data sets based on predetermined criteria. Those criteria, in turn, will help set the standard for quality data.
Some PubMed citations link to the full-text article. That full-text might be in PubMed Central—which means the article is stored here—but most are not, so we simply link to them where they are, whether that’s a publisher’s site or another library’s repository. Similarly, while some data sets might be deposited in NIH-hosted repositories, most won’t need to be. We can link to them instead.
PubMed offers other value-added services—from standardized metadata to suggestions of similar articles to related data in other databases. The data side of the house will need similar built-in services: links to articles that used a data set, explanations of how a data set was enriched or modified, metadata that fully describe the origins and content of the data.
But, that is only the beginning.
Data-driven discovery also relies upon proven methods to investigate the data, create predictive or explanatory models, and conduct a range of operations, so we need to think about how to create a library of models, both those based on statistics and those drawn from operations research and optimization.
We’re just beginning to consider what a library of models might look like. Many of the services we use for curating the literature, such as a common vocabulary and a standard way of attaching metadata, will be necessary. We will want to know the class of modeling tools and a model’s provenance and intended use. It will also likely be useful to tie a model to the data sets investigated with it and to the articles derived from that model’s use.
I’m sure there’s much more. What else would you include?
We’re at the beginning. What do we need to make this work?