The National Institutes of Health clearly recognizes the significant opportunities and challenges presented by big data in biomedical research. Charged by the 2015 report of the Advisory Committee to the NIH Director to become the hub of data science for NIH, the National Library of Medicine is playing a leading role in developing a vision for data science for NIH, and is consulting broadly with other NIH Institutes and Centers that are active in using big data, and with the NIH policy groups who will be addressing the policy issues associated with scientific data management and sharing.
In a companion post, I report on the progress NLM is making. In a previous DataScience@NIH blog (6.22.17), I enumerated some of the accomplishments of this past year’s foray into data science, which, as it turns out, is a natural extension of much of the work of the NLM enterprise: systematic acquisition, curation, storage and dissemination of the information substrate for science is the essence of what a library does. As the platform for discovery grows to encompass data as a substrate for science, so must the library grow.
Within the health care context, data science is a set of principles and practices that underlie the effective use of data relevant to biomedicine to glean insights and make new discoveries that will improve human health.
From the perspective of the NLM, it’s important to tease apart what I see are the two key aspects of data science – discovery-driving methodologies and advanced data management. NLM can and must contribute to both.
Discovery-driving data science methodologies encompass the range of analytical, visualization, and mathematical strategies that allow scholars to interrogate data sets. While many of these exist within the familiar statistical and epidemiological approaches to analytics, those approaches are best suited to complete data sets, generated under controlled conditions, with countable numbers of variables and element and can be shown to adhere to the distributional assumptions necessary to apply the approach. As our sophistication grows, and indeed the size, dirty-ness, and complexity of the data of interest expands, familiar methodologies fall short of the tools of inquiry needed to support data driven discovery. Enter new methodologies from mathematics, engineering, operations research, computer science and visual analytics that are unconstrained by distributional assumptions and have nice properties that allow robust examination of complex data set. Many of these methodologies are just emerging now, and will be of considerable value in extracting knowledge from data.
Advanced data management techniques include the policies, data structures, repository designs, curation management and access control strategies necessary for efficient storage and effective reuse of data as a basis for discovery. Key among the advance data management strategies is the idea of a data commons, the software stack and hardware storage of large data sets, workspaces, pipelines and commentary that permit safe storage and reuse of data sets. Curation techniques, including establishing metadata data, common terminologies, and cataloging and indexing approaches, must also be developed. Discovery strategies should be established in a way that allow data sets to be located and integrated with a minimum of user burden. Policies that govern access, use, and financial sustainability need must be proposed, debated and adopted.
The NLM will not go it alone – creating a future of discovery driven by data science requires constant engagement with data generators, data customers, and data financiers. Choices about how and when to curate a new data set rely as much on the original investigator and the questions posed as an anticipation of future users and yet-un-foreseen explorations of those data sets – Solutions that will accelerate data-driven discovery rest on a continuous interplay between information and biomedical science – and as the NLM is in essence the platform for this interplay between information and biomedical science, the challenge of realizing the promise of data science lies right in its bailiwick!