We are witnessing an evolution in health data. New forms of data – such as wearable devices, patient-generated summaries, and social media feed – are rapidly proliferating as healthcare becomes increasingly digitized. These new types of data hold great promise: they have the potential to complement the clinical chart as a means of documenting the human condition, and can serve as fruitful sources of information for analysis that could improve patient outcomes. Yet, to be useful, these data need to first be organized and shared in ways that maximize their true potential.
The term “scruffy data” is drawn from the field of artificial intelligence which characterizes research solutions as being ‘neat’ (i.e., elegant, clear and provably correct) or ‘scruffy’ (i.e., where intelligence is too computationally intractable to be solved with the sorts of homogeneous system such neat requirements usually mandate). Biologic data (e.g., genomic data) are generally considered ‘neat,’ whereas patient-generated data, which tend be heterogeneous and messy, are deemed ‘scruffy’. Emerging sources of ‘scruffy’ data were one of the foci on last week’s Partnering for Cures Conference (P4C) in Boston, MA. The goal of this conference was to accelerate patient-centered solutions and create meaningful collaboration in areas such as data sharing, translational research, training for young investigators, and a patient-centric approach to healthcare. Among the participants were senior leaders from NIH including Dr. Patricia Flatley Brennan, Director of the Library of Medicine and interim Associate Director for Data Science, and Carrie Wolinetz, Acting Chief of Staff and Associate Director Science Policy at NIH.
It is estimated that over half of the emerging health data sources, including patient-generated data, are available in free text. There are valuable insights in the text, but we don’t yet have the tools to analyze and maximize large volumes of free text data. Leveraging artificial intelligence and machine learning are the next frontier for analyzing the characteristics of insight embodied in patients’ words. Machine learning capabilities can enable researchers to explore many questions simultaneously to provide a comprehensive interrogation of the problems that the biomedical research community hopes to solve. As explained by Dr. Brennan, “Machine learning will enable us to move beyond the testing of a limited number of hypotheses to “exploring the all of things.” This opens up new vistas for analysis and understanding. It means we can go from studies that focus on a “‘n’ of one, to an ‘n’ of everything.”
Another theme of the panel discussion on ‘scruffy’ data was the use of health information for medical decision-making by patients and providers. The National Library of Medicine, which hosts over 300 databases of biomedical information, plays a critical role in making health data available to clinicians, researchers and the public. One such source of data is the ClinVar database, which is a freely accessible, public archive of the relationships among human variations and phenotypes. We heard from conference attendees about how this database can serve as a powerful tool in the interpretation of gene variants. We also heard concerns regarding the continuity of personal health data. For instance, patients can view health information contained in the ClinVar database, but cannot directly re-engage this resource for follow-up information when data are missing or incomplete as only research laboratories or clinical testing labs can submit or update reports. One salient example came from an audience member whose daughter had a genetic condition. When the family looked up the variant, it found records of a patient with a similar genomic footprint who was listed as “28”. But the database did not say whether that individual was 28 years old, or 28 weeks old, or 28 days old. Knowing the answer to this question would have important implications for the family’s and their provider’s chosen course of treatment. Yet to engage the system, patients have to go through third parties, which can be time-consuming and is wholly dependent on the goodwill and record-keeping of the data submitters.
To make health data usable to their fullest extent, they must be gathered, cleaned and organized so they are in a form that can be readily consumed by others. However, it can be time-consuming and expensive prepare data for analysis. It is estimated that between 50-80 percent of the effort in data management is spent in the curation of data. Patient-generated data presents particular challenges given the range of terminology sets and code profiles, and lack of agreement on which standards are to be used.
Speakers and audience members at the P4C conference shared many exciting new ideas of how to make “scruffy data” more useful and useable, such as: the use of annotation tools allowing database users to comment directly on the data, the creation of federated registries that could allow for the linking of disparate sources of data across distinct registries, and the use of unique patient identifiers to provide a common element that could tie sources of data to a single patient. Also discussed was the need to address patient research and privacy protections in light of concerns related to patient re-identification through advanced data analytics. These are some of the issues we will be exploring through DataScience@NIH as we seek to explore the new data frontier that is emerging.
The questions moving forward for NIH and the biomedical community will be how do we maximize the value of these rapidly proliferating forms of “scruffy” data and how do we leverage them to their fullest potential to advance biomedical research and patient outcomes. To accomplish this, we need to work closely with the patients who generating the data, the researchers who are analyzing the data, and the clinicians who are using the data for decision-making. As data-intensive research initiatives evolve at NIH (e.g. AllofUs Research Program and the Cancer Moonshoot), where patient-generated data will be more integrated into the data collection, we look forward to updating you on new developments.
About the Author:
Elizabeth Kittrie is a Strategic Advisor for Data and Open Science at the National Library of Medicine (NLM) where she is involved in developing the NLM long-range strategic plan and leading data science and open science initiatives. Prior to joining to the NLM, she served as a Senior Advisor to the Associate Director for Data Science at the National Institutes of Health, where she led open innovation efforts including the Open Science Prize, a collaborative partnership between the NIH, Wellcome Trust and Howard Hughes Medical Institute. Kittrie has also served as Senior Advisor to the Chief Technology Officer of the U.S. Department of Health and Human Services, where she led open government activities across HHS and coordinated the Department’s efforts to develop a common Public Access Policy. Prior to joining HHS, Kittrie served as the first Associate Director for the Department of Biomedical Informatics at Arizona State University.