NIH ODSS Search Workshop | Data Science at NIH

NIH ODSS Search Workshop, January 19-20, 2022

The NIH Office of Data Science Strategy (ODSS)-sponsored Search Workshop explored current capacities, gaps, and opportunities for global data search across data ecosystems to enhance data discovery and reuse. Discussions during the two days emphasized the need to establish and enforce clear standards, policies, and requirements to ensure newly generated and existing data is findable, the first element of the FAIR principles, and amenable to more sophisticated search techniques. Additionally, discussions highlighted the multifaceted search landscape, with complex use cases due to the wide range of diversity and heterogeneity in science. Speakers and attendees agreed that search system design must account for these complexities in order to prevent researcher fatigue from effort required to make data searchable. There was an emphasis on the need for interoperability between data platforms, allowing enhanced data sharing for improved research, discovery, and innovation. The discussion and breakout sessions emphasized existing capabilities and opportunities, understanding and reducing bias in assembled data, incorporating emerging search technologies, and recommendations for those involved with these search systems, while aligning to three use case scenarios: data discovery, cohort building, and knowledge searches.

View Full Report

Agenda
Time	Presentation
Day 1 – January 19, 2022
Watch the Day 1 video
11:00 a.m. – 11:20 a.m.	Welcome & Introductions: Dr. Susan Gregurick
Introduction: The introduction focused on current NIH-funded projects on interoperability such as Anvil and BDC. The NIH is creating incentives for researchers to share data and generate FAIR datasets. Goals of the Workshop: The goal of the workshop was to explore ways to extend current capabilities and opportunities around Search. The workshop aimed to identify up-and-coming transformative Search technologies, help identify recommendations for researchers and users of Search systems, and generate a vision for the future of how Search could be done.

11:00 a.m. – 11:20 a.m.	Search Listening Tour: Dr. Simon Twigger
The Search Listening Tour identified three main categories of use cases for Search: Dataset Discovery, Cohort Building, and Knowledge Retrieval. Each use case has three main aspects: user-focused components, implementation-focused components, and data-focused components. View Presentation

11:20 a.m. – 11:40 a.m.	Keynote Speaker: Danah Boyd
A significant amount of information (scientific or otherwise) is currently behind paywalls. When news is gated behind paywalls, headlines do more work than the content of the article and the risk of misinformation is increased. Search engines have trouble generating valuable results when too little viable information is returned for a query. In these instances, it is easy to manipulate Search results for terms that have not yet gained consensus. It is not possible to just “teach people to search better.” The community needs better methods and needs to pay attention to aspects such as metadata standards to be more responsive to Search needs.

11:50 a.m. – 12:10 p.m.	SET THE STAGE: Dataset Discovery: Dr. Mike Huerta
There is a great need for public data repositories, but it is difficult to find a permanent place to store them: Hundreds of thousands of datasets need a home each year. The community needs to consider the quality of data repositories and the types of NIH initiatives necessary to improve them, including plans for peer review. There is a need to find and discover datasets housed across a distributed and heterogenous repository ecosystem. This is doable with tools like DATMM.

12:10 p.m. – 12:30 p.m.	SET THE STAGE: Cohort Discovery: Dr. Anne Deslattes Mays
Rare diseases are difficult to study without a significant coordinated effort across communities and lifespans. There is a great need for institutions to share data, but there are challenges due to protecting privacy. Instead of sharing patient-level data directly, sharing a container to compute summaries of data could make this process easier while respecting privacy.

12:30 p.m. – 12:50 p.m.	SET THE STAGE: Knowledge Retrieval: Dr. Purvesh Khatri
Scientific bias as an issue in Search and is demonstrated through the example of gene annotations. Most annotations are inherited from a very small subset of well-understood genes and their functions which introduces bias. Traditional approaches to studies tend to control for factors that introduce variance in data, which decreases heterogeneity. This can make it difficult to generalize findings.

12:50 p.m. – 1:10 p.m.	SET THE STAGE: Ethics: Dr. Julia Stoyanovich
There are three facets of equity: Representation, Access, and Outcome. The most valuable Search tools should enhance reliability, accountability, and fairness (such as MLINSPECT and FairPrep). Data and models could benefit greatly from added metadata describing their use and limitations.

1:10 p.m. – 1:30 p.m.	SET THE STAGE: Cultural / Social Aspect: Dr. Larry Hunter
Search depends not just on the query, but also on the context. Technical knowledge, domain knowledge, native language, and other contexts need to be considered for effective Search. Different tools are needed at different depths of Search. For simple queries, a string of search terms may suffice. More complicated queries require better ways of capturing context. Queries are more diverse than simple string searches. For example, our query could be an RNAseq/gene expression dataset. In these cases, how do we handle more complicated queries? Basic methodological research is needed to address biomedical Search needs.

1:30 p.m. – 2:00 p.m.	Break

2:00 p.m. – 2:15 p.m.	CROSS-CUTTING THEMES: Data, Metadata, and Search: Dr. Maryanne Martone
Current Search capabilities are being built on previous Search initiatives and platforms, including the Neuroscience Information Framework (NIF), and the SciCrunch registry. There were many valuable lessons to be learned from these platforms. Rich metadata aids in Search, and it provides origins and important context. Just the simple cataloging of metadata is not sufficient. Semantics are very important and sometimes we take well accepted words for granted. Terms like “adult” can be very context dependent. The age at which a human is considered an adult will be very different when compared to other animals, such as a squirrel. It would benefit research if Search tools could understand these semantics.

2:15 p.m. – 2:30 p.m.	CROSS-CUTTING THEMES: Cutting Edge Technologies for Discovery: Prof. Rick Stevens
Large Language Models (LLMs) could be queried much like databases, opening new avenues for Search. These models can also be tuned based on usage patterns. Computing power needs to scale considerably to meet the needs of large searchable models like LLMs.

2:30 p.m. – 2:45 p.m.	CROSS-CUTTING THEMES: Effective UI/UX: Dr. Jina Huh-Yoo
While there are many theories surrounding the principles of Search, user experience and the context of the query are also important as we design future Search systems.

2:45 p.m. – 3:00 p.m.	CROSS-CUTTING THEMES: Trust in Search: Matt Might
There are important concepts that comprise trust in Search (soundness, completeness, and meaningfulness). Earning the trust of clinicians is critical in providing meaningful tools for Search in medicine.

Day 2 – January 20, 2022
Watch the Day 2 video
11:10 a.m. – 11:30 a.m.	KEYNOTE SPEAKER II: Sir Nigel Shadbolt
Publishing data to the standards required by the government is hard, unspectacular, unrewarding, and often unrecognized work. While the momentum behind such widespread architecture and standards may have slowed, the task is no less important today as it was in the past. There are significant efforts underway by various organizations to maintain infrastructure vital to biomedical research. One example is the work being done by the engineers at BenevolentAI to maintain a knowledge graph of life science. There is still little general adoption or awareness of how to implement FAIR concepts. This was evident during the pandemic response in the UK, where the data that the community thought was enough for modelling turned out to be in poor shape. There is a crucial emphasis on the incentive structures that promote data curation and sharing.

11:30 a.m. – 11:50 a.m.	CURRENT SEARCH LANDSCAPE AT NIH: Dr. Ian Fore
Prior efforts showed it is impossible to confine Search scope to core attributes of datasets. Finding and bringing together cohorts constitutes at least half the need that users have for Search. Search will need more than one solution from the NIH, and the NIH is not well structured for a single top-down approach, thus aspects of federated Search are required. The community expects the NIH to be the Conveyor, Provider, and Funder of Search. There is currently a timely and rare opportunity to influence Search through implementation of the new Data Management Policy.

12:10 p.m. – 1:40 p.m.	PARALLEL CROSS-POLLINATION BREAKOUT SESSIONS

2:00 p.m. – 2:45 p.m.	REPORTING FROM BREAKOUT SESSIONS
Dataset Discovery: Dr. Deb Agarwal It is no longer sufficient to just use bespoke datasets: It is necessary to stich information across multiple datasets. We don’t have the big data providers we used to have, where there was a high-quality dataset generated by someone whose task is that. There was a suggestion to identify core datasets and put resources into making “canonical high use datasets.” There was agreement that it is not enough to leave the curation to the community for “free.” Institutions need to help lead the charge, and this requires significant funding. It is important to think about data as a first-class asset, just like publications and grants. It has become clear that when seeking incentives for tasks such as data generation or annotation, there is much to learn from how economists approach similar problems. The community has been traditionally focused on making life easier for data providers, but what do data users want? An important question is “How do we encourage citations?” People will cite if there’s value to it. Reducing the cost for citation (auto generating DOIs, for example) could help. Carrot and stick methods (such as the NIH data sharing policy) need to also be considered. Cohort Discovery: Dr. Adam Resnick Incentives remain a key barrier to Search because they drive the quality of data, and this links into the policy and regulatory space that frames the query-able nature of the dataset and where it’s located and what framework. Incentives also directly intersect with implementation of standards. They continuously come up as a framework of need to empower cohort creation. There are also emerging technologies that change the narrative. Things like FHIR begin to press on that opportunity landscape. To make cohort creation possible, there still is a separation between those who submit and provide data and those who want to query and access data. Often these two groups are separated by time. Cohort creation as a centralized process is not a sustainable long-term framework. A federalized setting for cohort creation must be strategically embraced. There needs to be a definition for harmonization, including when it happens. There is also a need to pair the standards of harmonization with origin tracking and tools that empower this. There was caution against a notion that harmonization could be an entirely automated framework. Reality ends up being a two-state solution: the need to both invest in work for curation by human beings, as well as have this paired with automated processes. In genomics we have some of the notions in place for reproducible pipelines. This could be expanded for other fields. Much of the challenge is around text-based data. There is a need for more integration of multimodal data. These data are currently isolated in a hospital system and difficult to aggregate. There is a real opportunity for NIH to invest in this framing. Knowledge Retrieval: Dr. Anita Crescenzi The NIH can act as a common good and can also show leadership in providing sustainable funding for knowledge retrieval and augmenting current practices. It will be important to foster an open ecosystem of search systems that can be tailored to communities. We need to consider enabling this ecosystem with open APIs and standardizing around these open APIs rather than the current free-for-all. We need to support a feedback loop between results and Search. A need exists to clarify what is meant by quality and to be aware of the biases that may be contained therein. Contributors should engage in the assessment of quality to prevent the curation from being repeated or lost. Can incentives around data quality exist? There could be an increased role for data curators and ongoing curation as information may rapidly change. “Yesterday's top-quality score can be tomorrow's lowest.”

2:45 p.m. – 3:00 p.m.	CLOSING REMARKS
Dr. Stan Ahalt’s Remarks We heard a lot of overlap during the workshop. We have a growing understanding of the need to combine data collections for secondary analysis, and we are seeing the value in doing that. It is hard to get the process started, and federation fatigue is a challenge because the effort to do this work is non-trivial. An important Search mantra: “If it isn’t public, for most of us it does not exist” addresses barriers to accessing information which are clearly becoming a problem. The lack of singular standards continues to cause problems for Search. It is hard to measure data maturity. As part of this workshop, we have securely anchored cohort building as a part of Search. Training keeps coming up frequently. Students need more training on these topics early in their undergraduate careers. Provenance can be difficult to determine when trying to recreate day-to-day workflows. Perhaps we need to do some experiments of replacing indexing of data with other options. We need to have “data librarians.” Researchers need more interaction with the curators of repositories. Dr. Susanna-Assunta Sansone’s Remarks We heard over these sessions that we need to apply economic thinking to data and its use. It is a powerful concept. Search is powerful, essential, AND lucrative. Several speakers discussed the need for training material for making FAIR concrete because FAIR implementation will not come magically. There is clearly a need for the development of a collaborative framework. More funding models for collaborative frameworks are needed (someone mentioned a “search-a-thon”). We need more data policies, and they must cover the incentive parts of the data. We need to reward compliance and reuse as much as sharing. This means sharing with the public, not just sharing with the industry. We need executable data management policies. We need to expand the landscape of data sources beyond “classical” repositories. We also need to have a feedback loop between the users and the data producers. Ultimately, we all want to trust in the data that we use and reuse, and the quality that comes from the annotations. The workshop helped us understand the plurality of NIH efforts in Search. The expectation is that NIH will lead by example and can move the norms forward. Actionable curation involves an investment in AI.