NIH Virtual Community Workshop on Jumpstarting Access to Clinical Data for COVID-19 Research

The workshop started with an overview of the challenges with creating a collaborative hub for working on COVID-19 and a presentation of the research questions and use cases for the COVID-19 Clinical Data Hub being designed. Keynote speaker Eric Topol outlined the COVID-19 problem and four panels discussed topics important to the construction of a data hub: data harmonization, data sources/platforms, governance, and other topics such as security, confidentiality, and sustainability. The day ended with a presentation on diversity concerns in COVID-19 research and a discussion on final thoughts and lessons learned from the day. Overall, the workshop provided insights for considerations in building the data hub including:

  1. Using standardized approaches for harmonizing data.
  2. Addressing concerns around governance, security, and sustainability.
  3. Examining patient diversity through recruitment and enrollment.

A description of the workshop, bios on all the moderators, speakers, and panelists, and a link to the Slack channel are available.

Watch the workshop recording.


On July 29, 2020, the NIH hosted a workshop with more than 1,000 virtual participants to explore the design and construction of a COVID-19 Clinical Data Hub: a loosely federated system, built by researchers for researchers, to quickly make available the data and tools necessary to address pressing COVID-19 research questions. Add in a sentence about breath of participants and total number.

The workshop was chaired by Warren A. Kibbe, Ph.D., chief data officer at Duke Cancer Institute; Susan Gregurick, Ph.D., associate director for data science and director, Office of Data Science Strategy at NIH; and Patricia Flatley Brennan, Ph.D., director of the National Library of Medicine.

Dr. Kibbe opened the workshop by framing the objectives of the day:

  • to socialize the NIH’s efforts in collecting near-to-real time clinical data.
  • to discuss the importance of enabling greater access to this type data.
  • to collaborate with the community to identify gaps in our approach.

Dr. Brennan then briefly introduced the plethora of NIH initiatives aimed at COVID-19 response, contextualizing the place and role of a Clinical Data Hub as a critical, cross-NIH research capability moving forward.

Keynote on COVID-19

Eric Topol, M.D., founder and director of the Scripps Research Translational Institute, delivered the keynote on the state of the COVID-19 pandemic.

Dr. Topol began his remarks by addressing key research on COVID-19, drawing the “tip of the iceberg” adage in what we know so far. Dr. Topol pointed out that this disease results in a “pan-body attack” and cited research reporting 45% of cases to be asymptomatic and only one in eight cases being identified. Comparing COVID-19 to the 1918 pandemic, Dr. Topol emphasized that our advantage is in having “a much larger toolbox,” with methods like big data, analytics and artificial intelligence, but which have no value without data. He noted “that is where this workshop comes in.”

To inspire curiosity in the panelists and attendees for the day, Dr. Topol concluded his session by provocatively pointing out new sources of data from wearables, to mobility, to even sewage. He asked attendees to consider throughout the day the different orthogonal layers where data might be created and how it might be leveraged to expand our understanding of COVID-19.

Panel Sessions

The day began with a panel on Acquiring and Linking Data from Different Clinical Environments, which focused on the challenges and approaches to data harmonization, data source linkage, private versus publicly available data, incorporating social determinants of health, and managing data collision requests for healthcare centers. Salient points made by panelists and moderators included:

  • A common data standard is the key to unlocking evidence-based practice from practice-based evidence, and it enables interoperability, consistency, and comparability.
  • Without comparable data, there can be no analysis, inference, or knowledge generated in a clinical environment.
  • It is critical to generate evidence early and in a common format and to use diagnostics to be self-critical on the quality of data.
  • Local healthcare centers are overburdened with data collisions, and to mitigate, requesting organizations need to be clear, concise, and consolidated.

The next session, Creating and Using Platforms, introduced the NIH’s leading data science environments: the All of Us Research Program, the National Center for Advancing Translational Sciences’ National COVID Cohort Collaborative (N3C), and the National Heart, Lung and Blood Institute’s BioData Catalyst. This session discussed how these platforms are collecting and making available COVID-19 relevant data.

  • In response to COVID-19, All of Us launched the COVID Participant Experience survey, a seroprevalence survey on previously collected biospecimens, and is working to capture COVID-19 related clinical data in its ongoing electronic health record (EHR) data stream.
  • As of July 28, N3C had reached agreements with 792 member institutions across 65 clinical hubs to collect COVID-19 related EHR data in a FedRamp certified data enclave—a secure, reproduceable, transparent, versioned, provenanced, attributed and shareable analytics environment.
  • BioData Catalyst is positioned to enable long-term research in a variety of areas relevant to COVID-19 including: characterizing its natural history and long-term sequelae, protection of blood supply, clinical trials and specifically therapeutics, Multisystem Inflammatory Syndrome in Children multi-omics immunology, and establishing long-term COVID-19 cohorts.
  • Representatives discussed the critical potential for collaboration both in research and discovery, mentioning approaches such as linking consented data through a hash algorithm or creating a cross-platform query system for metadata.

The third panel featured a group of speakers who defined and discussed generalizability, reproducibility, and validity in research and how these characteristics can be enhanced to improve the efficiency and outcomes of clinical research.

  • The group defined generalizability as the ability to extrapolate research findings to a population at large, and the panelists noted:
    • Representativeness is lacking in current data collection. Generalizability can and should be enhanced in the future by recruiting populations outside of academic research centers.
    • It is critical to take an iterative approach to study design so that studies can be flexible to fit the rapidly changing situation.
  • Reproducibility was defined as the ability to obtain consistent results when an experiment is repeated, and the panelists noted that it is improved by providing real-time, transparent, accessible, and traceable data to other researchers.
  • External validity was defined as study results holding true in various environments and the panelists suggested how a pipeline could be built to implement findings back into the healthcare system to improve the current practice of simply releasing them in an academic context.

The final session covered Governance, Data Access Committees, Eligibility, Security and Confidentiality, which are considerations that must be examined when creating a collaborative for access to clinical data for COVID-19 research. Salient points made by the panelists and moderators included:

  • Governance should be considered first to encourage collaboration of the wide variety of existing platforms, avoid downstream interoperation challenges, and not discourage innovation.
  • Construction consideration should begin with the queries—the 20 questions that will be asked of the platform—to drive construction of governance, data access and tiers, etc. Panelists encouraged constructors to start small and iterate rather than approaching it all at once.
  • Prioritization of building a platform should commence in places of agreement. One panelist noted that stakeholders should be “finding a way to say yes.”
  • NIH should eventually explore technically innovative solutions like blockchain and a hash algorithm for data sharing and connection.

It is critical to promote sustainability by subsidizing compute and maintain this spirit of collaboration both within the academic community and externally with the private sector.

Considering Issues of Diversity and Representativeness

Monica Webb Hooper, Ph.D., and Lola Fayanju, M.D., concluded the workshop’s presentations with a session on the state of diversity and representativeness in clinical data collection and research and the resulting societal impact.

COVID-19 was initially referred to as the “Great Equalizer,” but in reality, Dr. Fayanju commented, it is the “Great Revealer” of U.S. health disparity and the observed inter-group differences in health screening, outcomes, and/or treatment, which are rooted in inequity and avoidable. The evidence is significant with people of color being disproportionately impacted by the pandemic. Dr. Fayanju asked, “Why do only six states report testing by race and ethnicity? A recent mRNA vaccine candidate showed promising results in their Phase 1 study but with 89% of participants white.”

Dr. Fayanju stated that the downstream impact of this lack of representativeness is dangerous. Without thinking of diversity and disparity, she noted that we can create and perpetuate blind spots, which result in poor study design. This results in lower generalizability, a risk of using race/ethnicity as a proxy for genetic variation, and disparate outcomes—ultimately leading to a more costly health care system. Dr. Fayanju added that the demography of the United States is going to continue changing over the next 20-30 years, and that it is paramount to pay attention and to change the framework at which we do science.

Dr. Fayanju concluded her presentation by asking attendees to consider: “Is this land really made for you and me?”

Finalize Vision and Next Steps

In the last session of the day, Dr. Brennan and Robert Grossman, Ph.D., made concluding remarks, asked provocative questions in reflection of the day’s dialogue, and delivered a call to action for the greater research community in support of the Clinical Data Hub effort.

Key questions included:

  • How do we maintain integrity at such a rapid speed?
  • How can we make COVID-19 research actionable in clinical care?
  • Are there incentives that can be provided to create a common discovery and data access standpoint, providing a single place for researchers to find the data and tools they need, quickly?
  • Can we eventually transition to a federated system with an agreement on data standards for data access and workflow execution for federated analysis of data?
  • What can we do now to make it most likely that the data we collect in these platforms can be used to understand the long-term sequalae of COVID?
  • How do we lower the burden of data collisions to frontline healthcare centers? Should we create a common place to submit data and then do cleaning and aggregation after?
  • How can we ensure that the data collected is appropriately representative of the U.S. population? Can we collect data outside typical academic medical centers to ensure that?

With respect to next steps for achieving a Clinical Data Hub, Drs. Brennan and Grossman agreed that there was enthusiasm for developing tools and strategies that promote data harmonization, even if around multiple data models, and around linking patients between platforms and datasets. Paramount to this effort is coming to a common agreement around a set of common data elements, which Dr. Brennan noted NIH is scaling investment around. Dr. Grossman further emphasized the critical nature of APIs to this effort in building an open and accessible ecosystem. He stated, “We make progress when data is brought against other data.” Finally, Dr. Brennan pointed out three topics she felt merit greater conversation in the future:

  • protecting patient rights when we extract data for research.
  • exposing the underlying logic, transparency, code and data.
  • engaging patients and policy makers as partners.

Outcomes and Next Steps

The discussion and presentations can be distilled into a set of priorities that should be considered when moving forward:

  • Acquiring and linking clinical data from different contributors necessitates a shared understanding of a set of minimal common data elements for greater comparability and usability across programs. Attention to data collisions from data contributors is also important.
  • Diversity and inclusivity of data and participants should be maximally encouraged to avoid the pitfalls of creating blind spots in datasets and in subsequent analysis. Datasets should accurately reflect the demography of the United States.
  • Data platforms should be developed to the extent possible to enable greater data discoverability and sharing across systems. NIH should consider a priority to enhance interoperability and incentivize shared innovative solutions. NIH should consider methods to maintain these collaborative efforts within the academic and private sectors.

This page last reviewed on March 23, 2023