Distributed Genomic Analysis Workflows and Services on NIH Cloud-Based Data Resources (NHGRI/OGDS)

Project Point of Contact: Valentina Di Francesco, OGDS Director, Chief Data Science Strategist

Goals and Objectives: The key goal of this project is to develop and implement interoperable, privacy-preserving and secure genomic analysis workflows and services that utilize the combination of controlled-access datasets from the National Human Genome Research Institute (NHGRI) AnVIL and other NIH cloud-based data resources.

Significance:

To demonstrate that distributed computing across NIH cloud-based platforms can be performed in a user-friendly, secure, privacy-preserving and trustworthy manner.

To develop best practices learnt from this effort and share them with other cloud-based resources that contribute to the NIH FAIR data ecosystem.

Description: Researchers who wish to leverage and analyze combined controlled-access datasets from different NIH cloud-based data resources face significant challenges. The challenges are not only technical (e.g., the dataset size preventing data from being downloaded to a local system, or lack of cloud-optimized analysis and visualization tools), but also administrative due to the diversity of data governance and stewardship processes to control access utilized by these resources. The need to develop and adopt practical, privacy-preserving, efficient computing technologies in a federated data ecosystem is crucial to support researchers who may need to increase the statistical power of their analyses, and the diversity of their population cohorts.

The key goal of this project is to develop, implement and test interoperable, privacy-preserving and secure genomic analysis workflows and services that utilize the combination of datasets initially from AnVIL and the All of Us (AoU) platforms. Potential projects may include: a joint imputation service that uses AnVIL and AoU hosted datasets as reference panels, and structural variant analysis workflows on long sequence reads using the new Telomere-to-Telomere reference genomes.

These activities will stimulate opportunities for the AnVIL team to expand on these efforts to include the development and implementation of additional interoperable genomic-based analysis workflows and resources with other major NIH data generation and sharing programs, such as TOPMed and Bridge2AI, or NCBI controlled-access data resources.

A Data Scholar will be actively engaged as a representative of the AnVIL team in the development of the technical requirements and oversight of the projects, working in collaboration with AnVIL and other platforms’ scientists and engineers, as well as NHGRI staff.

Project management activities, including setting up and monitoring the technical timelines and milestones, and preparation of technical reports and best practices, will also be the responsibility of the Data Scholar.

The Data Scholar will also provide expertise and dedicated time commitment to the collaborative activities of the AnVIL with the partners of the NIH Cloud Platform Interoperability (NCPI) efforts, in collaboration with the NCPI ODSS staff.

Data set(s) involved: As mentioned earlier, the initial projects will include AnVIL and AoU datasets.

Additional projects will likely include controlled access data from TOPMed, Bridge2AI, and NCBI data resources. These additional projects may become embedded in the NCPI “interoperability projects,” which are expected to launch in late 2023 or later.

Anticipated outcomes of the project:

An imputation service will be implemented in a FISMA system with strong security controls, accessible to researchers on the Terra platform, using either Google Cloud Platform or Microsoft Azure as the cloud service providers.
The structural variant analysis workflows will be shared with the community.
Best practices for federated, privacy-preserving computing will be shared with ODSS, NCPI partners, and other NIH cloud-based resources.
Publications.

Required skills of the DATA Scholar: Cloud computing and genomic data analysis. Familiarity is desired with privacy-preserving technologies, and the policies that govern human-controlled access data.

Expected/preferred length of DATA Scholar appointment: 2 years.

Expected/preferred time effort commitment of the DATA Scholar: Full time (100%)

Remote work preference: Hybrid preferred.

ICO support: The DATA Scholar will have dedicated office space at the NHGRI Rockledge site, including standard office equipment (laptop, 2 monitors, etc.).

Additional activities: As a member of the NHGRI Office of Genomic Data Science, the DATA Scholar will be engaged in various genomic data science-related activities within both the extramural and intramural programs. In particular, the DATA Scholar will be a fully active member of the NHGRI AnVIL team.

Career or professional development opportunities: At the end of the appointment, the DATA Scholar may have opportunities to pursue either programmatic work supporting cloud-based data sharing resources at various NIH ICOs, or related technical work at NCBI, CIT, or as a staff scientist within the NIH intramural research program.

To apply to this or other DATA Scholar positions, please see instructions here: datascience.nih.gov/data-scholars.