STRIDES Initiative: Starting and Using the Cloud - Preparing a Project for the Cloud

Preparing a Project for the Cloud

Understanding the big picture behind your data is essential because it creates context and enables you to make an informed decision about how to move your data to the cloud (e.g., what are the data, who created it, what it is used for, etc.).

Additionally, preparing your data by removing any anomalies, de-identifying protected information, and profiling your data prior to the upload process are often cost-effective and time-saving practices.

Identify Key Metrics and Prepare Your Data

Being prepared with some key metrics about your data will help ensure a streamlined process for uploading data to your chosen STRIDES Initiative partner’s cloud platform. The NIH STRIDES Initiative team can help you answer these questions and determine the best STRIDES Initiative partner to use by matching your needs with the available options.

Begin to gather the following information:
SubjectQuestion
Current sizeHow large are the datasets currently (MB/GB/TB/PB)?
File count and size distributionHow many individual files are there? What is the range of file sizes?
Data formatWhat format is the data currently in?
Data uploadDo you know about or want to use data prep tools that can help you prepare your data during the data upload process? Do you want to upload all the data at once? If not, how do you want to split the data into multiple uploads?
Data transferDo you plan on moving the data off drives to a tape drive? Do you want to ship the data to the STRIDES Initiative partner?
State of dataWhat state do you want the data to be in? How do you plan on cleaning the data (e.g., find anomalies, organize the data, etc.) before you transfer it to the cloud?
Storage durationHow long do these datasets need to live on the cloud (in months)?
Estimated growthWhat is the estimated rate of growth for new data (e.g., percentage increase per year)?
Backup and disaster recovery requirementsWhat data needs to be backed up? How many versions or copies of the backup need to be kept? Where do they need to live?
Update frequencyWill the data be updated or replaced frequently? Is the update a total rewrite of the entire dataset or just an incremental update to the existing dataset? What other tasks must happen following an update (e.g., re-indexing of the dataset)?
Access frequencyHow frequently is the data accessed (e.g., daily, a few times per week, once a month, once every six months, almost never)?
Desired access latencyDoes the data need to be available almost instantly on request, or is it okay if it appears in a few seconds, one to two minutes, or longer?
Data movement patternsWill you need to download data from the cloud to a local system, or will it remain in the cloud during any subsequent analysis steps?
Current data locationWhere does the data geographically live (country and/or general region within the country)? What devices do the data physically live on (e.g., disks, tape drives, etc.)? How many devices are there?
Predominant researcher locationWhere are the main researchers geographically located (country and/or general region within the country)?
Data movementIs there significant data movement to or from different sites that occurs during the regular course of business?
Data management

How are you keeping your data secure and making it findable, accessible, interoperable, reusable (FAIR)?

Data security

What measures do you have in place to protect your data including sensitive or protected data?

Findable

How is your data findable by your team and other collaborators? How is the data indexed?

Accessible

How do users access your data? Is the data open to the public? What kind of sign-in process will you use if approval is required to access the data?

Interoperable

Will you connect different data types for your analysis? If, so how? What is the provenance of the data, and how will it be curated for data harmonization?

Reusable

What kinds of licenses are associated with the data? Is ownership of the data clearly described? If there is human data, are consent agreements propagated for other research uses? Is this information associated with the metadata?

Configuring Data and Uploading Data to the Cloud

This page last reviewed on April 13, 2023