STRIDES Initiative: Starting and Using the Cloud

Preparing a Project for the Cloud

Understanding the big picture behind your data is essential because it creates context and enables you to make an informed decision about how to move your data to the cloud (e.g., what are the data, who created it, what it is used for, etc.).

Additionally, preparing your data by removing any anomalies, de-identifying protected information, and profiling your data prior to the upload process are often cost-effective and time-saving practices.

Identify Key Metrics and Prepare Your Data

Being prepared with some key metrics about your data will help ensure a streamlined process for uploading data to your chosen STRIDES Initiative partner’s cloud platform. The NIH STRIDES Initiative team can help you answer these questions and determine the best STRIDES Initiative partner to use by matching your needs with the available options.

Begin to gather the following information:
Subject Question
Current size How large are the datasets currently (MB/GB/TB/PB)?
File count and size distribution How many individual files are there? What is the range of file sizes?
Data format What format is the data currently in?
Data upload Do you know about or want to use data prep tools that can help you prepare your data during the data upload process? Do you want to upload all the data at once? If not, how do you want to split the data into multiple uploads?
Data transfer Do you plan on moving the data off drives to a tape drive? Do you want to ship the data to the STRIDES Initiative partner?
State of data What state do you want the data to be in? How do you plan on cleaning the data (e.g., find anomalies, organize the data, etc.) before you transfer it to the cloud?
Storage duration How long do these datasets need to live on the cloud (in months)?
Estimated growth What is the estimated rate of growth for new data (e.g., percentage increase per year)?
Backup and disaster recovery requirements What data needs to be backed up? How many versions or copies of the backup need to be kept? Where do they need to live?
Update frequency Will the data be updated or replaced frequently? Is the update a total rewrite of the entire dataset or just an incremental update to the existing dataset? What other tasks must happen following an update (e.g., re-indexing of the dataset)?
Access frequency How frequently is the data accessed (e.g., daily, a few times per week, once a month, once every six months, almost never)?
Desired access latency Does the data need to be available almost instantly on request, or is it okay if it appears in a few seconds, one to two minutes, or longer?
Data movement patterns Will you need to download data from the cloud to a local system, or will it remain in the cloud during any subsequent analysis steps?
Current data location Where does the data geographically live (country and/or general region within the country)? What devices do the data physically live on (e.g., disks, tape drives, etc.)? How many devices are there?
Predominant researcher location Where are the main researchers geographically located (country and/or general region within the country)?
Data movement Is there significant data movement to or from different sites that occurs during the regular course of business?
Data management
How are you keeping your data secure and making it findable, accessible, interoperable, reusable (FAIR)?
Data security

What measures do you have in place to protect your data including sensitive or protected data?

Findable

How is your data findable by your team and other collaborators? How is the data indexed?

Accessible

How do users access your data? Is the data open to the public? What kind of sign-in process will you use if approval is required to access the data?

Interoperable

Will you connect different data types for your analysis? If, so how? What is the provenance of the data, and how will it be curated for data harmonization?

Reusable

What kinds of licenses are associated with the data? Is ownership of the data clearly described? If there is human data, are consent agreements propagated for other research uses? Is this information associated with the metadata?

This page last reviewed on May 19, 2020