DataScience@NIH

Driving Discovery Through Data

5 Data Science: Preparing for a World of Tribbles
Patti Brennan / 04.27.17

Those of you born before 1980 or the inveterate fans of 1960s science fiction know what tribbles are. These small, soft, furry, purring space aliens appeared on a memorable 1967 episode of “Star Trek.”

The pleasure and comfort they bring to humans come with a significant downside: tribbles do nothing but eat and reproduce. Born pregnant and with no known predators, tribbles multiply exponentially, quickly consuming anything edible. Their care and feeding exhaust all available resources, leaving little reserve for other pursuits and other species.

Sounds a bit like research-generated data, doesn’t it?

Not long ago (but long after the first tribbles appeared on “Star Trek”) a research project produced a resolution to the hypothesis that initiated the project. Questions were posed, hypotheses generated, data acquired, analytics applied, and interpretation emerged. Done and done. But since the late 1990s or so, many research projects have also generated data sets that promise—or at least hint at—future discoveries.

Researchers the world over spend substantial funds curating and storing these data sets and ensuring they are FAIR (findable, accessible, interoperable, and re-usable). NIH alone spends hundreds of millions of dollars annually to ensure high-value data sets are securely stored and made available for future use and to foster among investigators an appreciation for and willingness to engage in data-driven discovery.   

All of which leads me to conclude that data science can learn a lot from “Star Trek” and its tribble troubles.

Like tribbles, data are attractive and pleasing to many. Some species, like Klingons, abhor tribbles, and I suspect a few out there abhor data or see data science as another scientific fad. Just as tribbles and their perpetual offspring can place extreme demands on a system, so too can data sets, especially those left to grow without purpose or design that end up competing for scarce research dollars. (Fortunately, we’re not yet at the point of having to choose between new investigations and data-driven science, but that day may come.) And the solution to the Enterprise’s tribble infestation—transporting them to a nearby Klingon vessel—sounds a bit like the hope that “the cloud” will solve our data science challenges, when such a move solves only one problem and shifts the others to a new environment.

Despite those challenges, some data science zealots advocate replacing experimental science with data-driven discovery, but I recommend a more balanced approach. As the Federation’s Prime Directive makes clear, every species and society should be allowed to follow its normal cultural evolution, and, to me, data science is part of the evolution of scholarly discovery.

I invite you to partner with me and the NIH to ensure a principled approach to data science.  

What steps should NIH take to make sure the “data tribbles” don’t crowd out the full range of discovery strategies needed to deliver the greatest benefit to society?

Comments

Let's involve the whole Enterprise crew to go where the NIH wants to go. I recommend involving many more subject specialists, business analysts, data modelers, data architects, data engineers and software project managers in exploring this strange new world. Subject specialists (Dr. McCoy and Mr. Spock) are domain experts and understand the medicine and science behind the numbers. Business analysts (Lt. Uhura) bridge the communication gap between subject matter experts and technologists, helping each side of the chasm understand and talk to the other. Data modelers (Ensign Chekhov) start with the big picture of entities and relationships and drill down into the details, helping us navigate the universal view. Data architects (Lt. Sulu) take the perfect worldview model and implement it into a workable framework with credible back-end support. Data engineers (Scottie) take the data in its source format, map it to the models and architectures and import it. And software project managers (Captain Kirk) define the timeline, goals and deliverables and keep everyone happy and moving in the same direction. "Ahead, Warp Factor One!"
"All your people must learn before you can reach for the stars."
--CAPTAIN KIRK, Star Trek: The Original Series, "The Gamesters of Triskelion"
http://www.top10-best.com/c/top_10_best_captain_kirk_quotes.html

PattiBrennan's picture

Thanks so much! This is exactly one of the world views that I think we need!

The only way to control the tribbles (which mindless reproduce - they were 'born pregnant') was to control their access to food. Today we are very much in the position that we create biological data at a scale that disguises its cost.

This is a pretty common theme, one I suspect has very little necessary connection to the problems of biological big data. Where every word is sung, an Opera places a premium on the words of a libretto. When we switched from film to digital, actors could do as many takes as needed - even the actor doesn't know what the film is about until the editor is done.

Technologically, we have selected the ability to create data cheaply without the corresponding investment in creating the researchers who can effectively use these data. There are certainly counterexamples to my musings; challenges solved in the data-unlimited paradigm clearly exist. However this blog post highlights the littleness of the fig leaf concealing our shame that we have not done as well as we would have hoped. The same happened in areas of taxonomy and phylogenetically when we hoped that a fully sequenced genome would finally resolve those polytomies.

Maybe it's not so bad, maybe machine learning will save us from drowning in this data. But by my lights we need to be educating our scientists and students to operate thoughtfully in the space of big data. Sequence is cheap, but the time needed to wade through the data to pick out the gems isn't.

In the end, if there were fewer tribbles, finding the exploding one would have been a lot easier.

PattiBrennan's picture

Thank you for your thoughts – Clearly you had some new ideas that I did not have – the analogy to the movie is very cogent. We will need lots of analogies to guide us through the data science experience!

Catalog the tribbles! We need FAIR at our home institutions. Here at CHOP, a small number of us say 'library science before data science'. To wit: our research file system hosts over 400,000 genomic data files (BAM tribbles!) that nobody knows anything about. We have 6,700 studies in REDCap (case report tribbles!) with 5.7M records spread across 1.4M variables. And that's not even touching clinical data and proliferating registries (registry tribbles!). This data is lost to follow up absent a serendipitous 'hey I study tribble schizogony! we should work together' data sharing connection. Maybe 'before data science' is too procedural, but data science will continue to be stymied if we can't find and qualify the critical data we already have at home. These data are where much of the untapped value lie and are most aligned with our local mission - pediatric, adult, disease area...

Add New Comment

Posting Calendar

September 2018

Sun Mon Tue Wed Thu Fri Sat
 
 
 
 
 
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
 
9
 
10
 
11
 
12
 
13
 
14
 
15
 
16
 
17
 
18
 
19
 
20
 
21
 
22
 
23
 
24
 
25
 
26
 
27
 
28
 
29
 
30
 
 
 
 
 
 
 
Back to Top