Big data –data science – data wrangling – data munging – data commons – cloud instances – the lingo of data science is growing almost as fast as the data itself is multiplying. Many of these terms reflect the size of the burgeoning data stores found in almost every scientific discipline, from anthropology to zoology. This leaves the casual observer or the data novice (and even some data impresarios!) with the idea that data science is only for big data, and somehow that these two terms are synonymous. I’d like to start a dialog for conceptual clarity about data science, undergirded by a plea for methodological flexibility and philosophical plurality.
First, let’s talk about data science – it means many things to many people – in this blog in the past I have advanced the idea that, within the health care context, data science is a set of principles and practices that underlie the effective use of data relevant to biomedicine to glean insights and make new discoveries that will improve human health. While size matters, and it is often the enormity of the data sets that exceed the limits of human and computational resources, data science isn’t reserved only for large data sets. Indeed, the robustness and methodological flexibility afforded by many data science approaches may be of significant value when applied to a wide variety of data sets. To me, data science approaches step in when a data set lacks the characteristics that make it amenable to investigation using well-understood frequentists inference approaches.
Data science stands apart from other analytical and philosophical approaches to inquiry with three main distinctions: 1) the methods are un-encumbered by the distributional assumptions; 2) analytical tools can be applied continuously, as to streams of data, or in a distributed manner, to data sets stored in different locations and 3) the volume of data often exceeds the storage capacity of computers commonly used in research. Data science reflects a promising way to make sense out of large data sets, to add interpretability to high volume, high velocity data.
Data science is more that large data files and a range of analytic and visualization tools. It’s a way of thinking about discovery and knowledge building that complements traditional empirical and experimental approaches with a new set of methods less tied to the semantics of the phenomena and more tied to the characteristics of the data itself. In data science approaches, scholarly rigor emerges in a different manner than the more familiar design-driven approaches to rigor.
Now let’s unpack that one – “less tied to the semantics of the phenomena and more tied to the characteristics of the data itself” – semantic based research methods use measurement approaches and analytical strategies governed by the nature of the phenomena of interest – so, for example, variable definitions tie the actual data generated in the measurement to some underlying physical or anatomical or psychological phenomena. Counts of heart rate within a certain range indicate normal function of the heart. The data (counts) are a measure (heart rate) that is a representation of some phenomena (cardiac function). Data science investigations can but are not restricted to a semantic alignment between data and phenomena. That is, the representation constraints are relaxed, and theory becomes useful as an interpretation tool at the end of an investigation, not at the beginning. This becomes exceedingly useful when one is exploring data captured during normal processes, such as traffic flow or sound waves. The linking of the data to the phenomena comes late in the process, and is not asserted early.
Now for the second assertion: “scholarly rigor emerges in a different manner than the more familiar design-driven approaches to rigor.” In empirical, experimental design approaches to research, the rigor is built in up front – Carefully explicated theoretical frameworks guide the initial hypothesis, measurement arises from characteristics of the phenomena under study, randomization, sample size, sampling plan, procedures are well though-out and applied in a principled manner, only those analytical strategies supported by the data and aligned with the design are employed, and interpretation returns to the original theoretical premise. A priori planning and careful execution lead to rigor, trustability and interpretability.
Data science methods allow investigation of a data set long after the data have been generated or collected. The curation of the data at the point of use, still executed in a principled manner, affords a level of tractability that enhances trust and interpretation. The data may have been collected for one purpose, and then used for a very different purpose. For example, weather observations collected for the purposes of predicting storms in turn become input into predicting satellite performance or degradation. Traffic patterns that guide the sequences of control for stoplights become explanatory variables in geographically-based investigations of sudden cardiac death. In data science investigations, the rigor emerges from the principled manner of selecting, curating, and investigating data, not from the a priori design and conduct.
Indeed, data science investigations yield different insights than do experiments – both are needed for discovery! It’s wasteful to think that data science approaches should only be used when one faces the challenges of large data sets – these emerging approaches hold great promise for learning from many situations.