Skip to Main Content

NIH Frontiers in Data Science Lecture Series

This series brings ideas at the forefront of data science to the NIH and biomedical science communities. Lectures, webinars, and workshops in this series are intended to inspire biomedical data science innovation and exploration. Many of these events are co-sponsored by individual NIH Institutes and Centers in order to highlight areas of data science that are of relevance or interest to particular biomedical domains. 

Upcoming Events:

Peter Murray-Rust, D.Phil. - November 15, 2016

Reader Emeritus in Molecular Informatics, University of Cambridge and Senior Research Fellow Emeritus, Churchill College, Cambridge

Founder of

Title: ContentMine: High-Throughput Extraction of Facts from Scientific Articles 

1:00pm - 2:00pm ET

Porter Neuroscience Research Center, Bldg 35A, Room 610, NIH Main Campus, Bethesda, MD

Videocast can be found here.

Bio: Dr. Murray-Rust is Founder of the ContentMine project which has used machines to liberate more than 100,000,000 facts from scientific literature. His research interests involve the automated analysis of data in scientific publications and the creation of virtual scientific communities. He has applied this to Chemistry through the development of the Chemical Markup Language (ChemML or CML). Dr. Murray-Rust holds a Doctor of Philosophy from the University of Oxford, U.K. His academic career spans more than thirty years in Computational Chemistry and Molecular Informatics at the Glaxo Group Research at Greenford, University of Nottingham, and University of Cambridge. He is known internationally for his activism in scientific open access and open data, which has been primarily focused on making scientific knowledge from literature freely available.

Abstract: There are millions of scientific articles published each year, but much of the content is not accessible because it is non-machine-readable or hidden in supplemental information or bitmapped figures. Content Mining (Text-and-Data mining/TDM) turns this semi-structured material into semantic form (XML) and annotates it with known metadata. EuropePMC, which works closely with PubMedCentral, provides an API for rapid fulltext search and retrieval of fulltext. ContentMine software then extracts "facts" with a number of "facet" tools: word search, regexes, bespoke text tools, chemical NLP (OSCAR), and certain diagram types (phylogenetic trees). The "facts" can be mapped onto triples and incorporated into Wikidata or used to annotate the text to help human readers. Common facets are often supported by dictionaries, but they can be easily extended by anyone with a list of words. Using heuristics, data can be extracted from common diagram types. The vision is to develop a communal open toolbox that can be extended and validated for a wide range of purposes. However, many rightsholders are trying to control TDM through technical and legal means. There is a recent legal exception in the U.K. that allows for text mining of facts for scientific research. The University of Cambridge is doing this and publishing to the open web. This talk will have live demos, many accessible to the participants during the talk.

Past Events:

Karim R. Lakhani, Ph.D. - September 30, 2016

Professor of Business Administration, Harvard Business School

Title: Should We Go to the Crowd First? Lessons from Running Crowd Contests in Life Sciences: How and Why Crowd Contests for Algorithm Development Work 

2:00pm - 3:00pm ET

Porter Neuroscience Research Center, Bldg 35A, Room 610, NIH Main Campus, Bethesda, MD

This event is being held in conjuction with the NIH CHALLENGE SYMPOSIUM 2016.

Videocast can be found here.

Bio: Karim R. Lakhani is an Professor of Business Administration at the Harvard Business School, the Principal Investigator of the Crowd Innovation Lab and NASA Tournament Lab at the Harvard Institute for Quantitative Social Science, and the faculty co-founder of the Harvard Business School Digital Initiative. He specializes in technology management and innovation. His research examines crowd-based innovation models and the digital transformation of companies and industries. Professor Lakhani is known for his pioneering scholarship on how communities and contests can be designed and managed to achieve innovative outcomes. He has partnered with NASA, TopCoder, and the Harvard Medical School to conduct field experiments on the design of crowd innovation programs. His research on digital transformation has shown the importance of data and analytics as drivers of business and operating model transformation and source of competitive advantage.

Abstract: Over the last decade crowd-based models of innovation have been shown to be highly complementary to a range of academic and industrial algorithmic and scientific problems. The work of the Crowd Innovation Lab | NASA Tournament Lab at Harvard University has shown that crowd-driven solutions for algorithmic challenges in computational biology, image analysis and space sciences routinely outperform internally developed solutions from elite organizations. The mini workshop will review the results, provide a systematic framework for understanding the place of crowd contests within academic life science research programs, and provide a deep dive into two recent challenges in genomics and imaging.

NIH Challenge Symposium 2016 - September 30, 2016

1:30pm - 4:00pm ET

Porter Neuroscience Research Center, Bldg 35A, Room 610, NIH Main Campus, Bethesda, MD

Videocast can be found here.

This lecture is part of a Challenge Symposium to explore the use of challenges as a mechanism for achieving scientific goals. The symposium will start with remarks from Tom Kalil, Deputy Director for Policy at the White House Office of Science and Technology Policy, 1:35pm -2:00pm ET.  Next, Dr. Karim Lakhani, a Harvard Business School professor who does research on challenges and innovation, will give a lecture, 2:00pm - 3:00pm ET. The symposium will conclude with a panel discussion with Lora Kutkat (NIH), Sandeep Patel (HHS), and Chris Nelson (OSTP), 3:00pm - 4:00pm ET. 


Dario Taraborelli, Ph.D. - September 23, 2016

Head of Research, Wikimedia Foundation

Title: Wikidata: Verifiable, Linked Open Knowledge that Anyone Can Edit 

9:00am - 10:00am ET

Natcher Conference Center, Bldg 45, Balcony A, NIH Main Campus, Bethesda, MD

Videocast can be found here.

Bio: Dario Taraborelli is a social computing researcher and open knowledge advocate based in San Francisco. He is currently Head of Research at the Wikimedia Foundation, the non-profit organization that operates Wikipedia and its sister projects. His research spans the behavioral and social aspects of online collaboration and commons-based peer production. As a co-author of the Altmetrics Manifesto and a long-standing open access advocate, he is interested in the design of open systems to promote, track, and measure the impact, reuse, and discoverability of research objects.

Abstract: As the largest and most popular general reference work on the Internet, Wikipedia is one of the primary entry points to scholarly literature and a primary channel for the dissemination of scientific content. Scholarly communities may be less familiar with the role Wikipedia plays as a source of linked open data for a large number of applications and services. In this talk, Dr. Taraborelli will give an overview of Wikidata, the collaborative knowledge base that anyone can edit, and Wikipedia’s fastest growing sister project. He will focus specifically on Wikidata initiatives of relevance to the scientific community as well as recent efforts to build an open bibliographic and citation repository in Wikidata to help volunteer contributors build the sum of human knowledge. Preserving the provenance and verifiability of information in Wikipedia is critical to the viability of the project. He will showcase several ways in which human, machine, and expert curation can help achieve this goal and how scholarly communities can leverage this knowledge corpus via open APIs.

NIH and the National Science Foundation jointly welcome:

Po-Shen Loh, Ph.D. - June 1, 2016  

Associate Professor of Mathematics, Carnegie Mellon University and Founder of

Title: World-Scale Personalized Learning through Crowdsourcing and Algorithms 

3:30pm - 4:30pm ET

Videocast can be found here.

You will need the meeting information and password: 

Event number: 745 418 342

Event password: NSFNIH#2016

Bio:  Po-Shen Loh is a math enthusiast and evangelist. He is the national coach of the USA International Mathematical Olympiad team, a math professor at Carnegie Mellon University, and the founder of  Po-Shen has numerous distinctions, from an International Mathematical Olympiad silver medal to the National Science Foundation’s CAREER award. His research considers a variety of questions that lie at the intersection of combinatorics (the study of discrete systems), probability theory, and computer science.

Abstract:  Improving math and science education is a national priority. Personalized education, like personalized medicine, both is within reach and has the potential to transform current practice through the usage of data and algorithms. Personalized education can benefit everyone from young students to established scientists. Established scientists can utilize it to fill in knowledge gaps efficiently, while young students stay engaged with math and science through active learning. Po-Shen Loh will speak about a pioneering project ( that will turn every smartphone into a free personalized learning system by combining crowdsourcing and mathematical algorithms. In a modern world where content is everywhere (but of varying quality, and disorganized), the central problem is to identify exactly which piece of content a particular learner should interact with at any given moment based upon the learner’s current knowledge base and long-term goals. 

Terry Stouch, Ph.D. - May 20, 2016

President, Science for Solutions, LLC

Title: How Far Can We Trust Our Data?:  The Interpretation, Value, and Use of Drug Discovery Data and the Need for a Bigger Data Approach 

2:00pm - 3:00pm ET

Porter Neuroscience Research Center, Bldg 35A, Room 640, NIH Main Campus, Bethesda, MD

Videocast can be found here.

Bio: Dr. Terry Richard Stouch has been working in depth with drug discovery data for over 30 years in order to accelerate and increase the success of drug discovery. A user of the data himself, he collaborates closely not only with others who use the data, but even more importantly, with those who generate the data. He brings biological, physical chemical, statistical, and computational experience to the evaluation of assays and data.  He specializes in drug design, pharmaceutical data analysis, predictive modeling, property prediction, bio-molecular structure, and molecular modeling and simulation. He is President of Science For Solutions, LLC, a consulting firm specializing in molecular and computational sciences; Senior Editor-in-Chief of the Journal of Computer-Aided Molecular Design; and an Adjunct Professor in the Department of Chemistry and Biochemistry at Duquesne University. He is a Fellow of the American Academy for the Advancement of Science (AAAS) and a Fellow of the International Union of Pure and Applied Chemistry (IUPAC).

Abstract: How reliable is our drug discovery data? Is it qualitative or quantitative? Is it a hard and fast signpost to direct our efforts or just a hint of what might be? In fact, the percentage of drug discovery data in which we can have high confidence is surprisingly low. And, the common range of "true" error in any measurement is many times larger than the conventional error of measurement. This presentation will address the author's 3-to-10 and 20% rules that capsulize the true error of drug discovery data, the percentage that can be used with high confidence, and how these rules were determined. Examples of some surprising and confounding factors from the experience of many laboratories and companies around the world will be highlighted and lead to a discussion of the need for a "bigger" data of drug discovery that can help improve our understanding and increase the value of our results. The importance of data presentation leading to proper interpretation by users will also be discussed.

Anthony Goldbloom - April 29, 2016

Founder and CEO, Kaggle

Title: Data Science and Medicine: What's Possibly at the Cutting Edge?

1:00pm - 2:00pm ET

This webinar has been recorded for the purpose of sharing content for public use. To watch the full YouTube video presentation, please click here

Bio: Anthony Goldbloom is the founder and CEO of Kaggle, a Silicon Valley start-up which has used predictive modeling to solve large scale problems for the Federal Government and private industry across a range of fields.  In 2011 and 2012, Forbes Magazine named Anthony as one of the 30 Under 30 in Technology, in 2013 the MIT Tech Review named him one of 35 Innovators Under 35, and the University of Melbourne awarded him an Alumni of Distinction Award. He holds a first call honors degree in Econometrics from the University of Melbourne.  Anthony has published in the Economist and the Harvard Business Review.

Abstract: Kaggle hosts data science competitions. Data scientists download data and upload solutions to very difficult problems. Kaggle has collaborated with the NIH to use data science to solve healthcare and medical research problems ranging from using data science to diagnose heart failure from fMRIs (by measuring ejection fraction) to predicting seizures from EEG data. This talk will introduce data science competitions and show some of the surprising things at the cutting edge of medical research.

Special Guest Speaker: 

Dr. Andrew Arai, Senior Invesigator, National Heart, Lung, and Blood Institute (NHLBI) 

This thought provoking two-part presentation introduces how data science competitions can help to discover solutions to complex biomedical problems.  In Part I, Anthony Goldbloom presents an overview of Kaggle’s methodology of designing, conducting, and evaluating data science competitions in medical research. He demonstrates through case studies of recent biomedical research collaborations, including diagnosing and predicting heart failure, seizures, and diabetic retinopathy.  In Part II, Dr. Andrew Arai presents the results of a NHLBI collaboration with Kaggle, which conducted a machine learning competition that analyzed MRI imaging data to identify heart damage indicators that can assist with heart attack prediction.

Ben Schneiderman

Ben Shneiderman - April 13, 2016

Distinguished University Professor, University of Maryland

Title: Interactive Visual Discovery in Event Analytics - Electronic Health Records and Other Applications

11:00am-noon ET

Porter Neuroscience Research Center, Bldg 35A, Room 610, NIH Main Campus, Bethesda, MD

Videocast can be found here.

Bio: Ben Shneiderman is a Distinguished University Professor in the Department of Computer Science at the University of Maryland (UM). He is a Fellow of the AAAS, ACM, IEEE, and NAI, and a Member of the National Academy of Engineering, in recognition of his pioneering contributions to human-computer interaction and information visualization.

Abstract:  Event Analytics is rapidly emerging as a new topic to extract insights from the growing set of temporal event sequences that come from medical histories, e-commerce patterns, social media log analysis, cybersecurity threats, sensor nets, online education, sports, etc. Dr. Shneiderman will review a decade of research on visualizing and exploring temporal event sequences to view compact summaries of thousands of patient histories represented as time-stamped events, such as strokes, vaccinations, or admission to an emergency room. His current work on EventFlow supports point events, such as heart attacks or vaccinations, and interval events, such as medication episodes or long hospitalizations. Demonstrations cover visual interfaces to support hospital quality control analysts who ensure that required procedures were carried out. He will show how domain-specific knowledge and problem-specific insights can lead to sharpening the analytic focus to enable more successful pattern and anomaly detection.

Dr. John Shon - April 5, 2016

VP of Bioinformatics and Data Sciences, Illumina

Title: Translating from Bench to Bedside and Back - Challenges and Opportunities from a Data Science Perspective

1:00pm - 2:00pm ET

Natcher Conference Center, Building 45, Room F1/F2​, NIH Main Campus, Bethesda, MD

Videocast can be found here.

Bio: John Shon is VP of Bioinformatics and Data Sciences at Illumina.  In this role, he leads a global team of bioinformatics scientists in developing algorithms and methods for Illumina NGS instruments and assays. As part of the Enterprise Informatics business unit, he also leads bioinformatics for clinical interpretation and translational informatics software. Prior to Illumina, Dr. Shon has over a decade of experience in large pharmaceutical companies, most recently as VP of Informatics, Research IT, and External Innovation at Janssen Pharmaceuticals (a division of J&J) where he supported R&D, clinical development, and Janssen Diagnostics teams. At Roche, Dr. Shon led informatics groups in translational research for target discovery, biomarker selection, drug safety, and personalized healthcare.  

Abstract:  As the cost of sequencing decreases, the generation of sequence data, and more importantly, genetic variation data, increases at a rate that exceeds our ability to fully understand it.  The data is also increasingly being generated in clinical contexts, providing the potential to enlighten or mechanistic understanding of disease and therapy. From a data science perspective, Dr. Shon will describe how research and clinical operational contexts both facilitate and limit the use of next generation sequencing data for discovery, clinical care, and translational research purposes. Dr. Shon will also review promising approaches coupling the generation of NGS data with clinical data for the systematic application of translational knowledge for precision medicine.

Barend Mons, Ph.D. - February 9, 2016

Professor of Biosemantics at the Human Genetics Department of Leiden University Medical Center, Netherlands

Title: Open Science as a Social Machine

10:00am - 11:00am ET

Natcher Conference Center, Room F1/F2, NIH Main Campus, Bethesda, MD

Videocast can be found here.

Bio: Barend Mons is a Molecular Biologist by training (PhD, Leiden University, 1986) and spent over 15 years in Malaria research. After that, he gained experience in computer-assisted knowledge discovery, which is still his research focus. He spent time with the European Commission (1993-1996) and with the Netherlands Organization for Scientific Research (NWO). Dr. Mons has also co-founded several spin off companies. 

Currently, Dr. Mons is Professor in Biosemantics at the Human Genetics Department of Leiden University Medical Center, is Head of Node for ELIXIR-NL at the Dutch Techcentre for Life Sciences, Integrator Life Sciences at the Netherlands eScience Center, and Board member of the Leiden Centre of Data Science.

In 2014, Dr. Mons initiated the FAIR data initiative, and in 2015, he was appointed Chair of the European Commission's High Level Expert Group for the European Open Science Cloud, DG Research and Innovation.

For the FAIR data initiative  

For nanopublications 

Abstract: Barend Mons is Chair of the European Commission's High Level Expert Group for the European Open Science Cloud (EOSC). The EOSC is meant to be a supporting expert infrastructure for Open Science. In this presentation, Dr. Mons will cover the aspects of open and participatory science in which community curation and annotation of data is key. He will emphasise the joint responsibility for data stewardship in Open Science. He will explain the concepts of Nanopublication, the Explicitome, and the concept of FAIR (Findable, Accessible, Interoperable, and Re-usable) data and other research objects with an emphasis on machine actionability of published research objects. Finally, Dr. Mons will outline the future developments of social machines in science and how users and producers of data merge into a knowledge creation communities where man-machine interaction is key. Examples will be from his own field: Human Genetics.

Vahan Simonyan, Ph.D. - January 29, 2015

HIVE Lead Scientist at FDA




Raja Mazumder, Ph.D. - January 29, 2015

HIVE Lead Scientist at George Washington University

Title: High-Performance Integrated Virtual Environment (HIVE): A Regulatory NGS Data Analysis Platform

10:00am - 11:00am ET

 Porter Neuroscience Research Center, Bldg 35A, Room 640, NIH Main Campus, Bethesda, MD

Videocast can be found here.

Vahan Simonyan, Ph.D. is the HIVE Lead Scientist at FDA, an author to more than 50 scientific publications in quantum physics and chemistry, nanotechnology, biotechnology, population dynamics, and bioinformatics. The technology developed by Dr. Simonyan and the code-base donated to US government has launched HIVE at FDA. This resulted in an enormous success in the form of the regulatory compliant R&D IT platform capable of handling peta-scale data from sequencing projects, post-market analytics, clinical and preclinical data analysis. Currently Dr. Simonyan's collaborations span the scope of +80 medium to large research and regulatory projects with scientists from government organizations, large healthcare consortia and academia.

Raja Mazumder, Ph.D. is the HIVE Lead Scientist at GW, an Associate Professor of Biochemistry and Molecular Medicine and the Director of The McCormick Genomic & Proteomic Center at The George Washington University (GW). Prior to joining GW, Raja was a faculty at Georgetown University (GU) where he worked on the UniProt project as a Team Lead with colleagues from European Bioinformatics Institute and Swiss Institute of Bioinformatics. Prior to GU, Raja worked at the National Center for Biotechnology Information (NCBI) as a Bioinformatics Scientist

Abstract: Abundance of miscellaneous high performance computational platforms available across academia, healthcare industry, and in government organizations isn't doing much to close the gap between research and regulatory analytics. Extra iterations for drug, device and biologics approval process are causing a significant cost increase for medical product development. High-Performance Integrated Virtual Environment (HIVE) co-developed by FDA and GW presents a great opportunity for serving as a bridge. It is authorized as a regulatory NGS data analysis platform and provides unique capability for healthcare stakeholders to look into NGS data from regulatory perspective of FDA.

As a distributed storage and computation environment and a multicomponent cloud infrastructure, HIVE provides secure web access for authorized users to deposit, retrieve, annotate, and compute on biomedical big data, and to analyze the outcomes using web interface visual environments appropriately built in collaboration with internal and external end users. In addition to the initial HIVE applications to next generation sequencing, the current universe of HIVE projects covers tailor-made applications involving dimensionality analysis, federated and integrated data mapping, modeling and simulations that are applicable to basic research, biostatistics, epidemiology, clinical studies, post-market evaluation, manufacturing consistency, environmental metagenomics, outbreak detection, and more.

Francine Berman, Ph.D. - November 5, 2015

Chair, Research Data Alliance / US, and Hamilton Distinguished Professor of Computer Science, RPI

Title: Got Data? Building a Sustainable Ecosystem for Data Driven Research

4:00pm – 5:00pm ET

Videocast can be found here.

Bio: Dr. Francine Berman is the Edward P. Hamilton Distinguished Professor in Computer Science at Rensselaer Polytechnic Institute. She is a Fellow of the Association of Computing Machinery (ACM), the Institute of Electrical and Electronics Engineers (IEEE), and the American Association for the Advancementof Science (AAAS). In 2009, Dr. Berman was the inaugural recipient of the ACM/IEEE-CS Ken Kennedy Award for "influential leadership in the design, development, and deployment of national-scale cyberinfrastructure." 

Dr. Berman is U.S. lead of the Research Data Alliance (RDA), a community-driven international organization created to accelerate research data sharing world-wide.  She also serves as Chair of the Anita Borg Institute Board of Trustees, as co-Chair of the NSF Advisory Committee for the Computer and Information Science and Engineering  (CISE) Directorate, as a member of the Board of Trustees of the Sloan Foundation, and as a member of the Board of Directors of the Monterey Bay Aquarium Research Institute (MBARI).

Previously, Dr. Berman served as Director of the San Diego Supercomputer Center and as Vice President for Research at Rensselaer Polytechnic Institute. She also served as co-Chair of the National Academies Board on Research Data and Information (BRDI), as co-Chair of the US-UK Blue Ribbon Task Force for Sustainable Digital Preservation and Access, as Chair of the Information, Computing and Communication Section (Section T) of the AAAS, and as a member of the NIGMS Advisory Council. For her accomplishments, leadership, and vision, Dr. Berman was recognized by the Library of Congress as a "Digital Preservation Pioneer", as one of the top women in technology by BusinessWeek and Newsweek, and as one of the top technologists by IEEE Spectrum.

Abstract: Innovation in a digital world presupposes that the data will be there when you need it, but will it? Without sufficient data infrastructure and attention to the stewardship and preservation of digital data, data may become inaccessible or lost. This is particularly problematic for data generated by sponsored research projects where the focus is on innovation rather than infrastructure, and support for stewardship and preservation may be short-term. In this presentation, Dr. Fran Berman discusses sustainability, infrastructure, and data, and explores the opportunities and challenges of creating a viable ecosystem for the data on which current and future research and innovation increasingly depend.

Photograph of Martin Fenner Techical Director at DataCite

Martin Fenner, M.D. - October 28, 2015

DataCite Technical Director

Title: Data-level Metrics

11:00am - 12:00pm ET

This webinar is co-sponsored by NCI as part of the CBIIT Speaker Series.

Bio: Martin Fenner is the DataCite Technical Director since August 2015. From 2012 to 2105 he was technical lead for the PLOS Article-Level Metrics project. Martin has a medical degree from the Free University of Berlin and is a Board-certified medical oncologist.

Abstract: The DataONE repository network, California Digital Library and Public Library of Science (PLOS) from October 2014 - October 2015 work on a NSF-funded project to explore metrics -  including citations, downloads and social media -  for about 150,000 datasets. This presentation will summarize the major hurdles to make this work, the most important findings, and some ideas to go forward, including implementation as a production service.

Photograph of Andrew Moore Dean of Computer Science at Carnegie Mellon University

Andrew Moore, Ph.D. - September 21, 2015

Dean of Computer Science at Carnegie Mellon University

Title: Recent Developments in Artificial Intelligence - Lessons from the Private Sector

12:00pm - 1:00pm ET

Lipsett Auditorium, Bldg 10, NIH Main Campus, Bethesda, MD

This talk is co-sponsored by NLM.  Videocast can be found here.

Bio: Andrew Moore is the Dean of the School of Computer Science at Carnegie Mellon University. His areas of research and expertise include decision and control algorithms, statistical machine learning, artificial intelligence, robotics, and statistical computation for large volumes of data. Andrew more previously served as the VP of Engineering at Google Pittsburg where he was responsible for the retail segment: Google Shopping. Andrew was involved with a number of Google/University activities, two examples of which were Google Sky (in collaboration with CMU, Hubble Space Telescope Center and University of Washington) and the Android SkyMap app.

Abstract: Andrew more will discuss some of the big developments in computer science from the perspective of someone crossing over from industry to academia. He will talk about roadmaps for AI-based consumer and advice products in the commercial world and contrast with some of the potentially viable roadmaps in healthcare. Andrew more will also touch on entity stores (aka knowledge graphs), question answering and ultra-large data center architectures.

Photograph of Hadley Wickham

Hadley Wickham, Ph.D. - September 16, 2015

Chief Scientist, R Studio and Adjunct Assistant Professor, Rice University

Title: Data Analysis with Pipes

2:30pm - 3:30pm ET

Building 40, Room 1201/1203, NIH Main Campus, Bethesda, MD

This talk is co-sponsored by NCI.

Bio: Hadley Wickham, Chief Scientist at RStudio and Adjunct Assistant Professor at Rice University is the author of several of the most revolutionary, influential, and popular software packages for the R statistical software environment, including dplyr, ggplot2, reshape2, and numerous others.

Abstract: Over the last year and half, three things have had a profound impact on how I develop tools for data analysis: Rcpp, writing the advanced R book ( and the pipe operator (%>%, from magrittr). In this talk, I'll focus on the pipe operator and how it’s influenced the development of tidyr, dplyr and ggvis, the next generation of reshape2, plyr and ggplot2. Come along to learn about why I think pipelines are awesome and see how pipelines + tidyr, dplyr, and ggvis can make your data analysis fast, fluent and fun.

Photograph of Greg Wilson

Greg Wilson, Ph.D. - May 20, 2015

Co-founder, Software Carpentry Foundation

Title: How to Help Ten Thousand Scientists

2:00pm - 3:00pm ET

Building 40, Room 1201/1203, NIH Main Campus, Bethesda, MD

The inaugural lecture of the Data Science Collaborative Lecture Series is co-sponsored by the National Institute of Mental Health.

Bio: Greg Wilson from the Software Carpentry Foundation. Greg is an engaging and knowledgeable advocate for best practices in scientific computing (see his recent PLOS Biology paper) that facilitate reproducibility and collaboration. Greg spoke on Software and Data Carpentry’s efforts to improve scientific computing practices worldwide. 

Abstract: Reproducible research, open science, peta-this and next-generation that... It would be easy to believe that the revolution is over, and a new kind of science had already won, but the fact is, the majority of researchers don't use computers any more skillfully or productively than they did twenty years ago. Most are still largely self-taught, and as a result, it takes them far longer to do simple tasks than it should, and at the end, they have no idea how reliable their results actually are.

Software Carpentry and Data Carpentry are trying to change this. In the past five years, they have delivered two-day computing skills workshops to over 10,000 scientists in more than 25 countries, and taught more than 300 of them how to teach these skills to their colleagues. This talk will describe what we teach, why and how we teach it, the impact it's having, the mistakes we've made along the way, and what we're planning to do next.

Back to Top