A Selection of Links and Articles Related to Reproducible Research

Table of Contents

Sitemap

---> misc
| ---> 2016
| ---> 2015
| ---> 2014
| ---> 2013
| ---> 2012
`--> Agenda

I've been interested by reproducibility since several years now and I have collected a bunch of links or articles that inspired me. After Konrad Hinsen pointed me out that several of these articles were not always publicly available, I decided to make this list public with pointers to "freely usable URLs". I'm not sure the sorting is really relevant but people may find interesting subjects

If you want a short introduction/motivation to this topic, you may want to have a look at the slides I prepared for the Inria days in June 2016 and which are riddled with pointers.

talk_16_06_21_Inria_journees_scientifiques.png

I organize a series of webinars on this topic, where I invite colleagues to expose their view on specific subjects. You'll find many useful information in there.

Selection of Articles I particularly enjoyed

Blogs or websites

Articles/Conferences

Webinar

Webinars

Summer school

I co-organized a Summer school on Performance Metrics, Modeling and Simulation of Large HPC Systems, in which I gave a tutorial on best practices for reproducible research. This tutorial was recorded and made available:

Coursera

The series of lectures that can be found on Coursera from Johns Hopkins University on Data Science is not bad. In particular Roger D. Peng's lecture on reproducible research had several interesting refs. Unfortunately this series of lectures now seems to be charged. :(

Conferences and events (roughly sorted by date)

Tools

Tools I use on a daily basis

  • Org-mode and R…
  • Sweave or rather the more modern version: KnitR. I use it mostly with students in conjunction with Rstudio
  • Git obviously

Tools I've looked at and which I don't use but have a good potential

There are many python tools

The truth is at the moment I prefer to stick with R because I'm more familiar with it and because it has the lead on many things, in particular in term of expressiveness. Take for example graphics:

  • On the R side, there is ggplot2, which is extremely expressive
  • On the python side, there is matplotlib, which is by comparison very low level. It can provide very nice graphics but it's way less expressive.
  • And I'm not the only one to think this. See for example http://blog.yhathq.com/posts/ggplot-for-python.html. That's why there has been several "ports" or reimplementations. In any case, the Python community is likely to "win" in the end with excellent tools and a "one language to rule them all" approach but for the moment, I'll stick to my combination of specialized languages/tools that I master quite well. :)

Tools I stumbled upon and that have interesting features

Sectionning is difficult as projects span different aspects.

Logging and

  • ActivePapers (Tutorial). Not sure this is the right section though…
  • Sumatra (A great idea by Andrew Davison, which definitely inspired some of our methodology based on git and org-mode)
  • Burrito is a tool that monitors and logs your activity so that you can possibly come back to previous situations.

Packaging

  • CDE (slides from the AMP workshop) and CARE are tools that help you packaging code so that it can be rerun by others.
  • There is also a recent effort (by the vistrails team) called ReproZip that trace dependancies and files and relies on VM and Vagrant to reproduce the environment.

Experiment management:

  • Vistrails (A workflow engine, which Sascha Hunold tried as an alternative to my uggly combo of scripts/Makefiles. Definitely interesting.)
  • 3X: A Workbench for eXecuting eXploratory eXperiments (never used it though). It looks like a mixture of rip APST with a code packaging tool and some visualization infrastructure. Somehow, it's related to Sumatra (with much more parallelism and way less journaling). It's crazy to see how we all reinvent the wheel. :(

Analysis platforms

  • Undertracks (ad hoc platform initially designed at the LIG for studies on technology-enhanced learning systems).
  • Framesoc and a trace archive (I'm somehow involved in these projects).

Shared Storage Infrastructures

Proprietary tools :(

Scientific instruments

Publication model

Reconciling HPC/Big Data with reproducible research

  • Why should I believe your supercomputing research ?

    Confidence in an instrument increases if we can use it to get results that are expected. Or we gain confidence in an experimental result if it can be replicated with a different instrument/apparatus. The question of whether we have evidence that claims to scientific knowledge stemming from simulation are justified is not so clear as verification and validation.

Statistics