Reproducibility in Data Analysis
Last updated: August 19, 2020
.mdfile used to generate them.
Reproducibility means each time you run your analysis with the same inputs, you get the same outputs.
Code is the
.r file or
.py file you wrote.
Data are your raw data files (e.g.,
Environment is the system that runs your code and any dependencies. For example, the version of R that you’re using and any external packages needed to run your code.
The simplest thing you can do to make sure your code produces consistent results is to re-run from start to finish with a clean environment (meaning that you restart R or Python to ensure there aren’t left-over dataframes in memory your code could be depending on).
You also need some sort of version control. This could be as simple as creating copies of the files with time stamps in the file names, but a better practice is to use a version control system like git.
The best thing you can do to avoid issues with data consistency is to never modify your source data. I like to keep a copy of source data in a .zip file, which prevents accidental modification.
If you have multiple versions of source data, figure out some sort of system for version control. Again, this could be a simple as
.zip files with time stamps in their names. You can also use git for data, though that doesn’t work well with large datasets and may be problematic for sensitive data.
Package managers make it possible to re-load the exact same set of external packages on a new computer (or on a collaborator’s computer). Unless you have a good reason to use something else, as of April 2020 you should use
renv for R and
pipenv for Python.
You should also note the version of the interpreter for the programming language you’re using in your project documentation.
I use the same folder structure for all my data analysis projects. This makes it easy to find files, and is constructed to promote good practices for reproducibility.
You should have a single “point of entry” to produce your analysis. For me this is often the README file in my project folder: in this I explain exactly what to run in order to run all parts of the analysis. Another option is a single “orchestrator” script responsible for running the entire analysis that either contains all the analysis code (though this is bad if the code is long) or calls out to separate files. The key thing is that all the analysis needs to run through the “point of entry” and this needs to be readily discoverable. A README in the root project folder is by definition discoverable, as is a file named
_run_all.R (which will sort to the top of file lists due to the
See below for more on documentation.
From low to high level:
- You should have comments interspersed with your analysis code. I recommend brief descriptions for each “chunk” of code, plus longer comments explaining anything complex or unintuitive. You should write enough comments so a competent programmer who had never seen your code before could quickly understand what was going on. (Note: this person is most likely you, 6 months from now.)
- At the top of each file (or class/function/method if you’re writing more complex programs), you should have a longer description explaining a high level what the purpose of the file is, and what the expected inputs and outputs are.
- Finally, you should have a README in the root level of your project explaining the purpose of the project and how to run it.