Project Organization for Data Science & Informatics
Last updated: January 18, 2021
When I was in grad school, I wrote this article on organizing data analysis projects. The logic in this is sound, but my structure and naming conventions have evolved somewhat since then. This is partially due to Good enough practices in scientific computing, which suggests a fairly similar structure but with different naming conventions.
Here’s what I recommend now for structuring internal projects (i.e., not open-sourced):
π project-name
β
ββ README.md
β
ββπ documents
β
ββπ output
β β
β ββ .gitignore
β
ββπ src
β β
β ββπ python
β ββπ r
β ββπ stata
β ββπ ...
β
ββπ studydata
β
ββ .gitignore
This is a highly intentional system, which I will explain in detail. Itβs also a minimal viable system β you may need additional structure for a more complex project, but this still provides a good starting point, and a set of standards that can be scaled up. For example, I often have many sub-folders inside the folders in the folder structure above.
project-name/
Project names should be kebab-case.
Why dashes and not underscores?
This is because project names can become GitHub repository names, like https://github.com/username/project-name
. URLs should never have underscores in them because they are obscured by hyperlink underlines. Local git repository folder names should match GitHub names to avoid confusion. Therefore, project names should not contain underscores.
Underscores are fine in subfolders and filenames. But keep the main folder kebab-case.
Why dashes and not spaces?
Spaces are annoying to deal with on the command line and in scripts: they must be escaped or quoted. Therefore, paths should not have spaces if they can at all be avoided (I’m looking at you, “My Documents” on Windows).
Why all lower case?
This just makes typing easier. If using upper-case makes a project name easier to read like (NASA-rover
), go for it.
project-name/README.md
Your README file should contain enough information for a smart person who has never seen your project before to be able to reproduce your analysis starting with just the raw data and your code.
Things it should probably contain:
- The purpose of the project
- Who is involved
- Where the data come from, and the version of the data1
- What software is used for analysis
- A description of how the project is organized and how to run the code
- Links to external project folders (e.g., Google Drive or OneDrive folders where documents live)
However a README is structured, the key thing is that it should be the single point of access for everything in your project. If itβs related to the project, it should be described in the README. Files outside the version controlled project folder should be accessible via links in the README.
project-name/documents/
Any documents you want to version control. This often includes background documents such as data dictionaries, analysis plans, etc.
Note that it’s often more convenient to work in Google Docs or Microsoft Office than in plain text on collaborative documents. It’s also not great to version control non-plain text files (like Word documents). I prefer using a separate folder on Google Drive or OneDrive, and I link to this in the README. So the documents/
folder is often fairly empty for my projects.
project-name/output/
Temporary space for output files from code. Everything in this folder should be considered ephemeral and should be able to be easily re-generated by re-running code. Anything I want to save gets manually moved into documents/
or into the Google Drive/OneDrive folder.
The .gitignore
file here indicates that the contents of this folder are not in version control. If you put a .gitignore
file inside output/
that contains the following, an empty output/
folder will be included in the git repository:
*
!.gitignore
More on git below.
project-name/src/
This is where all the source code for the project lives.
Use a separate folder for each programming language/statistical package. This is because many languages require package and project management files, like project-name.Rproj
files for R and Pipfile
/Pipfile.lock
for Python projects using Pipenv.
project-name/studydata/
This is where I put all the data related to my project. Nearly all my project folders relate to research studies, which is where the name comes from.
I typically do not want to version control any data. For research involving human subjects, accidentally including participant-level data in a git repository can be a major headache and potentially harm study participants! In general, putting large data files into git is a bad idea because it can slow down common operations and eventually make the repository too large for services like GitHub.
Therefore, I tell git to ignore any folder on my system called studydata/
and make sure that I always store study data inside a studydata/
folder.
Why not just data/
? I could do all this with data/
instead of studydata/
, but lots of projects have folders called data/
and globally ignoring them in git could have unintended consequences.
Not working on research studies? Then use βprojectdata/
.
git
Iβve mentioned git multiple times here. This is the most popular version control software used by programmers (think track changes in Microsoft Word, but for code). If youβre new to git, here are some resources for learning more about it.
My snarky philosophy is: code does not exist unless it is version controlled. Put a little more delicately, itβs fine for proof of concept/exploration to be ephemeral but once you get to the point where youβre making a decision based on analysis, the code generating that analysis should be in version control.
For projects that are more on the software development than the data science/analysis side of the spectrum, git init
should always happen before you write a line of code.
A note on dates in folder and file names
When including dates in folders or file names, they should always be written in 2020-03-15
format. This allows natural sorting by date when looking a folderβs contents (wonβt work if you use 03-15-2020
).
Itβs generally better to use analysis_output_2020-03-15.txt
rather than analysis_output_v17.txt
. This solves the problem of when you increment the version number (every day if itβs a date), and provides an easy path for more frequently updated versions (include the clock time after the date in the file name).
- The easiest way to do versions is by putting a date in the filename of the raw dataset and then referencing that in the README.
A better way is to use a checksum. Here are instructions for safely calculating a MD5 checksum on any file.
It can also be helpful to create a.zip
archive of data files and providing the checksum of those. This helps prevent accidental modification of data. [return]