July 10, 2014
Ending Statistics Software Monoculture
Brian Danielak, writing about how he would improve his graduate program:
If the courses are going to incorporate doing statistical analysis, they should move away from SPSS. There’s no reason students should have to pay for statistics software when there are fantastic, FREE, industry-standard software toolkits to do statistics (including the R+Rstudio ecosystem and the IPython Notebook). Think about it: not only are we asking students to pay NOW when they take the courses, we’re also only teaching them that system, which means later in their careers they’ll have to spend more money to buy SPSS again, because it’s the only system they were trained on.
(Emphasis mine.)
SPSS is one of what I call the “Big Three” paid statistics packages, along with Stata and SAS. It seems like most stats-heavy graduate programs pick one of these three programs, and then teach and use it exclusively.1
Brian mentions one downside of a stats package monoculture: if the chosen package is expensive, students will have to pay licensing fees for the rest of their careers. The reality is that most non-statisticians neither have the time nor desire to learn to use a different stats package on their own after they graduate. So they’re stuck.
My graduate program uses SAS – and unless I work at a university that already has a SAS license, I will most likely not be able to use it in my next job because it costs $8,700 for a single copy for personal use.
There are certainly some advantages to SAS. It is easier to learn than R or IPython for non-programmers. It produces great output for certain statistical procedures without much configuration that would be very difficult to produce in R or even Stata. And I’m more confident that the SAS statistical algorithms are bug-free than I am with open source software.
So I’m not sure that the solution is to replace the Big Three completely with an open source (R or IPython) monoculture. Instead, one of the open source packages should be taught alongside a paid package.
There are some big advantages to this:
R and Python are “real” programming languages,2 which makes it much easier to do certain tasks that are very difficult in a stats package. For example, merging a bunch of Excel files into a single table, complex reshaping of data, and programmatically manipulating statistics procedure output are all much easier in Python than in SAS or Stata in my experience.
Experience with a “real” programming language beyond a Big Three statistics package will be a huge benefit for students, who will increasingly be expected to work large datasets from multiple sources, interface with web APIs,3 and do more complex analysis.
Learning two stats packages will help students develop a better understanding both packages, just like learning a foreign language can improve your understanding of the structure and grammar of your native language. Likewise, students will find it easier to learn additional stats packages in the future.
R and Python bring some new ideas to the table. For example, both have “notebooks” (R Markdown in RStudio and IPython notebooks, respectively), which allow you to easily mix prose, equations, and code. RStudio has a visual debugger, which is a massive time-saver. IPython can run on an iPad.
And of course, students will not be completely at the mercy of absurd license fees for SAS or SPSS for the rest of their careers.
I would pick Stata and R + RStudio
If I had to pick, I would recommend Stata and R using the RStudio IDE.
I like Stata because I think it is easier to learn than SAS and I think the REPL-style interface (typing a command produces immediate output) is much faster and easier to debug than a SAS-style interface. I’ve never used SPSS, and considering it’s in the same price range as SAS I probably never will.
Stata licensing is also not hostile to users. While it is expensive4, it costs less than a car and you don’t have to call sales to get a price quote. It also is cross-platform, which is great for serious programmers and the many college students who prefer Macs over Windows PCs.
R and IPython share a lot of features, but R is much more widely used in the world of statistics and RStudio is the best interface I’ve seen in any stats package. Also, after learning R, picking up IPython should be fairly easy.
- I’m talking about graduate programs outside statistics departments, e.g. social science programs. It seems that this may not be true for at least some graduate programs in statistics: most statisticians I know seem to use R in addition to at least one of the Big Three. [return]
- By this, I am referring to the big practical advantages for data analysis provided by language features in R and Python like a wide range of useful data structures (e.g. arrays and hashes), and having non-study data variables that are first class citizens. Later, I refer to interfacing with APIs and other programming tasks, which are much easier in a “real” programming language than SAS/Stata/SPSS for a myriad of reasons. [return]
- An API is an “application programming interface,” or a way for two programs to talk to each other. In this case, I might want my analysis program to interface with this API from CMS. [return]
- $595 for a perpetual academic license, $295 for an annual license. [return]
Comments? Please send me a message.