Ending statistics software monoculture

Brian Danielak, writing about how he would improve his graduate program:

If the courses are going to incor­po­rate doing sta­tis­ti­cal analy­sis, they should move away from SPSS. There’s no rea­son stu­dents should have to pay for sta­tis­tics soft­ware when there are fan­tas­tic, FREE, industry-standard soft­ware toolk­its to do sta­tis­tics (includ­ing the R+Rstudio ecosys­tem and the IPython Note­book). Think about it: not only are we ask­ing stu­dents to pay NOW when they take the courses, we’re also only teach­ing them that sys­tem, which means later in their careers they’ll have to spend more money to buy SPSS again, because it’s the only sys­tem they were trained on.

(Emphasis mine.)

SPSS is one of what I call the “Big Three” paid statistics packages, along with Stata and SAS. It seems like most stats-heavy graduate programs pick one of these three programs, and then teach and use it exclusively.1

Brian mentions one downside of a stats package monoculture: if the chosen package is expensive, students will have to pay licensing fees for the rest of their careers. The reality is that most non-statisticians neither have the time nor desire to learn to use a different stats package on their own after they graduate. So they’re stuck.

My graduate program uses SAS – and unless I work at a university that already has a SAS license, I will most likely not be able to use it in my next job because it costs $8,700 for a single copy for personal use.

There are certainly some advantages to SAS. It is easier to learn than R or IPython for non-programmers. It produces great output for certain statistical procedures without much configuration that would be very difficult to produce in R or even Stata. And I’m more confident that the SAS statistical algorithms are bug-free than I am with open source software.

So I’m not sure that the solution is to replace the Big Three completely with an open source (R or IPython) monoculture. Instead, one of the open source packages should be taught alongside a paid package.

There are some big advantages to this:

I would pick Stata and R + RStudio

If I had to pick, I would recommend Stata and R using the RStudio IDE.

I like Stata because I think it is easier to learn than SAS and I think the REPL-style interface (typing a command produces immediate output) is much faster and easier to debug than a SAS-style interface. I’ve never used SPSS, and considering it’s in the same price range as SAS I probably never will.

Stata licensing is also not hostile to users. While it is expensive4, it costs less than a car and you don’t have to call sales to get a price quote. It also is cross-platform, which is great for serious programmers and the many college students who prefer Macs over Windows PCs.

R and IPython share a lot of features, but R is much more widely used in the world of statistics and RStudio is the best interface I’ve seen in any stats package. Also, after learning R, picking up IPython should be fairly easy.

  1. I’m talking about graduate programs outside statistics departments, e.g. social science programs. It seems that this may not be true for at least some graduate programs in statistics: most statisticians I know seem to use R in addition to at least one of the Big Three. [return]
  2. By this, I am referring to the big practical advantages for data analysis provided by language features in R and Python like a wide range of useful data structures (e.g. arrays and hashes), and having non-study data variables that are first class citizens. Later, I refer to interfacing with APIs and other programming tasks, which are much easier in a “real” programming language than SAS/Stata/SPSS for a myriad of reasons. [return]
  3. An API is an “application programming interface,” or a way for two programs to talk to each other. In this case, I might want my analysis program to interface with this API from CMS. [return]
  4. $595 for a perpetual academic license, $295 for an annual license. [return]

Comments? Please send me a message.

Subscribe via RSS or email.