August 12, 2015

Why It's Worth Using a Real Text Editor for Data Analysis

I’ve been an advocate of the Sublime Text editor for many years, first as a software developer and now for data analysis. It is a huge boost to productivity over less full-featured text editors, and has the unique advantage¹ of running natively on Windows, Mac, and Linux.

I see many people using the built-in editor for their statistics package, especially SAS and Stata users. I would encourage everyone who does this to take half a day and learn about Sublime Text. At the very least, you may identify problems that would take hours to solve in a default editor, and seconds in Sublime Text.

I recently had one of these problems in my own work, and wanted to share it as a motivating example:

A motivating example:

Say you wrote a bunch of Stata analysis code for Dataset A, and now you want to run it on Dataset B. Both datasets have the same variable names, except for one key difference: Dataset A’s names are mixed case (like VAR_Name) and Dataset B’s are lowercase (like var_name). Stata has case sensitive variable names, so the analysis code won’t run on Dataset B without fixing the variable names somehow.

Without Sublime Text, the best option would be manually going through all your analysis code and make the variable names lowercase by hand. There’s no automatic way to change the case of the variables in Dataset B because there’s no logic to the capitalization of the variable names in Dataset A.

Enter Sublime Text!

Here’s how I would solve this problem in Sublime Text:

Get a list of all the variables in Dataset A² and paste into Sublime Text. They will be separated by spaces.
Use multiple cursors to create rename commands for all the variables at once.
Paste the rename commands into my .do file, and everything works!

Here’s how that would look in Sublime:

This would take exactly the same amount of time to do in Sublime Text with 4 variables (like in the video) or 4,000.

The key feature here is multiple cursors. This is an extremely versatile feature. Even if your stats package does not have case-sensitive variable names (i.e., SAS), you should be able to imagine a whole set of problems that this can quickly solve.³

In addition to multiple cursors, there are lots of other helpful Sublime Text features and plugins. One example is the Paste as Column plugin:

Screencast of the paste in column plugin for Sublime Text in action.

Resources for getting started with Sublime Text

lynda.com Sublime Text 2 course (not free, but you can do a free trial)
tuts+ free Sublime Text 2 screencast series
Unofficial documentation
Official documentation

Note that Sublime Text 3 is currently in beta, but is very stable in my experience, and is what I would start with.

Sublime Text is not cheap, but there is a demo that you can use indefinitely (the only difference between the demo and paid versions is a periodic reminder to buy a license).

Sublime Text support for stats packages

Note that you will want to get plugins for your statistics package so you have syntax highlighting and other language-specific features:

Stata: try Stata Enhanced
SAS: There is a Sublime plugin for SAS, but I haven’t tried it yet. I use this old TextMate bundle to get syntax highlighting.
There is built-in support for R syntax highlighting, and it is possible to run the R REPL in Sublime Text.

Github’s Atom editor is a new, cross-platform alternative to Sublime Text, but still suffers from some performance issues with large files, which is problematic for data analysis. ^[return]
To do this in Stata, select all the variables in the Variables sidebar and click the little arrow next to one of them. This will add all the variables to the Command box, and you can copy them from there. ^[return]
Off the top of my head: renaming variables like substr1_MOVEME_substr2 to substr1_substr2_MOVEME, where substr1 and substr2 are placeholders for words of varying length. ^[return]

Comments? Please send me a message.

Subscribe via RSS or email.