Stata: Tips & Snippets
Last updated: January 18, 2021
ℹ️ I wrote these notes for myself as a reference, and published them here in case they are helpful for others.
Table of contents:
- Dates and times
- Working with missing data
- Reading MySQL data directly into Stata
- Using
bysortto run commands over levels of a categorical variable - Add group counts to dataset
- Add column total variable
- Calculating arbitrary percentiles
- Counting the number of distinct values of a variable
- Create value labels from another variable
- Summarizing dataset contents
- Flow control
- Auto-complete variable names in Sublime Text
- Colors
- Splitting up long lines
- Reloading
.adoprograms - Working with very large datasets
- Suggestions? Corrections?
Dates and times
Date types in Stata
Summary tables
| Date type | Examples of HRFs |
|---|---|
| datetime | 20jan2010 09:15:22.120 |
| date | 20jan2010, 20/01/2010, ... |
| weekly date | 2010w3 |
| monthly date | 2010m1 |
| quarterly date | 2010q1 |
| half-yearly date | 2010h1 |
| yearly date | 2010 |
| SIF type | Functions to build from components |
|---|---|
| datetime/c | tc = mdyhms(M, D, Y, h, m, s) |
tc = dhms(td, h, m, s) |
|
tc = hms(h, m, s) |
|
| datetime/C | tC = Cmdyhms(M, D, Y, h, m, s) |
tC = Cdhms(td, h, m, s) |
|
tC = Chms(h, m, s) |
|
| date | td = mdy(M, D, Y) |
| weekly date | tw = yw(Y, W) |
| monthly date | tm = ym(Y, M) |
| quarterly date | tq = yq(Y, Q) |
| half-yearly date | th = yh(Y, H) |
| yearly date | ty = y(Y) |
| From ↓ | To datetime/c | To datetime/C | To date |
|---|---|---|---|
| datetime/c | tC = Cofc(tc) |
td = dofc(tc) |
|
| datetime/C | tc = cofC(tC) |
td = dofC(tC) |
|
| date | tc = cofd(td) |
tC = Cofd(td) |
|
| weekly | td = dofw(tw) |
||
| monthly | td = dofm(tm) |
||
| quarterly | td = dofq(tq) |
||
| half-yearly | td = dofh(th) |
||
| yearly | td = dofy(ty) |
Source: Stata documentation
HRF = human readable form
SIF = Stata internal form
Converting strings to SIF date
// String format: 2016-09-28T09:53:38
gen double newvarname = clock(varname, "20YMD#hms")
format newvarname %tc
// String format: 2016-09-28 09:53:38
gen double newvarname = clock(varname, "20YMDhms")
format newvarname %tc
// String format: 2016-09-28
gen double newvarname = date(varname, "20YMD")
format newvarname %tdExtracting hours, minutes, and seconds from SIF dates
gen hours = hh(tcvar)
gen minutes = mm(tcvar)
gen seconds = ss(tcvar)
gen dayofweek = dow(dofc(tcvar))
gen month = month(dofc(tcvar))
gen year = year(dofc(tcvar))Getting the last day of a given month
local yearmonth = ym(2016,2)
di %td dofm(`yearmonth' + 1) - 1
// 29feb2016
di day(dofm(`yearmonth' + 1) - 1)
// 29
Working with missing data
Use missing(varname1, varname2) rather than == . or == "". Details here.
Reading MySQL data directly into Stata
clear
set odbcdriver ansi
odbc list
odbc load, table("tablename") dsn("mysql") clear noquoteOr you can do:
odbc load, exec("select * from tablename") dsn("mysql") clear noquoteOther resources:
Using bysort to run commands over levels of a categorical variable
For example, if you have a categorical variable gender and a continuous variable height, and you want to run summarize by gender (summarize doesn’t support by), you can do this:
bysort gender: summarize heightAdd group counts to dataset
If you want to fill in the n_group column in a dataset like this…
| id | group | n_group |
|---|---|---|
| 1 | 1 | 3 |
| 2 | 1 | 3 |
| 3 | 1 | 3 |
| 4 | 2 | 2 |
| 5 | 2 | 2 |
…you can use this command: egen n_group = count(group), by(group)
Add column total variable
This will sum a column and generate a new variable (the same for each row) with the total:
egen sumvar = sum(var)Calculating arbitrary percentiles
centile varname, centile(70) will give you the 70th percentile for varname.
If you want this as a variable called new_varname in your dataset:
egen p70 = pctile(new_varname), p(70)Counting the number of distinct values of a variable
* Option 1
tab varname, nofreq
display r(r)
* Option 2
inspect varlist
display r(N_unique)
* Option 3
codebook varlist(Source: statdaily.com)
Create value labels from another variable
net install labutil
labmask integervarname, values(stringvarname)Summarizing dataset contents
inspect VARNAME: “Display simple summary of data’s attributes”codebook VARNAME: “Describe data contents”- Documenting and searching a Stata dataset
Flow control
For loops
foreach i in 0 1 2 {
di `i'
}
* 0
* 1
* 2
foreach i in a b c {
di "`a'"
}
* a
* b
* c
forval i = 0/2 {
di `i'
}
* 0
* 1
* 2If you want to loop through all levels of a variable:
levelsof varname, local(levels)
foreach l of local levels {
whatever if varname == `l'
}Auto-complete variable names in Sublime Text
I like writing Stata code in Sublime Text, but I miss variable auto-completion. To get a poor man’s version of this, run the ds command in Stata, and then copy and paste the results into a /* */ comment block in your .do file in Sublime.
Colors
palette_all command.Splitting up long lines
There are two ways to do this:
* Option 1
regress lnwage educ complete age c.age#c.age ///
exp c.exp#c.exp tenure c.tenure#c.tenure ///
i.region female
* Option 2 (only works in do-files and ado-files)
#delimit ;
regress lnwage educ complete age c.age#c.age
exp c.exp#c.exp tenure c.tenure#c.tenure
i.region female ;
predict e, resid ;
#delimit crSee help delimit for more details.
Reloading .ado programs
Use the -discard- command and then run -program_name-. (source)
Working with very large datasets
Stata suggests looking at ftools and gtools.
Suggestions? Corrections?
Please contact me.