Exploring Data
Last updated: January 18, 2021
Setup
library(MASS) # Needed for `birthwt` dataset
library(tidyverse)
# Non-tidyverse packages
library(skimr) # This appears to be tidyverse-adjacent
library(summarytools)
df <- birthwt
df$race_factor <- factor(df$race)
Initial exploration of a new dataset
glimpse()
lists out columns, their types, and some example values
This function is part of
tibble
, the
tidyverse extension of data.frame
.
df %>% glimpse()
## Observations: 189
## Variables: 11
## $ low <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ age <int> 19, 33, 20, 21, 18, 21, 22, 17, 29, 26, 19, 19, 22, …
## $ lwt <int> 182, 155, 105, 108, 107, 124, 118, 103, 123, 113, 95…
## $ race <int> 2, 3, 1, 1, 1, 3, 1, 3, 1, 1, 3, 3, 3, 3, 1, 1, 2, 1…
## $ smoke <int> 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1…
## $ ptl <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…
## $ ht <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
## $ ui <int> 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…
## $ ftv <int> 0, 3, 1, 2, 0, 0, 1, 1, 1, 0, 0, 1, 0, 2, 0, 0, 0, 3…
## $ bwt <int> 2523, 2551, 2557, 2594, 2600, 2622, 2637, 2637, 2663…
## $ race_factor <fct> 2, 3, 1, 1, 1, 3, 1, 3, 1, 1, 3, 3, 3, 3, 1, 1, 2, 1…
summary()
shows summary stats
This is part of base R and is a convenient, if not overly pretty, way of getting summary stats like mean, median, etc.
df %>% summary()
## low age lwt race
## Min. :0.0000 Min. :14.00 Min. : 80.0 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:19.00 1st Qu.:110.0 1st Qu.:1.000
## Median :0.0000 Median :23.00 Median :121.0 Median :1.000
## Mean :0.3122 Mean :23.24 Mean :129.8 Mean :1.847
## 3rd Qu.:1.0000 3rd Qu.:26.00 3rd Qu.:140.0 3rd Qu.:3.000
## Max. :1.0000 Max. :45.00 Max. :250.0 Max. :3.000
## smoke ptl ht ui
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.00000 Median :0.0000
## Mean :0.3915 Mean :0.1958 Mean :0.06349 Mean :0.1481
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.0000 Max. :3.0000 Max. :1.00000 Max. :1.0000
## ftv bwt race_factor
## Min. :0.0000 Min. : 709 1:96
## 1st Qu.:0.0000 1st Qu.:2414 2:26
## Median :0.0000 Median :2977 3:67
## Mean :0.7937 Mean :2945
## 3rd Qu.:1.0000 3rd Qu.:3487
## Max. :6.0000 Max. :4990
table
to show frequency counts for a variable
Also part of base R, this is another quick way to show frequency counts.
df$race %>% table()
## .
## 1 2 3
## 96 26 67
skimr
This is a tidy-compatible package for summarizing data.
df %>% skimr::skim()
## Skim summary statistics
## n obs: 189
## n variables: 11
##
## ── Variable type:factor ──────────────────────────────────────────────────────────
## variable missing complete n n_unique top_counts
## race_factor 0 189 189 3 1: 96, 3: 67, 2: 26, NA: 0
## ordered
## FALSE
##
## ── Variable type:integer ─────────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## age 0 189 189 23.24 5.3 14 19 23 26 45 ▃▇▇▃▃▁▁▁
## bwt 0 189 189 2944.59 729.21 709 2414 2977 3487 4990 ▁▁▅▇▇▆▂▁
## ftv 0 189 189 0.79 1.06 0 0 0 1 6 ▇▃▂▁▁▁▁▁
## ht 0 189 189 0.063 0.24 0 0 0 0 1 ▇▁▁▁▁▁▁▁
## low 0 189 189 0.31 0.46 0 0 0 1 1 ▇▁▁▁▁▁▁▃
## lwt 0 189 189 129.81 30.58 80 110 121 140 250 ▃▇▅▂▂▁▁▁
## ptl 0 189 189 0.2 0.49 0 0 0 0 3 ▇▁▁▁▁▁▁▁
## race 0 189 189 1.85 0.92 1 1 1 3 3 ▇▁▁▂▁▁▁▆
## smoke 0 189 189 0.39 0.49 0 0 0 1 1 ▇▁▁▁▁▁▁▅
## ui 0 189 189 0.15 0.36 0 0 0 0 1 ▇▁▁▁▁▁▁▂
Here’s what it looks like with a string variable (Species
):
iris %>% skimr::skim()
## Skim summary statistics
## n obs: 150
## n variables: 5
##
## ── Variable type:factor ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## variable missing complete n n_unique top_counts
## Species 0 150 150 3 set: 50, ver: 50, vir: 50, NA: 0
## ordered
## FALSE
##
## ── Variable type:numeric ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## Petal.Length 0 150 150 3.76 1.77 1 1.6 4.35 5.1 6.9 ▇▁▁▂▅▅▃▁
## Petal.Width 0 150 150 1.2 0.76 0.1 0.3 1.3 1.8 2.5 ▇▁▁▅▃▃▂▂
## Sepal.Length 0 150 150 5.84 0.83 4.3 5.1 5.8 6.4 7.9 ▂▇▅▇▆▅▂▂
## Sepal.Width 0 150 150 3.06 0.44 2 2.8 3 3.3 4.4 ▁▂▅▇▃▂▁▁
Another option: summarytools::dfSummary()
This is similar to skimr::skim()
, but is less conservative about
horizontal space and can in theory use image-based graphs rather than
ASCII for histograms (though this requires X11, which doesn’t work on my
system for some reason):
dfSummary(iris)
## Data Frame Summary
## iris
## Dimensions: 150 x 5
## Duplicates: 1
##
## ----------------------------------------------------------------------------------------------------------------------
## No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
## ---- --------------- ------------------------ -------------------- -------------------------------- -------- ---------
## 1 Sepal.Length Mean (sd) : 5.8 (0.8) 35 distinct values . . : : 150 0
## [numeric] min < med < max: : : : : (100%) (0%)
## 4.3 < 5.8 < 7.9 : : : : :
## IQR (CV) : 1.3 (0.1) : : : : :
## : : : : : : : :
##
## 2 Sepal.Width Mean (sd) : 3.1 (0.4) 23 distinct values : 150 0
## [numeric] min < med < max: : (100%) (0%)
## 2 < 3 < 4.4 . :
## IQR (CV) : 0.5 (0.1) : : : :
## . . : : : : : :
##
## 3 Petal.Length Mean (sd) : 3.8 (1.8) 43 distinct values : 150 0
## [numeric] min < med < max: : . : (100%) (0%)
## 1 < 4.3 < 6.9 : : : .
## IQR (CV) : 3.5 (0.5) : : : : : .
## : : . : : : : : .
##
## 4 Petal.Width Mean (sd) : 1.2 (0.8) 22 distinct values : 150 0
## [numeric] min < med < max: : (100%) (0%)
## 0.1 < 1.3 < 2.5 : . . :
## IQR (CV) : 1.5 (0.6) : : : : .
## : : : : : . : : :
##
## 5 Species 1. setosa 50 (33.3%) IIIIII 150 0
## [factor] 2. versicolor 50 (33.3%) IIIIII (100%) (0%)
## 3. virginica 50 (33.3%) IIIIII
## ----------------------------------------------------------------------------------------------------------------------
Frequency counts of categorical variables
When using R, I frequently miss Stata’s tab
function for summarizing
categorical data. This is what it looks like:
. webuse iris
(Iris data)
. tab iris
Iris |
species | Freq. Percent Cum.
------------+-----------------------------------
setosa | 50 33.33 33.33
versicolor | 50 33.33 66.67
virginica | 50 33.33 100.00
------------+-----------------------------------
Total | 150 100.00
To achieve this in R, the best thing I’ve found is
summarytools::freq()
:
freq(iris$Species)
## Frequencies
## iris$Species
## Type: Factor
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ---------------- ------ --------- -------------- --------- --------------
## setosa 50 33.33 33.33 33.33 33.33
## versicolor 50 33.33 66.67 33.33 66.67
## virginica 50 33.33 100.00 33.33 100.00
## <NA> 0 0.00 100.00
## Total 150 100.00 100.00 100.00 100.00
Contingency tables (“2x2 tables”)
I also miss tab2
from Stata, which looks like this:
. tab tobacco parent, col
+-------------------+
| Key |
|-------------------|
| frequency |
| column percentage |
+-------------------+
| 1 = either parent
| smoked
tobacco usage | nonsmokin smoking | Total
----------------------+----------------------+----------
0 cigarettes | 5,177 4,292 | 9,469
| 70.49 56.06 | 63.13
----------------------+----------------------+----------
1 to 7 cigarettes/day | 1,593 2,213 | 3,806
| 21.69 28.91 | 25.37
----------------------+----------------------+----------
8 to 12 cigarettes/da | 378 672 | 1,050
| 5.15 8.78 | 7.00
----------------------+----------------------+----------
more than 12 cigarett | 196 479 | 675
| 2.67 6.26 | 4.50
----------------------+----------------------+----------
Total | 7,344 7,656 | 15,000
| 100.00 100.00 | 100.00
You can accomplish something similar with summarytools::ctable()
summarytools::ctable(tobacco$smoker, tobacco$diseased, prop = "c")
## Cross-Tabulation, Column Proportions
## smoker * diseased
## Data Frame: tobacco
##
## -------- ---------- -------------- -------------- ---------------
## diseased Yes No Total
## smoker
## Yes 125 ( 55.8%) 173 ( 22.3%) 298 ( 29.8%)
## No 99 ( 44.2%) 603 ( 77.7%) 702 ( 70.2%)
## Total 224 (100.0%) 776 (100.0%) 1000 (100.0%)
## -------- ---------- -------------- -------------- ---------------
Other resources
ℹ️ This page is part of my knowledge base for R, the popular statistical programming language. I attempt to use idiomatic practices with the tidyverse
collection of packages as much as possible. If you have suggestions for ways to improve this code, please contact me or use the survey link below..