Exploring Data

Last updated: July 18, 2020

Setup

library(MASS) # Needed for `birthwt` dataset
library(tidyverse)

# Non-tidyverse packages
library(skimr) # This appears to be tidyverse-adjacent
library(summarytools)

df <- birthwt
df$race_factor <- factor(df$race)

Initial exploration of a new dataset

glimpse() lists out columns, their types, and some example values

This function is part of tibble, the tidyverse extension of data.frame.

df %>% glimpse()
## Observations: 189
## Variables: 11
## $ low         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ age         <int> 19, 33, 20, 21, 18, 21, 22, 17, 29, 26, 19, 19, 22, …
## $ lwt         <int> 182, 155, 105, 108, 107, 124, 118, 103, 123, 113, 95…
## $ race        <int> 2, 3, 1, 1, 1, 3, 1, 3, 1, 1, 3, 3, 3, 3, 1, 1, 2, 1…
## $ smoke       <int> 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1…
## $ ptl         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…
## $ ht          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
## $ ui          <int> 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…
## $ ftv         <int> 0, 3, 1, 2, 0, 0, 1, 1, 1, 0, 0, 1, 0, 2, 0, 0, 0, 3…
## $ bwt         <int> 2523, 2551, 2557, 2594, 2600, 2622, 2637, 2637, 2663…
## $ race_factor <fct> 2, 3, 1, 1, 1, 3, 1, 3, 1, 1, 3, 3, 3, 3, 1, 1, 2, 1…

summary() shows summary stats

This is part of base R and is a convenient, if not overly pretty, way of getting summary stats like mean, median, etc.

df %>% summary()
##       low              age             lwt             race
##  Min.   :0.0000   Min.   :14.00   Min.   : 80.0   Min.   :1.000
##  1st Qu.:0.0000   1st Qu.:19.00   1st Qu.:110.0   1st Qu.:1.000
##  Median :0.0000   Median :23.00   Median :121.0   Median :1.000
##  Mean   :0.3122   Mean   :23.24   Mean   :129.8   Mean   :1.847
##  3rd Qu.:1.0000   3rd Qu.:26.00   3rd Qu.:140.0   3rd Qu.:3.000
##  Max.   :1.0000   Max.   :45.00   Max.   :250.0   Max.   :3.000
##      smoke             ptl               ht                ui
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000
##  Median :0.0000   Median :0.0000   Median :0.00000   Median :0.0000
##  Mean   :0.3915   Mean   :0.1958   Mean   :0.06349   Mean   :0.1481
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000
##  Max.   :1.0000   Max.   :3.0000   Max.   :1.00000   Max.   :1.0000
##       ftv              bwt       race_factor
##  Min.   :0.0000   Min.   : 709   1:96
##  1st Qu.:0.0000   1st Qu.:2414   2:26
##  Median :0.0000   Median :2977   3:67
##  Mean   :0.7937   Mean   :2945
##  3rd Qu.:1.0000   3rd Qu.:3487
##  Max.   :6.0000   Max.   :4990

table to show frequency counts for a variable

Also part of base R, this is another quick way to show frequency counts.

df$race %>% table()
## .
##  1  2  3
## 96 26 67

skimr

This is a tidy-compatible package for summarizing data.

df %>% skimr::skim()
## Skim summary statistics
##  n obs: 189
##  n variables: 11
##
## ── Variable type:factor ──────────────────────────────────────────────────────────
##     variable missing complete   n n_unique                 top_counts
##  race_factor       0      189 189        3 1: 96, 3: 67, 2: 26, NA: 0
##  ordered
##    FALSE
##
## ── Variable type:integer ─────────────────────────────────────────────────────────
##  variable missing complete   n     mean     sd  p0  p25  p50  p75 p100       hist
##       age       0      189 189   23.24    5.3   14   19   23   26   45   ▃▇▇▃▃▁▁▁
##       bwt       0      189 189 2944.59  729.21 709 2414 2977 3487 4990   ▁▁▅▇▇▆▂▁
##       ftv       0      189 189    0.79    1.06   0    0    0    1    6   ▇▃▂▁▁▁▁▁
##        ht       0      189 189    0.063   0.24   0    0    0    0    1   ▇▁▁▁▁▁▁▁
##       low       0      189 189    0.31    0.46   0    0    0    1    1   ▇▁▁▁▁▁▁▃
##       lwt       0      189 189  129.81   30.58  80  110  121  140  250   ▃▇▅▂▂▁▁▁
##       ptl       0      189 189    0.2     0.49   0    0    0    0    3   ▇▁▁▁▁▁▁▁
##      race       0      189 189    1.85    0.92   1    1    1    3    3   ▇▁▁▂▁▁▁▆
##     smoke       0      189 189    0.39    0.49   0    0    0    1    1   ▇▁▁▁▁▁▁▅
##        ui       0      189 189    0.15    0.36   0    0    0    0    1   ▇▁▁▁▁▁▁▂

Here’s what it looks like with a string variable (Species):

iris %>% skimr::skim()
## Skim summary statistics
##  n obs: 150
##  n variables: 5
##
## ── Variable type:factor ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##  variable missing complete   n n_unique                       top_counts
##   Species       0      150 150        3 set: 50, ver: 50, vir: 50, NA: 0
##  ordered
##    FALSE
##
## ── Variable type:numeric ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##      variable missing complete   n mean   sd  p0 p25  p50 p75 p100       hist
##  Petal.Length       0      150 150 3.76 1.77 1   1.6 4.35 5.1  6.9   ▇▁▁▂▅▅▃▁
##   Petal.Width       0      150 150 1.2  0.76 0.1 0.3 1.3  1.8  2.5   ▇▁▁▅▃▃▂▂
##  Sepal.Length       0      150 150 5.84 0.83 4.3 5.1 5.8  6.4  7.9   ▂▇▅▇▆▅▂▂
##   Sepal.Width       0      150 150 3.06 0.44 2   2.8 3    3.3  4.4   ▁▂▅▇▃▂▁▁

Another option: summarytools::dfSummary()

This is similar to skimr::skim(), but is less conservative about horizontal space and can in theory use image-based graphs rather than ASCII for histograms (though this requires X11, which doesn’t work on my system for some reason):

dfSummary(iris)
## Data Frame Summary
## iris
## Dimensions: 150 x 5
## Duplicates: 1
##
## ----------------------------------------------------------------------------------------------------------------------
## No   Variable        Stats / Values           Freqs (% of Valid)   Graph                            Valid    Missing
## ---- --------------- ------------------------ -------------------- -------------------------------- -------- ---------
## 1    Sepal.Length    Mean (sd) : 5.8 (0.8)    35 distinct values     . . : :                        150      0
##      [numeric]       min < med < max:                                : : : :                        (100%)   (0%)
##                      4.3 < 5.8 < 7.9                                 : : : : :
##                      IQR (CV) : 1.3 (0.1)                            : : : : :
##                                                                    : : : : : : : :
##
## 2    Sepal.Width     Mean (sd) : 3.1 (0.4)    23 distinct values           :                        150      0
##      [numeric]       min < med < max:                                      :                        (100%)   (0%)
##                      2 < 3 < 4.4                                         . :
##                      IQR (CV) : 0.5 (0.1)                              : : : :
##                                                                    . . : : : : : :
##
## 3    Petal.Length    Mean (sd) : 3.8 (1.8)    43 distinct values   :                                150      0
##      [numeric]       min < med < max:                              :         . :                    (100%)   (0%)
##                      1 < 4.3 < 6.9                                 :         : : .
##                      IQR (CV) : 3.5 (0.5)                          : :       : : : .
##                                                                    : :   . : : : : : .
##
## 4    Petal.Width     Mean (sd) : 1.2 (0.8)    22 distinct values   :                                150      0
##      [numeric]       min < med < max:                              :                                (100%)   (0%)
##                      0.1 < 1.3 < 2.5                               :       . .   :
##                      IQR (CV) : 1.5 (0.6)                          :       : :   :   .
##                                                                    : :   : : : . : : :
##
## 5    Species         1. setosa                50 (33.3%)           IIIIII                           150      0
##      [factor]        2. versicolor            50 (33.3%)           IIIIII                           (100%)   (0%)
##                      3. virginica             50 (33.3%)           IIIIII
## ----------------------------------------------------------------------------------------------------------------------

Frequency counts of categorical variables

When using R, I frequently miss Stata’s tab function for summarizing categorical data. This is what it looks like:

. webuse iris
(Iris data)

. tab iris

       Iris |
    species |      Freq.     Percent        Cum.
------------+-----------------------------------
     setosa |         50       33.33       33.33
 versicolor |         50       33.33       66.67
  virginica |         50       33.33      100.00
------------+-----------------------------------
      Total |        150      100.00

To achieve this in R, the best thing I’ve found is summarytools::freq():

freq(iris$Species)
## Frequencies
## iris$Species
## Type: Factor
##
##                    Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ---------------- ------ --------- -------------- --------- --------------
##           setosa     50     33.33          33.33     33.33          33.33
##       versicolor     50     33.33          66.67     33.33          66.67
##        virginica     50     33.33         100.00     33.33         100.00
##             <NA>      0                               0.00         100.00
##            Total    150    100.00         100.00    100.00         100.00

Contingency tables (“2x2 tables”)

I also miss tab2 from Stata, which looks like this:

. tab tobacco parent, col

+-------------------+
| Key               |
|-------------------|
|     frequency     |
| column percentage |
+-------------------+

                      |   1 = either parent
                      |        smoked
        tobacco usage | nonsmokin    smoking |     Total
----------------------+----------------------+----------
         0 cigarettes |     5,177      4,292 |     9,469
                      |     70.49      56.06 |     63.13
----------------------+----------------------+----------
1 to 7 cigarettes/day |     1,593      2,213 |     3,806
                      |     21.69      28.91 |     25.37
----------------------+----------------------+----------
8 to 12 cigarettes/da |       378        672 |     1,050
                      |      5.15       8.78 |      7.00
----------------------+----------------------+----------
more than 12 cigarett |       196        479 |       675
                      |      2.67       6.26 |      4.50
----------------------+----------------------+----------
                Total |     7,344      7,656 |    15,000
                      |    100.00     100.00 |    100.00

You can accomplish something similar with summarytools::ctable()

summarytools::ctable(tobacco$smoker, tobacco$diseased, prop = "c")
## Cross-Tabulation, Column Proportions
## smoker * diseased
## Data Frame: tobacco
##
## -------- ---------- -------------- -------------- ---------------
##            diseased            Yes             No           Total
##   smoker
##      Yes              125 ( 55.8%)   173 ( 22.3%)    298 ( 29.8%)
##       No               99 ( 44.2%)   603 ( 77.7%)    702 ( 70.2%)
##    Total              224 (100.0%)   776 (100.0%)   1000 (100.0%)
## -------- ---------- -------------- -------------- ---------------

ℹ️ This page is part of my knowledge base for R, the popular statistical programming language. I attempt to use idiomatic practices with the tidyverse collection of packages as much as possible. If you have suggestions for ways to improve this code, please contact me or use the survey link below..