Data Manipulation

Last updated: August 19, 2020

{{TOC}}

Setup

library(MASS) # Needed for `birthwt` dataset
library(tidyverse)

df <- birthwt

Aggregating

Maximum value across rows

df %>%
  select(age, lwt) %>%
  mutate(age = age * 5, test = 110) %>% # Make `age` column large enough to sometimes be > `lwt`
  mutate(
    row_max = pmax(age, lwt, test)
  ) %>%
  head()
##   age lwt test row_max
## 1  95 182  110     182
## 2 165 155  110     165
## 3 100 105  110     110
## 4 105 108  110     110
## 5  90 107  110     110
## 6 105 124  110     124

Filtering

Select rows where variable contains string

str_detect() allows you to filter strings with a regular expression. In this example, I use a$, which looks for the letter a at the end of each value for Species. This has the effect of filtering out all the rows with Species == "versicolor".

iris %>%
  filter(str_detect(Species, 'a$')) %>%
  select(Species) %>%
  table()
## .
##     setosa versicolor  virginica
##         50          0         50

Modifying

Update a subset of rows based on criteria

Using mutate()

The simplest way is to use mutate() and case_when().

df_example <- tibble(x = 1:5, y = c('a', 'b', 'c', 'd', 'e'))

# A tibble: 5 x 2
#       x y
#   <int> <chr>
# 1     1 a
# 2     2 b
# 3     3 c
# 4     4 d
# 5     5 e

df_example <- df_example %>%
  mutate(
    y = case_when(
      x > 3 ~ 'hello',
      TRUE ~ y # Without this, `y` will be `NA` for the first 3 rows
    )
  )

df_example
## # A tibble: 5 x 2
##       x y
##   <int> <chr>
## 1     1 a
## 2     2 b
## 3     3 c
## 4     4 hello
## 5     5 hello

You can also use if_else() to achieve a similar effect as case_when() if you don’t need multiple cases.

Using mutate_at()

If you want to use mutate_at() the syntax is somewhat different:

df_example <- tibble(x = 1:5, y = c('a', 'b', 'c', 'd', 'e'))

# A tibble: 5 x 2
#       x y
#   <int> <chr>
# 1     1 a
# 2     2 b
# 3     3 c
# 4     4 d
# 5     5 e

df_example %>%
  mutate_at(
    vars(x),
    list(~ case_when(
      . == 2 ~ 0L,
      TRUE ~ .
    ))
  )
## # A tibble: 5 x 2
##       x y
##   <int> <chr>
## 1     1 a
## 2     0 b
## 3     3 c
## 4     4 d
## 5     5 e

Note that you have to use . == 2 ~ 0L or TRUE ~ as.numeric(.) in this example to avoid a type error. I have no idea why R can’t figure this out for itself, but it apparently can’t.

Masks

You can also use a mask:

df_example <- tibble(x = 1:5, y = c('a', 'b', 'c', 'd', 'e'))

# A tibble: 5 x 2
#       x y
#   <int> <chr>
# 1     1 a
# 2     2 b
# 3     3 c
# 4     4 d
# 5     5 e

mask <- df_example$x > 3
df_example[mask, 'y'] <- c('hello', 'goodbye')

df_example
## # A tibble: 5 x 2
##       x y
##   <int> <chr>
## 1     1 a
## 2     2 b
## 3     3 c
## 4     4 hello
## 5     5 goodbye

Using other columns in mutate(), mutate_at(), or mutate_all()

df_example <- tibble(x=1:5, y=10:14, z=100:104)
df_example
## # A tibble: 5 x 3
##       x     y     z
##   <int> <int> <int>
## 1     1    10   100
## 2     2    11   101
## 3     3    12   102
## 4     4    13   103
## 5     5    14   104

This is simple for mutate():

df_example %>%
  mutate(
    new_col = x + y
  )
## # A tibble: 5 x 4
##       x     y     z new_col
##   <int> <int> <int>   <int>
## 1     1    10   100      11
## 2     2    11   101      13
## 3     3    12   102      15
## 4     4    13   103      17
## 5     5    14   104      19

For mutate_at() and mutate_all(), it is a bit more complex:

df_example %>%
  mutate_at(
    vars(x, y),
    ~ . + z
  )
## # A tibble: 5 x 3
##       x     y     z
##   <int> <int> <int>
## 1   101   110   100
## 2   103   112   101
## 3   105   114   102
## 4   107   116   103
## 5   109   118   104

This complexity is because mutate_at() requires that you provide a function rather than simply providing an expression like new_col = x + y.

The ~ character is used in base R with formulas, but is overloaded in purrr (one of the tidyverse packages) to implement anonymous functions (i.e. a function that’s not assigned to a name).

This is functinally equivalent to ~ . + z but with more verbose syntax:

f <- function(., other_col) . + other_col
df_example %>%
  mutate_at(
    vars(x, y),
     f, other_col = quote(z)
  )
## # A tibble: 5 x 3
##       x     y     z
##   <int> <int> <int>
## 1   101   110   100
## 2   103   112   101
## 3   105   114   102
## 4   107   116   103
## 5   109   118   104

The quote() function makes this somewhat confusing, and a technical explanation is beyond the scope of this guide. However, you can read more about quoting in R here.

Miscellaneous


ℹ️ This page is part of my knowledge base for R, the popular statistical programming language. I attempt to use idiomatic practices with the tidyverse collection of packages as much as possible. If you have suggestions for ways to improve this code, please contact me or use the survey link below..