Data Manipulation
Last updated: January 18, 2021
Table of contents:
{{TOC}}
Setup
library(MASS) # Needed for `birthwt` dataset
library(tidyverse)
df <- birthwt
Aggregating
Maximum value across rows
df %>%
select(age, lwt) %>%
mutate(age = age * 5, test = 110) %>% # Make `age` column large enough to sometimes be > `lwt`
mutate(
row_max = pmax(age, lwt, test)
) %>%
head()
## age lwt test row_max
## 1 95 182 110 182
## 2 165 155 110 165
## 3 100 105 110 110
## 4 105 108 110 110
## 5 90 107 110 110
## 6 105 124 110 124
Filtering
Select rows where variable contains string
str_detect()
allows you to filter strings with a regular expression.
In this example, I use a$
, which looks for the letter a
at the end
of each value for Species
. This has the effect of filtering out all
the rows with Species == "versicolor"
.
iris %>%
filter(str_detect(Species, 'a$')) %>%
select(Species) %>%
table()
## .
## setosa versicolor virginica
## 50 0 50
Modifying
Update a subset of rows based on criteria
Using mutate()
The simplest way is to use mutate()
and case_when()
.
df_example <- tibble(x = 1:5, y = c('a', 'b', 'c', 'd', 'e'))
# A tibble: 5 x 2
# x y
# <int> <chr>
# 1 1 a
# 2 2 b
# 3 3 c
# 4 4 d
# 5 5 e
df_example <- df_example %>%
mutate(
y = case_when(
x > 3 ~ 'hello',
TRUE ~ y # Without this, `y` will be `NA` for the first 3 rows
)
)
df_example
## # A tibble: 5 x 2
## x y
## <int> <chr>
## 1 1 a
## 2 2 b
## 3 3 c
## 4 4 hello
## 5 5 hello
You can also use if_else()
to achieve a similar effect as
case_when()
if you don’t need multiple cases.
Using mutate_at()
If you want to use mutate_at()
the syntax is somewhat different:
df_example <- tibble(x = 1:5, y = c('a', 'b', 'c', 'd', 'e'))
# A tibble: 5 x 2
# x y
# <int> <chr>
# 1 1 a
# 2 2 b
# 3 3 c
# 4 4 d
# 5 5 e
df_example %>%
mutate_at(
vars(x),
list(~ case_when(
. == 2 ~ 0L,
TRUE ~ .
))
)
## # A tibble: 5 x 2
## x y
## <int> <chr>
## 1 1 a
## 2 0 b
## 3 3 c
## 4 4 d
## 5 5 e
Note that you have to use . == 2 ~ 0L
or TRUE ~ as.numeric(.)
in
this example to avoid a type error. I have no idea why R can’t figure
this out for itself, but it apparently can’t.
Masks
You can also use a mask:
df_example <- tibble(x = 1:5, y = c('a', 'b', 'c', 'd', 'e'))
# A tibble: 5 x 2
# x y
# <int> <chr>
# 1 1 a
# 2 2 b
# 3 3 c
# 4 4 d
# 5 5 e
mask <- df_example$x > 3
df_example[mask, 'y'] <- c('hello', 'goodbye')
df_example
## # A tibble: 5 x 2
## x y
## <int> <chr>
## 1 1 a
## 2 2 b
## 3 3 c
## 4 4 hello
## 5 5 goodbye
Using other columns in mutate()
, mutate_at()
, or mutate_all()
df_example <- tibble(x=1:5, y=10:14, z=100:104)
df_example
## # A tibble: 5 x 3
## x y z
## <int> <int> <int>
## 1 1 10 100
## 2 2 11 101
## 3 3 12 102
## 4 4 13 103
## 5 5 14 104
This is simple for mutate()
:
df_example %>%
mutate(
new_col = x + y
)
## # A tibble: 5 x 4
## x y z new_col
## <int> <int> <int> <int>
## 1 1 10 100 11
## 2 2 11 101 13
## 3 3 12 102 15
## 4 4 13 103 17
## 5 5 14 104 19
For mutate_at()
and mutate_all()
, it is a bit more complex:
df_example %>%
mutate_at(
vars(x, y),
~ . + z
)
## # A tibble: 5 x 3
## x y z
## <int> <int> <int>
## 1 101 110 100
## 2 103 112 101
## 3 105 114 102
## 4 107 116 103
## 5 109 118 104
This complexity is because mutate_at()
requires that you provide a
function rather than simply providing an expression like
new_col = x + y
.
The ~
character is used in base R with
formulas,
but is overloaded in purrr
(one of the tidyverse
packages) to
implement anonymous functions (i.e. a function that’s not assigned to a
name).
This is functinally equivalent to ~ . + z
but with more verbose
syntax:
f <- function(., other_col) . + other_col
df_example %>%
mutate_at(
vars(x, y),
f, other_col = quote(z)
)
## # A tibble: 5 x 3
## x y z
## <int> <int> <int>
## 1 101 110 100
## 2 103 112 101
## 3 105 114 102
## 4 107 116 103
## 5 109 118 104
The quote()
function makes this somewhat confusing, and a technical
explanation is beyond the scope of this guide. However, you can read
more about quoting in R
here.
Miscellaneous
- Convert column to vector: use
deframe()
with a Tibble.
ℹ️ This page is part of my knowledge base for R, the popular statistical programming language. I attempt to use idiomatic practices with the tidyverse
collection of packages as much as possible. If you have suggestions for ways to improve this code, please contact me or use the survey link below..