Reshaping “Nontraditional” Wide Data (“Multi-choice” Data)
Last updated: January 18, 2021
library(tidyverse)
library(janitor)
Sometimes multiple choice data is represented in a “nontraditional” wide format:
df <- tribble(
~participant_id, ~favorite_fruit_1, ~favorite_fruit_2, ~favorite_fruit_3,
1, "Banana", "Apple", "Dragon fruit",
2, "Apple", "Strawberry", NA,
3, "Banana", NA, NA,
4, "Blueberry", "Kiwi", NA
)
df
## # A tibble: 4 x 4
## participant_id favorite_fruit_1 favorite_fruit_2 favorite_fruit_3
## <dbl> <chr> <chr> <chr>
## 1 1 Banana Apple Dragon fruit
## 2 2 Apple Strawberry <NA>
## 3 3 Banana <NA> <NA>
## 4 4 Blueberry Kiwi <NA>
The way to handle this is to first convert from wide to long, and then from long to “traditional” wide.
“Nontraditional” wide to long
df_long <- df %>%
pivot_longer(starts_with("favorite_fruit"), values_drop_na = TRUE)
df_long
## # A tibble: 8 x 3
## participant_id name value
## <dbl> <chr> <chr>
## 1 1 favorite_fruit_1 Banana
## 2 1 favorite_fruit_2 Apple
## 3 1 favorite_fruit_3 Dragon fruit
## 4 2 favorite_fruit_1 Apple
## 5 2 favorite_fruit_2 Strawberry
## 6 3 favorite_fruit_1 Banana
## 7 4 favorite_fruit_1 Blueberry
## 8 4 favorite_fruit_2 Kiwi
The code above assumes that all your multiple choice variables start
with the same prefix. If they don’t you could simply list the variable
names like
pivot_longer(c(favorite_fruit_1, favorite_fruit_2, favorite_fruit_3), values_drop_na = TRUE)
.
Or you can use any select()
function syntax to
select the columns. In this case,
pivot_longer(-participant_id, values_drop_na = TRUE)
also works
because we want to pivot all the columns that are not named
participant_id
.
It’s now posible to do quick tabulations that were much more difficult with the original structure of the data:
df_long %>% tabyl(value)
## value n percent
## Apple 2 0.250
## Banana 2 0.250
## Blueberry 1 0.125
## Dragon fruit 1 0.125
## Kiwi 1 0.125
## Strawberry 1 0.125
(This uses tabyl()
from the excellent janitor
package for generating the
frequencies table.)
If the order of the multiple choice data is important, it’s easy to
extract this from the name
variable:
df_long <- df_long %>%
extract(name, "order", regex = "([0-9]+)")
df_long
## # A tibble: 8 x 3
## participant_id order value
## <dbl> <chr> <chr>
## 1 1 1 Banana
## 2 1 2 Apple
## 3 1 3 Dragon fruit
## 4 2 1 Apple
## 5 2 2 Strawberry
## 6 3 1 Banana
## 7 4 1 Blueberry
## 8 4 2 Kiwi
This uses a “regex” or regular
expression to extract
the first number that appears in each name
value. Regex are out of
scope for this post, but are an extremely useful tool that is worth
learning.
Long to “traditional” wide
df_wide <- df_long %>%
mutate(checked = 1) %>% # Used for creating a binary 1/0 variable for each value
pivot_wider(
id_cols = participant_id,
names_from = value,
values_from = checked,
values_fill = list(checked = 0) # Fills in 0 for any value not "checked" above by the `mutate()` function
)
df_wide
## # A tibble: 4 x 7
## participant_id Banana Apple `Dragon fruit` Strawberry Blueberry Kiwi
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 1 1 0 0 0
## 2 2 0 1 0 1 0 0
## 3 3 1 0 0 0 0 0
## 4 4 0 0 0 0 1 1
ℹ️ This page is part of my knowledge base for R, the popular statistical programming language. I attempt to use idiomatic practices with the tidyverse
collection of packages as much as possible. If you have suggestions for ways to improve this code, please contact me or use the survey link below..