Reshaping “Nontraditional” Wide Data (“Multi-choice” Data)

Last updated: May 19, 2020

library(tidyverse)
library(janitor)

Sometimes multiple choice data is represented in a “nontraditional” wide format:

df <- tribble(
  ~participant_id, ~favorite_fruit_1, ~favorite_fruit_2, ~favorite_fruit_3,
  1,               "Banana",          "Apple",           "Dragon fruit",
  2,               "Apple",           "Strawberry",      NA,
  3,               "Banana",          NA,                NA,
  4,               "Blueberry",       "Kiwi",            NA
)
df
## # A tibble: 4 x 4
##   participant_id favorite_fruit_1 favorite_fruit_2 favorite_fruit_3
##            <dbl> <chr>            <chr>            <chr>           
## 1              1 Banana           Apple            Dragon fruit    
## 2              2 Apple            Strawberry       <NA>            
## 3              3 Banana           <NA>             <NA>            
## 4              4 Blueberry        Kiwi             <NA>

The way to handle this is to first convert from wide to long, and then from long to “traditional” wide.

“Nontraditional” wide to long

df_long <- df %>% 
  pivot_longer(starts_with("favorite_fruit"), values_drop_na = TRUE)

df_long
## # A tibble: 8 x 3
##   participant_id name             value       
##            <dbl> <chr>            <chr>       
## 1              1 favorite_fruit_1 Banana      
## 2              1 favorite_fruit_2 Apple       
## 3              1 favorite_fruit_3 Dragon fruit
## 4              2 favorite_fruit_1 Apple       
## 5              2 favorite_fruit_2 Strawberry  
## 6              3 favorite_fruit_1 Banana      
## 7              4 favorite_fruit_1 Blueberry   
## 8              4 favorite_fruit_2 Kiwi

The code above assumes that all your multiple choice variables start with the same prefix. If they don’t you could simply list the variable names like pivot_longer(c(favorite_fruit_1, favorite_fruit_2, favorite_fruit_3), values_drop_na = TRUE).

Or you can use any select() function syntax to select the columns. In this case, pivot_longer(-participant_id, values_drop_na = TRUE) also works because we want to pivot all the columns that are not named participant_id.

It’s now posible to do quick tabulations that were much more difficult with the original structure of the data:

df_long %>% tabyl(value)
##         value n percent
##         Apple 2   0.250
##        Banana 2   0.250
##     Blueberry 1   0.125
##  Dragon fruit 1   0.125
##          Kiwi 1   0.125
##    Strawberry 1   0.125

(This uses tabyl() from the excellent janitor package for generating the frequencies table.)

If the order of the multiple choice data is important, it’s easy to extract this from the name variable:

df_long <- df_long %>% 
  extract(name, "order", regex = "([0-9]+)")

df_long
## # A tibble: 8 x 3
##   participant_id order value       
##            <dbl> <chr> <chr>       
## 1              1 1     Banana      
## 2              1 2     Apple       
## 3              1 3     Dragon fruit
## 4              2 1     Apple       
## 5              2 2     Strawberry  
## 6              3 1     Banana      
## 7              4 1     Blueberry   
## 8              4 2     Kiwi

This uses a “regex” or regular expression to extract the first number that appears in each name value. Regex are out of scope for this post, but are an extremely useful tool that is worth learning.

Long to “traditional” wide

df_wide <- df_long %>% 
  mutate(checked = 1) %>% # Used for creating a binary 1/0 variable for each value
  pivot_wider(
    id_cols = participant_id,
    names_from = value,
    values_from = checked,
    values_fill = list(checked = 0) # Fills in 0 for any value not "checked" above by the `mutate()` function
  )
df_wide
## # A tibble: 4 x 7
##   participant_id Banana Apple `Dragon fruit` Strawberry Blueberry  Kiwi
##            <dbl>  <dbl> <dbl>          <dbl>      <dbl>     <dbl> <dbl>
## 1              1      1     1              1          0         0     0
## 2              2      0     1              0          1         0     0
## 3              3      1     0              0          0         0     0
## 4              4      0     0              0          0         1     1

ℹ️ This page is part of my knowledge base for R, the popular statistical programming language. I attempt to use idiomatic practices with the tidyverse collection of packages as much as possible. If you have suggestions for ways to improve this code, please contact me or use the survey link below..