Reshaping data with dplyr

Add
module 4
week 8
tidyr
tidyverse
dplyr
tibble
Author
Affiliation

School of Life Sciences, University of Hawaii

Published

March 9, 2023

Pre-lecture materials

Read ahead

Read ahead

Before class, you can prepare by reading the following materials:

  1. https://r4ds.had.co.nz/tidy-data
  2. tidyr cheat sheet from RStudio

Acknowledgements

Material for this lecture was borrowed and adopted from

Learning objectives

Learning objectives

At the end of this lesson you will:

  • Be able to transform wide data into long data
  • Be able to separate character columns into multiple columns
  • Be able to unite/separate multiple character columns into one column

Overview

Common data table reshaping tasks include reshaping your matrices or dataframes manually, or converting between short-wide to tall-thin formats.

Manual reshaping

Recall from our discussion on data objects that internally, R objects are stored as one huge vector. The various shapes of objects are simply created by R knowing where to break the vector into rows and columns. So it is very easy to reshape matrices:

vv <- 1:10  # a vector
mm <- matrix( vv, nrow=2)  # a matrix
mm
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    3    5    7    9
[2,]    2    4    6    8   10
dim(mm) <- NULL
mm <- matrix( vv, nrow=2, byrow=T)  # a matrix, but cells are now filled by row
mm
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    4    5
[2,]    6    7    8    9   10
dim(mm) <- NULL
mm  # vector is now n a different order because the collapse occurred by column
 [1]  1  6  2  7  3  8  4  9  5 10

Load the geospiza data:

require(geiger)
Loading required package: geiger
Loading required package: ape
data(geospiza)   # load the dataset into the workspace
geo <- geospiza$dat  # save the morphometric data as geo

Other means of “collapsing” dataframes are:

unlist(geo)   # produces a vector from the dataframe
                wingL  tarsusL  culmenL    beakD   gonysW
magnirostris 4.404200 3.038950 2.724667 2.823767 2.675983
conirostris  4.349867 2.984200 2.654400 2.513800 2.360167
difficilis   4.224067 2.898917 2.277183 2.011100 1.929983
scandens     4.261222 2.929033 2.621789 2.144700 2.036944
fortis       4.244008 2.894717 2.407025 2.362658 2.221867
fuliginosa   4.132957 2.806514 2.094971 1.941157 1.845379
pallida      4.265425 3.089450 2.430250 2.016350 1.949125
fusca        3.975393 2.936536 2.051843 1.191264 1.401186
parvulus     4.131600 2.973060 1.974420 1.873540 1.813340
pauper       4.232500 3.035900 2.187000 2.073400 1.962100
Pinaroloxias 4.188600 2.980200 2.311100 1.547500 1.630100
Platyspiza   4.419686 3.270543 2.331471 2.347471 2.282443
psittacula   4.235020 3.049120 2.259640 2.230040 2.073940
            # the atomic type of a dataframe is a list
unclass(geo)  # removes the class attribute, turning the dataframe into a 
                wingL  tarsusL  culmenL    beakD   gonysW
magnirostris 4.404200 3.038950 2.724667 2.823767 2.675983
conirostris  4.349867 2.984200 2.654400 2.513800 2.360167
difficilis   4.224067 2.898917 2.277183 2.011100 1.929983
scandens     4.261222 2.929033 2.621789 2.144700 2.036944
fortis       4.244008 2.894717 2.407025 2.362658 2.221867
fuliginosa   4.132957 2.806514 2.094971 1.941157 1.845379
pallida      4.265425 3.089450 2.430250 2.016350 1.949125
fusca        3.975393 2.936536 2.051843 1.191264 1.401186
parvulus     4.131600 2.973060 1.974420 1.873540 1.813340
pauper       4.232500 3.035900 2.187000 2.073400 1.962100
Pinaroloxias 4.188600 2.980200 2.311100 1.547500 1.630100
Platyspiza   4.419686 3.270543 2.331471 2.347471 2.282443
psittacula   4.235020 3.049120 2.259640 2.230040 2.073940
            # series of vectors  plus any names attributes, same as setting 
            # class(geo) <- NULL
c(geo)  # similar to unclass but without the attributes            
 [1] 4.404200 4.349867 4.224067 4.261222 4.244008 4.132957 4.265425 3.975393
 [9] 4.131600 4.232500 4.188600 4.419686 4.235020 3.038950 2.984200 2.898917
[17] 2.929033 2.894717 2.806514 3.089450 2.936536 2.973060 3.035900 2.980200
[25] 3.270543 3.049120 2.724667 2.654400 2.277183 2.621789 2.407025 2.094971
[33] 2.430250 2.051843 1.974420 2.187000 2.311100 2.331471 2.259640 2.823767
[41] 2.513800 2.011100 2.144700 2.362658 1.941157 2.016350 1.191264 1.873540
[49] 2.073400 1.547500 2.347471 2.230040 2.675983 2.360167 1.929983 2.036944
[57] 2.221867 1.845379 1.949125 1.401186 1.813340 1.962100 1.630100 2.282443
[65] 2.073940

An example of “untidy” data

People often make tables in short-wide format that end up not being tidy data. When people use column names to store data, it is no longer tidy. For example take a look at this built-in dataset that comes with tidyr on religion and income survey data with the number of respondents with income range in column name.

library(tidyr)
relig_income
# A tibble: 18 × 11
   religion      `<$10k` $10-2…¹ $20-3…² $30-4…³ $40-5…⁴ $50-7…⁵ $75-1…⁶ $100-…⁷
   <chr>           <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 Agnostic           27      34      60      81      76     137     122     109
 2 Atheist            12      27      37      52      35      70      73      59
 3 Buddhist           27      21      30      34      33      58      62      39
 4 Catholic          418     617     732     670     638    1116     949     792
 5 Don’t know/r…      15      14      15      11      10      35      21      17
 6 Evangelical …     575     869    1064     982     881    1486     949     723
 7 Hindu               1       9       7       9      11      34      47      48
 8 Historically…     228     244     236     238     197     223     131      81
 9 Jehovah's Wi…      20      27      24      24      21      30      15      11
10 Jewish             19      19      25      25      30      95      69      87
11 Mainline Prot     289     495     619     655     651    1107     939     753
12 Mormon             29      40      48      51      56     112      85      49
13 Muslim              6       7       9      10       9      23      16       8
14 Orthodox           13      17      23      32      32      47      38      42
15 Other Christ…       9       7      11      13      13      14      18      14
16 Other Faiths       20      33      40      46      49      63      46      40
17 Other World …       5       2       3       4       2       7       3       4
18 Unaffiliated      217     299     374     365     341     528     407     321
# … with 2 more variables: `>150k` <dbl>, `Don't know/refused` <dbl>, and
#   abbreviated variable names ¹​`$10-20k`, ²​`$20-30k`, ³​`$30-40k`, ⁴​`$40-50k`,
#   ⁵​`$50-75k`, ⁶​`$75-100k`, ⁷​`$100-150k`

Take a second to look at this data. In this case the variables are religion (a proper vector), income bracket (in the column names), and the number of respts, which is the third variable, is presented inside the table. onden Converting this data to tidy format would give us

library(tidyverse)

relig_income %>%
  pivot_longer(-religion, names_to = "income", values_to = "respondents") 
# A tibble: 180 × 3
   religion income             respondents
   <chr>    <chr>                    <dbl>
 1 Agnostic <$10k                       27
 2 Agnostic $10-20k                     34
 3 Agnostic $20-30k                     60
 4 Agnostic $30-40k                     81
 5 Agnostic $40-50k                     76
 6 Agnostic $50-75k                    137
 7 Agnostic $75-100k                   122
 8 Agnostic $100-150k                  109
 9 Agnostic >150k                       84
10 Agnostic Don't know/refused          96
# … with 170 more rows

Now we have each variable along the columns and each row corresponds to one observation (or category, here a combination of religion and income bracket).

Reshaping data with dplyr

pivot_longer()

The tidyr package includes functions to transfer a data frame between long and wide.

  • Wide format data has different attributes or variables describing an observation placed in separate columns.
  • Long format data tends to have different attributes encoded as levels of a single variable, followed by another column that contains the values of the observation at those different levels.

The key problem with the tidyness of the original data is that the income variables are not in their own columns, but rather are embedded in the structure of the columns, making it hard to manipuate the income variables.

To fix this, you can use the pivot_longer() function to gather values spread across several columns into a single column, here with the column names gathered into an income column.

Mutate to convert character to factor

# Gather everything EXCEPT religion to tidy data
relig_income %>%
  pivot_longer(-religion, names_to = "income", values_to = "respondents") %>%
  mutate(religion = factor(religion), income = factor(income))
# A tibble: 180 × 3
   religion income             respondents
   <fct>    <fct>                    <dbl>
 1 Agnostic <$10k                       27
 2 Agnostic $10-20k                     34
 3 Agnostic $20-30k                     60
 4 Agnostic $30-40k                     81
 5 Agnostic $40-50k                     76
 6 Agnostic $50-75k                    137
 7 Agnostic $75-100k                   122
 8 Agnostic $100-150k                  109
 9 Agnostic >150k                       84
10 Agnostic Don't know/refused          96
# … with 170 more rows
Note
  • When gathering, exclude any columns that you do not want “gathered” (religion in this case) by including the column names with a the minus sign in the pivot_longer() function.
  • Even if your data is in a tidy format, pivot_longer() is occasionally useful for pulling data together to take advantage of faceting, or plotting separate plots based on a grouping variable.

pivot_wider()

The pivot_wider() function is the opposite function. It is useful for creating summary tables for reports, but generally less commonly needed to tidy data.

You use the summarize() function in dplyr to summarize the total number of respondents per income category.

relig_income %>%
  pivot_longer(-religion, names_to = "income", values_to = "respondents") %>%
  mutate(religion = factor(religion), income = factor(income)) %>% 
  group_by(income) %>% 
  summarize(total_respondents = sum(respondents)) 
# A tibble: 10 × 2
   income             total_respondents
   <fct>                          <dbl>
 1 <$10k                           1930
 2 >150k                           2608
 3 $10-20k                         2781
 4 $100-150k                       3197
 5 $20-30k                         3357
 6 $30-40k                         3302
 7 $40-50k                         3085
 8 $50-75k                         5185
 9 $75-100k                        3990
10 Don't know/refused              6121

pivot_wider() can be flexibly used in combination with pivot_longer to make a nicer table to print.

Notice in this example how pivot_wider() has been used at the very end of the code sequence to convert the summarized data into a shape that offers a better tabular presentation for a report.

relig_income %>%
  pivot_longer(-religion, names_to = "income", values_to = "respondents") %>%
  mutate(religion = factor(religion), income = factor(income)) %>% 
  group_by(income) %>% 
  summarize(total_respondents = sum(respondents)) %>%

  pivot_wider(names_from = "income", 
              values_from = "total_respondents") %>%
  knitr::kable()
<$10k >150k $10-20k $100-150k $20-30k $30-40k $40-50k $50-75k $75-100k Don’t know/refused
1930 2608 2781 3197 3357 3302 3085 5185 3990 6121
Note
  • In the pivot_wider() call, you first specify the name of the column to use for the new column names (income in this example) and then specify the column to use for the cell values (total_respondents here).
  • Long format is often (but not always) the shape we need for tidy data. The important thing, however, is to think carefully about the shape you need for your analysis.
Example of pivot_longer()

Let’s try another dataset. This data contain an excerpt of the Gapminder data on life expectancy, GDP per capita, and population by country.

library(gapminder)
gapminder
# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# … with 1,694 more rows

If we wanted to make lifeExp, pop and gdpPercap (all measurements that we observe) go from a wide table into a long table, what would we do?

# try it yourself
Example

One more! Try using pivot_longer() to convert the the following data that contains simulated revenues for three companies by quarter for years 2006 to 2009.

Afterward, use group_by() and summarize() to calculate the average revenue for each company across all years and all quarters.

Bonus: Calculate a mean revenue for each company AND each year (averaged across all 4 quarters).

df <- tibble(
  "company" = rep(1:3, each=4), 
  "year"  = rep(2006:2009, 3),
  "Q1"    = sample(x = 0:100, size = 12),
  "Q2"    = sample(x = 0:100, size = 12),
  "Q3"    = sample(x = 0:100, size = 12),
  "Q4"    = sample(x = 0:100, size = 12),
)
df
# A tibble: 12 × 6
   company  year    Q1    Q2    Q3    Q4
     <int> <int> <int> <int> <int> <int>
 1       1  2006    77     4    95    88
 2       1  2007     0    22    70    45
 3       1  2008    53     7    87    97
 4       1  2009    64    80    33     8
 5       2  2006     3    33    50    14
 6       2  2007    98    39    11    37
 7       2  2008   100    35    88    22
 8       2  2009    38    96    78    89
 9       3  2006    61     6    14    87
10       3  2007    11    52    28    58
11       3  2008    46    37    29    18
12       3  2009    23    82    56    59
# try it yourself 

separate() and unite() cells within columns of data

Still in thetidyr package:

  • unite(): paste contents of two or more columns into a single column
  • separate(): split contents of a column into two or more columns

First, we combine the first three columns into one new column using unite(). This function is similar to newvar <- paste(A,B,C, sep="_")

names(gapminder)
[1] "country"   "continent" "year"      "lifeExp"   "pop"       "gdpPercap"
gapminder %>% 
  unite(col="country_continent_year", 
        country:year, 
        sep="_")
# A tibble: 1,704 × 4
   country_continent_year lifeExp      pop gdpPercap
   <chr>                    <dbl>    <int>     <dbl>
 1 Afghanistan_Asia_1952     28.8  8425333      779.
 2 Afghanistan_Asia_1957     30.3  9240934      821.
 3 Afghanistan_Asia_1962     32.0 10267083      853.
 4 Afghanistan_Asia_1967     34.0 11537966      836.
 5 Afghanistan_Asia_1972     36.1 13079460      740.
 6 Afghanistan_Asia_1977     38.4 14880372      786.
 7 Afghanistan_Asia_1982     39.9 12881816      978.
 8 Afghanistan_Asia_1987     40.8 13867957      852.
 9 Afghanistan_Asia_1992     41.7 16317921      649.
10 Afghanistan_Asia_1997     41.8 22227415      635.
# … with 1,694 more rows

Next, we show how to separate the columns into three separate columns using separate() using the col, into and sep arguments. Note that this works by finding the delimiter, and relies on order of the information.

gapminder %>% 
  unite(col="country_continent_year", 
        country:year, 
        sep="_") %>% 
  separate(col="country_continent_year", 
           into=c("country", "continent", "year"), 
           sep="_")
# A tibble: 1,704 × 6
   country     continent year  lifeExp      pop gdpPercap
   <chr>       <chr>     <chr>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia      1952     28.8  8425333      779.
 2 Afghanistan Asia      1957     30.3  9240934      821.
 3 Afghanistan Asia      1962     32.0 10267083      853.
 4 Afghanistan Asia      1967     34.0 11537966      836.
 5 Afghanistan Asia      1972     36.1 13079460      740.
 6 Afghanistan Asia      1977     38.4 14880372      786.
 7 Afghanistan Asia      1982     39.9 12881816      978.
 8 Afghanistan Asia      1987     40.8 13867957      852.
 9 Afghanistan Asia      1992     41.7 16317921      649.
10 Afghanistan Asia      1997     41.8 22227415      635.
# … with 1,694 more rows

Post-lecture materials

Final Questions

Here are some post-lecture questions to help you think about the material discussed.

Questions
  1. Using prose, describe how the variables and observations are organised in a tidy dataset versus an non-tidy dataset.

  2. What do the extra and fill arguments do in separate()? Experiment with the various options for the following two toy datasets.

tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% 
  separate(x, c("one", "two", "three"))

tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% 
  separate(x, c("one", "two", "three"))
  1. Both unite() and separate() have a remove argument. What does it do? Why would you set it to FALSE?

  2. Compare and contrast separate() and extract(). Why are there three variations of separation (by position, by separator, and with groups), but only one unite()?

Solution for gapminder example

gapminder %>% 
  pivot_longer(-c(country, continent, year), names_to = "metrics", values_to = "values")
# A tibble: 5,112 × 5
   country     continent  year metrics       values
   <fct>       <fct>     <int> <chr>          <dbl>
 1 Afghanistan Asia       1952 lifeExp         28.8
 2 Afghanistan Asia       1952 pop        8425333  
 3 Afghanistan Asia       1952 gdpPercap      779. 
 4 Afghanistan Asia       1957 lifeExp         30.3
 5 Afghanistan Asia       1957 pop        9240934  
 6 Afghanistan Asia       1957 gdpPercap      821. 
 7 Afghanistan Asia       1962 lifeExp         32.0
 8 Afghanistan Asia       1962 pop       10267083  
 9 Afghanistan Asia       1962 gdpPercap      853. 
10 Afghanistan Asia       1967 lifeExp         34.0
# … with 5,102 more rows

We stacked the three variables lifeExp, pop, and gdpPercap so now the table is a little thinner and three times as long.

Why did we have to make the non-gathered variables into a vector?