Be able to separate character columns into multiple columns
Be able to unite/separate multiple character columns into one column
Overview
Common data table reshaping tasks include reshaping your matrices or dataframes manually, or converting between short-wide to tall-thin formats.
Manual reshaping
Recall from our discussion on data objects that internally, R objects are stored as one huge vector. The various shapes of objects are simply created by R knowing where to break the vector into rows and columns. So it is very easy to reshape matrices:
vv<-1:10# a vectormm<-matrix(vv, nrow=2)# a matrixmm
People often make tables in short-wide format that end up not being tidy data. When people use column names to store data, it is no longer tidy. For example take a look at this built-in dataset that comes with tidyr on religion and income survey data with the number of respondents with income range in column name.
Take a second to look at this data. In this case the variables are religion (a proper vector), income bracket (in the column names), and the number of respts, which is the third variable, is presented inside the table. onden Converting this data to tidy format would give us
Now we have each variable along the columns and each row corresponds to one observation (or category, here a combination of religion and income bracket).
Reshaping data with dplyr
pivot_longer()
The tidyr package includes functions to transfer a data frame between long and wide.
Wide format data has different attributes or variables describing an observation placed in separate columns.
Long format data tends to have different attributes encoded as levels of a single variable, followed by another column that contains the values of the observation at those different levels.
The key problem with the tidyness of the original data is that the income variables are not in their own columns, but rather are embedded in the structure of the columns, making it hard to manipuate the income variables.
To fix this, you can use the pivot_longer() function to gather values spread across several columns into a single column, here with the column names gathered into an income column.
Mutate to convert character to factor
# Gather everything EXCEPT religion to tidy datarelig_income%>%pivot_longer(-religion, names_to ="income", values_to ="respondents")%>%mutate(religion =factor(religion), income =factor(income))
When gathering, exclude any columns that you do not want “gathered” (religion in this case) by including the column names with a the minus sign in the pivot_longer() function.
Even if your data is in a tidy format, pivot_longer() is occasionally useful for pulling data together to take advantage of faceting, or plotting separate plots based on a grouping variable.
pivot_wider()
The pivot_wider() function is the opposite function. It is useful for creating summary tables for reports, but generally less commonly needed to tidy data.
You use the summarize() function in dplyr to summarize the total number of respondents per income category.
pivot_wider() can be flexibly used in combination with pivot_longer to make a nicer table to print.
Notice in this example how pivot_wider() has been used at the very end of the code sequence to convert the summarized data into a shape that offers a better tabular presentation for a report.
In the pivot_wider() call, you first specify the name of the column to use for the new column names (income in this example) and then specify the column to use for the cell values (total_respondents here).
Long format is often (but not always) the shape we need for tidy data. The important thing, however, is to think carefully about the shape you need for your analysis.
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# … with 1,694 more rows
If we wanted to make lifeExp, pop and gdpPercap (all measurements that we observe) go from a wide table into a long table, what would we do?
# try it yourself
Example
One more! Try using pivot_longer() to convert the the following data that contains simulated revenues for three companies by quarter for years 2006 to 2009.
Afterward, use group_by() and summarize() to calculate the average revenue for each company across all years and all quarters.
Bonus: Calculate a mean revenue for each company AND each year (averaged across all 4 quarters).
Next, we show how to separate the columns into three separate columns using separate() using the col, into and sep arguments. Note that this works by finding the delimiter, and relies on order of the information.
Both unite() and separate() have a remove argument. What does it do? Why would you set it to FALSE?
Compare and contrast separate() and extract(). Why are there three variations of separation (by position, by separator, and with groups), but only one unite()?
# A tibble: 5,112 × 5
country continent year metrics values
<fct> <fct> <int> <chr> <dbl>
1 Afghanistan Asia 1952 lifeExp 28.8
2 Afghanistan Asia 1952 pop 8425333
3 Afghanistan Asia 1952 gdpPercap 779.
4 Afghanistan Asia 1957 lifeExp 30.3
5 Afghanistan Asia 1957 pop 9240934
6 Afghanistan Asia 1957 gdpPercap 821.
7 Afghanistan Asia 1962 lifeExp 32.0
8 Afghanistan Asia 1962 pop 10267083
9 Afghanistan Asia 1962 gdpPercap 853.
10 Afghanistan Asia 1967 lifeExp 34.0
# … with 5,102 more rows
We stacked the three variables lifeExp, pop, and gdpPercap so now the table is a little thinner and three times as long.
Why did we have to make the non-gathered variables into a vector?