Reshaping data with tidyr

Practical Computing and Data Science Tools

Annoucements

  • Midterm II is Thursday, November 14th.
  • Material on the midterm will include all material through Week 11.

Agenda

  • Continuation and review of combining tibbles (i.e. data joins)
  • Reshaping data with tidyr

Data Joins

Data Joins in R

Recall the x and y tibbles from last time.

Types of Data Joins

The dplyr package includes functions for two general types of joins:

  • Mutating joins, which combine the columns of tibbles x and y, and
  • Filtering joins, which match the rows of tibbles x and y.

Think of how mutate() adds columns to a tibble, while filter() removes rows.

Mutating Joins

dplyr contains four mutating joins:

  1. left_join(x, y) keeps all rows of x, but if a row in y does not match to x, an NA is assigned to that row in the new columns.
  2. right_join(x, y) is equivalent to left_join(y, x), except for column order.
  3. inner_join(x, y) keeps only the rows matched between x and y.
  4. full_join(x, y) keeps all rows of both x and y.

Filtering Joins

dplyr contains two filtering joins:

  1. semi_join(x, y) keeps all the rows in x that have a match in y.
  2. anti_join(x, y) removes all the rows in x that have a match in y.

Note: Unlike mutating joins, filtering joins do not add any columns to the data.

Loading data

We’ll create the x and y tibbles in R:

library(tidyverse)
x <- tibble(spp_code = c("ABBA", "BEPA", "PIST"),
            common_name = c("balsum fir", "paper birch", "white pine"))

y <- tibble(spp_code = c("ABBA", "BEPA", "LALA"),
            latin_name = c("Abies balsamea", 
                           "Betula papyrifera",
                           "Larix laricina"))

Loading data

We’ll create the x and y tibbles in R:

x
# A tibble: 3 × 2
  spp_code common_name
  <chr>    <chr>      
1 ABBA     balsum fir 
2 BEPA     paper birch
3 PIST     white pine 
y
# A tibble: 3 × 2
  spp_code latin_name       
  <chr>    <chr>            
1 ABBA     Abies balsamea   
2 BEPA     Betula papyrifera
3 LALA     Larix laricina   

Q1. Write the code to get this output

# A tibble: 3 × 3
  spp_code common_name latin_name       
  <chr>    <chr>       <chr>            
1 ABBA     balsum fir  Abies balsamea   
2 BEPA     paper birch Betula papyrifera
3 PIST     white pine  <NA>             
01:00

Q1. Write the code to get this output

left_join(x, y, by = "spp_code")
# A tibble: 3 × 3
  spp_code common_name latin_name       
  <chr>    <chr>       <chr>            
1 ABBA     balsum fir  Abies balsamea   
2 BEPA     paper birch Betula papyrifera
3 PIST     white pine  <NA>             

Q2. Write the code to get this output

# A tibble: 4 × 3
  spp_code common_name latin_name       
  <chr>    <chr>       <chr>            
1 ABBA     balsum fir  Abies balsamea   
2 BEPA     paper birch Betula papyrifera
3 PIST     white pine  <NA>             
4 LALA     <NA>        Larix laricina   
01:00

Q2. Write the code to get this output

full_join(x, y, by = "spp_code")
# A tibble: 4 × 3
  spp_code common_name latin_name       
  <chr>    <chr>       <chr>            
1 ABBA     balsum fir  Abies balsamea   
2 BEPA     paper birch Betula papyrifera
3 PIST     white pine  <NA>             
4 LALA     <NA>        Larix laricina   

Q3. Write the code to get this output

# A tibble: 3 × 3
  spp_code latin_name        common_name
  <chr>    <chr>             <chr>      
1 ABBA     Abies balsamea    balsum fir 
2 BEPA     Betula papyrifera paper birch
3 LALA     Larix laricina    <NA>       
01:00

Q3. Write the code to get this output

left_join(y, x, by = "spp_code")
# A tibble: 3 × 3
  spp_code latin_name        common_name
  <chr>    <chr>             <chr>      
1 ABBA     Abies balsamea    balsum fir 
2 BEPA     Betula papyrifera paper birch
3 LALA     Larix laricina    <NA>       

Q4. Write the code to get this output

# A tibble: 3 × 3
  spp_code common_name latin_name       
  <chr>    <chr>       <chr>            
1 ABBA     balsum fir  Abies balsamea   
2 BEPA     paper birch Betula papyrifera
3 LALA     <NA>        Larix laricina   
01:00

Q4. Write the code to get this output

right_join(x, y, by = "spp_code")
# A tibble: 3 × 3
  spp_code common_name latin_name       
  <chr>    <chr>       <chr>            
1 ABBA     balsum fir  Abies balsamea   
2 BEPA     paper birch Betula papyrifera
3 LALA     <NA>        Larix laricina   

Q5. Write the code to get this output

# A tibble: 2 × 3
  spp_code common_name latin_name       
  <chr>    <chr>       <chr>            
1 ABBA     balsum fir  Abies balsamea   
2 BEPA     paper birch Betula papyrifera
01:00

Q5. Write the code to get this output

inner_join(x, y, by = "spp_code")
# A tibble: 2 × 3
  spp_code common_name latin_name       
  <chr>    <chr>       <chr>            
1 ABBA     balsum fir  Abies balsamea   
2 BEPA     paper birch Betula papyrifera

Q6. Write the code to get this output

# A tibble: 4 × 3
  spp_code latin_name        common_name
  <chr>    <chr>             <chr>      
1 ABBA     Abies balsamea    balsum fir 
2 BEPA     Betula papyrifera paper birch
3 LALA     Larix laricina    <NA>       
4 PIST     <NA>              white pine 
01:00

Q6. Write the code to get this output

full_join(y, x, by = "spp_code")
# A tibble: 4 × 3
  spp_code latin_name        common_name
  <chr>    <chr>             <chr>      
1 ABBA     Abies balsamea    balsum fir 
2 BEPA     Betula papyrifera paper birch
3 LALA     Larix laricina    <NA>       
4 PIST     <NA>              white pine 

Consider a third tibble, z:

z <- x %>%
  rename(species_code = spp_code)
z
# A tibble: 3 × 2
  species_code common_name
  <chr>        <chr>      
1 ABBA         balsum fir 
2 BEPA         paper birch
3 PIST         white pine 

Q7. Write the code to get this output

# A tibble: 3 × 3
  species_code common_name latin_name       
  <chr>        <chr>       <chr>            
1 ABBA         balsum fir  Abies balsamea   
2 BEPA         paper birch Betula papyrifera
3 PIST         white pine  <NA>             
01:00

Q7. Write the code to get this output

left_join(z, y, by = c("species_code" = "spp_code"))
# A tibble: 3 × 3
  species_code common_name latin_name       
  <chr>        <chr>       <chr>            
1 ABBA         balsum fir  Abies balsamea   
2 BEPA         paper birch Betula papyrifera
3 PIST         white pine  <NA>             

Q8. Write the code to get this output (hint: use a filtering join)

# A tibble: 2 × 2
  spp_code latin_name       
  <chr>    <chr>            
1 ABBA     Abies balsamea   
2 BEPA     Betula papyrifera
01:00

Q8. Write the code to get this output (hint: use a filtering join)

semi_join(y, x, by = "spp_code")
# A tibble: 2 × 2
  spp_code latin_name       
  <chr>    <chr>            
1 ABBA     Abies balsamea   
2 BEPA     Betula papyrifera

Q9. Write the code to get this output (hint: use a filtering join)

# A tibble: 1 × 2
  species_code common_name
  <chr>        <chr>      
1 PIST         white pine 
01:00

Q9. Write the code to get this output (hint: use a filtering join)

anti_join(z, y, by = c("species_code" = "spp_code"))
# A tibble: 1 × 2
  species_code common_name
  <chr>        <chr>      
1 PIST         white pine 

Q10. Write the code to get this output (hint: use a filtering join)

# A tibble: 1 × 2
  spp_code latin_name    
  <chr>    <chr>         
1 LALA     Larix laricina
01:00

Q10. Write the code to get this output (hint: use a filtering join)

anti_join(y, x, by = "spp_code")
# A tibble: 1 × 2
  spp_code latin_name    
  <chr>    <chr>         
1 LALA     Larix laricina
# or 
anti_join(y, z, by = c("spp_code" = "species_code"))
# A tibble: 1 × 2
  spp_code latin_name    
  <chr>    <chr>         
1 LALA     Larix laricina

Reshaping Data

The tidyr package

  • The tidyr package allows us to transform data from long to wide formats, and back.
  • A key use of the tidyr package is getting data in tidy format.
  • But what is tidy format?

Tidy data

“Tidy data” or “tidy format” is a formal concept of how we organize data for analyses, in particular, with tidy data:

  • Each row of the data correspond to a single observation, and
  • Each column of the data correspond to a variable.

Tidy data

“Tidy data” or “tidy format” is a formal concept of how we organize data for analyses, in particular, with tidy data:

  • Each row of the data correspond to a single observation, and
  • Each column of the data correspond to a variable.

Let’s take a look at a blog post from Julie Lowndes and Allison Horst about tidy data

Tidying data with tidyr

Consider the face dataset from IFDAR:

face <- read_csv("../labs/datasets/FACE/FACE_aspen_core_growth.csv") %>%
  select(Rep, Treat, Clone, ID = `ID #`,
        contains(as.character(2001:2005)) & contains("Height"))
face
# A tibble: 1,991 × 9
     Rep Treat Clone    ID `2001_Height` `2002_Height` `2003_Height`
   <dbl> <dbl> <dbl> <dbl>         <dbl>         <dbl>         <dbl>
 1     1     1     8    45            NA            NA            NA
 2     1     1   216    44           547           622           715
 3     1     1     8    43           273           275           305
 4     1     1   216    42           526           619           720
 5     1     1   216    54           328           341           364
 6     1     1   271    55           543           590           634
 7     1     1   271    56           450           502           587
 8     1     1     8    57           217           227           256
 9     1     1   259    58           158           155            NA
10     1     1   271    59           230           241           260
# ℹ 1,981 more rows
# ℹ 2 more variables: `2004_Height` <dbl>, `2005_Height` <dbl>

Tidying data with tidyr

Is this data tidy? If so, what does a row represent? What does a column represent?

face
# A tibble: 1,991 × 9
     Rep Treat Clone    ID `2001_Height` `2002_Height` `2003_Height`
   <dbl> <dbl> <dbl> <dbl>         <dbl>         <dbl>         <dbl>
 1     1     1     8    45            NA            NA            NA
 2     1     1   216    44           547           622           715
 3     1     1     8    43           273           275           305
 4     1     1   216    42           526           619           720
 5     1     1   216    54           328           341           364
 6     1     1   271    55           543           590           634
 7     1     1   271    56           450           502           587
 8     1     1     8    57           217           227           256
 9     1     1   259    58           158           155            NA
10     1     1   271    59           230           241           260
# ℹ 1,981 more rows
# ℹ 2 more variables: `2004_Height` <dbl>, `2005_Height` <dbl>

How do we make this data tidy? 🤔

face
# A tibble: 1,991 × 9
     Rep Treat Clone    ID `2001_Height` `2002_Height` `2003_Height`
   <dbl> <dbl> <dbl> <dbl>         <dbl>         <dbl>         <dbl>
 1     1     1     8    45            NA            NA            NA
 2     1     1   216    44           547           622           715
 3     1     1     8    43           273           275           305
 4     1     1   216    42           526           619           720
 5     1     1   216    54           328           341           364
 6     1     1   271    55           543           590           634
 7     1     1   271    56           450           502           587
 8     1     1     8    57           217           227           256
 9     1     1   259    58           158           155            NA
10     1     1   271    59           230           241           260
# ℹ 1,981 more rows
# ℹ 2 more variables: `2004_Height` <dbl>, `2005_Height` <dbl>
  • Each row should represent a measurement for a given rep, treatment, clone, and ID. But currently, we have multiple measurements on the same row.

  • Further, we have multiple columns for the “height” variable.

  • Not tidy!

We need to pivot the data into a longer format!

Enter, pivot_longer().

pivot_longer(), a function from tidyr takes four key arguments:

  • .data: the data (tibble) you’d like to pivot,
  • cols: the columns you’d like to pivot,
  • names_to: the new column that will be created which takes the column names from cols as values, and
  • values_to: the new column that will be created which takes the column values from cols as values.

Why the long face?

face_long <- face %>%
  pivot_longer(
    cols = c("2001_Height", "2002_Height", "2003_Height", 
             "2004_Height", "2005_Height"),
    names_to = "Year_Type",
    values_to = "Height_cm")

Why the long face?

face_long <- face %>%
  pivot_longer(
    cols = c("2001_Height", "2002_Height", "2003_Height", 
             "2004_Height", "2005_Height"),
    names_to = "Year_Type",
    values_to = "Height_cm")

Why the long face?

face_long <- face %>%
  pivot_longer(
    cols = contains("Height"),
    names_to = "Year_Type",
    values_to = "Height_cm")
  • A cleaner way to select these columns is to use dplyr’s contains() function.

Why the long face?

face_long
# A tibble: 9,955 × 6
     Rep Treat Clone    ID Year_Type   Height_cm
   <dbl> <dbl> <dbl> <dbl> <chr>           <dbl>
 1     1     1     8    45 2001_Height        NA
 2     1     1     8    45 2002_Height        NA
 3     1     1     8    45 2003_Height        NA
 4     1     1     8    45 2004_Height        NA
 5     1     1     8    45 2005_Height        NA
 6     1     1   216    44 2001_Height       547
 7     1     1   216    44 2002_Height       622
 8     1     1   216    44 2003_Height       715
 9     1     1   216    44 2004_Height       716
10     1     1   216    44 2005_Height       817
# ℹ 9,945 more rows
  • To get tidy data!

Going (back) to wide data

  • Sometimes, we need to “widen” a dataset to get it into tidy format.
  • For this example, we will just widen the face_long dataset back to its original form.
  • tidyr has an aptly named function, pivot_wider().
  • Key arguments of pivot_wider():
    • .data: the tibble to widen,
    • names_from: the column that contains values which will be assigned as the new column names,
    • values_from: the column that contains

Pivoting wider

face_wide <- face_long %>%
  pivot_wider(names_from = "Year_Type", values_from = "Height_cm")

all_equal(face, face_wide)
[1] TRUE
  • This results in the same tibble that we started with!

Other useful tidyr functions

  • unite() for pasteing column values together with specified seperators,
  • The separate_wider_*() family: for splitting columns into multiple new columns:
    • separate_wider_delim(): separate by delimiter
    • separate_wider_position(): separate by position
    • separate_wider_regex(): separate by regular expression

unite(): examples

face_long
# A tibble: 9,955 × 6
     Rep Treat Clone    ID Year_Type   Height_cm
   <dbl> <dbl> <dbl> <dbl> <chr>           <dbl>
 1     1     1     8    45 2001_Height        NA
 2     1     1     8    45 2002_Height        NA
 3     1     1     8    45 2003_Height        NA
 4     1     1     8    45 2004_Height        NA
 5     1     1     8    45 2005_Height        NA
 6     1     1   216    44 2001_Height       547
 7     1     1   216    44 2002_Height       622
 8     1     1   216    44 2003_Height       715
 9     1     1   216    44 2004_Height       716
10     1     1   216    44 2005_Height       817
# ℹ 9,945 more rows

unite(): examples

face_long %>%
  unite(col = "Design", Rep, Treat, Clone, sep = "_")
# A tibble: 9,955 × 4
   Design     ID Year_Type   Height_cm
   <chr>   <dbl> <chr>           <dbl>
 1 1_1_8      45 2001_Height        NA
 2 1_1_8      45 2002_Height        NA
 3 1_1_8      45 2003_Height        NA
 4 1_1_8      45 2004_Height        NA
 5 1_1_8      45 2005_Height        NA
 6 1_1_216    44 2001_Height       547
 7 1_1_216    44 2002_Height       622
 8 1_1_216    44 2003_Height       715
 9 1_1_216    44 2004_Height       716
10 1_1_216    44 2005_Height       817
# ℹ 9,945 more rows

unite(): examples

face_long %>%
  unite(col = "Design", Rep, Treat, Clone, sep = ".")
# A tibble: 9,955 × 4
   Design     ID Year_Type   Height_cm
   <chr>   <dbl> <chr>           <dbl>
 1 1.1.8      45 2001_Height        NA
 2 1.1.8      45 2002_Height        NA
 3 1.1.8      45 2003_Height        NA
 4 1.1.8      45 2004_Height        NA
 5 1.1.8      45 2005_Height        NA
 6 1.1.216    44 2001_Height       547
 7 1.1.216    44 2002_Height       622
 8 1.1.216    44 2003_Height       715
 9 1.1.216    44 2004_Height       716
10 1.1.216    44 2005_Height       817
# ℹ 9,945 more rows

unite(): examples

face_long_abc <- face_long %>%
  unite(col = "Design", Rep, Treat, Clone, sep = "abc")
face_long_abc
# A tibble: 9,955 × 4
   Design         ID Year_Type   Height_cm
   <chr>       <dbl> <chr>           <dbl>
 1 1abc1abc8      45 2001_Height        NA
 2 1abc1abc8      45 2002_Height        NA
 3 1abc1abc8      45 2003_Height        NA
 4 1abc1abc8      45 2004_Height        NA
 5 1abc1abc8      45 2005_Height        NA
 6 1abc1abc216    44 2001_Height       547
 7 1abc1abc216    44 2002_Height       622
 8 1abc1abc216    44 2003_Height       715
 9 1abc1abc216    44 2004_Height       716
10 1abc1abc216    44 2005_Height       817
# ℹ 9,945 more rows

separate_wider_delim() examples

face_long_abc
# A tibble: 9,955 × 4
   Design         ID Year_Type   Height_cm
   <chr>       <dbl> <chr>           <dbl>
 1 1abc1abc8      45 2001_Height        NA
 2 1abc1abc8      45 2002_Height        NA
 3 1abc1abc8      45 2003_Height        NA
 4 1abc1abc8      45 2004_Height        NA
 5 1abc1abc8      45 2005_Height        NA
 6 1abc1abc216    44 2001_Height       547
 7 1abc1abc216    44 2002_Height       622
 8 1abc1abc216    44 2003_Height       715
 9 1abc1abc216    44 2004_Height       716
10 1abc1abc216    44 2005_Height       817
# ℹ 9,945 more rows

separate_wider_delim() examples

face_long_abc %>%
  separate_wider_delim(cols = "Design", 
                       delim = "abc",
                       names = c("Rep", "Treat", "Clone"))
# A tibble: 9,955 × 6
   Rep   Treat Clone    ID Year_Type   Height_cm
   <chr> <chr> <chr> <dbl> <chr>           <dbl>
 1 1     1     8        45 2001_Height        NA
 2 1     1     8        45 2002_Height        NA
 3 1     1     8        45 2003_Height        NA
 4 1     1     8        45 2004_Height        NA
 5 1     1     8        45 2005_Height        NA
 6 1     1     216      44 2001_Height       547
 7 1     1     216      44 2002_Height       622
 8 1     1     216      44 2003_Height       715
 9 1     1     216      44 2004_Height       716
10 1     1     216      44 2005_Height       817
# ℹ 9,945 more rows

separate_wider_delim() examples

face_long
# A tibble: 9,955 × 6
     Rep Treat Clone    ID Year_Type   Height_cm
   <dbl> <dbl> <dbl> <dbl> <chr>           <dbl>
 1     1     1     8    45 2001_Height        NA
 2     1     1     8    45 2002_Height        NA
 3     1     1     8    45 2003_Height        NA
 4     1     1     8    45 2004_Height        NA
 5     1     1     8    45 2005_Height        NA
 6     1     1   216    44 2001_Height       547
 7     1     1   216    44 2002_Height       622
 8     1     1   216    44 2003_Height       715
 9     1     1   216    44 2004_Height       716
10     1     1   216    44 2005_Height       817
# ℹ 9,945 more rows
face_long %>%
  separate_wider_delim(cols = "Year_Type", 
                       delim = "_",
                       names = c("Year", NA))
# A tibble: 9,955 × 6
     Rep Treat Clone    ID Year  Height_cm
   <dbl> <dbl> <dbl> <dbl> <chr>     <dbl>
 1     1     1     8    45 2001         NA
 2     1     1     8    45 2002         NA
 3     1     1     8    45 2003         NA
 4     1     1     8    45 2004         NA
 5     1     1     8    45 2005         NA
 6     1     1   216    44 2001        547
 7     1     1   216    44 2002        622
 8     1     1   216    44 2003        715
 9     1     1   216    44 2004        716
10     1     1   216    44 2005        817
# ℹ 9,945 more rows

Next time

  • More reshaping data with tidyr