Combining multiple tibbles with dplyr

Practical Computing and Data Science Tools

Agenda

  • Combining tibbles (i.e. data joins)
  • Applications of data joins to forestry

Data Joins in R

Today, we’ll talk about joining tibbles in R with dplyr.

Artwork by Allison Horst.

Combining tibbles

Data Joins in R

Recall that the dplyr package, short for “data pliers”, is an R package all about wrangling data.

Today, we will explore it’s tools for joining data tables (i.e. “tibbles”).

Data Joins in R

In particular, we’ll look at a few ways to join the following tables x and y:

Motivation

  • It is common in forestry, and in particular forest inventory, to have multiple tibbles where data are stored due to a variety of factors.
  • In order to perform most statistical analyses, you must have the necessary data in one tibble.
  • Example: the US Forest Service, Forest Inventory & Analysis stores plot-level, tree-level, subplot-level, … data in one database. Researchers must combine these data into a singular table to do analyses.

Types of Data Joins

The dplyr package, which is part of the tidyverse, includes functions for two general types of joins:

  • Mutating joins, which combine the columns of tibbles x and y, and
  • Filtering joins, which match the rows of tibbles x and y.

Types of Data Joins

The dplyr package, which is part of the tidyverse, includes functions for two general types of joins:

  • Mutating joins, which combine the columns of tibbles x and y, and
  • Filtering joins, which match the rows of tibbles x and y.

Think of how mutate() adds columns to a tibble, while filter() removes rows.

Example Data

For the following examples of data joins, we will use the tibbles from the first slide. We can load this data into R:

library(tidyverse)
x <- tibble(spp_code = c("ABBA", "BEPA", "PIST"),
            common_name = c("balsum fir", "paper birch", "white pine"))

y <- tibble(spp_code = c("ABBA", "BEPA", "LALA"),
            latin_name = c("Abies balsamea", 
                           "Betula papyrifera",
                           "Larix laricina"))

Mutating Joins

dplyr contains four mutating joins:

  1. left_join(x, y) keeps all rows of x, but if a row in y does not match to x, an NA is assigned to that row in the new columns.

Mutating Joins

dplyr contains four mutating joins:

  1. left_join(x, y) keeps all rows of x, but if a row in y does not match to x, an NA is assigned to that row in the new columns.
  2. right_join(x, y) is equivalent to left_join(y, x), except for column order.

Mutating Joins

dplyr contains four mutating joins:

  1. left_join(x, y) keeps all rows of x, but if a row in y does not match to x, an NA is assigned to that row in the new columns.
  2. right_join(x, y) is equivalent to left_join(y, x), except for column order.
  3. inner_join(x, y) keeps only the rows matched between x and y.

Mutating Joins

dplyr contains four mutating joins:

  1. left_join(x, y) keeps all rows of x, but if a row in y does not match to x, an NA is assigned to that row in the new columns.
  2. right_join(x, y) is equivalent to left_join(y, x), except for column order.
  3. inner_join(x, y) keeps only the rows matched between x and y.
  4. full_join(x, y) keeps all rows of both x and y.

Examples: Mutating Joins

Recall our example data

x
# A tibble: 3 × 2
  spp_code common_name
  <chr>    <chr>      
1 ABBA     balsum fir 
2 BEPA     paper birch
3 PIST     white pine 
y
# A tibble: 3 × 2
  spp_code latin_name       
  <chr>    <chr>            
1 ABBA     Abies balsamea   
2 BEPA     Betula papyrifera
3 LALA     Larix laricina   

left_join()

left_join(x, y)
Joining with `by = join_by(spp_code)`
# A tibble: 3 × 3
  spp_code common_name latin_name       
  <chr>    <chr>       <chr>            
1 ABBA     balsum fir  Abies balsamea   
2 BEPA     paper birch Betula papyrifera
3 PIST     white pine  <NA>             

left_join()

left_join(x, y)
Joining with `by = join_by(spp_code)`
# A tibble: 3 × 3
  spp_code common_name latin_name       
  <chr>    <chr>       <chr>            
1 ABBA     balsum fir  Abies balsamea   
2 BEPA     paper birch Betula papyrifera
3 PIST     white pine  <NA>             

But what’s that message about?

"Joining with `by = join_by(spp_code)`"

We need to specify a key 🔑

A key is can just be thought of the name(s) of the column(s) you’re joining by. In the left_join() from the last slide, R assumed we were joining by the column spp_code since both x and y have a column with that name.

We need to specify a key 🔑

A key is can just be thought of the name(s) of the column(s) you’re joining by. In the left_join() from the last slide, R assumed we were joining by the column spp_code since both x and y have a column with that name.

Keys are important, especially when the columns you are joining by have different names, or you are joining by multiple columns.

left_join(), with a 🔑

left_join(x, y, by = "spp_code")
# A tibble: 3 × 3
  spp_code common_name latin_name       
  <chr>    <chr>       <chr>            
1 ABBA     balsum fir  Abies balsamea   
2 BEPA     paper birch Betula papyrifera
3 PIST     white pine  <NA>             
  • Notice that we specify this key with the by argument. This is the same for all joins in dplyr.

right_join()

right_join(y, x, by = "spp_code")
# A tibble: 3 × 3
  spp_code latin_name        common_name
  <chr>    <chr>             <chr>      
1 ABBA     Abies balsamea    balsum fir 
2 BEPA     Betula papyrifera paper birch
3 PIST     <NA>              white pine 

right_join()

right_join(y, x, by = "spp_code")
# A tibble: 3 × 3
  spp_code latin_name        common_name
  <chr>    <chr>             <chr>      
1 ABBA     Abies balsamea    balsum fir 
2 BEPA     Betula papyrifera paper birch
3 PIST     <NA>              white pine 

Notice that this is the same as our previous left_join().

What happens if we try switching the order of x and y?

right_join(x, y, by = "spp_code")

right_join()

right_join(y, x, by = "spp_code")
# A tibble: 3 × 3
  spp_code latin_name        common_name
  <chr>    <chr>             <chr>      
1 ABBA     Abies balsamea    balsum fir 
2 BEPA     Betula papyrifera paper birch
3 PIST     <NA>              white pine 

Notice that this is the same as our previous left_join().

What happens if we try switching the order of x and y?

right_join(x, y, by = "spp_code")
# A tibble: 3 × 3
  spp_code common_name latin_name       
  <chr>    <chr>       <chr>            
1 ABBA     balsum fir  Abies balsamea   
2 BEPA     paper birch Betula papyrifera
3 LALA     <NA>        Larix laricina   

inner_join()

How many rows will the output have?

inner_join(x, y, by = "spp_code")

inner_join()

How many rows will the output have?

inner_join(x, y, by = "spp_code")
# A tibble: 2 × 3
  spp_code common_name latin_name       
  <chr>    <chr>       <chr>            
1 ABBA     balsum fir  Abies balsamea   
2 BEPA     paper birch Betula papyrifera

Why is this the result?

full_join()

How many rows will the output have?

full_join(x, y, by = "spp_code")

full_join()

How many rows will the output have?

full_join(x, y, by = "spp_code")
# A tibble: 4 × 3
  spp_code common_name latin_name       
  <chr>    <chr>       <chr>            
1 ABBA     balsum fir  Abies balsamea   
2 BEPA     paper birch Betula papyrifera
3 PIST     white pine  <NA>             
4 LALA     <NA>        Larix laricina   

Why is this the result?

Filtering Joins

dplyr contains two filtering joins:

  1. semi_join(x, y) keeps all the rows in x that have a match in y.

Filtering Joins

dplyr contains two filtering joins:

  1. semi_join(x, y) keeps all the rows in x that have a match in y.
  2. anti_join(x, y) removes all the rows in x that have a match in y.

Note: Unlike mutating joins, filtering joins do not add any columns to the data.

semi_join()

How many rows will this semi_join return? How many columns?

semi_join(x, y, by = "spp_code")

semi_join()

How many rows will this semi_join return? How many columns?

semi_join(x, y, by = "spp_code")
# A tibble: 2 × 2
  spp_code common_name
  <chr>    <chr>      
1 ABBA     balsum fir 
2 BEPA     paper birch

semi_join()

What about this semi_join? Will it be the same as semi_join(x, y)?

semi_join(y, x, by = "spp_code")

semi_join()

What about this semi_join? Will it be the same as semi_join(x, y)?

semi_join(y, x, by = "spp_code")
# A tibble: 2 × 2
  spp_code latin_name       
  <chr>    <chr>            
1 ABBA     Abies balsamea   
2 BEPA     Betula papyrifera

anti_join()

Let’s see what anti_join does:

anti_join(x, y, by = "spp_code")

anti_join()

Let’s see what anti_join does:

anti_join(x, y, by = "spp_code")
# A tibble: 1 × 2
  spp_code common_name
  <chr>    <chr>      
1 PIST     white pine 

Why do we get this output?

anti_join()

What happens if we switch the order of x and y?

anti_join(y, x, by = "spp_code")

anti_join()

What happens if we switch the order of x and y?

anti_join(y, x, by = "spp_code")
# A tibble: 1 × 2
  spp_code latin_name    
  <chr>    <chr>         
1 LALA     Larix laricina

An Important Subtlety: Column Names

So far, we have joined x and y by the spp_code column.

But what if y had the same column named differently:

y <- y %>%
  rename(species_code = spp_code)
y
# A tibble: 3 × 2
  species_code latin_name       
  <chr>        <chr>            
1 ABBA         Abies balsamea   
2 BEPA         Betula papyrifera
3 LALA         Larix laricina   

How Do We Join x and y?

left_join(x, y)
Error in `left_join()`:
! `by` must be supplied when `x` and `y` have no common variables.
ℹ Use `cross_join()` to perform a cross-join.

Looks like we need to specify by (our 🔑)

How Do We Join x and y?

left_join(x, y)
Error in `left_join()`:
! `by` must be supplied when `x` and `y` have no common variables.
ℹ Use `cross_join()` to perform a cross-join.

Looks like we need to specify by (our 🔑)

left_join(x, y, by = "spp_code")
Error in `left_join()`:
! Join columns in `y` must be present in the data.
✖ Problem with `spp_code`.

Still not working!

How Do We Join x and y?

left_join(x, y)
Error in `left_join()`:
! `by` must be supplied when `x` and `y` have no common variables.
ℹ Use `cross_join()` to perform a cross-join.

Looks like we need to specify by (our 🔑)

left_join(x, y, by = "spp_code")
Error in `left_join()`:
! Join columns in `y` must be present in the data.
✖ Problem with `spp_code`.

Still not working!

left_join(x, y, by = c("spp_code" = "species_code"))
# A tibble: 3 × 3
  spp_code common_name latin_name       
  <chr>    <chr>       <chr>            
1 ABBA     balsum fir  Abies balsamea   
2 BEPA     paper birch Betula papyrifera
3 PIST     white pine  <NA>             

Applications of data joins in forestry

Today: Applying Allometric Equations Using Joins

Allometric equations are regression equations that relate measurements like species, DBH, and perhaps height to more difficult and expensive to measure quantities such as stem volume or biomass.

Today: Applying Allometric Equations Using Joins

Allometric equations are regression equations that relate measurements like species, DBH, and perhaps height to more difficult and expensive to measure quantities such as stem volume or biomass.

  • Allometric equations take the inexpensive measurements as input (e.g., species and DBH) and return estimates of the expensive quantity (e.g., height, volume, biomass)

Today: Applying Allometric Equations Using Joins

Allometric equations are regression equations that relate measurements like species, DBH, and perhaps height to more difficult and expensive to measure quantities such as stem volume or biomass.

  • Allometric equations take the inexpensive measurements as input (e.g., species and DBH) and return estimates of the expensive quantity (e.g., height, volume, biomass)

  • Today, we’ll focus on calculating tree height based on DBH and species.

The Model

The model we’ll use to calculate height is as follows \[ \text{height} = 4.5 + \exp\left( \beta_1 + \frac{\beta_2}{\text{DBH} + 1.0} \right) \] where \(\beta_1\), \(\beta_2\) are stored in “datasets/FVS_NE_coefficients.csv” and are different for each forest vegetation code.

We’ll access data from a few tibbles in order to create our final height calculation.

The Data: stands

library(tidyverse)
stands <- read_csv("../labs/datasets/two_stands_wo_ht_vol.csv")
head(stands)
# A tibble: 6 × 4
  stand_id plot_id scientific_name   DBH_in
     <dbl>   <dbl> <chr>              <dbl>
1        1       1 Abies balsamea      11.3
2        1       1 Pinus strobus        9.8
3        1       1 Pinus strobus       10.7
4        1       3 Betula papyrifera   15.4
5        1       3 Pinus strobus       13.1
6        2       1 Larix laricina       7.1
dim(stands)
[1] 15  4

The Data: stands

library(tidyverse)
stands <- read_csv("../labs/datasets/two_stands_wo_ht_vol.csv")
head(stands)
# A tibble: 6 × 4
  stand_id plot_id scientific_name   DBH_in
     <dbl>   <dbl> <chr>              <dbl>
1        1       1 Abies balsamea      11.3
2        1       1 Pinus strobus        9.8
3        1       1 Pinus strobus       10.7
4        1       3 Betula papyrifera   15.4
5        1       3 Pinus strobus       13.1
6        2       1 Larix laricina       7.1
dim(stands)
[1] 15  4

Big Picture: We’d like to calculate tree height for each row of this tibble. To do so, we’ll need to load another tibble.

The Data: FVS Coefficients (our \(\beta\)’s)

ht_coeffs <- read_csv("../labs/datasets/FVS_NE_coefficients.csv")
ht_coeffs
# A tibble: 108 × 3
   USFS_FVS_code beta_1 beta_2
   <chr>          <dbl>  <dbl>
 1 BF              4.51  -6.01
 2 TA              4.51  -6.01
 3 WS              4.51  -6.01
 4 RS              4.51  -6.01
 5 NS              4.51  -6.01
 6 BS              4.51  -6.01
 7 PI              4.51  -6.01
 8 RN              4.51  -6.01
 9 WP              4.61  -6.19
10 LP              4.69  -6.88
# ℹ 98 more rows

Q: How can we join these tibbles?

head(stands)
# A tibble: 6 × 4
  stand_id plot_id scientific_name   DBH_in
     <dbl>   <dbl> <chr>              <dbl>
1        1       1 Abies balsamea      11.3
2        1       1 Pinus strobus        9.8
3        1       1 Pinus strobus       10.7
4        1       3 Betula papyrifera   15.4
5        1       3 Pinus strobus       13.1
6        2       1 Larix laricina       7.1
head(ht_coeffs)
# A tibble: 6 × 3
  USFS_FVS_code beta_1 beta_2
  <chr>          <dbl>  <dbl>
1 BF              4.51  -6.01
2 TA              4.51  -6.01
3 WS              4.51  -6.01
4 RS              4.51  -6.01
5 NS              4.51  -6.01
6 BS              4.51  -6.01

A: We need another tibble!

A: We need another tibble!

The “spp_codes” tibble contains the columns needed to be able to link “stands” and “ht_coeffs”.

spp_codes <- read_csv("../labs/datasets/USFS_species_codes.csv")
head(spp_codes)
# A tibble: 6 × 5
  common_name   scientific_name USFS_FVS_code USFS_FIA_code PLANTS_code
  <chr>         <chr>           <chr>                 <dbl> <chr>      
1 balsam fir    Abies balsamea  BF                       12 ABBA       
2 tamarack      Larix laricina  TA                       71 LALA       
3 white spruce  Picea glauca    WS                       94 PIGL       
4 red spruce    Picea rubens    RS                       97 PIRU       
5 Norway spruce Picea abies     NS                       91 PIAB       
6 black spruce  Picea mariana   BS                       95 PIMA       

Now We Can Join Our Data

head(stands, n = 3)
# A tibble: 3 × 4
  stand_id plot_id scientific_name DBH_in
     <dbl>   <dbl> <chr>            <dbl>
1        1       1 Abies balsamea    11.3
2        1       1 Pinus strobus      9.8
3        1       1 Pinus strobus     10.7
stands <- stands %>%
  left_join(spp_codes, by = "scientific_name") %>%
  select(-common_name, -USFS_FIA_code, -PLANTS_code) # remove unecessary columns
head(stands, n = 3)
# A tibble: 3 × 5
  stand_id plot_id scientific_name DBH_in USFS_FVS_code
     <dbl>   <dbl> <chr>            <dbl> <chr>        
1        1       1 Abies balsamea    11.3 BF           
2        1       1 Pinus strobus      9.8 WP           
3        1       1 Pinus strobus     10.7 WP           
  • Now, “stands” contains the USFS_FVS_code column

Q: Why did we use a left_join() in the last slide?

Q: Why did we use a left_join() in the last slide?

A: We want to retain all the rows of “stands”, but only the rows in “spp_codes” that match in “stands”.

Q: How can we check if all rows had a scientific_name match?

Q: How can we check if all rows had a scientific_name match?

A: Use anti_join()!

Q: How can we check if all rows had a scientific_name match?

A: Use anti_join()!

anti_join(stands, spp_codes, by = "scientific_name")
# A tibble: 0 × 5
# ℹ 5 variables: stand_id <dbl>, plot_id <dbl>, scientific_name <chr>,
#   DBH_in <dbl>, USFS_FVS_code <chr>

Join \(\beta\)’s to “stands”

Join \(\beta\)’s to “stands”

Q: What type of join should we use?

Join \(\beta\)’s to “stands”

Q: What type of join should we use?

A: A left_join() (or a right_join(), if you’d like)

stands <- left_join(stands, ht_coeffs, by = "USFS_FVS_code")

# or alternatively
# stands <- right_join(ht_coeffs, stands, by = "USFS_FVS_code")

head(stands)
# A tibble: 6 × 7
  stand_id plot_id scientific_name   DBH_in USFS_FVS_code beta_1 beta_2
     <dbl>   <dbl> <chr>              <dbl> <chr>          <dbl>  <dbl>
1        1       1 Abies balsamea      11.3 BF              4.51  -6.01
2        1       1 Pinus strobus        9.8 WP              4.61  -6.19
3        1       1 Pinus strobus       10.7 WP              4.61  -6.19
4        1       3 Betula papyrifera   15.4 PB              4.44  -4.09
5        1       3 Pinus strobus       13.1 WP              4.61  -6.19
6        2       1 Larix laricina       7.1 TA              4.51  -6.01

Now We Can Make Our Calculation!

Recall the equation for height: \[ \text{height} = 4.5 + \exp\left( \beta_1 + \frac{\beta_2}{\text{DBH} + 1.0} \right) \]

Now We Can Make Our Calculation!

Recall the equation for height: \[ \text{height} = 4.5 + \exp\left( \beta_1 + \frac{\beta_2}{\text{DBH} + 1.0} \right) \]

stands <- stands %>%
  mutate(height_ft = 4.5 + exp( beta_1 + beta_2 / (DBH_in + 1) ) ) %>%
  select(-beta_1, -beta_2) # Don't need these anymore
head(stands)
# A tibble: 6 × 6
  stand_id plot_id scientific_name   DBH_in USFS_FVS_code height_ft
     <dbl>   <dbl> <chr>              <dbl> <chr>             <dbl>
1        1       1 Abies balsamea      11.3 BF                 60.2
2        1       1 Pinus strobus        9.8 WP                 61.1
3        1       1 Pinus strobus       10.7 WP                 63.6
4        1       3 Betula papyrifera   15.4 PB                 70.5
5        1       3 Pinus strobus       13.1 WP                 69.2
6        2       1 Larix laricina       7.1 TA                 47.7

Next time

  • Reshaping data with tidyr