Joining and Reshaping Data
FOR 128: Lab 9
Welcome
Welcome to Lab 9! Today, we’ll focus on data joins and reshaping data. Parts 1 and 2 of this lab focuses on building your dplyr
skills, especially in relation to joins. However, you’ll need to use some other dplyr
functions along the way. In Part 3 of the lab, we’ll focus on reshaping data with tidyr
.
Learning objectives
- Use
dplyr
joins - Use
tidyr
to reshape data
Deliverables (i.e., what to put in the lab drop box)
Upload your rendered PDF (lab_09.pdf
) and Quarto (lab_09.qmd
) document to the lab drop box. Make sure the Quarto document properly renders to PDF.
Collaborator(s)
List any collaborators you worked with below.
Exercise 0
Load any packages you’ll need for this lab below.
Part 1: Toy Data
For this part, you’ll first work with the small datasets x
and y
presented in class to get started with joins. After a few exercises, you’ll move to some real data from the US Forest Service Forest Inventory & Analysis (FIA) program to gain more experience joining complex datasets.
Exercise 1
Create the x
and y
datasets that we have used over the lectures related to data joins using the tibble()
function. These datasets are shown below for your convenience.
In R, your resulting tibbles should look as follows:
x
# A tibble: 3 × 2
spp_code common_name
<chr> <chr>
1 ABBA balsum fir
2 BEPA paper birch
3 PIST white pine
y
# A tibble: 3 × 2
spp_code latin_name
<chr> <chr>
1 ABBA Abies balsamea
2 BEPA Betula papyrifera
3 LALA Larix laricina
Exercise 2
Create a dataset called z
by joining x
and y
by their species code. Retain only the rows in x
and y
that have matching species codes. Print your resulting tibble z
. Justify which join you used.
# A tibble: 2 × 3
spp_code common_name latin_name
<chr> <chr> <chr>
1 ABBA balsum fir Abies balsamea
2 BEPA paper birch Betula papyrifera
Exercise 3
Now, create a dataset called w
that contains all the rows in x
, but only the rows in y
that have a species code that match to x
by joining x
and y
. Make sure the column order is spp_code
, common_name
, latin_name
Justify which join you used.
# A tibble: 3 × 3
spp_code common_name latin_name
<chr> <chr> <chr>
1 ABBA balsum fir Abies balsamea
2 BEPA paper birch Betula papyrifera
3 PIST white pine <NA>
Exercise 4
Next, with another join of x
and y
, create a dataset called w_clone
with the same rows, columns, and values as w
, but that has the latin_name
column before common_name
. Justify which join you used.
# A tibble: 3 × 3
spp_code latin_name common_name
<chr> <chr> <chr>
1 ABBA Abies balsamea balsum fir
2 BEPA Betula papyrifera paper birch
3 PIST <NA> white pine
Exercise 5
Use a filtering join to check if there are any species codes in y
that do not exist in x
. Are there any? If so, what is the species code?
# A tibble: 1 × 2
spp_code latin_name
<chr> <chr>
1 LALA Larix laricina
Part 2: FIA Data
For this part, you’ll use four dataset from the Forest Inventory & Analysis Program (FIA). These data are available on the course website in the “data_lab9.zip” file. The plot_biomass.csv
dataset contains information on the unique FIA plot ID (PLT_CN
) and a measurement of aboveground biomass for each plot. The plot_locations.csv
dataset contains location information for each of those same plots, along with the unique identifier (PLT_CN
). Notably, plot_locations.csv
has columns denoting the state and county the plot resides in. However, we don’t have the actual state and county names for each plot, just their unique code. We will have to do some data joins with the state_reference.csv
and county_reference.csv
datasets to find out where these plots are.
Exercise 6
Load in the four datasets included in the posting of Lab 9 on the course website. Name each of them the same as they are named as files, without the .csv
extension. For example, for the “plot_biomass.csv” file, name the object plot_biomass
, etc. Take a look at these datasets and try to understand them. What does a row represent in each one?
plot_biomass
# A tibble: 191 × 2
PLT_CN DRYBIO_AG_TPA
<dbl> <dbl>
1 1.75e14 122.
2 1.75e14 76.2
3 1.75e14 4.21
4 1.75e14 13.2
5 1.75e14 90.9
6 1.75e14 96.2
7 1.75e14 397.
8 1.75e14 2.54
9 1.75e14 290.
10 2.17e14 182.
# ℹ 181 more rows
plot_locations
# A tibble: 191 × 5
PLT_CN STATECD COUNTYFIPS LON_PUBLIC LAT_PUBLIC
<dbl> <dbl> <dbl> <dbl> <dbl>
1 3.08e13 41 41051 -122. 45.6
2 4.12e14 41 41009 -123. 45.9
3 4.85e14 41 41009 -123. 45.8
4 3.08e13 41 41009 -123. 45.8
5 2.17e14 41 41009 -123. 46.0
6 4.85e14 41 41009 -123. 46.1
7 4.12e14 41 41051 -122. 45.5
8 3.08e13 41 41009 -123. 45.8
9 2.17e14 41 41009 -123. 45.9
10 5.59e14 41 41009 -123. 45.9
# ℹ 181 more rows
state_reference
# A tibble: 59 × 3
state_cd state_name state_abbr
<dbl> <chr> <chr>
1 1 Alabama AL
2 2 Alaska AK
3 4 Arizona AZ
4 5 Arkansas AR
5 6 California CA
6 8 Colorado CO
7 9 Connecticut CT
8 10 Delaware DE
9 11 District of Columbia DC
10 12 Florida FL
# ℹ 49 more rows
county_reference
# A tibble: 3,233 × 3
STATENM COUNTYNM COUNTYFIPS
<chr> <chr> <chr>
1 Ohio Highland 39071
2 California Alpine 06003
3 Florida Escambia 12033
4 Illinois Lawrence 17101
5 Mississippi Wayne 28153
6 Mississippi Tishomingo 28141
7 Montana Sanders 30089
8 New York Albany 36001
9 Pennsylvania Potter 42105
10 Missouri Greene 29077
# ℹ 3,223 more rows
Exercise 7
Now that you’ve got a sense of the data, join two datasets together to get both plot locations and plot biomass in the same tibble. Call this new tibble plots
. Make sure to specify what you are joining by, and justify which join you used. Your new dataset should have 191 rows and 6 columns. Take a glimpse()
at your new tibble.
glimpse(plots)
Rows: 191
Columns: 6
$ PLT_CN <dbl> 1.747633e+14, 1.747635e+14, 1.747635e+14, 1.747636e+14, …
$ DRYBIO_AG_TPA <dbl> 122.36957, 76.15349, 4.20973, 13.23662, 90.90198, 96.228…
$ STATECD <dbl> 41, 53, 53, 53, 53, 41, 41, 53, 41, 41, 41, 41, 41, 41, …
$ COUNTYFIPS <dbl> 41051, 53013, 53013, 53013, 53013, 41051, 41051, 53013, …
$ LON_PUBLIC <dbl> -121.9465, -117.8284, -117.6408, -117.7454, -117.7872, -…
$ LAT_PUBLIC <dbl> 45.63014, 46.08317, 46.24816, 46.22776, 46.10631, 45.530…
Exercise 8
Now you have your plot location information and plot biomass measurements in one tibble, but we still don’t know where these plots are. Add some columns to plots
by joining it with state_reference
in a way such that each row of plots
now has a state name and abbreviation associated with it. Use dplyr
to figure out what state(s) are these plots in?
plots
# A tibble: 191 × 8
PLT_CN DRYBIO_AG_TPA STATECD COUNTYFIPS LON_PUBLIC LAT_PUBLIC state_name
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 1.75e14 122. 41 41051 -122. 45.6 Oregon
2 1.75e14 76.2 53 53013 -118. 46.1 Washington
3 1.75e14 4.21 53 53013 -118. 46.2 Washington
4 1.75e14 13.2 53 53013 -118. 46.2 Washington
5 1.75e14 90.9 53 53013 -118. 46.1 Washington
6 1.75e14 96.2 41 41051 -122. 45.5 Oregon
7 1.75e14 397. 41 41051 -122. 45.5 Oregon
8 1.75e14 2.54 53 53013 -118. 46.3 Washington
9 1.75e14 290. 41 41051 -122. 45.5 Oregon
10 2.17e14 182. 41 41051 -122. 45.6 Oregon
# ℹ 181 more rows
# ℹ 1 more variable: state_abbr <chr>
Exercise 9
Use a filtering join to see if any rows of plots
do not have a match in state_reference
. Are there any? If so, how many?
# A tibble: 0 × 8
# ℹ 8 variables: PLT_CN <dbl>, DRYBIO_AG_TPA <dbl>, STATECD <dbl>,
# COUNTYFIPS <dbl>, LON_PUBLIC <dbl>, LAT_PUBLIC <dbl>, state_name <chr>,
# state_abbr <chr>
Exercise 10
Now you’ve added biomass and state information to your plot data. Great job! We have one more dataset to join. Join the county_reference
tibble to plots
so you can see what county each FIA plot is in. Hint: you might get an error here. Make sure to check the column classes you are joining by and change them as needed. You should have 191 rows and 10 columns in plots
after this join. What counties are these plots in?
plots
# A tibble: 191 × 10
PLT_CN DRYBIO_AG_TPA STATECD COUNTYFIPS LON_PUBLIC LAT_PUBLIC state_name
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
1 1.75e14 122. 41 41051 -122. 45.6 Oregon
2 1.75e14 76.2 53 53013 -118. 46.1 Washington
3 1.75e14 4.21 53 53013 -118. 46.2 Washington
4 1.75e14 13.2 53 53013 -118. 46.2 Washington
5 1.75e14 90.9 53 53013 -118. 46.1 Washington
6 1.75e14 96.2 41 41051 -122. 45.5 Oregon
7 1.75e14 397. 41 41051 -122. 45.5 Oregon
8 1.75e14 2.54 53 53013 -118. 46.3 Washington
9 1.75e14 290. 41 41051 -122. 45.5 Oregon
10 2.17e14 182. 41 41051 -122. 45.6 Oregon
# ℹ 181 more rows
# ℹ 3 more variables: state_abbr <chr>, STATENM <chr>, COUNTYNM <chr>
Exercise 11
Now you’ve wrangled all your data into one tibble. Great job! Use a series of piped dplyr
functions to compute the average biomass in each county.
# A tibble: 3 × 2
COUNTYNM avg_bio
<chr> <dbl>
1 Clark 92.4
2 Columbia 60.6
3 Multnomah 187.
Part 3: Reshaping Data
You’ll find the “example_plot_data.csv” data file in the “data_lab9.zip” file on the course website. It holds tree data measured on 2 1/20-th acre plots.
Your goal is to make a stand table with species as rows and 4-inch DBH classes as columns and values of trees per acre. These trees per acre values will be an average from the per acre values recorded for each plot. This average is our best estimate for the trees per acre for the area sampled using the two plots.
Exercise 12
Read in the “example_plot_data.csv” into a tibble called example_plots
.
example_plots
# A tibble: 8 × 3
plot_id species dbh_in
<dbl> <chr> <dbl>
1 1 Basswood 5.1
2 1 Basswood 5.5
3 1 Beech 10.1
4 1 Red oak 5
5 2 Beech 11.5
6 2 Beech 12.2
7 2 Basswood 6
8 2 Red oak 10
Exercise 13
Add two new columns to the example_plots
tibble. The first dbh_4in
holds the 4-inch DBH class the given tree falls in. The second TF
holds the tree factor for each tree (i.e., how many trees per acre that one tree represents). Add both columns with a single call to mutate()
.
Note that since these are 20 acre plots, the tree factor should be 20 for each tree.
HINT: When creating the dbh_4in
column, you can use cut_width()
from ggplot2
or a case_when()
inside of a mutate()
.
example_plots
# A tibble: 8 × 5
plot_id species dbh_in tf dbh_4in
<dbl> <chr> <dbl> <dbl> <fct>
1 1 Basswood 5.1 20 [2,6]
2 1 Basswood 5.5 20 [2,6]
3 1 Beech 10.1 20 (10,14]
4 1 Red oak 5 20 [2,6]
5 2 Beech 11.5 20 (10,14]
6 2 Beech 12.2 20 (10,14]
7 2 Basswood 6 20 [2,6]
8 2 Red oak 10 20 (6,10]
Exercise 14
Compute a stand table for plot
1 and 2. They should look like the tables below.
Plot 1
# A tibble: 3 × 4
# Groups: species [3]
species `[2,6]` `(6,10]` `(10,14]`
<chr> <dbl> <dbl> <dbl>
1 Basswood 40 0 0
2 Beech 0 0 20
3 Red oak 20 0 0
Plot 2
# A tibble: 3 × 4
# Groups: species [3]
species `[2,6]` `(6,10]` `(10,14]`
<chr> <dbl> <dbl> <dbl>
1 Basswood 20 0 0
2 Beech 0 0 40
3 Red oak 0 20 0
Exercise 15
The stand table estimate we want combines information from each plot stand table by averaging their cell values. For example, plot 1 has 40 Basswood in the [2,6] in DBH class, and Plot 2 has 20 Basswood in the [2,6] in DBH class. So, best estimate of Basswood in the [2,6] in DBH class should be an average of 40 and 20 (i.e., mean(c(40,20))
) which is 30.
The code below computes the desired stand table estimate (i.e., averages across plots) all at once. Your job is to study this code and convince yourself you understand it and can replicate it for different data. Take it apart. Run it bit by bit. Reverse engineer it to understand how it works. Use math by hand to confirm the average from the two tables above are the values you see in the table below.
%>%
example_plots complete(plot_id, species, dbh_4in, fill = list(tf = 0)) %>%
group_by(plot_id, species, dbh_4in) %>%
summarize(tf_plot_level = sum(tf)) %>% ## Plot-level summary of trees per acre.
group_by(species, dbh_4in) %>%
summarise(mean_tpa = mean(tf_plot_level)) %>% ## Forest-level mean of trees per acre.
pivot_wider(names_from = dbh_4in, values_from = mean_tpa)
`summarise()` has grouped output by 'plot_id', 'species'. You can override
using the `.groups` argument.
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
# A tibble: 3 × 4
# Groups: species [3]
species `[2,6]` `(6,10]` `(10,14]`
<chr> <dbl> <dbl> <dbl>
1 Basswood 30 0 0
2 Beech 0 0 30
3 Red oak 10 10 0
Wrap up
Congratulations! You’ve made it to the end of Lab 9. Make sure to render your final document and submit both the .pdf and .qmd file to D2L.