Advanced Data Wrangling with dplyr
FOR 128: Lab 8
Welcome
Welcome to Lab 8! Today, we’ll focus on writing extensive dplyr
code on larger data. We will both use the verbs individually and “write a sentence” with the verbs by stringing them together with pipes.
Learning objectives
- Use
dplyr
verbs together with pipes on larger datasets.
Deliverables (i.e., what to put in the lab drop box)
Upload your rendered PDF (lab_08.pdf
) and Quarto (lab_08.qmd
) document to the lab drop box. Make sure the Quarto document properly renders to PDF.
Collaborator(s)
List any collaborators you worked with below.
Introduction
Data: pdxTrees
The pdxTrees
R package contains measurements of every tree in the Portland, Oregon metro area. In particular, it contains two datasets, which we will call parks
and streets
. The parks
dataset contains all of the trees in 174 parks in Portland. The streets
dataset contains all the trees on the streets of Portland. The hex sticker for pdxTrees
(as seen below) contains a fun easter egg from the city of Portland, as its graphics are done in a similar fashion to the famous airport carpet.
Methods: dplyr
Use dplyr
function that we’ve learned over the past few weeks to answer the questions in this lab.
Exercise 0
Load any packages you’ll need for this lab below. Note, you’ll need to install pdxTrees
with install.packages("pdxTrees")
before you can load it with library(pdxTrees)
.
Exercise 1
Part (a)
Load the data. For this lab, we will load both the parks
and streets
datasets. To do so, run the get_pdxTrees_parks()
and get_pdxTrees_streets()
functions (note, you don’t need to specify any arguments for these functions) and assign them to parks
and streets
, respectively.
Your datasets should look like this when you print parks
and streets
as follows:
parks
# A tibble: 25,534 × 34
Longitude Latitude UserID Genus Family DBH Inventory_Date Species
<dbl> <dbl> <chr> <chr> <chr> <dbl> <dttm> <chr>
1 -123. 45.6 1 Pseudotsu… Pinac… 37.4 2017-05-09 00:00:00 PSME
2 -123. 45.6 2 Pseudotsu… Pinac… 32.5 2017-05-09 00:00:00 PSME
3 -123. 45.6 3 Crataegus Rosac… 9.7 2017-05-09 00:00:00 CRLA
4 -123. 45.6 4 Quercus Fagac… 10.3 2017-05-09 00:00:00 QURU
5 -123. 45.6 5 Pseudotsu… Pinac… 33.2 2017-05-09 00:00:00 PSME
6 -123. 45.6 6 Pseudotsu… Pinac… 32.1 2017-05-09 00:00:00 PSME
7 -123. 45.6 7 Pseudotsu… Pinac… 28.4 2017-05-09 00:00:00 PSME
8 -123. 45.6 8 Pseudotsu… Pinac… 27.2 2017-05-09 00:00:00 PSME
9 -123. 45.6 9 Pseudotsu… Pinac… 35.2 2017-05-09 00:00:00 PSME
10 -123. 45.6 10 Pseudotsu… Pinac… 32.4 2017-05-09 00:00:00 PSME
# ℹ 25,524 more rows
# ℹ 26 more variables: Common_Name <chr>, Condition <chr>, Tree_Height <dbl>,
# Crown_Width_NS <dbl>, Crown_Width_EW <dbl>, Crown_Base_Height <dbl>,
# Collected_By <chr>, Park <chr>, Scientific_Name <chr>,
# Functional_Type <chr>, Mature_Size <fct>, Native <chr>, Edible <chr>,
# Nuisance <chr>, Structural_Value <dbl>, Carbon_Storage_lb <dbl>,
# Carbon_Storage_value <dbl>, Carbon_Sequestration_lb <dbl>, …
streets
# A tibble: 218,602 × 23
UserID Inventory_Date Species DBH Condition Site_Type Site_Width Wires
<chr> <dttm> <chr> <dbl> <chr> <chr> <dbl> <chr>
1 1 2010-08-21 00:00:00 PRPL 6.10 Good <NA> 9 High…
2 2 2010-08-21 00:00:00 ACMA 36.3 Poor <NA> 9 High…
3 3 2010-08-21 00:00:00 PLAC 20.5 Fair <NA> 9 No HV
4 4 2010-08-21 00:00:00 CO 8.20 Fair <NA> 9 No HV
5 5 2010-08-21 00:00:00 ACMA 55.7 Good <NA> 9 No HV
6 6 2010-08-21 00:00:00 ACPL 18.6 Good <NA> 9 No HV
7 7 2010-08-21 00:00:00 ACPL 18.8 Good <NA> 9 No HV
8 8 2010-08-21 00:00:00 ACPL 16 Good <NA> 9 No HV
9 9 2010-08-21 00:00:00 QU 6.20 Good <NA> 9 No HV
10 10 2010-08-21 00:00:00 ACPL 19.3 Fair <NA> 9 No HV
# ℹ 218,592 more rows
# ℹ 15 more variables: Site_Development <chr>, Site_Size <chr>, Notes <chr>,
# Address <chr>, Neighborhood <chr>, Collected_By <chr>, Scientific <chr>,
# Family <chr>, Genus <chr>, Common_Name <chr>, Functional_Type <chr>,
# Edible <chr>, Longitude <dbl>, Latitude <dbl>, Mature_Size <fct>
Part (b)
Take a look at the documentation by running ?get_pdxTrees_parks
and ?get_pdxTrees_streets
in your console, and tell me about two columns (what are their names? what do they measure? units? factor levels? etc.) from each of the datasets (four total).
Exercise 2
Find how many trees of each species (use the common name column for species here and throughout the lab) are in the parks
dataset, and order the data from greatest number of trees in a species to least. Notice I’ve called the number of trees num_trees
.
# A tibble: 304 × 2
Common_Name num_trees
<chr> <int>
1 Douglas-Fir 6783
2 Norway Maple 1502
3 Western Redcedar 964
4 Northern Red Oak 736
5 Pin Oak 619
6 Incense Cedar 507
7 Bigleaf Maple 490
8 Japanese Flowering Cherry 418
9 American Elm 379
10 Giant Sequoia 349
# ℹ 294 more rows
Exercise 3
A staple of Portland living in the springtime is to go out and see the Japanese Flowering Cherry trees beginning to bloom. Find the park with the most Japanese Flowering Cherry trees, and how many there are in that park.
# A tibble: 1 × 2
Park num_cherries
<chr> <int>
1 Gov Tom McCall Waterfront Park 104
Exercise 4
Part (a)
Find the top 5 parks with the most total DBH (i.e. the sum of all the trees DBH in those parks).
# A tibble: 5 × 2
Park total_DBH
<chr> <dbl>
1 Mt Tabor Park 31965.
2 Columbia Park 26952.
3 Laurelhurst Park 26890
4 Washington Park 16081.
5 John Luby Park 14519.
Mt. Tabor Park is one of my personal favorites in Portland, I used to live just a few blocks away. It is a wonderful place to watch the sun set over the West hills of the city. Not only does it have the most DBH in all of the parks in Portland, but Portland’s drinking water is stored in giant reservoirs at that park!
Part (b)
Find the top 5 parks with the highest DBH per tree (in other words, the 5 highest mean DBHs), along with how many trees are in those parks.
# A tibble: 5 × 3
Park mean_DBH number_of_trees
<chr> <dbl> <int>
1 Heritage Tree Park 59 1
2 Johnswood Property 43.1 18
3 Patton Square Park 37.0 13
4 Hancock Park 32.5 104
5 Laurelwood Park 31.2 13
Wow! Heritage Tree Park has one giant tree!
Exercise 5
There is a column in the parks
dataset that represents the amount of carbon (in lbs) each tree has sequestered. Find the park with the highest carbon sequestration.
# A tibble: 1 × 2
Park carbon_sequestered
<chr> <dbl>
1 Mt Tabor Park 36515.
Nice job, Mt. Tabor!
Exercise 6
The parks
dataset has a wide variety of numeric columns measuring things from DBH to pollution removal value of a given tree. Take the mean of all of these numeric columns, grouped by park. Make sure to remove NAs in the mean()
function.
HINT: In across()
you can set .cols = where(is.numeric)
to apply a function across all numeric columns.
# A tibble: 174 × 18
Park Longitude Latitude DBH Tree_Height Crown_Width_NS Crown_Width_EW
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Albert Ke… -123. 45.5 20.5 83.7 32.3 31.8
2 Alberta P… -123. 45.6 28.4 96.6 45.5 43.4
3 Ankeny Pl… -123. 45.5 30.8 69 62 62.2
4 April Hil… -123. 45.5 18.9 47.9 34.0 33
5 Arbor Lod… -123. 45.6 19.2 49.4 39.8 39.7
6 Argay Park -123. 45.6 19.6 46.2 35.3 35.3
7 Berkeley … -123. 45.5 26.7 72.9 44.9 47.7
8 Berrydale… -123. 45.5 23.6 69.8 41.3 39.3
9 Bloomingt… -123. 45.5 24.1 62.6 47.9 48.2
10 Brentwood… -123. 45.5 17.4 40.8 29.7 30.8
# ℹ 164 more rows
# ℹ 11 more variables: Crown_Base_Height <dbl>, Structural_Value <dbl>,
# Carbon_Storage_lb <dbl>, Carbon_Storage_value <dbl>,
# Carbon_Sequestration_lb <dbl>, Carbon_Sequestration_value <dbl>,
# Stormwater_ft <dbl>, Stormwater_value <dbl>, Pollution_Removal_value <dbl>,
# Pollution_Removal_oz <dbl>, Total_Annual_Services <dbl>
Exercise 7
For this exercise, consider the streets
dataset.
Part (a)
There is a column delineating the edibility of the trees in the dataset. Find the possible values for this column and how many trees correspond to each value.
# A tibble: 3 × 2
Edible n_rows
<chr> <int>
1 fruit 3649
2 no 211954
3 nut 2999
Part (b)
Create a new tibble called forager
that contains all rows of the streets
with edible fruits or nuts but only the columns denoting the common name, neighborhood, condition, and edibility.
forager
# A tibble: 6,648 × 4
Common_Name Neighborhood Condition Edible
<chr> <chr> <chr> <chr>
1 Persimmon Concordia Good fruit
2 Walnut Concordia Fair nut
3 Walnut Concordia Fair nut
4 Walnut Concordia Fair nut
5 Walnut Concordia Good nut
6 Walnut Concordia Fair nut
7 Walnut Concordia Fair nut
8 Walnut Concordia Poor nut
9 Walnut Concordia Fair nut
10 Fig Concordia Poor fruit
# ℹ 6,638 more rows
Part (c)
Create a new logical column and save it in the forager
dataset. Name it fruit_edible
. The values in this columns should be TRUE
or FALSE
depending on if the given tree has edible fruits or not
forager
# A tibble: 6,648 × 5
Common_Name Neighborhood Condition Edible fruit_edible
<chr> <chr> <chr> <chr> <lgl>
1 Persimmon Concordia Good fruit TRUE
2 Walnut Concordia Fair nut FALSE
3 Walnut Concordia Fair nut FALSE
4 Walnut Concordia Fair nut FALSE
5 Walnut Concordia Good nut FALSE
6 Walnut Concordia Fair nut FALSE
7 Walnut Concordia Fair nut FALSE
8 Walnut Concordia Poor nut FALSE
9 Walnut Concordia Fair nut FALSE
10 Fig Concordia Poor fruit TRUE
# ℹ 6,638 more rows
Part (d)
Arrange the forager dataset by the column you just created in (c), so that the TRUE
values are at the top, and save this change to the forager
dataset.
forager
# A tibble: 6,648 × 5
Common_Name Neighborhood Condition Edible fruit_edible
<chr> <chr> <chr> <chr> <lgl>
1 Persimmon Concordia Good fruit TRUE
2 Fig Concordia Poor fruit TRUE
3 Fig Concordia Good fruit TRUE
4 Fig Concordia Fair fruit TRUE
5 Fig Concordia Poor fruit TRUE
6 Fig Concordia Good fruit TRUE
7 Fig Concordia Fair fruit TRUE
8 Persimmon Concordia Good fruit TRUE
9 Persimmon Concordia Fair fruit TRUE
10 Persimmon Concordia Fair fruit TRUE
# ℹ 6,638 more rows
Part (e)
The Edible
column in named in a bit of a funny way. Rename the column to Edible_Component
and save this change to the forager
dataset.
forager
# A tibble: 6,648 × 5
Common_Name Neighborhood Condition Edible_Component fruit_edible
<chr> <chr> <chr> <chr> <lgl>
1 Persimmon Concordia Good fruit TRUE
2 Fig Concordia Poor fruit TRUE
3 Fig Concordia Good fruit TRUE
4 Fig Concordia Fair fruit TRUE
5 Fig Concordia Poor fruit TRUE
6 Fig Concordia Good fruit TRUE
7 Fig Concordia Fair fruit TRUE
8 Persimmon Concordia Good fruit TRUE
9 Persimmon Concordia Fair fruit TRUE
10 Persimmon Concordia Fair fruit TRUE
# ℹ 6,638 more rows
Part (f)
Which five neighborhoods would a fruit loving forager most like to live (in other words, what are the five neighborhoods with the most trees with edible fruit)?
HINT: using the sum()
function on logical data will make all the TRUE
s 1
s and all the FALSE
s 0
s.
# A tibble: 5 × 2
Neighborhood n_edible_fruit_trees
<chr> <int>
1 Montavilla 166
2 Sellwood-Moreland 154
3 Richmond 148
4 Rose City Park 127
5 Woodstock 125
The Sellwood-Moreland and Woodstock neighborhoods are the heart of southeast Portland (in my opinion). The pdxTrees
R package was written primarily in the Woodstock neighborhood, as a matter of fact. Sellwood-Moreland sits just to the west of Woodstock and has a variety of top-tier coffee shops, along with a gorgeous riverfront park with off-leash dogs allowed. Quite the place to be if you are a fruit forager, too!
Part (g)
Of the five neighborhoods found in part (e), what are the conditions of the edible fruit bearing trees (in other words, how many are in Good condition, Fair condition, etc., grouped by these five neighborhoods).
HINT: you can use group_by()
to group by multiple groups at once. Example: group_by(column1, column2)
.
`summarise()` has grouped output by 'Neighborhood'. You can override using the
`.groups` argument.
# A tibble: 15 × 3
# Groups: Neighborhood [5]
Neighborhood Condition num_fruit_trees
<chr> <chr> <int>
1 Montavilla Fair 92
2 Montavilla Good 63
3 Montavilla Poor 11
4 Richmond Fair 36
5 Richmond Good 104
6 Richmond Poor 8
7 Rose City Park Fair 71
8 Rose City Park Good 54
9 Rose City Park Poor 2
10 Sellwood-Moreland Fair 56
11 Sellwood-Moreland Good 89
12 Sellwood-Moreland Poor 9
13 Woodstock Fair 39
14 Woodstock Good 77
15 Woodstock Poor 9
Wrap up
Congratulations! You’ve made it to the end of Lab 8. Make sure to render your final document and submit both the .pdf and .qmd file to D2L.