Advanced Data Wrangling with dplyr

FOR 128: Lab 8

Published

October 30, 2024

Welcome

Welcome to Lab 8! Today, we’ll focus on writing extensive dplyr code on larger data. We will both use the verbs individually and “write a sentence” with the verbs by stringing them together with pipes.

Learning objectives

  • Use dplyr verbs together with pipes on larger datasets.

Deliverables (i.e., what to put in the lab drop box)

Upload your rendered PDF (lab_08.pdf) and Quarto (lab_08.qmd) document to the lab drop box. Make sure the Quarto document properly renders to PDF.

Collaborator(s)

List any collaborators you worked with below.

Introduction

Data: pdxTrees

The pdxTrees R package contains measurements of every tree in the Portland, Oregon metro area. In particular, it contains two datasets, which we will call parks and streets. The parks dataset contains all of the trees in 174 parks in Portland. The streets dataset contains all the trees on the streets of Portland. The hex sticker for pdxTrees (as seen below) contains a fun easter egg from the city of Portland, as its graphics are done in a similar fashion to the famous airport carpet.

Methods: dplyr

Use dplyr function that we’ve learned over the past few weeks to answer the questions in this lab.

Exercise 0

Load any packages you’ll need for this lab below. Note, you’ll need to install pdxTrees with install.packages("pdxTrees") before you can load it with library(pdxTrees).

Exercise 1

Part (a)

Load the data. For this lab, we will load both the parks and streets datasets. To do so, run the get_pdxTrees_parks() and get_pdxTrees_streets() functions (note, you don’t need to specify any arguments for these functions) and assign them to parks and streets, respectively.

Your datasets should look like this when you print parks and streets as follows:

parks
# A tibble: 25,534 × 34
   Longitude Latitude UserID Genus      Family   DBH Inventory_Date      Species
       <dbl>    <dbl> <chr>  <chr>      <chr>  <dbl> <dttm>              <chr>  
 1     -123.     45.6 1      Pseudotsu… Pinac…  37.4 2017-05-09 00:00:00 PSME   
 2     -123.     45.6 2      Pseudotsu… Pinac…  32.5 2017-05-09 00:00:00 PSME   
 3     -123.     45.6 3      Crataegus  Rosac…   9.7 2017-05-09 00:00:00 CRLA   
 4     -123.     45.6 4      Quercus    Fagac…  10.3 2017-05-09 00:00:00 QURU   
 5     -123.     45.6 5      Pseudotsu… Pinac…  33.2 2017-05-09 00:00:00 PSME   
 6     -123.     45.6 6      Pseudotsu… Pinac…  32.1 2017-05-09 00:00:00 PSME   
 7     -123.     45.6 7      Pseudotsu… Pinac…  28.4 2017-05-09 00:00:00 PSME   
 8     -123.     45.6 8      Pseudotsu… Pinac…  27.2 2017-05-09 00:00:00 PSME   
 9     -123.     45.6 9      Pseudotsu… Pinac…  35.2 2017-05-09 00:00:00 PSME   
10     -123.     45.6 10     Pseudotsu… Pinac…  32.4 2017-05-09 00:00:00 PSME   
# ℹ 25,524 more rows
# ℹ 26 more variables: Common_Name <chr>, Condition <chr>, Tree_Height <dbl>,
#   Crown_Width_NS <dbl>, Crown_Width_EW <dbl>, Crown_Base_Height <dbl>,
#   Collected_By <chr>, Park <chr>, Scientific_Name <chr>,
#   Functional_Type <chr>, Mature_Size <fct>, Native <chr>, Edible <chr>,
#   Nuisance <chr>, Structural_Value <dbl>, Carbon_Storage_lb <dbl>,
#   Carbon_Storage_value <dbl>, Carbon_Sequestration_lb <dbl>, …
streets
# A tibble: 218,602 × 23
   UserID Inventory_Date      Species   DBH Condition Site_Type Site_Width Wires
   <chr>  <dttm>              <chr>   <dbl> <chr>     <chr>          <dbl> <chr>
 1 1      2010-08-21 00:00:00 PRPL     6.10 Good      <NA>               9 High…
 2 2      2010-08-21 00:00:00 ACMA    36.3  Poor      <NA>               9 High…
 3 3      2010-08-21 00:00:00 PLAC    20.5  Fair      <NA>               9 No HV
 4 4      2010-08-21 00:00:00 CO       8.20 Fair      <NA>               9 No HV
 5 5      2010-08-21 00:00:00 ACMA    55.7  Good      <NA>               9 No HV
 6 6      2010-08-21 00:00:00 ACPL    18.6  Good      <NA>               9 No HV
 7 7      2010-08-21 00:00:00 ACPL    18.8  Good      <NA>               9 No HV
 8 8      2010-08-21 00:00:00 ACPL    16    Good      <NA>               9 No HV
 9 9      2010-08-21 00:00:00 QU       6.20 Good      <NA>               9 No HV
10 10     2010-08-21 00:00:00 ACPL    19.3  Fair      <NA>               9 No HV
# ℹ 218,592 more rows
# ℹ 15 more variables: Site_Development <chr>, Site_Size <chr>, Notes <chr>,
#   Address <chr>, Neighborhood <chr>, Collected_By <chr>, Scientific <chr>,
#   Family <chr>, Genus <chr>, Common_Name <chr>, Functional_Type <chr>,
#   Edible <chr>, Longitude <dbl>, Latitude <dbl>, Mature_Size <fct>

Part (b)

Take a look at the documentation by running ?get_pdxTrees_parks and ?get_pdxTrees_streets in your console, and tell me about two columns (what are their names? what do they measure? units? factor levels? etc.) from each of the datasets (four total).

Exercise 2

Find how many trees of each species (use the common name column for species here and throughout the lab) are in the parks dataset, and order the data from greatest number of trees in a species to least. Notice I’ve called the number of trees num_trees.

# A tibble: 304 × 2
   Common_Name               num_trees
   <chr>                         <int>
 1 Douglas-Fir                    6783
 2 Norway Maple                   1502
 3 Western Redcedar                964
 4 Northern Red Oak                736
 5 Pin Oak                         619
 6 Incense Cedar                   507
 7 Bigleaf Maple                   490
 8 Japanese Flowering Cherry       418
 9 American Elm                    379
10 Giant Sequoia                   349
# ℹ 294 more rows

Exercise 3

A staple of Portland living in the springtime is to go out and see the Japanese Flowering Cherry trees beginning to bloom. Find the park with the most Japanese Flowering Cherry trees, and how many there are in that park.

# A tibble: 1 × 2
  Park                           num_cherries
  <chr>                                 <int>
1 Gov Tom McCall Waterfront Park          104

Exercise 4

Part (a)

Find the top 5 parks with the most total DBH (i.e. the sum of all the trees DBH in those parks).

# A tibble: 5 × 2
  Park             total_DBH
  <chr>                <dbl>
1 Mt Tabor Park       31965.
2 Columbia Park       26952.
3 Laurelhurst Park    26890 
4 Washington Park     16081.
5 John Luby Park      14519.

Mt. Tabor Park is one of my personal favorites in Portland, I used to live just a few blocks away. It is a wonderful place to watch the sun set over the West hills of the city. Not only does it have the most DBH in all of the parks in Portland, but Portland’s drinking water is stored in giant reservoirs at that park!

Part (b)

Find the top 5 parks with the highest DBH per tree (in other words, the 5 highest mean DBHs), along with how many trees are in those parks.

# A tibble: 5 × 3
  Park               mean_DBH number_of_trees
  <chr>                 <dbl>           <int>
1 Heritage Tree Park     59                 1
2 Johnswood Property     43.1              18
3 Patton Square Park     37.0              13
4 Hancock Park           32.5             104
5 Laurelwood Park        31.2              13

Wow! Heritage Tree Park has one giant tree!

Exercise 5

There is a column in the parks dataset that represents the amount of carbon (in lbs) each tree has sequestered. Find the park with the highest carbon sequestration.

# A tibble: 1 × 2
  Park          carbon_sequestered
  <chr>                      <dbl>
1 Mt Tabor Park             36515.

Nice job, Mt. Tabor!

Exercise 6

The parks dataset has a wide variety of numeric columns measuring things from DBH to pollution removal value of a given tree. Take the mean of all of these numeric columns, grouped by park. Make sure to remove NAs in the mean() function.

HINT: In across() you can set .cols = where(is.numeric) to apply a function across all numeric columns.

# A tibble: 174 × 18
   Park       Longitude Latitude   DBH Tree_Height Crown_Width_NS Crown_Width_EW
   <chr>          <dbl>    <dbl> <dbl>       <dbl>          <dbl>          <dbl>
 1 Albert Ke…     -123.     45.5  20.5        83.7           32.3           31.8
 2 Alberta P…     -123.     45.6  28.4        96.6           45.5           43.4
 3 Ankeny Pl…     -123.     45.5  30.8        69             62             62.2
 4 April Hil…     -123.     45.5  18.9        47.9           34.0           33  
 5 Arbor Lod…     -123.     45.6  19.2        49.4           39.8           39.7
 6 Argay Park     -123.     45.6  19.6        46.2           35.3           35.3
 7 Berkeley …     -123.     45.5  26.7        72.9           44.9           47.7
 8 Berrydale…     -123.     45.5  23.6        69.8           41.3           39.3
 9 Bloomingt…     -123.     45.5  24.1        62.6           47.9           48.2
10 Brentwood…     -123.     45.5  17.4        40.8           29.7           30.8
# ℹ 164 more rows
# ℹ 11 more variables: Crown_Base_Height <dbl>, Structural_Value <dbl>,
#   Carbon_Storage_lb <dbl>, Carbon_Storage_value <dbl>,
#   Carbon_Sequestration_lb <dbl>, Carbon_Sequestration_value <dbl>,
#   Stormwater_ft <dbl>, Stormwater_value <dbl>, Pollution_Removal_value <dbl>,
#   Pollution_Removal_oz <dbl>, Total_Annual_Services <dbl>

Exercise 7

For this exercise, consider the streets dataset.

Part (a)

There is a column delineating the edibility of the trees in the dataset. Find the possible values for this column and how many trees correspond to each value.

# A tibble: 3 × 2
  Edible n_rows
  <chr>   <int>
1 fruit    3649
2 no     211954
3 nut      2999

Part (b)

Create a new tibble called forager that contains all rows of the streets with edible fruits or nuts but only the columns denoting the common name, neighborhood, condition, and edibility.

forager
# A tibble: 6,648 × 4
   Common_Name Neighborhood Condition Edible
   <chr>       <chr>        <chr>     <chr> 
 1 Persimmon   Concordia    Good      fruit 
 2 Walnut      Concordia    Fair      nut   
 3 Walnut      Concordia    Fair      nut   
 4 Walnut      Concordia    Fair      nut   
 5 Walnut      Concordia    Good      nut   
 6 Walnut      Concordia    Fair      nut   
 7 Walnut      Concordia    Fair      nut   
 8 Walnut      Concordia    Poor      nut   
 9 Walnut      Concordia    Fair      nut   
10 Fig         Concordia    Poor      fruit 
# ℹ 6,638 more rows

Part (c)

Create a new logical column and save it in the forager dataset. Name it fruit_edible. The values in this columns should be TRUE or FALSE depending on if the given tree has edible fruits or not

forager
# A tibble: 6,648 × 5
   Common_Name Neighborhood Condition Edible fruit_edible
   <chr>       <chr>        <chr>     <chr>  <lgl>       
 1 Persimmon   Concordia    Good      fruit  TRUE        
 2 Walnut      Concordia    Fair      nut    FALSE       
 3 Walnut      Concordia    Fair      nut    FALSE       
 4 Walnut      Concordia    Fair      nut    FALSE       
 5 Walnut      Concordia    Good      nut    FALSE       
 6 Walnut      Concordia    Fair      nut    FALSE       
 7 Walnut      Concordia    Fair      nut    FALSE       
 8 Walnut      Concordia    Poor      nut    FALSE       
 9 Walnut      Concordia    Fair      nut    FALSE       
10 Fig         Concordia    Poor      fruit  TRUE        
# ℹ 6,638 more rows

Part (d)

Arrange the forager dataset by the column you just created in (c), so that the TRUE values are at the top, and save this change to the forager dataset.

forager
# A tibble: 6,648 × 5
   Common_Name Neighborhood Condition Edible fruit_edible
   <chr>       <chr>        <chr>     <chr>  <lgl>       
 1 Persimmon   Concordia    Good      fruit  TRUE        
 2 Fig         Concordia    Poor      fruit  TRUE        
 3 Fig         Concordia    Good      fruit  TRUE        
 4 Fig         Concordia    Fair      fruit  TRUE        
 5 Fig         Concordia    Poor      fruit  TRUE        
 6 Fig         Concordia    Good      fruit  TRUE        
 7 Fig         Concordia    Fair      fruit  TRUE        
 8 Persimmon   Concordia    Good      fruit  TRUE        
 9 Persimmon   Concordia    Fair      fruit  TRUE        
10 Persimmon   Concordia    Fair      fruit  TRUE        
# ℹ 6,638 more rows

Part (e)

The Edible column in named in a bit of a funny way. Rename the column to Edible_Component and save this change to the forager dataset.

forager
# A tibble: 6,648 × 5
   Common_Name Neighborhood Condition Edible_Component fruit_edible
   <chr>       <chr>        <chr>     <chr>            <lgl>       
 1 Persimmon   Concordia    Good      fruit            TRUE        
 2 Fig         Concordia    Poor      fruit            TRUE        
 3 Fig         Concordia    Good      fruit            TRUE        
 4 Fig         Concordia    Fair      fruit            TRUE        
 5 Fig         Concordia    Poor      fruit            TRUE        
 6 Fig         Concordia    Good      fruit            TRUE        
 7 Fig         Concordia    Fair      fruit            TRUE        
 8 Persimmon   Concordia    Good      fruit            TRUE        
 9 Persimmon   Concordia    Fair      fruit            TRUE        
10 Persimmon   Concordia    Fair      fruit            TRUE        
# ℹ 6,638 more rows

Part (f)

Which five neighborhoods would a fruit loving forager most like to live (in other words, what are the five neighborhoods with the most trees with edible fruit)?

HINT: using the sum() function on logical data will make all the TRUEs 1s and all the FALSEs 0s.

# A tibble: 5 × 2
  Neighborhood      n_edible_fruit_trees
  <chr>                            <int>
1 Montavilla                         166
2 Sellwood-Moreland                  154
3 Richmond                           148
4 Rose City Park                     127
5 Woodstock                          125

The Sellwood-Moreland and Woodstock neighborhoods are the heart of southeast Portland (in my opinion). The pdxTrees R package was written primarily in the Woodstock neighborhood, as a matter of fact. Sellwood-Moreland sits just to the west of Woodstock and has a variety of top-tier coffee shops, along with a gorgeous riverfront park with off-leash dogs allowed. Quite the place to be if you are a fruit forager, too!

Part (g)

Of the five neighborhoods found in part (e), what are the conditions of the edible fruit bearing trees (in other words, how many are in Good condition, Fair condition, etc., grouped by these five neighborhoods).

HINT: you can use group_by() to group by multiple groups at once. Example: group_by(column1, column2).

`summarise()` has grouped output by 'Neighborhood'. You can override using the
`.groups` argument.
# A tibble: 15 × 3
# Groups:   Neighborhood [5]
   Neighborhood      Condition num_fruit_trees
   <chr>             <chr>               <int>
 1 Montavilla        Fair                   92
 2 Montavilla        Good                   63
 3 Montavilla        Poor                   11
 4 Richmond          Fair                   36
 5 Richmond          Good                  104
 6 Richmond          Poor                    8
 7 Rose City Park    Fair                   71
 8 Rose City Park    Good                   54
 9 Rose City Park    Poor                    2
10 Sellwood-Moreland Fair                   56
11 Sellwood-Moreland Good                   89
12 Sellwood-Moreland Poor                    9
13 Woodstock         Fair                   39
14 Woodstock         Good                   77
15 Woodstock         Poor                    9

Wrap up

Congratulations! You’ve made it to the end of Lab 8. Make sure to render your final document and submit both the .pdf and .qmd file to D2L.