readr and tibbletidyversetibble: improved data framesreadr: tidy reading and writing datadplyrtidyversetidyverseR packages that work together to provide extensive and intuitive data analysis functions.tidyverse packages:
tibble, to improve on the data.frame,readr, to improve reading and writing data,dplyr, to manipulate and summarize data “data plyers”,tidyr, to clean and reshape data, andggplot2, to produce beautiful graphics with intuitive syntax “the grammar of graphics”.tidyverse?To install tidyverse, you must run this code on your computer.
NOTE: you only have to do this once.
This will take a few minutes to run on your computer. After it’s done, you can load all the tidyverse packages with the library() function:
NOTE: you’ll load the tidyverse with library() on each project (e.g. lab, midterm, project) towards the top of your Quarto document.
tibble and readr
tibble: improved data framestibbles: improved data framesRecall the pets data frame:
tibbles: improved data framesRecall the pets data frame:
tibbles: improved data framesRecall the pets data frame:
tibbles: improved data framesWe can improve on this data structure by created a tibble rather than a data.frame. The syntax is almost exactly the same:
tibbles: improved data framesWe can improve on this data structure by created a tibble rather than a data.frame. The syntax is almost exactly the same:
tibbles: improved data framesWe can improve on this data structure by created a tibble rather than a data.frame. The syntax is almost exactly the same:
tibbles: improved data framesWe can improve on this data structure by created a tibble rather than a data.frame. The syntax is almost exactly the same:
tibbles: improved data framestibbles give us some nice information when printed: dimensions and column class.tibbles and data.framesWe can convert existing data.frames into tibbles:
Note that new_pets_tbl is a tibble:
but is of course still a data.frame:
tibbles are data.frames, but not all data.frames are tibbles.In the past, when we have wanted the values from one column of a data frame, we are returned a vector. For example:
[1] "Dude" "Pickle" "Kyle" "Nubs"
names ages
1 Dude 6
2 Pickle 5
3 Kyle 3
4 Nubs 11
tibbles fix this inconsistency
readr: tidy reading and writing
So far, we have primarily used the built-in read.csv() and read.table() functions to read data into R. Example:
Recall:
read.csv() function,file = "../labs/datasets/FEF_trees.csv",<- to an object in R call fef.readrThe readr package includes analogous read_csv(), read_table(), read_delim(), and even more read_*() functions.
read_csv() with an underscore (_) rather than the base R option of read.csv().tidyverse to run this function, unless we call the package name before the function name like so: readr::read_csv(file = "../labs/datasets/FEF_trees.csv")readr functions can have up to 100x speed increase compared to base R, depending on your dataset. watershed year plot species dbh_in height_ft stem_green_kg top_green_kg
1 3 1991 29 Acer rubrum 6.0 48 92.2 13.1
2 3 1991 33 Acer rubrum 6.9 48 102.3 23.1
3 3 1991 35 Acer rubrum 6.4 48 124.4 8.7
4 3 1991 39 Acer rubrum 6.5 49 91.7 39.0
5 3 1991 44 Acer rubrum 7.2 51 186.2 8.9
6 3 1992 26 Acer rubrum 3.1 40 20.8 0.9
smbranch_green_kg lgbranch_green_kg allwoody_green_kg leaves_green_kg
1 30.5 48.4 184.2 16.1
2 23.5 57.7 206.6 12.9
3 22.3 44.1 199.5 16.5
4 22.5 35.5 188.7 12.0
5 25.4 65.1 285.6 22.4
6 1.9 1.5 25.1 0.9
stem_dry_kg top_dry_kg smbranch_dry_kg lgbranch_dry_kg allwoody_dry_kg
1 54.7 7.1 15.3 28.0 105.1
2 62.3 12.4 14.8 33.6 123.1
3 73.3 4.6 11.5 25.1 114.4
4 53.6 21.3 11.2 19.8 105.9
5 106.4 4.7 11.7 36.1 159.0
6 11.7 0.5 1.1 0.9 14.2
leaves_dry_kg
1 6.1
2 4.6
3 6.1
4 4.2
5 7.9
6 0.3
# A tibble: 6 × 18
watershed year plot species dbh_in height_ft stem_green_kg top_green_kg
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 3 1991 29 Acer rubrum 6 48 92.2 13.1
2 3 1991 33 Acer rubrum 6.9 48 102. 23.1
3 3 1991 35 Acer rubrum 6.4 48 124. 8.7
4 3 1991 39 Acer rubrum 6.5 49 91.7 39
5 3 1991 44 Acer rubrum 7.2 51 186. 8.9
6 3 1992 26 Acer rubrum 3.1 40 20.8 0.9
# ℹ 10 more variables: smbranch_green_kg <dbl>, lgbranch_green_kg <dbl>,
# allwoody_green_kg <dbl>, leaves_green_kg <dbl>, stem_dry_kg <dbl>,
# top_dry_kg <dbl>, smbranch_dry_kg <dbl>, lgbranch_dry_kg <dbl>,
# allwoody_dry_kg <dbl>, leaves_dry_kg <dbl>
R’s built-in read.csv() function reads in data as a data.frame, but readr::read_csv() reads in data as a tibble. Neat!We can specify some nifty options with the read_*() functions from readr. We can specify column types:
Rows: 88
Columns: 18
$ watershed <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
$ year <dbl> 1991, 1991, 1991, 1991, 1991, 1992, 1992, 1992, 1992…
$ plot <dbl> 29, 33, 35, 39, 44, 26, 26, 26, 48, 48, 48, 29, 33, …
$ species <chr> "Acer rubrum", "Acer rubrum", "Acer rubrum", "Acer r…
$ dbh_in <dbl> 6.0, 6.9, 6.4, 6.5, 7.2, 3.1, 2.0, 4.1, 2.4, 2.7, 3.…
$ height_ft <dbl> 48.0, 48.0, 48.0, 49.0, 51.0, 40.0, 30.5, 50.0, 28.0…
$ stem_green_kg <dbl> 92.2, 102.3, 124.4, 91.7, 186.2, 20.8, 5.6, 54.1, 10…
$ top_green_kg <dbl> 13.1, 23.1, 8.7, 39.0, 8.9, 0.9, 0.9, 8.6, 0.7, 5.0,…
$ smbranch_green_kg <dbl> 30.5, 23.5, 22.3, 22.5, 25.4, 1.9, 2.2, 8.0, 3.7, 3.…
$ lgbranch_green_kg <dbl> 48.4, 57.7, 44.1, 35.5, 65.1, 1.5, 0.6, 4.0, 0.5, 1.…
$ allwoody_green_kg <dbl> 184.2, 206.6, 199.5, 188.7, 285.6, 25.1, 9.3, 74.7, …
$ leaves_green_kg <dbl> 16.1, 12.9, 16.5, 12.0, 22.4, 0.9, 1.0, 6.1, 2.5, 1.…
$ stem_dry_kg <dbl> 54.7, 62.3, 73.3, 53.6, 106.4, 11.7, 3.2, 28.3, 5.5,…
$ top_dry_kg <dbl> 7.1, 12.4, 4.6, 21.3, 4.7, 0.5, 0.5, 4.4, 0.4, 2.7, …
$ smbranch_dry_kg <dbl> 15.3, 14.8, 11.5, 11.2, 11.7, 1.1, 1.2, 3.6, 1.8, 0.…
$ lgbranch_dry_kg <dbl> 28.0, 33.6, 25.1, 19.8, 36.1, 0.9, 0.3, 2.1, 0.3, 1.…
$ allwoody_dry_kg <dbl> 105.1, 123.1, 114.4, 105.9, 159.0, 14.2, 5.3, 38.5, …
$ leaves_dry_kg <dbl> 6.1, 4.6, 6.1, 4.2, 7.9, 0.3, 0.3, 1.9, 0.8, 0.5, 1.…
Rows: 88
Columns: 18
$ watershed <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
$ year <int> 1991, 1991, 1991, 1991, 1991, 1992, 1992, 1992, 1992…
$ plot <int> 29, 33, 35, 39, 44, 26, 26, 26, 48, 48, 48, 29, 33, …
$ species <chr> "Acer rubrum", "Acer rubrum", "Acer rubrum", "Acer r…
$ dbh_in <dbl> 6.0, 6.9, 6.4, 6.5, 7.2, 3.1, 2.0, 4.1, 2.4, 2.7, 3.…
$ height_ft <dbl> 48.0, 48.0, 48.0, 49.0, 51.0, 40.0, 30.5, 50.0, 28.0…
$ stem_green_kg <dbl> 92.2, 102.3, 124.4, 91.7, 186.2, 20.8, 5.6, 54.1, 10…
$ top_green_kg <dbl> 13.1, 23.1, 8.7, 39.0, 8.9, 0.9, 0.9, 8.6, 0.7, 5.0,…
$ smbranch_green_kg <dbl> 30.5, 23.5, 22.3, 22.5, 25.4, 1.9, 2.2, 8.0, 3.7, 3.…
$ lgbranch_green_kg <dbl> 48.4, 57.7, 44.1, 35.5, 65.1, 1.5, 0.6, 4.0, 0.5, 1.…
$ allwoody_green_kg <dbl> 184.2, 206.6, 199.5, 188.7, 285.6, 25.1, 9.3, 74.7, …
$ leaves_green_kg <dbl> 16.1, 12.9, 16.5, 12.0, 22.4, 0.9, 1.0, 6.1, 2.5, 1.…
$ stem_dry_kg <dbl> 54.7, 62.3, 73.3, 53.6, 106.4, 11.7, 3.2, 28.3, 5.5,…
$ top_dry_kg <dbl> 7.1, 12.4, 4.6, 21.3, 4.7, 0.5, 0.5, 4.4, 0.4, 2.7, …
$ smbranch_dry_kg <dbl> 15.3, 14.8, 11.5, 11.2, 11.7, 1.1, 1.2, 3.6, 1.8, 0.…
$ lgbranch_dry_kg <dbl> 28.0, 33.6, 25.1, 19.8, 36.1, 0.9, 0.3, 2.1, 0.3, 1.…
$ allwoody_dry_kg <dbl> 105.1, 123.1, 114.4, 105.9, 159.0, 14.2, 5.3, 38.5, …
$ leaves_dry_kg <dbl> 6.1, 4.6, 6.1, 4.2, 7.9, 0.3, 0.3, 1.9, 0.8, 0.5, 1.…
Writing data with readr looks very similar to base R. Say we’d like to write pets_tbl to “pets.csv”:
With base:
With readr
readr::write_csv() compared to write.csv().dplyr demo (time permitting)
dplyr