readr
and tibble
tidyverse
tibble
: improved data framesreadr
: tidy reading and writing datadplyr
tidyverse
tidyverse
R
packages that work together to provide extensive and intuitive data analysis functions.tidyverse
packages:
tibble
, to improve on the data.frame,readr
, to improve reading and writing data,dplyr
, to manipulate and summarize data “data plyers”,tidyr
, to clean and reshape data, andggplot2
, to produce beautiful graphics with intuitive syntax “the grammar of graphics”.tidyverse
?To install tidyverse
, you must run this code on your computer.
NOTE: you only have to do this once.
This will take a few minutes to run on your computer. After it’s done, you can load all the tidyverse
packages with the library()
function:
NOTE: you’ll load the tidyverse
with library()
on each project (e.g. lab, midterm, project) towards the top of your Quarto document.
tibble
and readr
tibble
: improved data framestibble
s: improved data framesRecall the pets data frame:
tibble
s: improved data framesRecall the pets data frame:
tibble
s: improved data framesRecall the pets data frame:
tibble
s: improved data framesWe can improve on this data structure by created a tibble
rather than a data.frame
. The syntax is almost exactly the same:
tibble
s: improved data framesWe can improve on this data structure by created a tibble
rather than a data.frame
. The syntax is almost exactly the same:
tibble
s: improved data framesWe can improve on this data structure by created a tibble
rather than a data.frame
. The syntax is almost exactly the same:
tibble
s: improved data framesWe can improve on this data structure by created a tibble
rather than a data.frame
. The syntax is almost exactly the same:
tibble
s: improved data framestibble
s give us some nice information when printed: dimensions and column class.tibble
s and data.frame
sWe can convert existing data.frame
s into tibble
s:
Note that new_pets_tbl
is a tibble
:
but is of course still a data.frame
:
tibble
s are data.frame
s, but not all data.frame
s are tibble
s.In the past, when we have wanted the values from one column of a data frame, we are returned a vector. For example:
[1] "Dude" "Pickle" "Kyle" "Nubs"
names ages
1 Dude 6
2 Pickle 5
3 Kyle 3
4 Nubs 11
tibble
s fix this inconsistency
readr
: tidy reading and writingSo far, we have primarily used the built-in read.csv()
and read.table()
functions to read data into R
. Example:
Recall:
read.csv()
function,file = "../labs/datasets/FEF_trees.csv"
,<-
to an object in R
call fef
.readr
The readr
package includes analogous read_csv()
, read_table()
, read_delim()
, and even more read_*()
functions.
read_csv()
with an underscore (_
) rather than the base R
option of read.csv()
.tidyverse
to run this function, unless we call the package name before the function name like so: readr::read_csv(file = "../labs/datasets/FEF_trees.csv")
readr
functions can have up to 100x speed increase compared to base R
, depending on your dataset. watershed year plot species dbh_in height_ft stem_green_kg top_green_kg
1 3 1991 29 Acer rubrum 6.0 48 92.2 13.1
2 3 1991 33 Acer rubrum 6.9 48 102.3 23.1
3 3 1991 35 Acer rubrum 6.4 48 124.4 8.7
4 3 1991 39 Acer rubrum 6.5 49 91.7 39.0
5 3 1991 44 Acer rubrum 7.2 51 186.2 8.9
6 3 1992 26 Acer rubrum 3.1 40 20.8 0.9
smbranch_green_kg lgbranch_green_kg allwoody_green_kg leaves_green_kg
1 30.5 48.4 184.2 16.1
2 23.5 57.7 206.6 12.9
3 22.3 44.1 199.5 16.5
4 22.5 35.5 188.7 12.0
5 25.4 65.1 285.6 22.4
6 1.9 1.5 25.1 0.9
stem_dry_kg top_dry_kg smbranch_dry_kg lgbranch_dry_kg allwoody_dry_kg
1 54.7 7.1 15.3 28.0 105.1
2 62.3 12.4 14.8 33.6 123.1
3 73.3 4.6 11.5 25.1 114.4
4 53.6 21.3 11.2 19.8 105.9
5 106.4 4.7 11.7 36.1 159.0
6 11.7 0.5 1.1 0.9 14.2
leaves_dry_kg
1 6.1
2 4.6
3 6.1
4 4.2
5 7.9
6 0.3
# A tibble: 6 × 18
watershed year plot species dbh_in height_ft stem_green_kg top_green_kg
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 3 1991 29 Acer rubrum 6 48 92.2 13.1
2 3 1991 33 Acer rubrum 6.9 48 102. 23.1
3 3 1991 35 Acer rubrum 6.4 48 124. 8.7
4 3 1991 39 Acer rubrum 6.5 49 91.7 39
5 3 1991 44 Acer rubrum 7.2 51 186. 8.9
6 3 1992 26 Acer rubrum 3.1 40 20.8 0.9
# ℹ 10 more variables: smbranch_green_kg <dbl>, lgbranch_green_kg <dbl>,
# allwoody_green_kg <dbl>, leaves_green_kg <dbl>, stem_dry_kg <dbl>,
# top_dry_kg <dbl>, smbranch_dry_kg <dbl>, lgbranch_dry_kg <dbl>,
# allwoody_dry_kg <dbl>, leaves_dry_kg <dbl>
R
’s built-in read.csv()
function reads in data as a data.frame
, but readr::read_csv()
reads in data as a tibble
. Neat!We can specify some nifty options with the read_*()
functions from readr
. We can specify column types:
Rows: 88
Columns: 18
$ watershed <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
$ year <dbl> 1991, 1991, 1991, 1991, 1991, 1992, 1992, 1992, 1992…
$ plot <dbl> 29, 33, 35, 39, 44, 26, 26, 26, 48, 48, 48, 29, 33, …
$ species <chr> "Acer rubrum", "Acer rubrum", "Acer rubrum", "Acer r…
$ dbh_in <dbl> 6.0, 6.9, 6.4, 6.5, 7.2, 3.1, 2.0, 4.1, 2.4, 2.7, 3.…
$ height_ft <dbl> 48.0, 48.0, 48.0, 49.0, 51.0, 40.0, 30.5, 50.0, 28.0…
$ stem_green_kg <dbl> 92.2, 102.3, 124.4, 91.7, 186.2, 20.8, 5.6, 54.1, 10…
$ top_green_kg <dbl> 13.1, 23.1, 8.7, 39.0, 8.9, 0.9, 0.9, 8.6, 0.7, 5.0,…
$ smbranch_green_kg <dbl> 30.5, 23.5, 22.3, 22.5, 25.4, 1.9, 2.2, 8.0, 3.7, 3.…
$ lgbranch_green_kg <dbl> 48.4, 57.7, 44.1, 35.5, 65.1, 1.5, 0.6, 4.0, 0.5, 1.…
$ allwoody_green_kg <dbl> 184.2, 206.6, 199.5, 188.7, 285.6, 25.1, 9.3, 74.7, …
$ leaves_green_kg <dbl> 16.1, 12.9, 16.5, 12.0, 22.4, 0.9, 1.0, 6.1, 2.5, 1.…
$ stem_dry_kg <dbl> 54.7, 62.3, 73.3, 53.6, 106.4, 11.7, 3.2, 28.3, 5.5,…
$ top_dry_kg <dbl> 7.1, 12.4, 4.6, 21.3, 4.7, 0.5, 0.5, 4.4, 0.4, 2.7, …
$ smbranch_dry_kg <dbl> 15.3, 14.8, 11.5, 11.2, 11.7, 1.1, 1.2, 3.6, 1.8, 0.…
$ lgbranch_dry_kg <dbl> 28.0, 33.6, 25.1, 19.8, 36.1, 0.9, 0.3, 2.1, 0.3, 1.…
$ allwoody_dry_kg <dbl> 105.1, 123.1, 114.4, 105.9, 159.0, 14.2, 5.3, 38.5, …
$ leaves_dry_kg <dbl> 6.1, 4.6, 6.1, 4.2, 7.9, 0.3, 0.3, 1.9, 0.8, 0.5, 1.…
Rows: 88
Columns: 18
$ watershed <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
$ year <int> 1991, 1991, 1991, 1991, 1991, 1992, 1992, 1992, 1992…
$ plot <int> 29, 33, 35, 39, 44, 26, 26, 26, 48, 48, 48, 29, 33, …
$ species <chr> "Acer rubrum", "Acer rubrum", "Acer rubrum", "Acer r…
$ dbh_in <dbl> 6.0, 6.9, 6.4, 6.5, 7.2, 3.1, 2.0, 4.1, 2.4, 2.7, 3.…
$ height_ft <dbl> 48.0, 48.0, 48.0, 49.0, 51.0, 40.0, 30.5, 50.0, 28.0…
$ stem_green_kg <dbl> 92.2, 102.3, 124.4, 91.7, 186.2, 20.8, 5.6, 54.1, 10…
$ top_green_kg <dbl> 13.1, 23.1, 8.7, 39.0, 8.9, 0.9, 0.9, 8.6, 0.7, 5.0,…
$ smbranch_green_kg <dbl> 30.5, 23.5, 22.3, 22.5, 25.4, 1.9, 2.2, 8.0, 3.7, 3.…
$ lgbranch_green_kg <dbl> 48.4, 57.7, 44.1, 35.5, 65.1, 1.5, 0.6, 4.0, 0.5, 1.…
$ allwoody_green_kg <dbl> 184.2, 206.6, 199.5, 188.7, 285.6, 25.1, 9.3, 74.7, …
$ leaves_green_kg <dbl> 16.1, 12.9, 16.5, 12.0, 22.4, 0.9, 1.0, 6.1, 2.5, 1.…
$ stem_dry_kg <dbl> 54.7, 62.3, 73.3, 53.6, 106.4, 11.7, 3.2, 28.3, 5.5,…
$ top_dry_kg <dbl> 7.1, 12.4, 4.6, 21.3, 4.7, 0.5, 0.5, 4.4, 0.4, 2.7, …
$ smbranch_dry_kg <dbl> 15.3, 14.8, 11.5, 11.2, 11.7, 1.1, 1.2, 3.6, 1.8, 0.…
$ lgbranch_dry_kg <dbl> 28.0, 33.6, 25.1, 19.8, 36.1, 0.9, 0.3, 2.1, 0.3, 1.…
$ allwoody_dry_kg <dbl> 105.1, 123.1, 114.4, 105.9, 159.0, 14.2, 5.3, 38.5, …
$ leaves_dry_kg <dbl> 6.1, 4.6, 6.1, 4.2, 7.9, 0.3, 0.3, 1.9, 0.8, 0.5, 1.…
Writing data with readr
looks very similar to base R. Say we’d like to write pets_tbl
to “pets.csv”:
With base:
With readr
readr::write_csv()
compared to write.csv()
.dplyr
demo (time permitting)dplyr