readr and tibble

Practical Computing and Data Science Tools

Announcements

  • Midterm 1 + Lab 5 grades will be released by the end of the weekend.

Agenda

  • Intro to the tidyverse
  • tibble: improved data frames
  • readr: tidy reading and writing data
  • Sneak peak at dplyr

Welcome to the tidyverse

What is the tidyverse

  • A collection of R packages that work together to provide extensive and intuitive data analysis functions.
  • In this class, we will focus on 5 tidyverse packages:
    • tibble, to improve on the data.frame,
    • readr, to improve reading and writing data,
    • dplyr, to manipulate and summarize data “data plyers”,
    • tidyr, to clean and reshape data, and
    • ggplot2, to produce beautiful graphics with intuitive syntax “the grammar of graphics”.

How do we use / install the tidyverse?

To install tidyverse, you must run this code on your computer.

NOTE: you only have to do this once.

install.packages("tidyverse")

This will take a few minutes to run on your computer. After it’s done, you can load all the tidyverse packages with the library() function:

library(tidyverse)

NOTE: you’ll load the tidyverse with library() on each project (e.g. lab, midterm, project) towards the top of your Quarto document.

Today: tibble and readr

tibble: improved data frames

tibbles: improved data frames

Recall the pets data frame:

pets_df <- data.frame(
  names = c("Dude", "Pickle", "Kyle", "Nubs"),
  ages = c(6, 5, 3, 11),
  is_dog = c(TRUE, FALSE, FALSE, FALSE)
)

tibbles: improved data frames

Recall the pets data frame:

pets_df <- data.frame(
  names = c("Dude", "Pickle", "Kyle", "Nubs"),
  ages = c(6, 5, 3, 11),
  is_dog = c(TRUE, FALSE, FALSE, FALSE)
)
class(pets_df)
[1] "data.frame"

tibbles: improved data frames

Recall the pets data frame:

pets_df <- data.frame(
  names = c("Dude", "Pickle", "Kyle", "Nubs"),
  ages = c(6, 5, 3, 11),
  is_dog = c(TRUE, FALSE, FALSE, FALSE)
)
pets_df
   names ages is_dog
1   Dude    6   TRUE
2 Pickle    5  FALSE
3   Kyle    3  FALSE
4   Nubs   11  FALSE

tibbles: improved data frames

We can improve on this data structure by created a tibble rather than a data.frame. The syntax is almost exactly the same:

pets_df <- data.frame(
  names = c("Dude", "Pickle", "Kyle", "Nubs"),
  ages = c(6, 5, 3, 11),
  is_dog = c(TRUE, FALSE, FALSE, FALSE)
)

tibbles: improved data frames

We can improve on this data structure by created a tibble rather than a data.frame. The syntax is almost exactly the same:

library(tidyverse)
pets_tbl <- tibble(
  names = c("Dude", "Pickle", "Kyle", "Nubs"),
  ages = c(6, 5, 3, 11),
  is_dog = c(TRUE, FALSE, FALSE, FALSE)
)

tibbles: improved data frames

We can improve on this data structure by created a tibble rather than a data.frame. The syntax is almost exactly the same:

library(tidyverse)
pets_tbl <- tibble(
  names = c("Dude", "Pickle", "Kyle", "Nubs"),
  ages = c(6, 5, 3, 11),
  is_dog = c(TRUE, FALSE, FALSE, FALSE)
)
class(pets_tbl)
[1] "tbl_df"     "tbl"        "data.frame"

tibbles: improved data frames

We can improve on this data structure by created a tibble rather than a data.frame. The syntax is almost exactly the same:

library(tidyverse)
pets_tbl <- tibble(
  names = c("Dude", "Pickle", "Kyle", "Nubs"),
  ages = c(6, 5, 3, 11),
  is_dog = c(TRUE, FALSE, FALSE, FALSE)
)
pets_tbl
# A tibble: 4 × 3
  names   ages is_dog
  <chr>  <dbl> <lgl> 
1 Dude       6 TRUE  
2 Pickle     5 FALSE 
3 Kyle       3 FALSE 
4 Nubs      11 FALSE 

tibbles: improved data frames

pets_df
   names ages is_dog
1   Dude    6   TRUE
2 Pickle    5  FALSE
3   Kyle    3  FALSE
4   Nubs   11  FALSE
pets_tbl
# A tibble: 4 × 3
  names   ages is_dog
  <chr>  <dbl> <lgl> 
1 Dude       6 TRUE  
2 Pickle     5 FALSE 
3 Kyle       3 FALSE 
4 Nubs      11 FALSE 
  • tibbles give us some nice information when printed: dimensions and column class.

Converting tibbles and data.frames

We can convert existing data.frames into tibbles:

new_pets_tbl <- as_tibble(pets_df)

Note that new_pets_tbl is a tibble:

is_tibble(new_pets_tbl)
[1] TRUE

but is of course still a data.frame:

is.data.frame(new_pets_tbl)
[1] TRUE
  • All tibbles are data.frames, but not all data.frames are tibbles.

A note on subsetting

In the past, when we have wanted the values from one column of a data frame, we are returned a vector. For example:

# the names of the pets in pets_df
pets_df[, 1] # returns a vector
[1] "Dude"   "Pickle" "Kyle"   "Nubs"  
pets_df[, c(1,2)] # returns a data.frame
   names ages
1   Dude    6
2 Pickle    5
3   Kyle    3
4   Nubs   11
  • Inconsistent!

A note on subsetting

tibbles fix this inconsistency

# the names of the pets in pets_df
pets_tbl[, 1] # returns a tibble / data.frame
# A tibble: 4 × 1
  names 
  <chr> 
1 Dude  
2 Pickle
3 Kyle  
4 Nubs  
pets_tbl[, c(1, 2)] # returns a tibble / data.frame
# A tibble: 4 × 2
  names   ages
  <chr>  <dbl>
1 Dude       6
2 Pickle     5
3 Kyle       3
4 Nubs      11

readr: tidy reading and writing

Reading data

So far, we have primarily used the built-in read.csv() and read.table() functions to read data into R. Example:

fef <- read.csv(file = "../labs/datasets/FEF_trees.csv")

Recall:

  • We start with the read.csv() function,
  • We specify a relative path with file = "../labs/datasets/FEF_trees.csv",
  • We assign this read-in data with the assignment arrow <- to an object in R call fef.

Reading data with readr

The readr package includes analogous read_csv(), read_table(), read_delim(), and even more read_*() functions.

fef_tidy <- read_csv(file = "../labs/datasets/FEF_trees.csv")
  • Note that we now are using read_csv() with an underscore (_) rather than the base R option of read.csv().
  • We must first load the tidyverse to run this function, unless we call the package name before the function name like so: readr::read_csv(file = "../labs/datasets/FEF_trees.csv")
  • Using the readr functions can have up to 100x speed increase compared to base R, depending on your dataset.

Comparing base and tidy data reading:

dim(fef)
[1] 88 18
dim(fef_tidy)
[1] 88 18

Comparing base and tidy data reading:

head(fef)
  watershed year plot     species dbh_in height_ft stem_green_kg top_green_kg
1         3 1991   29 Acer rubrum    6.0        48          92.2         13.1
2         3 1991   33 Acer rubrum    6.9        48         102.3         23.1
3         3 1991   35 Acer rubrum    6.4        48         124.4          8.7
4         3 1991   39 Acer rubrum    6.5        49          91.7         39.0
5         3 1991   44 Acer rubrum    7.2        51         186.2          8.9
6         3 1992   26 Acer rubrum    3.1        40          20.8          0.9
  smbranch_green_kg lgbranch_green_kg allwoody_green_kg leaves_green_kg
1              30.5              48.4             184.2            16.1
2              23.5              57.7             206.6            12.9
3              22.3              44.1             199.5            16.5
4              22.5              35.5             188.7            12.0
5              25.4              65.1             285.6            22.4
6               1.9               1.5              25.1             0.9
  stem_dry_kg top_dry_kg smbranch_dry_kg lgbranch_dry_kg allwoody_dry_kg
1        54.7        7.1            15.3            28.0           105.1
2        62.3       12.4            14.8            33.6           123.1
3        73.3        4.6            11.5            25.1           114.4
4        53.6       21.3            11.2            19.8           105.9
5       106.4        4.7            11.7            36.1           159.0
6        11.7        0.5             1.1             0.9            14.2
  leaves_dry_kg
1           6.1
2           4.6
3           6.1
4           4.2
5           7.9
6           0.3

Comparing base and tidy data reading:

head(fef_tidy)
# A tibble: 6 × 18
  watershed  year  plot species     dbh_in height_ft stem_green_kg top_green_kg
      <dbl> <dbl> <dbl> <chr>        <dbl>     <dbl>         <dbl>        <dbl>
1         3  1991    29 Acer rubrum    6          48          92.2         13.1
2         3  1991    33 Acer rubrum    6.9        48         102.          23.1
3         3  1991    35 Acer rubrum    6.4        48         124.           8.7
4         3  1991    39 Acer rubrum    6.5        49          91.7         39  
5         3  1991    44 Acer rubrum    7.2        51         186.           8.9
6         3  1992    26 Acer rubrum    3.1        40          20.8          0.9
# ℹ 10 more variables: smbranch_green_kg <dbl>, lgbranch_green_kg <dbl>,
#   allwoody_green_kg <dbl>, leaves_green_kg <dbl>, stem_dry_kg <dbl>,
#   top_dry_kg <dbl>, smbranch_dry_kg <dbl>, lgbranch_dry_kg <dbl>,
#   allwoody_dry_kg <dbl>, leaves_dry_kg <dbl>

Comparing base and tidy data reading:

class(fef)
[1] "data.frame"
class(fef_tidy)
[1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame" 
  • R’s built-in read.csv() function reads in data as a data.frame, but readr::read_csv() reads in data as a tibble. Neat!

Reading data options

We can specify some nifty options with the read_*() functions from readr. We can specify column types:

fef_tidy_cols <- read_csv(
  file = "../labs/datasets/FEF_trees.csv",
  col_types = list(
    watershed = col_integer(),
    year = col_integer(),
    plot = col_integer()
  )
)

Reading data options

glimpse(fef_tidy)
Rows: 88
Columns: 18
$ watershed         <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
$ year              <dbl> 1991, 1991, 1991, 1991, 1991, 1992, 1992, 1992, 1992…
$ plot              <dbl> 29, 33, 35, 39, 44, 26, 26, 26, 48, 48, 48, 29, 33, …
$ species           <chr> "Acer rubrum", "Acer rubrum", "Acer rubrum", "Acer r…
$ dbh_in            <dbl> 6.0, 6.9, 6.4, 6.5, 7.2, 3.1, 2.0, 4.1, 2.4, 2.7, 3.…
$ height_ft         <dbl> 48.0, 48.0, 48.0, 49.0, 51.0, 40.0, 30.5, 50.0, 28.0…
$ stem_green_kg     <dbl> 92.2, 102.3, 124.4, 91.7, 186.2, 20.8, 5.6, 54.1, 10…
$ top_green_kg      <dbl> 13.1, 23.1, 8.7, 39.0, 8.9, 0.9, 0.9, 8.6, 0.7, 5.0,…
$ smbranch_green_kg <dbl> 30.5, 23.5, 22.3, 22.5, 25.4, 1.9, 2.2, 8.0, 3.7, 3.…
$ lgbranch_green_kg <dbl> 48.4, 57.7, 44.1, 35.5, 65.1, 1.5, 0.6, 4.0, 0.5, 1.…
$ allwoody_green_kg <dbl> 184.2, 206.6, 199.5, 188.7, 285.6, 25.1, 9.3, 74.7, …
$ leaves_green_kg   <dbl> 16.1, 12.9, 16.5, 12.0, 22.4, 0.9, 1.0, 6.1, 2.5, 1.…
$ stem_dry_kg       <dbl> 54.7, 62.3, 73.3, 53.6, 106.4, 11.7, 3.2, 28.3, 5.5,…
$ top_dry_kg        <dbl> 7.1, 12.4, 4.6, 21.3, 4.7, 0.5, 0.5, 4.4, 0.4, 2.7, …
$ smbranch_dry_kg   <dbl> 15.3, 14.8, 11.5, 11.2, 11.7, 1.1, 1.2, 3.6, 1.8, 0.…
$ lgbranch_dry_kg   <dbl> 28.0, 33.6, 25.1, 19.8, 36.1, 0.9, 0.3, 2.1, 0.3, 1.…
$ allwoody_dry_kg   <dbl> 105.1, 123.1, 114.4, 105.9, 159.0, 14.2, 5.3, 38.5, …
$ leaves_dry_kg     <dbl> 6.1, 4.6, 6.1, 4.2, 7.9, 0.3, 0.3, 1.9, 0.8, 0.5, 1.…
glimpse(fef_tidy_cols)
Rows: 88
Columns: 18
$ watershed         <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
$ year              <int> 1991, 1991, 1991, 1991, 1991, 1992, 1992, 1992, 1992…
$ plot              <int> 29, 33, 35, 39, 44, 26, 26, 26, 48, 48, 48, 29, 33, …
$ species           <chr> "Acer rubrum", "Acer rubrum", "Acer rubrum", "Acer r…
$ dbh_in            <dbl> 6.0, 6.9, 6.4, 6.5, 7.2, 3.1, 2.0, 4.1, 2.4, 2.7, 3.…
$ height_ft         <dbl> 48.0, 48.0, 48.0, 49.0, 51.0, 40.0, 30.5, 50.0, 28.0…
$ stem_green_kg     <dbl> 92.2, 102.3, 124.4, 91.7, 186.2, 20.8, 5.6, 54.1, 10…
$ top_green_kg      <dbl> 13.1, 23.1, 8.7, 39.0, 8.9, 0.9, 0.9, 8.6, 0.7, 5.0,…
$ smbranch_green_kg <dbl> 30.5, 23.5, 22.3, 22.5, 25.4, 1.9, 2.2, 8.0, 3.7, 3.…
$ lgbranch_green_kg <dbl> 48.4, 57.7, 44.1, 35.5, 65.1, 1.5, 0.6, 4.0, 0.5, 1.…
$ allwoody_green_kg <dbl> 184.2, 206.6, 199.5, 188.7, 285.6, 25.1, 9.3, 74.7, …
$ leaves_green_kg   <dbl> 16.1, 12.9, 16.5, 12.0, 22.4, 0.9, 1.0, 6.1, 2.5, 1.…
$ stem_dry_kg       <dbl> 54.7, 62.3, 73.3, 53.6, 106.4, 11.7, 3.2, 28.3, 5.5,…
$ top_dry_kg        <dbl> 7.1, 12.4, 4.6, 21.3, 4.7, 0.5, 0.5, 4.4, 0.4, 2.7, …
$ smbranch_dry_kg   <dbl> 15.3, 14.8, 11.5, 11.2, 11.7, 1.1, 1.2, 3.6, 1.8, 0.…
$ lgbranch_dry_kg   <dbl> 28.0, 33.6, 25.1, 19.8, 36.1, 0.9, 0.3, 2.1, 0.3, 1.…
$ allwoody_dry_kg   <dbl> 105.1, 123.1, 114.4, 105.9, 159.0, 14.2, 5.3, 38.5, …
$ leaves_dry_kg     <dbl> 6.1, 4.6, 6.1, 4.2, 7.9, 0.3, 0.3, 1.9, 0.8, 0.5, 1.…

Writing data

Writing data with readr looks very similar to base R. Say we’d like to write pets_tbl to “pets.csv”:

With base:

write.csv(x = pets_tbl, file = "pets.csv")

With readr

write_csv(x = pets_tbl, file = "pets.csv")
  • Up to a 2x speed increase with readr::write_csv() compared to write.csv().

dplyr demo (time permitting)

Next time

  • dplyr