Advanced grouping and summarization

Practical Computing and Data Science Tools

Agenda

group_by() and summarize() in further detail,
Counting with n() and n_distinct(),
Working across() columns.

`dplyr` verbs

Recall the standard dplyr verb form:

verb(data, action)

The first argument of any dplyr verb is the data (a tibble or data.frame), and the next arguments specify how we are using the verb().
Since the first argument is always the data, dplyr functions can easily be “piped” between to create a pipeline from one verb to another.
A very common pipeline between verbs occurs between group_by() and summarize().

Again, we will use `more_pets`

library(tidyverse)
more_pets <- tibble(
  names = c("Dude", "Pickle", "Kyle", "Nubs", "Marvin", "Figaro", "Slim"),
  ages = c(6, 5, 3, 11, 11, 3, 6),
  meals_per_day = c(2, 3, 3, 3, 1, 2, 2),
  type_of_animal = c("dog", rep("cat", 3), "sheep/ram", "cat", "dog")
)
more_pets

# A tibble: 7 × 4
  names   ages meals_per_day type_of_animal
  <chr>  <dbl>         <dbl> <chr>         
1 Dude       6             2 dog           
2 Pickle     5             3 cat           
3 Kyle       3             3 cat           
4 Nubs      11             3 cat           
5 Marvin    11             1 sheep/ram     
6 Figaro     3             2 cat           
7 Slim       6             2 dog

Summarizing data with `dplyr`

Summarization

Data summaries are some of the best ways to learn from the data we have.
With increasingly large data, it can be very helpful to summarize.
dplyr provides intuitive and powerful ways to summarize data.
Recall: they call their function for summarization, suprisingly, summarize()

Summarization

The summarize() function takes a very similar form to the other dplyr functions.
In particular, it is of the form:

summarize(data, new_summary_column_name = summarization_code)

Get the mean value of the pet’s ages

more_pets %>%
  summarize(avg_age = mean(ages))

# A tibble: 1 × 1
  avg_age
    <dbl>
1    6.43

Get the total years lived by the pets

# A tibble: 1 × 1
  years_lived
        <dbl>
1          45

Get the total years lived by the pets

more_pets %>%
  summarize(years_lived = sum(ages))

# A tibble: 1 × 1
  years_lived
        <dbl>
1          45

Custom functions in `summarize()`

Recall the quadratic mean:

quad_mean <- function(x) {
  return(sqrt(sum(x^2)) / length(x))
}

Custom functions in `summarize()`

Recall the quadratic mean:

quad_mean <- function(x) {
  return(sqrt(sum(x^2)) / length(x))
}

more_pets %>%
  summarize(q_mean = quad_mean(ages),
            a_mean = mean(ages))

# A tibble: 1 × 2
  q_mean a_mean
   <dbl>  <dbl>
1   2.70   6.43

We can use our own functions within a summarize!

`group_by()`

The group_by() function allows us to group our tibble by a variable of interest.
group_by() on its own, does not change the rows or columns of the tibble, it just makes it “grouped”
However, when we go to summarize() “grouped” data, we get the results for each group.
Let’s try it out!

average age for each type of pet

more_pets

# A tibble: 7 × 4
  names   ages meals_per_day type_of_animal
  <chr>  <dbl>         <dbl> <chr>         
1 Dude       6             2 dog           
2 Pickle     5             3 cat           
3 Kyle       3             3 cat           
4 Nubs      11             3 cat           
5 Marvin    11             1 sheep/ram     
6 Figaro     3             2 cat           
7 Slim       6             2 dog

average age for each type of pet

more_pets %>% 
  group_by(type_of_animal)

# A tibble: 7 × 4
# Groups:   type_of_animal [3]
  names   ages meals_per_day type_of_animal
  <chr>  <dbl>         <dbl> <chr>         
1 Dude       6             2 dog           
2 Pickle     5             3 cat           
3 Kyle       3             3 cat           
4 Nubs      11             3 cat           
5 Marvin    11             1 sheep/ram     
6 Figaro     3             2 cat           
7 Slim       6             2 dog

average age for each type of pet

more_pets %>% 
  group_by(type_of_animal) %>%
  summarize(avg_age = mean(ages))

# A tibble: 3 × 2
  type_of_animal avg_age
  <chr>            <dbl>
1 cat                5.5
2 dog                6  
3 sheep/ram         11

How many of each type of animal?

# A tibble: 3 × 2
  type_of_animal number_of_types
  <chr>                    <int>
1 cat                          4
2 dog                          2
3 sheep/ram                    1

How many of each type of animal?

more_pets %>% 
  group_by(type_of_animal) %>%
  summarize(number_of_types = n())

# A tibble: 3 × 2
  type_of_animal number_of_types
  <chr>                    <int>
1 cat                          4
2 dog                          2
3 sheep/ram                    1

What’s going on with `n()`

The n() function counts for us!
When we use n(), nothing ever goes in the parentheses.
For our cases, we will use n() in within a summarize() or mutate() on a grouped dataset. n() counts the rows in each group for us.

Number of other animals of the same type

# A tibble: 7 × 5
# Groups:   type_of_animal [3]
  names   ages meals_per_day type_of_animal other_of_same
  <chr>  <dbl>         <dbl> <chr>                  <dbl>
1 Dude       6             2 dog                        1
2 Pickle     5             3 cat                        3
3 Kyle       3             3 cat                        3
4 Nubs      11             3 cat                        3
5 Marvin    11             1 sheep/ram                  0
6 Figaro     3             2 cat                        3
7 Slim       6             2 dog                        1

Number of other animals of the same type

more_pets %>% 
  group_by(type_of_animal) %>%
  mutate(other_of_same = n() - 1)

# A tibble: 7 × 5
# Groups:   type_of_animal [3]
  names   ages meals_per_day type_of_animal other_of_same
  <chr>  <dbl>         <dbl> <chr>                  <dbl>
1 Dude       6             2 dog                        1
2 Pickle     5             3 cat                        3
3 Kyle       3             3 cat                        3
4 Nubs      11             3 cat                        3
5 Marvin    11             1 sheep/ram                  0
6 Figaro     3             2 cat                        3
7 Slim       6             2 dog                        1

Number of types of animals

# A tibble: 1 × 1
  n_types
    <int>
1       3

Number of types of animals

more_pets %>% 
  summarize(n_types = n_distinct(type_of_animal))

# A tibble: 1 × 1
  n_types
    <int>
1       3

The `n_distinct()` function

Used similarly to n(), but takes columns as inputs,
Counts the number of distinct values in the supplied columns,
Can be used grouped.

Number of unique ages for each type of animal

# A tibble: 3 × 2
  type_of_animal unique_ages
  <chr>                <int>
1 cat                      3
2 dog                      1
3 sheep/ram                1

Number of unique ages for each type of animal

more_pets %>%
  group_by(type_of_animal) %>%
  summarize(unique_ages = n_distinct(ages))

# A tibble: 3 × 2
  type_of_animal unique_ages
  <chr>                <int>
1 cat                      3
2 dog                      1
3 sheep/ram                1

Working `across()` columns

`across()`

across() takes two arguments: .cols and .fns
The .cols argument specifies the columns we’d like to apply our function, .fns, to.
In practice, we use across() within a mutate() or a summarize().
Let’s try it out!

Get the mean of ages and meals_per_day

more_pets %>%
  summarize(
    across(.cols = c(ages, meals_per_day),
           .fns = mean)
  )

# A tibble: 1 × 2
   ages meals_per_day
  <dbl>         <dbl>
1  6.43          2.29

Make the ages and meals_per_day columns integers

# A tibble: 7 × 4
  names   ages meals_per_day type_of_animal
  <chr>  <int>         <int> <chr>         
1 Dude       6             2 dog           
2 Pickle     5             3 cat           
3 Kyle       3             3 cat           
4 Nubs      11             3 cat           
5 Marvin    11             1 sheep/ram     
6 Figaro     3             2 cat           
7 Slim       6             2 dog

Make the ages and meals_per_day columns integers

more_pets %>%
  mutate(
    across(.cols = c(ages, meals_per_day),
           .fns = as.integer)
  )

# A tibble: 7 × 4
  names   ages meals_per_day type_of_animal
  <chr>  <int>         <int> <chr>         
1 Dude       6             2 dog           
2 Pickle     5             3 cat           
3 Kyle       3             3 cat           
4 Nubs      11             3 cat           
5 Marvin    11             1 sheep/ram     
6 Figaro     3             2 cat           
7 Slim       6             2 dog

Get the mean and standard deviation of ages and meals_per_day

Bad example

more_pets %>%
  summarize(
    across(.cols = c(ages, meals_per_day),
           .fns = mean)
  )

# A tibble: 1 × 2
   ages meals_per_day
  <dbl>         <dbl>
1  6.43          2.29

more_pets %>%
  summarize(
    across(.cols = c(ages, meals_per_day),
           .fns = sd)
  )

# A tibble: 1 × 2
   ages meals_per_day
  <dbl>         <dbl>
1  3.36         0.756

Get the mean and standard deviation of ages and meals_per_day

Better

more_pets %>%
  summarize(
    across(.cols = c(ages, meals_per_day),
           .fns = c(mean, sd))
  )

# A tibble: 1 × 4
  ages_1 ages_2 meals_per_day_1 meals_per_day_2
   <dbl>  <dbl>           <dbl>           <dbl>
1   6.43   3.36            2.29           0.756

Get the mean and standard deviation of ages and meals_per_day

Best

more_pets %>%
  summarize(
    across(.cols = c(ages, meals_per_day),
           .fns = c(mean = mean, standard_deviation = sd))
  )

# A tibble: 1 × 4
  ages_mean ages_standard_deviation meals_per_day_mean meals_per_day_standard_…¹
      <dbl>                   <dbl>              <dbl>                     <dbl>
1      6.43                    3.36               2.29                     0.756
# ℹ abbreviated name: ¹meals_per_day_standard_deviation

Get the mean and standard deviation of ages and meals_per_day, grouped by type of animal

more_pets %>%
  group_by(type_of_animal) %>%
  summarize(
    across(.cols = c(ages, meals_per_day),
           .fns = c(mean = mean, standard_deviation = sd))
  )

# A tibble: 3 × 5
  type_of_animal ages_mean ages_standard_deviation meals_per_day_mean
  <chr>              <dbl>                   <dbl>              <dbl>
1 cat                  5.5                    3.79               2.75
2 dog                  6                      0                  2   
3 sheep/ram           11                     NA                  1   
# ℹ 1 more variable: meals_per_day_standard_deviation <dbl>

Specifying function arguments in calls to `across()`

Consider the toy dataset:

trees <- tibble(
  dbh = c(15, 9, NA),
  height = c(50, 34, 33),
  spp = c("doug-fir", "madrone", "doug-fir")
)
trees

# A tibble: 3 × 3
    dbh height spp     
  <dbl>  <dbl> <chr>   
1    15     50 doug-fir
2     9     34 madrone 
3    NA     33 doug-fir

What if we want the mean dbh and height?

trees %>%
  summarize(
    across(.cols = c(dbh, height),
           .fns = mean)
  )

# A tibble: 1 × 2
    dbh height
  <dbl>  <dbl>
1    NA     39

returns NA :-(

We need “lambda syntax”

trees %>%
  summarize(
    across(.cols = c(dbh, height),
           .fns = ~ mean(.x, na.rm = TRUE))
  )

# A tibble: 1 × 2
    dbh height
  <dbl>  <dbl>
1    12     39

We need “lambda syntax”

trees %>%
  summarize(
    across(.cols = c(dbh, height),
           .fns = ~ mean(.x, na.rm = TRUE))
  )

# A tibble: 1 × 2
    dbh height
  <dbl>  <dbl>
1    12     39

Before, we had .fns = mean, but with lambda syntax we can specify additional arguments in the newfound parentheses.
We use the tilde ~ to specify lambda syntax.
After we’ve specified the ~, we can now put parentheses after the function and specify additional arguments.

We need “lambda syntax”

trees %>%
  summarize(
    across(.cols = c(dbh, height),
           .fns = ~ mean(.x, na.rm = TRUE))
  )

# A tibble: 1 × 2
    dbh height
  <dbl>  <dbl>
1    12     39

We now must specify .x in the place where the columns go (before, this happened implicitly)

Let’s take a look at Lab 7

https://www.for128.org

Next time

Joining tibbles with dplyr

Advanced grouping and summarization

Agenda

dplyr verbs

Again, we will use more_pets

Summarizing data with dplyr

Summarization

Summarization

Get the mean value of the pet’s ages

Get the total years lived by the pets

Get the total years lived by the pets

Custom functions in summarize()

Custom functions in summarize()

group_by()

average age for each type of pet

average age for each type of pet

average age for each type of pet

How many of each type of animal?

How many of each type of animal?

What’s going on with n()

Number of other animals of the same type

Number of other animals of the same type

Number of types of animals

Number of types of animals

The n_distinct() function

Number of unique ages for each type of animal

Number of unique ages for each type of animal

Working across() columns

across()

Get the mean of ages and meals_per_day

Make the ages and meals_per_day columns integers

Make the ages and meals_per_day columns integers

Get the mean and standard deviation of ages and meals_per_day

Get the mean and standard deviation of ages and meals_per_day

Get the mean and standard deviation of ages and meals_per_day

Get the mean and standard deviation of ages and meals_per_day, grouped by type of animal

Specifying function arguments in calls to across()

What if we want the mean dbh and height?

We need “lambda syntax”

We need “lambda syntax”

We need “lambda syntax”

Let’s take a look at Lab 7

Next time

`dplyr` verbs

Again, we will use `more_pets`

Summarizing data with `dplyr`

Custom functions in `summarize()`

Custom functions in `summarize()`

`group_by()`

What’s going on with `n()`

The `n_distinct()` function

Working `across()` columns

`across()`

Specifying function arguments in calls to `across()`