Advanced grouping and summarization

Practical Computing and Data Science Tools

Agenda

  • group_by() and summarize() in further detail,
  • Counting with n() and n_distinct(),
  • Working across() columns.

dplyr verbs

Recall the standard dplyr verb form:

verb(data, action)
  • The first argument of any dplyr verb is the data (a tibble or data.frame), and the next arguments specify how we are using the verb().
  • Since the first argument is always the data, dplyr functions can easily be “piped” between to create a pipeline from one verb to another.
  • A very common pipeline between verbs occurs between group_by() and summarize().

Again, we will use more_pets

library(tidyverse)
more_pets <- tibble(
  names = c("Dude", "Pickle", "Kyle", "Nubs", "Marvin", "Figaro", "Slim"),
  ages = c(6, 5, 3, 11, 11, 3, 6),
  meals_per_day = c(2, 3, 3, 3, 1, 2, 2),
  type_of_animal = c("dog", rep("cat", 3), "sheep/ram", "cat", "dog")
)
more_pets
# A tibble: 7 × 4
  names   ages meals_per_day type_of_animal
  <chr>  <dbl>         <dbl> <chr>         
1 Dude       6             2 dog           
2 Pickle     5             3 cat           
3 Kyle       3             3 cat           
4 Nubs      11             3 cat           
5 Marvin    11             1 sheep/ram     
6 Figaro     3             2 cat           
7 Slim       6             2 dog           

Summarizing data with dplyr

Summarization

  • Data summaries are some of the best ways to learn from the data we have.
  • With increasingly large data, it can be very helpful to summarize.
  • dplyr provides intuitive and powerful ways to summarize data.
  • Recall: they call their function for summarization, suprisingly, summarize()

Summarization

  • The summarize() function takes a very similar form to the other dplyr functions.
  • In particular, it is of the form:
summarize(data, new_summary_column_name = summarization_code)

Get the mean value of the pet’s ages

more_pets %>%
  summarize(avg_age = mean(ages))
# A tibble: 1 × 1
  avg_age
    <dbl>
1    6.43

Get the total years lived by the pets

# A tibble: 1 × 1
  years_lived
        <dbl>
1          45

Get the total years lived by the pets

more_pets %>%
  summarize(years_lived = sum(ages))
# A tibble: 1 × 1
  years_lived
        <dbl>
1          45

Custom functions in summarize()

Recall the quadratic mean:

quad_mean <- function(x) {
  return(sqrt(sum(x^2)) / length(x))
}

Custom functions in summarize()

Recall the quadratic mean:

quad_mean <- function(x) {
  return(sqrt(sum(x^2)) / length(x))
}

more_pets %>%
  summarize(q_mean = quad_mean(ages),
            a_mean = mean(ages))
# A tibble: 1 × 2
  q_mean a_mean
   <dbl>  <dbl>
1   2.70   6.43
  • We can use our own functions within a summarize!

group_by()

  • The group_by() function allows us to group our tibble by a variable of interest.
  • group_by() on its own, does not change the rows or columns of the tibble, it just makes it “grouped”
  • However, when we go to summarize() “grouped” data, we get the results for each group.
  • Let’s try it out!

average age for each type of pet

more_pets
# A tibble: 7 × 4
  names   ages meals_per_day type_of_animal
  <chr>  <dbl>         <dbl> <chr>         
1 Dude       6             2 dog           
2 Pickle     5             3 cat           
3 Kyle       3             3 cat           
4 Nubs      11             3 cat           
5 Marvin    11             1 sheep/ram     
6 Figaro     3             2 cat           
7 Slim       6             2 dog           

average age for each type of pet

more_pets %>% 
  group_by(type_of_animal)
# A tibble: 7 × 4
# Groups:   type_of_animal [3]
  names   ages meals_per_day type_of_animal
  <chr>  <dbl>         <dbl> <chr>         
1 Dude       6             2 dog           
2 Pickle     5             3 cat           
3 Kyle       3             3 cat           
4 Nubs      11             3 cat           
5 Marvin    11             1 sheep/ram     
6 Figaro     3             2 cat           
7 Slim       6             2 dog           

average age for each type of pet

more_pets %>% 
  group_by(type_of_animal) %>%
  summarize(avg_age = mean(ages))
# A tibble: 3 × 2
  type_of_animal avg_age
  <chr>            <dbl>
1 cat                5.5
2 dog                6  
3 sheep/ram         11  

How many of each type of animal?

# A tibble: 3 × 2
  type_of_animal number_of_types
  <chr>                    <int>
1 cat                          4
2 dog                          2
3 sheep/ram                    1

How many of each type of animal?

more_pets %>% 
  group_by(type_of_animal) %>%
  summarize(number_of_types = n())
# A tibble: 3 × 2
  type_of_animal number_of_types
  <chr>                    <int>
1 cat                          4
2 dog                          2
3 sheep/ram                    1

What’s going on with n()

  • The n() function counts for us!
  • When we use n(), nothing ever goes in the parentheses.
  • For our cases, we will use n() in within a summarize() or mutate() on a grouped dataset. n() counts the rows in each group for us.

Number of other animals of the same type

# A tibble: 7 × 5
# Groups:   type_of_animal [3]
  names   ages meals_per_day type_of_animal other_of_same
  <chr>  <dbl>         <dbl> <chr>                  <dbl>
1 Dude       6             2 dog                        1
2 Pickle     5             3 cat                        3
3 Kyle       3             3 cat                        3
4 Nubs      11             3 cat                        3
5 Marvin    11             1 sheep/ram                  0
6 Figaro     3             2 cat                        3
7 Slim       6             2 dog                        1

Number of other animals of the same type

more_pets %>% 
  group_by(type_of_animal) %>%
  mutate(other_of_same = n() - 1) 
# A tibble: 7 × 5
# Groups:   type_of_animal [3]
  names   ages meals_per_day type_of_animal other_of_same
  <chr>  <dbl>         <dbl> <chr>                  <dbl>
1 Dude       6             2 dog                        1
2 Pickle     5             3 cat                        3
3 Kyle       3             3 cat                        3
4 Nubs      11             3 cat                        3
5 Marvin    11             1 sheep/ram                  0
6 Figaro     3             2 cat                        3
7 Slim       6             2 dog                        1

Number of types of animals

# A tibble: 1 × 1
  n_types
    <int>
1       3

Number of types of animals

more_pets %>% 
  summarize(n_types = n_distinct(type_of_animal))
# A tibble: 1 × 1
  n_types
    <int>
1       3

The n_distinct() function

  • Used similarly to n(), but takes columns as inputs,
  • Counts the number of distinct values in the supplied columns,
  • Can be used grouped.

Number of unique ages for each type of animal

# A tibble: 3 × 2
  type_of_animal unique_ages
  <chr>                <int>
1 cat                      3
2 dog                      1
3 sheep/ram                1

Number of unique ages for each type of animal

more_pets %>%
  group_by(type_of_animal) %>%
  summarize(unique_ages = n_distinct(ages))
# A tibble: 3 × 2
  type_of_animal unique_ages
  <chr>                <int>
1 cat                      3
2 dog                      1
3 sheep/ram                1

Working across() columns

across()

  • across() takes two arguments: .cols and .fns
  • The .cols argument specifies the columns we’d like to apply our function, .fns, to.
  • In practice, we use across() within a mutate() or a summarize().
  • Let’s try it out!

Get the mean of ages and meals_per_day

more_pets %>%
  summarize(
    across(.cols = c(ages, meals_per_day),
           .fns = mean)
  )
# A tibble: 1 × 2
   ages meals_per_day
  <dbl>         <dbl>
1  6.43          2.29

Make the ages and meals_per_day columns integers

# A tibble: 7 × 4
  names   ages meals_per_day type_of_animal
  <chr>  <int>         <int> <chr>         
1 Dude       6             2 dog           
2 Pickle     5             3 cat           
3 Kyle       3             3 cat           
4 Nubs      11             3 cat           
5 Marvin    11             1 sheep/ram     
6 Figaro     3             2 cat           
7 Slim       6             2 dog           

Make the ages and meals_per_day columns integers

more_pets %>%
  mutate(
    across(.cols = c(ages, meals_per_day),
           .fns = as.integer)
  )
# A tibble: 7 × 4
  names   ages meals_per_day type_of_animal
  <chr>  <int>         <int> <chr>         
1 Dude       6             2 dog           
2 Pickle     5             3 cat           
3 Kyle       3             3 cat           
4 Nubs      11             3 cat           
5 Marvin    11             1 sheep/ram     
6 Figaro     3             2 cat           
7 Slim       6             2 dog           

Get the mean and standard deviation of ages and meals_per_day

Bad example

more_pets %>%
  summarize(
    across(.cols = c(ages, meals_per_day),
           .fns = mean)
  )
# A tibble: 1 × 2
   ages meals_per_day
  <dbl>         <dbl>
1  6.43          2.29
more_pets %>%
  summarize(
    across(.cols = c(ages, meals_per_day),
           .fns = sd)
  )
# A tibble: 1 × 2
   ages meals_per_day
  <dbl>         <dbl>
1  3.36         0.756

Get the mean and standard deviation of ages and meals_per_day

Better

more_pets %>%
  summarize(
    across(.cols = c(ages, meals_per_day),
           .fns = c(mean, sd))
  )
# A tibble: 1 × 4
  ages_1 ages_2 meals_per_day_1 meals_per_day_2
   <dbl>  <dbl>           <dbl>           <dbl>
1   6.43   3.36            2.29           0.756

Get the mean and standard deviation of ages and meals_per_day

Best

more_pets %>%
  summarize(
    across(.cols = c(ages, meals_per_day),
           .fns = c(mean = mean, standard_deviation = sd))
  )
# A tibble: 1 × 4
  ages_mean ages_standard_deviation meals_per_day_mean meals_per_day_standard_…¹
      <dbl>                   <dbl>              <dbl>                     <dbl>
1      6.43                    3.36               2.29                     0.756
# ℹ abbreviated name: ¹​meals_per_day_standard_deviation

Get the mean and standard deviation of ages and meals_per_day, grouped by type of animal

more_pets %>%
  group_by(type_of_animal) %>%
  summarize(
    across(.cols = c(ages, meals_per_day),
           .fns = c(mean = mean, standard_deviation = sd))
  )
# A tibble: 3 × 5
  type_of_animal ages_mean ages_standard_deviation meals_per_day_mean
  <chr>              <dbl>                   <dbl>              <dbl>
1 cat                  5.5                    3.79               2.75
2 dog                  6                      0                  2   
3 sheep/ram           11                     NA                  1   
# ℹ 1 more variable: meals_per_day_standard_deviation <dbl>

Specifying function arguments in calls to across()

Consider the toy dataset:

trees <- tibble(
  dbh = c(15, 9, NA),
  height = c(50, 34, 33),
  spp = c("doug-fir", "madrone", "doug-fir")
)
trees
# A tibble: 3 × 3
    dbh height spp     
  <dbl>  <dbl> <chr>   
1    15     50 doug-fir
2     9     34 madrone 
3    NA     33 doug-fir

What if we want the mean dbh and height?

trees %>%
  summarize(
    across(.cols = c(dbh, height),
           .fns = mean)
  )
# A tibble: 1 × 2
    dbh height
  <dbl>  <dbl>
1    NA     39
  • returns NA :-(

We need “lambda syntax”

trees %>%
  summarize(
    across(.cols = c(dbh, height),
           .fns = ~ mean(.x, na.rm = TRUE))
  )
# A tibble: 1 × 2
    dbh height
  <dbl>  <dbl>
1    12     39

We need “lambda syntax”

trees %>%
  summarize(
    across(.cols = c(dbh, height),
           .fns = ~ mean(.x, na.rm = TRUE))
  )
# A tibble: 1 × 2
    dbh height
  <dbl>  <dbl>
1    12     39
  • Before, we had .fns = mean, but with lambda syntax we can specify additional arguments in the newfound parentheses.
  • We use the tilde ~ to specify lambda syntax.
  • After we’ve specified the ~, we can now put parentheses after the function and specify additional arguments.

We need “lambda syntax”

trees %>%
  summarize(
    across(.cols = c(dbh, height),
           .fns = ~ mean(.x, na.rm = TRUE))
  )
# A tibble: 1 × 2
    dbh height
  <dbl>  <dbl>
1    12     39
  • We now must specify .x in the place where the columns go (before, this happened implicitly)

Let’s take a look at Lab 7

https://www.for128.org

Next time

  • Joining tibbles with dplyr