group_by() and summarize() in further detail,n() and n_distinct(),across() columns.dplyr verbsRecall the standard dplyr verb form:
dplyr verb is the data (a tibble or data.frame), and the next arguments specify how we are using the verb().dplyr functions can easily be “piped” between to create a pipeline from one verb to another.group_by() and summarize().more_petslibrary(tidyverse)
more_pets <- tibble(
names = c("Dude", "Pickle", "Kyle", "Nubs", "Marvin", "Figaro", "Slim"),
ages = c(6, 5, 3, 11, 11, 3, 6),
meals_per_day = c(2, 3, 3, 3, 1, 2, 2),
type_of_animal = c("dog", rep("cat", 3), "sheep/ram", "cat", "dog")
)
more_pets# A tibble: 7 × 4
names ages meals_per_day type_of_animal
<chr> <dbl> <dbl> <chr>
1 Dude 6 2 dog
2 Pickle 5 3 cat
3 Kyle 3 3 cat
4 Nubs 11 3 cat
5 Marvin 11 1 sheep/ram
6 Figaro 3 2 cat
7 Slim 6 2 dog
dplyrdplyr provides intuitive and powerful ways to summarize data.summarize()summarize() function takes a very similar form to the other dplyr functions.# A tibble: 1 × 1
years_lived
<dbl>
1 45
summarize()Recall the quadratic mean:
summarize()Recall the quadratic mean:
quad_mean <- function(x) {
return(sqrt(sum(x^2)) / length(x))
}
more_pets %>%
summarize(q_mean = quad_mean(ages),
a_mean = mean(ages))# A tibble: 1 × 2
q_mean a_mean
<dbl> <dbl>
1 2.70 6.43
group_by()group_by() function allows us to group our tibble by a variable of interest.group_by() on its own, does not change the rows or columns of the tibble, it just makes it “grouped”summarize() “grouped” data, we get the results for each group.# A tibble: 3 × 2
type_of_animal number_of_types
<chr> <int>
1 cat 4
2 dog 2
3 sheep/ram 1
n()n() function counts for us!n(), nothing ever goes in the parentheses.n() in within a summarize() or mutate() on a grouped dataset. n() counts the rows in each group for us.# A tibble: 7 × 5
# Groups: type_of_animal [3]
names ages meals_per_day type_of_animal other_of_same
<chr> <dbl> <dbl> <chr> <dbl>
1 Dude 6 2 dog 1
2 Pickle 5 3 cat 3
3 Kyle 3 3 cat 3
4 Nubs 11 3 cat 3
5 Marvin 11 1 sheep/ram 0
6 Figaro 3 2 cat 3
7 Slim 6 2 dog 1
# A tibble: 7 × 5
# Groups: type_of_animal [3]
names ages meals_per_day type_of_animal other_of_same
<chr> <dbl> <dbl> <chr> <dbl>
1 Dude 6 2 dog 1
2 Pickle 5 3 cat 3
3 Kyle 3 3 cat 3
4 Nubs 11 3 cat 3
5 Marvin 11 1 sheep/ram 0
6 Figaro 3 2 cat 3
7 Slim 6 2 dog 1
# A tibble: 1 × 1
n_types
<int>
1 3
n_distinct() functionn(), but takes columns as inputs,# A tibble: 3 × 2
type_of_animal unique_ages
<chr> <int>
1 cat 3
2 dog 1
3 sheep/ram 1
across() columnsacross()across() takes two arguments: .cols and .fns.cols argument specifies the columns we’d like to apply our function, .fns, to.across() within a mutate() or a summarize().# A tibble: 7 × 4
names ages meals_per_day type_of_animal
<chr> <int> <int> <chr>
1 Dude 6 2 dog
2 Pickle 5 3 cat
3 Kyle 3 3 cat
4 Nubs 11 3 cat
5 Marvin 11 1 sheep/ram
6 Figaro 3 2 cat
7 Slim 6 2 dog
Bad example
Better
Best
more_pets %>%
summarize(
across(.cols = c(ages, meals_per_day),
.fns = c(mean = mean, standard_deviation = sd))
)# A tibble: 1 × 4
ages_mean ages_standard_deviation meals_per_day_mean meals_per_day_standard_…¹
<dbl> <dbl> <dbl> <dbl>
1 6.43 3.36 2.29 0.756
# ℹ abbreviated name: ¹meals_per_day_standard_deviation
more_pets %>%
group_by(type_of_animal) %>%
summarize(
across(.cols = c(ages, meals_per_day),
.fns = c(mean = mean, standard_deviation = sd))
)# A tibble: 3 × 5
type_of_animal ages_mean ages_standard_deviation meals_per_day_mean
<chr> <dbl> <dbl> <dbl>
1 cat 5.5 3.79 2.75
2 dog 6 0 2
3 sheep/ram 11 NA 1
# ℹ 1 more variable: meals_per_day_standard_deviation <dbl>
across()Consider the toy dataset:
# A tibble: 1 × 2
dbh height
<dbl> <dbl>
1 NA 39
NA :-(# A tibble: 1 × 2
dbh height
<dbl> <dbl>
1 12 39
.fns = mean, but with lambda syntax we can specify additional arguments in the newfound parentheses.~ to specify lambda syntax.~, we can now put parentheses after the function and specify additional arguments.# A tibble: 1 × 2
dbh height
<dbl> <dbl>
1 12 39
.x in the place where the columns go (before, this happened implicitly)dplyr