tidyrtidyrRRecall the x and y tibbles from last time.
The dplyr package includes functions for two general types of joins:
x and y, andx and y.Think of how mutate() adds columns to a tibble, while filter() removes rows.
dplyr contains four mutating joins:
left_join(x, y) keeps all rows of x, but if a row in y does not match to x, an NA is assigned to that row in the new columns.right_join(x, y) is equivalent to left_join(y, x), except for column order.inner_join(x, y) keeps only the rows matched between x and y.full_join(x, y) keeps all rows of both x and y.dplyr contains two filtering joins:
semi_join(x, y) keeps all the rows in x that have a match in y.anti_join(x, y) removes all the rows in x that have a match in y.Note: Unlike mutating joins, filtering joins do not add any columns to the data.
We’ll create the x and y tibbles in R:
We’ll create the x and y tibbles in R:
# A tibble: 3 × 3
spp_code common_name latin_name
<chr> <chr> <chr>
1 ABBA balsum fir Abies balsamea
2 BEPA paper birch Betula papyrifera
3 PIST white pine <NA>
01:00
# A tibble: 4 × 3
spp_code common_name latin_name
<chr> <chr> <chr>
1 ABBA balsum fir Abies balsamea
2 BEPA paper birch Betula papyrifera
3 PIST white pine <NA>
4 LALA <NA> Larix laricina
01:00
# A tibble: 3 × 3
spp_code latin_name common_name
<chr> <chr> <chr>
1 ABBA Abies balsamea balsum fir
2 BEPA Betula papyrifera paper birch
3 LALA Larix laricina <NA>
01:00
# A tibble: 3 × 3
spp_code common_name latin_name
<chr> <chr> <chr>
1 ABBA balsum fir Abies balsamea
2 BEPA paper birch Betula papyrifera
3 LALA <NA> Larix laricina
01:00
# A tibble: 2 × 3
spp_code common_name latin_name
<chr> <chr> <chr>
1 ABBA balsum fir Abies balsamea
2 BEPA paper birch Betula papyrifera
01:00
# A tibble: 4 × 3
spp_code latin_name common_name
<chr> <chr> <chr>
1 ABBA Abies balsamea balsum fir
2 BEPA Betula papyrifera paper birch
3 LALA Larix laricina <NA>
4 PIST <NA> white pine
01:00
z:# A tibble: 3 × 3
species_code common_name latin_name
<chr> <chr> <chr>
1 ABBA balsum fir Abies balsamea
2 BEPA paper birch Betula papyrifera
3 PIST white pine <NA>
01:00
# A tibble: 2 × 2
spp_code latin_name
<chr> <chr>
1 ABBA Abies balsamea
2 BEPA Betula papyrifera
01:00
# A tibble: 1 × 2
species_code common_name
<chr> <chr>
1 PIST white pine
01:00
# A tibble: 1 × 2
spp_code latin_name
<chr> <chr>
1 LALA Larix laricina
01:00
tidyr packagetidyr package allows us to transform data from long to wide formats, and back.tidyr package is getting data in tidy format.“Tidy data” or “tidy format” is a formal concept of how we organize data for analyses, in particular, with tidy data:
“Tidy data” or “tidy format” is a formal concept of how we organize data for analyses, in particular, with tidy data:
Let’s take a look at a blog post from Julie Lowndes and Allison Horst about tidy data
tidyrConsider the face dataset from IFDAR:
face <- read_csv("../labs/datasets/FACE/FACE_aspen_core_growth.csv") %>%
select(Rep, Treat, Clone, ID = `ID #`,
contains(as.character(2001:2005)) & contains("Height"))
face# A tibble: 1,991 × 9
Rep Treat Clone ID `2001_Height` `2002_Height` `2003_Height`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 8 45 NA NA NA
2 1 1 216 44 547 622 715
3 1 1 8 43 273 275 305
4 1 1 216 42 526 619 720
5 1 1 216 54 328 341 364
6 1 1 271 55 543 590 634
7 1 1 271 56 450 502 587
8 1 1 8 57 217 227 256
9 1 1 259 58 158 155 NA
10 1 1 271 59 230 241 260
# ℹ 1,981 more rows
# ℹ 2 more variables: `2004_Height` <dbl>, `2005_Height` <dbl>
tidyrIs this data tidy? If so, what does a row represent? What does a column represent?
# A tibble: 1,991 × 9
Rep Treat Clone ID `2001_Height` `2002_Height` `2003_Height`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 8 45 NA NA NA
2 1 1 216 44 547 622 715
3 1 1 8 43 273 275 305
4 1 1 216 42 526 619 720
5 1 1 216 54 328 341 364
6 1 1 271 55 543 590 634
7 1 1 271 56 450 502 587
8 1 1 8 57 217 227 256
9 1 1 259 58 158 155 NA
10 1 1 271 59 230 241 260
# ℹ 1,981 more rows
# ℹ 2 more variables: `2004_Height` <dbl>, `2005_Height` <dbl>
# A tibble: 1,991 × 9
Rep Treat Clone ID `2001_Height` `2002_Height` `2003_Height`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 8 45 NA NA NA
2 1 1 216 44 547 622 715
3 1 1 8 43 273 275 305
4 1 1 216 42 526 619 720
5 1 1 216 54 328 341 364
6 1 1 271 55 543 590 634
7 1 1 271 56 450 502 587
8 1 1 8 57 217 227 256
9 1 1 259 58 158 155 NA
10 1 1 271 59 230 241 260
# ℹ 1,981 more rows
# ℹ 2 more variables: `2004_Height` <dbl>, `2005_Height` <dbl>
Each row should represent a measurement for a given rep, treatment, clone, and ID. But currently, we have multiple measurements on the same row.
Further, we have multiple columns for the “height” variable.
Not tidy!
Enter, pivot_longer().
pivot_longer(), a function from tidyr takes four key arguments:
.data: the data (tibble) you’d like to pivot,cols: the columns you’d like to pivot,names_to: the new column that will be created which takes the column names from cols as values, andvalues_to: the new column that will be created which takes the column values from cols as values.face?face?face?dplyr’s contains() function.face?# A tibble: 9,955 × 6
Rep Treat Clone ID Year_Type Height_cm
<dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 1 1 8 45 2001_Height NA
2 1 1 8 45 2002_Height NA
3 1 1 8 45 2003_Height NA
4 1 1 8 45 2004_Height NA
5 1 1 8 45 2005_Height NA
6 1 1 216 44 2001_Height 547
7 1 1 216 44 2002_Height 622
8 1 1 216 44 2003_Height 715
9 1 1 216 44 2004_Height 716
10 1 1 216 44 2005_Height 817
# ℹ 9,945 more rows
face_long dataset back to its original form.tidyr has an aptly named function, pivot_wider().pivot_wider():
.data: the tibble to widen,names_from: the column that contains values which will be assigned as the new column names,values_from: the column that containsface_wide <- face_long %>%
pivot_wider(names_from = "Year_Type", values_from = "Height_cm")
all_equal(face, face_wide)[1] TRUE
tidyr functionsunite() for pasteing column values together with specified seperators,separate_wider_*() family: for splitting columns into multiple new columns:
separate_wider_delim(): separate by delimiterseparate_wider_position(): separate by positionseparate_wider_regex(): separate by regular expressionunite(): examples# A tibble: 9,955 × 6
Rep Treat Clone ID Year_Type Height_cm
<dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 1 1 8 45 2001_Height NA
2 1 1 8 45 2002_Height NA
3 1 1 8 45 2003_Height NA
4 1 1 8 45 2004_Height NA
5 1 1 8 45 2005_Height NA
6 1 1 216 44 2001_Height 547
7 1 1 216 44 2002_Height 622
8 1 1 216 44 2003_Height 715
9 1 1 216 44 2004_Height 716
10 1 1 216 44 2005_Height 817
# ℹ 9,945 more rows
unite(): examples# A tibble: 9,955 × 4
Design ID Year_Type Height_cm
<chr> <dbl> <chr> <dbl>
1 1_1_8 45 2001_Height NA
2 1_1_8 45 2002_Height NA
3 1_1_8 45 2003_Height NA
4 1_1_8 45 2004_Height NA
5 1_1_8 45 2005_Height NA
6 1_1_216 44 2001_Height 547
7 1_1_216 44 2002_Height 622
8 1_1_216 44 2003_Height 715
9 1_1_216 44 2004_Height 716
10 1_1_216 44 2005_Height 817
# ℹ 9,945 more rows
unite(): examples# A tibble: 9,955 × 4
Design ID Year_Type Height_cm
<chr> <dbl> <chr> <dbl>
1 1.1.8 45 2001_Height NA
2 1.1.8 45 2002_Height NA
3 1.1.8 45 2003_Height NA
4 1.1.8 45 2004_Height NA
5 1.1.8 45 2005_Height NA
6 1.1.216 44 2001_Height 547
7 1.1.216 44 2002_Height 622
8 1.1.216 44 2003_Height 715
9 1.1.216 44 2004_Height 716
10 1.1.216 44 2005_Height 817
# ℹ 9,945 more rows
unite(): examples# A tibble: 9,955 × 4
Design ID Year_Type Height_cm
<chr> <dbl> <chr> <dbl>
1 1abc1abc8 45 2001_Height NA
2 1abc1abc8 45 2002_Height NA
3 1abc1abc8 45 2003_Height NA
4 1abc1abc8 45 2004_Height NA
5 1abc1abc8 45 2005_Height NA
6 1abc1abc216 44 2001_Height 547
7 1abc1abc216 44 2002_Height 622
8 1abc1abc216 44 2003_Height 715
9 1abc1abc216 44 2004_Height 716
10 1abc1abc216 44 2005_Height 817
# ℹ 9,945 more rows
separate_wider_delim() examples# A tibble: 9,955 × 4
Design ID Year_Type Height_cm
<chr> <dbl> <chr> <dbl>
1 1abc1abc8 45 2001_Height NA
2 1abc1abc8 45 2002_Height NA
3 1abc1abc8 45 2003_Height NA
4 1abc1abc8 45 2004_Height NA
5 1abc1abc8 45 2005_Height NA
6 1abc1abc216 44 2001_Height 547
7 1abc1abc216 44 2002_Height 622
8 1abc1abc216 44 2003_Height 715
9 1abc1abc216 44 2004_Height 716
10 1abc1abc216 44 2005_Height 817
# ℹ 9,945 more rows
separate_wider_delim() examplesface_long_abc %>%
separate_wider_delim(cols = "Design",
delim = "abc",
names = c("Rep", "Treat", "Clone"))# A tibble: 9,955 × 6
Rep Treat Clone ID Year_Type Height_cm
<chr> <chr> <chr> <dbl> <chr> <dbl>
1 1 1 8 45 2001_Height NA
2 1 1 8 45 2002_Height NA
3 1 1 8 45 2003_Height NA
4 1 1 8 45 2004_Height NA
5 1 1 8 45 2005_Height NA
6 1 1 216 44 2001_Height 547
7 1 1 216 44 2002_Height 622
8 1 1 216 44 2003_Height 715
9 1 1 216 44 2004_Height 716
10 1 1 216 44 2005_Height 817
# ℹ 9,945 more rows
separate_wider_delim() examples# A tibble: 9,955 × 6
Rep Treat Clone ID Year_Type Height_cm
<dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 1 1 8 45 2001_Height NA
2 1 1 8 45 2002_Height NA
3 1 1 8 45 2003_Height NA
4 1 1 8 45 2004_Height NA
5 1 1 8 45 2005_Height NA
6 1 1 216 44 2001_Height 547
7 1 1 216 44 2002_Height 622
8 1 1 216 44 2003_Height 715
9 1 1 216 44 2004_Height 716
10 1 1 216 44 2005_Height 817
# ℹ 9,945 more rows
# A tibble: 9,955 × 6
Rep Treat Clone ID Year Height_cm
<dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 1 1 8 45 2001 NA
2 1 1 8 45 2002 NA
3 1 1 8 45 2003 NA
4 1 1 8 45 2004 NA
5 1 1 8 45 2005 NA
6 1 1 216 44 2001 547
7 1 1 216 44 2002 622
8 1 1 216 44 2003 715
9 1 1 216 44 2004 716
10 1 1 216 44 2005 817
# ℹ 9,945 more rows
tidyr