tidyr
tidyr
R
Recall the x
and y
tibbles from last time.
The dplyr
package includes functions for two general types of joins:
x
and y
, andx
and y
.Think of how mutate()
adds columns to a tibble, while filter()
removes rows.
dplyr
contains four mutating joins:
left_join(x, y)
keeps all rows of x
, but if a row in y
does not match to x
, an NA
is assigned to that row in the new columns.right_join(x, y)
is equivalent to left_join(y, x)
, except for column order.inner_join(x, y)
keeps only the rows matched between x
and y
.full_join(x, y)
keeps all rows of both x
and y
.dplyr
contains two filtering joins:
semi_join(x, y)
keeps all the rows in x
that have a match in y
.anti_join(x, y)
removes all the rows in x
that have a match in y
.Note: Unlike mutating joins, filtering joins do not add any columns to the data.
We’ll create the x
and y
tibbles in R
:
We’ll create the x
and y
tibbles in R
:
# A tibble: 3 × 3
spp_code common_name latin_name
<chr> <chr> <chr>
1 ABBA balsum fir Abies balsamea
2 BEPA paper birch Betula papyrifera
3 PIST white pine <NA>
01:00
# A tibble: 4 × 3
spp_code common_name latin_name
<chr> <chr> <chr>
1 ABBA balsum fir Abies balsamea
2 BEPA paper birch Betula papyrifera
3 PIST white pine <NA>
4 LALA <NA> Larix laricina
01:00
# A tibble: 3 × 3
spp_code latin_name common_name
<chr> <chr> <chr>
1 ABBA Abies balsamea balsum fir
2 BEPA Betula papyrifera paper birch
3 LALA Larix laricina <NA>
01:00
# A tibble: 3 × 3
spp_code common_name latin_name
<chr> <chr> <chr>
1 ABBA balsum fir Abies balsamea
2 BEPA paper birch Betula papyrifera
3 LALA <NA> Larix laricina
01:00
# A tibble: 2 × 3
spp_code common_name latin_name
<chr> <chr> <chr>
1 ABBA balsum fir Abies balsamea
2 BEPA paper birch Betula papyrifera
01:00
# A tibble: 4 × 3
spp_code latin_name common_name
<chr> <chr> <chr>
1 ABBA Abies balsamea balsum fir
2 BEPA Betula papyrifera paper birch
3 LALA Larix laricina <NA>
4 PIST <NA> white pine
01:00
z
:# A tibble: 3 × 3
species_code common_name latin_name
<chr> <chr> <chr>
1 ABBA balsum fir Abies balsamea
2 BEPA paper birch Betula papyrifera
3 PIST white pine <NA>
01:00
# A tibble: 2 × 2
spp_code latin_name
<chr> <chr>
1 ABBA Abies balsamea
2 BEPA Betula papyrifera
01:00
# A tibble: 1 × 2
species_code common_name
<chr> <chr>
1 PIST white pine
01:00
# A tibble: 1 × 2
spp_code latin_name
<chr> <chr>
1 LALA Larix laricina
01:00
tidyr
packagetidyr
package allows us to transform data from long to wide formats, and back.tidyr
package is getting data in tidy format.“Tidy data” or “tidy format” is a formal concept of how we organize data for analyses, in particular, with tidy data:
“Tidy data” or “tidy format” is a formal concept of how we organize data for analyses, in particular, with tidy data:
Let’s take a look at a blog post from Julie Lowndes and Allison Horst about tidy data
tidyr
Consider the face
dataset from IFDAR:
face <- read_csv("../labs/datasets/FACE/FACE_aspen_core_growth.csv") %>%
select(Rep, Treat, Clone, ID = `ID #`,
contains(as.character(2001:2005)) & contains("Height"))
face
# A tibble: 1,991 × 9
Rep Treat Clone ID `2001_Height` `2002_Height` `2003_Height`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 8 45 NA NA NA
2 1 1 216 44 547 622 715
3 1 1 8 43 273 275 305
4 1 1 216 42 526 619 720
5 1 1 216 54 328 341 364
6 1 1 271 55 543 590 634
7 1 1 271 56 450 502 587
8 1 1 8 57 217 227 256
9 1 1 259 58 158 155 NA
10 1 1 271 59 230 241 260
# ℹ 1,981 more rows
# ℹ 2 more variables: `2004_Height` <dbl>, `2005_Height` <dbl>
tidyr
Is this data tidy? If so, what does a row represent? What does a column represent?
# A tibble: 1,991 × 9
Rep Treat Clone ID `2001_Height` `2002_Height` `2003_Height`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 8 45 NA NA NA
2 1 1 216 44 547 622 715
3 1 1 8 43 273 275 305
4 1 1 216 42 526 619 720
5 1 1 216 54 328 341 364
6 1 1 271 55 543 590 634
7 1 1 271 56 450 502 587
8 1 1 8 57 217 227 256
9 1 1 259 58 158 155 NA
10 1 1 271 59 230 241 260
# ℹ 1,981 more rows
# ℹ 2 more variables: `2004_Height` <dbl>, `2005_Height` <dbl>
# A tibble: 1,991 × 9
Rep Treat Clone ID `2001_Height` `2002_Height` `2003_Height`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 8 45 NA NA NA
2 1 1 216 44 547 622 715
3 1 1 8 43 273 275 305
4 1 1 216 42 526 619 720
5 1 1 216 54 328 341 364
6 1 1 271 55 543 590 634
7 1 1 271 56 450 502 587
8 1 1 8 57 217 227 256
9 1 1 259 58 158 155 NA
10 1 1 271 59 230 241 260
# ℹ 1,981 more rows
# ℹ 2 more variables: `2004_Height` <dbl>, `2005_Height` <dbl>
Each row should represent a measurement for a given rep, treatment, clone, and ID. But currently, we have multiple measurements on the same row.
Further, we have multiple columns for the “height” variable.
Not tidy!
Enter, pivot_longer()
.
pivot_longer()
, a function from tidyr
takes four key arguments:
.data
: the data (tibble) you’d like to pivot,cols
: the columns you’d like to pivot,names_to
: the new column that will be created which takes the column names from cols
as values, andvalues_to
: the new column that will be created which takes the column values from cols
as values.face
?face
?face
?dplyr
’s contains()
function.face
?# A tibble: 9,955 × 6
Rep Treat Clone ID Year_Type Height_cm
<dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 1 1 8 45 2001_Height NA
2 1 1 8 45 2002_Height NA
3 1 1 8 45 2003_Height NA
4 1 1 8 45 2004_Height NA
5 1 1 8 45 2005_Height NA
6 1 1 216 44 2001_Height 547
7 1 1 216 44 2002_Height 622
8 1 1 216 44 2003_Height 715
9 1 1 216 44 2004_Height 716
10 1 1 216 44 2005_Height 817
# ℹ 9,945 more rows
face_long
dataset back to its original form.tidyr
has an aptly named function, pivot_wider()
.pivot_wider()
:
.data
: the tibble to widen,names_from
: the column that contains values which will be assigned as the new column names,values_from
: the column that containsface_wide <- face_long %>%
pivot_wider(names_from = "Year_Type", values_from = "Height_cm")
all_equal(face, face_wide)
[1] TRUE
tidyr
functionsunite()
for pasteing column values together with specified seperators,separate_wider_*()
family: for splitting columns into multiple new columns:
separate_wider_delim()
: separate by delimiterseparate_wider_position()
: separate by positionseparate_wider_regex()
: separate by regular expressionunite()
: examples# A tibble: 9,955 × 6
Rep Treat Clone ID Year_Type Height_cm
<dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 1 1 8 45 2001_Height NA
2 1 1 8 45 2002_Height NA
3 1 1 8 45 2003_Height NA
4 1 1 8 45 2004_Height NA
5 1 1 8 45 2005_Height NA
6 1 1 216 44 2001_Height 547
7 1 1 216 44 2002_Height 622
8 1 1 216 44 2003_Height 715
9 1 1 216 44 2004_Height 716
10 1 1 216 44 2005_Height 817
# ℹ 9,945 more rows
unite()
: examples# A tibble: 9,955 × 4
Design ID Year_Type Height_cm
<chr> <dbl> <chr> <dbl>
1 1_1_8 45 2001_Height NA
2 1_1_8 45 2002_Height NA
3 1_1_8 45 2003_Height NA
4 1_1_8 45 2004_Height NA
5 1_1_8 45 2005_Height NA
6 1_1_216 44 2001_Height 547
7 1_1_216 44 2002_Height 622
8 1_1_216 44 2003_Height 715
9 1_1_216 44 2004_Height 716
10 1_1_216 44 2005_Height 817
# ℹ 9,945 more rows
unite()
: examples# A tibble: 9,955 × 4
Design ID Year_Type Height_cm
<chr> <dbl> <chr> <dbl>
1 1.1.8 45 2001_Height NA
2 1.1.8 45 2002_Height NA
3 1.1.8 45 2003_Height NA
4 1.1.8 45 2004_Height NA
5 1.1.8 45 2005_Height NA
6 1.1.216 44 2001_Height 547
7 1.1.216 44 2002_Height 622
8 1.1.216 44 2003_Height 715
9 1.1.216 44 2004_Height 716
10 1.1.216 44 2005_Height 817
# ℹ 9,945 more rows
unite()
: examples# A tibble: 9,955 × 4
Design ID Year_Type Height_cm
<chr> <dbl> <chr> <dbl>
1 1abc1abc8 45 2001_Height NA
2 1abc1abc8 45 2002_Height NA
3 1abc1abc8 45 2003_Height NA
4 1abc1abc8 45 2004_Height NA
5 1abc1abc8 45 2005_Height NA
6 1abc1abc216 44 2001_Height 547
7 1abc1abc216 44 2002_Height 622
8 1abc1abc216 44 2003_Height 715
9 1abc1abc216 44 2004_Height 716
10 1abc1abc216 44 2005_Height 817
# ℹ 9,945 more rows
separate_wider_delim()
examples# A tibble: 9,955 × 4
Design ID Year_Type Height_cm
<chr> <dbl> <chr> <dbl>
1 1abc1abc8 45 2001_Height NA
2 1abc1abc8 45 2002_Height NA
3 1abc1abc8 45 2003_Height NA
4 1abc1abc8 45 2004_Height NA
5 1abc1abc8 45 2005_Height NA
6 1abc1abc216 44 2001_Height 547
7 1abc1abc216 44 2002_Height 622
8 1abc1abc216 44 2003_Height 715
9 1abc1abc216 44 2004_Height 716
10 1abc1abc216 44 2005_Height 817
# ℹ 9,945 more rows
separate_wider_delim()
examplesface_long_abc %>%
separate_wider_delim(cols = "Design",
delim = "abc",
names = c("Rep", "Treat", "Clone"))
# A tibble: 9,955 × 6
Rep Treat Clone ID Year_Type Height_cm
<chr> <chr> <chr> <dbl> <chr> <dbl>
1 1 1 8 45 2001_Height NA
2 1 1 8 45 2002_Height NA
3 1 1 8 45 2003_Height NA
4 1 1 8 45 2004_Height NA
5 1 1 8 45 2005_Height NA
6 1 1 216 44 2001_Height 547
7 1 1 216 44 2002_Height 622
8 1 1 216 44 2003_Height 715
9 1 1 216 44 2004_Height 716
10 1 1 216 44 2005_Height 817
# ℹ 9,945 more rows
separate_wider_delim()
examples# A tibble: 9,955 × 6
Rep Treat Clone ID Year_Type Height_cm
<dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 1 1 8 45 2001_Height NA
2 1 1 8 45 2002_Height NA
3 1 1 8 45 2003_Height NA
4 1 1 8 45 2004_Height NA
5 1 1 8 45 2005_Height NA
6 1 1 216 44 2001_Height 547
7 1 1 216 44 2002_Height 622
8 1 1 216 44 2003_Height 715
9 1 1 216 44 2004_Height 716
10 1 1 216 44 2005_Height 817
# ℹ 9,945 more rows
# A tibble: 9,955 × 6
Rep Treat Clone ID Year Height_cm
<dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 1 1 8 45 2001 NA
2 1 1 8 45 2002 NA
3 1 1 8 45 2003 NA
4 1 1 8 45 2004 NA
5 1 1 8 45 2005 NA
6 1 1 216 44 2001 547
7 1 1 216 44 2002 622
8 1 1 216 44 2003 715
9 1 1 216 44 2004 716
10 1 1 216 44 2005 817
# ℹ 9,945 more rows
tidyr