1 Data-wrangling and tidying for tables

Making tables is often when we stretch our dplyr skills. Here are some additional more advanced examples that you may find helpful for today’s assignment!

We’ll use the pnwflights14 package to practice our dplyr skills. We need to download the package from github using devtools.

# once per machine
install.packages("devtools")
devtools::install_github("ismayc/pnwflights14")

Now, we need to load the flights dataset from the pnwflights14 package.

# once per work session
library(pnwflights14)
data("flights", package = "pnwflights14")

2 More Fun dplyr Tricks

2.1 dplyr::select

Recall that we use select to specify which columns in a dataframe you’d like to keep by name:

# keep these 2 cols
mini_flights <- flights %>% 
  select(carrier, flight)
glimpse(mini_flights)
## Rows: 162,049
## Columns: 2
## $ carrier <chr> "AS", "US", "UA", "US", "AS", "DL", "UA", "UA", "UA", "UA", "UA", "US", "DL", "AA", "AS", "AS", "F9", "AA", "AS", "F9", "UA…
## $ flight  <int> 145, 1830, 1609, 466, 121, 1823, 1481, 229, 1576, 478, 1569, 649, 1634, 1094, 81, 200, 796, 2240, 426, 144, 1573, 344, 1136…
# keep first five cols
first_five <- flights %>% 
  select(year, month, day, dep_time, dep_delay)
glimpse(first_five)
## Rows: 162,049
## Columns: 5
## $ year      <int> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2…
## $ month     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time  <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557, 557, 558, 559, 600, 600, 602, 602, 603, 603, 606, 606, 614, …
## $ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2, -1, 0, -10, -3, -3, -2, -3, -4, 6, -1, 2, -2, -6, -2, 2, 13,…
# alternatively, specify range
first_five <- flights %>% 
  select(year:dep_delay)
glimpse(first_five)
## Rows: 162,049
## Columns: 5
## $ year      <int> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2…
## $ month     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time  <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557, 557, 558, 559, 600, 600, 602, 602, 603, 603, 606, 606, 614, …
## $ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2, -1, 0, -10, -3, -3, -2, -3, -4, 6, -1, 2, -2, -6, -2, 2, 13,…

We can also choose the columns we want by negation, that is, you can specify which columns to drop instead of keep. This way, all variables not listed are kept.

# we can also use negation
all_but_year <- flights %>% 
  select(-year)
glimpse(all_but_year)
## Rows: 162,049
## Columns: 15
## $ month     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time  <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557, 557, 558, 559, 600, 600, 602, 602, 603, 603, 606, 606, 614, …
## $ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2, -1, 0, -10, -3, -3, -2, -3, -4, 6, -1, 2, -2, -6, -2, 2, 13,…
## $ arr_time  <int> 235, 738, 548, 800, 325, 747, 936, 1148, 917, 1334, 911, 907, 837, 1134, 825, 801, 916, 1151, 842, 943, 1204, 755, 1202, …
## $ arr_delay <dbl> 70, -23, -4, -23, 43, 88, 219, 15, 24, -6, 4, 12, -12, -16, -25, -2, -9, -19, -8, 5, 7, -17, -2, -19, 3, -2, -12, -30, -7…
## $ carrier   <chr> "AS", "US", "UA", "US", "AS", "DL", "UA", "UA", "UA", "UA", "UA", "US", "DL", "AA", "AS", "AS", "F9", "AA", "AS", "F9", "…
## $ tailnum   <chr> "N508AS", "N195UW", "N37422", "N547UW", "N762AS", "N806DN", "N14219", "N813UA", "N75433", "N574UA", "N36476", "N548UW", "…
## $ flight    <int> 145, 1830, 1609, 466, 121, 1823, 1481, 229, 1576, 478, 1569, 649, 1634, 1094, 81, 200, 796, 2240, 426, 144, 1573, 344, 11…
## $ origin    <chr> "PDX", "SEA", "PDX", "PDX", "SEA", "SEA", "SEA", "PDX", "SEA", "SEA", "PDX", "PDX", "SEA", "SEA", "SEA", "SEA", "PDX", "S…
## $ dest      <chr> "ANC", "CLT", "IAH", "CLT", "ANC", "DTW", "ORD", "IAH", "DEN", "EWR", "DEN", "PHX", "SLC", "DFW", "ANC", "SJC", "DEN", "O…
## $ air_time  <dbl> 194, 252, 201, 251, 201, 224, 202, 217, 136, 268, 130, 122, 82, 184, 188, 100, 125, 206, 125, 129, 203, 94, 203, 98, 87, …
## $ distance  <dbl> 1542, 2279, 1825, 2282, 1448, 1927, 1721, 1825, 1024, 2402, 991, 1009, 689, 1660, 1448, 697, 991, 1721, 954, 1024, 1739, …
## $ hour      <dbl> 0, 0, 0, 0, 0, 0, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6…
## $ minute    <dbl> 1, 4, 8, 28, 34, 37, 46, 26, 27, 36, 41, 49, 50, 57, 57, 58, 59, 0, 0, 2, 2, 3, 3, 6, 6, 14, 17, 18, 19, 20, 22, 23, 24, …

dplyr::select comes with several other helper functions…

depart <- flights %>%
  select(starts_with("dep_"))
glimpse(depart)
## Rows: 162,049
## Columns: 2
## $ dep_time  <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557, 557, 558, 559, 600, 600, 602, 602, 603, 603, 606, 606, 614, …
## $ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2, -1, 0, -10, -3, -3, -2, -3, -4, 6, -1, 2, -2, -6, -2, 2, 13,…
times <- flights %>%
  select(contains("time"))
glimpse(times)
## Rows: 162,049
## Columns: 3
## $ dep_time <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557, 557, 558, 559, 600, 600, 602, 602, 603, 603, 606, 606, 614, 6…
## $ arr_time <int> 235, 738, 548, 800, 325, 747, 936, 1148, 917, 1334, 911, 907, 837, 1134, 825, 801, 916, 1151, 842, 943, 1204, 755, 1202, 8…
## $ air_time <dbl> 194, 252, 201, 251, 201, 224, 202, 217, 136, 268, 130, 122, 82, 184, 188, 100, 125, 206, 125, 129, 203, 94, 203, 98, 87, 1…
# note that we are not actually saving the new dataframe here
flights %>%
  select(-contains("time")) %>% head()
## # A tibble: 6 × 13
##    year month   day dep_delay arr_delay carrier tailnum flight origin dest  distance  hour minute
##   <int> <int> <int>     <dbl>     <dbl> <chr>   <chr>    <int> <chr>  <chr>    <dbl> <dbl>  <dbl>
## 1  2014     1     1        96        70 AS      N508AS     145 PDX    ANC       1542     0      1
## 2  2014     1     1        -6       -23 US      N195UW    1830 SEA    CLT       2279     0      4
## 3  2014     1     1        13        -4 UA      N37422    1609 PDX    IAH       1825     0      8
## 4  2014     1     1        -2       -23 US      N547UW     466 PDX    CLT       2282     0     28
## 5  2014     1     1        44        43 AS      N762AS     121 SEA    ANC       1448     0     34
## 6  2014     1     1        82        88 DL      N806DN    1823 SEA    DTW       1927     0     37
delays <- flights %>%
  select(ends_with("delay"))
glimpse(delays)
## Rows: 162,049
## Columns: 2
## $ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2, -1, 0, -10, -3, -3, -2, -3, -4, 6, -1, 2, -2, -6, -2, 2, 13,…
## $ arr_delay <dbl> 70, -23, -4, -23, 43, 88, 219, 15, 24, -6, 4, 12, -12, -16, -25, -2, -9, -19, -8, 5, 7, -17, -2, -19, 3, -2, -12, -30, -7…

One of my favorite select helper functions is everything(), which allows you to use select to keep all your variables, but easily rearrange the columns without having to list all the variables to keep/drop.

new_order <- flights %>% 
  select(origin, dest, everything())
head(new_order)
## # A tibble: 6 × 16
##   origin dest   year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight air_time distance  hour minute
##   <chr>  <chr> <int> <int> <int>    <int>     <dbl>    <int>     <dbl> <chr>   <chr>    <int>    <dbl>    <dbl> <dbl>  <dbl>
## 1 PDX    ANC    2014     1     1        1        96      235        70 AS      N508AS     145      194     1542     0      1
## 2 SEA    CLT    2014     1     1        4        -6      738       -23 US      N195UW    1830      252     2279     0      4
## 3 PDX    IAH    2014     1     1        8        13      548        -4 UA      N37422    1609      201     1825     0      8
## 4 PDX    CLT    2014     1     1       28        -2      800       -23 US      N547UW     466      251     2282     0     28
## 5 SEA    ANC    2014     1     1       34        44      325        43 AS      N762AS     121      201     1448     0     34
## 6 SEA    DTW    2014     1     1       37        82      747        88 DL      N806DN    1823      224     1927     0     37
# with negation
new_order2 <- flights %>% 
  select(origin, dest, everything(), -year)
head(new_order2)
## # A tibble: 6 × 15
##   origin dest  month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight air_time distance  hour minute
##   <chr>  <chr> <int> <int>    <int>     <dbl>    <int>     <dbl> <chr>   <chr>    <int>    <dbl>    <dbl> <dbl>  <dbl>
## 1 PDX    ANC       1     1        1        96      235        70 AS      N508AS     145      194     1542     0      1
## 2 SEA    CLT       1     1        4        -6      738       -23 US      N195UW    1830      252     2279     0      4
## 3 PDX    IAH       1     1        8        13      548        -4 UA      N37422    1609      201     1825     0      8
## 4 PDX    CLT       1     1       28        -2      800       -23 US      N547UW     466      251     2282     0     28
## 5 SEA    ANC       1     1       34        44      325        43 AS      N762AS     121      201     1448     0     34
## 6 SEA    DTW       1     1       37        82      747        88 DL      N806DN    1823      224     1927     0     37

We can also rename variables within select.

flights2 <- flights %>%
  select(tail_num = tailnum, everything())
head(flights2)
## # A tibble: 6 × 16
##   tail_num  year month   day dep_time dep_delay arr_time arr_delay carrier flight origin dest  air_time distance  hour minute
##   <chr>    <int> <int> <int>    <int>     <dbl>    <int>     <dbl> <chr>    <int> <chr>  <chr>    <dbl>    <dbl> <dbl>  <dbl>
## 1 N508AS    2014     1     1        1        96      235        70 AS         145 PDX    ANC        194     1542     0      1
## 2 N195UW    2014     1     1        4        -6      738       -23 US        1830 SEA    CLT        252     2279     0      4
## 3 N37422    2014     1     1        8        13      548        -4 UA        1609 PDX    IAH        201     1825     0      8
## 4 N547UW    2014     1     1       28        -2      800       -23 US         466 PDX    CLT        251     2282     0     28
## 5 N762AS    2014     1     1       34        44      325        43 AS         121 SEA    ANC        201     1448     0     34
## 6 N806DN    2014     1     1       37        82      747        88 DL        1823 SEA    DTW        224     1927     0     37

If you don’t want to move the renamed variables within your dataframe, you can use the rename function.

flights3 <- flights %>%
  rename(tail_num = tailnum)
glimpse(flights3)
## Rows: 162,049
## Columns: 16
## $ year      <int> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2…
## $ month     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time  <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557, 557, 558, 559, 600, 600, 602, 602, 603, 603, 606, 606, 614, …
## $ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2, -1, 0, -10, -3, -3, -2, -3, -4, 6, -1, 2, -2, -6, -2, 2, 13,…
## $ arr_time  <int> 235, 738, 548, 800, 325, 747, 936, 1148, 917, 1334, 911, 907, 837, 1134, 825, 801, 916, 1151, 842, 943, 1204, 755, 1202, …
## $ arr_delay <dbl> 70, -23, -4, -23, 43, 88, 219, 15, 24, -6, 4, 12, -12, -16, -25, -2, -9, -19, -8, 5, 7, -17, -2, -19, 3, -2, -12, -30, -7…
## $ carrier   <chr> "AS", "US", "UA", "US", "AS", "DL", "UA", "UA", "UA", "UA", "UA", "US", "DL", "AA", "AS", "AS", "F9", "AA", "AS", "F9", "…
## $ tail_num  <chr> "N508AS", "N195UW", "N37422", "N547UW", "N762AS", "N806DN", "N14219", "N813UA", "N75433", "N574UA", "N36476", "N548UW", "…
## $ flight    <int> 145, 1830, 1609, 466, 121, 1823, 1481, 229, 1576, 478, 1569, 649, 1634, 1094, 81, 200, 796, 2240, 426, 144, 1573, 344, 11…
## $ origin    <chr> "PDX", "SEA", "PDX", "PDX", "SEA", "SEA", "SEA", "PDX", "SEA", "SEA", "PDX", "PDX", "SEA", "SEA", "SEA", "SEA", "PDX", "S…
## $ dest      <chr> "ANC", "CLT", "IAH", "CLT", "ANC", "DTW", "ORD", "IAH", "DEN", "EWR", "DEN", "PHX", "SLC", "DFW", "ANC", "SJC", "DEN", "O…
## $ air_time  <dbl> 194, 252, 201, 251, 201, 224, 202, 217, 136, 268, 130, 122, 82, 184, 188, 100, 125, 206, 125, 129, 203, 94, 203, 98, 87, …
## $ distance  <dbl> 1542, 2279, 1825, 2282, 1448, 1927, 1721, 1825, 1024, 2402, 991, 1009, 689, 1660, 1448, 697, 991, 1721, 954, 1024, 1739, …
## $ hour      <dbl> 0, 0, 0, 0, 0, 0, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6…
## $ minute    <dbl> 1, 4, 8, 28, 34, 37, 46, 26, 27, 36, 41, 49, 50, 57, 57, 58, 59, 0, 0, 2, 2, 3, 3, 6, 6, 14, 17, 18, 19, 20, 22, 23, 24, …

2.2 dplyr::filter

As we have previously seen, filter has many flexible uses:

# flights taking off from PDX
pdx <- flights %>% 
  filter(origin == "PDX")
head(pdx)
## # A tibble: 6 × 16
##    year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest  air_time distance  hour minute
##   <int> <int> <int>    <int>     <dbl>    <int>     <dbl> <chr>   <chr>    <int> <chr>  <chr>    <dbl>    <dbl> <dbl>  <dbl>
## 1  2014     1     1        1        96      235        70 AS      N508AS     145 PDX    ANC        194     1542     0      1
## 2  2014     1     1        8        13      548        -4 UA      N37422    1609 PDX    IAH        201     1825     0      8
## 3  2014     1     1       28        -2      800       -23 US      N547UW     466 PDX    CLT        251     2282     0     28
## 4  2014     1     1      526        -4     1148        15 UA      N813UA     229 PDX    IAH        217     1825     5     26
## 5  2014     1     1      541         1      911         4 UA      N36476    1569 PDX    DEN        130      991     5     41
## 6  2014     1     1      549        24      907        12 US      N548UW     649 PDX    PHX        122     1009     5     49
# january flights from PDX
pdx_jan <- flights %>% 
  filter(origin == "PDX", month == 1) # the comma is an "and"
head(pdx_jan)
## # A tibble: 6 × 16
##    year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest  air_time distance  hour minute
##   <int> <int> <int>    <int>     <dbl>    <int>     <dbl> <chr>   <chr>    <int> <chr>  <chr>    <dbl>    <dbl> <dbl>  <dbl>
## 1  2014     1     1        1        96      235        70 AS      N508AS     145 PDX    ANC        194     1542     0      1
## 2  2014     1     1        8        13      548        -4 UA      N37422    1609 PDX    IAH        201     1825     0      8
## 3  2014     1     1       28        -2      800       -23 US      N547UW     466 PDX    CLT        251     2282     0     28
## 4  2014     1     1      526        -4     1148        15 UA      N813UA     229 PDX    IAH        217     1825     5     26
## 5  2014     1     1      541         1      911         4 UA      N36476    1569 PDX    DEN        130      991     5     41
## 6  2014     1     1      549        24      907        12 US      N548UW     649 PDX    PHX        122     1009     5     49
# flights to ATL (Atlanta) or BNA (Nashville)
to_south <- flights %>% 
  filter(dest == "ATL" | dest == "BNA") %>% # | is "or"
  select(origin, dest, everything())
head(to_south)
## # A tibble: 6 × 16
##   origin dest   year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight air_time distance  hour minute
##   <chr>  <chr> <int> <int> <int>    <int>     <dbl>    <int>     <dbl> <chr>   <chr>    <int>    <dbl>    <dbl> <dbl>  <dbl>
## 1 SEA    ATL    2014     1     1      624        -6     1401        -6 DL      N617DL     968      235     2182     6     24
## 2 SEA    ATL    2014     1     1      802        -3     1533       -17 AS      N532AS     742      249     2182     8      2
## 3 SEA    ATL    2014     1     1      824        -1     1546       -14 DL      N633DL     128      234     2182     8     24
## 4 PDX    ATL    2014     1     1      944        -6     1727        -8 AS      N548AS     752      252     2172     9     44
## 5 PDX    ATL    2014     1     1     1054        94     1807        84 DL      N377DA    1502      237     2172    10     54
## 6 SEA    ATL    2014     1     1     1158         6     1915       -14 DL      N6712B    1962      234     2182    11     58
# flights from PDX to ATL (Atlanta) or BNA (Nashville)
pdx_to_south <- flights %>% 
  filter(origin == "PDX", dest == "ATL" | dest == "BNA") %>% # | is "or"
  select(origin, dest, everything())
head(pdx_to_south)
## # A tibble: 6 × 16
##   origin dest   year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight air_time distance  hour minute
##   <chr>  <chr> <int> <int> <int>    <int>     <dbl>    <int>     <dbl> <chr>   <chr>    <int>    <dbl>    <dbl> <dbl>  <dbl>
## 1 PDX    ATL    2014     1     1      944        -6     1727        -8 AS      N548AS     752      252     2172     9     44
## 2 PDX    ATL    2014     1     1     1054        94     1807        84 DL      N377DA    1502      237     2172    10     54
## 3 PDX    ATL    2014     1     1     1323        -2     2038       -15 DL      N393DA     773      235     2172    13     23
## 4 PDX    ATL    2014     1     1     2253         8      611         4 DL      N371DA     503      240     2172    22     53
## 5 PDX    ATL    2014     1     2      627        -3     1350        -7 DL      N3746H    1156      244     2172     6     27
## 6 PDX    ATL    2014     1     2      918        -2     1643        -2 DL      N3756     1940      249     2172     9     18
# alternatively, using group membership
south_dests <- c("ATL", "BNA")
pdx_to_south2 <- flights %>% 
  filter(origin == "PDX", dest %in% south_dests) %>% 
  select(origin, dest, everything())
head(pdx_to_south2)
## # A tibble: 6 × 16
##   origin dest   year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight air_time distance  hour minute
##   <chr>  <chr> <int> <int> <int>    <int>     <dbl>    <int>     <dbl> <chr>   <chr>    <int>    <dbl>    <dbl> <dbl>  <dbl>
## 1 PDX    ATL    2014     1     1      944        -6     1727        -8 AS      N548AS     752      252     2172     9     44
## 2 PDX    ATL    2014     1     1     1054        94     1807        84 DL      N377DA    1502      237     2172    10     54
## 3 PDX    ATL    2014     1     1     1323        -2     2038       -15 DL      N393DA     773      235     2172    13     23
## 4 PDX    ATL    2014     1     1     2253         8      611         4 DL      N371DA     503      240     2172    22     53
## 5 PDX    ATL    2014     1     2      627        -3     1350        -7 DL      N3746H    1156      244     2172     6     27
## 6 PDX    ATL    2014     1     2      918        -2     1643        -2 DL      N3756     1940      249     2172     9     18
# flights delayed by 1 hour or more
delay_1plus <- flights %>%
  filter(dep_delay >= 60)
head(delay_1plus)
## # A tibble: 6 × 16
##    year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest  air_time distance  hour minute
##   <int> <int> <int>    <int>     <dbl>    <int>     <dbl> <chr>   <chr>    <int> <chr>  <chr>    <dbl>    <dbl> <dbl>  <dbl>
## 1  2014     1     1        1        96      235        70 AS      N508AS     145 PDX    ANC        194     1542     0      1
## 2  2014     1     1       37        82      747        88 DL      N806DN    1823 SEA    DTW        224     1927     0     37
## 3  2014     1     1      346       227      936       219 UA      N14219    1481 SEA    ORD        202     1721     3     46
## 4  2014     1     1      650        90     1037        91 US      N626AW     460 SEA    PHX        141     1107     6     50
## 5  2014     1     1      959       164     1137       157 AS      N534AS     805 SEA    SMF         77      605     9     59
## 6  2014     1     1     1008        68     1242        64 AS      N788AS     456 SEA    LAX        129      954    10      8
# flights delayed by 1 hour, but not more than 2 hours
delay_1hr <- flights %>%
  filter(dep_delay >= 60, dep_delay < 120)
head(delay_1hr)
## # A tibble: 6 × 16
##    year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest  air_time distance  hour minute
##   <int> <int> <int>    <int>     <dbl>    <int>     <dbl> <chr>   <chr>    <int> <chr>  <chr>    <dbl>    <dbl> <dbl>  <dbl>
## 1  2014     1     1        1        96      235        70 AS      N508AS     145 PDX    ANC        194     1542     0      1
## 2  2014     1     1       37        82      747        88 DL      N806DN    1823 SEA    DTW        224     1927     0     37
## 3  2014     1     1      650        90     1037        91 US      N626AW     460 SEA    PHX        141     1107     6     50
## 4  2014     1     1     1008        68     1242        64 AS      N788AS     456 SEA    LAX        129      954    10      8
## 5  2014     1     1     1014        75     1613        81 UA      N37408    1444 SEA    ORD        201     1721    10     14
## 6  2014     1     1     1036        81     1408        63 OO      N218AG    3466 PDX    TUS        134     1119    10     36
range(delay_1hr$dep_delay, na.rm = TRUE)
## [1]  60 119
# even more efficient using between (always inclusive)
delay_bwn <- flights %>%
  filter(between(dep_delay, 60, 119))
head(delay_bwn)
## # A tibble: 6 × 16
##    year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest  air_time distance  hour minute
##   <int> <int> <int>    <int>     <dbl>    <int>     <dbl> <chr>   <chr>    <int> <chr>  <chr>    <dbl>    <dbl> <dbl>  <dbl>
## 1  2014     1     1        1        96      235        70 AS      N508AS     145 PDX    ANC        194     1542     0      1
## 2  2014     1     1       37        82      747        88 DL      N806DN    1823 SEA    DTW        224     1927     0     37
## 3  2014     1     1      650        90     1037        91 US      N626AW     460 SEA    PHX        141     1107     6     50
## 4  2014     1     1     1008        68     1242        64 AS      N788AS     456 SEA    LAX        129      954    10      8
## 5  2014     1     1     1014        75     1613        81 UA      N37408    1444 SEA    ORD        201     1721    10     14
## 6  2014     1     1     1036        81     1408        63 OO      N218AG    3466 PDX    TUS        134     1119    10     36
range(delay_bwn$dep_delay, na.rm = TRUE)
## [1]  60 119

2.3 dplyr::arrange

# default is ascending order
flights %>% 
  arrange(year, month, day) %>% head(n=20)
## # A tibble: 20 × 16
##     year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest  air_time distance  hour minute
##    <int> <int> <int>    <int>     <dbl>    <int>     <dbl> <chr>   <chr>    <int> <chr>  <chr>    <dbl>    <dbl> <dbl>  <dbl>
##  1  2014     1     1        1        96      235        70 AS      N508AS     145 PDX    ANC        194     1542     0      1
##  2  2014     1     1        4        -6      738       -23 US      N195UW    1830 SEA    CLT        252     2279     0      4
##  3  2014     1     1        8        13      548        -4 UA      N37422    1609 PDX    IAH        201     1825     0      8
##  4  2014     1     1       28        -2      800       -23 US      N547UW     466 PDX    CLT        251     2282     0     28
##  5  2014     1     1       34        44      325        43 AS      N762AS     121 SEA    ANC        201     1448     0     34
##  6  2014     1     1       37        82      747        88 DL      N806DN    1823 SEA    DTW        224     1927     0     37
##  7  2014     1     1      346       227      936       219 UA      N14219    1481 SEA    ORD        202     1721     3     46
##  8  2014     1     1      526        -4     1148        15 UA      N813UA     229 PDX    IAH        217     1825     5     26
##  9  2014     1     1      527         7      917        24 UA      N75433    1576 SEA    DEN        136     1024     5     27
## 10  2014     1     1      536         1     1334        -6 UA      N574UA     478 SEA    EWR        268     2402     5     36
## 11  2014     1     1      541         1      911         4 UA      N36476    1569 PDX    DEN        130      991     5     41
## 12  2014     1     1      549        24      907        12 US      N548UW     649 PDX    PHX        122     1009     5     49
## 13  2014     1     1      550         0      837       -12 DL      N660DL    1634 SEA    SLC         82      689     5     50
## 14  2014     1     1      557        -3     1134       -16 AA      N3JLAA    1094 SEA    DFW        184     1660     5     57
## 15  2014     1     1      557        -3      825       -25 AS      N562AS      81 SEA    ANC        188     1448     5     57
## 16  2014     1     1      558        -2      801        -2 AS      N402AS     200 SEA    SJC        100      697     5     58
## 17  2014     1     1      559        -1      916        -9 F9      N210FR     796 PDX    DEN        125      991     5     59
## 18  2014     1     1      600         0     1151       -19 AA      N3JFAA    2240 SEA    ORD        206     1721     6      0
## 19  2014     1     1      600       -10      842        -8 AS      N786AS     426 SEA    LAX        125      954     6      0
## 20  2014     1     1      602        -3      943         5 F9      N201FR     144 SEA    DEN        129     1024     6      2
# descending order
flights %>% 
  arrange(desc(year), desc(month), desc(day)) %>% head(n=20)
## # A tibble: 20 × 16
##     year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest  air_time distance  hour minute
##    <int> <int> <int>    <int>     <dbl>    <int>     <dbl> <chr>   <chr>    <int> <chr>  <chr>    <dbl>    <dbl> <dbl>  <dbl>
##  1  2014    12    31        2        12      601        31 AA      N3JKAA    1230 SEA    DFW        210     1660     0      2
##  2  2014    12    31       27        -3      623         3 AA      N3EWAA    1431 SEA    ORD        211     1721     0     27
##  3  2014    12    31       39        14      324         4 AS      N762AS     135 SEA    ANC        207     1448     0     39
##  4  2014    12    31       40         0      549         0 DL      N757AT    2440 SEA    MSP        174     1399     0     40
##  5  2014    12    31       52        -8      917       -21 AA      N3JFAA     371 SEA    MIA        306     2724     0     52
##  6  2014    12    31       54         4      621        17 DL      N128DL    1670 PDX    MSP        174     1426     0     54
##  7  2014    12    31       56        61      848        80 DL      N655DL     929 SEA    ATL        249     2182     0     56
##  8  2014    12    31      512        -3      904         4 US      N653AW     480 SEA    PHX        154     1107     5     12
##  9  2014    12    31      515        -5      855         5 US      N580UW     425 PDX    PHX        144     1009     5     15
## 10  2014    12    31      534         4      859         7 UA      N34460    1075 PDX    DEN        125      991     5     34
## 11  2014    12    31      546         1      916        -4 WN      N8323C     757 PDX    DEN        126      991     5     46
## 12  2014    12    31      548        -2     1351       -13 UA      N461UA     665 PDX    EWR        287     2434     5     48
## 13  2014    12    31      549         4     1208        12 UA      N68807    1457 SEA    IAH        223     1874     5     49
## 14  2014    12    31      550         0      922         2 WN      N797MX    2121 PDX    PHX        138     1009     5     50
## 15  2014    12    31      551        -4     1202        12 AA      N3HXAA    1094 SEA    DFW        211     1660     5     51
## 16  2014    12    31      551        -9      744       -15 AS      N570AS     342 SEA    OAK         91      671     5     51
## 17  2014    12    31      555       -10      824        -1 AS      N548AS     602 SEA    LAS        124      867     5     55
## 18  2014    12    31      558        -2      849         0 DL      N668DN    1831 PDX    SLC         85      630     5     58
## 19  2014    12    31      558        -2     1149         4 AA      N436AA    1534 PDX    DFW        211     1616     5     58
## 20  2014    12    31      558        -2      738        -4 AS      N585AS     406 PDX    SJC         80      569     5     58

2.4 dplyr::distinct

# unique origin-dest combinations
flights %>% 
  select(origin, dest) %>% 
  distinct %>% head(n=50)
## # A tibble: 50 × 2
##    origin dest 
##    <chr>  <chr>
##  1 PDX    ANC  
##  2 SEA    CLT  
##  3 PDX    IAH  
##  4 PDX    CLT  
##  5 SEA    ANC  
##  6 SEA    DTW  
##  7 SEA    ORD  
##  8 SEA    DEN  
##  9 SEA    EWR  
## 10 PDX    DEN  
## # ℹ 40 more rows
# all unique destinations from PDX (there are 49, so we'll just list the first few)
from_pdx <- flights %>% 
  filter(origin == "PDX") %>% 
  select(origin, dest) %>%
  distinct(dest)
head(from_pdx)
## # A tibble: 6 × 1
##   dest 
##   <chr>
## 1 ANC  
## 2 IAH  
## 3 CLT  
## 4 DEN  
## 5 PHX  
## 6 ORD

2.5 dplyr::mutate

mutate is used to transform existing variables or create new ones; here, we are using it to create an indicator variable that identifies flights that were entirely on-time.

# add total delay variable
flights %>%
  mutate(tot_delay = dep_delay + arr_delay) %>%
  select(origin, dest, ends_with("delay"), everything()) %>% 
  head
## # A tibble: 6 × 17
##   origin dest  dep_delay arr_delay tot_delay  year month   day dep_time arr_time carrier tailnum flight air_time distance  hour minute
##   <chr>  <chr>     <dbl>     <dbl>     <dbl> <int> <int> <int>    <int>    <int> <chr>   <chr>    <int>    <dbl>    <dbl> <dbl>  <dbl>
## 1 PDX    ANC          96        70       166  2014     1     1        1      235 AS      N508AS     145      194     1542     0      1
## 2 SEA    CLT          -6       -23       -29  2014     1     1        4      738 US      N195UW    1830      252     2279     0      4
## 3 PDX    IAH          13        -4         9  2014     1     1        8      548 UA      N37422    1609      201     1825     0      8
## 4 PDX    CLT          -2       -23       -25  2014     1     1       28      800 US      N547UW     466      251     2282     0     28
## 5 SEA    ANC          44        43        87  2014     1     1       34      325 AS      N762AS     121      201     1448     0     34
## 6 SEA    DTW          82        88       170  2014     1     1       37      747 DL      N806DN    1823      224     1927     0     37
# flights that were delayed at departure had on time or early arrivals?
arrivals <- flights %>%
  mutate(arr_ok = ifelse(dep_delay > 0 & arr_delay <= 0, 1, 0)) %>% 
  select(origin, dest, ends_with("delay"), carrier, arr_ok)

# peek at it
arrivals %>%
  filter(arr_ok == 1) %>%
  head
## # A tibble: 6 × 6
##   origin dest  dep_delay arr_delay carrier arr_ok
##   <chr>  <chr>     <dbl>     <dbl> <chr>    <dbl>
## 1 PDX    IAH          13        -4 UA           1
## 2 SEA    EWR           1        -6 UA           1
## 3 SEA    SAN           2       -12 AS           1
## 4 PDX    EWR           2       -19 UA           1
## 5 SEA    IAH          13        -4 UA           1
## 6 PDX    IAD          10        -4 UA           1

2.6 dplyr::summarise (or dplyr::summarize)

We have seen summarise used to calculate summary statistics, like so:

flights %>%
  summarise(mean(dep_delay, na.rm = TRUE))
## # A tibble: 1 × 1
##   `mean(dep_delay, na.rm = TRUE)`
##                             <dbl>
## 1                            6.13

But this can get tedious when we want to compute several summary statistics, like so:

# we can also name that variable, and summarise multiple variables
flights %>%
  summarise(mean_delay = mean(dep_delay, na.rm = TRUE),
            sd_delay = sd(dep_delay, na.rm = TRUE),
            median_delay = median(dep_delay, na.rm = TRUE))
## # A tibble: 1 × 3
##   mean_delay sd_delay median_delay
##        <dbl>    <dbl>        <dbl>
## 1       6.13     29.1           -2

summarise together with across() lets us automate this process a bit:

flights %>%
  filter(!is.na(dep_delay)) %>%
  select(dep_delay) %>%
  summarise(
    across(everything(), 
           lst(mean, sd, median)) # making this a list gets us nice column names
    )
## # A tibble: 1 × 3
##   dep_delay_mean dep_delay_sd dep_delay_median
##            <dbl>        <dbl>            <dbl>
## 1           6.13         29.1               -2
# same thing over all numeric columns
flights %>%
  filter(!is.na(dep_delay)) %>%
  summarise(across(where(is.numeric), lst(mean, sd, median)))
## # A tibble: 1 × 36
##   year_mean year_sd year_median month_mean month_sd month_median day_mean day_sd day_median dep_time_mean dep_time_sd dep_time_median
##       <dbl>   <dbl>       <dbl>      <dbl>    <dbl>        <dbl>    <dbl>  <dbl>      <dbl>         <dbl>       <dbl>           <dbl>
## 1      2014       0        2014       6.62     3.32            7     15.8   8.79         16         1278.        523.            1217
## # ℹ 24 more variables: dep_delay_mean <dbl>, dep_delay_sd <dbl>, dep_delay_median <dbl>, arr_time_mean <dbl>, arr_time_sd <dbl>,
## #   arr_time_median <int>, arr_delay_mean <dbl>, arr_delay_sd <dbl>, arr_delay_median <dbl>, flight_mean <dbl>, flight_sd <dbl>,
## #   flight_median <dbl>, air_time_mean <dbl>, air_time_sd <dbl>, air_time_median <dbl>, distance_mean <dbl>, distance_sd <dbl>,
## #   distance_median <dbl>, hour_mean <dbl>, hour_sd <dbl>, hour_median <dbl>, minute_mean <dbl>, minute_sd <dbl>, minute_median <dbl>
# combine with pivot_longer, change names too
flights %>%
  filter(!is.na(dep_delay)) %>%
  summarise(across(dep_delay, lst(mean, stdev=sd, median))) %>% 
  pivot_longer(everything(), names_to="delay_stat", values_to="value", names_prefix = "dep_delay_")
## # A tibble: 3 × 2
##   delay_stat value
##   <chr>      <dbl>
## 1 mean        6.13
## 2 stdev      29.1 
## 3 median     -2

We can use aggregating functions in summarise

# how many unique destinations?
summary_table <- flights %>% 
  summarise(tot_flights = n(),
            tot_planes = n_distinct(tailnum),
            tot_carriers = n_distinct(carrier),
            tot_dests = n_distinct(dest),
            tot_origins = n_distinct(origin))

summary_table
## # A tibble: 1 × 5
##   tot_flights tot_planes tot_carriers tot_dests tot_origins
##         <int>      <int>        <int>     <int>       <int>
## 1      162049       3023           11        71           2

The resulting table is a bit messy; we can reprocess using pivot_longer and separate:

# chain with tidyr functions
summary_table %>% 
  pivot_longer(everything(), names_to="key", values_to="value") %>% 
  separate(key, into = c("tot", "entity")) %>% 
  select(-tot, total = value)
## # A tibble: 5 × 2
##   entity    total
##   <chr>     <int>
## 1 flights  162049
## 2 planes     3023
## 3 carriers     11
## 4 dests        71
## 5 origins       2

Or we can do it in a single call to pivot_longer():

summary_table %>% 
  pivot_longer(
    everything(), # grab all columns
    names_to="entity", # make a new column called "entity" to store the old column names
    names_prefix="tot_", # remove the "tot_" prefix from the old column names
    values_to="total") # put the values in a column called "total"
## # A tibble: 5 × 2
##   entity    total
##   <chr>     <int>
## 1 flights  162049
## 2 planes     3023
## 3 carriers     11
## 4 dests        71
## 5 origins       2

3 tidyr

We’ll work with a made up dataframe:

df <- data.frame(
  id = 1:10,
  date = as.Date('2015-01-01') + 0:9,
  q1_m1_w1 = rnorm(10, 0, 1),
  q1_m1_w2 = rnorm(10, 0, 1),
  q1_m2_w3 = rnorm(10, 0, 1),
  q2_m1_w1 = rnorm(10, 0, 1),
  q2_m2_w1 = rnorm(10, 0, 1),
  q2_m2_w2 = rnorm(10, 0, 1)
)
# HLO
head(df)
##   id       date   q1_m1_w1   q1_m1_w2   q1_m2_w3    q2_m1_w1   q2_m2_w1   q2_m2_w2
## 1  1 2015-01-01 -1.1116773  0.6142841 -1.1714875  0.87494857 -0.3232590  0.9876592
## 2  2 2015-01-02  1.1799015  0.8463138 -0.3389054  1.75723234 -1.4957872 -0.8978570
## 3  3 2015-01-03 -0.9506309 -0.6169158 -0.3153580 -0.07181477  0.4665644 -1.9269922
## 4  4 2015-01-04  0.1975220  0.7934394 -0.2378307  0.58933216 -0.8207184 -0.2030309
## 5  5 2015-01-05  0.5255518 -0.4919001 -0.9546156 -0.84600664  1.6570551  0.1293742
## 6  6 2015-01-06 -0.4800159 -1.1043834 -0.1293722  0.82895860 -0.5542707  1.1216944
glimpse(df)
## Rows: 10
## Columns: 8
## $ id       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
## $ date     <date> 2015-01-01, 2015-01-02, 2015-01-03, 2015-01-04, 2015-01-05, 2015-01-06, 2015-01-07, 2015-01-08, 2015-01-09, 2015-01-10
## $ q1_m1_w1 <dbl> -1.1116773, 1.1799015, -0.9506309, 0.1975220, 0.5255518, -0.4800159, 0.7514034, 0.2399251, 0.7111657, 0.4055081
## $ q1_m1_w2 <dbl> 0.6142841, 0.8463138, -0.6169158, 0.7934394, -0.4919001, -1.1043834, 0.4934917, -1.9895106, 1.0143630, -0.6657672
## $ q1_m2_w3 <dbl> -1.1714875, -0.3389054, -0.3153580, -0.2378307, -0.9546156, -0.1293722, 0.6325128, 0.6125887, -0.9325299, 0.1415578
## $ q2_m1_w1 <dbl> 0.87494857, 1.75723234, -0.07181477, 0.58933216, -0.84600664, 0.82895860, 0.69227405, -1.54251402, 0.54486288, 0.97491665
## $ q2_m2_w1 <dbl> -0.3232590, -1.4957872, 0.4665644, -0.8207184, 1.6570551, -0.5542707, -1.5169967, -0.8280475, 0.7440145, -1.0260936
## $ q2_m2_w2 <dbl> 0.9876592, -0.8978570, -1.9269922, -0.2030309, 0.1293742, 1.1216944, 1.1620434, 0.8363186, 0.3752448, 0.4801174

3.1 tidyr::pivot_longer

First, let’s pivot…

df_tidy <- df %>%
  pivot_longer(q1_m1_w1:q2_m2_w2, names_to="key", values_to="value")
head(df_tidy)
## # A tibble: 6 × 4
##      id date       key       value
##   <int> <date>     <chr>     <dbl>
## 1     1 2015-01-01 q1_m1_w1 -1.11 
## 2     1 2015-01-01 q1_m1_w2  0.614
## 3     1 2015-01-01 q1_m2_w3 -1.17 
## 4     1 2015-01-01 q2_m1_w1  0.875
## 5     1 2015-01-01 q2_m2_w1 -0.323
## 6     1 2015-01-01 q2_m2_w2  0.988

3.2 tidyr::separate

# separate 1 col into 3 cols
df_sep <- df_tidy %>%
  separate(key, into = c("quarter", "month", "week"))
head(df_sep)
## # A tibble: 6 × 6
##      id date       quarter month week   value
##   <int> <date>     <chr>   <chr> <chr>  <dbl>
## 1     1 2015-01-01 q1      m1    w1    -1.11 
## 2     1 2015-01-01 q1      m1    w2     0.614
## 3     1 2015-01-01 q1      m2    w3    -1.17 
## 4     1 2015-01-01 q2      m1    w1     0.875
## 5     1 2015-01-01 q2      m2    w1    -0.323
## 6     1 2015-01-01 q2      m2    w2     0.988
# separate 1 col into 2 cols
df_sep2 <- df_tidy %>%
  separate(key, into = c("quarter", "period"), extra = "merge")
head(df_sep2)
## # A tibble: 6 × 5
##      id date       quarter period  value
##   <int> <date>     <chr>   <chr>   <dbl>
## 1     1 2015-01-01 q1      m1_w1  -1.11 
## 2     1 2015-01-01 q1      m1_w2   0.614
## 3     1 2015-01-01 q1      m2_w3  -1.17 
## 4     1 2015-01-01 q2      m1_w1   0.875
## 5     1 2015-01-01 q2      m2_w1  -0.323
## 6     1 2015-01-01 q2      m2_w2   0.988

stringr vs. tidyr separate by regular expression

3.3 tidyr::extract

Extract is essentially the same as separate, let’s see how…

# extract
df_ext <- df_sep2 %>%
  extract(period, into = "month")
head(df_ext)
## # A tibble: 6 × 5
##      id date       quarter month  value
##   <int> <date>     <chr>   <chr>  <dbl>
## 1     1 2015-01-01 q1      m1    -1.11 
## 2     1 2015-01-01 q1      m1     0.614
## 3     1 2015-01-01 q1      m2    -1.17 
## 4     1 2015-01-01 q2      m1     0.875
## 5     1 2015-01-01 q2      m2    -0.323
## 6     1 2015-01-01 q2      m2     0.988

We can see that, by default, it’s taking the first alphanumeric chunk of the value, and extracting it. We can specify much more complex patterns:

# this gives us same output as separate
df_ext <- df_sep2 %>%
  extract(period, into = c("month", "week"), 
          regex = "([[:alnum:]]+)_([[:alnum:]]+)")
head(df_ext)
## # A tibble: 6 × 6
##      id date       quarter month week   value
##   <int> <date>     <chr>   <chr> <chr>  <dbl>
## 1     1 2015-01-01 q1      m1    w1    -1.11 
## 2     1 2015-01-01 q1      m1    w2     0.614
## 3     1 2015-01-01 q1      m2    w3    -1.17 
## 4     1 2015-01-01 q2      m1    w1     0.875
## 5     1 2015-01-01 q2      m2    w1    -0.323
## 6     1 2015-01-01 q2      m2    w2     0.988

3.4 tidyr::unite

The unite() function lets us reverse the process:

# let's say we want to combine quarter and month with an underscore
df_uni <- df_sep %>%
  unite(period, quarter:month) # sep = "_" is the default arg
head(df_uni)
## # A tibble: 6 × 5
##      id date       period week   value
##   <int> <date>     <chr>  <chr>  <dbl>
## 1     1 2015-01-01 q1_m1  w1    -1.11 
## 2     1 2015-01-01 q1_m1  w2     0.614
## 3     1 2015-01-01 q1_m2  w3    -1.17 
## 4     1 2015-01-01 q2_m1  w1     0.875
## 5     1 2015-01-01 q2_m2  w1    -0.323
## 6     1 2015-01-01 q2_m2  w2     0.988
# let's say we want to combine quarter and month with nothing
df_uni <- df_sep %>%
  unite(period, quarter:month, sep = "")
head(df_uni)
## # A tibble: 6 × 5
##      id date       period week   value
##   <int> <date>     <chr>  <chr>  <dbl>
## 1     1 2015-01-01 q1m1   w1    -1.11 
## 2     1 2015-01-01 q1m1   w2     0.614
## 3     1 2015-01-01 q1m2   w3    -1.17 
## 4     1 2015-01-01 q2m1   w1     0.875
## 5     1 2015-01-01 q2m2   w1    -0.323
## 6     1 2015-01-01 q2m2   w2     0.988

3.5 tidyr::pivot_wider

# finally let's spread
df_spread <- df_uni %>%
  pivot_wider(names_from=week, values_from=value)  # fill = NA is default arg
head(df_spread)
## # A tibble: 6 × 6
##      id date       period     w1     w2     w3
##   <int> <date>     <chr>   <dbl>  <dbl>  <dbl>
## 1     1 2015-01-01 q1m1   -1.11   0.614 NA    
## 2     1 2015-01-01 q1m2   NA     NA     -1.17 
## 3     1 2015-01-01 q2m1    0.875 NA     NA    
## 4     1 2015-01-01 q2m2   -0.323  0.988 NA    
## 5     2 2015-01-02 q1m1    1.18   0.846 NA    
## 6     2 2015-01-02 q1m2   NA     NA     -0.339

3.6 Putting it all together (pivot_longer() %>% separate() %>% pivot_wider())

All in one, if we had wanted to end up with one row per month:

df_tidiest <- df %>% 
  pivot_longer(q1_m1_w1:q2_m2_w2, names_to="key", values_to="value") %>% 
  separate(key, into = c("quarter", "month", "week")) %>% 
  pivot_wider(names_from=week, values_from=value)
head(df_tidiest)
## # A tibble: 6 × 7
##      id date       quarter month     w1     w2     w3
##   <int> <date>     <chr>   <chr>  <dbl>  <dbl>  <dbl>
## 1     1 2015-01-01 q1      m1    -1.11   0.614 NA    
## 2     1 2015-01-01 q1      m2    NA     NA     -1.17 
## 3     1 2015-01-01 q2      m1     0.875 NA     NA    
## 4     1 2015-01-01 q2      m2    -0.323  0.988 NA    
## 5     2 2015-01-02 q1      m1     1.18   0.846 NA    
## 6     2 2015-01-02 q1      m2    NA     NA     -0.339

4 broom

“The broom package takes the messy output of built-in functions in R, such as lm, nls, or t.test, and turns them into tidy data frames.” So, broom tidies output from other R functions that are un-tidy.

See here for list of functions: https://github.com/dgrtwo/broom

Vignette: ftp://cran.r-project.org/pub/R/web/packages/broom/vignettes/broom.html

fit <- lm(mpg ~ qsec + factor(am) + wt + factor(gear), 
          data = mtcars)

Un-tidy output from lm

summary(fit)
## 
## Call:
## lm(formula = mpg ~ qsec + factor(am) + wt + factor(gear), data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5064 -1.5220 -0.7517  1.3841  4.6345 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     9.3650     8.3730   1.118  0.27359    
## qsec            1.2449     0.3828   3.252  0.00317 ** 
## factor(am)1     3.1505     1.9405   1.624  0.11654    
## wt             -3.9263     0.7428  -5.286 1.58e-05 ***
## factor(gear)4  -0.2682     1.6555  -0.162  0.87257    
## factor(gear)5  -0.2697     2.0632  -0.131  0.89698    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.55 on 26 degrees of freedom
## Multiple R-squared:  0.8498, Adjusted R-squared:  0.8209 
## F-statistic: 29.43 on 5 and 26 DF,  p-value: 6.379e-10

Tidy output from broom

tidy(fit)
## # A tibble: 6 × 5
##   term          estimate std.error statistic   p.value
##   <chr>            <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)      9.37      8.37      1.12  0.274    
## 2 qsec             1.24      0.383     3.25  0.00317  
## 3 factor(am)1      3.15      1.94      1.62  0.117    
## 4 wt              -3.93      0.743    -5.29  0.0000158
## 5 factor(gear)4   -0.268     1.66     -0.162 0.873    
## 6 factor(gear)5   -0.270     2.06     -0.131 0.897

This is very helfpul for making e.g. regression tables for papers!

Creative Commons License