1 Data-wrangling and tidying for tables

Making tables is often when we stretch our dplyr skills. Here are some additional more advanced examples that you may find helpful for today’s assignment!

We’ll use the pnwflights14 package to practice our dplyr skills. We need to download the package from github using devtools.

# once per machine
install.packages("devtools")
devtools::install_github("ismayc/pnwflights14")

Now, we need to load the flights dataset from the pnwflights14 package.

# once per work session
library(pnwflights14)
data("flights", package = "pnwflights14")

2 More Fun dplyr Tricks

2.1 dplyr::select

Recall that we use select to specify which columns in a dataframe you’d like to keep by name:

# keep these 2 cols
mini_flights <- flights %>% 
  select(carrier, flight)
glimpse(mini_flights)
## Rows: 162,049
## Columns: 2
## $ carrier <chr> "AS", "US", "UA", "US", "AS", "DL", "UA", "UA", "UA", "UA", "U…
## $ flight  <int> 145, 1830, 1609, 466, 121, 1823, 1481, 229, 1576, 478, 1569, 6…
# keep first five cols
first_five <- flights %>% 
  select(year, month, day, dep_time, dep_delay)
glimpse(first_five)
## Rows: 162,049
## Columns: 5
## $ year      <int> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, …
## $ month     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ day       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ dep_time  <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557,…
## $ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2,…
# alternatively, specify range
first_five <- flights %>% 
  select(year:dep_delay)
glimpse(first_five)
## Rows: 162,049
## Columns: 5
## $ year      <int> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, …
## $ month     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ day       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ dep_time  <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557,…
## $ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2,…

We can also choose the columns we want by negation, that is, you can specify which columns to drop instead of keep. This way, all variables not listed are kept.

# we can also use negation
all_but_year <- flights %>% 
  select(-year)
glimpse(all_but_year)
## Rows: 162,049
## Columns: 15
## $ month     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ day       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ dep_time  <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557,…
## $ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2,…
## $ arr_time  <int> 235, 738, 548, 800, 325, 747, 936, 1148, 917, 1334, 911, 907…
## $ arr_delay <dbl> 70, -23, -4, -23, 43, 88, 219, 15, 24, -6, 4, 12, -12, -16, …
## $ carrier   <chr> "AS", "US", "UA", "US", "AS", "DL", "UA", "UA", "UA", "UA", …
## $ tailnum   <chr> "N508AS", "N195UW", "N37422", "N547UW", "N762AS", "N806DN", …
## $ flight    <int> 145, 1830, 1609, 466, 121, 1823, 1481, 229, 1576, 478, 1569,…
## $ origin    <chr> "PDX", "SEA", "PDX", "PDX", "SEA", "SEA", "SEA", "PDX", "SEA…
## $ dest      <chr> "ANC", "CLT", "IAH", "CLT", "ANC", "DTW", "ORD", "IAH", "DEN…
## $ air_time  <dbl> 194, 252, 201, 251, 201, 224, 202, 217, 136, 268, 130, 122, …
## $ distance  <dbl> 1542, 2279, 1825, 2282, 1448, 1927, 1721, 1825, 1024, 2402, …
## $ hour      <dbl> 0, 0, 0, 0, 0, 0, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, …
## $ minute    <dbl> 1, 4, 8, 28, 34, 37, 46, 26, 27, 36, 41, 49, 50, 57, 57, 58,…

dplyr::select comes with several other helper functions…

depart <- flights %>%
  select(starts_with("dep_"))
glimpse(depart)
## Rows: 162,049
## Columns: 2
## $ dep_time  <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557,…
## $ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2,…
times <- flights %>%
  select(contains("time"))
glimpse(times)
## Rows: 162,049
## Columns: 3
## $ dep_time <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557, …
## $ arr_time <int> 235, 738, 548, 800, 325, 747, 936, 1148, 917, 1334, 911, 907,…
## $ air_time <dbl> 194, 252, 201, 251, 201, 224, 202, 217, 136, 268, 130, 122, 8…
# note that we are not actually saving the new dataframe here
flights %>%
  select(-contains("time")) %>% head()
## # A tibble: 6 × 13
##    year month   day dep_de…¹ arr_d…² carrier tailnum flight origin dest  dista…³
##   <int> <int> <int>    <dbl>   <dbl> <chr>   <chr>    <int> <chr>  <chr>   <dbl>
## 1  2014     1     1       96      70 AS      N508AS     145 PDX    ANC      1542
## 2  2014     1     1       -6     -23 US      N195UW    1830 SEA    CLT      2279
## 3  2014     1     1       13      -4 UA      N37422    1609 PDX    IAH      1825
## 4  2014     1     1       -2     -23 US      N547UW     466 PDX    CLT      2282
## 5  2014     1     1       44      43 AS      N762AS     121 SEA    ANC      1448
## 6  2014     1     1       82      88 DL      N806DN    1823 SEA    DTW      1927
## # … with 2 more variables: hour <dbl>, minute <dbl>, and abbreviated variable
## #   names ¹​dep_delay, ²​arr_delay, ³​distance
delays <- flights %>%
  select(ends_with("delay"))
glimpse(delays)
## Rows: 162,049
## Columns: 2
## $ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2,…
## $ arr_delay <dbl> 70, -23, -4, -23, 43, 88, 219, 15, 24, -6, 4, 12, -12, -16, …

One of my favorite select helper functions is everything(), which allows you to use select to keep all your variables, but easily rearrange the columns without having to list all the variables to keep/drop.

new_order <- flights %>% 
  select(origin, dest, everything())
head(new_order)
## # A tibble: 6 × 16
##   origin dest   year month   day dep_t…¹ dep_d…² arr_t…³ arr_d…⁴ carrier tailnum
##   <chr>  <chr> <int> <int> <int>   <int>   <dbl>   <int>   <dbl> <chr>   <chr>  
## 1 PDX    ANC    2014     1     1       1      96     235      70 AS      N508AS 
## 2 SEA    CLT    2014     1     1       4      -6     738     -23 US      N195UW 
## 3 PDX    IAH    2014     1     1       8      13     548      -4 UA      N37422 
## 4 PDX    CLT    2014     1     1      28      -2     800     -23 US      N547UW 
## 5 SEA    ANC    2014     1     1      34      44     325      43 AS      N762AS 
## 6 SEA    DTW    2014     1     1      37      82     747      88 DL      N806DN 
## # … with 5 more variables: flight <int>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, and abbreviated variable names ¹​dep_time,
## #   ²​dep_delay, ³​arr_time, ⁴​arr_delay
# with negation
new_order2 <- flights %>% 
  select(origin, dest, everything(), -year)
head(new_order2)
## # A tibble: 6 × 15
##   origin dest  month   day dep_time dep_delay arr_time arr_delay carrier tailnum
##   <chr>  <chr> <int> <int>    <int>     <dbl>    <int>     <dbl> <chr>   <chr>  
## 1 PDX    ANC       1     1        1        96      235        70 AS      N508AS 
## 2 SEA    CLT       1     1        4        -6      738       -23 US      N195UW 
## 3 PDX    IAH       1     1        8        13      548        -4 UA      N37422 
## 4 PDX    CLT       1     1       28        -2      800       -23 US      N547UW 
## 5 SEA    ANC       1     1       34        44      325        43 AS      N762AS 
## 6 SEA    DTW       1     1       37        82      747        88 DL      N806DN 
## # … with 5 more variables: flight <int>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>

We can also rename variables within select.

flights2 <- flights %>%
  select(tail_num = tailnum, everything())
head(flights2)
## # A tibble: 6 × 16
##   tail_num  year month   day dep_time dep_delay arr_time arr_de…¹ carrier flight
##   <chr>    <int> <int> <int>    <int>     <dbl>    <int>    <dbl> <chr>    <int>
## 1 N508AS    2014     1     1        1        96      235       70 AS         145
## 2 N195UW    2014     1     1        4        -6      738      -23 US        1830
## 3 N37422    2014     1     1        8        13      548       -4 UA        1609
## 4 N547UW    2014     1     1       28        -2      800      -23 US         466
## 5 N762AS    2014     1     1       34        44      325       43 AS         121
## 6 N806DN    2014     1     1       37        82      747       88 DL        1823
## # … with 6 more variables: origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, and abbreviated variable name
## #   ¹​arr_delay

If you don’t want to move the renamed variables within your dataframe, you can use the rename function.

flights3 <- flights %>%
  rename(tail_num = tailnum)
glimpse(flights3)
## Rows: 162,049
## Columns: 16
## $ year      <int> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, …
## $ month     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ day       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ dep_time  <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557,…
## $ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2,…
## $ arr_time  <int> 235, 738, 548, 800, 325, 747, 936, 1148, 917, 1334, 911, 907…
## $ arr_delay <dbl> 70, -23, -4, -23, 43, 88, 219, 15, 24, -6, 4, 12, -12, -16, …
## $ carrier   <chr> "AS", "US", "UA", "US", "AS", "DL", "UA", "UA", "UA", "UA", …
## $ tail_num  <chr> "N508AS", "N195UW", "N37422", "N547UW", "N762AS", "N806DN", …
## $ flight    <int> 145, 1830, 1609, 466, 121, 1823, 1481, 229, 1576, 478, 1569,…
## $ origin    <chr> "PDX", "SEA", "PDX", "PDX", "SEA", "SEA", "SEA", "PDX", "SEA…
## $ dest      <chr> "ANC", "CLT", "IAH", "CLT", "ANC", "DTW", "ORD", "IAH", "DEN…
## $ air_time  <dbl> 194, 252, 201, 251, 201, 224, 202, 217, 136, 268, 130, 122, …
## $ distance  <dbl> 1542, 2279, 1825, 2282, 1448, 1927, 1721, 1825, 1024, 2402, …
## $ hour      <dbl> 0, 0, 0, 0, 0, 0, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, …
## $ minute    <dbl> 1, 4, 8, 28, 34, 37, 46, 26, 27, 36, 41, 49, 50, 57, 57, 58,…

2.2 dplyr::filter

As we have previously seen, filter has many flexible uses:

# flights taking off from PDX
pdx <- flights %>% 
  filter(origin == "PDX")
head(pdx)
## # A tibble: 6 × 16
##    year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight
##   <int> <int> <int>    <int>     <dbl>    <int>     <dbl> <chr>   <chr>    <int>
## 1  2014     1     1        1        96      235        70 AS      N508AS     145
## 2  2014     1     1        8        13      548        -4 UA      N37422    1609
## 3  2014     1     1       28        -2      800       -23 US      N547UW     466
## 4  2014     1     1      526        -4     1148        15 UA      N813UA     229
## 5  2014     1     1      541         1      911         4 UA      N36476    1569
## 6  2014     1     1      549        24      907        12 US      N548UW     649
## # … with 6 more variables: origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>
# january flights from PDX
pdx_jan <- flights %>% 
  filter(origin == "PDX", month == 1) # the comma is an "and"
head(pdx_jan)
## # A tibble: 6 × 16
##    year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight
##   <int> <int> <int>    <int>     <dbl>    <int>     <dbl> <chr>   <chr>    <int>
## 1  2014     1     1        1        96      235        70 AS      N508AS     145
## 2  2014     1     1        8        13      548        -4 UA      N37422    1609
## 3  2014     1     1       28        -2      800       -23 US      N547UW     466
## 4  2014     1     1      526        -4     1148        15 UA      N813UA     229
## 5  2014     1     1      541         1      911         4 UA      N36476    1569
## 6  2014     1     1      549        24      907        12 US      N548UW     649
## # … with 6 more variables: origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>
# flights to ATL (Atlanta) or BNA (Nashville)
to_south <- flights %>% 
  filter(dest == "ATL" | dest == "BNA") %>% # | is "or"
  select(origin, dest, everything())
head(to_south)
## # A tibble: 6 × 16
##   origin dest   year month   day dep_t…¹ dep_d…² arr_t…³ arr_d…⁴ carrier tailnum
##   <chr>  <chr> <int> <int> <int>   <int>   <dbl>   <int>   <dbl> <chr>   <chr>  
## 1 SEA    ATL    2014     1     1     624      -6    1401      -6 DL      N617DL 
## 2 SEA    ATL    2014     1     1     802      -3    1533     -17 AS      N532AS 
## 3 SEA    ATL    2014     1     1     824      -1    1546     -14 DL      N633DL 
## 4 PDX    ATL    2014     1     1     944      -6    1727      -8 AS      N548AS 
## 5 PDX    ATL    2014     1     1    1054      94    1807      84 DL      N377DA 
## 6 SEA    ATL    2014     1     1    1158       6    1915     -14 DL      N6712B 
## # … with 5 more variables: flight <int>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, and abbreviated variable names ¹​dep_time,
## #   ²​dep_delay, ³​arr_time, ⁴​arr_delay
# flights from PDX to ATL (Atlanta) or BNA (Nashville)
pdx_to_south <- flights %>% 
  filter(origin == "PDX", dest == "ATL" | dest == "BNA") %>% # | is "or"
  select(origin, dest, everything())
head(pdx_to_south)
## # A tibble: 6 × 16
##   origin dest   year month   day dep_t…¹ dep_d…² arr_t…³ arr_d…⁴ carrier tailnum
##   <chr>  <chr> <int> <int> <int>   <int>   <dbl>   <int>   <dbl> <chr>   <chr>  
## 1 PDX    ATL    2014     1     1     944      -6    1727      -8 AS      N548AS 
## 2 PDX    ATL    2014     1     1    1054      94    1807      84 DL      N377DA 
## 3 PDX    ATL    2014     1     1    1323      -2    2038     -15 DL      N393DA 
## 4 PDX    ATL    2014     1     1    2253       8     611       4 DL      N371DA 
## 5 PDX    ATL    2014     1     2     627      -3    1350      -7 DL      N3746H 
## 6 PDX    ATL    2014     1     2     918      -2    1643      -2 DL      N3756  
## # … with 5 more variables: flight <int>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, and abbreviated variable names ¹​dep_time,
## #   ²​dep_delay, ³​arr_time, ⁴​arr_delay
# alternatively, using group membership
south_dests <- c("ATL", "BNA")
pdx_to_south2 <- flights %>% 
  filter(origin == "PDX", dest %in% south_dests) %>% 
  select(origin, dest, everything())
head(pdx_to_south2)
## # A tibble: 6 × 16
##   origin dest   year month   day dep_t…¹ dep_d…² arr_t…³ arr_d…⁴ carrier tailnum
##   <chr>  <chr> <int> <int> <int>   <int>   <dbl>   <int>   <dbl> <chr>   <chr>  
## 1 PDX    ATL    2014     1     1     944      -6    1727      -8 AS      N548AS 
## 2 PDX    ATL    2014     1     1    1054      94    1807      84 DL      N377DA 
## 3 PDX    ATL    2014     1     1    1323      -2    2038     -15 DL      N393DA 
## 4 PDX    ATL    2014     1     1    2253       8     611       4 DL      N371DA 
## 5 PDX    ATL    2014     1     2     627      -3    1350      -7 DL      N3746H 
## 6 PDX    ATL    2014     1     2     918      -2    1643      -2 DL      N3756  
## # … with 5 more variables: flight <int>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, and abbreviated variable names ¹​dep_time,
## #   ²​dep_delay, ³​arr_time, ⁴​arr_delay
# flights delayed by 1 hour or more
delay_1plus <- flights %>%
  filter(dep_delay >= 60)
head(delay_1plus)
## # A tibble: 6 × 16
##    year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight
##   <int> <int> <int>    <int>     <dbl>    <int>     <dbl> <chr>   <chr>    <int>
## 1  2014     1     1        1        96      235        70 AS      N508AS     145
## 2  2014     1     1       37        82      747        88 DL      N806DN    1823
## 3  2014     1     1      346       227      936       219 UA      N14219    1481
## 4  2014     1     1      650        90     1037        91 US      N626AW     460
## 5  2014     1     1      959       164     1137       157 AS      N534AS     805
## 6  2014     1     1     1008        68     1242        64 AS      N788AS     456
## # … with 6 more variables: origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>
# flights delayed by 1 hour, but not more than 2 hours
delay_1hr <- flights %>%
  filter(dep_delay >= 60, dep_delay < 120)
head(delay_1hr)
## # A tibble: 6 × 16
##    year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight
##   <int> <int> <int>    <int>     <dbl>    <int>     <dbl> <chr>   <chr>    <int>
## 1  2014     1     1        1        96      235        70 AS      N508AS     145
## 2  2014     1     1       37        82      747        88 DL      N806DN    1823
## 3  2014     1     1      650        90     1037        91 US      N626AW     460
## 4  2014     1     1     1008        68     1242        64 AS      N788AS     456
## 5  2014     1     1     1014        75     1613        81 UA      N37408    1444
## 6  2014     1     1     1036        81     1408        63 OO      N218AG    3466
## # … with 6 more variables: origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>
range(delay_1hr$dep_delay, na.rm = TRUE)
## [1]  60 119
# even more efficient using between (always inclusive)
delay_bwn <- flights %>%
  filter(between(dep_delay, 60, 119))
head(delay_bwn)
## # A tibble: 6 × 16
##    year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight
##   <int> <int> <int>    <int>     <dbl>    <int>     <dbl> <chr>   <chr>    <int>
## 1  2014     1     1        1        96      235        70 AS      N508AS     145
## 2  2014     1     1       37        82      747        88 DL      N806DN    1823
## 3  2014     1     1      650        90     1037        91 US      N626AW     460
## 4  2014     1     1     1008        68     1242        64 AS      N788AS     456
## 5  2014     1     1     1014        75     1613        81 UA      N37408    1444
## 6  2014     1     1     1036        81     1408        63 OO      N218AG    3466
## # … with 6 more variables: origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>
range(delay_bwn$dep_delay, na.rm = TRUE)
## [1]  60 119

2.3 dplyr::arrange

# default is ascending order
flights %>% 
  arrange(year, month, day) %>% head(n=20)
## # A tibble: 20 × 16
##     year month   day dep_time dep_delay arr_time arr_de…¹ carrier tailnum flight
##    <int> <int> <int>    <int>     <dbl>    <int>    <dbl> <chr>   <chr>    <int>
##  1  2014     1     1        1        96      235       70 AS      N508AS     145
##  2  2014     1     1        4        -6      738      -23 US      N195UW    1830
##  3  2014     1     1        8        13      548       -4 UA      N37422    1609
##  4  2014     1     1       28        -2      800      -23 US      N547UW     466
##  5  2014     1     1       34        44      325       43 AS      N762AS     121
##  6  2014     1     1       37        82      747       88 DL      N806DN    1823
##  7  2014     1     1      346       227      936      219 UA      N14219    1481
##  8  2014     1     1      526        -4     1148       15 UA      N813UA     229
##  9  2014     1     1      527         7      917       24 UA      N75433    1576
## 10  2014     1     1      536         1     1334       -6 UA      N574UA     478
## 11  2014     1     1      541         1      911        4 UA      N36476    1569
## 12  2014     1     1      549        24      907       12 US      N548UW     649
## 13  2014     1     1      550         0      837      -12 DL      N660DL    1634
## 14  2014     1     1      557        -3     1134      -16 AA      N3JLAA    1094
## 15  2014     1     1      557        -3      825      -25 AS      N562AS      81
## 16  2014     1     1      558        -2      801       -2 AS      N402AS     200
## 17  2014     1     1      559        -1      916       -9 F9      N210FR     796
## 18  2014     1     1      600         0     1151      -19 AA      N3JFAA    2240
## 19  2014     1     1      600       -10      842       -8 AS      N786AS     426
## 20  2014     1     1      602        -3      943        5 F9      N201FR     144
## # … with 6 more variables: origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, and abbreviated variable name
## #   ¹​arr_delay
# descending order
flights %>% 
  arrange(desc(year), desc(month), desc(day)) %>% head(n=20)
## # A tibble: 20 × 16
##     year month   day dep_time dep_delay arr_time arr_de…¹ carrier tailnum flight
##    <int> <int> <int>    <int>     <dbl>    <int>    <dbl> <chr>   <chr>    <int>
##  1  2014    12    31        2        12      601       31 AA      N3JKAA    1230
##  2  2014    12    31       27        -3      623        3 AA      N3EWAA    1431
##  3  2014    12    31       39        14      324        4 AS      N762AS     135
##  4  2014    12    31       40         0      549        0 DL      N757AT    2440
##  5  2014    12    31       52        -8      917      -21 AA      N3JFAA     371
##  6  2014    12    31       54         4      621       17 DL      N128DL    1670
##  7  2014    12    31       56        61      848       80 DL      N655DL     929
##  8  2014    12    31      512        -3      904        4 US      N653AW     480
##  9  2014    12    31      515        -5      855        5 US      N580UW     425
## 10  2014    12    31      534         4      859        7 UA      N34460    1075
## 11  2014    12    31      546         1      916       -4 WN      N8323C     757
## 12  2014    12    31      548        -2     1351      -13 UA      N461UA     665
## 13  2014    12    31      549         4     1208       12 UA      N68807    1457
## 14  2014    12    31      550         0      922        2 WN      N797MX    2121
## 15  2014    12    31      551        -4     1202       12 AA      N3HXAA    1094
## 16  2014    12    31      551        -9      744      -15 AS      N570AS     342
## 17  2014    12    31      555       -10      824       -1 AS      N548AS     602
## 18  2014    12    31      558        -2      849        0 DL      N668DN    1831
## 19  2014    12    31      558        -2     1149        4 AA      N436AA    1534
## 20  2014    12    31      558        -2      738       -4 AS      N585AS     406
## # … with 6 more variables: origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, and abbreviated variable name
## #   ¹​arr_delay

2.4 dplyr::distinct

# unique origin-dest combinations
flights %>% 
  select(origin, dest) %>% 
  distinct %>% head(n=50)
## # A tibble: 50 × 2
##    origin dest 
##    <chr>  <chr>
##  1 PDX    ANC  
##  2 SEA    CLT  
##  3 PDX    IAH  
##  4 PDX    CLT  
##  5 SEA    ANC  
##  6 SEA    DTW  
##  7 SEA    ORD  
##  8 SEA    DEN  
##  9 SEA    EWR  
## 10 PDX    DEN  
## # … with 40 more rows
# all unique destinations from PDX (there are 49, so we'll just list the first few)
from_pdx <- flights %>% 
  filter(origin == "PDX") %>% 
  select(origin, dest) %>%
  distinct(dest)
head(from_pdx)
## # A tibble: 6 × 1
##   dest 
##   <chr>
## 1 ANC  
## 2 IAH  
## 3 CLT  
## 4 DEN  
## 5 PHX  
## 6 ORD

2.5 dplyr::mutate

mutate is used to transform existing variables or create new ones; here, we are using it to create an indicator variable that identifies flights that were entirely on-time.

# add total delay variable
flights %>%
  mutate(tot_delay = dep_delay + arr_delay) %>%
  select(origin, dest, ends_with("delay"), everything()) %>% 
  head
## # A tibble: 6 × 17
##   origin dest  dep_d…¹ arr_d…² tot_d…³  year month   day dep_t…⁴ arr_t…⁵ carrier
##   <chr>  <chr>   <dbl>   <dbl>   <dbl> <int> <int> <int>   <int>   <int> <chr>  
## 1 PDX    ANC        96      70     166  2014     1     1       1     235 AS     
## 2 SEA    CLT        -6     -23     -29  2014     1     1       4     738 US     
## 3 PDX    IAH        13      -4       9  2014     1     1       8     548 UA     
## 4 PDX    CLT        -2     -23     -25  2014     1     1      28     800 US     
## 5 SEA    ANC        44      43      87  2014     1     1      34     325 AS     
## 6 SEA    DTW        82      88     170  2014     1     1      37     747 DL     
## # … with 6 more variables: tailnum <chr>, flight <int>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, and abbreviated variable names
## #   ¹​dep_delay, ²​arr_delay, ³​tot_delay, ⁴​dep_time, ⁵​arr_time
# flights that were delayed at departure had on time or early arrivals?
arrivals <- flights %>%
  mutate(arr_ok = ifelse(dep_delay > 0 & arr_delay <= 0, 1, 0)) %>% 
  select(origin, dest, ends_with("delay"), carrier, arr_ok)

# peek at it
arrivals %>%
  filter(arr_ok == 1) %>%
  head
## # A tibble: 6 × 6
##   origin dest  dep_delay arr_delay carrier arr_ok
##   <chr>  <chr>     <dbl>     <dbl> <chr>    <dbl>
## 1 PDX    IAH          13        -4 UA           1
## 2 SEA    EWR           1        -6 UA           1
## 3 SEA    SAN           2       -12 AS           1
## 4 PDX    EWR           2       -19 UA           1
## 5 SEA    IAH          13        -4 UA           1
## 6 PDX    IAD          10        -4 UA           1

2.6 dplyr::summarise (or dplyr::summarize)

We have seen summarise used to calculate summary statistics, like so:

flights %>%
  summarise(mean(dep_delay, na.rm = TRUE))
## # A tibble: 1 × 1
##   `mean(dep_delay, na.rm = TRUE)`
##                             <dbl>
## 1                            6.13

But this can get tedious when we want to compute several summary statistics, like so:

# we can also name that variable, and summarise multiple variables
flights %>%
  summarise(mean_delay = mean(dep_delay, na.rm = TRUE),
            sd_delay = sd(dep_delay, na.rm = TRUE),
            median_delay = median(dep_delay, na.rm = TRUE))
## # A tibble: 1 × 3
##   mean_delay sd_delay median_delay
##        <dbl>    <dbl>        <dbl>
## 1       6.13     29.1           -2

summarise together with across() lets us automate this process a bit:

flights %>%
  filter(!is.na(dep_delay)) %>%
  select(dep_delay) %>%
  summarise(
    across(everything(), 
           lst(mean, sd, median)) # making this a list gets us nice column names
    )
## # A tibble: 1 × 3
##   dep_delay_mean dep_delay_sd dep_delay_median
##            <dbl>        <dbl>            <dbl>
## 1           6.13         29.1               -2
# same thing over all numeric columns
flights %>%
  filter(!is.na(dep_delay)) %>%
  summarise(across(where(is.numeric), lst(mean, sd, median)))
## # A tibble: 1 × 36
##   year_…¹ year_sd year_…² month…³ month…⁴ month…⁵ day_m…⁶ day_sd day_m…⁷ dep_t…⁸
##     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>   <dbl>   <dbl>
## 1    2014       0    2014    6.62    3.32       7    15.8   8.79      16   1278.
## # … with 26 more variables: dep_time_sd <dbl>, dep_time_median <dbl>,
## #   dep_delay_mean <dbl>, dep_delay_sd <dbl>, dep_delay_median <dbl>,
## #   arr_time_mean <dbl>, arr_time_sd <dbl>, arr_time_median <int>,
## #   arr_delay_mean <dbl>, arr_delay_sd <dbl>, arr_delay_median <dbl>,
## #   flight_mean <dbl>, flight_sd <dbl>, flight_median <dbl>,
## #   air_time_mean <dbl>, air_time_sd <dbl>, air_time_median <dbl>,
## #   distance_mean <dbl>, distance_sd <dbl>, distance_median <dbl>, …
# combine with pivot_longer, change names too
flights %>%
  filter(!is.na(dep_delay)) %>%
  summarise(across(dep_delay, lst(mean, stdev=sd, median))) %>% 
  pivot_longer(everything(), names_to="delay_stat", values_to="value", names_prefix = "dep_delay_")
## # A tibble: 3 × 2
##   delay_stat value
##   <chr>      <dbl>
## 1 mean        6.13
## 2 stdev      29.1 
## 3 median     -2

We can use aggregating functions in summarise

# how many unique destinations?
summary_table <- flights %>% 
  summarise(tot_flights = n(),
            tot_planes = n_distinct(tailnum),
            tot_carriers = n_distinct(carrier),
            tot_dests = n_distinct(dest),
            tot_origins = n_distinct(origin))

summary_table
## # A tibble: 1 × 5
##   tot_flights tot_planes tot_carriers tot_dests tot_origins
##         <int>      <int>        <int>     <int>       <int>
## 1      162049       3023           11        71           2

The resulting table is a bit messy; we can reprocess using pivot_longer and separate:

# chain with tidyr functions
summary_table %>% 
  pivot_longer(everything(), names_to="key", values_to="value") %>% 
  separate(key, into = c("tot", "entity")) %>% 
  select(-tot, total = value)
## # A tibble: 5 × 2
##   entity    total
##   <chr>     <int>
## 1 flights  162049
## 2 planes     3023
## 3 carriers     11
## 4 dests        71
## 5 origins       2

Or we can do it in a single call to pivot_longer():

summary_table %>% 
  pivot_longer(
    everything(), # grab all columns
    names_to="entity", # make a new column called "entity" to store the old column names
    names_prefix="tot_", # remove the "tot_" prefix from the old column names
    values_to="total") # put the values in a column called "total"
## # A tibble: 5 × 2
##   entity    total
##   <chr>     <int>
## 1 flights  162049
## 2 planes     3023
## 3 carriers     11
## 4 dests        71
## 5 origins       2

3 tidyr

We’ll work with a made up dataframe:

df <- data.frame(
  id = 1:10,
  date = as.Date('2015-01-01') + 0:9,
  q1_m1_w1 = rnorm(10, 0, 1),
  q1_m1_w2 = rnorm(10, 0, 1),
  q1_m2_w3 = rnorm(10, 0, 1),
  q2_m1_w1 = rnorm(10, 0, 1),
  q2_m2_w1 = rnorm(10, 0, 1),
  q2_m2_w2 = rnorm(10, 0, 1)
)
# HLO
head(df)
##   id       date   q1_m1_w1   q1_m1_w2   q1_m2_w3    q2_m1_w1   q2_m2_w1
## 1  1 2015-01-01 -1.1116773  0.6142841 -1.1714875  0.87494857 -0.3232590
## 2  2 2015-01-02  1.1799015  0.8463138 -0.3389054  1.75723234 -1.4957872
## 3  3 2015-01-03 -0.9506309 -0.6169158 -0.3153580 -0.07181477  0.4665644
## 4  4 2015-01-04  0.1975220  0.7934394 -0.2378307  0.58933216 -0.8207184
## 5  5 2015-01-05  0.5255518 -0.4919001 -0.9546156 -0.84600664  1.6570551
## 6  6 2015-01-06 -0.4800159 -1.1043834 -0.1293722  0.82895860 -0.5542707
##     q2_m2_w2
## 1  0.9876592
## 2 -0.8978570
## 3 -1.9269922
## 4 -0.2030309
## 5  0.1293742
## 6  1.1216944
glimpse(df)
## Rows: 10
## Columns: 8
## $ id       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
## $ date     <date> 2015-01-01, 2015-01-02, 2015-01-03, 2015-01-04, 2015-01-05, 2…
## $ q1_m1_w1 <dbl> -1.1116773, 1.1799015, -0.9506309, 0.1975220, 0.5255518, -0.…
## $ q1_m1_w2 <dbl> 0.6142841, 0.8463138, -0.6169158, 0.7934394, -0.4919001, -1.1…
## $ q1_m2_w3 <dbl> -1.1714875, -0.3389054, -0.3153580, -0.2378307, -0.9546156, -…
## $ q2_m1_w1 <dbl> 0.87494857, 1.75723234, -0.07181477, 0.58933216, -0.84600664,…
## $ q2_m2_w1 <dbl> -0.3232590, -1.4957872, 0.4665644, -0.8207184, 1.6570551, -0.…
## $ q2_m2_w2 <dbl> 0.9876592, -0.8978570, -1.9269922, -0.2030309, 0.1293742, 1.1…

3.1 tidyr::pivot_longer

First, let’s pivot…

df_tidy <- df %>%
  pivot_longer(q1_m1_w1:q2_m2_w2, names_to="key", values_to="value")
head(df_tidy)
## # A tibble: 6 × 4
##      id date       key       value
##   <int> <date>     <chr>     <dbl>
## 1     1 2015-01-01 q1_m1_w1 -1.11 
## 2     1 2015-01-01 q1_m1_w2  0.614
## 3     1 2015-01-01 q1_m2_w3 -1.17 
## 4     1 2015-01-01 q2_m1_w1  0.875
## 5     1 2015-01-01 q2_m2_w1 -0.323
## 6     1 2015-01-01 q2_m2_w2  0.988

3.2 tidyr::separate

# separate 1 col into 3 cols
df_sep <- df_tidy %>%
  separate(key, into = c("quarter", "month", "week"))
head(df_sep)
## # A tibble: 6 × 6
##      id date       quarter month week   value
##   <int> <date>     <chr>   <chr> <chr>  <dbl>
## 1     1 2015-01-01 q1      m1    w1    -1.11 
## 2     1 2015-01-01 q1      m1    w2     0.614
## 3     1 2015-01-01 q1      m2    w3    -1.17 
## 4     1 2015-01-01 q2      m1    w1     0.875
## 5     1 2015-01-01 q2      m2    w1    -0.323
## 6     1 2015-01-01 q2      m2    w2     0.988
# separate 1 col into 2 cols
df_sep2 <- df_tidy %>%
  separate(key, into = c("quarter", "period"), extra = "merge")
head(df_sep2)
## # A tibble: 6 × 5
##      id date       quarter period  value
##   <int> <date>     <chr>   <chr>   <dbl>
## 1     1 2015-01-01 q1      m1_w1  -1.11 
## 2     1 2015-01-01 q1      m1_w2   0.614
## 3     1 2015-01-01 q1      m2_w3  -1.17 
## 4     1 2015-01-01 q2      m1_w1   0.875
## 5     1 2015-01-01 q2      m2_w1  -0.323
## 6     1 2015-01-01 q2      m2_w2   0.988

stringr vs. tidyr separate by regular expression

3.3 tidyr::extract

Extract is essentially the same as separate, let’s see how…

# extract
df_ext <- df_sep2 %>%
  extract(period, into = "month")
head(df_ext)
## # A tibble: 6 × 5
##      id date       quarter month  value
##   <int> <date>     <chr>   <chr>  <dbl>
## 1     1 2015-01-01 q1      m1    -1.11 
## 2     1 2015-01-01 q1      m1     0.614
## 3     1 2015-01-01 q1      m2    -1.17 
## 4     1 2015-01-01 q2      m1     0.875
## 5     1 2015-01-01 q2      m2    -0.323
## 6     1 2015-01-01 q2      m2     0.988

We can see that, by default, it’s taking the first alphanumeric chunk of the value, and extracting it. We can specify much more complex patterns:

# this gives us same output as separate
df_ext <- df_sep2 %>%
  extract(period, into = c("month", "week"), 
          regex = "([[:alnum:]]+)_([[:alnum:]]+)")
head(df_ext)
## # A tibble: 6 × 6
##      id date       quarter month week   value
##   <int> <date>     <chr>   <chr> <chr>  <dbl>
## 1     1 2015-01-01 q1      m1    w1    -1.11 
## 2     1 2015-01-01 q1      m1    w2     0.614
## 3     1 2015-01-01 q1      m2    w3    -1.17 
## 4     1 2015-01-01 q2      m1    w1     0.875
## 5     1 2015-01-01 q2      m2    w1    -0.323
## 6     1 2015-01-01 q2      m2    w2     0.988

3.4 tidyr::unite

The unite() function lets us reverse the process:

# let's say we want to combine quarter and month with an underscore
df_uni <- df_sep %>%
  unite(period, quarter:month) # sep = "_" is the default arg
head(df_uni)
## # A tibble: 6 × 5
##      id date       period week   value
##   <int> <date>     <chr>  <chr>  <dbl>
## 1     1 2015-01-01 q1_m1  w1    -1.11 
## 2     1 2015-01-01 q1_m1  w2     0.614
## 3     1 2015-01-01 q1_m2  w3    -1.17 
## 4     1 2015-01-01 q2_m1  w1     0.875
## 5     1 2015-01-01 q2_m2  w1    -0.323
## 6     1 2015-01-01 q2_m2  w2     0.988
# let's say we want to combine quarter and month with nothing
df_uni <- df_sep %>%
  unite(period, quarter:month, sep = "")
head(df_uni)
## # A tibble: 6 × 5
##      id date       period week   value
##   <int> <date>     <chr>  <chr>  <dbl>
## 1     1 2015-01-01 q1m1   w1    -1.11 
## 2     1 2015-01-01 q1m1   w2     0.614
## 3     1 2015-01-01 q1m2   w3    -1.17 
## 4     1 2015-01-01 q2m1   w1     0.875
## 5     1 2015-01-01 q2m2   w1    -0.323
## 6     1 2015-01-01 q2m2   w2     0.988

3.5 tidyr::pivot_wider

# finally let's spread
df_spread <- df_uni %>%
  pivot_wider(names_from=week, values_from=value)  # fill = NA is default arg
head(df_spread)
## # A tibble: 6 × 6
##      id date       period     w1     w2     w3
##   <int> <date>     <chr>   <dbl>  <dbl>  <dbl>
## 1     1 2015-01-01 q1m1   -1.11   0.614 NA    
## 2     1 2015-01-01 q1m2   NA     NA     -1.17 
## 3     1 2015-01-01 q2m1    0.875 NA     NA    
## 4     1 2015-01-01 q2m2   -0.323  0.988 NA    
## 5     2 2015-01-02 q1m1    1.18   0.846 NA    
## 6     2 2015-01-02 q1m2   NA     NA     -0.339

3.6 Putting it all together (pivot_longer() %>% separate() %>% pivot_wider())

All in one, if we had wanted to end up with one row per month:

df_tidiest <- df %>% 
  pivot_longer(q1_m1_w1:q2_m2_w2, names_to="key", values_to="value") %>% 
  separate(key, into = c("quarter", "month", "week")) %>% 
  pivot_wider(names_from=week, values_from=value)
head(df_tidiest)
## # A tibble: 6 × 7
##      id date       quarter month     w1     w2     w3
##   <int> <date>     <chr>   <chr>  <dbl>  <dbl>  <dbl>
## 1     1 2015-01-01 q1      m1    -1.11   0.614 NA    
## 2     1 2015-01-01 q1      m2    NA     NA     -1.17 
## 3     1 2015-01-01 q2      m1     0.875 NA     NA    
## 4     1 2015-01-01 q2      m2    -0.323  0.988 NA    
## 5     2 2015-01-02 q1      m1     1.18   0.846 NA    
## 6     2 2015-01-02 q1      m2    NA     NA     -0.339

4 broom

“The broom package takes the messy output of built-in functions in R, such as lm, nls, or t.test, and turns them into tidy data frames.” So, broom tidies output from other R functions that are un-tidy.

See here for list of functions: https://github.com/dgrtwo/broom

Vignette: ftp://cran.r-project.org/pub/R/web/packages/broom/vignettes/broom.html

fit <- lm(mpg ~ qsec + factor(am) + wt + factor(gear), 
          data = mtcars)

Un-tidy output from lm

summary(fit)
## 
## Call:
## lm(formula = mpg ~ qsec + factor(am) + wt + factor(gear), data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5064 -1.5220 -0.7517  1.3841  4.6345 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     9.3650     8.3730   1.118  0.27359    
## qsec            1.2449     0.3828   3.252  0.00317 ** 
## factor(am)1     3.1505     1.9405   1.624  0.11654    
## wt             -3.9263     0.7428  -5.286 1.58e-05 ***
## factor(gear)4  -0.2682     1.6555  -0.162  0.87257    
## factor(gear)5  -0.2697     2.0632  -0.131  0.89698    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.55 on 26 degrees of freedom
## Multiple R-squared:  0.8498, Adjusted R-squared:  0.8209 
## F-statistic: 29.43 on 5 and 26 DF,  p-value: 6.379e-10

Tidy output from broom

tidy(fit)
## # A tibble: 6 × 5
##   term          estimate std.error statistic   p.value
##   <chr>            <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)      9.37      8.37      1.12  0.274    
## 2 qsec             1.24      0.383     3.25  0.00317  
## 3 factor(am)1      3.15      1.94      1.62  0.117    
## 4 wt              -3.93      0.743    -5.29  0.0000158
## 5 factor(gear)4   -0.268     1.66     -0.162 0.873    
## 6 factor(gear)5   -0.270     2.06     -0.131 0.897

This is very helfpul for making e.g. regression tables for papers!

Creative Commons License