1 Data-wrangling and tidying for tables

Making tables is often when we stretch our dplyr skills. Here are some additional more advanced examples that you may find helpful for today’s assignment!

We’ll use the pnwflights14 package to practice our dplyr skills. We need to download the package from github using devtools.

# once per machine
install.packages("devtools")
devtools::install_github("ismayc/pnwflights14")

Now, we need to load the flights dataset from the pnwflights14 package.

# once per work session
library(pnwflights14)
data("flights", package = "pnwflights14")

2 More Fun dplyr Tricks

2.1 dplyr::select

Recall that we use select to specify which columns in a dataframe you’d like to keep by name:

# keep these 2 cols
mini_flights <- flights %>% 
  select(carrier, flight)
glimpse(mini_flights)
## Rows: 162,049
## Columns: 2
## $ carrier <chr> "AS", "US", "UA", "US", "AS", "DL", "UA", "UA", "UA", "UA", "U…
## $ flight  <int> 145, 1830, 1609, 466, 121, 1823, 1481, 229, 1576, 478, 1569, 6…
# keep first five cols
first_five <- flights %>% 
  select(year, month, day, dep_time, dep_delay)
glimpse(first_five)
## Rows: 162,049
## Columns: 5
## $ year      <int> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, …
## $ month     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ day       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ dep_time  <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557,…
## $ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2,…
# alternatively, specify range
first_five <- flights %>% 
  select(year:dep_delay)
glimpse(first_five)
## Rows: 162,049
## Columns: 5
## $ year      <int> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, …
## $ month     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ day       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ dep_time  <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557,…
## $ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2,…

We can also choose the columns we want by negation, that is, you can specify which columns to drop instead of keep. This way, all variables not listed are kept.

# we can also use negation
all_but_year <- flights %>% 
  select(-year)
glimpse(all_but_year)
## Rows: 162,049
## Columns: 15
## $ month     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ day       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ dep_time  <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557,…
## $ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2,…
## $ arr_time  <int> 235, 738, 548, 800, 325, 747, 936, 1148, 917, 1334, 911, 907…
## $ arr_delay <dbl> 70, -23, -4, -23, 43, 88, 219, 15, 24, -6, 4, 12, -12, -16, …
## $ carrier   <chr> "AS", "US", "UA", "US", "AS", "DL", "UA", "UA", "UA", "UA", …
## $ tailnum   <chr> "N508AS", "N195UW", "N37422", "N547UW", "N762AS", "N806DN", …
## $ flight    <int> 145, 1830, 1609, 466, 121, 1823, 1481, 229, 1576, 478, 1569,…
## $ origin    <chr> "PDX", "SEA", "PDX", "PDX", "SEA", "SEA", "SEA", "PDX", "SEA…
## $ dest      <chr> "ANC", "CLT", "IAH", "CLT", "ANC", "DTW", "ORD", "IAH", "DEN…
## $ air_time  <dbl> 194, 252, 201, 251, 201, 224, 202, 217, 136, 268, 130, 122, …
## $ distance  <dbl> 1542, 2279, 1825, 2282, 1448, 1927, 1721, 1825, 1024, 2402, …
## $ hour      <dbl> 0, 0, 0, 0, 0, 0, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, …
## $ minute    <dbl> 1, 4, 8, 28, 34, 37, 46, 26, 27, 36, 41, 49, 50, 57, 57, 58,…

dplyr::select comes with several other helper functions…

depart <- flights %>%
  select(starts_with("dep_"))
glimpse(depart)
## Rows: 162,049
## Columns: 2
## $ dep_time  <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557,…
## $ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2,…
times <- flights %>%
  select(contains("time"))
glimpse(times)
## Rows: 162,049
## Columns: 3
## $ dep_time <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557, …
## $ arr_time <int> 235, 738, 548, 800, 325, 747, 936, 1148, 917, 1334, 911, 907,…
## $ air_time <dbl> 194, 252, 201, 251, 201, 224, 202, 217, 136, 268, 130, 122, 8…
# note that we are not actually saving the new dataframe here
flights %>%
  select(-contains("time")) %>% head()
## # A tibble: 6 × 13
##    year month   day dep_delay arr_delay carrier tailnum flight origin dest 
##   <int> <int> <int>     <dbl>     <dbl> <chr>   <chr>    <int> <chr>  <chr>
## 1  2014     1     1        96        70 AS      N508AS     145 PDX    ANC  
## 2  2014     1     1        -6       -23 US      N195UW    1830 SEA    CLT  
## 3  2014     1     1        13        -4 UA      N37422    1609 PDX    IAH  
## 4  2014     1     1        -2       -23 US      N547UW     466 PDX    CLT  
## 5  2014     1     1        44        43 AS      N762AS     121 SEA    ANC  
## 6  2014     1     1        82        88 DL      N806DN    1823 SEA    DTW  
## # … with 3 more variables: distance <dbl>, hour <dbl>, minute <dbl>
delays <- flights %>%
  select(ends_with("delay"))
glimpse(delays)
## Rows: 162,049
## Columns: 2
## $ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2,…
## $ arr_delay <dbl> 70, -23, -4, -23, 43, 88, 219, 15, 24, -6, 4, 12, -12, -16, …

One of my favorite select helper functions is everything(), which allows you to use select to keep all your variables, but easily rearrange the columns without having to list all the variables to keep/drop.

new_order <- flights %>% 
  select(origin, dest, everything())
head(new_order)
## # A tibble: 6 × 16
##   origin dest   year month   day dep_time dep_delay arr_time arr_delay carrier
##   <chr>  <chr> <int> <int> <int>    <int>     <dbl>    <int>     <dbl> <chr>  
## 1 PDX    ANC    2014     1     1        1        96      235        70 AS     
## 2 SEA    CLT    2014     1     1        4        -6      738       -23 US     
## 3 PDX    IAH    2014     1     1        8        13      548        -4 UA     
## 4 PDX    CLT    2014     1     1       28        -2      800       -23 US     
## 5 SEA    ANC    2014     1     1       34        44      325        43 AS     
## 6 SEA    DTW    2014     1     1       37        82      747        88 DL     
## # … with 6 more variables: tailnum <chr>, flight <int>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>
# with negation
new_order2 <- flights %>% 
  select(origin, dest, everything(), -year)
head(new_order2)
## # A tibble: 6 × 15
##   origin dest  month   day dep_time dep_delay arr_time arr_delay carrier tailnum
##   <chr>  <chr> <int> <int>    <int>     <dbl>    <int>     <dbl> <chr>   <chr>  
## 1 PDX    ANC       1     1        1        96      235        70 AS      N508AS 
## 2 SEA    CLT       1     1        4        -6      738       -23 US      N195UW 
## 3 PDX    IAH       1     1        8        13      548        -4 UA      N37422 
## 4 PDX    CLT       1     1       28        -2      800       -23 US      N547UW 
## 5 SEA    ANC       1     1       34        44      325        43 AS      N762AS 
## 6 SEA    DTW       1     1       37        82      747        88 DL      N806DN 
## # … with 5 more variables: flight <int>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>

We can also rename variables within select.

flights2 <- flights %>%
  select(tail_num = tailnum, everything())
head(flights2)
## # A tibble: 6 × 16
##   tail_num  year month   day dep_time dep_delay arr_time arr_delay carrier
##   <chr>    <int> <int> <int>    <int>     <dbl>    <int>     <dbl> <chr>  
## 1 N508AS    2014     1     1        1        96      235        70 AS     
## 2 N195UW    2014     1     1        4        -6      738       -23 US     
## 3 N37422    2014     1     1        8        13      548        -4 UA     
## 4 N547UW    2014     1     1       28        -2      800       -23 US     
## 5 N762AS    2014     1     1       34        44      325        43 AS     
## 6 N806DN    2014     1     1       37        82      747        88 DL     
## # … with 7 more variables: flight <int>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>

If you don’t want to move the renamed variables within your dataframe, you can use the rename function.

flights3 <- flights %>%
  rename(tail_num = tailnum)
glimpse(flights3)
## Rows: 162,049
## Columns: 16
## $ year      <int> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, …
## $ month     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ day       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ dep_time  <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557,…
## $ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2,…
## $ arr_time  <int> 235, 738, 548, 800, 325, 747, 936, 1148, 917, 1334, 911, 907…
## $ arr_delay <dbl> 70, -23, -4, -23, 43, 88, 219, 15, 24, -6, 4, 12, -12, -16, …
## $ carrier   <chr> "AS", "US", "UA", "US", "AS", "DL", "UA", "UA", "UA", "UA", …
## $ tail_num  <chr> "N508AS", "N195UW", "N37422", "N547UW", "N762AS", "N806DN", …
## $ flight    <int> 145, 1830, 1609, 466, 121, 1823, 1481, 229, 1576, 478, 1569,…
## $ origin    <chr> "PDX", "SEA", "PDX", "PDX", "SEA", "SEA", "SEA", "PDX", "SEA…
## $ dest      <chr> "ANC", "CLT", "IAH", "CLT", "ANC", "DTW", "ORD", "IAH", "DEN…
## $ air_time  <dbl> 194, 252, 201, 251, 201, 224, 202, 217, 136, 268, 130, 122, …
## $ distance  <dbl> 1542, 2279, 1825, 2282, 1448, 1927, 1721, 1825, 1024, 2402, …
## $ hour      <dbl> 0, 0, 0, 0, 0, 0, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, …
## $ minute    <dbl> 1, 4, 8, 28, 34, 37, 46, 26, 27, 36, 41, 49, 50, 57, 57, 58,…

2.2 dplyr::filter

As we have previously seen, filter has many flexible uses:

# flights taking off from PDX
pdx <- flights %>% 
  filter(origin == "PDX")
head(pdx)
## # A tibble: 6 × 16
##    year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight
##   <int> <int> <int>    <int>     <dbl>    <int>     <dbl> <chr>   <chr>    <int>
## 1  2014     1     1        1        96      235        70 AS      N508AS     145
## 2  2014     1     1        8        13      548        -4 UA      N37422    1609
## 3  2014     1     1       28        -2      800       -23 US      N547UW     466
## 4  2014     1     1      526        -4     1148        15 UA      N813UA     229
## 5  2014     1     1      541         1      911         4 UA      N36476    1569
## 6  2014     1     1      549        24      907        12 US      N548UW     649
## # … with 6 more variables: origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>
# january flights from PDX
pdx_jan <- flights %>% 
  filter(origin == "PDX", month == 1) # the comma is an "and"
head(pdx_jan)
## # A tibble: 6 × 16
##    year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight
##   <int> <int> <int>    <int>     <dbl>    <int>     <dbl> <chr>   <chr>    <int>
## 1  2014     1     1        1        96      235        70 AS      N508AS     145
## 2  2014     1     1        8        13      548        -4 UA      N37422    1609
## 3  2014     1     1       28        -2      800       -23 US      N547UW     466
## 4  2014     1     1      526        -4     1148        15 UA      N813UA     229
## 5  2014     1     1      541         1      911         4 UA      N36476    1569
## 6  2014     1     1      549        24      907        12 US      N548UW     649
## # … with 6 more variables: origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>
# flights to ATL (Atlanta) or BNA (Nashville)
to_south <- flights %>% 
  filter(dest == "ATL" | dest == "BNA") %>% # | is "or"
  select(origin, dest, everything())
head(to_south)
## # A tibble: 6 × 16
##   origin dest   year month   day dep_time dep_delay arr_time arr_delay carrier
##   <chr>  <chr> <int> <int> <int>    <int>     <dbl>    <int>     <dbl> <chr>  
## 1 SEA    ATL    2014     1     1      624        -6     1401        -6 DL     
## 2 SEA    ATL    2014     1     1      802        -3     1533       -17 AS     
## 3 SEA    ATL    2014     1     1      824        -1     1546       -14 DL     
## 4 PDX    ATL    2014     1     1      944        -6     1727        -8 AS     
## 5 PDX    ATL    2014     1     1     1054        94     1807        84 DL     
## 6 SEA    ATL    2014     1     1     1158         6     1915       -14 DL     
## # … with 6 more variables: tailnum <chr>, flight <int>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>
# flights from PDX to ATL (Atlanta) or BNA (Nashville)
pdx_to_south <- flights %>% 
  filter(origin == "PDX", dest == "ATL" | dest == "BNA") %>% # | is "or"
  select(origin, dest, everything())
head(pdx_to_south)
## # A tibble: 6 × 16
##   origin dest   year month   day dep_time dep_delay arr_time arr_delay carrier
##   <chr>  <chr> <int> <int> <int>    <int>     <dbl>    <int>     <dbl> <chr>  
## 1 PDX    ATL    2014     1     1      944        -6     1727        -8 AS     
## 2 PDX    ATL    2014     1     1     1054        94     1807        84 DL     
## 3 PDX    ATL    2014     1     1     1323        -2     2038       -15 DL     
## 4 PDX    ATL    2014     1     1     2253         8      611         4 DL     
## 5 PDX    ATL    2014     1     2      627        -3     1350        -7 DL     
## 6 PDX    ATL    2014     1     2      918        -2     1643        -2 DL     
## # … with 6 more variables: tailnum <chr>, flight <int>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>
# alternatively, using group membership
south_dests <- c("ATL", "BNA")
pdx_to_south2 <- flights %>% 
  filter(origin == "PDX", dest %in% south_dests) %>% 
  select(origin, dest, everything())
head(pdx_to_south2)
## # A tibble: 6 × 16
##   origin dest   year month   day dep_time dep_delay arr_time arr_delay carrier
##   <chr>  <chr> <int> <int> <int>    <int>     <dbl>    <int>     <dbl> <chr>  
## 1 PDX    ATL    2014     1     1      944        -6     1727        -8 AS     
## 2 PDX    ATL    2014     1     1     1054        94     1807        84 DL     
## 3 PDX    ATL    2014     1     1     1323        -2     2038       -15 DL     
## 4 PDX    ATL    2014     1     1     2253         8      611         4 DL     
## 5 PDX    ATL    2014     1     2      627        -3     1350        -7 DL     
## 6 PDX    ATL    2014     1     2      918        -2     1643        -2 DL     
## # … with 6 more variables: tailnum <chr>, flight <int>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>
# flights delayed by 1 hour or more
delay_1plus <- flights %>%
  filter(dep_delay >= 60)
head(delay_1plus)
## # A tibble: 6 × 16
##    year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight
##   <int> <int> <int>    <int>     <dbl>    <int>     <dbl> <chr>   <chr>    <int>
## 1  2014     1     1        1        96      235        70 AS      N508AS     145
## 2  2014     1     1       37        82      747        88 DL      N806DN    1823
## 3  2014     1     1      346       227      936       219 UA      N14219    1481
## 4  2014     1     1      650        90     1037        91 US      N626AW     460
## 5  2014     1     1      959       164     1137       157 AS      N534AS     805
## 6  2014     1     1     1008        68     1242        64 AS      N788AS     456
## # … with 6 more variables: origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>
# flights delayed by 1 hour, but not more than 2 hours
delay_1hr <- flights %>%
  filter(dep_delay >= 60, dep_delay < 120)
head(delay_1hr)
## # A tibble: 6 × 16
##    year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight
##   <int> <int> <int>    <int>     <dbl>    <int>     <dbl> <chr>   <chr>    <int>
## 1  2014     1     1        1        96      235        70 AS      N508AS     145
## 2  2014     1     1       37        82      747        88 DL      N806DN    1823
## 3  2014     1     1      650        90     1037        91 US      N626AW     460
## 4  2014     1     1     1008        68     1242        64 AS      N788AS     456
## 5  2014     1     1     1014        75     1613        81 UA      N37408    1444
## 6  2014     1     1     1036        81     1408        63 OO      N218AG    3466
## # … with 6 more variables: origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>
range(delay_1hr$dep_delay, na.rm = TRUE)
## [1]  60 119
# even more efficient using between (always inclusive)
delay_bwn <- flights %>%
  filter(between(dep_delay, 60, 119))
head(delay_bwn)
## # A tibble: 6 × 16
##    year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight
##   <int> <int> <int>    <int>     <dbl>    <int>     <dbl> <chr>   <chr>    <int>
## 1  2014     1     1        1        96      235        70 AS      N508AS     145
## 2  2014     1     1       37        82      747        88 DL      N806DN    1823
## 3  2014     1     1      650        90     1037        91 US      N626AW     460
## 4  2014     1     1     1008        68     1242        64 AS      N788AS     456
## 5  2014     1     1     1014        75     1613        81 UA      N37408    1444
## 6  2014     1     1     1036        81     1408        63 OO      N218AG    3466
## # … with 6 more variables: origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>
range(delay_bwn$dep_delay, na.rm = TRUE)
## [1]  60 119

2.3 dplyr::arrange

# default is ascending order
flights %>% 
  arrange(year, month, day) %>% head(n=20)
## # A tibble: 20 × 16
##     year month   day dep_time dep_delay arr_time arr_delay carrier tailnum
##    <int> <int> <int>    <int>     <dbl>    <int>     <dbl> <chr>   <chr>  
##  1  2014     1     1        1        96      235        70 AS      N508AS 
##  2  2014     1     1        4        -6      738       -23 US      N195UW 
##  3  2014     1     1        8        13      548        -4 UA      N37422 
##  4  2014     1     1       28        -2      800       -23 US      N547UW 
##  5  2014     1     1       34        44      325        43 AS      N762AS 
##  6  2014     1     1       37        82      747        88 DL      N806DN 
##  7  2014     1     1      346       227      936       219 UA      N14219 
##  8  2014     1     1      526        -4     1148        15 UA      N813UA 
##  9  2014     1     1      527         7      917        24 UA      N75433 
## 10  2014     1     1      536         1     1334        -6 UA      N574UA 
## 11  2014     1     1      541         1      911         4 UA      N36476 
## 12  2014     1     1      549        24      907        12 US      N548UW 
## 13  2014     1     1      550         0      837       -12 DL      N660DL 
## 14  2014     1     1      557        -3     1134       -16 AA      N3JLAA 
## 15  2014     1     1      557        -3      825       -25 AS      N562AS 
## 16  2014     1     1      558        -2      801        -2 AS      N402AS 
## 17  2014     1     1      559        -1      916        -9 F9      N210FR 
## 18  2014     1     1      600         0     1151       -19 AA      N3JFAA 
## 19  2014     1     1      600       -10      842        -8 AS      N786AS 
## 20  2014     1     1      602        -3      943         5 F9      N201FR 
## # … with 7 more variables: flight <int>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>
# descending order
flights %>% 
  arrange(desc(year), desc(month), desc(day)) %>% head(n=20)
## # A tibble: 20 × 16
##     year month   day dep_time dep_delay arr_time arr_delay carrier tailnum
##    <int> <int> <int>    <int>     <dbl>    <int>     <dbl> <chr>   <chr>  
##  1  2014    12    31        2        12      601        31 AA      N3JKAA 
##  2  2014    12    31       27        -3      623         3 AA      N3EWAA 
##  3  2014    12    31       39        14      324         4 AS      N762AS 
##  4  2014    12    31       40         0      549         0 DL      N757AT 
##  5  2014    12    31       52        -8      917       -21 AA      N3JFAA 
##  6  2014    12    31       54         4      621        17 DL      N128DL 
##  7  2014    12    31       56        61      848        80 DL      N655DL 
##  8  2014    12    31      512        -3      904         4 US      N653AW 
##  9  2014    12    31      515        -5      855         5 US      N580UW 
## 10  2014    12    31      534         4      859         7 UA      N34460 
## 11  2014    12    31      546         1      916        -4 WN      N8323C 
## 12  2014    12    31      548        -2     1351       -13 UA      N461UA 
## 13  2014    12    31      549         4     1208        12 UA      N68807 
## 14  2014    12    31      550         0      922         2 WN      N797MX 
## 15  2014    12    31      551        -4     1202        12 AA      N3HXAA 
## 16  2014    12    31      551        -9      744       -15 AS      N570AS 
## 17  2014    12    31      555       -10      824        -1 AS      N548AS 
## 18  2014    12    31      558        -2      849         0 DL      N668DN 
## 19  2014    12    31      558        -2     1149         4 AA      N436AA 
## 20  2014    12    31      558        -2      738        -4 AS      N585AS 
## # … with 7 more variables: flight <int>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>

2.4 dplyr::distinct

# unique origin-dest combinations
flights %>% 
  select(origin, dest) %>% 
  distinct %>% head(n=50)
## # A tibble: 50 × 2
##    origin dest 
##    <chr>  <chr>
##  1 PDX    ANC  
##  2 SEA    CLT  
##  3 PDX    IAH  
##  4 PDX    CLT  
##  5 SEA    ANC  
##  6 SEA    DTW  
##  7 SEA    ORD  
##  8 SEA    DEN  
##  9 SEA    EWR  
## 10 PDX    DEN  
## # … with 40 more rows
# all unique destinations from PDX (there are 49, so we'll just list the first few)
from_pdx <- flights %>% 
  filter(origin == "PDX") %>% 
  select(origin, dest) %>%
  distinct(dest)
head(from_pdx)
## # A tibble: 6 × 1
##   dest 
##   <chr>
## 1 ANC  
## 2 IAH  
## 3 CLT  
## 4 DEN  
## 5 PHX  
## 6 ORD

2.5 dplyr::mutate

mutate is used to transform existing variables or create new ones; here, we are using it to create an indicator variable that identifies flights that were entirely on-time.

# add total delay variable
flights %>%
  mutate(tot_delay = dep_delay + arr_delay) %>%
  select(origin, dest, ends_with("delay"), everything()) %>% 
  head
## # A tibble: 6 × 17
##   origin dest  dep_delay arr_delay tot_delay  year month   day dep_time arr_time
##   <chr>  <chr>     <dbl>     <dbl>     <dbl> <int> <int> <int>    <int>    <int>
## 1 PDX    ANC          96        70       166  2014     1     1        1      235
## 2 SEA    CLT          -6       -23       -29  2014     1     1        4      738
## 3 PDX    IAH          13        -4         9  2014     1     1        8      548
## 4 PDX    CLT          -2       -23       -25  2014     1     1       28      800
## 5 SEA    ANC          44        43        87  2014     1     1       34      325
## 6 SEA    DTW          82        88       170  2014     1     1       37      747
## # … with 7 more variables: carrier <chr>, tailnum <chr>, flight <int>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>
# flights that were delayed at departure had on time or early arrivals?
arrivals <- flights %>%
  mutate(arr_ok = ifelse(dep_delay > 0 & arr_delay <= 0, 1, 0)) %>% 
  select(origin, dest, ends_with("delay"), carrier, arr_ok)

# peek at it
arrivals %>%
  filter(arr_ok == 1) %>%
  head
## # A tibble: 6 × 6
##   origin dest  dep_delay arr_delay carrier arr_ok
##   <chr>  <chr>     <dbl>     <dbl> <chr>    <dbl>
## 1 PDX    IAH          13        -4 UA           1
## 2 SEA    EWR           1        -6 UA           1
## 3 SEA    SAN           2       -12 AS           1
## 4 PDX    EWR           2       -19 UA           1
## 5 SEA    IAH          13        -4 UA           1
## 6 PDX    IAD          10        -4 UA           1

2.6 dplyr::summarise (or dplyr::summarize)

We have seen summarise used to calculate summary statistics, like so:

flights %>%
  summarise(mean(dep_delay, na.rm = TRUE))
## # A tibble: 1 × 1
##   `mean(dep_delay, na.rm = TRUE)`
##                             <dbl>
## 1                            6.13

But this can get tedious when we want to compute several summary statistics, like so:

# we can also name that variable, and summarise multiple variables
flights %>%
  summarise(mean_delay = mean(dep_delay, na.rm = TRUE),
            sd_delay = sd(dep_delay, na.rm = TRUE),
            median_delay = median(dep_delay, na.rm = TRUE))
## # A tibble: 1 × 3
##   mean_delay sd_delay median_delay
##        <dbl>    <dbl>        <dbl>
## 1       6.13     29.1           -2

summarise together with across() lets us automate this process a bit:

flights %>%
  filter(!is.na(dep_delay)) %>%
  select(dep_delay) %>%
  summarise(
    across(everything(), 
           lst(mean, sd, median)) # making this a list gets us nice column names
    )
## # A tibble: 1 × 3
##   dep_delay_mean dep_delay_sd dep_delay_median
##            <dbl>        <dbl>            <dbl>
## 1           6.13         29.1               -2
# same thing over all numeric columns
flights %>%
  filter(!is.na(dep_delay)) %>%
  summarise(across(where(is.numeric), lst(mean, sd, median)))
## # A tibble: 1 × 36
##   year_mean year_sd year_median month_mean month_sd month_median day_mean day_sd
##       <dbl>   <dbl>       <dbl>      <dbl>    <dbl>        <dbl>    <dbl>  <dbl>
## 1      2014       0        2014       6.62     3.32            7     15.8   8.79
## # … with 28 more variables: day_median <dbl>, dep_time_mean <dbl>,
## #   dep_time_sd <dbl>, dep_time_median <dbl>, dep_delay_mean <dbl>,
## #   dep_delay_sd <dbl>, dep_delay_median <dbl>, arr_time_mean <dbl>,
## #   arr_time_sd <dbl>, arr_time_median <int>, arr_delay_mean <dbl>,
## #   arr_delay_sd <dbl>, arr_delay_median <dbl>, flight_mean <dbl>,
## #   flight_sd <dbl>, flight_median <dbl>, air_time_mean <dbl>,
## #   air_time_sd <dbl>, air_time_median <dbl>, distance_mean <dbl>, …
# combine with pivot_longer, change names too
flights %>%
  filter(!is.na(dep_delay)) %>%
  summarise(across(dep_delay, lst(mean, stdev=sd, median))) %>% 
  pivot_longer(everything(), names_to="delay_stat", values_to="value", names_prefix = "dep_delay_")
## # A tibble: 3 × 2
##   delay_stat value
##   <chr>      <dbl>
## 1 mean        6.13
## 2 stdev      29.1 
## 3 median     -2

We can use aggregating functions in summarise

# how many unique destinations?
summary_table <- flights %>% 
  summarise(tot_flights = n(),
            tot_planes = n_distinct(tailnum),
            tot_carriers = n_distinct(carrier),
            tot_dests = n_distinct(dest),
            tot_origins = n_distinct(origin))

summary_table
## # A tibble: 1 × 5
##   tot_flights tot_planes tot_carriers tot_dests tot_origins
##         <int>      <int>        <int>     <int>       <int>
## 1      162049       3023           11        71           2

The resulting table is a bit messy; we can reprocess using pivot_longer and separate:

# chain with tidyr functions
summary_table %>% 
  pivot_longer(everything(), names_to="key", values_to="value") %>% 
  separate(key, into = c("tot", "entity")) %>% 
  select(-tot, total = value)
## # A tibble: 5 × 2
##   entity    total
##   <chr>     <int>
## 1 flights  162049
## 2 planes     3023
## 3 carriers     11
## 4 dests        71
## 5 origins       2

Or we can do it in a single call to pivot_longer():

summary_table %>% 
  pivot_longer(
    everything(), # grab all columns
    names_to="entity", # make a new column called "entity" to store the old column names
    names_prefix="tot_", # remove the "tot_" prefix from the old column names
    values_to="total") # put the values in a column called "total"
## # A tibble: 5 × 2
##   entity    total
##   <chr>     <int>
## 1 flights  162049
## 2 planes     3023
## 3 carriers     11
## 4 dests        71
## 5 origins       2

3 tidyr

We’ll work with a made up dataframe:

df <- data.frame(
  id = 1:10,
  date = as.Date('2015-01-01') + 0:9,
  q1_m1_w1 = rnorm(10, 0, 1),
  q1_m1_w2 = rnorm(10, 0, 1),
  q1_m2_w3 = rnorm(10, 0, 1),
  q2_m1_w1 = rnorm(10, 0, 1),
  q2_m2_w1 = rnorm(10, 0, 1),
  q2_m2_w2 = rnorm(10, 0, 1)
)
# HLO
head(df)
##   id       date   q1_m1_w1   q1_m1_w2   q1_m2_w3   q2_m1_w1   q2_m2_w1
## 1  1 2015-01-01 -1.1700743 -1.3313873  0.7225514 -1.1991584 -1.1453316
## 2  2 2015-01-02  0.7013156  0.4073680  1.4849110  0.9400779  0.8611532
## 3  3 2015-01-03  1.6318222 -0.7944370  1.5842153  0.6983900 -2.6945197
## 4  4 2015-01-04 -0.2343429  0.1907154 -0.9717708  0.1686624 -0.4788981
## 5  5 2015-01-05  0.6487112 -0.1029043 -0.6050051 -0.5839066 -0.2390221
## 6  6 2015-01-06  0.9597311  0.5272105  2.0258127 -1.6533158 -0.1458568
##     q2_m2_w2
## 1 -0.9506309
## 2  0.1975220
## 3  0.5255518
## 4 -0.4800159
## 5  0.7514034
## 6  0.2399251
glimpse(df)
## Rows: 10
## Columns: 8
## $ id       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
## $ date     <date> 2015-01-01, 2015-01-02, 2015-01-03, 2015-01-04, 2015-01-05, 2…
## $ q1_m1_w1 <dbl> -1.1700743, 0.7013156, 1.6318222, -0.2343429, 0.6487112, 0.9…
## $ q1_m1_w2 <dbl> -1.33138729, 0.40736797, -0.79443701, 0.19071541, -0.10290434…
## $ q1_m2_w3 <dbl> 0.7225514, 1.4849110, 1.5842153, -0.9717708, -0.6050051, 2.02…
## $ q2_m1_w1 <dbl> -1.1991584, 0.9400779, 0.6983900, 0.1686624, -0.5839066, -1.6…
## $ q2_m2_w1 <dbl> -1.1453316, 0.8611532, -2.6945197, -0.4788981, -0.2390221, -0…
## $ q2_m2_w2 <dbl> -0.9506309, 0.1975220, 0.5255518, -0.4800159, 0.7514034, 0.23…

3.1 tidyr::pivot_longer

First, let’s pivot…

df_tidy <- df %>%
  pivot_longer(q1_m1_w1:q2_m2_w2, names_to="key", values_to="value")
head(df_tidy)
## # A tibble: 6 × 4
##      id date       key       value
##   <int> <date>     <chr>     <dbl>
## 1     1 2015-01-01 q1_m1_w1 -1.17 
## 2     1 2015-01-01 q1_m1_w2 -1.33 
## 3     1 2015-01-01 q1_m2_w3  0.723
## 4     1 2015-01-01 q2_m1_w1 -1.20 
## 5     1 2015-01-01 q2_m2_w1 -1.15 
## 6     1 2015-01-01 q2_m2_w2 -0.951

3.2 tidyr::separate

# separate 1 col into 3 cols
df_sep <- df_tidy %>%
  separate(key, into = c("quarter", "month", "week"))
head(df_sep)
## # A tibble: 6 × 6
##      id date       quarter month week   value
##   <int> <date>     <chr>   <chr> <chr>  <dbl>
## 1     1 2015-01-01 q1      m1    w1    -1.17 
## 2     1 2015-01-01 q1      m1    w2    -1.33 
## 3     1 2015-01-01 q1      m2    w3     0.723
## 4     1 2015-01-01 q2      m1    w1    -1.20 
## 5     1 2015-01-01 q2      m2    w1    -1.15 
## 6     1 2015-01-01 q2      m2    w2    -0.951
# separate 1 col into 2 cols
df_sep2 <- df_tidy %>%
  separate(key, into = c("quarter", "period"), extra = "merge")
head(df_sep2)
## # A tibble: 6 × 5
##      id date       quarter period  value
##   <int> <date>     <chr>   <chr>   <dbl>
## 1     1 2015-01-01 q1      m1_w1  -1.17 
## 2     1 2015-01-01 q1      m1_w2  -1.33 
## 3     1 2015-01-01 q1      m2_w3   0.723
## 4     1 2015-01-01 q2      m1_w1  -1.20 
## 5     1 2015-01-01 q2      m2_w1  -1.15 
## 6     1 2015-01-01 q2      m2_w2  -0.951

stringr vs. tidyr separate by regular expression

3.3 tidyr::extract

Extract is essentially the same as separate, let’s see how…

# extract
df_ext <- df_sep2 %>%
  extract(period, into = "month")
head(df_ext)
## # A tibble: 6 × 5
##      id date       quarter month  value
##   <int> <date>     <chr>   <chr>  <dbl>
## 1     1 2015-01-01 q1      m1    -1.17 
## 2     1 2015-01-01 q1      m1    -1.33 
## 3     1 2015-01-01 q1      m2     0.723
## 4     1 2015-01-01 q2      m1    -1.20 
## 5     1 2015-01-01 q2      m2    -1.15 
## 6     1 2015-01-01 q2      m2    -0.951

We can see that, by default, it’s taking the first alphanumeric chunk of the value, and extracting it. We can specify much more complex patterns:

# this gives us same output as separate
df_ext <- df_sep2 %>%
  extract(period, into = c("month", "week"), 
          regex = "([[:alnum:]]+)_([[:alnum:]]+)")
head(df_ext)
## # A tibble: 6 × 6
##      id date       quarter month week   value
##   <int> <date>     <chr>   <chr> <chr>  <dbl>
## 1     1 2015-01-01 q1      m1    w1    -1.17 
## 2     1 2015-01-01 q1      m1    w2    -1.33 
## 3     1 2015-01-01 q1      m2    w3     0.723
## 4     1 2015-01-01 q2      m1    w1    -1.20 
## 5     1 2015-01-01 q2      m2    w1    -1.15 
## 6     1 2015-01-01 q2      m2    w2    -0.951

3.4 tidyr::unite

The unite() function lets us reverse the process:

# let's say we want to combine quarter and month with an underscore
df_uni <- df_sep %>%
  unite(period, quarter:month) # sep = "_" is the default arg
head(df_uni)
## # A tibble: 6 × 5
##      id date       period week   value
##   <int> <date>     <chr>  <chr>  <dbl>
## 1     1 2015-01-01 q1_m1  w1    -1.17 
## 2     1 2015-01-01 q1_m1  w2    -1.33 
## 3     1 2015-01-01 q1_m2  w3     0.723
## 4     1 2015-01-01 q2_m1  w1    -1.20 
## 5     1 2015-01-01 q2_m2  w1    -1.15 
## 6     1 2015-01-01 q2_m2  w2    -0.951
# let's say we want to combine quarter and month with nothing
df_uni <- df_sep %>%
  unite(period, quarter:month, sep = "")
head(df_uni)
## # A tibble: 6 × 5
##      id date       period week   value
##   <int> <date>     <chr>  <chr>  <dbl>
## 1     1 2015-01-01 q1m1   w1    -1.17 
## 2     1 2015-01-01 q1m1   w2    -1.33 
## 3     1 2015-01-01 q1m2   w3     0.723
## 4     1 2015-01-01 q2m1   w1    -1.20 
## 5     1 2015-01-01 q2m2   w1    -1.15 
## 6     1 2015-01-01 q2m2   w2    -0.951

3.5 tidyr::pivot_wider

# finally let's spread
df_spread <- df_uni %>%
  pivot_wider(names_from=week, values_from=value)  # fill = NA is default arg
head(df_spread)
## # A tibble: 6 × 6
##      id date       period     w1     w2     w3
##   <int> <date>     <chr>   <dbl>  <dbl>  <dbl>
## 1     1 2015-01-01 q1m1   -1.17  -1.33  NA    
## 2     1 2015-01-01 q1m2   NA     NA      0.723
## 3     1 2015-01-01 q2m1   -1.20  NA     NA    
## 4     1 2015-01-01 q2m2   -1.15  -0.951 NA    
## 5     2 2015-01-02 q1m1    0.701  0.407 NA    
## 6     2 2015-01-02 q1m2   NA     NA      1.48

3.6 Putting it all together (pivot_longer() %>% separate() %>% pivot_wider())

All in one, if we had wanted to end up with one row per month:

df_tidiest <- df %>% 
  pivot_longer(q1_m1_w1:q2_m2_w2, names_to="key", values_to="value") %>% 
  separate(key, into = c("quarter", "month", "week")) %>% 
  pivot_wider(names_from=week, values_from=value)
head(df_tidiest)
## # A tibble: 6 × 7
##      id date       quarter month     w1     w2     w3
##   <int> <date>     <chr>   <chr>  <dbl>  <dbl>  <dbl>
## 1     1 2015-01-01 q1      m1    -1.17  -1.33  NA    
## 2     1 2015-01-01 q1      m2    NA     NA      0.723
## 3     1 2015-01-01 q2      m1    -1.20  NA     NA    
## 4     1 2015-01-01 q2      m2    -1.15  -0.951 NA    
## 5     2 2015-01-02 q1      m1     0.701  0.407 NA    
## 6     2 2015-01-02 q1      m2    NA     NA      1.48

4 broom

“The broom package takes the messy output of built-in functions in R, such as lm, nls, or t.test, and turns them into tidy data frames.” So, broom tidies output from other R functions that are un-tidy.

See here for list of functions: https://github.com/dgrtwo/broom

Vignette: ftp://cran.r-project.org/pub/R/web/packages/broom/vignettes/broom.html

fit <- lm(mpg ~ qsec + factor(am) + wt + factor(gear), 
          data = mtcars)

Un-tidy output from lm

summary(fit)
## 
## Call:
## lm(formula = mpg ~ qsec + factor(am) + wt + factor(gear), data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5064 -1.5220 -0.7517  1.3841  4.6345 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     9.3650     8.3730   1.118  0.27359    
## qsec            1.2449     0.3828   3.252  0.00317 ** 
## factor(am)1     3.1505     1.9405   1.624  0.11654    
## wt             -3.9263     0.7428  -5.286 1.58e-05 ***
## factor(gear)4  -0.2682     1.6555  -0.162  0.87257    
## factor(gear)5  -0.2697     2.0632  -0.131  0.89698    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.55 on 26 degrees of freedom
## Multiple R-squared:  0.8498, Adjusted R-squared:  0.8209 
## F-statistic: 29.43 on 5 and 26 DF,  p-value: 6.379e-10

Tidy output from broom

tidy(fit)
## # A tibble: 6 × 5
##   term          estimate std.error statistic   p.value
##   <chr>            <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)      9.37      8.37      1.12  0.274    
## 2 qsec             1.24      0.383     3.25  0.00317  
## 3 factor(am)1      3.15      1.94      1.62  0.117    
## 4 wt              -3.93      0.743    -5.29  0.0000158
## 5 factor(gear)4   -0.268     1.66     -0.162 0.873    
## 6 factor(gear)5   -0.270     2.06     -0.131 0.897

This is very helfpul for making e.g. regression tables for papers!

Creative Commons License