Making tables is often when we stretch our dplyr
skills.
Here are some additional more advanced examples that you may find
helpful for today’s assignment!
We’ll use the pnwflights14
package to practice our
dplyr
skills. We need to download the package from github
using devtools
.
# once per machine
install.packages("devtools")
devtools::install_github("ismayc/pnwflights14")
Now, we need to load the flights
dataset from the
pnwflights14
package.
# once per work session
library(pnwflights14)
data("flights", package = "pnwflights14")
dplyr
Tricksdplyr::select
Recall that we use select to specify which columns in a dataframe you’d like to keep by name:
# keep these 2 cols
mini_flights <- flights %>%
select(carrier, flight)
glimpse(mini_flights)
## Rows: 162,049
## Columns: 2
## $ carrier <chr> "AS", "US", "UA", "US", "AS", "DL", "UA", "UA", "UA", "UA", "U…
## $ flight <int> 145, 1830, 1609, 466, 121, 1823, 1481, 229, 1576, 478, 1569, 6…
# keep first five cols
first_five <- flights %>%
select(year, month, day, dep_time, dep_delay)
glimpse(first_five)
## Rows: 162,049
## Columns: 5
## $ year <int> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, …
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ dep_time <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557,…
## $ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2,…
# alternatively, specify range
first_five <- flights %>%
select(year:dep_delay)
glimpse(first_five)
## Rows: 162,049
## Columns: 5
## $ year <int> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, …
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ dep_time <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557,…
## $ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2,…
We can also choose the columns we want by negation, that is, you can specify which columns to drop instead of keep. This way, all variables not listed are kept.
# we can also use negation
all_but_year <- flights %>%
select(-year)
glimpse(all_but_year)
## Rows: 162,049
## Columns: 15
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ dep_time <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557,…
## $ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2,…
## $ arr_time <int> 235, 738, 548, 800, 325, 747, 936, 1148, 917, 1334, 911, 907…
## $ arr_delay <dbl> 70, -23, -4, -23, 43, 88, 219, 15, 24, -6, 4, 12, -12, -16, …
## $ carrier <chr> "AS", "US", "UA", "US", "AS", "DL", "UA", "UA", "UA", "UA", …
## $ tailnum <chr> "N508AS", "N195UW", "N37422", "N547UW", "N762AS", "N806DN", …
## $ flight <int> 145, 1830, 1609, 466, 121, 1823, 1481, 229, 1576, 478, 1569,…
## $ origin <chr> "PDX", "SEA", "PDX", "PDX", "SEA", "SEA", "SEA", "PDX", "SEA…
## $ dest <chr> "ANC", "CLT", "IAH", "CLT", "ANC", "DTW", "ORD", "IAH", "DEN…
## $ air_time <dbl> 194, 252, 201, 251, 201, 224, 202, 217, 136, 268, 130, 122, …
## $ distance <dbl> 1542, 2279, 1825, 2282, 1448, 1927, 1721, 1825, 1024, 2402, …
## $ hour <dbl> 0, 0, 0, 0, 0, 0, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, …
## $ minute <dbl> 1, 4, 8, 28, 34, 37, 46, 26, 27, 36, 41, 49, 50, 57, 57, 58,…
dplyr::select
comes with several other helper
functions…
depart <- flights %>%
select(starts_with("dep_"))
glimpse(depart)
## Rows: 162,049
## Columns: 2
## $ dep_time <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557,…
## $ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2,…
times <- flights %>%
select(contains("time"))
glimpse(times)
## Rows: 162,049
## Columns: 3
## $ dep_time <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557, …
## $ arr_time <int> 235, 738, 548, 800, 325, 747, 936, 1148, 917, 1334, 911, 907,…
## $ air_time <dbl> 194, 252, 201, 251, 201, 224, 202, 217, 136, 268, 130, 122, 8…
# note that we are not actually saving the new dataframe here
flights %>%
select(-contains("time")) %>% head()
## # A tibble: 6 × 13
## year month day dep_delay arr_delay carrier tailnum flight origin dest
## <int> <int> <int> <dbl> <dbl> <chr> <chr> <int> <chr> <chr>
## 1 2014 1 1 96 70 AS N508AS 145 PDX ANC
## 2 2014 1 1 -6 -23 US N195UW 1830 SEA CLT
## 3 2014 1 1 13 -4 UA N37422 1609 PDX IAH
## 4 2014 1 1 -2 -23 US N547UW 466 PDX CLT
## 5 2014 1 1 44 43 AS N762AS 121 SEA ANC
## 6 2014 1 1 82 88 DL N806DN 1823 SEA DTW
## # … with 3 more variables: distance <dbl>, hour <dbl>, minute <dbl>
delays <- flights %>%
select(ends_with("delay"))
glimpse(delays)
## Rows: 162,049
## Columns: 2
## $ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2,…
## $ arr_delay <dbl> 70, -23, -4, -23, 43, 88, 219, 15, 24, -6, 4, 12, -12, -16, …
One of my favorite select helper functions is
everything()
, which allows you to use select to keep
all your variables, but easily rearrange the columns
without having to list all the variables to keep/drop.
new_order <- flights %>%
select(origin, dest, everything())
head(new_order)
## # A tibble: 6 × 16
## origin dest year month day dep_time dep_delay arr_time arr_delay carrier
## <chr> <chr> <int> <int> <int> <int> <dbl> <int> <dbl> <chr>
## 1 PDX ANC 2014 1 1 1 96 235 70 AS
## 2 SEA CLT 2014 1 1 4 -6 738 -23 US
## 3 PDX IAH 2014 1 1 8 13 548 -4 UA
## 4 PDX CLT 2014 1 1 28 -2 800 -23 US
## 5 SEA ANC 2014 1 1 34 44 325 43 AS
## 6 SEA DTW 2014 1 1 37 82 747 88 DL
## # … with 6 more variables: tailnum <chr>, flight <int>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>
# with negation
new_order2 <- flights %>%
select(origin, dest, everything(), -year)
head(new_order2)
## # A tibble: 6 × 15
## origin dest month day dep_time dep_delay arr_time arr_delay carrier tailnum
## <chr> <chr> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr>
## 1 PDX ANC 1 1 1 96 235 70 AS N508AS
## 2 SEA CLT 1 1 4 -6 738 -23 US N195UW
## 3 PDX IAH 1 1 8 13 548 -4 UA N37422
## 4 PDX CLT 1 1 28 -2 800 -23 US N547UW
## 5 SEA ANC 1 1 34 44 325 43 AS N762AS
## 6 SEA DTW 1 1 37 82 747 88 DL N806DN
## # … with 5 more variables: flight <int>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>
We can also rename variables within select.
flights2 <- flights %>%
select(tail_num = tailnum, everything())
head(flights2)
## # A tibble: 6 × 16
## tail_num year month day dep_time dep_delay arr_time arr_delay carrier
## <chr> <int> <int> <int> <int> <dbl> <int> <dbl> <chr>
## 1 N508AS 2014 1 1 1 96 235 70 AS
## 2 N195UW 2014 1 1 4 -6 738 -23 US
## 3 N37422 2014 1 1 8 13 548 -4 UA
## 4 N547UW 2014 1 1 28 -2 800 -23 US
## 5 N762AS 2014 1 1 34 44 325 43 AS
## 6 N806DN 2014 1 1 37 82 747 88 DL
## # … with 7 more variables: flight <int>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>
If you don’t want to move the renamed variables within your
dataframe, you can use the rename
function.
flights3 <- flights %>%
rename(tail_num = tailnum)
glimpse(flights3)
## Rows: 162,049
## Columns: 16
## $ year <int> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, …
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ dep_time <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557,…
## $ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2,…
## $ arr_time <int> 235, 738, 548, 800, 325, 747, 936, 1148, 917, 1334, 911, 907…
## $ arr_delay <dbl> 70, -23, -4, -23, 43, 88, 219, 15, 24, -6, 4, 12, -12, -16, …
## $ carrier <chr> "AS", "US", "UA", "US", "AS", "DL", "UA", "UA", "UA", "UA", …
## $ tail_num <chr> "N508AS", "N195UW", "N37422", "N547UW", "N762AS", "N806DN", …
## $ flight <int> 145, 1830, 1609, 466, 121, 1823, 1481, 229, 1576, 478, 1569,…
## $ origin <chr> "PDX", "SEA", "PDX", "PDX", "SEA", "SEA", "SEA", "PDX", "SEA…
## $ dest <chr> "ANC", "CLT", "IAH", "CLT", "ANC", "DTW", "ORD", "IAH", "DEN…
## $ air_time <dbl> 194, 252, 201, 251, 201, 224, 202, 217, 136, 268, 130, 122, …
## $ distance <dbl> 1542, 2279, 1825, 2282, 1448, 1927, 1721, 1825, 1024, 2402, …
## $ hour <dbl> 0, 0, 0, 0, 0, 0, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, …
## $ minute <dbl> 1, 4, 8, 28, 34, 37, 46, 26, 27, 36, 41, 49, 50, 57, 57, 58,…
dplyr::filter
As we have previously seen, filter
has many flexible
uses:
# flights taking off from PDX
pdx <- flights %>%
filter(origin == "PDX")
head(pdx)
## # A tibble: 6 × 16
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight
## <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr> <int>
## 1 2014 1 1 1 96 235 70 AS N508AS 145
## 2 2014 1 1 8 13 548 -4 UA N37422 1609
## 3 2014 1 1 28 -2 800 -23 US N547UW 466
## 4 2014 1 1 526 -4 1148 15 UA N813UA 229
## 5 2014 1 1 541 1 911 4 UA N36476 1569
## 6 2014 1 1 549 24 907 12 US N548UW 649
## # … with 6 more variables: origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>
# january flights from PDX
pdx_jan <- flights %>%
filter(origin == "PDX", month == 1) # the comma is an "and"
head(pdx_jan)
## # A tibble: 6 × 16
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight
## <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr> <int>
## 1 2014 1 1 1 96 235 70 AS N508AS 145
## 2 2014 1 1 8 13 548 -4 UA N37422 1609
## 3 2014 1 1 28 -2 800 -23 US N547UW 466
## 4 2014 1 1 526 -4 1148 15 UA N813UA 229
## 5 2014 1 1 541 1 911 4 UA N36476 1569
## 6 2014 1 1 549 24 907 12 US N548UW 649
## # … with 6 more variables: origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>
# flights to ATL (Atlanta) or BNA (Nashville)
to_south <- flights %>%
filter(dest == "ATL" | dest == "BNA") %>% # | is "or"
select(origin, dest, everything())
head(to_south)
## # A tibble: 6 × 16
## origin dest year month day dep_time dep_delay arr_time arr_delay carrier
## <chr> <chr> <int> <int> <int> <int> <dbl> <int> <dbl> <chr>
## 1 SEA ATL 2014 1 1 624 -6 1401 -6 DL
## 2 SEA ATL 2014 1 1 802 -3 1533 -17 AS
## 3 SEA ATL 2014 1 1 824 -1 1546 -14 DL
## 4 PDX ATL 2014 1 1 944 -6 1727 -8 AS
## 5 PDX ATL 2014 1 1 1054 94 1807 84 DL
## 6 SEA ATL 2014 1 1 1158 6 1915 -14 DL
## # … with 6 more variables: tailnum <chr>, flight <int>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>
# flights from PDX to ATL (Atlanta) or BNA (Nashville)
pdx_to_south <- flights %>%
filter(origin == "PDX", dest == "ATL" | dest == "BNA") %>% # | is "or"
select(origin, dest, everything())
head(pdx_to_south)
## # A tibble: 6 × 16
## origin dest year month day dep_time dep_delay arr_time arr_delay carrier
## <chr> <chr> <int> <int> <int> <int> <dbl> <int> <dbl> <chr>
## 1 PDX ATL 2014 1 1 944 -6 1727 -8 AS
## 2 PDX ATL 2014 1 1 1054 94 1807 84 DL
## 3 PDX ATL 2014 1 1 1323 -2 2038 -15 DL
## 4 PDX ATL 2014 1 1 2253 8 611 4 DL
## 5 PDX ATL 2014 1 2 627 -3 1350 -7 DL
## 6 PDX ATL 2014 1 2 918 -2 1643 -2 DL
## # … with 6 more variables: tailnum <chr>, flight <int>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>
# alternatively, using group membership
south_dests <- c("ATL", "BNA")
pdx_to_south2 <- flights %>%
filter(origin == "PDX", dest %in% south_dests) %>%
select(origin, dest, everything())
head(pdx_to_south2)
## # A tibble: 6 × 16
## origin dest year month day dep_time dep_delay arr_time arr_delay carrier
## <chr> <chr> <int> <int> <int> <int> <dbl> <int> <dbl> <chr>
## 1 PDX ATL 2014 1 1 944 -6 1727 -8 AS
## 2 PDX ATL 2014 1 1 1054 94 1807 84 DL
## 3 PDX ATL 2014 1 1 1323 -2 2038 -15 DL
## 4 PDX ATL 2014 1 1 2253 8 611 4 DL
## 5 PDX ATL 2014 1 2 627 -3 1350 -7 DL
## 6 PDX ATL 2014 1 2 918 -2 1643 -2 DL
## # … with 6 more variables: tailnum <chr>, flight <int>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>
# flights delayed by 1 hour or more
delay_1plus <- flights %>%
filter(dep_delay >= 60)
head(delay_1plus)
## # A tibble: 6 × 16
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight
## <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr> <int>
## 1 2014 1 1 1 96 235 70 AS N508AS 145
## 2 2014 1 1 37 82 747 88 DL N806DN 1823
## 3 2014 1 1 346 227 936 219 UA N14219 1481
## 4 2014 1 1 650 90 1037 91 US N626AW 460
## 5 2014 1 1 959 164 1137 157 AS N534AS 805
## 6 2014 1 1 1008 68 1242 64 AS N788AS 456
## # … with 6 more variables: origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>
# flights delayed by 1 hour, but not more than 2 hours
delay_1hr <- flights %>%
filter(dep_delay >= 60, dep_delay < 120)
head(delay_1hr)
## # A tibble: 6 × 16
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight
## <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr> <int>
## 1 2014 1 1 1 96 235 70 AS N508AS 145
## 2 2014 1 1 37 82 747 88 DL N806DN 1823
## 3 2014 1 1 650 90 1037 91 US N626AW 460
## 4 2014 1 1 1008 68 1242 64 AS N788AS 456
## 5 2014 1 1 1014 75 1613 81 UA N37408 1444
## 6 2014 1 1 1036 81 1408 63 OO N218AG 3466
## # … with 6 more variables: origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>
range(delay_1hr$dep_delay, na.rm = TRUE)
## [1] 60 119
# even more efficient using between (always inclusive)
delay_bwn <- flights %>%
filter(between(dep_delay, 60, 119))
head(delay_bwn)
## # A tibble: 6 × 16
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight
## <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr> <int>
## 1 2014 1 1 1 96 235 70 AS N508AS 145
## 2 2014 1 1 37 82 747 88 DL N806DN 1823
## 3 2014 1 1 650 90 1037 91 US N626AW 460
## 4 2014 1 1 1008 68 1242 64 AS N788AS 456
## 5 2014 1 1 1014 75 1613 81 UA N37408 1444
## 6 2014 1 1 1036 81 1408 63 OO N218AG 3466
## # … with 6 more variables: origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>
range(delay_bwn$dep_delay, na.rm = TRUE)
## [1] 60 119
dplyr::arrange
# default is ascending order
flights %>%
arrange(year, month, day) %>% head(n=20)
## # A tibble: 20 × 16
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr>
## 1 2014 1 1 1 96 235 70 AS N508AS
## 2 2014 1 1 4 -6 738 -23 US N195UW
## 3 2014 1 1 8 13 548 -4 UA N37422
## 4 2014 1 1 28 -2 800 -23 US N547UW
## 5 2014 1 1 34 44 325 43 AS N762AS
## 6 2014 1 1 37 82 747 88 DL N806DN
## 7 2014 1 1 346 227 936 219 UA N14219
## 8 2014 1 1 526 -4 1148 15 UA N813UA
## 9 2014 1 1 527 7 917 24 UA N75433
## 10 2014 1 1 536 1 1334 -6 UA N574UA
## 11 2014 1 1 541 1 911 4 UA N36476
## 12 2014 1 1 549 24 907 12 US N548UW
## 13 2014 1 1 550 0 837 -12 DL N660DL
## 14 2014 1 1 557 -3 1134 -16 AA N3JLAA
## 15 2014 1 1 557 -3 825 -25 AS N562AS
## 16 2014 1 1 558 -2 801 -2 AS N402AS
## 17 2014 1 1 559 -1 916 -9 F9 N210FR
## 18 2014 1 1 600 0 1151 -19 AA N3JFAA
## 19 2014 1 1 600 -10 842 -8 AS N786AS
## 20 2014 1 1 602 -3 943 5 F9 N201FR
## # … with 7 more variables: flight <int>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>
# descending order
flights %>%
arrange(desc(year), desc(month), desc(day)) %>% head(n=20)
## # A tibble: 20 × 16
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr>
## 1 2014 12 31 2 12 601 31 AA N3JKAA
## 2 2014 12 31 27 -3 623 3 AA N3EWAA
## 3 2014 12 31 39 14 324 4 AS N762AS
## 4 2014 12 31 40 0 549 0 DL N757AT
## 5 2014 12 31 52 -8 917 -21 AA N3JFAA
## 6 2014 12 31 54 4 621 17 DL N128DL
## 7 2014 12 31 56 61 848 80 DL N655DL
## 8 2014 12 31 512 -3 904 4 US N653AW
## 9 2014 12 31 515 -5 855 5 US N580UW
## 10 2014 12 31 534 4 859 7 UA N34460
## 11 2014 12 31 546 1 916 -4 WN N8323C
## 12 2014 12 31 548 -2 1351 -13 UA N461UA
## 13 2014 12 31 549 4 1208 12 UA N68807
## 14 2014 12 31 550 0 922 2 WN N797MX
## 15 2014 12 31 551 -4 1202 12 AA N3HXAA
## 16 2014 12 31 551 -9 744 -15 AS N570AS
## 17 2014 12 31 555 -10 824 -1 AS N548AS
## 18 2014 12 31 558 -2 849 0 DL N668DN
## 19 2014 12 31 558 -2 1149 4 AA N436AA
## 20 2014 12 31 558 -2 738 -4 AS N585AS
## # … with 7 more variables: flight <int>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>
dplyr::distinct
# unique origin-dest combinations
flights %>%
select(origin, dest) %>%
distinct %>% head(n=50)
## # A tibble: 50 × 2
## origin dest
## <chr> <chr>
## 1 PDX ANC
## 2 SEA CLT
## 3 PDX IAH
## 4 PDX CLT
## 5 SEA ANC
## 6 SEA DTW
## 7 SEA ORD
## 8 SEA DEN
## 9 SEA EWR
## 10 PDX DEN
## # … with 40 more rows
# all unique destinations from PDX (there are 49, so we'll just list the first few)
from_pdx <- flights %>%
filter(origin == "PDX") %>%
select(origin, dest) %>%
distinct(dest)
head(from_pdx)
## # A tibble: 6 × 1
## dest
## <chr>
## 1 ANC
## 2 IAH
## 3 CLT
## 4 DEN
## 5 PHX
## 6 ORD
dplyr::mutate
mutate
is used to transform existing variables or create
new ones; here, we are using it to create an indicator variable that
identifies flights that were entirely on-time.
# add total delay variable
flights %>%
mutate(tot_delay = dep_delay + arr_delay) %>%
select(origin, dest, ends_with("delay"), everything()) %>%
head
## # A tibble: 6 × 17
## origin dest dep_delay arr_delay tot_delay year month day dep_time arr_time
## <chr> <chr> <dbl> <dbl> <dbl> <int> <int> <int> <int> <int>
## 1 PDX ANC 96 70 166 2014 1 1 1 235
## 2 SEA CLT -6 -23 -29 2014 1 1 4 738
## 3 PDX IAH 13 -4 9 2014 1 1 8 548
## 4 PDX CLT -2 -23 -25 2014 1 1 28 800
## 5 SEA ANC 44 43 87 2014 1 1 34 325
## 6 SEA DTW 82 88 170 2014 1 1 37 747
## # … with 7 more variables: carrier <chr>, tailnum <chr>, flight <int>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>
# flights that were delayed at departure had on time or early arrivals?
arrivals <- flights %>%
mutate(arr_ok = ifelse(dep_delay > 0 & arr_delay <= 0, 1, 0)) %>%
select(origin, dest, ends_with("delay"), carrier, arr_ok)
# peek at it
arrivals %>%
filter(arr_ok == 1) %>%
head
## # A tibble: 6 × 6
## origin dest dep_delay arr_delay carrier arr_ok
## <chr> <chr> <dbl> <dbl> <chr> <dbl>
## 1 PDX IAH 13 -4 UA 1
## 2 SEA EWR 1 -6 UA 1
## 3 SEA SAN 2 -12 AS 1
## 4 PDX EWR 2 -19 UA 1
## 5 SEA IAH 13 -4 UA 1
## 6 PDX IAD 10 -4 UA 1
dplyr::summarise
(or dplyr::summarize
)We have seen summarise
used to calculate summary
statistics, like so:
flights %>%
summarise(mean(dep_delay, na.rm = TRUE))
## # A tibble: 1 × 1
## `mean(dep_delay, na.rm = TRUE)`
## <dbl>
## 1 6.13
But this can get tedious when we want to compute several summary statistics, like so:
# we can also name that variable, and summarise multiple variables
flights %>%
summarise(mean_delay = mean(dep_delay, na.rm = TRUE),
sd_delay = sd(dep_delay, na.rm = TRUE),
median_delay = median(dep_delay, na.rm = TRUE))
## # A tibble: 1 × 3
## mean_delay sd_delay median_delay
## <dbl> <dbl> <dbl>
## 1 6.13 29.1 -2
summarise
together with across()
lets us
automate this process a bit:
flights %>%
filter(!is.na(dep_delay)) %>%
select(dep_delay) %>%
summarise(
across(everything(),
lst(mean, sd, median)) # making this a list gets us nice column names
)
## # A tibble: 1 × 3
## dep_delay_mean dep_delay_sd dep_delay_median
## <dbl> <dbl> <dbl>
## 1 6.13 29.1 -2
# same thing over all numeric columns
flights %>%
filter(!is.na(dep_delay)) %>%
summarise(across(where(is.numeric), lst(mean, sd, median)))
## # A tibble: 1 × 36
## year_mean year_sd year_median month_mean month_sd month_median day_mean day_sd
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2014 0 2014 6.62 3.32 7 15.8 8.79
## # … with 28 more variables: day_median <dbl>, dep_time_mean <dbl>,
## # dep_time_sd <dbl>, dep_time_median <dbl>, dep_delay_mean <dbl>,
## # dep_delay_sd <dbl>, dep_delay_median <dbl>, arr_time_mean <dbl>,
## # arr_time_sd <dbl>, arr_time_median <int>, arr_delay_mean <dbl>,
## # arr_delay_sd <dbl>, arr_delay_median <dbl>, flight_mean <dbl>,
## # flight_sd <dbl>, flight_median <dbl>, air_time_mean <dbl>,
## # air_time_sd <dbl>, air_time_median <dbl>, distance_mean <dbl>, …
# combine with pivot_longer, change names too
flights %>%
filter(!is.na(dep_delay)) %>%
summarise(across(dep_delay, lst(mean, stdev=sd, median))) %>%
pivot_longer(everything(), names_to="delay_stat", values_to="value", names_prefix = "dep_delay_")
## # A tibble: 3 × 2
## delay_stat value
## <chr> <dbl>
## 1 mean 6.13
## 2 stdev 29.1
## 3 median -2
We can use aggregating functions in summarise
# how many unique destinations?
summary_table <- flights %>%
summarise(tot_flights = n(),
tot_planes = n_distinct(tailnum),
tot_carriers = n_distinct(carrier),
tot_dests = n_distinct(dest),
tot_origins = n_distinct(origin))
summary_table
## # A tibble: 1 × 5
## tot_flights tot_planes tot_carriers tot_dests tot_origins
## <int> <int> <int> <int> <int>
## 1 162049 3023 11 71 2
The resulting table is a bit messy; we can reprocess using
pivot_longer
and separate
:
# chain with tidyr functions
summary_table %>%
pivot_longer(everything(), names_to="key", values_to="value") %>%
separate(key, into = c("tot", "entity")) %>%
select(-tot, total = value)
## # A tibble: 5 × 2
## entity total
## <chr> <int>
## 1 flights 162049
## 2 planes 3023
## 3 carriers 11
## 4 dests 71
## 5 origins 2
Or we can do it in a single call to pivot_longer()
:
summary_table %>%
pivot_longer(
everything(), # grab all columns
names_to="entity", # make a new column called "entity" to store the old column names
names_prefix="tot_", # remove the "tot_" prefix from the old column names
values_to="total") # put the values in a column called "total"
## # A tibble: 5 × 2
## entity total
## <chr> <int>
## 1 flights 162049
## 2 planes 3023
## 3 carriers 11
## 4 dests 71
## 5 origins 2
tidyr
We’ll work with a made up dataframe:
df <- data.frame(
id = 1:10,
date = as.Date('2015-01-01') + 0:9,
q1_m1_w1 = rnorm(10, 0, 1),
q1_m1_w2 = rnorm(10, 0, 1),
q1_m2_w3 = rnorm(10, 0, 1),
q2_m1_w1 = rnorm(10, 0, 1),
q2_m2_w1 = rnorm(10, 0, 1),
q2_m2_w2 = rnorm(10, 0, 1)
)
# HLO
head(df)
## id date q1_m1_w1 q1_m1_w2 q1_m2_w3 q2_m1_w1 q2_m2_w1
## 1 1 2015-01-01 -1.1700743 -1.3313873 0.7225514 -1.1991584 -1.1453316
## 2 2 2015-01-02 0.7013156 0.4073680 1.4849110 0.9400779 0.8611532
## 3 3 2015-01-03 1.6318222 -0.7944370 1.5842153 0.6983900 -2.6945197
## 4 4 2015-01-04 -0.2343429 0.1907154 -0.9717708 0.1686624 -0.4788981
## 5 5 2015-01-05 0.6487112 -0.1029043 -0.6050051 -0.5839066 -0.2390221
## 6 6 2015-01-06 0.9597311 0.5272105 2.0258127 -1.6533158 -0.1458568
## q2_m2_w2
## 1 -0.9506309
## 2 0.1975220
## 3 0.5255518
## 4 -0.4800159
## 5 0.7514034
## 6 0.2399251
glimpse(df)
## Rows: 10
## Columns: 8
## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
## $ date <date> 2015-01-01, 2015-01-02, 2015-01-03, 2015-01-04, 2015-01-05, 2…
## $ q1_m1_w1 <dbl> -1.1700743, 0.7013156, 1.6318222, -0.2343429, 0.6487112, 0.9…
## $ q1_m1_w2 <dbl> -1.33138729, 0.40736797, -0.79443701, 0.19071541, -0.10290434…
## $ q1_m2_w3 <dbl> 0.7225514, 1.4849110, 1.5842153, -0.9717708, -0.6050051, 2.02…
## $ q2_m1_w1 <dbl> -1.1991584, 0.9400779, 0.6983900, 0.1686624, -0.5839066, -1.6…
## $ q2_m2_w1 <dbl> -1.1453316, 0.8611532, -2.6945197, -0.4788981, -0.2390221, -0…
## $ q2_m2_w2 <dbl> -0.9506309, 0.1975220, 0.5255518, -0.4800159, 0.7514034, 0.23…
tidyr::pivot_longer
First, let’s pivot…
df_tidy <- df %>%
pivot_longer(q1_m1_w1:q2_m2_w2, names_to="key", values_to="value")
head(df_tidy)
## # A tibble: 6 × 4
## id date key value
## <int> <date> <chr> <dbl>
## 1 1 2015-01-01 q1_m1_w1 -1.17
## 2 1 2015-01-01 q1_m1_w2 -1.33
## 3 1 2015-01-01 q1_m2_w3 0.723
## 4 1 2015-01-01 q2_m1_w1 -1.20
## 5 1 2015-01-01 q2_m2_w1 -1.15
## 6 1 2015-01-01 q2_m2_w2 -0.951
tidyr::separate
# separate 1 col into 3 cols
df_sep <- df_tidy %>%
separate(key, into = c("quarter", "month", "week"))
head(df_sep)
## # A tibble: 6 × 6
## id date quarter month week value
## <int> <date> <chr> <chr> <chr> <dbl>
## 1 1 2015-01-01 q1 m1 w1 -1.17
## 2 1 2015-01-01 q1 m1 w2 -1.33
## 3 1 2015-01-01 q1 m2 w3 0.723
## 4 1 2015-01-01 q2 m1 w1 -1.20
## 5 1 2015-01-01 q2 m2 w1 -1.15
## 6 1 2015-01-01 q2 m2 w2 -0.951
# separate 1 col into 2 cols
df_sep2 <- df_tidy %>%
separate(key, into = c("quarter", "period"), extra = "merge")
head(df_sep2)
## # A tibble: 6 × 5
## id date quarter period value
## <int> <date> <chr> <chr> <dbl>
## 1 1 2015-01-01 q1 m1_w1 -1.17
## 2 1 2015-01-01 q1 m1_w2 -1.33
## 3 1 2015-01-01 q1 m2_w3 0.723
## 4 1 2015-01-01 q2 m1_w1 -1.20
## 5 1 2015-01-01 q2 m2_w1 -1.15
## 6 1 2015-01-01 q2 m2_w2 -0.951
stringr vs. tidyr separate by regular expression
tidyr::extract
Extract
is essentially the same as
separate
, let’s see how…
# extract
df_ext <- df_sep2 %>%
extract(period, into = "month")
head(df_ext)
## # A tibble: 6 × 5
## id date quarter month value
## <int> <date> <chr> <chr> <dbl>
## 1 1 2015-01-01 q1 m1 -1.17
## 2 1 2015-01-01 q1 m1 -1.33
## 3 1 2015-01-01 q1 m2 0.723
## 4 1 2015-01-01 q2 m1 -1.20
## 5 1 2015-01-01 q2 m2 -1.15
## 6 1 2015-01-01 q2 m2 -0.951
We can see that, by default, it’s taking the first alphanumeric chunk
of the value, and extract
ing it. We can specify much more
complex patterns:
# this gives us same output as separate
df_ext <- df_sep2 %>%
extract(period, into = c("month", "week"),
regex = "([[:alnum:]]+)_([[:alnum:]]+)")
head(df_ext)
## # A tibble: 6 × 6
## id date quarter month week value
## <int> <date> <chr> <chr> <chr> <dbl>
## 1 1 2015-01-01 q1 m1 w1 -1.17
## 2 1 2015-01-01 q1 m1 w2 -1.33
## 3 1 2015-01-01 q1 m2 w3 0.723
## 4 1 2015-01-01 q2 m1 w1 -1.20
## 5 1 2015-01-01 q2 m2 w1 -1.15
## 6 1 2015-01-01 q2 m2 w2 -0.951
tidyr::unite
The unite()
function lets us reverse the process:
# let's say we want to combine quarter and month with an underscore
df_uni <- df_sep %>%
unite(period, quarter:month) # sep = "_" is the default arg
head(df_uni)
## # A tibble: 6 × 5
## id date period week value
## <int> <date> <chr> <chr> <dbl>
## 1 1 2015-01-01 q1_m1 w1 -1.17
## 2 1 2015-01-01 q1_m1 w2 -1.33
## 3 1 2015-01-01 q1_m2 w3 0.723
## 4 1 2015-01-01 q2_m1 w1 -1.20
## 5 1 2015-01-01 q2_m2 w1 -1.15
## 6 1 2015-01-01 q2_m2 w2 -0.951
# let's say we want to combine quarter and month with nothing
df_uni <- df_sep %>%
unite(period, quarter:month, sep = "")
head(df_uni)
## # A tibble: 6 × 5
## id date period week value
## <int> <date> <chr> <chr> <dbl>
## 1 1 2015-01-01 q1m1 w1 -1.17
## 2 1 2015-01-01 q1m1 w2 -1.33
## 3 1 2015-01-01 q1m2 w3 0.723
## 4 1 2015-01-01 q2m1 w1 -1.20
## 5 1 2015-01-01 q2m2 w1 -1.15
## 6 1 2015-01-01 q2m2 w2 -0.951
tidyr::pivot_wider
# finally let's spread
df_spread <- df_uni %>%
pivot_wider(names_from=week, values_from=value) # fill = NA is default arg
head(df_spread)
## # A tibble: 6 × 6
## id date period w1 w2 w3
## <int> <date> <chr> <dbl> <dbl> <dbl>
## 1 1 2015-01-01 q1m1 -1.17 -1.33 NA
## 2 1 2015-01-01 q1m2 NA NA 0.723
## 3 1 2015-01-01 q2m1 -1.20 NA NA
## 4 1 2015-01-01 q2m2 -1.15 -0.951 NA
## 5 2 2015-01-02 q1m1 0.701 0.407 NA
## 6 2 2015-01-02 q1m2 NA NA 1.48
pivot_longer() %>% separate() %>% pivot_wider()
)All in one, if we had wanted to end up with one row per month:
df_tidiest <- df %>%
pivot_longer(q1_m1_w1:q2_m2_w2, names_to="key", values_to="value") %>%
separate(key, into = c("quarter", "month", "week")) %>%
pivot_wider(names_from=week, values_from=value)
head(df_tidiest)
## # A tibble: 6 × 7
## id date quarter month w1 w2 w3
## <int> <date> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 2015-01-01 q1 m1 -1.17 -1.33 NA
## 2 1 2015-01-01 q1 m2 NA NA 0.723
## 3 1 2015-01-01 q2 m1 -1.20 NA NA
## 4 1 2015-01-01 q2 m2 -1.15 -0.951 NA
## 5 2 2015-01-02 q1 m1 0.701 0.407 NA
## 6 2 2015-01-02 q1 m2 NA NA 1.48
broom
“The broom package takes the messy output of built-in functions in R,
such as lm
, nls
, or t.test
, and
turns them into tidy data frames.” So, broom tidies output from other R
functions that are un-tidy.
See here for list of functions: https://github.com/dgrtwo/broom
Vignette: ftp://cran.r-project.org/pub/R/web/packages/broom/vignettes/broom.html
fit <- lm(mpg ~ qsec + factor(am) + wt + factor(gear),
data = mtcars)
Un-tidy output from lm
summary(fit)
##
## Call:
## lm(formula = mpg ~ qsec + factor(am) + wt + factor(gear), data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5064 -1.5220 -0.7517 1.3841 4.6345
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.3650 8.3730 1.118 0.27359
## qsec 1.2449 0.3828 3.252 0.00317 **
## factor(am)1 3.1505 1.9405 1.624 0.11654
## wt -3.9263 0.7428 -5.286 1.58e-05 ***
## factor(gear)4 -0.2682 1.6555 -0.162 0.87257
## factor(gear)5 -0.2697 2.0632 -0.131 0.89698
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.55 on 26 degrees of freedom
## Multiple R-squared: 0.8498, Adjusted R-squared: 0.8209
## F-statistic: 29.43 on 5 and 26 DF, p-value: 6.379e-10
Tidy output from broom
tidy(fit)
## # A tibble: 6 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 9.37 8.37 1.12 0.274
## 2 qsec 1.24 0.383 3.25 0.00317
## 3 factor(am)1 3.15 1.94 1.62 0.117
## 4 wt -3.93 0.743 -5.29 0.0000158
## 5 factor(gear)4 -0.268 1.66 -0.162 0.873
## 6 factor(gear)5 -0.270 2.06 -0.131 0.897
This is very helfpul for making e.g. regression tables for papers!