Making tables is often when we stretch our dplyr
skills.
Here are some additional more advanced examples that you may find
helpful for today’s assignment!
We’ll use the pnwflights14
package to practice our
dplyr
skills. We need to download the package from github
using devtools
.
# once per machine
install.packages("devtools")
devtools::install_github("ismayc/pnwflights14")
Now, we need to load the flights
dataset from the
pnwflights14
package.
# once per work session
library(pnwflights14)
data("flights", package = "pnwflights14")
dplyr
Tricksdplyr::select
Recall that we use select to specify which columns in a dataframe you’d like to keep by name:
# keep these 2 cols
mini_flights <- flights %>%
select(carrier, flight)
glimpse(mini_flights)
## Rows: 162,049
## Columns: 2
## $ carrier <chr> "AS", "US", "UA", "US", "AS", "DL", "UA", "UA", "UA", "UA", "U…
## $ flight <int> 145, 1830, 1609, 466, 121, 1823, 1481, 229, 1576, 478, 1569, 6…
# keep first five cols
first_five <- flights %>%
select(year, month, day, dep_time, dep_delay)
glimpse(first_five)
## Rows: 162,049
## Columns: 5
## $ year <int> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, …
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ dep_time <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557,…
## $ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2,…
# alternatively, specify range
first_five <- flights %>%
select(year:dep_delay)
glimpse(first_five)
## Rows: 162,049
## Columns: 5
## $ year <int> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, …
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ dep_time <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557,…
## $ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2,…
We can also choose the columns we want by negation, that is, you can specify which columns to drop instead of keep. This way, all variables not listed are kept.
# we can also use negation
all_but_year <- flights %>%
select(-year)
glimpse(all_but_year)
## Rows: 162,049
## Columns: 15
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ dep_time <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557,…
## $ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2,…
## $ arr_time <int> 235, 738, 548, 800, 325, 747, 936, 1148, 917, 1334, 911, 907…
## $ arr_delay <dbl> 70, -23, -4, -23, 43, 88, 219, 15, 24, -6, 4, 12, -12, -16, …
## $ carrier <chr> "AS", "US", "UA", "US", "AS", "DL", "UA", "UA", "UA", "UA", …
## $ tailnum <chr> "N508AS", "N195UW", "N37422", "N547UW", "N762AS", "N806DN", …
## $ flight <int> 145, 1830, 1609, 466, 121, 1823, 1481, 229, 1576, 478, 1569,…
## $ origin <chr> "PDX", "SEA", "PDX", "PDX", "SEA", "SEA", "SEA", "PDX", "SEA…
## $ dest <chr> "ANC", "CLT", "IAH", "CLT", "ANC", "DTW", "ORD", "IAH", "DEN…
## $ air_time <dbl> 194, 252, 201, 251, 201, 224, 202, 217, 136, 268, 130, 122, …
## $ distance <dbl> 1542, 2279, 1825, 2282, 1448, 1927, 1721, 1825, 1024, 2402, …
## $ hour <dbl> 0, 0, 0, 0, 0, 0, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, …
## $ minute <dbl> 1, 4, 8, 28, 34, 37, 46, 26, 27, 36, 41, 49, 50, 57, 57, 58,…
dplyr::select
comes with several other helper
functions…
depart <- flights %>%
select(starts_with("dep_"))
glimpse(depart)
## Rows: 162,049
## Columns: 2
## $ dep_time <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557,…
## $ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2,…
times <- flights %>%
select(contains("time"))
glimpse(times)
## Rows: 162,049
## Columns: 3
## $ dep_time <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557, …
## $ arr_time <int> 235, 738, 548, 800, 325, 747, 936, 1148, 917, 1334, 911, 907,…
## $ air_time <dbl> 194, 252, 201, 251, 201, 224, 202, 217, 136, 268, 130, 122, 8…
# note that we are not actually saving the new dataframe here
flights %>%
select(-contains("time")) %>% head()
## # A tibble: 6 × 13
## year month day dep_de…¹ arr_d…² carrier tailnum flight origin dest dista…³
## <int> <int> <int> <dbl> <dbl> <chr> <chr> <int> <chr> <chr> <dbl>
## 1 2014 1 1 96 70 AS N508AS 145 PDX ANC 1542
## 2 2014 1 1 -6 -23 US N195UW 1830 SEA CLT 2279
## 3 2014 1 1 13 -4 UA N37422 1609 PDX IAH 1825
## 4 2014 1 1 -2 -23 US N547UW 466 PDX CLT 2282
## 5 2014 1 1 44 43 AS N762AS 121 SEA ANC 1448
## 6 2014 1 1 82 88 DL N806DN 1823 SEA DTW 1927
## # … with 2 more variables: hour <dbl>, minute <dbl>, and abbreviated variable
## # names ¹dep_delay, ²arr_delay, ³distance
delays <- flights %>%
select(ends_with("delay"))
glimpse(delays)
## Rows: 162,049
## Columns: 2
## $ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2,…
## $ arr_delay <dbl> 70, -23, -4, -23, 43, 88, 219, 15, 24, -6, 4, 12, -12, -16, …
One of my favorite select helper functions is
everything()
, which allows you to use select to keep
all your variables, but easily rearrange the columns
without having to list all the variables to keep/drop.
new_order <- flights %>%
select(origin, dest, everything())
head(new_order)
## # A tibble: 6 × 16
## origin dest year month day dep_t…¹ dep_d…² arr_t…³ arr_d…⁴ carrier tailnum
## <chr> <chr> <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr>
## 1 PDX ANC 2014 1 1 1 96 235 70 AS N508AS
## 2 SEA CLT 2014 1 1 4 -6 738 -23 US N195UW
## 3 PDX IAH 2014 1 1 8 13 548 -4 UA N37422
## 4 PDX CLT 2014 1 1 28 -2 800 -23 US N547UW
## 5 SEA ANC 2014 1 1 34 44 325 43 AS N762AS
## 6 SEA DTW 2014 1 1 37 82 747 88 DL N806DN
## # … with 5 more variables: flight <int>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, and abbreviated variable names ¹dep_time,
## # ²dep_delay, ³arr_time, ⁴arr_delay
# with negation
new_order2 <- flights %>%
select(origin, dest, everything(), -year)
head(new_order2)
## # A tibble: 6 × 15
## origin dest month day dep_time dep_delay arr_time arr_delay carrier tailnum
## <chr> <chr> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr>
## 1 PDX ANC 1 1 1 96 235 70 AS N508AS
## 2 SEA CLT 1 1 4 -6 738 -23 US N195UW
## 3 PDX IAH 1 1 8 13 548 -4 UA N37422
## 4 PDX CLT 1 1 28 -2 800 -23 US N547UW
## 5 SEA ANC 1 1 34 44 325 43 AS N762AS
## 6 SEA DTW 1 1 37 82 747 88 DL N806DN
## # … with 5 more variables: flight <int>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>
We can also rename variables within select.
flights2 <- flights %>%
select(tail_num = tailnum, everything())
head(flights2)
## # A tibble: 6 × 16
## tail_num year month day dep_time dep_delay arr_time arr_de…¹ carrier flight
## <chr> <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <int>
## 1 N508AS 2014 1 1 1 96 235 70 AS 145
## 2 N195UW 2014 1 1 4 -6 738 -23 US 1830
## 3 N37422 2014 1 1 8 13 548 -4 UA 1609
## 4 N547UW 2014 1 1 28 -2 800 -23 US 466
## 5 N762AS 2014 1 1 34 44 325 43 AS 121
## 6 N806DN 2014 1 1 37 82 747 88 DL 1823
## # … with 6 more variables: origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, and abbreviated variable name
## # ¹arr_delay
If you don’t want to move the renamed variables within your
dataframe, you can use the rename
function.
flights3 <- flights %>%
rename(tail_num = tailnum)
glimpse(flights3)
## Rows: 162,049
## Columns: 16
## $ year <int> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, …
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ dep_time <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557,…
## $ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2,…
## $ arr_time <int> 235, 738, 548, 800, 325, 747, 936, 1148, 917, 1334, 911, 907…
## $ arr_delay <dbl> 70, -23, -4, -23, 43, 88, 219, 15, 24, -6, 4, 12, -12, -16, …
## $ carrier <chr> "AS", "US", "UA", "US", "AS", "DL", "UA", "UA", "UA", "UA", …
## $ tail_num <chr> "N508AS", "N195UW", "N37422", "N547UW", "N762AS", "N806DN", …
## $ flight <int> 145, 1830, 1609, 466, 121, 1823, 1481, 229, 1576, 478, 1569,…
## $ origin <chr> "PDX", "SEA", "PDX", "PDX", "SEA", "SEA", "SEA", "PDX", "SEA…
## $ dest <chr> "ANC", "CLT", "IAH", "CLT", "ANC", "DTW", "ORD", "IAH", "DEN…
## $ air_time <dbl> 194, 252, 201, 251, 201, 224, 202, 217, 136, 268, 130, 122, …
## $ distance <dbl> 1542, 2279, 1825, 2282, 1448, 1927, 1721, 1825, 1024, 2402, …
## $ hour <dbl> 0, 0, 0, 0, 0, 0, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, …
## $ minute <dbl> 1, 4, 8, 28, 34, 37, 46, 26, 27, 36, 41, 49, 50, 57, 57, 58,…
dplyr::filter
As we have previously seen, filter
has many flexible
uses:
# flights taking off from PDX
pdx <- flights %>%
filter(origin == "PDX")
head(pdx)
## # A tibble: 6 × 16
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight
## <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr> <int>
## 1 2014 1 1 1 96 235 70 AS N508AS 145
## 2 2014 1 1 8 13 548 -4 UA N37422 1609
## 3 2014 1 1 28 -2 800 -23 US N547UW 466
## 4 2014 1 1 526 -4 1148 15 UA N813UA 229
## 5 2014 1 1 541 1 911 4 UA N36476 1569
## 6 2014 1 1 549 24 907 12 US N548UW 649
## # … with 6 more variables: origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>
# january flights from PDX
pdx_jan <- flights %>%
filter(origin == "PDX", month == 1) # the comma is an "and"
head(pdx_jan)
## # A tibble: 6 × 16
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight
## <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr> <int>
## 1 2014 1 1 1 96 235 70 AS N508AS 145
## 2 2014 1 1 8 13 548 -4 UA N37422 1609
## 3 2014 1 1 28 -2 800 -23 US N547UW 466
## 4 2014 1 1 526 -4 1148 15 UA N813UA 229
## 5 2014 1 1 541 1 911 4 UA N36476 1569
## 6 2014 1 1 549 24 907 12 US N548UW 649
## # … with 6 more variables: origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>
# flights to ATL (Atlanta) or BNA (Nashville)
to_south <- flights %>%
filter(dest == "ATL" | dest == "BNA") %>% # | is "or"
select(origin, dest, everything())
head(to_south)
## # A tibble: 6 × 16
## origin dest year month day dep_t…¹ dep_d…² arr_t…³ arr_d…⁴ carrier tailnum
## <chr> <chr> <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr>
## 1 SEA ATL 2014 1 1 624 -6 1401 -6 DL N617DL
## 2 SEA ATL 2014 1 1 802 -3 1533 -17 AS N532AS
## 3 SEA ATL 2014 1 1 824 -1 1546 -14 DL N633DL
## 4 PDX ATL 2014 1 1 944 -6 1727 -8 AS N548AS
## 5 PDX ATL 2014 1 1 1054 94 1807 84 DL N377DA
## 6 SEA ATL 2014 1 1 1158 6 1915 -14 DL N6712B
## # … with 5 more variables: flight <int>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, and abbreviated variable names ¹dep_time,
## # ²dep_delay, ³arr_time, ⁴arr_delay
# flights from PDX to ATL (Atlanta) or BNA (Nashville)
pdx_to_south <- flights %>%
filter(origin == "PDX", dest == "ATL" | dest == "BNA") %>% # | is "or"
select(origin, dest, everything())
head(pdx_to_south)
## # A tibble: 6 × 16
## origin dest year month day dep_t…¹ dep_d…² arr_t…³ arr_d…⁴ carrier tailnum
## <chr> <chr> <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr>
## 1 PDX ATL 2014 1 1 944 -6 1727 -8 AS N548AS
## 2 PDX ATL 2014 1 1 1054 94 1807 84 DL N377DA
## 3 PDX ATL 2014 1 1 1323 -2 2038 -15 DL N393DA
## 4 PDX ATL 2014 1 1 2253 8 611 4 DL N371DA
## 5 PDX ATL 2014 1 2 627 -3 1350 -7 DL N3746H
## 6 PDX ATL 2014 1 2 918 -2 1643 -2 DL N3756
## # … with 5 more variables: flight <int>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, and abbreviated variable names ¹dep_time,
## # ²dep_delay, ³arr_time, ⁴arr_delay
# alternatively, using group membership
south_dests <- c("ATL", "BNA")
pdx_to_south2 <- flights %>%
filter(origin == "PDX", dest %in% south_dests) %>%
select(origin, dest, everything())
head(pdx_to_south2)
## # A tibble: 6 × 16
## origin dest year month day dep_t…¹ dep_d…² arr_t…³ arr_d…⁴ carrier tailnum
## <chr> <chr> <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr>
## 1 PDX ATL 2014 1 1 944 -6 1727 -8 AS N548AS
## 2 PDX ATL 2014 1 1 1054 94 1807 84 DL N377DA
## 3 PDX ATL 2014 1 1 1323 -2 2038 -15 DL N393DA
## 4 PDX ATL 2014 1 1 2253 8 611 4 DL N371DA
## 5 PDX ATL 2014 1 2 627 -3 1350 -7 DL N3746H
## 6 PDX ATL 2014 1 2 918 -2 1643 -2 DL N3756
## # … with 5 more variables: flight <int>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, and abbreviated variable names ¹dep_time,
## # ²dep_delay, ³arr_time, ⁴arr_delay
# flights delayed by 1 hour or more
delay_1plus <- flights %>%
filter(dep_delay >= 60)
head(delay_1plus)
## # A tibble: 6 × 16
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight
## <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr> <int>
## 1 2014 1 1 1 96 235 70 AS N508AS 145
## 2 2014 1 1 37 82 747 88 DL N806DN 1823
## 3 2014 1 1 346 227 936 219 UA N14219 1481
## 4 2014 1 1 650 90 1037 91 US N626AW 460
## 5 2014 1 1 959 164 1137 157 AS N534AS 805
## 6 2014 1 1 1008 68 1242 64 AS N788AS 456
## # … with 6 more variables: origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>
# flights delayed by 1 hour, but not more than 2 hours
delay_1hr <- flights %>%
filter(dep_delay >= 60, dep_delay < 120)
head(delay_1hr)
## # A tibble: 6 × 16
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight
## <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr> <int>
## 1 2014 1 1 1 96 235 70 AS N508AS 145
## 2 2014 1 1 37 82 747 88 DL N806DN 1823
## 3 2014 1 1 650 90 1037 91 US N626AW 460
## 4 2014 1 1 1008 68 1242 64 AS N788AS 456
## 5 2014 1 1 1014 75 1613 81 UA N37408 1444
## 6 2014 1 1 1036 81 1408 63 OO N218AG 3466
## # … with 6 more variables: origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>
range(delay_1hr$dep_delay, na.rm = TRUE)
## [1] 60 119
# even more efficient using between (always inclusive)
delay_bwn <- flights %>%
filter(between(dep_delay, 60, 119))
head(delay_bwn)
## # A tibble: 6 × 16
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight
## <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr> <int>
## 1 2014 1 1 1 96 235 70 AS N508AS 145
## 2 2014 1 1 37 82 747 88 DL N806DN 1823
## 3 2014 1 1 650 90 1037 91 US N626AW 460
## 4 2014 1 1 1008 68 1242 64 AS N788AS 456
## 5 2014 1 1 1014 75 1613 81 UA N37408 1444
## 6 2014 1 1 1036 81 1408 63 OO N218AG 3466
## # … with 6 more variables: origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>
range(delay_bwn$dep_delay, na.rm = TRUE)
## [1] 60 119
dplyr::arrange
# default is ascending order
flights %>%
arrange(year, month, day) %>% head(n=20)
## # A tibble: 20 × 16
## year month day dep_time dep_delay arr_time arr_de…¹ carrier tailnum flight
## <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr> <int>
## 1 2014 1 1 1 96 235 70 AS N508AS 145
## 2 2014 1 1 4 -6 738 -23 US N195UW 1830
## 3 2014 1 1 8 13 548 -4 UA N37422 1609
## 4 2014 1 1 28 -2 800 -23 US N547UW 466
## 5 2014 1 1 34 44 325 43 AS N762AS 121
## 6 2014 1 1 37 82 747 88 DL N806DN 1823
## 7 2014 1 1 346 227 936 219 UA N14219 1481
## 8 2014 1 1 526 -4 1148 15 UA N813UA 229
## 9 2014 1 1 527 7 917 24 UA N75433 1576
## 10 2014 1 1 536 1 1334 -6 UA N574UA 478
## 11 2014 1 1 541 1 911 4 UA N36476 1569
## 12 2014 1 1 549 24 907 12 US N548UW 649
## 13 2014 1 1 550 0 837 -12 DL N660DL 1634
## 14 2014 1 1 557 -3 1134 -16 AA N3JLAA 1094
## 15 2014 1 1 557 -3 825 -25 AS N562AS 81
## 16 2014 1 1 558 -2 801 -2 AS N402AS 200
## 17 2014 1 1 559 -1 916 -9 F9 N210FR 796
## 18 2014 1 1 600 0 1151 -19 AA N3JFAA 2240
## 19 2014 1 1 600 -10 842 -8 AS N786AS 426
## 20 2014 1 1 602 -3 943 5 F9 N201FR 144
## # … with 6 more variables: origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, and abbreviated variable name
## # ¹arr_delay
# descending order
flights %>%
arrange(desc(year), desc(month), desc(day)) %>% head(n=20)
## # A tibble: 20 × 16
## year month day dep_time dep_delay arr_time arr_de…¹ carrier tailnum flight
## <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr> <int>
## 1 2014 12 31 2 12 601 31 AA N3JKAA 1230
## 2 2014 12 31 27 -3 623 3 AA N3EWAA 1431
## 3 2014 12 31 39 14 324 4 AS N762AS 135
## 4 2014 12 31 40 0 549 0 DL N757AT 2440
## 5 2014 12 31 52 -8 917 -21 AA N3JFAA 371
## 6 2014 12 31 54 4 621 17 DL N128DL 1670
## 7 2014 12 31 56 61 848 80 DL N655DL 929
## 8 2014 12 31 512 -3 904 4 US N653AW 480
## 9 2014 12 31 515 -5 855 5 US N580UW 425
## 10 2014 12 31 534 4 859 7 UA N34460 1075
## 11 2014 12 31 546 1 916 -4 WN N8323C 757
## 12 2014 12 31 548 -2 1351 -13 UA N461UA 665
## 13 2014 12 31 549 4 1208 12 UA N68807 1457
## 14 2014 12 31 550 0 922 2 WN N797MX 2121
## 15 2014 12 31 551 -4 1202 12 AA N3HXAA 1094
## 16 2014 12 31 551 -9 744 -15 AS N570AS 342
## 17 2014 12 31 555 -10 824 -1 AS N548AS 602
## 18 2014 12 31 558 -2 849 0 DL N668DN 1831
## 19 2014 12 31 558 -2 1149 4 AA N436AA 1534
## 20 2014 12 31 558 -2 738 -4 AS N585AS 406
## # … with 6 more variables: origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, and abbreviated variable name
## # ¹arr_delay
dplyr::distinct
# unique origin-dest combinations
flights %>%
select(origin, dest) %>%
distinct %>% head(n=50)
## # A tibble: 50 × 2
## origin dest
## <chr> <chr>
## 1 PDX ANC
## 2 SEA CLT
## 3 PDX IAH
## 4 PDX CLT
## 5 SEA ANC
## 6 SEA DTW
## 7 SEA ORD
## 8 SEA DEN
## 9 SEA EWR
## 10 PDX DEN
## # … with 40 more rows
# all unique destinations from PDX (there are 49, so we'll just list the first few)
from_pdx <- flights %>%
filter(origin == "PDX") %>%
select(origin, dest) %>%
distinct(dest)
head(from_pdx)
## # A tibble: 6 × 1
## dest
## <chr>
## 1 ANC
## 2 IAH
## 3 CLT
## 4 DEN
## 5 PHX
## 6 ORD
dplyr::mutate
mutate
is used to transform existing variables or create
new ones; here, we are using it to create an indicator variable that
identifies flights that were entirely on-time.
# add total delay variable
flights %>%
mutate(tot_delay = dep_delay + arr_delay) %>%
select(origin, dest, ends_with("delay"), everything()) %>%
head
## # A tibble: 6 × 17
## origin dest dep_d…¹ arr_d…² tot_d…³ year month day dep_t…⁴ arr_t…⁵ carrier
## <chr> <chr> <dbl> <dbl> <dbl> <int> <int> <int> <int> <int> <chr>
## 1 PDX ANC 96 70 166 2014 1 1 1 235 AS
## 2 SEA CLT -6 -23 -29 2014 1 1 4 738 US
## 3 PDX IAH 13 -4 9 2014 1 1 8 548 UA
## 4 PDX CLT -2 -23 -25 2014 1 1 28 800 US
## 5 SEA ANC 44 43 87 2014 1 1 34 325 AS
## 6 SEA DTW 82 88 170 2014 1 1 37 747 DL
## # … with 6 more variables: tailnum <chr>, flight <int>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, and abbreviated variable names
## # ¹dep_delay, ²arr_delay, ³tot_delay, ⁴dep_time, ⁵arr_time
# flights that were delayed at departure had on time or early arrivals?
arrivals <- flights %>%
mutate(arr_ok = ifelse(dep_delay > 0 & arr_delay <= 0, 1, 0)) %>%
select(origin, dest, ends_with("delay"), carrier, arr_ok)
# peek at it
arrivals %>%
filter(arr_ok == 1) %>%
head
## # A tibble: 6 × 6
## origin dest dep_delay arr_delay carrier arr_ok
## <chr> <chr> <dbl> <dbl> <chr> <dbl>
## 1 PDX IAH 13 -4 UA 1
## 2 SEA EWR 1 -6 UA 1
## 3 SEA SAN 2 -12 AS 1
## 4 PDX EWR 2 -19 UA 1
## 5 SEA IAH 13 -4 UA 1
## 6 PDX IAD 10 -4 UA 1
dplyr::summarise
(or dplyr::summarize
)We have seen summarise
used to calculate summary
statistics, like so:
flights %>%
summarise(mean(dep_delay, na.rm = TRUE))
## # A tibble: 1 × 1
## `mean(dep_delay, na.rm = TRUE)`
## <dbl>
## 1 6.13
But this can get tedious when we want to compute several summary statistics, like so:
# we can also name that variable, and summarise multiple variables
flights %>%
summarise(mean_delay = mean(dep_delay, na.rm = TRUE),
sd_delay = sd(dep_delay, na.rm = TRUE),
median_delay = median(dep_delay, na.rm = TRUE))
## # A tibble: 1 × 3
## mean_delay sd_delay median_delay
## <dbl> <dbl> <dbl>
## 1 6.13 29.1 -2
summarise
together with across()
lets us
automate this process a bit:
flights %>%
filter(!is.na(dep_delay)) %>%
select(dep_delay) %>%
summarise(
across(everything(),
lst(mean, sd, median)) # making this a list gets us nice column names
)
## # A tibble: 1 × 3
## dep_delay_mean dep_delay_sd dep_delay_median
## <dbl> <dbl> <dbl>
## 1 6.13 29.1 -2
# same thing over all numeric columns
flights %>%
filter(!is.na(dep_delay)) %>%
summarise(across(where(is.numeric), lst(mean, sd, median)))
## # A tibble: 1 × 36
## year_…¹ year_sd year_…² month…³ month…⁴ month…⁵ day_m…⁶ day_sd day_m…⁷ dep_t…⁸
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2014 0 2014 6.62 3.32 7 15.8 8.79 16 1278.
## # … with 26 more variables: dep_time_sd <dbl>, dep_time_median <dbl>,
## # dep_delay_mean <dbl>, dep_delay_sd <dbl>, dep_delay_median <dbl>,
## # arr_time_mean <dbl>, arr_time_sd <dbl>, arr_time_median <int>,
## # arr_delay_mean <dbl>, arr_delay_sd <dbl>, arr_delay_median <dbl>,
## # flight_mean <dbl>, flight_sd <dbl>, flight_median <dbl>,
## # air_time_mean <dbl>, air_time_sd <dbl>, air_time_median <dbl>,
## # distance_mean <dbl>, distance_sd <dbl>, distance_median <dbl>, …
# combine with pivot_longer, change names too
flights %>%
filter(!is.na(dep_delay)) %>%
summarise(across(dep_delay, lst(mean, stdev=sd, median))) %>%
pivot_longer(everything(), names_to="delay_stat", values_to="value", names_prefix = "dep_delay_")
## # A tibble: 3 × 2
## delay_stat value
## <chr> <dbl>
## 1 mean 6.13
## 2 stdev 29.1
## 3 median -2
We can use aggregating functions in summarise
# how many unique destinations?
summary_table <- flights %>%
summarise(tot_flights = n(),
tot_planes = n_distinct(tailnum),
tot_carriers = n_distinct(carrier),
tot_dests = n_distinct(dest),
tot_origins = n_distinct(origin))
summary_table
## # A tibble: 1 × 5
## tot_flights tot_planes tot_carriers tot_dests tot_origins
## <int> <int> <int> <int> <int>
## 1 162049 3023 11 71 2
The resulting table is a bit messy; we can reprocess using
pivot_longer
and separate
:
# chain with tidyr functions
summary_table %>%
pivot_longer(everything(), names_to="key", values_to="value") %>%
separate(key, into = c("tot", "entity")) %>%
select(-tot, total = value)
## # A tibble: 5 × 2
## entity total
## <chr> <int>
## 1 flights 162049
## 2 planes 3023
## 3 carriers 11
## 4 dests 71
## 5 origins 2
Or we can do it in a single call to pivot_longer()
:
summary_table %>%
pivot_longer(
everything(), # grab all columns
names_to="entity", # make a new column called "entity" to store the old column names
names_prefix="tot_", # remove the "tot_" prefix from the old column names
values_to="total") # put the values in a column called "total"
## # A tibble: 5 × 2
## entity total
## <chr> <int>
## 1 flights 162049
## 2 planes 3023
## 3 carriers 11
## 4 dests 71
## 5 origins 2
tidyr
We’ll work with a made up dataframe:
df <- data.frame(
id = 1:10,
date = as.Date('2015-01-01') + 0:9,
q1_m1_w1 = rnorm(10, 0, 1),
q1_m1_w2 = rnorm(10, 0, 1),
q1_m2_w3 = rnorm(10, 0, 1),
q2_m1_w1 = rnorm(10, 0, 1),
q2_m2_w1 = rnorm(10, 0, 1),
q2_m2_w2 = rnorm(10, 0, 1)
)
# HLO
head(df)
## id date q1_m1_w1 q1_m1_w2 q1_m2_w3 q2_m1_w1 q2_m2_w1
## 1 1 2015-01-01 -1.1116773 0.6142841 -1.1714875 0.87494857 -0.3232590
## 2 2 2015-01-02 1.1799015 0.8463138 -0.3389054 1.75723234 -1.4957872
## 3 3 2015-01-03 -0.9506309 -0.6169158 -0.3153580 -0.07181477 0.4665644
## 4 4 2015-01-04 0.1975220 0.7934394 -0.2378307 0.58933216 -0.8207184
## 5 5 2015-01-05 0.5255518 -0.4919001 -0.9546156 -0.84600664 1.6570551
## 6 6 2015-01-06 -0.4800159 -1.1043834 -0.1293722 0.82895860 -0.5542707
## q2_m2_w2
## 1 0.9876592
## 2 -0.8978570
## 3 -1.9269922
## 4 -0.2030309
## 5 0.1293742
## 6 1.1216944
glimpse(df)
## Rows: 10
## Columns: 8
## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
## $ date <date> 2015-01-01, 2015-01-02, 2015-01-03, 2015-01-04, 2015-01-05, 2…
## $ q1_m1_w1 <dbl> -1.1116773, 1.1799015, -0.9506309, 0.1975220, 0.5255518, -0.…
## $ q1_m1_w2 <dbl> 0.6142841, 0.8463138, -0.6169158, 0.7934394, -0.4919001, -1.1…
## $ q1_m2_w3 <dbl> -1.1714875, -0.3389054, -0.3153580, -0.2378307, -0.9546156, -…
## $ q2_m1_w1 <dbl> 0.87494857, 1.75723234, -0.07181477, 0.58933216, -0.84600664,…
## $ q2_m2_w1 <dbl> -0.3232590, -1.4957872, 0.4665644, -0.8207184, 1.6570551, -0.…
## $ q2_m2_w2 <dbl> 0.9876592, -0.8978570, -1.9269922, -0.2030309, 0.1293742, 1.1…
tidyr::pivot_longer
First, let’s pivot…
df_tidy <- df %>%
pivot_longer(q1_m1_w1:q2_m2_w2, names_to="key", values_to="value")
head(df_tidy)
## # A tibble: 6 × 4
## id date key value
## <int> <date> <chr> <dbl>
## 1 1 2015-01-01 q1_m1_w1 -1.11
## 2 1 2015-01-01 q1_m1_w2 0.614
## 3 1 2015-01-01 q1_m2_w3 -1.17
## 4 1 2015-01-01 q2_m1_w1 0.875
## 5 1 2015-01-01 q2_m2_w1 -0.323
## 6 1 2015-01-01 q2_m2_w2 0.988
tidyr::separate
# separate 1 col into 3 cols
df_sep <- df_tidy %>%
separate(key, into = c("quarter", "month", "week"))
head(df_sep)
## # A tibble: 6 × 6
## id date quarter month week value
## <int> <date> <chr> <chr> <chr> <dbl>
## 1 1 2015-01-01 q1 m1 w1 -1.11
## 2 1 2015-01-01 q1 m1 w2 0.614
## 3 1 2015-01-01 q1 m2 w3 -1.17
## 4 1 2015-01-01 q2 m1 w1 0.875
## 5 1 2015-01-01 q2 m2 w1 -0.323
## 6 1 2015-01-01 q2 m2 w2 0.988
# separate 1 col into 2 cols
df_sep2 <- df_tidy %>%
separate(key, into = c("quarter", "period"), extra = "merge")
head(df_sep2)
## # A tibble: 6 × 5
## id date quarter period value
## <int> <date> <chr> <chr> <dbl>
## 1 1 2015-01-01 q1 m1_w1 -1.11
## 2 1 2015-01-01 q1 m1_w2 0.614
## 3 1 2015-01-01 q1 m2_w3 -1.17
## 4 1 2015-01-01 q2 m1_w1 0.875
## 5 1 2015-01-01 q2 m2_w1 -0.323
## 6 1 2015-01-01 q2 m2_w2 0.988
stringr vs. tidyr separate by regular expression
tidyr::extract
Extract
is essentially the same as
separate
, let’s see how…
# extract
df_ext <- df_sep2 %>%
extract(period, into = "month")
head(df_ext)
## # A tibble: 6 × 5
## id date quarter month value
## <int> <date> <chr> <chr> <dbl>
## 1 1 2015-01-01 q1 m1 -1.11
## 2 1 2015-01-01 q1 m1 0.614
## 3 1 2015-01-01 q1 m2 -1.17
## 4 1 2015-01-01 q2 m1 0.875
## 5 1 2015-01-01 q2 m2 -0.323
## 6 1 2015-01-01 q2 m2 0.988
We can see that, by default, it’s taking the first alphanumeric chunk
of the value, and extract
ing it. We can specify much more
complex patterns:
# this gives us same output as separate
df_ext <- df_sep2 %>%
extract(period, into = c("month", "week"),
regex = "([[:alnum:]]+)_([[:alnum:]]+)")
head(df_ext)
## # A tibble: 6 × 6
## id date quarter month week value
## <int> <date> <chr> <chr> <chr> <dbl>
## 1 1 2015-01-01 q1 m1 w1 -1.11
## 2 1 2015-01-01 q1 m1 w2 0.614
## 3 1 2015-01-01 q1 m2 w3 -1.17
## 4 1 2015-01-01 q2 m1 w1 0.875
## 5 1 2015-01-01 q2 m2 w1 -0.323
## 6 1 2015-01-01 q2 m2 w2 0.988
tidyr::unite
The unite()
function lets us reverse the process:
# let's say we want to combine quarter and month with an underscore
df_uni <- df_sep %>%
unite(period, quarter:month) # sep = "_" is the default arg
head(df_uni)
## # A tibble: 6 × 5
## id date period week value
## <int> <date> <chr> <chr> <dbl>
## 1 1 2015-01-01 q1_m1 w1 -1.11
## 2 1 2015-01-01 q1_m1 w2 0.614
## 3 1 2015-01-01 q1_m2 w3 -1.17
## 4 1 2015-01-01 q2_m1 w1 0.875
## 5 1 2015-01-01 q2_m2 w1 -0.323
## 6 1 2015-01-01 q2_m2 w2 0.988
# let's say we want to combine quarter and month with nothing
df_uni <- df_sep %>%
unite(period, quarter:month, sep = "")
head(df_uni)
## # A tibble: 6 × 5
## id date period week value
## <int> <date> <chr> <chr> <dbl>
## 1 1 2015-01-01 q1m1 w1 -1.11
## 2 1 2015-01-01 q1m1 w2 0.614
## 3 1 2015-01-01 q1m2 w3 -1.17
## 4 1 2015-01-01 q2m1 w1 0.875
## 5 1 2015-01-01 q2m2 w1 -0.323
## 6 1 2015-01-01 q2m2 w2 0.988
tidyr::pivot_wider
# finally let's spread
df_spread <- df_uni %>%
pivot_wider(names_from=week, values_from=value) # fill = NA is default arg
head(df_spread)
## # A tibble: 6 × 6
## id date period w1 w2 w3
## <int> <date> <chr> <dbl> <dbl> <dbl>
## 1 1 2015-01-01 q1m1 -1.11 0.614 NA
## 2 1 2015-01-01 q1m2 NA NA -1.17
## 3 1 2015-01-01 q2m1 0.875 NA NA
## 4 1 2015-01-01 q2m2 -0.323 0.988 NA
## 5 2 2015-01-02 q1m1 1.18 0.846 NA
## 6 2 2015-01-02 q1m2 NA NA -0.339
pivot_longer() %>% separate() %>% pivot_wider()
)All in one, if we had wanted to end up with one row per month:
df_tidiest <- df %>%
pivot_longer(q1_m1_w1:q2_m2_w2, names_to="key", values_to="value") %>%
separate(key, into = c("quarter", "month", "week")) %>%
pivot_wider(names_from=week, values_from=value)
head(df_tidiest)
## # A tibble: 6 × 7
## id date quarter month w1 w2 w3
## <int> <date> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 2015-01-01 q1 m1 -1.11 0.614 NA
## 2 1 2015-01-01 q1 m2 NA NA -1.17
## 3 1 2015-01-01 q2 m1 0.875 NA NA
## 4 1 2015-01-01 q2 m2 -0.323 0.988 NA
## 5 2 2015-01-02 q1 m1 1.18 0.846 NA
## 6 2 2015-01-02 q1 m2 NA NA -0.339
broom
“The broom package takes the messy output of built-in functions in R,
such as lm
, nls
, or t.test
, and
turns them into tidy data frames.” So, broom tidies output from other R
functions that are un-tidy.
See here for list of functions: https://github.com/dgrtwo/broom
Vignette: ftp://cran.r-project.org/pub/R/web/packages/broom/vignettes/broom.html
fit <- lm(mpg ~ qsec + factor(am) + wt + factor(gear),
data = mtcars)
Un-tidy output from lm
summary(fit)
##
## Call:
## lm(formula = mpg ~ qsec + factor(am) + wt + factor(gear), data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5064 -1.5220 -0.7517 1.3841 4.6345
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.3650 8.3730 1.118 0.27359
## qsec 1.2449 0.3828 3.252 0.00317 **
## factor(am)1 3.1505 1.9405 1.624 0.11654
## wt -3.9263 0.7428 -5.286 1.58e-05 ***
## factor(gear)4 -0.2682 1.6555 -0.162 0.87257
## factor(gear)5 -0.2697 2.0632 -0.131 0.89698
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.55 on 26 degrees of freedom
## Multiple R-squared: 0.8498, Adjusted R-squared: 0.8209
## F-statistic: 29.43 on 5 and 26 DF, p-value: 6.379e-10
Tidy output from broom
tidy(fit)
## # A tibble: 6 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 9.37 8.37 1.12 0.274
## 2 qsec 1.24 0.383 3.25 0.00317
## 3 factor(am)1 3.15 1.94 1.62 0.117
## 4 wt -3.93 0.743 -5.29 0.0000158
## 5 factor(gear)4 -0.268 1.66 -0.162 0.873
## 6 factor(gear)5 -0.270 2.06 -0.131 0.897
This is very helfpul for making e.g. regression tables for papers!