Making tables is often when we stretch our dplyr
skills.
Here are some additional more advanced examples that you may find
helpful for today’s assignment!
We’ll use the pnwflights14
package to practice our
dplyr
skills. We need to download the package from github
using devtools
.
# once per machine
install.packages("devtools")
devtools::install_github("ismayc/pnwflights14")
Now, we need to load the flights
dataset from the
pnwflights14
package.
# once per work session
library(pnwflights14)
data("flights", package = "pnwflights14")
dplyr
Tricksdplyr::select
Recall that we use select to specify which columns in a dataframe you’d like to keep by name:
# keep these 2 cols
mini_flights <- flights %>%
select(carrier, flight)
glimpse(mini_flights)
## Rows: 162,049
## Columns: 2
## $ carrier <chr> "AS", "US", "UA", "US", "AS", "DL", "UA", "UA", "UA", "UA", "UA", "US", "DL", "AA", "AS", "AS", "F9", "AA", "AS", "F9", "UA…
## $ flight <int> 145, 1830, 1609, 466, 121, 1823, 1481, 229, 1576, 478, 1569, 649, 1634, 1094, 81, 200, 796, 2240, 426, 144, 1573, 344, 1136…
# keep first five cols
first_five <- flights %>%
select(year, month, day, dep_time, dep_delay)
glimpse(first_five)
## Rows: 162,049
## Columns: 5
## $ year <int> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2…
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557, 557, 558, 559, 600, 600, 602, 602, 603, 603, 606, 606, 614, …
## $ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2, -1, 0, -10, -3, -3, -2, -3, -4, 6, -1, 2, -2, -6, -2, 2, 13,…
# alternatively, specify range
first_five <- flights %>%
select(year:dep_delay)
glimpse(first_five)
## Rows: 162,049
## Columns: 5
## $ year <int> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2…
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557, 557, 558, 559, 600, 600, 602, 602, 603, 603, 606, 606, 614, …
## $ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2, -1, 0, -10, -3, -3, -2, -3, -4, 6, -1, 2, -2, -6, -2, 2, 13,…
We can also choose the columns we want by negation, that is, you can specify which columns to drop instead of keep. This way, all variables not listed are kept.
# we can also use negation
all_but_year <- flights %>%
select(-year)
glimpse(all_but_year)
## Rows: 162,049
## Columns: 15
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557, 557, 558, 559, 600, 600, 602, 602, 603, 603, 606, 606, 614, …
## $ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2, -1, 0, -10, -3, -3, -2, -3, -4, 6, -1, 2, -2, -6, -2, 2, 13,…
## $ arr_time <int> 235, 738, 548, 800, 325, 747, 936, 1148, 917, 1334, 911, 907, 837, 1134, 825, 801, 916, 1151, 842, 943, 1204, 755, 1202, …
## $ arr_delay <dbl> 70, -23, -4, -23, 43, 88, 219, 15, 24, -6, 4, 12, -12, -16, -25, -2, -9, -19, -8, 5, 7, -17, -2, -19, 3, -2, -12, -30, -7…
## $ carrier <chr> "AS", "US", "UA", "US", "AS", "DL", "UA", "UA", "UA", "UA", "UA", "US", "DL", "AA", "AS", "AS", "F9", "AA", "AS", "F9", "…
## $ tailnum <chr> "N508AS", "N195UW", "N37422", "N547UW", "N762AS", "N806DN", "N14219", "N813UA", "N75433", "N574UA", "N36476", "N548UW", "…
## $ flight <int> 145, 1830, 1609, 466, 121, 1823, 1481, 229, 1576, 478, 1569, 649, 1634, 1094, 81, 200, 796, 2240, 426, 144, 1573, 344, 11…
## $ origin <chr> "PDX", "SEA", "PDX", "PDX", "SEA", "SEA", "SEA", "PDX", "SEA", "SEA", "PDX", "PDX", "SEA", "SEA", "SEA", "SEA", "PDX", "S…
## $ dest <chr> "ANC", "CLT", "IAH", "CLT", "ANC", "DTW", "ORD", "IAH", "DEN", "EWR", "DEN", "PHX", "SLC", "DFW", "ANC", "SJC", "DEN", "O…
## $ air_time <dbl> 194, 252, 201, 251, 201, 224, 202, 217, 136, 268, 130, 122, 82, 184, 188, 100, 125, 206, 125, 129, 203, 94, 203, 98, 87, …
## $ distance <dbl> 1542, 2279, 1825, 2282, 1448, 1927, 1721, 1825, 1024, 2402, 991, 1009, 689, 1660, 1448, 697, 991, 1721, 954, 1024, 1739, …
## $ hour <dbl> 0, 0, 0, 0, 0, 0, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6…
## $ minute <dbl> 1, 4, 8, 28, 34, 37, 46, 26, 27, 36, 41, 49, 50, 57, 57, 58, 59, 0, 0, 2, 2, 3, 3, 6, 6, 14, 17, 18, 19, 20, 22, 23, 24, …
dplyr::select
comes with several other helper
functions…
depart <- flights %>%
select(starts_with("dep_"))
glimpse(depart)
## Rows: 162,049
## Columns: 2
## $ dep_time <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557, 557, 558, 559, 600, 600, 602, 602, 603, 603, 606, 606, 614, …
## $ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2, -1, 0, -10, -3, -3, -2, -3, -4, 6, -1, 2, -2, -6, -2, 2, 13,…
times <- flights %>%
select(contains("time"))
glimpse(times)
## Rows: 162,049
## Columns: 3
## $ dep_time <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557, 557, 558, 559, 600, 600, 602, 602, 603, 603, 606, 606, 614, 6…
## $ arr_time <int> 235, 738, 548, 800, 325, 747, 936, 1148, 917, 1334, 911, 907, 837, 1134, 825, 801, 916, 1151, 842, 943, 1204, 755, 1202, 8…
## $ air_time <dbl> 194, 252, 201, 251, 201, 224, 202, 217, 136, 268, 130, 122, 82, 184, 188, 100, 125, 206, 125, 129, 203, 94, 203, 98, 87, 1…
# note that we are not actually saving the new dataframe here
flights %>%
select(-contains("time")) %>% head()
## # A tibble: 6 × 13
## year month day dep_delay arr_delay carrier tailnum flight origin dest distance hour minute
## <int> <int> <int> <dbl> <dbl> <chr> <chr> <int> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2014 1 1 96 70 AS N508AS 145 PDX ANC 1542 0 1
## 2 2014 1 1 -6 -23 US N195UW 1830 SEA CLT 2279 0 4
## 3 2014 1 1 13 -4 UA N37422 1609 PDX IAH 1825 0 8
## 4 2014 1 1 -2 -23 US N547UW 466 PDX CLT 2282 0 28
## 5 2014 1 1 44 43 AS N762AS 121 SEA ANC 1448 0 34
## 6 2014 1 1 82 88 DL N806DN 1823 SEA DTW 1927 0 37
delays <- flights %>%
select(ends_with("delay"))
glimpse(delays)
## Rows: 162,049
## Columns: 2
## $ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2, -1, 0, -10, -3, -3, -2, -3, -4, 6, -1, 2, -2, -6, -2, 2, 13,…
## $ arr_delay <dbl> 70, -23, -4, -23, 43, 88, 219, 15, 24, -6, 4, 12, -12, -16, -25, -2, -9, -19, -8, 5, 7, -17, -2, -19, 3, -2, -12, -30, -7…
One of my favorite select helper functions is
everything()
, which allows you to use select to keep
all your variables, but easily rearrange the columns
without having to list all the variables to keep/drop.
new_order <- flights %>%
select(origin, dest, everything())
head(new_order)
## # A tibble: 6 × 16
## origin dest year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight air_time distance hour minute
## <chr> <chr> <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 PDX ANC 2014 1 1 1 96 235 70 AS N508AS 145 194 1542 0 1
## 2 SEA CLT 2014 1 1 4 -6 738 -23 US N195UW 1830 252 2279 0 4
## 3 PDX IAH 2014 1 1 8 13 548 -4 UA N37422 1609 201 1825 0 8
## 4 PDX CLT 2014 1 1 28 -2 800 -23 US N547UW 466 251 2282 0 28
## 5 SEA ANC 2014 1 1 34 44 325 43 AS N762AS 121 201 1448 0 34
## 6 SEA DTW 2014 1 1 37 82 747 88 DL N806DN 1823 224 1927 0 37
# with negation
new_order2 <- flights %>%
select(origin, dest, everything(), -year)
head(new_order2)
## # A tibble: 6 × 15
## origin dest month day dep_time dep_delay arr_time arr_delay carrier tailnum flight air_time distance hour minute
## <chr> <chr> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 PDX ANC 1 1 1 96 235 70 AS N508AS 145 194 1542 0 1
## 2 SEA CLT 1 1 4 -6 738 -23 US N195UW 1830 252 2279 0 4
## 3 PDX IAH 1 1 8 13 548 -4 UA N37422 1609 201 1825 0 8
## 4 PDX CLT 1 1 28 -2 800 -23 US N547UW 466 251 2282 0 28
## 5 SEA ANC 1 1 34 44 325 43 AS N762AS 121 201 1448 0 34
## 6 SEA DTW 1 1 37 82 747 88 DL N806DN 1823 224 1927 0 37
We can also rename variables within select.
flights2 <- flights %>%
select(tail_num = tailnum, everything())
head(flights2)
## # A tibble: 6 × 16
## tail_num year month day dep_time dep_delay arr_time arr_delay carrier flight origin dest air_time distance hour minute
## <chr> <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 N508AS 2014 1 1 1 96 235 70 AS 145 PDX ANC 194 1542 0 1
## 2 N195UW 2014 1 1 4 -6 738 -23 US 1830 SEA CLT 252 2279 0 4
## 3 N37422 2014 1 1 8 13 548 -4 UA 1609 PDX IAH 201 1825 0 8
## 4 N547UW 2014 1 1 28 -2 800 -23 US 466 PDX CLT 251 2282 0 28
## 5 N762AS 2014 1 1 34 44 325 43 AS 121 SEA ANC 201 1448 0 34
## 6 N806DN 2014 1 1 37 82 747 88 DL 1823 SEA DTW 224 1927 0 37
If you don’t want to move the renamed variables within your
dataframe, you can use the rename
function.
flights3 <- flights %>%
rename(tail_num = tailnum)
glimpse(flights3)
## Rows: 162,049
## Columns: 16
## $ year <int> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2…
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557, 557, 558, 559, 600, 600, 602, 602, 603, 603, 606, 606, 614, …
## $ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2, -1, 0, -10, -3, -3, -2, -3, -4, 6, -1, 2, -2, -6, -2, 2, 13,…
## $ arr_time <int> 235, 738, 548, 800, 325, 747, 936, 1148, 917, 1334, 911, 907, 837, 1134, 825, 801, 916, 1151, 842, 943, 1204, 755, 1202, …
## $ arr_delay <dbl> 70, -23, -4, -23, 43, 88, 219, 15, 24, -6, 4, 12, -12, -16, -25, -2, -9, -19, -8, 5, 7, -17, -2, -19, 3, -2, -12, -30, -7…
## $ carrier <chr> "AS", "US", "UA", "US", "AS", "DL", "UA", "UA", "UA", "UA", "UA", "US", "DL", "AA", "AS", "AS", "F9", "AA", "AS", "F9", "…
## $ tail_num <chr> "N508AS", "N195UW", "N37422", "N547UW", "N762AS", "N806DN", "N14219", "N813UA", "N75433", "N574UA", "N36476", "N548UW", "…
## $ flight <int> 145, 1830, 1609, 466, 121, 1823, 1481, 229, 1576, 478, 1569, 649, 1634, 1094, 81, 200, 796, 2240, 426, 144, 1573, 344, 11…
## $ origin <chr> "PDX", "SEA", "PDX", "PDX", "SEA", "SEA", "SEA", "PDX", "SEA", "SEA", "PDX", "PDX", "SEA", "SEA", "SEA", "SEA", "PDX", "S…
## $ dest <chr> "ANC", "CLT", "IAH", "CLT", "ANC", "DTW", "ORD", "IAH", "DEN", "EWR", "DEN", "PHX", "SLC", "DFW", "ANC", "SJC", "DEN", "O…
## $ air_time <dbl> 194, 252, 201, 251, 201, 224, 202, 217, 136, 268, 130, 122, 82, 184, 188, 100, 125, 206, 125, 129, 203, 94, 203, 98, 87, …
## $ distance <dbl> 1542, 2279, 1825, 2282, 1448, 1927, 1721, 1825, 1024, 2402, 991, 1009, 689, 1660, 1448, 697, 991, 1721, 954, 1024, 1739, …
## $ hour <dbl> 0, 0, 0, 0, 0, 0, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6…
## $ minute <dbl> 1, 4, 8, 28, 34, 37, 46, 26, 27, 36, 41, 49, 50, 57, 57, 58, 59, 0, 0, 2, 2, 3, 3, 6, 6, 14, 17, 18, 19, 20, 22, 23, 24, …
dplyr::filter
As we have previously seen, filter
has many flexible
uses:
# flights taking off from PDX
pdx <- flights %>%
filter(origin == "PDX")
head(pdx)
## # A tibble: 6 × 16
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute
## <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr> <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2014 1 1 1 96 235 70 AS N508AS 145 PDX ANC 194 1542 0 1
## 2 2014 1 1 8 13 548 -4 UA N37422 1609 PDX IAH 201 1825 0 8
## 3 2014 1 1 28 -2 800 -23 US N547UW 466 PDX CLT 251 2282 0 28
## 4 2014 1 1 526 -4 1148 15 UA N813UA 229 PDX IAH 217 1825 5 26
## 5 2014 1 1 541 1 911 4 UA N36476 1569 PDX DEN 130 991 5 41
## 6 2014 1 1 549 24 907 12 US N548UW 649 PDX PHX 122 1009 5 49
# january flights from PDX
pdx_jan <- flights %>%
filter(origin == "PDX", month == 1) # the comma is an "and"
head(pdx_jan)
## # A tibble: 6 × 16
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute
## <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr> <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2014 1 1 1 96 235 70 AS N508AS 145 PDX ANC 194 1542 0 1
## 2 2014 1 1 8 13 548 -4 UA N37422 1609 PDX IAH 201 1825 0 8
## 3 2014 1 1 28 -2 800 -23 US N547UW 466 PDX CLT 251 2282 0 28
## 4 2014 1 1 526 -4 1148 15 UA N813UA 229 PDX IAH 217 1825 5 26
## 5 2014 1 1 541 1 911 4 UA N36476 1569 PDX DEN 130 991 5 41
## 6 2014 1 1 549 24 907 12 US N548UW 649 PDX PHX 122 1009 5 49
# flights to ATL (Atlanta) or BNA (Nashville)
to_south <- flights %>%
filter(dest == "ATL" | dest == "BNA") %>% # | is "or"
select(origin, dest, everything())
head(to_south)
## # A tibble: 6 × 16
## origin dest year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight air_time distance hour minute
## <chr> <chr> <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 SEA ATL 2014 1 1 624 -6 1401 -6 DL N617DL 968 235 2182 6 24
## 2 SEA ATL 2014 1 1 802 -3 1533 -17 AS N532AS 742 249 2182 8 2
## 3 SEA ATL 2014 1 1 824 -1 1546 -14 DL N633DL 128 234 2182 8 24
## 4 PDX ATL 2014 1 1 944 -6 1727 -8 AS N548AS 752 252 2172 9 44
## 5 PDX ATL 2014 1 1 1054 94 1807 84 DL N377DA 1502 237 2172 10 54
## 6 SEA ATL 2014 1 1 1158 6 1915 -14 DL N6712B 1962 234 2182 11 58
# flights from PDX to ATL (Atlanta) or BNA (Nashville)
pdx_to_south <- flights %>%
filter(origin == "PDX", dest == "ATL" | dest == "BNA") %>% # | is "or"
select(origin, dest, everything())
head(pdx_to_south)
## # A tibble: 6 × 16
## origin dest year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight air_time distance hour minute
## <chr> <chr> <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 PDX ATL 2014 1 1 944 -6 1727 -8 AS N548AS 752 252 2172 9 44
## 2 PDX ATL 2014 1 1 1054 94 1807 84 DL N377DA 1502 237 2172 10 54
## 3 PDX ATL 2014 1 1 1323 -2 2038 -15 DL N393DA 773 235 2172 13 23
## 4 PDX ATL 2014 1 1 2253 8 611 4 DL N371DA 503 240 2172 22 53
## 5 PDX ATL 2014 1 2 627 -3 1350 -7 DL N3746H 1156 244 2172 6 27
## 6 PDX ATL 2014 1 2 918 -2 1643 -2 DL N3756 1940 249 2172 9 18
# alternatively, using group membership
south_dests <- c("ATL", "BNA")
pdx_to_south2 <- flights %>%
filter(origin == "PDX", dest %in% south_dests) %>%
select(origin, dest, everything())
head(pdx_to_south2)
## # A tibble: 6 × 16
## origin dest year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight air_time distance hour minute
## <chr> <chr> <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 PDX ATL 2014 1 1 944 -6 1727 -8 AS N548AS 752 252 2172 9 44
## 2 PDX ATL 2014 1 1 1054 94 1807 84 DL N377DA 1502 237 2172 10 54
## 3 PDX ATL 2014 1 1 1323 -2 2038 -15 DL N393DA 773 235 2172 13 23
## 4 PDX ATL 2014 1 1 2253 8 611 4 DL N371DA 503 240 2172 22 53
## 5 PDX ATL 2014 1 2 627 -3 1350 -7 DL N3746H 1156 244 2172 6 27
## 6 PDX ATL 2014 1 2 918 -2 1643 -2 DL N3756 1940 249 2172 9 18
# flights delayed by 1 hour or more
delay_1plus <- flights %>%
filter(dep_delay >= 60)
head(delay_1plus)
## # A tibble: 6 × 16
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute
## <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr> <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2014 1 1 1 96 235 70 AS N508AS 145 PDX ANC 194 1542 0 1
## 2 2014 1 1 37 82 747 88 DL N806DN 1823 SEA DTW 224 1927 0 37
## 3 2014 1 1 346 227 936 219 UA N14219 1481 SEA ORD 202 1721 3 46
## 4 2014 1 1 650 90 1037 91 US N626AW 460 SEA PHX 141 1107 6 50
## 5 2014 1 1 959 164 1137 157 AS N534AS 805 SEA SMF 77 605 9 59
## 6 2014 1 1 1008 68 1242 64 AS N788AS 456 SEA LAX 129 954 10 8
# flights delayed by 1 hour, but not more than 2 hours
delay_1hr <- flights %>%
filter(dep_delay >= 60, dep_delay < 120)
head(delay_1hr)
## # A tibble: 6 × 16
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute
## <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr> <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2014 1 1 1 96 235 70 AS N508AS 145 PDX ANC 194 1542 0 1
## 2 2014 1 1 37 82 747 88 DL N806DN 1823 SEA DTW 224 1927 0 37
## 3 2014 1 1 650 90 1037 91 US N626AW 460 SEA PHX 141 1107 6 50
## 4 2014 1 1 1008 68 1242 64 AS N788AS 456 SEA LAX 129 954 10 8
## 5 2014 1 1 1014 75 1613 81 UA N37408 1444 SEA ORD 201 1721 10 14
## 6 2014 1 1 1036 81 1408 63 OO N218AG 3466 PDX TUS 134 1119 10 36
range(delay_1hr$dep_delay, na.rm = TRUE)
## [1] 60 119
# even more efficient using between (always inclusive)
delay_bwn <- flights %>%
filter(between(dep_delay, 60, 119))
head(delay_bwn)
## # A tibble: 6 × 16
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute
## <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr> <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2014 1 1 1 96 235 70 AS N508AS 145 PDX ANC 194 1542 0 1
## 2 2014 1 1 37 82 747 88 DL N806DN 1823 SEA DTW 224 1927 0 37
## 3 2014 1 1 650 90 1037 91 US N626AW 460 SEA PHX 141 1107 6 50
## 4 2014 1 1 1008 68 1242 64 AS N788AS 456 SEA LAX 129 954 10 8
## 5 2014 1 1 1014 75 1613 81 UA N37408 1444 SEA ORD 201 1721 10 14
## 6 2014 1 1 1036 81 1408 63 OO N218AG 3466 PDX TUS 134 1119 10 36
range(delay_bwn$dep_delay, na.rm = TRUE)
## [1] 60 119
dplyr::arrange
# default is ascending order
flights %>%
arrange(year, month, day) %>% head(n=20)
## # A tibble: 20 × 16
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute
## <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr> <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2014 1 1 1 96 235 70 AS N508AS 145 PDX ANC 194 1542 0 1
## 2 2014 1 1 4 -6 738 -23 US N195UW 1830 SEA CLT 252 2279 0 4
## 3 2014 1 1 8 13 548 -4 UA N37422 1609 PDX IAH 201 1825 0 8
## 4 2014 1 1 28 -2 800 -23 US N547UW 466 PDX CLT 251 2282 0 28
## 5 2014 1 1 34 44 325 43 AS N762AS 121 SEA ANC 201 1448 0 34
## 6 2014 1 1 37 82 747 88 DL N806DN 1823 SEA DTW 224 1927 0 37
## 7 2014 1 1 346 227 936 219 UA N14219 1481 SEA ORD 202 1721 3 46
## 8 2014 1 1 526 -4 1148 15 UA N813UA 229 PDX IAH 217 1825 5 26
## 9 2014 1 1 527 7 917 24 UA N75433 1576 SEA DEN 136 1024 5 27
## 10 2014 1 1 536 1 1334 -6 UA N574UA 478 SEA EWR 268 2402 5 36
## 11 2014 1 1 541 1 911 4 UA N36476 1569 PDX DEN 130 991 5 41
## 12 2014 1 1 549 24 907 12 US N548UW 649 PDX PHX 122 1009 5 49
## 13 2014 1 1 550 0 837 -12 DL N660DL 1634 SEA SLC 82 689 5 50
## 14 2014 1 1 557 -3 1134 -16 AA N3JLAA 1094 SEA DFW 184 1660 5 57
## 15 2014 1 1 557 -3 825 -25 AS N562AS 81 SEA ANC 188 1448 5 57
## 16 2014 1 1 558 -2 801 -2 AS N402AS 200 SEA SJC 100 697 5 58
## 17 2014 1 1 559 -1 916 -9 F9 N210FR 796 PDX DEN 125 991 5 59
## 18 2014 1 1 600 0 1151 -19 AA N3JFAA 2240 SEA ORD 206 1721 6 0
## 19 2014 1 1 600 -10 842 -8 AS N786AS 426 SEA LAX 125 954 6 0
## 20 2014 1 1 602 -3 943 5 F9 N201FR 144 SEA DEN 129 1024 6 2
# descending order
flights %>%
arrange(desc(year), desc(month), desc(day)) %>% head(n=20)
## # A tibble: 20 × 16
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute
## <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr> <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2014 12 31 2 12 601 31 AA N3JKAA 1230 SEA DFW 210 1660 0 2
## 2 2014 12 31 27 -3 623 3 AA N3EWAA 1431 SEA ORD 211 1721 0 27
## 3 2014 12 31 39 14 324 4 AS N762AS 135 SEA ANC 207 1448 0 39
## 4 2014 12 31 40 0 549 0 DL N757AT 2440 SEA MSP 174 1399 0 40
## 5 2014 12 31 52 -8 917 -21 AA N3JFAA 371 SEA MIA 306 2724 0 52
## 6 2014 12 31 54 4 621 17 DL N128DL 1670 PDX MSP 174 1426 0 54
## 7 2014 12 31 56 61 848 80 DL N655DL 929 SEA ATL 249 2182 0 56
## 8 2014 12 31 512 -3 904 4 US N653AW 480 SEA PHX 154 1107 5 12
## 9 2014 12 31 515 -5 855 5 US N580UW 425 PDX PHX 144 1009 5 15
## 10 2014 12 31 534 4 859 7 UA N34460 1075 PDX DEN 125 991 5 34
## 11 2014 12 31 546 1 916 -4 WN N8323C 757 PDX DEN 126 991 5 46
## 12 2014 12 31 548 -2 1351 -13 UA N461UA 665 PDX EWR 287 2434 5 48
## 13 2014 12 31 549 4 1208 12 UA N68807 1457 SEA IAH 223 1874 5 49
## 14 2014 12 31 550 0 922 2 WN N797MX 2121 PDX PHX 138 1009 5 50
## 15 2014 12 31 551 -4 1202 12 AA N3HXAA 1094 SEA DFW 211 1660 5 51
## 16 2014 12 31 551 -9 744 -15 AS N570AS 342 SEA OAK 91 671 5 51
## 17 2014 12 31 555 -10 824 -1 AS N548AS 602 SEA LAS 124 867 5 55
## 18 2014 12 31 558 -2 849 0 DL N668DN 1831 PDX SLC 85 630 5 58
## 19 2014 12 31 558 -2 1149 4 AA N436AA 1534 PDX DFW 211 1616 5 58
## 20 2014 12 31 558 -2 738 -4 AS N585AS 406 PDX SJC 80 569 5 58
dplyr::distinct
# unique origin-dest combinations
flights %>%
select(origin, dest) %>%
distinct %>% head(n=50)
## # A tibble: 50 × 2
## origin dest
## <chr> <chr>
## 1 PDX ANC
## 2 SEA CLT
## 3 PDX IAH
## 4 PDX CLT
## 5 SEA ANC
## 6 SEA DTW
## 7 SEA ORD
## 8 SEA DEN
## 9 SEA EWR
## 10 PDX DEN
## # ℹ 40 more rows
# all unique destinations from PDX (there are 49, so we'll just list the first few)
from_pdx <- flights %>%
filter(origin == "PDX") %>%
select(origin, dest) %>%
distinct(dest)
head(from_pdx)
## # A tibble: 6 × 1
## dest
## <chr>
## 1 ANC
## 2 IAH
## 3 CLT
## 4 DEN
## 5 PHX
## 6 ORD
dplyr::mutate
mutate
is used to transform existing variables or create
new ones; here, we are using it to create an indicator variable that
identifies flights that were entirely on-time.
# add total delay variable
flights %>%
mutate(tot_delay = dep_delay + arr_delay) %>%
select(origin, dest, ends_with("delay"), everything()) %>%
head
## # A tibble: 6 × 17
## origin dest dep_delay arr_delay tot_delay year month day dep_time arr_time carrier tailnum flight air_time distance hour minute
## <chr> <chr> <dbl> <dbl> <dbl> <int> <int> <int> <int> <int> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 PDX ANC 96 70 166 2014 1 1 1 235 AS N508AS 145 194 1542 0 1
## 2 SEA CLT -6 -23 -29 2014 1 1 4 738 US N195UW 1830 252 2279 0 4
## 3 PDX IAH 13 -4 9 2014 1 1 8 548 UA N37422 1609 201 1825 0 8
## 4 PDX CLT -2 -23 -25 2014 1 1 28 800 US N547UW 466 251 2282 0 28
## 5 SEA ANC 44 43 87 2014 1 1 34 325 AS N762AS 121 201 1448 0 34
## 6 SEA DTW 82 88 170 2014 1 1 37 747 DL N806DN 1823 224 1927 0 37
# flights that were delayed at departure had on time or early arrivals?
arrivals <- flights %>%
mutate(arr_ok = ifelse(dep_delay > 0 & arr_delay <= 0, 1, 0)) %>%
select(origin, dest, ends_with("delay"), carrier, arr_ok)
# peek at it
arrivals %>%
filter(arr_ok == 1) %>%
head
## # A tibble: 6 × 6
## origin dest dep_delay arr_delay carrier arr_ok
## <chr> <chr> <dbl> <dbl> <chr> <dbl>
## 1 PDX IAH 13 -4 UA 1
## 2 SEA EWR 1 -6 UA 1
## 3 SEA SAN 2 -12 AS 1
## 4 PDX EWR 2 -19 UA 1
## 5 SEA IAH 13 -4 UA 1
## 6 PDX IAD 10 -4 UA 1
dplyr::summarise
(or dplyr::summarize
)We have seen summarise
used to calculate summary
statistics, like so:
flights %>%
summarise(mean(dep_delay, na.rm = TRUE))
## # A tibble: 1 × 1
## `mean(dep_delay, na.rm = TRUE)`
## <dbl>
## 1 6.13
But this can get tedious when we want to compute several summary statistics, like so:
# we can also name that variable, and summarise multiple variables
flights %>%
summarise(mean_delay = mean(dep_delay, na.rm = TRUE),
sd_delay = sd(dep_delay, na.rm = TRUE),
median_delay = median(dep_delay, na.rm = TRUE))
## # A tibble: 1 × 3
## mean_delay sd_delay median_delay
## <dbl> <dbl> <dbl>
## 1 6.13 29.1 -2
summarise
together with across()
lets us
automate this process a bit:
flights %>%
filter(!is.na(dep_delay)) %>%
select(dep_delay) %>%
summarise(
across(everything(),
lst(mean, sd, median)) # making this a list gets us nice column names
)
## # A tibble: 1 × 3
## dep_delay_mean dep_delay_sd dep_delay_median
## <dbl> <dbl> <dbl>
## 1 6.13 29.1 -2
# same thing over all numeric columns
flights %>%
filter(!is.na(dep_delay)) %>%
summarise(across(where(is.numeric), lst(mean, sd, median)))
## # A tibble: 1 × 36
## year_mean year_sd year_median month_mean month_sd month_median day_mean day_sd day_median dep_time_mean dep_time_sd dep_time_median
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2014 0 2014 6.62 3.32 7 15.8 8.79 16 1278. 523. 1217
## # ℹ 24 more variables: dep_delay_mean <dbl>, dep_delay_sd <dbl>, dep_delay_median <dbl>, arr_time_mean <dbl>, arr_time_sd <dbl>,
## # arr_time_median <int>, arr_delay_mean <dbl>, arr_delay_sd <dbl>, arr_delay_median <dbl>, flight_mean <dbl>, flight_sd <dbl>,
## # flight_median <dbl>, air_time_mean <dbl>, air_time_sd <dbl>, air_time_median <dbl>, distance_mean <dbl>, distance_sd <dbl>,
## # distance_median <dbl>, hour_mean <dbl>, hour_sd <dbl>, hour_median <dbl>, minute_mean <dbl>, minute_sd <dbl>, minute_median <dbl>
# combine with pivot_longer, change names too
flights %>%
filter(!is.na(dep_delay)) %>%
summarise(across(dep_delay, lst(mean, stdev=sd, median))) %>%
pivot_longer(everything(), names_to="delay_stat", values_to="value", names_prefix = "dep_delay_")
## # A tibble: 3 × 2
## delay_stat value
## <chr> <dbl>
## 1 mean 6.13
## 2 stdev 29.1
## 3 median -2
We can use aggregating functions in summarise
# how many unique destinations?
summary_table <- flights %>%
summarise(tot_flights = n(),
tot_planes = n_distinct(tailnum),
tot_carriers = n_distinct(carrier),
tot_dests = n_distinct(dest),
tot_origins = n_distinct(origin))
summary_table
## # A tibble: 1 × 5
## tot_flights tot_planes tot_carriers tot_dests tot_origins
## <int> <int> <int> <int> <int>
## 1 162049 3023 11 71 2
The resulting table is a bit messy; we can reprocess using
pivot_longer
and separate
:
# chain with tidyr functions
summary_table %>%
pivot_longer(everything(), names_to="key", values_to="value") %>%
separate(key, into = c("tot", "entity")) %>%
select(-tot, total = value)
## # A tibble: 5 × 2
## entity total
## <chr> <int>
## 1 flights 162049
## 2 planes 3023
## 3 carriers 11
## 4 dests 71
## 5 origins 2
Or we can do it in a single call to pivot_longer()
:
summary_table %>%
pivot_longer(
everything(), # grab all columns
names_to="entity", # make a new column called "entity" to store the old column names
names_prefix="tot_", # remove the "tot_" prefix from the old column names
values_to="total") # put the values in a column called "total"
## # A tibble: 5 × 2
## entity total
## <chr> <int>
## 1 flights 162049
## 2 planes 3023
## 3 carriers 11
## 4 dests 71
## 5 origins 2
tidyr
We’ll work with a made up dataframe:
df <- data.frame(
id = 1:10,
date = as.Date('2015-01-01') + 0:9,
q1_m1_w1 = rnorm(10, 0, 1),
q1_m1_w2 = rnorm(10, 0, 1),
q1_m2_w3 = rnorm(10, 0, 1),
q2_m1_w1 = rnorm(10, 0, 1),
q2_m2_w1 = rnorm(10, 0, 1),
q2_m2_w2 = rnorm(10, 0, 1)
)
# HLO
head(df)
## id date q1_m1_w1 q1_m1_w2 q1_m2_w3 q2_m1_w1 q2_m2_w1 q2_m2_w2
## 1 1 2015-01-01 -1.1116773 0.6142841 -1.1714875 0.87494857 -0.3232590 0.9876592
## 2 2 2015-01-02 1.1799015 0.8463138 -0.3389054 1.75723234 -1.4957872 -0.8978570
## 3 3 2015-01-03 -0.9506309 -0.6169158 -0.3153580 -0.07181477 0.4665644 -1.9269922
## 4 4 2015-01-04 0.1975220 0.7934394 -0.2378307 0.58933216 -0.8207184 -0.2030309
## 5 5 2015-01-05 0.5255518 -0.4919001 -0.9546156 -0.84600664 1.6570551 0.1293742
## 6 6 2015-01-06 -0.4800159 -1.1043834 -0.1293722 0.82895860 -0.5542707 1.1216944
glimpse(df)
## Rows: 10
## Columns: 8
## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
## $ date <date> 2015-01-01, 2015-01-02, 2015-01-03, 2015-01-04, 2015-01-05, 2015-01-06, 2015-01-07, 2015-01-08, 2015-01-09, 2015-01-10
## $ q1_m1_w1 <dbl> -1.1116773, 1.1799015, -0.9506309, 0.1975220, 0.5255518, -0.4800159, 0.7514034, 0.2399251, 0.7111657, 0.4055081
## $ q1_m1_w2 <dbl> 0.6142841, 0.8463138, -0.6169158, 0.7934394, -0.4919001, -1.1043834, 0.4934917, -1.9895106, 1.0143630, -0.6657672
## $ q1_m2_w3 <dbl> -1.1714875, -0.3389054, -0.3153580, -0.2378307, -0.9546156, -0.1293722, 0.6325128, 0.6125887, -0.9325299, 0.1415578
## $ q2_m1_w1 <dbl> 0.87494857, 1.75723234, -0.07181477, 0.58933216, -0.84600664, 0.82895860, 0.69227405, -1.54251402, 0.54486288, 0.97491665
## $ q2_m2_w1 <dbl> -0.3232590, -1.4957872, 0.4665644, -0.8207184, 1.6570551, -0.5542707, -1.5169967, -0.8280475, 0.7440145, -1.0260936
## $ q2_m2_w2 <dbl> 0.9876592, -0.8978570, -1.9269922, -0.2030309, 0.1293742, 1.1216944, 1.1620434, 0.8363186, 0.3752448, 0.4801174
tidyr::pivot_longer
First, let’s pivot…
df_tidy <- df %>%
pivot_longer(q1_m1_w1:q2_m2_w2, names_to="key", values_to="value")
head(df_tidy)
## # A tibble: 6 × 4
## id date key value
## <int> <date> <chr> <dbl>
## 1 1 2015-01-01 q1_m1_w1 -1.11
## 2 1 2015-01-01 q1_m1_w2 0.614
## 3 1 2015-01-01 q1_m2_w3 -1.17
## 4 1 2015-01-01 q2_m1_w1 0.875
## 5 1 2015-01-01 q2_m2_w1 -0.323
## 6 1 2015-01-01 q2_m2_w2 0.988
tidyr::separate
# separate 1 col into 3 cols
df_sep <- df_tidy %>%
separate(key, into = c("quarter", "month", "week"))
head(df_sep)
## # A tibble: 6 × 6
## id date quarter month week value
## <int> <date> <chr> <chr> <chr> <dbl>
## 1 1 2015-01-01 q1 m1 w1 -1.11
## 2 1 2015-01-01 q1 m1 w2 0.614
## 3 1 2015-01-01 q1 m2 w3 -1.17
## 4 1 2015-01-01 q2 m1 w1 0.875
## 5 1 2015-01-01 q2 m2 w1 -0.323
## 6 1 2015-01-01 q2 m2 w2 0.988
# separate 1 col into 2 cols
df_sep2 <- df_tidy %>%
separate(key, into = c("quarter", "period"), extra = "merge")
head(df_sep2)
## # A tibble: 6 × 5
## id date quarter period value
## <int> <date> <chr> <chr> <dbl>
## 1 1 2015-01-01 q1 m1_w1 -1.11
## 2 1 2015-01-01 q1 m1_w2 0.614
## 3 1 2015-01-01 q1 m2_w3 -1.17
## 4 1 2015-01-01 q2 m1_w1 0.875
## 5 1 2015-01-01 q2 m2_w1 -0.323
## 6 1 2015-01-01 q2 m2_w2 0.988
stringr vs. tidyr separate by regular expression
tidyr::extract
Extract
is essentially the same as
separate
, let’s see how…
# extract
df_ext <- df_sep2 %>%
extract(period, into = "month")
head(df_ext)
## # A tibble: 6 × 5
## id date quarter month value
## <int> <date> <chr> <chr> <dbl>
## 1 1 2015-01-01 q1 m1 -1.11
## 2 1 2015-01-01 q1 m1 0.614
## 3 1 2015-01-01 q1 m2 -1.17
## 4 1 2015-01-01 q2 m1 0.875
## 5 1 2015-01-01 q2 m2 -0.323
## 6 1 2015-01-01 q2 m2 0.988
We can see that, by default, it’s taking the first alphanumeric chunk
of the value, and extract
ing it. We can specify much more
complex patterns:
# this gives us same output as separate
df_ext <- df_sep2 %>%
extract(period, into = c("month", "week"),
regex = "([[:alnum:]]+)_([[:alnum:]]+)")
head(df_ext)
## # A tibble: 6 × 6
## id date quarter month week value
## <int> <date> <chr> <chr> <chr> <dbl>
## 1 1 2015-01-01 q1 m1 w1 -1.11
## 2 1 2015-01-01 q1 m1 w2 0.614
## 3 1 2015-01-01 q1 m2 w3 -1.17
## 4 1 2015-01-01 q2 m1 w1 0.875
## 5 1 2015-01-01 q2 m2 w1 -0.323
## 6 1 2015-01-01 q2 m2 w2 0.988
tidyr::unite
The unite()
function lets us reverse the process:
# let's say we want to combine quarter and month with an underscore
df_uni <- df_sep %>%
unite(period, quarter:month) # sep = "_" is the default arg
head(df_uni)
## # A tibble: 6 × 5
## id date period week value
## <int> <date> <chr> <chr> <dbl>
## 1 1 2015-01-01 q1_m1 w1 -1.11
## 2 1 2015-01-01 q1_m1 w2 0.614
## 3 1 2015-01-01 q1_m2 w3 -1.17
## 4 1 2015-01-01 q2_m1 w1 0.875
## 5 1 2015-01-01 q2_m2 w1 -0.323
## 6 1 2015-01-01 q2_m2 w2 0.988
# let's say we want to combine quarter and month with nothing
df_uni <- df_sep %>%
unite(period, quarter:month, sep = "")
head(df_uni)
## # A tibble: 6 × 5
## id date period week value
## <int> <date> <chr> <chr> <dbl>
## 1 1 2015-01-01 q1m1 w1 -1.11
## 2 1 2015-01-01 q1m1 w2 0.614
## 3 1 2015-01-01 q1m2 w3 -1.17
## 4 1 2015-01-01 q2m1 w1 0.875
## 5 1 2015-01-01 q2m2 w1 -0.323
## 6 1 2015-01-01 q2m2 w2 0.988
tidyr::pivot_wider
# finally let's spread
df_spread <- df_uni %>%
pivot_wider(names_from=week, values_from=value) # fill = NA is default arg
head(df_spread)
## # A tibble: 6 × 6
## id date period w1 w2 w3
## <int> <date> <chr> <dbl> <dbl> <dbl>
## 1 1 2015-01-01 q1m1 -1.11 0.614 NA
## 2 1 2015-01-01 q1m2 NA NA -1.17
## 3 1 2015-01-01 q2m1 0.875 NA NA
## 4 1 2015-01-01 q2m2 -0.323 0.988 NA
## 5 2 2015-01-02 q1m1 1.18 0.846 NA
## 6 2 2015-01-02 q1m2 NA NA -0.339
pivot_longer() %>% separate() %>% pivot_wider()
)All in one, if we had wanted to end up with one row per month:
df_tidiest <- df %>%
pivot_longer(q1_m1_w1:q2_m2_w2, names_to="key", values_to="value") %>%
separate(key, into = c("quarter", "month", "week")) %>%
pivot_wider(names_from=week, values_from=value)
head(df_tidiest)
## # A tibble: 6 × 7
## id date quarter month w1 w2 w3
## <int> <date> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 2015-01-01 q1 m1 -1.11 0.614 NA
## 2 1 2015-01-01 q1 m2 NA NA -1.17
## 3 1 2015-01-01 q2 m1 0.875 NA NA
## 4 1 2015-01-01 q2 m2 -0.323 0.988 NA
## 5 2 2015-01-02 q1 m1 1.18 0.846 NA
## 6 2 2015-01-02 q1 m2 NA NA -0.339
broom
“The broom package takes the messy output of built-in functions in R,
such as lm
, nls
, or t.test
, and
turns them into tidy data frames.” So, broom tidies output from other R
functions that are un-tidy.
See here for list of functions: https://github.com/dgrtwo/broom
Vignette: ftp://cran.r-project.org/pub/R/web/packages/broom/vignettes/broom.html
fit <- lm(mpg ~ qsec + factor(am) + wt + factor(gear),
data = mtcars)
Un-tidy output from lm
summary(fit)
##
## Call:
## lm(formula = mpg ~ qsec + factor(am) + wt + factor(gear), data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5064 -1.5220 -0.7517 1.3841 4.6345
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.3650 8.3730 1.118 0.27359
## qsec 1.2449 0.3828 3.252 0.00317 **
## factor(am)1 3.1505 1.9405 1.624 0.11654
## wt -3.9263 0.7428 -5.286 1.58e-05 ***
## factor(gear)4 -0.2682 1.6555 -0.162 0.87257
## factor(gear)5 -0.2697 2.0632 -0.131 0.89698
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.55 on 26 degrees of freedom
## Multiple R-squared: 0.8498, Adjusted R-squared: 0.8209
## F-statistic: 29.43 on 5 and 26 DF, p-value: 6.379e-10
Tidy output from broom
tidy(fit)
## # A tibble: 6 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 9.37 8.37 1.12 0.274
## 2 qsec 1.24 0.383 3.25 0.00317
## 3 factor(am)1 3.15 1.94 1.62 0.117
## 4 wt -3.93 0.743 -5.29 0.0000158
## 5 factor(gear)4 -0.268 1.66 -0.162 0.873
## 6 factor(gear)5 -0.270 2.06 -0.131 0.897
This is very helfpul for making e.g. regression tables for papers!