Become familiar with tools for generating publication-ready tables directly in R.
We will use data from the following paper: MacFarlane, H., Gorman, K., Ingham, R., Presmanes Hill, A., Papadakis, K., Kiss, G., & van Santen, J. (2017). Quantitative analysis of disfluency in children with autism spectrum disorder or language impairment. PLoS ONE, 12(3), e0173936.
mazes <- read_csv("http://bit.ly/mazes-gist") %>% 
  clean_names()  #janitor package
glimpse(mazes)
Rows: 381
Columns: 12
$ study_id <chr> "CSLU-001", "CSLU-001", "CSLU-001", "CSLU-001", "CSLU-002", "…
$ ca       <dbl> 5.6667, 5.6667, 5.6667, 5.6667, 6.5000, 6.5000, 6.5000, 6.500…
$ viq      <dbl> 124, 124, 124, 124, 124, 124, 124, 124, 108, 108, 108, 108, 1…
$ dx       <chr> "TD", "TD", "TD", "TD", "TD", "TD", "TD", "TD", "TD", "TD", "…
$ activity <chr> "Conversation", "Picture Description", "Play", "Wordless Pict…
$ content  <dbl> 24, 1, 21, 8, 3, 5, 8, 2, 25, 10, 2, 5, 32, 20, 13, 21, 27, 9…
$ filler   <dbl> 31, 2, 6, 2, 10, 3, 8, 2, 21, 13, 10, 2, 12, 9, 4, 4, 12, 6, …
$ rep      <dbl> 2, 0, 3, 0, 3, 2, 3, 0, 4, 0, 0, 0, 13, 5, 5, 6, 10, 5, 5, 2,…
$ rev      <dbl> 5, 0, 8, 4, 0, 1, 2, 0, 4, 2, 1, 3, 8, 7, 2, 8, 5, 1, 4, 2, 1…
$ fs       <dbl> 17, 1, 10, 4, 0, 2, 3, 2, 17, 8, 1, 2, 11, 8, 6, 7, 12, 3, 3,…
$ cued     <dbl> 36, 2, 6, 2, 10, 3, 9, 2, 29, 13, 11, 2, 14, 12, 4, 11, 17, 7…
$ not_cued <dbl> 50, 3, 27, 10, 13, 8, 15, 4, 38, 23, 11, 7, 42, 26, 17, 18, 3…
The workhorse for making tables in R Markdown documents is the knitr package’s kable function. This function is really versatile, but also free of fancy formatting options, for better or worse.
knitr::kablekable all tables everywhereIn order to tell RMarkdown to use kable to format all tabular output, update the YAML of your document. For HTML:
---
title: "My Awesome Data Vis Lab"
output:
  html_document:
    df_print: kable
---
You can also define the html format in the global options.
# If you don't define format here, you'll need put `format = "html"` in every kable function.
options(knitr.table.format = "html") 
# You may also wish to set this option
options(scipen = 1, digits = 2)
kable table in a chunkFor HTML output:
head(mazes) %>% 
  kable(format = "html")
| study_id | ca | viq | dx | activity | content | filler | rep | rev | fs | cued | not_cued | 
|---|---|---|---|---|---|---|---|---|---|---|---|
| CSLU-001 | 5.6667 | 124 | TD | Conversation | 24 | 31 | 2 | 5 | 17 | 36 | 50 | 
| CSLU-001 | 5.6667 | 124 | TD | Picture Description | 1 | 2 | 0 | 0 | 1 | 2 | 3 | 
| CSLU-001 | 5.6667 | 124 | TD | Play | 21 | 6 | 3 | 8 | 10 | 6 | 27 | 
| CSLU-001 | 5.6667 | 124 | TD | Wordless Picture Book | 8 | 2 | 0 | 4 | 4 | 2 | 10 | 
| CSLU-002 | 6.5000 | 124 | TD | Conversation | 3 | 10 | 3 | 0 | 0 | 10 | 13 | 
| CSLU-002 | 6.5000 | 124 | TD | Picture Description | 5 | 3 | 2 | 1 | 2 | 3 | 8 | 
Note that this is a bit rough-looking. We will use various features of kable to improve our output.
For starters, we can add a caption:
head(mazes) %>% 
  kable(format = "html", digits = 2, caption = "A table produced by kable.")
| study_id | ca | viq | dx | activity | content | filler | rep | rev | fs | cued | not_cued | 
|---|---|---|---|---|---|---|---|---|---|---|---|
| CSLU-001 | 5.67 | 124 | TD | Conversation | 24 | 31 | 2 | 5 | 17 | 36 | 50 | 
| CSLU-001 | 5.67 | 124 | TD | Picture Description | 1 | 2 | 0 | 0 | 1 | 2 | 3 | 
| CSLU-001 | 5.67 | 124 | TD | Play | 21 | 6 | 3 | 8 | 10 | 6 | 27 | 
| CSLU-001 | 5.67 | 124 | TD | Wordless Picture Book | 8 | 2 | 0 | 4 | 4 | 2 | 10 | 
| CSLU-002 | 6.50 | 124 | TD | Conversation | 3 | 10 | 3 | 0 | 0 | 10 | 13 | 
| CSLU-002 | 6.50 | 124 | TD | Picture Description | 5 | 3 | 2 | 1 | 2 | 3 | 8 | 
We can also manually specify human-readable column names:
my_maze_names <- c("Participant", "Age", "Verbal\nIQ", "Group", "Activity", "Content\nMaze", "Filler\nMaze", "Repetition", "Revision", "False\nStart", "Cued", "Not\nCued")
head(mazes) %>% 
  kable(format = "html", digits = 2, caption = "A table produced by kable.",
        col.names = my_maze_names)
| Participant | Age | Verbal IQ | Group | Activity | Content Maze | Filler Maze | Repetition | Revision | False Start | Cued | Not Cued | 
|---|---|---|---|---|---|---|---|---|---|---|---|
| CSLU-001 | 5.67 | 124 | TD | Conversation | 24 | 31 | 2 | 5 | 17 | 36 | 50 | 
| CSLU-001 | 5.67 | 124 | TD | Picture Description | 1 | 2 | 0 | 0 | 1 | 2 | 3 | 
| CSLU-001 | 5.67 | 124 | TD | Play | 21 | 6 | 3 | 8 | 10 | 6 | 27 | 
| CSLU-001 | 5.67 | 124 | TD | Wordless Picture Book | 8 | 2 | 0 | 4 | 4 | 2 | 10 | 
| CSLU-002 | 6.50 | 124 | TD | Conversation | 3 | 10 | 3 | 0 | 0 | 10 | 13 | 
| CSLU-002 | 6.50 | 124 | TD | Picture Description | 5 | 3 | 2 | 1 | 2 | 3 | 8 | 
kable tables in a chunkTo improve the visual layout of the table, we can use the kableExtra package, which provides the kable_styling() function:
head(mazes) %>% 
  kable(format = "html", digits = 2, caption = "A styled kable table.",
        col.names = my_maze_names) %>% 
  kable_styling()
| Participant | Age | Verbal IQ | Group | Activity | Content Maze | Filler Maze | Repetition | Revision | False Start | Cued | Not Cued | 
|---|---|---|---|---|---|---|---|---|---|---|---|
| CSLU-001 | 5.67 | 124 | TD | Conversation | 24 | 31 | 2 | 5 | 17 | 36 | 50 | 
| CSLU-001 | 5.67 | 124 | TD | Picture Description | 1 | 2 | 0 | 0 | 1 | 2 | 3 | 
| CSLU-001 | 5.67 | 124 | TD | Play | 21 | 6 | 3 | 8 | 10 | 6 | 27 | 
| CSLU-001 | 5.67 | 124 | TD | Wordless Picture Book | 8 | 2 | 0 | 4 | 4 | 2 | 10 | 
| CSLU-002 | 6.50 | 124 | TD | Conversation | 3 | 10 | 3 | 0 | 0 | 10 | 13 | 
| CSLU-002 | 6.50 | 124 | TD | Picture Description | 5 | 3 | 2 | 1 | 2 | 3 | 8 | 
There are lots of printing options: https://haozhu233.github.io/kableExtra/awesome_table_in_html.html
head(mazes) %>% 
  kable(format = "html", digits = 2, caption = "A non-full width zebra kable table.") %>% 
  kable_styling(bootstrap_options = "striped", full_width = F)
| study_id | ca | viq | dx | activity | content | filler | rep | rev | fs | cued | not_cued | 
|---|---|---|---|---|---|---|---|---|---|---|---|
| CSLU-001 | 5.67 | 124 | TD | Conversation | 24 | 31 | 2 | 5 | 17 | 36 | 50 | 
| CSLU-001 | 5.67 | 124 | TD | Picture Description | 1 | 2 | 0 | 0 | 1 | 2 | 3 | 
| CSLU-001 | 5.67 | 124 | TD | Play | 21 | 6 | 3 | 8 | 10 | 6 | 27 | 
| CSLU-001 | 5.67 | 124 | TD | Wordless Picture Book | 8 | 2 | 0 | 4 | 4 | 2 | 10 | 
| CSLU-002 | 6.50 | 124 | TD | Conversation | 3 | 10 | 3 | 0 | 0 | 10 | 13 | 
| CSLU-002 | 6.50 | 124 | TD | Picture Description | 5 | 3 | 2 | 1 | 2 | 3 | 8 | 
head(mazes) %>% 
  kable(format = "html", digits = 2, caption = "Over here!") %>% 
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
| study_id | ca | viq | dx | activity | content | filler | rep | rev | fs | cued | not_cued | 
|---|---|---|---|---|---|---|---|---|---|---|---|
| CSLU-001 | 5.67 | 124 | TD | Conversation | 24 | 31 | 2 | 5 | 17 | 36 | 50 | 
| CSLU-001 | 5.67 | 124 | TD | Picture Description | 1 | 2 | 0 | 0 | 1 | 2 | 3 | 
| CSLU-001 | 5.67 | 124 | TD | Play | 21 | 6 | 3 | 8 | 10 | 6 | 27 | 
| CSLU-001 | 5.67 | 124 | TD | Wordless Picture Book | 8 | 2 | 0 | 4 | 4 | 2 | 10 | 
| CSLU-002 | 6.50 | 124 | TD | Conversation | 3 | 10 | 3 | 0 | 0 | 10 | 13 | 
| CSLU-002 | 6.50 | 124 | TD | Picture Description | 5 | 3 | 2 | 1 | 2 | 3 | 8 | 
We can control the formatting of individual columns using the column_spec() function:
head(mazes) %>% 
  kable(format = "html", digits = 2, caption = "Over here!") %>% 
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left") %>% 
  column_spec(4, width="3cm", background="lightblue", border_right=TRUE)
| study_id | ca | viq | dx | activity | content | filler | rep | rev | fs | cued | not_cued | 
|---|---|---|---|---|---|---|---|---|---|---|---|
| CSLU-001 | 5.67 | 124 | TD | Conversation | 24 | 31 | 2 | 5 | 17 | 36 | 50 | 
| CSLU-001 | 5.67 | 124 | TD | Picture Description | 1 | 2 | 0 | 0 | 1 | 2 | 3 | 
| CSLU-001 | 5.67 | 124 | TD | Play | 21 | 6 | 3 | 8 | 10 | 6 | 27 | 
| CSLU-001 | 5.67 | 124 | TD | Wordless Picture Book | 8 | 2 | 0 | 4 | 4 | 2 | 10 | 
| CSLU-002 | 6.50 | 124 | TD | Conversation | 3 | 10 | 3 | 0 | 0 | 10 | 13 | 
| CSLU-002 | 6.50 | 124 | TD | Picture Description | 5 | 3 | 2 | 1 | 2 | 3 | 8 | 
kable + kableExtra + formattablecolor_tile and color_bar are neat extras if used wisely!
http://haozhu233.github.io/kableExtra/use_kableExtra_with_formattable.html
library(formattable)
head(mazes) %>% 
  mutate(ca = color_tile("transparent", "lightpink")(ca),
         viq = color_bar("lightseagreen")(viq)) %>%
  kable("html", escape = F, caption = 'This table is colored.') %>%
  kable_styling(position = "center") %>% 
  column_spec(4, width = "3cm") 
| study_id | ca | viq | dx | activity | content | filler | rep | rev | fs | cued | not_cued | 
|---|---|---|---|---|---|---|---|---|---|---|---|
| CSLU-001 | 5.6667 | 124 | TD | Conversation | 24 | 31 | 2 | 5 | 17 | 36 | 50 | 
| CSLU-001 | 5.6667 | 124 | TD | Picture Description | 1 | 2 | 0 | 0 | 1 | 2 | 3 | 
| CSLU-001 | 5.6667 | 124 | TD | Play | 21 | 6 | 3 | 8 | 10 | 6 | 27 | 
| CSLU-001 | 5.6667 | 124 | TD | Wordless Picture Book | 8 | 2 | 0 | 4 | 4 | 2 | 10 | 
| CSLU-002 | 6.5000 | 124 | TD | Conversation | 3 | 10 | 3 | 0 | 0 | 10 | 13 | 
| CSLU-002 | 6.5000 | 124 | TD | Picture Description | 5 | 3 | 2 | 1 | 2 | 3 | 8 | 
tibble + kable + kableExtraYou can also use any of these tools with plain text tables using the tibble package to create a table. Two main functions:
tribble: enter tibble by rowstibble: enter tibble by columnsFor example, I used tribble to make this table in our slide decks:
math_table <- tibble::tribble(
  ~Operator, ~Description, ~Usage,
  "\\+", "addition", "x + y",
  "\\-", "subtraction", "x - y",
  "\\*", "multiplication", "x * y",
  "/", "division", "x / y",
  "^", "raised to the power of", "x ^ y",
  "abs", "absolute value", "abs(x)",
  "%/%", "integer division", "x %/% y",
  "%%", "remainder after division", "x %% y"
)
Then I used this chunk to print it:
```{r, results = 'asis'}
knitr::kable(math_table, format = "html", caption = "Helpful mutate functions") %>% 
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
```
knitr::kable(math_table, format = "html", caption = "Helpful mutate functions") %>% 
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
| Operator | Description | Usage | 
|---|---|---|
| + | addition | x + y | 
| - | subtraction | x - y | 
| * | multiplication | x * y | 
| / | division | x / y | 
| ^ | raised to the power of | x ^ y | 
| abs | absolute value | abs(x) | 
| %/% | integer division | x %/% y | 
| %% | remainder after division | x %% y | 
Sometimes you may just want to type in a table in Markdown and ignore R. Four kinds of tables may be used. The first three kinds presuppose the use of a fixed-width font, such as Courier. The fourth kind can be used with proportionally spaced fonts, as it does not require lining up columns. All of the below will render when typed outside of an R code chunk since these are based on pandoc being used to render your markdown document. Note that these should all work whether you are knitting to either html or PDF.
This code for a simple table:
  Right     Left     Center     Default
-------     ------ ----------   -------
     12     12        12            12
    123     123       123          123
      1     1          1             1
Table:  Demonstration of simple table syntax.
Produces this simple table:
| Right | Left | Center | Default | 
|---|---|---|---|
| 12 | 12 | 12 | 12 | 
| 123 | 123 | 123 | 123 | 
| 1 | 1 | 1 | 1 | 
The headers and table rows must each fit on one line. Column alignments are determined by the position of the header text relative to the dashed line below it:3
The column headers may be omitted, provided a dashed line is used to end the table.
This code for a multi-line table:
-------------------------------------------------------------
 Centered   Default           Right Left
  Header    Aligned         Aligned Aligned
----------- ------- --------------- -------------------------
   First    row                12.0 Example of a row that
                                    spans multiple lines.
  Second    row                 5.0 Here's another one. Note
                                    the blank line between
                                    rows.
-------------------------------------------------------------
Table: Here's the caption. It, too, may span
multiple lines.
Produces this multi-line table:
| Centered Header | Default Aligned | Right Aligned | Left Aligned | 
|---|---|---|---|
| First | row | 12.0 | Example of a row that spans multiple lines. | 
| Second | row | 5.0 | Here’s another one. Note the blank line between rows. | 
This code for a grid table:
: Sample grid table.
+---------------+---------------+--------------------+
| Fruit         | Price         | Advantages         |
+===============+===============+====================+
| Bananas       | $1.34         | - built-in wrapper |
|               |               | - bright color     |
+---------------+---------------+--------------------+
| Oranges       | $2.10         | - cures scurvy     |
|               |               | - tasty            |
+---------------+---------------+--------------------+
Produces this grid table:
| Fruit | Price | Advantages | 
|---|---|---|
| Bananas | $1.34 | 
  | 
| Oranges | $2.10 | 
  | 
Alignments are not supported, nor are cells that span multiple columns or rows.
This code for a pipe table:
| Right | Left | Default | Center |
|------:|:-----|---------|:------:|
|   12  |  12  |    12   |    12  |
|  123  |  123 |   123   |   123  |
|    1  |    1 |     1   |     1  |
  : Demonstration of pipe table syntax.
Produces this pipe table:
| Right | Left | Default | Center | 
|---|---|---|---|
| 12 | 12 | 12 | 12 | 
| 123 | 123 | 123 | 123 | 
| 1 | 1 | 1 | 1 | 
If you want to make tables that include R output (like output from functions like means, variances, or output from models), there are two steps:
This section covers (1). But, although there are some nice options for (2) within R Markdown via various packages, I am not dogmatic about doing everything in R Markdown, especially things like (2).
Note that many of the tidying techniques we’ll be using are ones we’ve seen before; you’ll be using them in the challenge assignment this week on this dataset, so hopefully this walk-through will give you some ideas.
dplyrWe’ll use the pnwflights14 package to practice our dplyr skills. We need to download the package from github using devtools.
# once per machine
install.packages("devtools")
devtools::install_github("ismayc/pnwflights14")
Now, we need to load the flights dataset from the pnwflights14 package.
# once per work session
data("flights", package = "pnwflights14")
dplyr::selectRecall that we use select to specify which columns in a dataframe you’d like to keep by name:
# keep these 2 cols
mini_flights <- flights %>% 
  select(carrier, flight)
glimpse(mini_flights)
Rows: 162,049
Columns: 2
$ carrier <chr> "AS", "US", "UA", "US", "AS", "DL", "UA", "UA", "UA", "UA", "U…
$ flight  <int> 145, 1830, 1609, 466, 121, 1823, 1481, 229, 1576, 478, 1569, 6…
# keep first five cols
first_five <- flights %>% 
  select(year, month, day, dep_time, dep_delay)
glimpse(first_five)
Rows: 162,049
Columns: 5
$ year      <int> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, …
$ month     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ day       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ dep_time  <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557,…
$ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2,…
# alternatively, specify range
first_five <- flights %>% 
  select(year:dep_delay)
glimpse(first_five)
Rows: 162,049
Columns: 5
$ year      <int> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, …
$ month     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ day       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ dep_time  <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557,…
$ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2,…
We can also choose the columns we want by negation, that is, you can specify which columns to drop instead of keep. This way, all variables not listed are kept.
# we can also use negation
all_but_year <- flights %>% 
  select(-year)
glimpse(all_but_year)
Rows: 162,049
Columns: 15
$ month     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ day       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ dep_time  <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557,…
$ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2,…
$ arr_time  <int> 235, 738, 548, 800, 325, 747, 936, 1148, 917, 1334, 911, 907…
$ arr_delay <dbl> 70, -23, -4, -23, 43, 88, 219, 15, 24, -6, 4, 12, -12, -16, …
$ carrier   <chr> "AS", "US", "UA", "US", "AS", "DL", "UA", "UA", "UA", "UA", …
$ tailnum   <chr> "N508AS", "N195UW", "N37422", "N547UW", "N762AS", "N806DN", …
$ flight    <int> 145, 1830, 1609, 466, 121, 1823, 1481, 229, 1576, 478, 1569,…
$ origin    <chr> "PDX", "SEA", "PDX", "PDX", "SEA", "SEA", "SEA", "PDX", "SEA…
$ dest      <chr> "ANC", "CLT", "IAH", "CLT", "ANC", "DTW", "ORD", "IAH", "DEN…
$ air_time  <dbl> 194, 252, 201, 251, 201, 224, 202, 217, 136, 268, 130, 122, …
$ distance  <dbl> 1542, 2279, 1825, 2282, 1448, 1927, 1721, 1825, 1024, 2402, …
$ hour      <dbl> 0, 0, 0, 0, 0, 0, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, …
$ minute    <dbl> 1, 4, 8, 28, 34, 37, 46, 26, 27, 36, 41, 49, 50, 57, 57, 58,…
dplyr::select comes with several other helper functions…
depart <- flights %>%
  select(starts_with("dep_"))
glimpse(depart)
Rows: 162,049
Columns: 2
$ dep_time  <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557,…
$ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2,…
times <- flights %>%
  select(contains("time"))
glimpse(times)
Rows: 162,049
Columns: 3
$ dep_time <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557, …
$ arr_time <int> 235, 738, 548, 800, 325, 747, 936, 1148, 917, 1334, 911, 907,…
$ air_time <dbl> 194, 252, 201, 251, 201, 224, 202, 217, 136, 268, 130, 122, 8…
# note that we are not actually saving the new dataframe here
flights %>%
  select(-contains("time")) %>% head()
| year | month | day | dep_delay | arr_delay | carrier | tailnum | flight | origin | dest | distance | hour | minute | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2014 | 1 | 1 | 96 | 70 | AS | N508AS | 145 | PDX | ANC | 1.54e+03 | 0 | 1 | 
| 2014 | 1 | 1 | -6 | -23 | US | N195UW | 1830 | SEA | CLT | 2.28e+03 | 0 | 4 | 
| 2014 | 1 | 1 | 13 | -4 | UA | N37422 | 1609 | PDX | IAH | 1.82e+03 | 0 | 8 | 
| 2014 | 1 | 1 | -2 | -23 | US | N547UW | 466 | PDX | CLT | 2.28e+03 | 0 | 28 | 
| 2014 | 1 | 1 | 44 | 43 | AS | N762AS | 121 | SEA | ANC | 1.45e+03 | 0 | 34 | 
| 2014 | 1 | 1 | 82 | 88 | DL | N806DN | 1823 | SEA | DTW | 1.93e+03 | 0 | 37 | 
delays <- flights %>%
  select(ends_with("delay"))
glimpse(delays)
Rows: 162,049
Columns: 2
$ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2,…
$ arr_delay <dbl> 70, -23, -4, -23, 43, 88, 219, 15, 24, -6, 4, 12, -12, -16, …
One of my favorite select helper functions is everything(), which allows you to use select to keep all your variables, but easily rearrange the columns without having to list all the variables to keep/drop.
new_order <- flights %>% 
  select(origin, dest, everything())
head(new_order)
| origin | dest | year | month | day | dep_time | dep_delay | arr_time | arr_delay | carrier | tailnum | flight | air_time | distance | hour | minute | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PDX | ANC | 2014 | 1 | 1 | 1 | 96 | 235 | 70 | AS | N508AS | 145 | 194 | 1.54e+03 | 0 | 1 | 
| SEA | CLT | 2014 | 1 | 1 | 4 | -6 | 738 | -23 | US | N195UW | 1830 | 252 | 2.28e+03 | 0 | 4 | 
| PDX | IAH | 2014 | 1 | 1 | 8 | 13 | 548 | -4 | UA | N37422 | 1609 | 201 | 1.82e+03 | 0 | 8 | 
| PDX | CLT | 2014 | 1 | 1 | 28 | -2 | 800 | -23 | US | N547UW | 466 | 251 | 2.28e+03 | 0 | 28 | 
| SEA | ANC | 2014 | 1 | 1 | 34 | 44 | 325 | 43 | AS | N762AS | 121 | 201 | 1.45e+03 | 0 | 34 | 
| SEA | DTW | 2014 | 1 | 1 | 37 | 82 | 747 | 88 | DL | N806DN | 1823 | 224 | 1.93e+03 | 0 | 37 | 
# with negation
new_order2 <- flights %>% 
  select(origin, dest, everything(), -year)
head(new_order2)
| origin | dest | month | day | dep_time | dep_delay | arr_time | arr_delay | carrier | tailnum | flight | air_time | distance | hour | minute | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PDX | ANC | 1 | 1 | 1 | 96 | 235 | 70 | AS | N508AS | 145 | 194 | 1.54e+03 | 0 | 1 | 
| SEA | CLT | 1 | 1 | 4 | -6 | 738 | -23 | US | N195UW | 1830 | 252 | 2.28e+03 | 0 | 4 | 
| PDX | IAH | 1 | 1 | 8 | 13 | 548 | -4 | UA | N37422 | 1609 | 201 | 1.82e+03 | 0 | 8 | 
| PDX | CLT | 1 | 1 | 28 | -2 | 800 | -23 | US | N547UW | 466 | 251 | 2.28e+03 | 0 | 28 | 
| SEA | ANC | 1 | 1 | 34 | 44 | 325 | 43 | AS | N762AS | 121 | 201 | 1.45e+03 | 0 | 34 | 
| SEA | DTW | 1 | 1 | 37 | 82 | 747 | 88 | DL | N806DN | 1823 | 224 | 1.93e+03 | 0 | 37 | 
We can also rename variables within select.
flights2 <- flights %>%
  select(tail_num = tailnum, everything())
head(flights2)
| tail_num | year | month | day | dep_time | dep_delay | arr_time | arr_delay | carrier | flight | origin | dest | air_time | distance | hour | minute | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| N508AS | 2014 | 1 | 1 | 1 | 96 | 235 | 70 | AS | 145 | PDX | ANC | 194 | 1.54e+03 | 0 | 1 | 
| N195UW | 2014 | 1 | 1 | 4 | -6 | 738 | -23 | US | 1830 | SEA | CLT | 252 | 2.28e+03 | 0 | 4 | 
| N37422 | 2014 | 1 | 1 | 8 | 13 | 548 | -4 | UA | 1609 | PDX | IAH | 201 | 1.82e+03 | 0 | 8 | 
| N547UW | 2014 | 1 | 1 | 28 | -2 | 800 | -23 | US | 466 | PDX | CLT | 251 | 2.28e+03 | 0 | 28 | 
| N762AS | 2014 | 1 | 1 | 34 | 44 | 325 | 43 | AS | 121 | SEA | ANC | 201 | 1.45e+03 | 0 | 34 | 
| N806DN | 2014 | 1 | 1 | 37 | 82 | 747 | 88 | DL | 1823 | SEA | DTW | 224 | 1.93e+03 | 0 | 37 | 
If you don’t want to move the renamed variables within your dataframe, you can use the rename function.
flights3 <- flights %>%
  rename(tail_num = tailnum)
glimpse(flights3)
Rows: 162,049
Columns: 16
$ year      <int> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, …
$ month     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ day       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ dep_time  <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557,…
$ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2,…
$ arr_time  <int> 235, 738, 548, 800, 325, 747, 936, 1148, 917, 1334, 911, 907…
$ arr_delay <dbl> 70, -23, -4, -23, 43, 88, 219, 15, 24, -6, 4, 12, -12, -16, …
$ carrier   <chr> "AS", "US", "UA", "US", "AS", "DL", "UA", "UA", "UA", "UA", …
$ tail_num  <chr> "N508AS", "N195UW", "N37422", "N547UW", "N762AS", "N806DN", …
$ flight    <int> 145, 1830, 1609, 466, 121, 1823, 1481, 229, 1576, 478, 1569,…
$ origin    <chr> "PDX", "SEA", "PDX", "PDX", "SEA", "SEA", "SEA", "PDX", "SEA…
$ dest      <chr> "ANC", "CLT", "IAH", "CLT", "ANC", "DTW", "ORD", "IAH", "DEN…
$ air_time  <dbl> 194, 252, 201, 251, 201, 224, 202, 217, 136, 268, 130, 122, …
$ distance  <dbl> 1542, 2279, 1825, 2282, 1448, 1927, 1721, 1825, 1024, 2402, …
$ hour      <dbl> 0, 0, 0, 0, 0, 0, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, …
$ minute    <dbl> 1, 4, 8, 28, 34, 37, 46, 26, 27, 36, 41, 49, 50, 57, 57, 58,…
dplyr::filterAs we have previously seen, filter has many flexible uses:
# flights taking off from PDX
pdx <- flights %>% 
  filter(origin == "PDX")
head(pdx)
| year | month | day | dep_time | dep_delay | arr_time | arr_delay | carrier | tailnum | flight | origin | dest | air_time | distance | hour | minute | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2014 | 1 | 1 | 1 | 96 | 235 | 70 | AS | N508AS | 145 | PDX | ANC | 194 | 1.54e+03 | 0 | 1 | 
| 2014 | 1 | 1 | 8 | 13 | 548 | -4 | UA | N37422 | 1609 | PDX | IAH | 201 | 1.82e+03 | 0 | 8 | 
| 2014 | 1 | 1 | 28 | -2 | 800 | -23 | US | N547UW | 466 | PDX | CLT | 251 | 2.28e+03 | 0 | 28 | 
| 2014 | 1 | 1 | 526 | -4 | 1148 | 15 | UA | N813UA | 229 | PDX | IAH | 217 | 1.82e+03 | 5 | 26 | 
| 2014 | 1 | 1 | 541 | 1 | 911 | 4 | UA | N36476 | 1569 | PDX | DEN | 130 | 991 | 5 | 41 | 
| 2014 | 1 | 1 | 549 | 24 | 907 | 12 | US | N548UW | 649 | PDX | PHX | 122 | 1.01e+03 | 5 | 49 | 
# january flights from PDX
pdx_jan <- flights %>% 
  filter(origin == "PDX", month == 1) # the comma is an "and"
head(pdx_jan)
| year | month | day | dep_time | dep_delay | arr_time | arr_delay | carrier | tailnum | flight | origin | dest | air_time | distance | hour | minute | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2014 | 1 | 1 | 1 | 96 | 235 | 70 | AS | N508AS | 145 | PDX | ANC | 194 | 1.54e+03 | 0 | 1 | 
| 2014 | 1 | 1 | 8 | 13 | 548 | -4 | UA | N37422 | 1609 | PDX | IAH | 201 | 1.82e+03 | 0 | 8 | 
| 2014 | 1 | 1 | 28 | -2 | 800 | -23 | US | N547UW | 466 | PDX | CLT | 251 | 2.28e+03 | 0 | 28 | 
| 2014 | 1 | 1 | 526 | -4 | 1148 | 15 | UA | N813UA | 229 | PDX | IAH | 217 | 1.82e+03 | 5 | 26 | 
| 2014 | 1 | 1 | 541 | 1 | 911 | 4 | UA | N36476 | 1569 | PDX | DEN | 130 | 991 | 5 | 41 | 
| 2014 | 1 | 1 | 549 | 24 | 907 | 12 | US | N548UW | 649 | PDX | PHX | 122 | 1.01e+03 | 5 | 49 | 
# flights to ATL (Atlanta) or BNA (Nashville)
to_south <- flights %>% 
  filter(dest == "ATL" | dest == "BNA") %>% # | is "or"
  select(origin, dest, everything())
head(to_south)
| origin | dest | year | month | day | dep_time | dep_delay | arr_time | arr_delay | carrier | tailnum | flight | air_time | distance | hour | minute | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SEA | ATL | 2014 | 1 | 1 | 624 | -6 | 1401 | -6 | DL | N617DL | 968 | 235 | 2.18e+03 | 6 | 24 | 
| SEA | ATL | 2014 | 1 | 1 | 802 | -3 | 1533 | -17 | AS | N532AS | 742 | 249 | 2.18e+03 | 8 | 2 | 
| SEA | ATL | 2014 | 1 | 1 | 824 | -1 | 1546 | -14 | DL | N633DL | 128 | 234 | 2.18e+03 | 8 | 24 | 
| PDX | ATL | 2014 | 1 | 1 | 944 | -6 | 1727 | -8 | AS | N548AS | 752 | 252 | 2.17e+03 | 9 | 44 | 
| PDX | ATL | 2014 | 1 | 1 | 1054 | 94 | 1807 | 84 | DL | N377DA | 1502 | 237 | 2.17e+03 | 10 | 54 | 
| SEA | ATL | 2014 | 1 | 1 | 1158 | 6 | 1915 | -14 | DL | N6712B | 1962 | 234 | 2.18e+03 | 11 | 58 | 
# flights from PDX to ATL (Atlanta) or BNA (Nashville)
pdx_to_south <- flights %>% 
  filter(origin == "PDX", dest == "ATL" | dest == "BNA") %>% # | is "or"
  select(origin, dest, everything())
head(pdx_to_south)
| origin | dest | year | month | day | dep_time | dep_delay | arr_time | arr_delay | carrier | tailnum | flight | air_time | distance | hour | minute | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PDX | ATL | 2014 | 1 | 1 | 944 | -6 | 1727 | -8 | AS | N548AS | 752 | 252 | 2.17e+03 | 9 | 44 | 
| PDX | ATL | 2014 | 1 | 1 | 1054 | 94 | 1807 | 84 | DL | N377DA | 1502 | 237 | 2.17e+03 | 10 | 54 | 
| PDX | ATL | 2014 | 1 | 1 | 1323 | -2 | 2038 | -15 | DL | N393DA | 773 | 235 | 2.17e+03 | 13 | 23 | 
| PDX | ATL | 2014 | 1 | 1 | 2253 | 8 | 611 | 4 | DL | N371DA | 503 | 240 | 2.17e+03 | 22 | 53 | 
| PDX | ATL | 2014 | 1 | 2 | 627 | -3 | 1350 | -7 | DL | N3746H | 1156 | 244 | 2.17e+03 | 6 | 27 | 
| PDX | ATL | 2014 | 1 | 2 | 918 | -2 | 1643 | -2 | DL | N3756 | 1940 | 249 | 2.17e+03 | 9 | 18 | 
# alternatively, using group membership
south_dests <- c("ATL", "BNA")
pdx_to_south2 <- flights %>% 
  filter(origin == "PDX", dest %in% south_dests) %>% 
  select(origin, dest, everything())
head(pdx_to_south2)
| origin | dest | year | month | day | dep_time | dep_delay | arr_time | arr_delay | carrier | tailnum | flight | air_time | distance | hour | minute | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PDX | ATL | 2014 | 1 | 1 | 944 | -6 | 1727 | -8 | AS | N548AS | 752 | 252 | 2.17e+03 | 9 | 44 | 
| PDX | ATL | 2014 | 1 | 1 | 1054 | 94 | 1807 | 84 | DL | N377DA | 1502 | 237 | 2.17e+03 | 10 | 54 | 
| PDX | ATL | 2014 | 1 | 1 | 1323 | -2 | 2038 | -15 | DL | N393DA | 773 | 235 | 2.17e+03 | 13 | 23 | 
| PDX | ATL | 2014 | 1 | 1 | 2253 | 8 | 611 | 4 | DL | N371DA | 503 | 240 | 2.17e+03 | 22 | 53 | 
| PDX | ATL | 2014 | 1 | 2 | 627 | -3 | 1350 | -7 | DL | N3746H | 1156 | 244 | 2.17e+03 | 6 | 27 | 
| PDX | ATL | 2014 | 1 | 2 | 918 | -2 | 1643 | -2 | DL | N3756 | 1940 | 249 | 2.17e+03 | 9 | 18 | 
# flights delayed by 1 hour or more
delay_1plus <- flights %>%
  filter(dep_delay >= 60)
head(delay_1plus)
| year | month | day | dep_time | dep_delay | arr_time | arr_delay | carrier | tailnum | flight | origin | dest | air_time | distance | hour | minute | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2014 | 1 | 1 | 1 | 96 | 235 | 70 | AS | N508AS | 145 | PDX | ANC | 194 | 1.54e+03 | 0 | 1 | 
| 2014 | 1 | 1 | 37 | 82 | 747 | 88 | DL | N806DN | 1823 | SEA | DTW | 224 | 1.93e+03 | 0 | 37 | 
| 2014 | 1 | 1 | 346 | 227 | 936 | 219 | UA | N14219 | 1481 | SEA | ORD | 202 | 1.72e+03 | 3 | 46 | 
| 2014 | 1 | 1 | 650 | 90 | 1037 | 91 | US | N626AW | 460 | SEA | PHX | 141 | 1.11e+03 | 6 | 50 | 
| 2014 | 1 | 1 | 959 | 164 | 1137 | 157 | AS | N534AS | 805 | SEA | SMF | 77 | 605 | 9 | 59 | 
| 2014 | 1 | 1 | 1008 | 68 | 1242 | 64 | AS | N788AS | 456 | SEA | LAX | 129 | 954 | 10 | 8 | 
# flights delayed by 1 hour, but not more than 2 hours
delay_1hr <- flights %>%
  filter(dep_delay >= 60, dep_delay < 120)
head(delay_1hr)
| year | month | day | dep_time | dep_delay | arr_time | arr_delay | carrier | tailnum | flight | origin | dest | air_time | distance | hour | minute | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2014 | 1 | 1 | 1 | 96 | 235 | 70 | AS | N508AS | 145 | PDX | ANC | 194 | 1.54e+03 | 0 | 1 | 
| 2014 | 1 | 1 | 37 | 82 | 747 | 88 | DL | N806DN | 1823 | SEA | DTW | 224 | 1.93e+03 | 0 | 37 | 
| 2014 | 1 | 1 | 650 | 90 | 1037 | 91 | US | N626AW | 460 | SEA | PHX | 141 | 1.11e+03 | 6 | 50 | 
| 2014 | 1 | 1 | 1008 | 68 | 1242 | 64 | AS | N788AS | 456 | SEA | LAX | 129 | 954 | 10 | 8 | 
| 2014 | 1 | 1 | 1014 | 75 | 1613 | 81 | UA | N37408 | 1444 | SEA | ORD | 201 | 1.72e+03 | 10 | 14 | 
| 2014 | 1 | 1 | 1036 | 81 | 1408 | 63 | OO | N218AG | 3466 | PDX | TUS | 134 | 1.12e+03 | 10 | 36 | 
range(delay_1hr$dep_delay, na.rm = TRUE)
[1]  60 119
# even more efficient using between (always inclusive)
delay_bwn <- flights %>%
  filter(between(dep_delay, 60, 119))
head(delay_bwn)
| year | month | day | dep_time | dep_delay | arr_time | arr_delay | carrier | tailnum | flight | origin | dest | air_time | distance | hour | minute | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2014 | 1 | 1 | 1 | 96 | 235 | 70 | AS | N508AS | 145 | PDX | ANC | 194 | 1.54e+03 | 0 | 1 | 
| 2014 | 1 | 1 | 37 | 82 | 747 | 88 | DL | N806DN | 1823 | SEA | DTW | 224 | 1.93e+03 | 0 | 37 | 
| 2014 | 1 | 1 | 650 | 90 | 1037 | 91 | US | N626AW | 460 | SEA | PHX | 141 | 1.11e+03 | 6 | 50 | 
| 2014 | 1 | 1 | 1008 | 68 | 1242 | 64 | AS | N788AS | 456 | SEA | LAX | 129 | 954 | 10 | 8 | 
| 2014 | 1 | 1 | 1014 | 75 | 1613 | 81 | UA | N37408 | 1444 | SEA | ORD | 201 | 1.72e+03 | 10 | 14 | 
| 2014 | 1 | 1 | 1036 | 81 | 1408 | 63 | OO | N218AG | 3466 | PDX | TUS | 134 | 1.12e+03 | 10 | 36 | 
range(delay_bwn$dep_delay, na.rm = TRUE)
[1]  60 119
dplyr::arrange# default is ascending order
flights %>% 
  arrange(year, month, day) %>% head(n=20)
| year | month | day | dep_time | dep_delay | arr_time | arr_delay | carrier | tailnum | flight | origin | dest | air_time | distance | hour | minute | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2014 | 1 | 1 | 1 | 96 | 235 | 70 | AS | N508AS | 145 | PDX | ANC | 194 | 1.54e+03 | 0 | 1 | 
| 2014 | 1 | 1 | 4 | -6 | 738 | -23 | US | N195UW | 1830 | SEA | CLT | 252 | 2.28e+03 | 0 | 4 | 
| 2014 | 1 | 1 | 8 | 13 | 548 | -4 | UA | N37422 | 1609 | PDX | IAH | 201 | 1.82e+03 | 0 | 8 | 
| 2014 | 1 | 1 | 28 | -2 | 800 | -23 | US | N547UW | 466 | PDX | CLT | 251 | 2.28e+03 | 0 | 28 | 
| 2014 | 1 | 1 | 34 | 44 | 325 | 43 | AS | N762AS | 121 | SEA | ANC | 201 | 1.45e+03 | 0 | 34 | 
| 2014 | 1 | 1 | 37 | 82 | 747 | 88 | DL | N806DN | 1823 | SEA | DTW | 224 | 1.93e+03 | 0 | 37 | 
| 2014 | 1 | 1 | 346 | 227 | 936 | 219 | UA | N14219 | 1481 | SEA | ORD | 202 | 1.72e+03 | 3 | 46 | 
| 2014 | 1 | 1 | 526 | -4 | 1148 | 15 | UA | N813UA | 229 | PDX | IAH | 217 | 1.82e+03 | 5 | 26 | 
| 2014 | 1 | 1 | 527 | 7 | 917 | 24 | UA | N75433 | 1576 | SEA | DEN | 136 | 1.02e+03 | 5 | 27 | 
| 2014 | 1 | 1 | 536 | 1 | 1334 | -6 | UA | N574UA | 478 | SEA | EWR | 268 | 2.4e+03 | 5 | 36 | 
| 2014 | 1 | 1 | 541 | 1 | 911 | 4 | UA | N36476 | 1569 | PDX | DEN | 130 | 991 | 5 | 41 | 
| 2014 | 1 | 1 | 549 | 24 | 907 | 12 | US | N548UW | 649 | PDX | PHX | 122 | 1.01e+03 | 5 | 49 | 
| 2014 | 1 | 1 | 550 | 0 | 837 | -12 | DL | N660DL | 1634 | SEA | SLC | 82 | 689 | 5 | 50 | 
| 2014 | 1 | 1 | 557 | -3 | 1134 | -16 | AA | N3JLAA | 1094 | SEA | DFW | 184 | 1.66e+03 | 5 | 57 | 
| 2014 | 1 | 1 | 557 | -3 | 825 | -25 | AS | N562AS | 81 | SEA | ANC | 188 | 1.45e+03 | 5 | 57 | 
| 2014 | 1 | 1 | 558 | -2 | 801 | -2 | AS | N402AS | 200 | SEA | SJC | 100 | 697 | 5 | 58 | 
| 2014 | 1 | 1 | 559 | -1 | 916 | -9 | F9 | N210FR | 796 | PDX | DEN | 125 | 991 | 5 | 59 | 
| 2014 | 1 | 1 | 600 | 0 | 1151 | -19 | AA | N3JFAA | 2240 | SEA | ORD | 206 | 1.72e+03 | 6 | 0 | 
| 2014 | 1 | 1 | 600 | -10 | 842 | -8 | AS | N786AS | 426 | SEA | LAX | 125 | 954 | 6 | 0 | 
| 2014 | 1 | 1 | 602 | -3 | 943 | 5 | F9 | N201FR | 144 | SEA | DEN | 129 | 1.02e+03 | 6 | 2 | 
# descending order
flights %>% 
  arrange(desc(year), desc(month), desc(day)) %>% head(n=20)
| year | month | day | dep_time | dep_delay | arr_time | arr_delay | carrier | tailnum | flight | origin | dest | air_time | distance | hour | minute | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2014 | 12 | 31 | 2 | 12 | 601 | 31 | AA | N3JKAA | 1230 | SEA | DFW | 210 | 1.66e+03 | 0 | 2 | 
| 2014 | 12 | 31 | 27 | -3 | 623 | 3 | AA | N3EWAA | 1431 | SEA | ORD | 211 | 1.72e+03 | 0 | 27 | 
| 2014 | 12 | 31 | 39 | 14 | 324 | 4 | AS | N762AS | 135 | SEA | ANC | 207 | 1.45e+03 | 0 | 39 | 
| 2014 | 12 | 31 | 40 | 0 | 549 | 0 | DL | N757AT | 2440 | SEA | MSP | 174 | 1.4e+03 | 0 | 40 | 
| 2014 | 12 | 31 | 52 | -8 | 917 | -21 | AA | N3JFAA | 371 | SEA | MIA | 306 | 2.72e+03 | 0 | 52 | 
| 2014 | 12 | 31 | 54 | 4 | 621 | 17 | DL | N128DL | 1670 | PDX | MSP | 174 | 1.43e+03 | 0 | 54 | 
| 2014 | 12 | 31 | 56 | 61 | 848 | 80 | DL | N655DL | 929 | SEA | ATL | 249 | 2.18e+03 | 0 | 56 | 
| 2014 | 12 | 31 | 512 | -3 | 904 | 4 | US | N653AW | 480 | SEA | PHX | 154 | 1.11e+03 | 5 | 12 | 
| 2014 | 12 | 31 | 515 | -5 | 855 | 5 | US | N580UW | 425 | PDX | PHX | 144 | 1.01e+03 | 5 | 15 | 
| 2014 | 12 | 31 | 534 | 4 | 859 | 7 | UA | N34460 | 1075 | PDX | DEN | 125 | 991 | 5 | 34 | 
| 2014 | 12 | 31 | 546 | 1 | 916 | -4 | WN | N8323C | 757 | PDX | DEN | 126 | 991 | 5 | 46 | 
| 2014 | 12 | 31 | 548 | -2 | 1351 | -13 | UA | N461UA | 665 | PDX | EWR | 287 | 2.43e+03 | 5 | 48 | 
| 2014 | 12 | 31 | 549 | 4 | 1208 | 12 | UA | N68807 | 1457 | SEA | IAH | 223 | 1.87e+03 | 5 | 49 | 
| 2014 | 12 | 31 | 550 | 0 | 922 | 2 | WN | N797MX | 2121 | PDX | PHX | 138 | 1.01e+03 | 5 | 50 | 
| 2014 | 12 | 31 | 551 | -4 | 1202 | 12 | AA | N3HXAA | 1094 | SEA | DFW | 211 | 1.66e+03 | 5 | 51 | 
| 2014 | 12 | 31 | 551 | -9 | 744 | -15 | AS | N570AS | 342 | SEA | OAK | 91 | 671 | 5 | 51 | 
| 2014 | 12 | 31 | 555 | -10 | 824 | -1 | AS | N548AS | 602 | SEA | LAS | 124 | 867 | 5 | 55 | 
| 2014 | 12 | 31 | 558 | -2 | 849 | 0 | DL | N668DN | 1831 | PDX | SLC | 85 | 630 | 5 | 58 | 
| 2014 | 12 | 31 | 558 | -2 | 1149 | 4 | AA | N436AA | 1534 | PDX | DFW | 211 | 1.62e+03 | 5 | 58 | 
| 2014 | 12 | 31 | 558 | -2 | 738 | -4 | AS | N585AS | 406 | PDX | SJC | 80 | 569 | 5 | 58 | 
dplyr::distinct# unique origin-dest combinations
flights %>% 
  select(origin, dest) %>% 
  distinct %>% head(n=50)
| origin | dest | 
|---|---|
| PDX | ANC | 
| SEA | CLT | 
| PDX | IAH | 
| PDX | CLT | 
| SEA | ANC | 
| SEA | DTW | 
| SEA | ORD | 
| SEA | DEN | 
| SEA | EWR | 
| PDX | DEN | 
| PDX | PHX | 
| SEA | SLC | 
| SEA | DFW | 
| SEA | SJC | 
| SEA | LAX | 
| PDX | ORD | 
| SEA | OAK | 
| SEA | SFO | 
| PDX | SJC | 
| SEA | SNA | 
| SEA | SAN | 
| PDX | DFW | 
| PDX | EWR | 
| SEA | IAH | 
| SEA | ATL | 
| PDX | BUR | 
| SEA | MDW | 
| PDX | LAX | 
| PDX | SAN | 
| PDX | IAD | 
| SEA | PSP | 
| PDX | MDW | 
| SEA | MSP | 
| PDX | SNA | 
| SEA | PHX | 
| PDX | HNL | 
| SEA | JFK | 
| PDX | MSP | 
| PDX | SFO | 
| SEA | LAS | 
| SEA | LGB | 
| PDX | ONT | 
| PDX | LGB | 
| PDX | JFK | 
| PDX | KOA | 
| PDX | LAS | 
| SEA | ONT | 
| SEA | MKE | 
| SEA | BUR | 
| SEA | KTN | 
# all unique destinations from PDX (there are 49, so we'll just list the first few)
from_pdx <- flights %>% 
  filter(origin == "PDX") %>% 
  select(origin, dest) %>%
  distinct(dest)
head(from_pdx)
| dest | 
|---|
| ANC | 
| IAH | 
| CLT | 
| DEN | 
| PHX | 
| ORD | 
dplyr::mutatemutate is used to transform existing variables or create new ones; here, we are using it to create an indicator variable that identifies flights that were entirely on-time.
# add total delay variable
flights %>%
  mutate(tot_delay = dep_delay + arr_delay) %>%
  select(origin, dest, ends_with("delay"), everything()) %>% 
  head
| origin | dest | dep_delay | arr_delay | tot_delay | year | month | day | dep_time | arr_time | carrier | tailnum | flight | air_time | distance | hour | minute | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PDX | ANC | 96 | 70 | 166 | 2014 | 1 | 1 | 1 | 235 | AS | N508AS | 145 | 194 | 1.54e+03 | 0 | 1 | 
| SEA | CLT | -6 | -23 | -29 | 2014 | 1 | 1 | 4 | 738 | US | N195UW | 1830 | 252 | 2.28e+03 | 0 | 4 | 
| PDX | IAH | 13 | -4 | 9 | 2014 | 1 | 1 | 8 | 548 | UA | N37422 | 1609 | 201 | 1.82e+03 | 0 | 8 | 
| PDX | CLT | -2 | -23 | -25 | 2014 | 1 | 1 | 28 | 800 | US | N547UW | 466 | 251 | 2.28e+03 | 0 | 28 | 
| SEA | ANC | 44 | 43 | 87 | 2014 | 1 | 1 | 34 | 325 | AS | N762AS | 121 | 201 | 1.45e+03 | 0 | 34 | 
| SEA | DTW | 82 | 88 | 170 | 2014 | 1 | 1 | 37 | 747 | DL | N806DN | 1823 | 224 | 1.93e+03 | 0 | 37 | 
# flights that were delayed at departure had on time or early arrivals?
arrivals <- flights %>%
  mutate(arr_ok = ifelse(dep_delay > 0 & arr_delay <= 0, 1, 0)) %>% 
  select(origin, dest, ends_with("delay"), carrier, arr_ok)
# peek at it
arrivals %>%
  filter(arr_ok == 1) %>%
  head
| origin | dest | dep_delay | arr_delay | carrier | arr_ok | 
|---|---|---|---|---|---|
| PDX | IAH | 13 | -4 | UA | 1 | 
| SEA | EWR | 1 | -6 | UA | 1 | 
| SEA | SAN | 2 | -12 | AS | 1 | 
| PDX | EWR | 2 | -19 | UA | 1 | 
| SEA | IAH | 13 | -4 | UA | 1 | 
| PDX | IAD | 10 | -4 | UA | 1 | 
dplyr::summarise (or dplyr::summarize)We have seen summarise used to calculate summary statistics, like so:
flights %>%
  summarise(mean(dep_delay, na.rm = TRUE))
| mean(dep_delay, na.rm = TRUE) | 
|---|
| 6.13 | 
But this can get tedious when we want to compute several summary statistics, like so:
# we can also name that variable, and summarise multiple variables
flights %>%
  summarise(mean_delay = mean(dep_delay, na.rm = TRUE),
            sd_delay = sd(dep_delay, na.rm = TRUE),
            median_delay = median(dep_delay, na.rm = TRUE))
| mean_delay | sd_delay | median_delay | 
|---|---|---|
| 6.13 | 29.1 | -2 | 
summarise_each lets us automate this process a bit:
flights %>%
  filter(!is.na(dep_delay)) %>%
  select(dep_delay) %>%
  summarise_each(lst(mean, sd, median))
| mean | sd | median | 
|---|---|---|
| 6.13 | 29.1 | -2 | 
# same thing
flights %>%
  filter(!is.na(dep_delay)) %>%
  summarise_each(lst(mean, sd, median), dep_delay)
| mean | sd | median | 
|---|---|---|
| 6.13 | 29.1 | -2 | 
# combine with pivot_longer, change names too
flights %>%
  filter(!is.na(dep_delay)) %>%
  summarise_each(lst(mean, stdev = sd, median), dep_delay) %>%
  pivot_longer(everything(), names_to="delay_stat", values_to="value")
| delay_stat | value | 
|---|---|
| mean | 6.13 | 
| stdev | 29.1 | 
| median | -2 | 
We can use aggregating functions in summarise
# how many unique destinations?
summary_table <- flights %>% 
  summarise(tot_flights = n(),
            tot_planes = n_distinct(tailnum),
            tot_carriers = n_distinct(carrier),
            tot_dests = n_distinct(dest),
            tot_origins = n_distinct(origin))
summary_table
| tot_flights | tot_planes | tot_carriers | tot_dests | tot_origins | 
|---|---|---|---|---|
| 162049 | 3023 | 11 | 71 | 2 | 
The resulting table is a bit messy; we can reprocess using pivot_longer and separate:
# chain with tidyr functions
summary_table %>% 
  pivot_longer(everything(), names_to="key", values_to="value") %>% 
  separate(key, into = c("tot", "entity")) %>% 
  select(-tot, total = value)
| entity | total | 
|---|---|
| flights | 162049 | 
| planes | 3023 | 
| carriers | 11 | 
| dests | 71 | 
| origins | 2 | 
pivot_longer():
summary_table %>% 
  pivot_longer(
    everything(), # grab all columns
    names_to="entity", # make a new column called "entity" to store the old column names
    names_prefix="tot_", # remove the "tot_" prefix from the old column names
    values_to="total") # put the values in a column called "total"
| entity | total | 
|---|---|
| flights | 162049 | 
| planes | 3023 | 
| carriers | 11 | 
| dests | 71 | 
| origins | 2 | 
tidyrWe’ll work with a made up dataframe:
df <- data.frame(
  id = 1:10,
  date = as.Date('2015-01-01') + 0:9,
  q1_m1_w1 = rnorm(10, 0, 1),
  q1_m1_w2 = rnorm(10, 0, 1),
  q1_m2_w3 = rnorm(10, 0, 1),
  q2_m1_w1 = rnorm(10, 0, 1),
  q2_m2_w1 = rnorm(10, 0, 1),
  q2_m2_w2 = rnorm(10, 0, 1)
)
# HLO
head(df)
| id | date | q1_m1_w1 | q1_m1_w2 | q1_m2_w3 | q2_m1_w1 | q2_m2_w1 | q2_m2_w2 | 
|---|---|---|---|---|---|---|---|
| 1 | 2015-01-01 | -0.313 | -0.607 | 0.17 | -0.255 | 0.131 | 0.26 | 
| 2 | 2015-01-02 | -0.808 | -0.768 | 1.13 | 1.83 | -0.562 | -0.297 | 
| 3 | 2015-01-03 | -1.27 | -0.165 | -1.28 | 0.419 | -1.23 | -0.669 | 
| 4 | 2015-01-04 | -2.12 | -1.04 | -0.603 | -0.19 | -1.21 | -1.64 | 
| 5 | 2015-01-05 | 0.19 | -0.631 | 0.359 | 1.98 | -1.1 | 1.36 | 
| 6 | 2015-01-06 | 0.539 | 0.449 | -0.704 | -0.466 | 0.731 | 0.125 | 
glimpse(df)
Rows: 10
Columns: 8
$ id       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
$ date     <date> 2015-01-01, 2015-01-02, 2015-01-03, 2015-01-04, 2015-01-05, 2…
$ q1_m1_w1 <dbl> -0.3134072, -0.8077594, -1.2704517, -2.1222188, 0.1895081, 0…
$ q1_m1_w2 <dbl> -0.6066112, -0.7676025, -0.1651904, -1.0356756, -0.6314616, 0…
$ q1_m2_w3 <dbl> 0.17043485, 1.12869876, -1.28033499, -0.60293507, 0.35930413,…
$ q2_m1_w1 <dbl> -0.25450850, 1.82724513, 0.41912644, -0.18991891, 1.98128295,…
$ q2_m2_w1 <dbl> 0.1313910, -0.5619366, -1.2254890, -1.2102771, -1.1036748, 0.…
$ q2_m2_w2 <dbl> 0.26023119, -0.29650726, -0.66912554, -1.63523370, 1.35751588…
tidyr::pivot_longerFirst, let’s pivot…
df_tidy <- df %>%
  pivot_longer(q1_m1_w1:q2_m2_w2, names_to="key", values_to="value")
head(df_tidy)
| id | date | key | value | 
|---|---|---|---|
| 1 | 2015-01-01 | q1_m1_w1 | -0.313 | 
| 1 | 2015-01-01 | q1_m1_w2 | -0.607 | 
| 1 | 2015-01-01 | q1_m2_w3 | 0.17 | 
| 1 | 2015-01-01 | q2_m1_w1 | -0.255 | 
| 1 | 2015-01-01 | q2_m2_w1 | 0.131 | 
| 1 | 2015-01-01 | q2_m2_w2 | 0.26 | 
tidyr::separate# separate 1 col into 3 cols
df_sep <- df_tidy %>%
  separate(key, into = c("quarter", "month", "week"))
head(df_sep)
| id | date | quarter | month | week | value | 
|---|---|---|---|---|---|
| 1 | 2015-01-01 | q1 | m1 | w1 | -0.313 | 
| 1 | 2015-01-01 | q1 | m1 | w2 | -0.607 | 
| 1 | 2015-01-01 | q1 | m2 | w3 | 0.17 | 
| 1 | 2015-01-01 | q2 | m1 | w1 | -0.255 | 
| 1 | 2015-01-01 | q2 | m2 | w1 | 0.131 | 
| 1 | 2015-01-01 | q2 | m2 | w2 | 0.26 | 
# separate 1 col into 2 cols
df_sep2 <- df_tidy %>%
  separate(key, into = c("quarter", "period"), extra = "merge")
head(df_sep2)
| id | date | quarter | period | value | 
|---|---|---|---|---|
| 1 | 2015-01-01 | q1 | m1_w1 | -0.313 | 
| 1 | 2015-01-01 | q1 | m1_w2 | -0.607 | 
| 1 | 2015-01-01 | q1 | m2_w3 | 0.17 | 
| 1 | 2015-01-01 | q2 | m1_w1 | -0.255 | 
| 1 | 2015-01-01 | q2 | m2_w1 | 0.131 | 
| 1 | 2015-01-01 | q2 | m2_w2 | 0.26 | 
tidyr::extractExtract is essentially the same as separate, let’s see how…
# extract
df_ext <- df_sep2 %>%
  extract(period, into = "month")
head(df_ext)
| id | date | quarter | month | value | 
|---|---|---|---|---|
| 1 | 2015-01-01 | q1 | m1 | -0.313 | 
| 1 | 2015-01-01 | q1 | m1 | -0.607 | 
| 1 | 2015-01-01 | q1 | m2 | 0.17 | 
| 1 | 2015-01-01 | q2 | m1 | -0.255 | 
| 1 | 2015-01-01 | q2 | m2 | 0.131 | 
| 1 | 2015-01-01 | q2 | m2 | 0.26 | 
We can see that, by default, it’s taking the first alphanumeric chunk of the value, and extracting it. We can specify much more complex patterns:
# this gives us same output as separate
df_ext <- df_sep2 %>%
  extract(period, into = c("month", "week"), 
          regex = "([[:alnum:]]+)_([[:alnum:]]+)")
head(df_ext)
| id | date | quarter | month | week | value | 
|---|---|---|---|---|---|
| 1 | 2015-01-01 | q1 | m1 | w1 | -0.313 | 
| 1 | 2015-01-01 | q1 | m1 | w2 | -0.607 | 
| 1 | 2015-01-01 | q1 | m2 | w3 | 0.17 | 
| 1 | 2015-01-01 | q2 | m1 | w1 | -0.255 | 
| 1 | 2015-01-01 | q2 | m2 | w1 | 0.131 | 
| 1 | 2015-01-01 | q2 | m2 | w2 | 0.26 | 
tidyr::uniteThe unite() function lets us reverse the process:
# let's say we want to combine quarter and month with an underscore
df_uni <- df_sep %>%
  unite(period, quarter:month) # sep = "_" is the default arg
head(df_uni)
| id | date | period | week | value | 
|---|---|---|---|---|
| 1 | 2015-01-01 | q1_m1 | w1 | -0.313 | 
| 1 | 2015-01-01 | q1_m1 | w2 | -0.607 | 
| 1 | 2015-01-01 | q1_m2 | w3 | 0.17 | 
| 1 | 2015-01-01 | q2_m1 | w1 | -0.255 | 
| 1 | 2015-01-01 | q2_m2 | w1 | 0.131 | 
| 1 | 2015-01-01 | q2_m2 | w2 | 0.26 | 
# let's say we want to combine quarter and month with nothing
df_uni <- df_sep %>%
  unite(period, quarter:month, sep = "")
head(df_uni)
| id | date | period | week | value | 
|---|---|---|---|---|
| 1 | 2015-01-01 | q1m1 | w1 | -0.313 | 
| 1 | 2015-01-01 | q1m1 | w2 | -0.607 | 
| 1 | 2015-01-01 | q1m2 | w3 | 0.17 | 
| 1 | 2015-01-01 | q2m1 | w1 | -0.255 | 
| 1 | 2015-01-01 | q2m2 | w1 | 0.131 | 
| 1 | 2015-01-01 | q2m2 | w2 | 0.26 | 
tidyr::pivot_wider# finally let's spread
df_spread <- df_uni %>%
  pivot_wider(names_from=week, values_from=value)  # fill = NA is default arg
head(df_spread)
| id | date | period | w1 | w2 | w3 | 
|---|---|---|---|---|---|
| 1 | 2015-01-01 | q1m1 | -0.313 | -0.607 | |
| 1 | 2015-01-01 | q1m2 | 0.17 | ||
| 1 | 2015-01-01 | q2m1 | -0.255 | ||
| 1 | 2015-01-01 | q2m2 | 0.131 | 0.26 | |
| 2 | 2015-01-02 | q1m1 | -0.808 | -0.768 | |
| 2 | 2015-01-02 | q1m2 | 1.13 | 
pivot_longer() %>% separate() %>% pivot_wider())All in one, if we had wanted to end up with one row per month:
df_tidiest <- df %>% 
  pivot_longer(q1_m1_w1:q2_m2_w2, names_to="key", values_to="value") %>% 
  separate(key, into = c("quarter", "month", "week")) %>% 
  pivot_wider(names_from=week, values_from=value)
head(df_tidiest)
| id | date | quarter | month | w1 | w2 | w3 | 
|---|---|---|---|---|---|---|
| 1 | 2015-01-01 | q1 | m1 | -0.313 | -0.607 | |
| 1 | 2015-01-01 | q1 | m2 | 0.17 | ||
| 1 | 2015-01-01 | q2 | m1 | -0.255 | ||
| 1 | 2015-01-01 | q2 | m2 | 0.131 | 0.26 | |
| 2 | 2015-01-02 | q1 | m1 | -0.808 | -0.768 | |
| 2 | 2015-01-02 | q1 | m2 | 1.13 | 
broom“The broom package takes the messy output of built-in functions in R, such as lm, nls, or t.test, and turns them into tidy data frames.” So, broom tidies output from other R functions that are un-tidy.
See here for list of functions: https://github.com/dgrtwo/broom
Vignette: ftp://cran.r-project.org/pub/R/web/packages/broom/vignettes/broom.html
fit <- lm(mpg ~ qsec + factor(am) + wt + factor(gear), 
          data = mtcars)
Un-tidy output from lm
summary(fit)
Call:
lm(formula = mpg ~ qsec + factor(am) + wt + factor(gear), data = mtcars)
Residuals:
    Min      1Q  Median      3Q     Max 
-3.5064 -1.5220 -0.7517  1.3841  4.6345 
Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)     9.3650     8.3730   1.118  0.27359    
qsec            1.2449     0.3828   3.252  0.00317 ** 
factor(am)1     3.1505     1.9405   1.624  0.11654    
wt             -3.9263     0.7428  -5.286 1.58e-05 ***
factor(gear)4  -0.2682     1.6555  -0.162  0.87257    
factor(gear)5  -0.2697     2.0632  -0.131  0.89698    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.55 on 26 degrees of freedom
Multiple R-squared:  0.8498,    Adjusted R-squared:  0.8209 
F-statistic: 29.43 on 5 and 26 DF,  p-value: 6.379e-10
Tidy output from broom
tidy(fit)
| term | estimate | std.error | statistic | p.value | 
|---|---|---|---|---|
| (Intercept) | 9.37 | 8.37 | 1.12 | 0.274 | 
| qsec | 1.24 | 0.383 | 3.25 | 0.00317 | 
| factor(am)1 | 3.15 | 1.94 | 1.62 | 0.117 | 
| wt | -3.93 | 0.743 | -5.29 | 1.58e-05 | 
| factor(gear)4 | -0.268 | 1.66 | -0.162 | 0.873 | 
| factor(gear)5 | -0.27 | 2.06 | -0.131 | 0.897 | 
gtgt is a relatively new package from RStudio designed for publication-ready figures.
library(gt)
Let’s set up a tiny little data table…
prepped_flight_data <- flights %>% filter(dest %in% c("DEN", "DCA", "JFK", "SFO")) %>% 
  group_by(dest) %>% 
  mutate_at(vars(dep_delay), replace_na, replace=0.0) %>%  
  summarise(mean_delay=mean(dep_delay, na.rm=TRUE), median_delay=median(dep_delay, na.rm=TRUE))
glimpse(prepped_flight_data)
Rows: 4
Columns: 3
$ dest         <chr> "DCA", "DEN", "JFK", "SFO"
$ mean_delay   <dbl> 1.475138, 8.753835, 9.235996, 12.996799
$ median_delay <dbl> -3, -1, -1, -1
… and now make a table out of it:
gt_formatted <- prepped_flight_data %>%  gt() %>% 
  tab_header(title="PDX Departure Delays, DCA/DEN/JFK/SFO", subtitle="(Delay in minutes)") %>% 
  fmt_number(columns=vars(mean_delay, median_delay), decimals=2) %>% 
  cols_label(dest="Destination", mean_delay="Mean", median_delay="Median") 
gt_formatted 
| PDX Departure Delays, DCA/DEN/JFK/SFO | ||
|---|---|---|
| (Delay in minutes) | ||
| Destination | Mean | Median | 
| DCA | 1.48 | −3.00 | 
| DEN | 8.75 | −1.00 | 
| JFK | 9.24 | −1.00 | 
| SFO | 13.00 | −1.00 | 
We can adjust things like text alignment after-the-fact:
gt_formatted <- gt_formatted %>% 
  cols_align(align="center", columns=vars(dest)) %>%
  tab_style(
    cell_text(align = "right"), 
    locations = cells_title(groups = c("subtitle"))
  ) %>% 
  tab_style(
    cell_text(style = "italic"),
    locations = cells_title(groups=c("title"))
  )
gt_formatted
| PDX Departure Delays, DCA/DEN/JFK/SFO | ||
|---|---|---|
| (Delay in minutes) | ||
| Destination | Mean | Median | 
| DCA | 1.48 | −3.00 | 
| DEN | 8.75 | −1.00 | 
| JFK | 9.24 | −1.00 | 
| SFO | 13.00 | −1.00 | 
Tip: In addition to cells_title(), there are helpers to select any of the other cell types (e.g. cells_column_labels(), etc.).
Now we can turn it into whatever we need! As Latex:
gt_formatted %>% gtsave("my_table.tex")
As RTF, to put in Word:
gt_formatted %>% gtsave("my_table.rtf")
tableonetableone is for generating “Table 1” in your paper. You know, the “Subject Characteristics” table- lots of boring summary statistics.
Vignette: https://cran.r-project.org/web/packages/tableone/vignettes/introduction.html
library(tableone)
By default, its output is probably not what we want (though we can see hints of it down at the bottom)…
CreateTableOne(data = mazes)
                          
                           Overall       
  n                           381        
  study_id (%)                           
     CSLU-001                   4 ( 1.0) 
     CSLU-002                   4 ( 1.0) 
     CSLU-007                   4 ( 1.0) 
     CSLU-010                   4 ( 1.0) 
     CSLU-020                   4 ( 1.0) 
     CSLU-024                   4 ( 1.0) 
     CSLU-027                   4 ( 1.0) 
     CSLU-031                   4 ( 1.0) 
     CSLU-036                   3 ( 0.8) 
     CSLU-046                   4 ( 1.0) 
     CSLU-053                   4 ( 1.0) 
     CSLU-054                   4 ( 1.0) 
     CSLU-059                   4 ( 1.0) 
     CSLU-062                   4 ( 1.0) 
     CSLU-066                   4 ( 1.0) 
     CSLU-073                   4 ( 1.0) 
     CSLU-077                   4 ( 1.0) 
     CSLU-080                   4 ( 1.0) 
     CSLU-082                   3 ( 0.8) 
     CSLU-084                   4 ( 1.0) 
     CSLU-089                   4 ( 1.0) 
     CSLU-095                   3 ( 0.8) 
     CSLU-096                   4 ( 1.0) 
     CSLU-101                   4 ( 1.0) 
     CSLU-104                   4 ( 1.0) 
     CSLU-112                   4 ( 1.0) 
     CSLU-117                   4 ( 1.0) 
     CSLU-119                   4 ( 1.0) 
     CSLU-122                   4 ( 1.0) 
     CSLU-124                   3 ( 0.8) 
     CSLU-142                   4 ( 1.0) 
     CSLU-144                   4 ( 1.0) 
     CSLU-146                   4 ( 1.0) 
     CSLU-154                   4 ( 1.0) 
     CSLU-156                   4 ( 1.0) 
     CSLU-161                   4 ( 1.0) 
     CSLU-163                   4 ( 1.0) 
     CSLU-165                   4 ( 1.0) 
     CSLU-167                   4 ( 1.0) 
     CSLU-180                   4 ( 1.0) 
     CSLU-191                   4 ( 1.0) 
     CSLU-203                   4 ( 1.0) 
     CSLU-204                   4 ( 1.0) 
     CSLU-213                   4 ( 1.0) 
     CSLU-216                   4 ( 1.0) 
     CSLU-220                   4 ( 1.0) 
     CSLU-226                   4 ( 1.0) 
     CSLU-233                   4 ( 1.0) 
     CSLU-238                   4 ( 1.0) 
     CSLU-245                   4 ( 1.0) 
     CSLU-258                   4 ( 1.0) 
     CSLU-259                   4 ( 1.0) 
     CSLU-263                   4 ( 1.0) 
     CSLU-269                   4 ( 1.0) 
     CSLU-274                   4 ( 1.0) 
     CSLU-275                   4 ( 1.0) 
     CSLU-277                   4 ( 1.0) 
     CSLU-284                   4 ( 1.0) 
     CSLU-290                   4 ( 1.0) 
     CSLU-303                   4 ( 1.0) 
     CSLU-306                   4 ( 1.0) 
     CSLU-312                   4 ( 1.0) 
     CSLU-315                   4 ( 1.0) 
     CSLU-316                   4 ( 1.0) 
     CSLU-320                   4 ( 1.0) 
     CSLU-324                   4 ( 1.0) 
     CSLU-332                   4 ( 1.0) 
     CSLU-335                   4 ( 1.0) 
     CSLU-339                   4 ( 1.0) 
     CSLU-348                   4 ( 1.0) 
     CSLU-349                   4 ( 1.0) 
     CSLU-355                   4 ( 1.0) 
     CSLU-359                   4 ( 1.0) 
     CSLU-372                   4 ( 1.0) 
     CSLU-373                   4 ( 1.0) 
     CSLU-375                   4 ( 1.0) 
     CSLU-379                   4 ( 1.0) 
     CSLU-388                   2 ( 0.5) 
     CSLU-389                   4 ( 1.0) 
     CSLU-393                   4 ( 1.0) 
     CSLU-395                   4 ( 1.0) 
     CSLU-417                   4 ( 1.0) 
     CSLU-419                   4 ( 1.0) 
     CSLU-427                   4 ( 1.0) 
     CSLU-432                   3 ( 0.8) 
     CSLU-435                   4 ( 1.0) 
     CSLU-441                   4 ( 1.0) 
     CSLU-442                   4 ( 1.0) 
     CSLU-447                   4 ( 1.0) 
     CSLU-454                   4 ( 1.0) 
     CSLU-460                   4 ( 1.0) 
     CSLU-470                   4 ( 1.0) 
     CSLU-472                   4 ( 1.0) 
     CSLU-477                   4 ( 1.0) 
     CSLU-482                   4 ( 1.0) 
     CSLU-486                   4 ( 1.0) 
     CSLU-499                   4 ( 1.0) 
  ca (mean (SD))             6.83 (1.06) 
  viq (mean (SD))          100.82 (18.74)
  dx (%)                                 
     ASD                      183 (48.0) 
     SLI                       71 (18.6) 
     TD                       127 (33.3) 
  activity (%)                           
     Conversation              94 (24.7) 
     Picture Description       94 (24.7) 
     Play                      96 (25.2) 
     Wordless Picture Book     97 (25.5) 
  content (mean (SD))       18.73 (24.84)
  filler (mean (SD))        11.20 (17.59)
  rep (mean (SD))            6.24 (9.45) 
  rev (mean (SD))            3.79 (4.31) 
  fs (mean (SD))             8.70 (12.76)
  cued (mean (SD))          14.36 (24.22)
  not_cued (mean (SD))      26.77 (31.73)
We need to tell it a bit about what we are looking for:
## Vector of variables to summarize
my_num_vars <- c("ca", "viq", "content", "filler", "rep", "rev", "fs", "cued", "not_cued")
## Vector of categorical variables that need transformation
my_cat_vars <- c("dx", "activity")
## Create a TableOne object
tab2 <- CreateTableOne(vars = my_num_vars, data = mazes, factorVars = my_cat_vars)
print(tab2, showAllLevels = TRUE)
                      
                       level Overall       
  n                             381        
  ca (mean (SD))               6.83 (1.06) 
  viq (mean (SD))            100.82 (18.74)
  content (mean (SD))         18.73 (24.84)
  filler (mean (SD))          11.20 (17.59)
  rep (mean (SD))              6.24 (9.45) 
  rev (mean (SD))              3.79 (4.31) 
  fs (mean (SD))               8.70 (12.76)
  cued (mean (SD))            14.36 (24.22)
  not_cued (mean (SD))        26.77 (31.73)
If we want the summary statistics to be computed on a per-strata basis, we can ask for that:
# Another 
tab3 <- CreateTableOne(vars = my_num_vars, strata = "dx" , data = mazes)
tab3
                      Stratified by dx
                       ASD           SLI           TD             p      test
  n                      183            71            127                    
  ca (mean (SD))        6.74 (1.11)   7.15 (1.00)    6.76 (0.97)   0.015     
  viq (mean (SD))      95.29 (17.62) 86.24 (5.95)  116.94 (12.82) <0.001     
  content (mean (SD))  20.46 (29.73) 17.34 (24.35)  17.00 (15.67)  0.422     
  filler (mean (SD))    7.86 (13.54) 10.56 (16.35)  16.38 (21.84) <0.001     
  rep (mean (SD))       7.25 (11.82)  5.45 (6.86)    5.23 (6.21)   0.134     
  rev (mean (SD))       3.87 (4.85)   3.25 (3.55)    3.98 (3.85)   0.498     
  fs (mean (SD))        9.35 (14.60)  8.63 (15.00)   7.80 (7.55)   0.574     
  cued (mean (SD))     10.66 (21.94) 13.21 (22.54)  20.35 (27.10)  0.002     
  not_cued (mean (SD)) 25.52 (33.49) 25.25 (31.84)  29.41 (29.04)  0.517     
DT packageDT lets us produce interactive data tables for use in HTML reports that go far beyond kable’s capabilities.
An excellent tutorial on DT is available at https://rstudio.github.io/DT/.
“Out of the box”, datatable gives us a very usable table:
datatable(mazes)
pixiedust (best for PDF)Pixiedust is a ggplot-inspired table package that includes a lot of flexibility and works well for PDF-based output. We can illustrate its use by making a regression table.
Remember that broom package we used earlier? Here is what the model parameters look like; this is obviously not what a publicatio-ready version of the table would look like:
tidy(fit)
| term | estimate | std.error | statistic | p.value | 
|---|---|---|---|---|
| (Intercept) | 9.37 | 8.37 | 1.12 | 0.274 | 
| qsec | 1.24 | 0.383 | 3.25 | 0.00317 | 
| factor(am)1 | 3.15 | 1.94 | 1.62 | 0.117 | 
| wt | -3.93 | 0.743 | -5.29 | 1.58e-05 | 
| factor(gear)4 | -0.268 | 1.66 | -0.162 | 0.873 | 
| factor(gear)5 | -0.27 | 2.06 | -0.131 | 0.897 | 
dust(fit) %>% 
  sprinkle(cols = "term", 
           replace = c("Intercept", "Quarter Mile Time", "Automatic vs. Manual",
                       "Weight", "Gears: 4 vs. 3", "Gears: 5 vs 3")) %>%
  sprinkle(cols = c("estimate", "std.error", "statistic"),
           round = 3) %>% 
  sprinkle(cols = "p.value", fn = quote(pvalString(value))) %>% 
  sprinkle_colnames("Term", "Coefficient", "SE", "T-statistic", "P-value")
| Term | Coefficient | SE | T-statistic | P-value | 
|---|---|---|---|---|
| Intercept | 9.365 | 8.373 | 1.118 | 0.27 | 
| Quarter Mile Time | 1.245 | 0.383 | 3.252 | 0.003 | 
| Automatic vs. Manual | 3.151 | 1.941 | 1.624 | 0.12 | 
| Weight | -3.926 | 0.743 | -5.286 | < 0.001 | 
| Gears: 4 vs. 3 | -0.268 | 1.655 | -0.162 | 0.87 | 
| Gears: 5 vs 3 | -0.27 | 2.063 | -0.131 | 0.9 | 
For more information about Pixiedust, consult:
extrafontThe extrafont package is the best place to start with fonts in R and ggplot.
library(extrafont)
Note: Make sure to follow all installation instructions from github!
You can access fonts on your local system in ggplot by using the theme() function to set the relevant properties of your figure.
demo_plt <- ggplot(iris, aes(Sepal.Length, Sepal.Width, color=Species)) + 
  geom_point(size=3) +
  scale_color_discrete(labels=c("Setosa", "Versicolor", "Virginica")) + 
  labs(x="Sepal Length (cm)", y="Sepal Width (cm)", title="Fisher's Irises") 
demo_plt

In ggplot, the theme() function lets us control the appearance of various aspects of the plot. There are many, many properties we can tweak; we will begin by setting the global text property for all text on the entire plot.
Let’s try changing that font a bit:
demo_plt + theme(text=element_text(family="Optima"))

demo_plt + theme(text=element_text(family="Palatino"))

demo_plt + theme(text=element_text(family="Zapfino"))

Note that the specific name that you must use to refer to the font is not exactly obvious. It is not the font name, but rather the font’s “family name”. Because of some under-the-hood details about the way that fonts work, this is a bit different from what you might see in e.g. a font selection menu in Word. You can see a list of all the fonts that R knows about by using the fonttable() command:
fonttable() %>% glimpse()
Rows: 149
Columns: 10
$ package    <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ afmfile    <chr> "Keyboard.afm.gz", "SFNSDisplay.afm.gz", "SFNSText.afm.gz",…
$ fontfile   <chr> "/System/Library/Fonts/Keyboard.ttf", "/System/Library/Font…
$ FullName   <chr> ".Keyboard", "System Font", "System Font", "System Font Ita…
$ FamilyName <chr> ".Keyboard", "System Font", "System Font", "System Font", "…
$ FontName   <chr> "-Keyboard", "-SFNSDisplay", "-SFNSText", "-SFNSText-Italic…
$ Bold       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ Italic     <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALS…
$ Symbol     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ afmsymfile <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
fonttable() %>% head()
| package | afmfile | fontfile | FullName | FamilyName | FontName | Bold | Italic | Symbol | afmsymfile | 
|---|---|---|---|---|---|---|---|---|---|
| Keyboard.afm.gz | /System/Library/Fonts/Keyboard.ttf | .Keyboard | .Keyboard | -Keyboard | FALSE | FALSE | FALSE | ||
| SFNSDisplay.afm.gz | /System/Library/Fonts/SFNSDisplay.ttf | System Font | System Font | -SFNSDisplay | FALSE | FALSE | FALSE | ||
| SFNSText.afm.gz | /System/Library/Fonts/SFNSText.ttf | System Font | System Font | -SFNSText | FALSE | FALSE | FALSE | ||
| SFNSTextItalic.afm.gz | /System/Library/Fonts/SFNSTextItalic.ttf | System Font Italic | System Font | -SFNSText-Italic | FALSE | TRUE | FALSE | ||
| Andale Mono.afm.gz | /Library/Fonts/Andale Mono.ttf | Andale Mono | Andale Mono | AndaleMono | FALSE | FALSE | FALSE | ||
| Apple Braille.afm.gz | /System/Library/Fonts/Apple Braille.ttf | Apple Braille | Apple Braille | AppleBraille | FALSE | FALSE | FALSE | 
fonttable() %>% select(FullName, FamilyName, FontName) %>% head(20)
| FullName | FamilyName | FontName | 
|---|---|---|
| .Keyboard | .Keyboard | -Keyboard | 
| System Font | System Font | -SFNSDisplay | 
| System Font | System Font | -SFNSText | 
| System Font Italic | System Font | -SFNSText-Italic | 
| Andale Mono | Andale Mono | AndaleMono | 
| Apple Braille | Apple Braille | AppleBraille | 
| Apple Braille Outline 6 Dot | Apple Braille | AppleBraille-Outline6Dot | 
| Apple Braille Outline 8 Dot | Apple Braille | AppleBraille-Outline8Dot | 
| Apple Braille Pinpoint 6 Dot | Apple Braille | AppleBraille-Pinpoint6Dot | 
| Apple Braille Pinpoint 8 Dot | Apple Braille | AppleBraille-Pinpoint8Dot | 
| AppleMyungjo Regular | AppleMyungjo | AppleMyungjo | 
| Arial Black | Arial Black | Arial-Black | 
| Arial Bold Italic | Arial | Arial-BoldItalicMT | 
| Arial Bold | Arial | Arial-BoldMT | 
| Arial Italic | Arial | Arial-ItalicMT | 
| Arial | Arial | ArialMT | 
| Arial Narrow | Arial Narrow | ArialNarrow | 
| Arial Narrow Bold | Arial Narrow | ArialNarrow-Bold | 
| Arial Narrow Bold Italic | Arial Narrow | ArialNarrow-BoldItalic | 
| Arial Narrow Italic | Arial Narrow | ArialNarrow-Italic | 
fonttable() %>% select(FamilyName) %>% distinct()
| FamilyName | 
|---|
| .Keyboard | 
| System Font | 
| Andale Mono | 
| Apple Braille | 
| AppleMyungjo | 
| Arial Black | 
| Arial | 
| Arial Narrow | 
| Arial Rounded MT Bold | 
| Arial Unicode MS | 
| Bitter | 
| Bodoni Ornaments | 
| Bodoni 72 Smallcaps | 
| Brush Script MT | 
| Calibri | 
| Calibri Light | 
| Comic Sans MS | 
| Consolas | 
| Courier New | 
| DIN Alternate | 
| DIN Condensed | 
| Fira Sans Black | 
| Fira Sans | 
| Fira Sans ExtraBold | 
| Fira Sans ExtraLight | 
| Fira Sans Light | 
| Fira Sans Medium | 
| Fira Sans SemiBold | 
| Fira Sans Thin | 
| Fondamento | 
| Georgia | 
| Gorni | 
| Impact | 
| Karla | 
| Khmer Sangam MN | 
| Lao Sangam MN | 
| Lato Black | 
| Lato | 
| Lato Hairline | 
| Lato Light | 
| Luminari | 
| Microsoft Sans Serif | 
| Noto Sans | 
| Noto Serif | 
| Open Sans | 
| Open Sans Extrabold | 
| Open Sans Light | 
| Open Sans Semibold | 
| SILDoulos IPA93 | 
| SILManuscript IPA93 | 
| SILSophia IPA93 | 
| Tahoma | 
| Tangerine | 
| Times New Roman | 
| Trattatello | 
| Trebuchet MS | 
| Verdana | 
| Webdings | 
| WeePeople | 
| Wingdings | 
| Wingdings 2 | 
| Wingdings 3 | 
| .SF NS Rounded | 
| Harding Text Web Bold | 
| Harding Text Web Regular | 
Long story short, you’ll want to use the value in FamilyName to refer to your font of interest.
What if we wanted to have a different font for the title? The plot.title property is where we would want to look:
demo_plt + theme(
  text=element_text(family="Palatino"),
  plot.title=element_text(family="Zapfino")
  )

We can follow this pattern to change anything we like about the text in different parts of the plot:
demo_plt + theme(
  text=element_text(family="Palatino"),
  plot.title=element_text(family="Zapfino", size=8, color="blue"),
  axis.title.x = element_text(family="Optima", size=24)
  )

This part is a bit beyond the scope of this class, but the upshot is that after you install new fonts in your computer, you should tell extrafont about it using the font_import() command.
extrafont only knows about TrueType fonts (ones whose file ends in .ttf). If you have an OpenType font you want to use with R, your best bet probably involves the showtext package.