1 Goals for Lab 05

Become familiar with tools for generating publication-ready tables directly in R.

We will use data from the following paper: MacFarlane, H., Gorman, K., Ingham, R., Presmanes Hill, A., Papadakis, K., Kiss, G., & van Santen, J. (2017). Quantitative analysis of disfluency in children with autism spectrum disorder or language impairment. PLoS ONE, 12(3), e0173936.

mazes <- read_csv("http://bit.ly/mazes-gist") %>% 
  clean_names()  #janitor package

glimpse(mazes)
Rows: 381
Columns: 12
$ study_id <chr> "CSLU-001", "CSLU-001", "CSLU-001", "CSLU-001", "CSLU-002", "…
$ ca       <dbl> 5.6667, 5.6667, 5.6667, 5.6667, 6.5000, 6.5000, 6.5000, 6.500…
$ viq      <dbl> 124, 124, 124, 124, 124, 124, 124, 124, 108, 108, 108, 108, 1…
$ dx       <chr> "TD", "TD", "TD", "TD", "TD", "TD", "TD", "TD", "TD", "TD", "…
$ activity <chr> "Conversation", "Picture Description", "Play", "Wordless Pict…
$ content  <dbl> 24, 1, 21, 8, 3, 5, 8, 2, 25, 10, 2, 5, 32, 20, 13, 21, 27, 9…
$ filler   <dbl> 31, 2, 6, 2, 10, 3, 8, 2, 21, 13, 10, 2, 12, 9, 4, 4, 12, 6, …
$ rep      <dbl> 2, 0, 3, 0, 3, 2, 3, 0, 4, 0, 0, 0, 13, 5, 5, 6, 10, 5, 5, 2,…
$ rev      <dbl> 5, 0, 8, 4, 0, 1, 2, 0, 4, 2, 1, 3, 8, 7, 2, 8, 5, 1, 4, 2, 1…
$ fs       <dbl> 17, 1, 10, 4, 0, 2, 3, 2, 17, 8, 1, 2, 11, 8, 6, 7, 12, 3, 3,…
$ cued     <dbl> 36, 2, 6, 2, 10, 3, 9, 2, 29, 13, 11, 2, 14, 12, 4, 11, 17, 7…
$ not_cued <dbl> 50, 3, 27, 10, 13, 8, 15, 4, 38, 23, 11, 7, 42, 26, 17, 18, 3…

2 TL;DR

The workhorse for making tables in R Markdown documents is the knitr package’s kable function. This function is really versatile, but also free of fancy formatting options, for better or worse.

3 knitr::kable

3.1 kable all tables everywhere

In order to tell RMarkdown to use kable to format all tabular output, update the YAML of your document. For HTML:

---
title: "My Awesome Data Vis Lab"
output:
  html_document:
    df_print: kable
---

You can also define the html format in the global options.

# If you don't define format here, you'll need put `format = "html"` in every kable function.
options(knitr.table.format = "html") 
# You may also wish to set this option
options(scipen = 1, digits = 2)

3.2 kable table in a chunk

For HTML output:

head(mazes) %>% 
  kable(format = "html")
study_id ca viq dx activity content filler rep rev fs cued not_cued
CSLU-001 5.6667 124 TD Conversation 24 31 2 5 17 36 50
CSLU-001 5.6667 124 TD Picture Description 1 2 0 0 1 2 3
CSLU-001 5.6667 124 TD Play 21 6 3 8 10 6 27
CSLU-001 5.6667 124 TD Wordless Picture Book 8 2 0 4 4 2 10
CSLU-002 6.5000 124 TD Conversation 3 10 3 0 0 10 13
CSLU-002 6.5000 124 TD Picture Description 5 3 2 1 2 3 8

Note that this is a bit rough-looking. We will use various features of kable to improve our output.

For starters, we can add a caption:

head(mazes) %>% 
  kable(format = "html", digits = 2, caption = "A table produced by kable.")
A table produced by kable.
study_id ca viq dx activity content filler rep rev fs cued not_cued
CSLU-001 5.67 124 TD Conversation 24 31 2 5 17 36 50
CSLU-001 5.67 124 TD Picture Description 1 2 0 0 1 2 3
CSLU-001 5.67 124 TD Play 21 6 3 8 10 6 27
CSLU-001 5.67 124 TD Wordless Picture Book 8 2 0 4 4 2 10
CSLU-002 6.50 124 TD Conversation 3 10 3 0 0 10 13
CSLU-002 6.50 124 TD Picture Description 5 3 2 1 2 3 8

We can also manually specify human-readable column names:

my_maze_names <- c("Participant", "Age", "Verbal\nIQ", "Group", "Activity", "Content\nMaze", "Filler\nMaze", "Repetition", "Revision", "False\nStart", "Cued", "Not\nCued")
head(mazes) %>% 
  kable(format = "html", digits = 2, caption = "A table produced by kable.",
        col.names = my_maze_names)
A table produced by kable.
Participant Age Verbal IQ Group Activity Content Maze Filler Maze Repetition Revision False Start Cued Not Cued
CSLU-001 5.67 124 TD Conversation 24 31 2 5 17 36 50
CSLU-001 5.67 124 TD Picture Description 1 2 0 0 1 2 3
CSLU-001 5.67 124 TD Play 21 6 3 8 10 6 27
CSLU-001 5.67 124 TD Wordless Picture Book 8 2 0 4 4 2 10
CSLU-002 6.50 124 TD Conversation 3 10 3 0 0 10 13
CSLU-002 6.50 124 TD Picture Description 5 3 2 1 2 3 8

3.3 Styled kable tables in a chunk

To improve the visual layout of the table, we can use the kableExtra package, which provides the kable_styling() function:

head(mazes) %>% 
  kable(format = "html", digits = 2, caption = "A styled kable table.",
        col.names = my_maze_names) %>% 
  kable_styling()
A styled kable table.
Participant Age Verbal IQ Group Activity Content Maze Filler Maze Repetition Revision False Start Cued Not Cued
CSLU-001 5.67 124 TD Conversation 24 31 2 5 17 36 50
CSLU-001 5.67 124 TD Picture Description 1 2 0 0 1 2 3
CSLU-001 5.67 124 TD Play 21 6 3 8 10 6 27
CSLU-001 5.67 124 TD Wordless Picture Book 8 2 0 4 4 2 10
CSLU-002 6.50 124 TD Conversation 3 10 3 0 0 10 13
CSLU-002 6.50 124 TD Picture Description 5 3 2 1 2 3 8

There are lots of printing options: https://haozhu233.github.io/kableExtra/awesome_table_in_html.html

head(mazes) %>% 
  kable(format = "html", digits = 2, caption = "A non-full width zebra kable table.") %>% 
  kable_styling(bootstrap_options = "striped", full_width = F)
A non-full width zebra kable table.
study_id ca viq dx activity content filler rep rev fs cued not_cued
CSLU-001 5.67 124 TD Conversation 24 31 2 5 17 36 50
CSLU-001 5.67 124 TD Picture Description 1 2 0 0 1 2 3
CSLU-001 5.67 124 TD Play 21 6 3 8 10 6 27
CSLU-001 5.67 124 TD Wordless Picture Book 8 2 0 4 4 2 10
CSLU-002 6.50 124 TD Conversation 3 10 3 0 0 10 13
CSLU-002 6.50 124 TD Picture Description 5 3 2 1 2 3 8
head(mazes) %>% 
  kable(format = "html", digits = 2, caption = "Over here!") %>% 
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
Over here!
study_id ca viq dx activity content filler rep rev fs cued not_cued
CSLU-001 5.67 124 TD Conversation 24 31 2 5 17 36 50
CSLU-001 5.67 124 TD Picture Description 1 2 0 0 1 2 3
CSLU-001 5.67 124 TD Play 21 6 3 8 10 6 27
CSLU-001 5.67 124 TD Wordless Picture Book 8 2 0 4 4 2 10
CSLU-002 6.50 124 TD Conversation 3 10 3 0 0 10 13
CSLU-002 6.50 124 TD Picture Description 5 3 2 1 2 3 8

3.4 Controlling column appearance

We can control the formatting of individual columns using the column_spec() function:

head(mazes) %>% 
  kable(format = "html", digits = 2, caption = "Over here!") %>% 
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left") %>% 
  column_spec(4, width="3cm", background="lightblue", border_right=TRUE)
Over here!
study_id ca viq dx activity content filler rep rev fs cued not_cued
CSLU-001 5.67 124 TD Conversation 24 31 2 5 17 36 50
CSLU-001 5.67 124 TD Picture Description 1 2 0 0 1 2 3
CSLU-001 5.67 124 TD Play 21 6 3 8 10 6 27
CSLU-001 5.67 124 TD Wordless Picture Book 8 2 0 4 4 2 10
CSLU-002 6.50 124 TD Conversation 3 10 3 0 0 10 13
CSLU-002 6.50 124 TD Picture Description 5 3 2 1 2 3 8

3.5 kable + kableExtra + formattable

color_tile and color_bar are neat extras if used wisely!

http://haozhu233.github.io/kableExtra/use_kableExtra_with_formattable.html

library(formattable)
head(mazes) %>% 
  mutate(ca = color_tile("transparent", "lightpink")(ca),
         viq = color_bar("lightseagreen")(viq)) %>%
  kable("html", escape = F, caption = 'This table is colored.') %>%
  kable_styling(position = "center") %>% 
  column_spec(4, width = "3cm") 
This table is colored.
study_id ca viq dx activity content filler rep rev fs cued not_cued
CSLU-001 5.6667 124 TD Conversation 24 31 2 5 17 36 50
CSLU-001 5.6667 124 TD Picture Description 1 2 0 0 1 2 3
CSLU-001 5.6667 124 TD Play 21 6 3 8 10 6 27
CSLU-001 5.6667 124 TD Wordless Picture Book 8 2 0 4 4 2 10
CSLU-002 6.5000 124 TD Conversation 3 10 3 0 0 10 13
CSLU-002 6.5000 124 TD Picture Description 5 3 2 1 2 3 8

3.6 tibble + kable + kableExtra

You can also use any of these tools with plain text tables using the tibble package to create a table. Two main functions:

  • tribble: enter tibble by rows
  • tibble: enter tibble by columns

For example, I used tribble to make this table in our slide decks:

math_table <- tibble::tribble(
  ~Operator, ~Description, ~Usage,
  "\\+", "addition", "x + y",
  "\\-", "subtraction", "x - y",
  "\\*", "multiplication", "x * y",
  "/", "division", "x / y",
  "^", "raised to the power of", "x ^ y",
  "abs", "absolute value", "abs(x)",
  "%/%", "integer division", "x %/% y",
  "%%", "remainder after division", "x %% y"
)

Then I used this chunk to print it:

```{r, results = 'asis'}
knitr::kable(math_table, format = "html", caption = "Helpful mutate functions") %>% 
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
```
knitr::kable(math_table, format = "html", caption = "Helpful mutate functions") %>% 
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
Helpful mutate functions
Operator Description Usage
+ addition x + y
- subtraction x - y
* multiplication x * y
/ division x / y
^ raised to the power of x ^ y
abs absolute value abs(x)
%/% integer division x %/% y
%% remainder after division x %% y

4 Markdown Tables

Sometimes you may just want to type in a table in Markdown and ignore R. Four kinds of tables may be used. The first three kinds presuppose the use of a fixed-width font, such as Courier. The fourth kind can be used with proportionally spaced fonts, as it does not require lining up columns. All of the below will render when typed outside of an R code chunk since these are based on pandoc being used to render your markdown document. Note that these should all work whether you are knitting to either html or PDF.

4.1 Simple table

This code for a simple table:

  Right     Left     Center     Default
-------     ------ ----------   -------
     12     12        12            12
    123     123       123          123
      1     1          1             1

Table:  Demonstration of simple table syntax.

Produces this simple table:

Demonstration of simple table syntax.
Right Left Center Default
12 12 12 12
123 123 123 123
1 1 1 1

The headers and table rows must each fit on one line. Column alignments are determined by the position of the header text relative to the dashed line below it:3

  • If the dashed line is flush with the header text on the right side but extends beyond it on the left, the column is right-aligned.
  • If the dashed line is flush with the header text on the left side but extends beyond it on the right, the column is left-aligned.
  • If the dashed line extends beyond the header text on both sides, the column is centered.
  • If the dashed line is flush with the header text on both sides, the default alignment is used (in most cases, this will be left).
  • The table must end with a blank line, or a line of dashes followed by a blank line.

The column headers may be omitted, provided a dashed line is used to end the table.

4.2 Multi-line tables

This code for a multi-line table:

-------------------------------------------------------------
 Centered   Default           Right Left
  Header    Aligned         Aligned Aligned
----------- ------- --------------- -------------------------
   First    row                12.0 Example of a row that
                                    spans multiple lines.

  Second    row                 5.0 Here's another one. Note
                                    the blank line between
                                    rows.
-------------------------------------------------------------

Table: Here's the caption. It, too, may span
multiple lines.

Produces this multi-line table:

Here’s the caption. It, too, may span multiple lines.
Centered Header Default Aligned Right Aligned Left Aligned
First row 12.0 Example of a row that spans multiple lines.
Second row 5.0 Here’s another one. Note the blank line between rows.

4.3 Grid tables

This code for a grid table:

: Sample grid table.

+---------------+---------------+--------------------+
| Fruit         | Price         | Advantages         |
+===============+===============+====================+
| Bananas       | $1.34         | - built-in wrapper |
|               |               | - bright color     |
+---------------+---------------+--------------------+
| Oranges       | $2.10         | - cures scurvy     |
|               |               | - tasty            |
+---------------+---------------+--------------------+

Produces this grid table:

Sample grid table.
Fruit Price Advantages
Bananas $1.34
  • built-in wrapper
  • bright color
Oranges $2.10
  • cures scurvy
  • tasty

Alignments are not supported, nor are cells that span multiple columns or rows.

4.4 Pipe tables

This code for a pipe table:

| Right | Left | Default | Center |
|------:|:-----|---------|:------:|
|   12  |  12  |    12   |    12  |
|  123  |  123 |   123   |   123  |
|    1  |    1 |     1   |     1  |

  : Demonstration of pipe table syntax.

Produces this pipe table:

Demonstration of pipe table syntax.
Right Left Default Center
12 12 12 12
123 123 123 123
1 1 1 1

5 Making tables in R

If you want to make tables that include R output (like output from functions like means, variances, or output from models), there are two steps:

  1. Get the numbers you need in tabular format; then
  2. Render that information in an aesthetically-pleasing way.

This section covers (1). But, although there are some nice options for (2) within R Markdown via various packages, I am not dogmatic about doing everything in R Markdown, especially things like (2).

Note that many of the tidying techniques we’ll be using are ones we’ve seen before; you’ll be using them in the challenge assignment this week on this dataset, so hopefully this walk-through will give you some ideas.

5.1 dplyr

We’ll use the pnwflights14 package to practice our dplyr skills. We need to download the package from github using devtools.

# once per machine
install.packages("devtools")
devtools::install_github("ismayc/pnwflights14")

Now, we need to load the flights dataset from the pnwflights14 package.

# once per work session
data("flights", package = "pnwflights14")

5.1.1 dplyr::select

Recall that we use select to specify which columns in a dataframe you’d like to keep by name:

# keep these 2 cols
mini_flights <- flights %>% 
  select(carrier, flight)
glimpse(mini_flights)
Rows: 162,049
Columns: 2
$ carrier <chr> "AS", "US", "UA", "US", "AS", "DL", "UA", "UA", "UA", "UA", "U…
$ flight  <int> 145, 1830, 1609, 466, 121, 1823, 1481, 229, 1576, 478, 1569, 6…
# keep first five cols
first_five <- flights %>% 
  select(year, month, day, dep_time, dep_delay)
glimpse(first_five)
Rows: 162,049
Columns: 5
$ year      <int> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, …
$ month     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ day       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ dep_time  <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557,…
$ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2,…
# alternatively, specify range
first_five <- flights %>% 
  select(year:dep_delay)
glimpse(first_five)
Rows: 162,049
Columns: 5
$ year      <int> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, …
$ month     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ day       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ dep_time  <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557,…
$ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2,…

We can also choose the columns we want by negation, that is, you can specify which columns to drop instead of keep. This way, all variables not listed are kept.

# we can also use negation
all_but_year <- flights %>% 
  select(-year)
glimpse(all_but_year)
Rows: 162,049
Columns: 15
$ month     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ day       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ dep_time  <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557,…
$ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2,…
$ arr_time  <int> 235, 738, 548, 800, 325, 747, 936, 1148, 917, 1334, 911, 907…
$ arr_delay <dbl> 70, -23, -4, -23, 43, 88, 219, 15, 24, -6, 4, 12, -12, -16, …
$ carrier   <chr> "AS", "US", "UA", "US", "AS", "DL", "UA", "UA", "UA", "UA", …
$ tailnum   <chr> "N508AS", "N195UW", "N37422", "N547UW", "N762AS", "N806DN", …
$ flight    <int> 145, 1830, 1609, 466, 121, 1823, 1481, 229, 1576, 478, 1569,…
$ origin    <chr> "PDX", "SEA", "PDX", "PDX", "SEA", "SEA", "SEA", "PDX", "SEA…
$ dest      <chr> "ANC", "CLT", "IAH", "CLT", "ANC", "DTW", "ORD", "IAH", "DEN…
$ air_time  <dbl> 194, 252, 201, 251, 201, 224, 202, 217, 136, 268, 130, 122, …
$ distance  <dbl> 1542, 2279, 1825, 2282, 1448, 1927, 1721, 1825, 1024, 2402, …
$ hour      <dbl> 0, 0, 0, 0, 0, 0, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, …
$ minute    <dbl> 1, 4, 8, 28, 34, 37, 46, 26, 27, 36, 41, 49, 50, 57, 57, 58,…

dplyr::select comes with several other helper functions…

depart <- flights %>%
  select(starts_with("dep_"))
glimpse(depart)
Rows: 162,049
Columns: 2
$ dep_time  <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557,…
$ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2,…
times <- flights %>%
  select(contains("time"))
glimpse(times)
Rows: 162,049
Columns: 3
$ dep_time <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557, …
$ arr_time <int> 235, 738, 548, 800, 325, 747, 936, 1148, 917, 1334, 911, 907,…
$ air_time <dbl> 194, 252, 201, 251, 201, 224, 202, 217, 136, 268, 130, 122, 8…
# note that we are not actually saving the new dataframe here
flights %>%
  select(-contains("time")) %>% head()
yearmonthdaydep_delayarr_delaycarriertailnumflightorigindestdistancehourminute
2014119670ASN508AS145PDXANC1.54e+0301
201411-6-23USN195UW1830SEACLT2.28e+0304
20141113-4UAN374221609PDXIAH1.82e+0308
201411-2-23USN547UW466PDXCLT2.28e+03028
2014114443ASN762AS121SEAANC1.45e+03034
2014118288DLN806DN1823SEADTW1.93e+03037
delays <- flights %>%
  select(ends_with("delay"))
glimpse(delays)
Rows: 162,049
Columns: 2
$ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2,…
$ arr_delay <dbl> 70, -23, -4, -23, 43, 88, 219, 15, 24, -6, 4, 12, -12, -16, …

One of my favorite select helper functions is everything(), which allows you to use select to keep all your variables, but easily rearrange the columns without having to list all the variables to keep/drop.

new_order <- flights %>% 
  select(origin, dest, everything())
head(new_order)
origindestyearmonthdaydep_timedep_delayarr_timearr_delaycarriertailnumflightair_timedistancehourminute
PDXANC20141119623570ASN508AS1451941.54e+0301
SEACLT2014114-6738-23USN195UW18302522.28e+0304
PDXIAH201411813548-4UAN3742216092011.82e+0308
PDXCLT20141128-2800-23USN547UW4662512.28e+03028
SEAANC201411344432543ASN762AS1212011.45e+03034
SEADTW201411378274788DLN806DN18232241.93e+03037
# with negation
new_order2 <- flights %>% 
  select(origin, dest, everything(), -year)
head(new_order2)
origindestmonthdaydep_timedep_delayarr_timearr_delaycarriertailnumflightair_timedistancehourminute
PDXANC1119623570ASN508AS1451941.54e+0301
SEACLT114-6738-23USN195UW18302522.28e+0304
PDXIAH11813548-4UAN3742216092011.82e+0308
PDXCLT1128-2800-23USN547UW4662512.28e+03028
SEAANC11344432543ASN762AS1212011.45e+03034
SEADTW11378274788DLN806DN18232241.93e+03037

We can also rename variables within select.

flights2 <- flights %>%
  select(tail_num = tailnum, everything())
head(flights2)
tail_numyearmonthdaydep_timedep_delayarr_timearr_delaycarrierflightorigindestair_timedistancehourminute
N508AS20141119623570AS145PDXANC1941.54e+0301
N195UW2014114-6738-23US1830SEACLT2522.28e+0304
N37422201411813548-4UA1609PDXIAH2011.82e+0308
N547UW20141128-2800-23US466PDXCLT2512.28e+03028
N762AS201411344432543AS121SEAANC2011.45e+03034
N806DN201411378274788DL1823SEADTW2241.93e+03037

If you don’t want to move the renamed variables within your dataframe, you can use the rename function.

flights3 <- flights %>%
  rename(tail_num = tailnum)
glimpse(flights3)
Rows: 162,049
Columns: 16
$ year      <int> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, …
$ month     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ day       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ dep_time  <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557,…
$ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2,…
$ arr_time  <int> 235, 738, 548, 800, 325, 747, 936, 1148, 917, 1334, 911, 907…
$ arr_delay <dbl> 70, -23, -4, -23, 43, 88, 219, 15, 24, -6, 4, 12, -12, -16, …
$ carrier   <chr> "AS", "US", "UA", "US", "AS", "DL", "UA", "UA", "UA", "UA", …
$ tail_num  <chr> "N508AS", "N195UW", "N37422", "N547UW", "N762AS", "N806DN", …
$ flight    <int> 145, 1830, 1609, 466, 121, 1823, 1481, 229, 1576, 478, 1569,…
$ origin    <chr> "PDX", "SEA", "PDX", "PDX", "SEA", "SEA", "SEA", "PDX", "SEA…
$ dest      <chr> "ANC", "CLT", "IAH", "CLT", "ANC", "DTW", "ORD", "IAH", "DEN…
$ air_time  <dbl> 194, 252, 201, 251, 201, 224, 202, 217, 136, 268, 130, 122, …
$ distance  <dbl> 1542, 2279, 1825, 2282, 1448, 1927, 1721, 1825, 1024, 2402, …
$ hour      <dbl> 0, 0, 0, 0, 0, 0, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, …
$ minute    <dbl> 1, 4, 8, 28, 34, 37, 46, 26, 27, 36, 41, 49, 50, 57, 57, 58,…

5.1.2 dplyr::filter

As we have previously seen, filter has many flexible uses:

# flights taking off from PDX
pdx <- flights %>% 
  filter(origin == "PDX")
head(pdx)
yearmonthdaydep_timedep_delayarr_timearr_delaycarriertailnumflightorigindestair_timedistancehourminute
20141119623570ASN508AS145PDXANC1941.54e+0301
201411813548-4UAN374221609PDXIAH2011.82e+0308
20141128-2800-23USN547UW466PDXCLT2512.28e+03028
201411526-4114815UAN813UA229PDXIAH2171.82e+03526
20141154119114UAN364761569PDXDEN130991       541
2014115492490712USN548UW649PDXPHX1221.01e+03549
# january flights from PDX
pdx_jan <- flights %>% 
  filter(origin == "PDX", month == 1) # the comma is an "and"
head(pdx_jan)
yearmonthdaydep_timedep_delayarr_timearr_delaycarriertailnumflightorigindestair_timedistancehourminute
20141119623570ASN508AS145PDXANC1941.54e+0301
201411813548-4UAN374221609PDXIAH2011.82e+0308
20141128-2800-23USN547UW466PDXCLT2512.28e+03028
201411526-4114815UAN813UA229PDXIAH2171.82e+03526
20141154119114UAN364761569PDXDEN130991       541
2014115492490712USN548UW649PDXPHX1221.01e+03549
# flights to ATL (Atlanta) or BNA (Nashville)
to_south <- flights %>% 
  filter(dest == "ATL" | dest == "BNA") %>% # | is "or"
  select(origin, dest, everything())
head(to_south)
origindestyearmonthdaydep_timedep_delayarr_timearr_delaycarriertailnumflightair_timedistancehourminute
SEAATL201411624-61401-6DLN617DL9682352.18e+03624
SEAATL201411802-31533-17ASN532AS7422492.18e+0382
SEAATL201411824-11546-14DLN633DL1282342.18e+03824
PDXATL201411944-61727-8ASN548AS7522522.17e+03944
PDXATL201411105494180784DLN377DA15022372.17e+031054
SEAATL201411115861915-14DLN6712B19622342.18e+031158
# flights from PDX to ATL (Atlanta) or BNA (Nashville)
pdx_to_south <- flights %>% 
  filter(origin == "PDX", dest == "ATL" | dest == "BNA") %>% # | is "or"
  select(origin, dest, everything())
head(pdx_to_south)
origindestyearmonthdaydep_timedep_delayarr_timearr_delaycarriertailnumflightair_timedistancehourminute
PDXATL201411944-61727-8ASN548AS7522522.17e+03944
PDXATL201411105494180784DLN377DA15022372.17e+031054
PDXATL2014111323-22038-15DLN393DA7732352.17e+031323
PDXATL201411225386114DLN371DA5032402.17e+032253
PDXATL201412627-31350-7DLN3746H11562442.17e+03627
PDXATL201412918-21643-2DLN375619402492.17e+03918
# alternatively, using group membership
south_dests <- c("ATL", "BNA")
pdx_to_south2 <- flights %>% 
  filter(origin == "PDX", dest %in% south_dests) %>% 
  select(origin, dest, everything())
head(pdx_to_south2)
origindestyearmonthdaydep_timedep_delayarr_timearr_delaycarriertailnumflightair_timedistancehourminute
PDXATL201411944-61727-8ASN548AS7522522.17e+03944
PDXATL201411105494180784DLN377DA15022372.17e+031054
PDXATL2014111323-22038-15DLN393DA7732352.17e+031323
PDXATL201411225386114DLN371DA5032402.17e+032253
PDXATL201412627-31350-7DLN3746H11562442.17e+03627
PDXATL201412918-21643-2DLN375619402492.17e+03918
# flights delayed by 1 hour or more
delay_1plus <- flights %>%
  filter(dep_delay >= 60)
head(delay_1plus)
yearmonthdaydep_timedep_delayarr_timearr_delaycarriertailnumflightorigindestair_timedistancehourminute
20141119623570ASN508AS145PDXANC1941.54e+0301
201411378274788DLN806DN1823SEADTW2241.93e+03037
201411346227936219UAN142191481SEAORD2021.72e+03346
20141165090103791USN626AW460SEAPHX1411.11e+03650
2014119591641137157ASN534AS805SEASMF77605       959
201411100868124264ASN788AS456SEALAX129954       108
# flights delayed by 1 hour, but not more than 2 hours
delay_1hr <- flights %>%
  filter(dep_delay >= 60, dep_delay < 120)
head(delay_1hr)
yearmonthdaydep_timedep_delayarr_timearr_delaycarriertailnumflightorigindestair_timedistancehourminute
20141119623570ASN508AS145PDXANC1941.54e+0301
201411378274788DLN806DN1823SEADTW2241.93e+03037
20141165090103791USN626AW460SEAPHX1411.11e+03650
201411100868124264ASN788AS456SEALAX129954       108
201411101475161381UAN374081444SEAORD2011.72e+031014
201411103681140863OON218AG3466PDXTUS1341.12e+031036
range(delay_1hr$dep_delay, na.rm = TRUE)
[1]  60 119
# even more efficient using between (always inclusive)
delay_bwn <- flights %>%
  filter(between(dep_delay, 60, 119))
head(delay_bwn)
yearmonthdaydep_timedep_delayarr_timearr_delaycarriertailnumflightorigindestair_timedistancehourminute
20141119623570ASN508AS145PDXANC1941.54e+0301
201411378274788DLN806DN1823SEADTW2241.93e+03037
20141165090103791USN626AW460SEAPHX1411.11e+03650
201411100868124264ASN788AS456SEALAX129954       108
201411101475161381UAN374081444SEAORD2011.72e+031014
201411103681140863OON218AG3466PDXTUS1341.12e+031036
range(delay_bwn$dep_delay, na.rm = TRUE)
[1]  60 119

5.1.3 dplyr::arrange

# default is ascending order
flights %>% 
  arrange(year, month, day) %>% head(n=20)
yearmonthdaydep_timedep_delayarr_timearr_delaycarriertailnumflightorigindestair_timedistancehourminute
20141119623570ASN508AS145PDXANC1941.54e+0301
2014114-6738-23USN195UW1830SEACLT2522.28e+0304
201411813548-4UAN374221609PDXIAH2011.82e+0308
20141128-2800-23USN547UW466PDXCLT2512.28e+03028
201411344432543ASN762AS121SEAANC2011.45e+03034
201411378274788DLN806DN1823SEADTW2241.93e+03037
201411346227936219UAN142191481SEAORD2021.72e+03346
201411526-4114815UAN813UA229PDXIAH2171.82e+03526
201411527791724UAN754331576SEADEN1361.02e+03527
20141153611334-6UAN574UA478SEAEWR2682.4e+03 536
20141154119114UAN364761569PDXDEN130991       541
2014115492490712USN548UW649PDXPHX1221.01e+03549
2014115500837-12DLN660DL1634SEASLC82689       550
201411557-31134-16AAN3JLAA1094SEADFW1841.66e+03557
201411557-3825-25ASN562AS81SEAANC1881.45e+03557
201411558-2801-2ASN402AS200SEASJC100697       558
201411559-1916-9F9N210FR796PDXDEN125991       559
20141160001151-19AAN3JFAA2240SEAORD2061.72e+0360
201411600-10842-8ASN786AS426SEALAX125954       60
201411602-39435F9N201FR144SEADEN1291.02e+0362
# descending order
flights %>% 
  arrange(desc(year), desc(month), desc(day)) %>% head(n=20)
yearmonthdaydep_timedep_delayarr_timearr_delaycarriertailnumflightorigindestair_timedistancehourminute
2014123121260131AAN3JKAA1230SEADFW2101.66e+0302
2014123127-36233AAN3EWAA1431SEAORD2111.72e+03027
2014123139143244ASN762AS135SEAANC2071.45e+03039
201412314005490DLN757AT2440SEAMSP1741.4e+03 040
2014123152-8917-21AAN3JFAA371SEAMIA3062.72e+03052
2014123154462117DLN128DL1670PDXMSP1741.43e+03054
20141231566184880DLN655DL929SEAATL2492.18e+03056
20141231512-39044USN653AW480SEAPHX1541.11e+03512
20141231515-58555USN580UW425PDXPHX1441.01e+03515
2014123153448597UAN344601075PDXDEN125991       534
201412315461916-4WNN8323C757PDXDEN126991       546
20141231548-21351-13UAN461UA665PDXEWR2872.43e+03548
201412315494120812UAN688071457SEAIAH2231.87e+03549
2014123155009222WNN797MX2121PDXPHX1381.01e+03550
20141231551-4120212AAN3HXAA1094SEADFW2111.66e+03551
20141231551-9744-15ASN570AS342SEAOAK91671       551
20141231555-10824-1ASN548AS602SEALAS124867       555
20141231558-28490DLN668DN1831PDXSLC85630       558
20141231558-211494AAN436AA1534PDXDFW2111.62e+03558
20141231558-2738-4ASN585AS406PDXSJC80569       558

5.1.4 dplyr::distinct

# unique origin-dest combinations
flights %>% 
  select(origin, dest) %>% 
  distinct %>% head(n=50)
origindest
PDXANC
SEACLT
PDXIAH
PDXCLT
SEAANC
SEADTW
SEAORD
SEADEN
SEAEWR
PDXDEN
PDXPHX
SEASLC
SEADFW
SEASJC
SEALAX
PDXORD
SEAOAK
SEASFO
PDXSJC
SEASNA
SEASAN
PDXDFW
PDXEWR
SEAIAH
SEAATL
PDXBUR
SEAMDW
PDXLAX
PDXSAN
PDXIAD
SEAPSP
PDXMDW
SEAMSP
PDXSNA
SEAPHX
PDXHNL
SEAJFK
PDXMSP
PDXSFO
SEALAS
SEALGB
PDXONT
PDXLGB
PDXJFK
PDXKOA
PDXLAS
SEAONT
SEAMKE
SEABUR
SEAKTN
# all unique destinations from PDX (there are 49, so we'll just list the first few)
from_pdx <- flights %>% 
  filter(origin == "PDX") %>% 
  select(origin, dest) %>%
  distinct(dest)
head(from_pdx)
dest
ANC
IAH
CLT
DEN
PHX
ORD

5.1.5 dplyr::mutate

mutate is used to transform existing variables or create new ones; here, we are using it to create an indicator variable that identifies flights that were entirely on-time.

# add total delay variable
flights %>%
  mutate(tot_delay = dep_delay + arr_delay) %>%
  select(origin, dest, ends_with("delay"), everything()) %>% 
  head
origindestdep_delayarr_delaytot_delayyearmonthdaydep_timearr_timecarriertailnumflightair_timedistancehourminute
PDXANC96701662014111235ASN508AS1451941.54e+0301
SEACLT-6-23-292014114738USN195UW18302522.28e+0304
PDXIAH13-492014118548UAN3742216092011.82e+0308
PDXCLT-2-23-2520141128800USN547UW4662512.28e+03028
SEAANC44438720141134325ASN762AS1212011.45e+03034
SEADTW828817020141137747DLN806DN18232241.93e+03037
# flights that were delayed at departure had on time or early arrivals?
arrivals <- flights %>%
  mutate(arr_ok = ifelse(dep_delay > 0 & arr_delay <= 0, 1, 0)) %>% 
  select(origin, dest, ends_with("delay"), carrier, arr_ok)

# peek at it
arrivals %>%
  filter(arr_ok == 1) %>%
  head
origindestdep_delayarr_delaycarrierarr_ok
PDXIAH13-4UA1
SEAEWR1-6UA1
SEASAN2-12AS1
PDXEWR2-19UA1
SEAIAH13-4UA1
PDXIAD10-4UA1

5.1.6 dplyr::summarise (or dplyr::summarize)

We have seen summarise used to calculate summary statistics, like so:

flights %>%
  summarise(mean(dep_delay, na.rm = TRUE))
mean(dep_delay, na.rm = TRUE)
6.13

But this can get tedious when we want to compute several summary statistics, like so:

# we can also name that variable, and summarise multiple variables
flights %>%
  summarise(mean_delay = mean(dep_delay, na.rm = TRUE),
            sd_delay = sd(dep_delay, na.rm = TRUE),
            median_delay = median(dep_delay, na.rm = TRUE))
mean_delaysd_delaymedian_delay
6.1329.1-2

summarise_each lets us automate this process a bit:

flights %>%
  filter(!is.na(dep_delay)) %>%
  select(dep_delay) %>%
  summarise_each(lst(mean, sd, median))
meansdmedian
6.1329.1-2
# same thing
flights %>%
  filter(!is.na(dep_delay)) %>%
  summarise_each(lst(mean, sd, median), dep_delay)
meansdmedian
6.1329.1-2
# combine with pivot_longer, change names too
flights %>%
  filter(!is.na(dep_delay)) %>%
  summarise_each(lst(mean, stdev = sd, median), dep_delay) %>%
  pivot_longer(everything(), names_to="delay_stat", values_to="value")
delay_statvalue
mean6.13
stdev29.1 
median-2   

We can use aggregating functions in summarise

# how many unique destinations?
summary_table <- flights %>% 
  summarise(tot_flights = n(),
            tot_planes = n_distinct(tailnum),
            tot_carriers = n_distinct(carrier),
            tot_dests = n_distinct(dest),
            tot_origins = n_distinct(origin))

summary_table
tot_flightstot_planestot_carrierstot_deststot_origins
162049302311712

The resulting table is a bit messy; we can reprocess using pivot_longer and separate:

# chain with tidyr functions
summary_table %>% 
  pivot_longer(everything(), names_to="key", values_to="value") %>% 
  separate(key, into = c("tot", "entity")) %>% 
  select(-tot, total = value)

entitytotal
flights162049
planes3023
carriers11
dests71
origins2
Or we can do it in a single call to pivot_longer():

summary_table %>% 
  pivot_longer(
    everything(), # grab all columns
    names_to="entity", # make a new column called "entity" to store the old column names
    names_prefix="tot_", # remove the "tot_" prefix from the old column names
    values_to="total") # put the values in a column called "total"
entitytotal
flights162049
planes3023
carriers11
dests71
origins2

5.2 tidyr

We’ll work with a made up dataframe:

df <- data.frame(
  id = 1:10,
  date = as.Date('2015-01-01') + 0:9,
  q1_m1_w1 = rnorm(10, 0, 1),
  q1_m1_w2 = rnorm(10, 0, 1),
  q1_m2_w3 = rnorm(10, 0, 1),
  q2_m1_w1 = rnorm(10, 0, 1),
  q2_m2_w1 = rnorm(10, 0, 1),
  q2_m2_w2 = rnorm(10, 0, 1)
)
# HLO
head(df)
iddateq1_m1_w1q1_m1_w2q1_m2_w3q2_m1_w1q2_m2_w1q2_m2_w2
12015-01-01-0.313-0.6070.17 -0.2550.1310.26 
22015-01-02-0.808-0.7681.13 1.83 -0.562-0.297
32015-01-03-1.27 -0.165-1.28 0.419-1.23 -0.669
42015-01-04-2.12 -1.04 -0.603-0.19 -1.21 -1.64 
52015-01-050.19 -0.6310.3591.98 -1.1  1.36 
62015-01-060.5390.449-0.704-0.4660.7310.125
glimpse(df)
Rows: 10
Columns: 8
$ id       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
$ date     <date> 2015-01-01, 2015-01-02, 2015-01-03, 2015-01-04, 2015-01-05, 2…
$ q1_m1_w1 <dbl> -0.3134072, -0.8077594, -1.2704517, -2.1222188, 0.1895081, 0…
$ q1_m1_w2 <dbl> -0.6066112, -0.7676025, -0.1651904, -1.0356756, -0.6314616, 0…
$ q1_m2_w3 <dbl> 0.17043485, 1.12869876, -1.28033499, -0.60293507, 0.35930413,…
$ q2_m1_w1 <dbl> -0.25450850, 1.82724513, 0.41912644, -0.18991891, 1.98128295,…
$ q2_m2_w1 <dbl> 0.1313910, -0.5619366, -1.2254890, -1.2102771, -1.1036748, 0.…
$ q2_m2_w2 <dbl> 0.26023119, -0.29650726, -0.66912554, -1.63523370, 1.35751588…

5.2.1 tidyr::pivot_longer

First, let’s pivot…

df_tidy <- df %>%
  pivot_longer(q1_m1_w1:q2_m2_w2, names_to="key", values_to="value")
head(df_tidy)
iddatekeyvalue
12015-01-01q1_m1_w1-0.313
12015-01-01q1_m1_w2-0.607
12015-01-01q1_m2_w30.17 
12015-01-01q2_m1_w1-0.255
12015-01-01q2_m2_w10.131
12015-01-01q2_m2_w20.26 

5.2.2 tidyr::separate

# separate 1 col into 3 cols
df_sep <- df_tidy %>%
  separate(key, into = c("quarter", "month", "week"))
head(df_sep)
iddatequartermonthweekvalue
12015-01-01q1m1w1-0.313
12015-01-01q1m1w2-0.607
12015-01-01q1m2w30.17 
12015-01-01q2m1w1-0.255
12015-01-01q2m2w10.131
12015-01-01q2m2w20.26 
# separate 1 col into 2 cols
df_sep2 <- df_tidy %>%
  separate(key, into = c("quarter", "period"), extra = "merge")
head(df_sep2)

iddatequarterperiodvalue
12015-01-01q1m1_w1-0.313
12015-01-01q1m1_w2-0.607
12015-01-01q1m2_w30.17 
12015-01-01q2m1_w1-0.255
12015-01-01q2m2_w10.131
12015-01-01q2m2_w20.26 
stringr vs. tidyr separate by regular expression

5.2.3 tidyr::extract

Extract is essentially the same as separate, let’s see how…

# extract
df_ext <- df_sep2 %>%
  extract(period, into = "month")
head(df_ext)
iddatequartermonthvalue
12015-01-01q1m1-0.313
12015-01-01q1m1-0.607
12015-01-01q1m20.17 
12015-01-01q2m1-0.255
12015-01-01q2m20.131
12015-01-01q2m20.26 

We can see that, by default, it’s taking the first alphanumeric chunk of the value, and extracting it. We can specify much more complex patterns:

# this gives us same output as separate
df_ext <- df_sep2 %>%
  extract(period, into = c("month", "week"), 
          regex = "([[:alnum:]]+)_([[:alnum:]]+)")
head(df_ext)
iddatequartermonthweekvalue
12015-01-01q1m1w1-0.313
12015-01-01q1m1w2-0.607
12015-01-01q1m2w30.17 
12015-01-01q2m1w1-0.255
12015-01-01q2m2w10.131
12015-01-01q2m2w20.26 

5.2.4 tidyr::unite

The unite() function lets us reverse the process:

# let's say we want to combine quarter and month with an underscore
df_uni <- df_sep %>%
  unite(period, quarter:month) # sep = "_" is the default arg
head(df_uni)
iddateperiodweekvalue
12015-01-01q1_m1w1-0.313
12015-01-01q1_m1w2-0.607
12015-01-01q1_m2w30.17 
12015-01-01q2_m1w1-0.255
12015-01-01q2_m2w10.131
12015-01-01q2_m2w20.26 
# let's say we want to combine quarter and month with nothing
df_uni <- df_sep %>%
  unite(period, quarter:month, sep = "")
head(df_uni)
iddateperiodweekvalue
12015-01-01q1m1w1-0.313
12015-01-01q1m1w2-0.607
12015-01-01q1m2w30.17 
12015-01-01q2m1w1-0.255
12015-01-01q2m2w10.131
12015-01-01q2m2w20.26 

5.2.5 tidyr::pivot_wider

# finally let's spread
df_spread <- df_uni %>%
  pivot_wider(names_from=week, values_from=value)  # fill = NA is default arg
head(df_spread)
iddateperiodw1w2w3
12015-01-01q1m1-0.313-0.607   
12015-01-01q1m2        0.17
12015-01-01q2m1-0.255       
12015-01-01q2m20.1310.26    
22015-01-02q1m1-0.808-0.768   
22015-01-02q1m2        1.13

5.2.6 Putting it all together (pivot_longer() %>% separate() %>% pivot_wider())

All in one, if we had wanted to end up with one row per month:

df_tidiest <- df %>% 
  pivot_longer(q1_m1_w1:q2_m2_w2, names_to="key", values_to="value") %>% 
  separate(key, into = c("quarter", "month", "week")) %>% 
  pivot_wider(names_from=week, values_from=value)
head(df_tidiest)
iddatequartermonthw1w2w3
12015-01-01q1m1-0.313-0.607   
12015-01-01q1m2        0.17
12015-01-01q2m1-0.255       
12015-01-01q2m20.1310.26    
22015-01-02q1m1-0.808-0.768   
22015-01-02q1m2        1.13

5.3 broom

“The broom package takes the messy output of built-in functions in R, such as lm, nls, or t.test, and turns them into tidy data frames.” So, broom tidies output from other R functions that are un-tidy.

See here for list of functions: https://github.com/dgrtwo/broom

Vignette: ftp://cran.r-project.org/pub/R/web/packages/broom/vignettes/broom.html

fit <- lm(mpg ~ qsec + factor(am) + wt + factor(gear), 
          data = mtcars)

Un-tidy output from lm

summary(fit)

Call:
lm(formula = mpg ~ qsec + factor(am) + wt + factor(gear), data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5064 -1.5220 -0.7517  1.3841  4.6345 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)     9.3650     8.3730   1.118  0.27359    
qsec            1.2449     0.3828   3.252  0.00317 ** 
factor(am)1     3.1505     1.9405   1.624  0.11654    
wt             -3.9263     0.7428  -5.286 1.58e-05 ***
factor(gear)4  -0.2682     1.6555  -0.162  0.87257    
factor(gear)5  -0.2697     2.0632  -0.131  0.89698    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.55 on 26 degrees of freedom
Multiple R-squared:  0.8498,    Adjusted R-squared:  0.8209 
F-statistic: 29.43 on 5 and 26 DF,  p-value: 6.379e-10

Tidy output from broom

tidy(fit)
termestimatestd.errorstatisticp.value
(Intercept)9.37 8.37 1.12 0.274   
qsec1.24 0.3833.25 0.00317 
factor(am)13.15 1.94 1.62 0.117   
wt-3.93 0.743-5.29 1.58e-05
factor(gear)4-0.2681.66 -0.1620.873   
factor(gear)5-0.27 2.06 -0.1310.897   

6 Specialized Packages

6.1 gt

gt is a relatively new package from RStudio designed for publication-ready figures.

library(gt)

Let’s set up a tiny little data table…

prepped_flight_data <- flights %>% filter(dest %in% c("DEN", "DCA", "JFK", "SFO")) %>% 
  group_by(dest) %>% 
  mutate_at(vars(dep_delay), replace_na, replace=0.0) %>%  
  summarise(mean_delay=mean(dep_delay, na.rm=TRUE), median_delay=median(dep_delay, na.rm=TRUE))
glimpse(prepped_flight_data)
Rows: 4
Columns: 3
$ dest         <chr> "DCA", "DEN", "JFK", "SFO"
$ mean_delay   <dbl> 1.475138, 8.753835, 9.235996, 12.996799
$ median_delay <dbl> -3, -1, -1, -1

… and now make a table out of it:

gt_formatted <- prepped_flight_data %>%  gt() %>% 
  tab_header(title="PDX Departure Delays, DCA/DEN/JFK/SFO", subtitle="(Delay in minutes)") %>% 
  fmt_number(columns=vars(mean_delay, median_delay), decimals=2) %>% 
  cols_label(dest="Destination", mean_delay="Mean", median_delay="Median") 
gt_formatted 
PDX Departure Delays, DCA/DEN/JFK/SFO
(Delay in minutes)
Destination Mean Median
DCA 1.48 −3.00
DEN 8.75 −1.00
JFK 9.24 −1.00
SFO 13.00 −1.00

We can adjust things like text alignment after-the-fact:

gt_formatted <- gt_formatted %>% 
  cols_align(align="center", columns=vars(dest)) %>%
  tab_style(
    cell_text(align = "right"), 
    locations = cells_title(groups = c("subtitle"))
  ) %>% 
  tab_style(
    cell_text(style = "italic"),
    locations = cells_title(groups=c("title"))
  )
gt_formatted
PDX Departure Delays, DCA/DEN/JFK/SFO
(Delay in minutes)
Destination Mean Median
DCA 1.48 −3.00
DEN 8.75 −1.00
JFK 9.24 −1.00
SFO 13.00 −1.00

Tip: In addition to cells_title(), there are helpers to select any of the other cell types (e.g. cells_column_labels(), etc.).

Now we can turn it into whatever we need! As Latex:

gt_formatted %>% gtsave("my_table.tex")

As RTF, to put in Word:

gt_formatted %>% gtsave("my_table.rtf")

6.2 tableone

tableone is for generating “Table 1” in your paper. You know, the “Subject Characteristics” table- lots of boring summary statistics.

Vignette: https://cran.r-project.org/web/packages/tableone/vignettes/introduction.html

library(tableone)

By default, its output is probably not what we want (though we can see hints of it down at the bottom)…

CreateTableOne(data = mazes)
                          
                           Overall       
  n                           381        
  study_id (%)                           
     CSLU-001                   4 ( 1.0) 
     CSLU-002                   4 ( 1.0) 
     CSLU-007                   4 ( 1.0) 
     CSLU-010                   4 ( 1.0) 
     CSLU-020                   4 ( 1.0) 
     CSLU-024                   4 ( 1.0) 
     CSLU-027                   4 ( 1.0) 
     CSLU-031                   4 ( 1.0) 
     CSLU-036                   3 ( 0.8) 
     CSLU-046                   4 ( 1.0) 
     CSLU-053                   4 ( 1.0) 
     CSLU-054                   4 ( 1.0) 
     CSLU-059                   4 ( 1.0) 
     CSLU-062                   4 ( 1.0) 
     CSLU-066                   4 ( 1.0) 
     CSLU-073                   4 ( 1.0) 
     CSLU-077                   4 ( 1.0) 
     CSLU-080                   4 ( 1.0) 
     CSLU-082                   3 ( 0.8) 
     CSLU-084                   4 ( 1.0) 
     CSLU-089                   4 ( 1.0) 
     CSLU-095                   3 ( 0.8) 
     CSLU-096                   4 ( 1.0) 
     CSLU-101                   4 ( 1.0) 
     CSLU-104                   4 ( 1.0) 
     CSLU-112                   4 ( 1.0) 
     CSLU-117                   4 ( 1.0) 
     CSLU-119                   4 ( 1.0) 
     CSLU-122                   4 ( 1.0) 
     CSLU-124                   3 ( 0.8) 
     CSLU-142                   4 ( 1.0) 
     CSLU-144                   4 ( 1.0) 
     CSLU-146                   4 ( 1.0) 
     CSLU-154                   4 ( 1.0) 
     CSLU-156                   4 ( 1.0) 
     CSLU-161                   4 ( 1.0) 
     CSLU-163                   4 ( 1.0) 
     CSLU-165                   4 ( 1.0) 
     CSLU-167                   4 ( 1.0) 
     CSLU-180                   4 ( 1.0) 
     CSLU-191                   4 ( 1.0) 
     CSLU-203                   4 ( 1.0) 
     CSLU-204                   4 ( 1.0) 
     CSLU-213                   4 ( 1.0) 
     CSLU-216                   4 ( 1.0) 
     CSLU-220                   4 ( 1.0) 
     CSLU-226                   4 ( 1.0) 
     CSLU-233                   4 ( 1.0) 
     CSLU-238                   4 ( 1.0) 
     CSLU-245                   4 ( 1.0) 
     CSLU-258                   4 ( 1.0) 
     CSLU-259                   4 ( 1.0) 
     CSLU-263                   4 ( 1.0) 
     CSLU-269                   4 ( 1.0) 
     CSLU-274                   4 ( 1.0) 
     CSLU-275                   4 ( 1.0) 
     CSLU-277                   4 ( 1.0) 
     CSLU-284                   4 ( 1.0) 
     CSLU-290                   4 ( 1.0) 
     CSLU-303                   4 ( 1.0) 
     CSLU-306                   4 ( 1.0) 
     CSLU-312                   4 ( 1.0) 
     CSLU-315                   4 ( 1.0) 
     CSLU-316                   4 ( 1.0) 
     CSLU-320                   4 ( 1.0) 
     CSLU-324                   4 ( 1.0) 
     CSLU-332                   4 ( 1.0) 
     CSLU-335                   4 ( 1.0) 
     CSLU-339                   4 ( 1.0) 
     CSLU-348                   4 ( 1.0) 
     CSLU-349                   4 ( 1.0) 
     CSLU-355                   4 ( 1.0) 
     CSLU-359                   4 ( 1.0) 
     CSLU-372                   4 ( 1.0) 
     CSLU-373                   4 ( 1.0) 
     CSLU-375                   4 ( 1.0) 
     CSLU-379                   4 ( 1.0) 
     CSLU-388                   2 ( 0.5) 
     CSLU-389                   4 ( 1.0) 
     CSLU-393                   4 ( 1.0) 
     CSLU-395                   4 ( 1.0) 
     CSLU-417                   4 ( 1.0) 
     CSLU-419                   4 ( 1.0) 
     CSLU-427                   4 ( 1.0) 
     CSLU-432                   3 ( 0.8) 
     CSLU-435                   4 ( 1.0) 
     CSLU-441                   4 ( 1.0) 
     CSLU-442                   4 ( 1.0) 
     CSLU-447                   4 ( 1.0) 
     CSLU-454                   4 ( 1.0) 
     CSLU-460                   4 ( 1.0) 
     CSLU-470                   4 ( 1.0) 
     CSLU-472                   4 ( 1.0) 
     CSLU-477                   4 ( 1.0) 
     CSLU-482                   4 ( 1.0) 
     CSLU-486                   4 ( 1.0) 
     CSLU-499                   4 ( 1.0) 
  ca (mean (SD))             6.83 (1.06) 
  viq (mean (SD))          100.82 (18.74)
  dx (%)                                 
     ASD                      183 (48.0) 
     SLI                       71 (18.6) 
     TD                       127 (33.3) 
  activity (%)                           
     Conversation              94 (24.7) 
     Picture Description       94 (24.7) 
     Play                      96 (25.2) 
     Wordless Picture Book     97 (25.5) 
  content (mean (SD))       18.73 (24.84)
  filler (mean (SD))        11.20 (17.59)
  rep (mean (SD))            6.24 (9.45) 
  rev (mean (SD))            3.79 (4.31) 
  fs (mean (SD))             8.70 (12.76)
  cued (mean (SD))          14.36 (24.22)
  not_cued (mean (SD))      26.77 (31.73)

We need to tell it a bit about what we are looking for:

## Vector of variables to summarize
my_num_vars <- c("ca", "viq", "content", "filler", "rep", "rev", "fs", "cued", "not_cued")

## Vector of categorical variables that need transformation
my_cat_vars <- c("dx", "activity")

## Create a TableOne object
tab2 <- CreateTableOne(vars = my_num_vars, data = mazes, factorVars = my_cat_vars)
print(tab2, showAllLevels = TRUE)
                      
                       level Overall       
  n                             381        
  ca (mean (SD))               6.83 (1.06) 
  viq (mean (SD))            100.82 (18.74)
  content (mean (SD))         18.73 (24.84)
  filler (mean (SD))          11.20 (17.59)
  rep (mean (SD))              6.24 (9.45) 
  rev (mean (SD))              3.79 (4.31) 
  fs (mean (SD))               8.70 (12.76)
  cued (mean (SD))            14.36 (24.22)
  not_cued (mean (SD))        26.77 (31.73)

If we want the summary statistics to be computed on a per-strata basis, we can ask for that:

# Another 
tab3 <- CreateTableOne(vars = my_num_vars, strata = "dx" , data = mazes)
tab3
                      Stratified by dx
                       ASD           SLI           TD             p      test
  n                      183            71            127                    
  ca (mean (SD))        6.74 (1.11)   7.15 (1.00)    6.76 (0.97)   0.015     
  viq (mean (SD))      95.29 (17.62) 86.24 (5.95)  116.94 (12.82) <0.001     
  content (mean (SD))  20.46 (29.73) 17.34 (24.35)  17.00 (15.67)  0.422     
  filler (mean (SD))    7.86 (13.54) 10.56 (16.35)  16.38 (21.84) <0.001     
  rep (mean (SD))       7.25 (11.82)  5.45 (6.86)    5.23 (6.21)   0.134     
  rev (mean (SD))       3.87 (4.85)   3.25 (3.55)    3.98 (3.85)   0.498     
  fs (mean (SD))        9.35 (14.60)  8.63 (15.00)   7.80 (7.55)   0.574     
  cued (mean (SD))     10.66 (21.94) 13.21 (22.54)  20.35 (27.10)  0.002     
  not_cued (mean (SD)) 25.52 (33.49) 25.25 (31.84)  29.41 (29.04)  0.517     

6.3 The DT package

DT lets us produce interactive data tables for use in HTML reports that go far beyond kable’s capabilities.

An excellent tutorial on DT is available at https://rstudio.github.io/DT/.

“Out of the box”, datatable gives us a very usable table:

datatable(mazes)

6.4 pixiedust (best for PDF)

Pixiedust is a ggplot-inspired table package that includes a lot of flexibility and works well for PDF-based output. We can illustrate its use by making a regression table.

Remember that broom package we used earlier? Here is what the model parameters look like; this is obviously not what a publicatio-ready version of the table would look like:

tidy(fit)

termestimatestd.errorstatisticp.value
(Intercept)9.37 8.37 1.12 0.274   
qsec1.24 0.3833.25 0.00317 
factor(am)13.15 1.94 1.62 0.117   
wt-3.93 0.743-5.29 1.58e-05
factor(gear)4-0.2681.66 -0.1620.873   
factor(gear)5-0.27 2.06 -0.1310.897   
Pixiedust gives us many controls for things like column names, row values, formatting for columns, etc.:

dust(fit) %>% 
  sprinkle(cols = "term", 
           replace = c("Intercept", "Quarter Mile Time", "Automatic vs. Manual",
                       "Weight", "Gears: 4 vs. 3", "Gears: 5 vs 3")) %>%
  sprinkle(cols = c("estimate", "std.error", "statistic"),
           round = 3) %>% 
  sprinkle(cols = "p.value", fn = quote(pvalString(value))) %>% 
  sprinkle_colnames("Term", "Coefficient", "SE", "T-statistic", "P-value")
Term Coefficient SE T-statistic P-value
Intercept 9.365 8.373 1.118 0.27
Quarter Mile Time 1.245 0.383 3.252 0.003
Automatic vs. Manual 3.151 1.941 1.624 0.12
Weight -3.926 0.743 -5.286 < 0.001
Gears: 4 vs. 3 -0.268 1.655 -0.162 0.87
Gears: 5 vs 3 -0.27 2.063 -0.131 0.9


For more information about Pixiedust, consult:

7 Finally, fonts!

7.1 Your friend, extrafont

The extrafont package is the best place to start with fonts in R and ggplot.

library(extrafont)

Note: Make sure to follow all installation instructions from github!

You can access fonts on your local system in ggplot by using the theme() function to set the relevant properties of your figure.

demo_plt <- ggplot(iris, aes(Sepal.Length, Sepal.Width, color=Species)) + 
  geom_point(size=3) +
  scale_color_discrete(labels=c("Setosa", "Versicolor", "Virginica")) + 
  labs(x="Sepal Length (cm)", y="Sepal Width (cm)", title="Fisher's Irises") 
demo_plt

In ggplot, the theme() function lets us control the appearance of various aspects of the plot. There are many, many properties we can tweak; we will begin by setting the global text property for all text on the entire plot.

Let’s try changing that font a bit:

demo_plt + theme(text=element_text(family="Optima"))

demo_plt + theme(text=element_text(family="Palatino"))

demo_plt + theme(text=element_text(family="Zapfino"))

Note that the specific name that you must use to refer to the font is not exactly obvious. It is not the font name, but rather the font’s “family name”. Because of some under-the-hood details about the way that fonts work, this is a bit different from what you might see in e.g. a font selection menu in Word. You can see a list of all the fonts that R knows about by using the fonttable() command:

fonttable() %>% glimpse()
Rows: 149
Columns: 10
$ package    <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ afmfile    <chr> "Keyboard.afm.gz", "SFNSDisplay.afm.gz", "SFNSText.afm.gz",…
$ fontfile   <chr> "/System/Library/Fonts/Keyboard.ttf", "/System/Library/Font…
$ FullName   <chr> ".Keyboard", "System Font", "System Font", "System Font Ita…
$ FamilyName <chr> ".Keyboard", "System Font", "System Font", "System Font", "…
$ FontName   <chr> "-Keyboard", "-SFNSDisplay", "-SFNSText", "-SFNSText-Italic…
$ Bold       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ Italic     <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALS…
$ Symbol     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ afmsymfile <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
fonttable() %>% head()
packageafmfilefontfileFullNameFamilyNameFontNameBoldItalicSymbolafmsymfile
Keyboard.afm.gz/System/Library/Fonts/Keyboard.ttf.Keyboard.Keyboard-KeyboardFALSEFALSEFALSE
SFNSDisplay.afm.gz/System/Library/Fonts/SFNSDisplay.ttfSystem FontSystem Font-SFNSDisplayFALSEFALSEFALSE
SFNSText.afm.gz/System/Library/Fonts/SFNSText.ttfSystem FontSystem Font-SFNSTextFALSEFALSEFALSE
SFNSTextItalic.afm.gz/System/Library/Fonts/SFNSTextItalic.ttfSystem Font ItalicSystem Font-SFNSText-ItalicFALSETRUEFALSE
Andale Mono.afm.gz/Library/Fonts/Andale Mono.ttfAndale MonoAndale MonoAndaleMonoFALSEFALSEFALSE
Apple Braille.afm.gz/System/Library/Fonts/Apple Braille.ttfApple BrailleApple BrailleAppleBrailleFALSEFALSEFALSE
fonttable() %>% select(FullName, FamilyName, FontName) %>% head(20)
FullNameFamilyNameFontName
.Keyboard.Keyboard-Keyboard
System FontSystem Font-SFNSDisplay
System FontSystem Font-SFNSText
System Font ItalicSystem Font-SFNSText-Italic
Andale MonoAndale MonoAndaleMono
Apple BrailleApple BrailleAppleBraille
Apple Braille Outline 6 DotApple BrailleAppleBraille-Outline6Dot
Apple Braille Outline 8 DotApple BrailleAppleBraille-Outline8Dot
Apple Braille Pinpoint 6 DotApple BrailleAppleBraille-Pinpoint6Dot
Apple Braille Pinpoint 8 DotApple BrailleAppleBraille-Pinpoint8Dot
AppleMyungjo RegularAppleMyungjoAppleMyungjo
Arial BlackArial BlackArial-Black
Arial Bold ItalicArialArial-BoldItalicMT
Arial BoldArialArial-BoldMT
Arial ItalicArialArial-ItalicMT
ArialArialArialMT
Arial NarrowArial NarrowArialNarrow
Arial Narrow BoldArial NarrowArialNarrow-Bold
Arial Narrow Bold ItalicArial NarrowArialNarrow-BoldItalic
Arial Narrow ItalicArial NarrowArialNarrow-Italic
fonttable() %>% select(FamilyName) %>% distinct()
FamilyName
.Keyboard
System Font
Andale Mono
Apple Braille
AppleMyungjo
Arial Black
Arial
Arial Narrow
Arial Rounded MT Bold
Arial Unicode MS
Bitter
Bodoni Ornaments
Bodoni 72 Smallcaps
Brush Script MT
Calibri
Calibri Light
Comic Sans MS
Consolas
Courier New
DIN Alternate
DIN Condensed
Fira Sans Black
Fira Sans
Fira Sans ExtraBold
Fira Sans ExtraLight
Fira Sans Light
Fira Sans Medium
Fira Sans SemiBold
Fira Sans Thin
Fondamento
Georgia
Gorni
Impact
Karla
Khmer Sangam MN
Lao Sangam MN
Lato Black
Lato
Lato Hairline
Lato Light
Luminari
Microsoft Sans Serif
Noto Sans
Noto Serif
Open Sans
Open Sans Extrabold
Open Sans Light
Open Sans Semibold
SILDoulos IPA93
SILManuscript IPA93
SILSophia IPA93
Tahoma
Tangerine
Times New Roman
Trattatello
Trebuchet MS
Verdana
Webdings
WeePeople
Wingdings
Wingdings 2
Wingdings 3
.SF NS Rounded
Harding Text Web Bold
Harding Text Web Regular

Long story short, you’ll want to use the value in FamilyName to refer to your font of interest.

What if we wanted to have a different font for the title? The plot.title property is where we would want to look:

demo_plt + theme(
  text=element_text(family="Palatino"),
  plot.title=element_text(family="Zapfino")
  )

We can follow this pattern to change anything we like about the text in different parts of the plot:

demo_plt + theme(
  text=element_text(family="Palatino"),
  plot.title=element_text(family="Zapfino", size=8, color="blue"),
  axis.title.x = element_text(family="Optima", size=24)
  )

7.2 Installing new fonts

This part is a bit beyond the scope of this class, but the upshot is that after you install new fonts in your computer, you should tell extrafont about it using the font_import() command.

7.2.1 Caveats

extrafont only knows about TrueType fonts (ones whose file ends in .ttf). If you have an OpenType font you want to use with R, your best bet probably involves the showtext package.

Creative Commons License