1 Goals for Lab 05

Become familiar with tools for generating publication-ready tables directly in R.

We will use data from the following paper: MacFarlane, H., Gorman, K., Ingham, R., Presmanes Hill, A., Papadakis, K., Kiss, G., & van Santen, J. (2017). Quantitative analysis of disfluency in children with autism spectrum disorder or language impairment. PLoS ONE, 12(3), e0173936.

mazes <- read_csv("http://bit.ly/mazes-gist") %>% 
  clean_names()  #janitor package

glimpse(mazes)

Rows: 381
Columns: 12
$ study_id <chr> "CSLU-001", "CSLU-001", "CSLU-001", "CSLU-001", "CSLU-002", …
$ ca       <dbl> 5.6667, 5.6667, 5.6667, 5.6667, 6.5000, 6.5000, 6.5000, 6.50…
$ viq      <dbl> 124, 124, 124, 124, 124, 124, 124, 124, 108, 108, 108, 108, …
$ dx       <chr> "TD", "TD", "TD", "TD", "TD", "TD", "TD", "TD", "TD", "TD", …
$ activity <chr> "Conversation", "Picture Description", "Play", "Wordless Pic…
$ content  <dbl> 24, 1, 21, 8, 3, 5, 8, 2, 25, 10, 2, 5, 32, 20, 13, 21, 27, …
$ filler   <dbl> 31, 2, 6, 2, 10, 3, 8, 2, 21, 13, 10, 2, 12, 9, 4, 4, 12, 6,…
$ rep      <dbl> 2, 0, 3, 0, 3, 2, 3, 0, 4, 0, 0, 0, 13, 5, 5, 6, 10, 5, 5, 2…
$ rev      <dbl> 5, 0, 8, 4, 0, 1, 2, 0, 4, 2, 1, 3, 8, 7, 2, 8, 5, 1, 4, 2, …
$ fs       <dbl> 17, 1, 10, 4, 0, 2, 3, 2, 17, 8, 1, 2, 11, 8, 6, 7, 12, 3, 3…
$ cued     <dbl> 36, 2, 6, 2, 10, 3, 9, 2, 29, 13, 11, 2, 14, 12, 4, 11, 17, …
$ not_cued <dbl> 50, 3, 27, 10, 13, 8, 15, 4, 38, 23, 11, 7, 42, 26, 17, 18, …

2 TL;DR

The workhorse for making tables in R Markdown documents is the knitr package’s kable function. This function is really versatile, but also free of fancy formatting options, for better or worse.

3 `knitr::kable`

3.1 `kable` all tables everywhere

Update the YAML of your document. For HTML:

---
title: "My Awesome Data Vis Lab"
output:
  html_document:
    df_print: kable
---

You can also define the html format in the global options.

# If you don't define format here, you'll need put `format = "html"` in every kable function.
options(knitr.table.format = "html") 
# You may also wish to set this option
options(scipen = 1, digits = 2)

3.2 `kable` table in a chunk

For HTML:

head(mazes) %>% 
  kable(format = "html")

study_id	ca	viq	dx	activity	content	filler	rep	rev	fs	cued	not_cued
CSLU-001	5.6667	124	TD	Conversation	24	31	2	5	17	36	50
CSLU-001	5.6667	124	TD	Picture Description	1	2	0	0	1	2	3
CSLU-001	5.6667	124	TD	Play	21	6	3	8	10	6	27
CSLU-001	5.6667	124	TD	Wordless Picture Book	8	2	0	4	4	2	10
CSLU-002	6.5000	124	TD	Conversation	3	10	3	0	0	10	13
CSLU-002	6.5000	124	TD	Picture Description	5	3	2	1	2	3	8

head(mazes) %>% 
  kable(format = "html", digits = 2, caption = "A table produced by kable.")

A table produced by kable.
study_id	ca	viq	dx	activity	content	filler	rep	rev	fs	cued	not_cued
CSLU-001	5.67	124	TD	Conversation	24	31	2	5	17	36	50
CSLU-001	5.67	124	TD	Picture Description	1	2	0	0	1	2	3
CSLU-001	5.67	124	TD	Play	21	6	3	8	10	6	27
CSLU-001	5.67	124	TD	Wordless Picture Book	8	2	0	4	4	2	10
CSLU-002	6.50	124	TD	Conversation	3	10	3	0	0	10	13
CSLU-002	6.50	124	TD	Picture Description	5	3	2	1	2	3	8

my_maze_names <- c("Participant", "Age", "Verbal\nIQ", "Group", "Activity", "Content\nMaze", "Filler\nMaze", "Repetition", "Revision", "False\nStart", "Cued", "Not\nCued")
head(mazes) %>% 
  kable(format = "html", digits = 2, caption = "A table produced by kable.",
        col.names = my_maze_names)

A table produced by kable.
Participant	Age	Verbal IQ	Group	Activity	Content Maze	Filler Maze	Repetition	Revision	False Start	Cued	Not Cued
CSLU-001	5.67	124	TD	Conversation	24	31	2	5	17	36	50
CSLU-001	5.67	124	TD	Picture Description	1	2	0	0	1	2	3
CSLU-001	5.67	124	TD	Play	21	6	3	8	10	6	27
CSLU-001	5.67	124	TD	Wordless Picture Book	8	2	0	4	4	2	10
CSLU-002	6.50	124	TD	Conversation	3	10	3	0	0	10	13
CSLU-002	6.50	124	TD	Picture Description	5	3	2	1	2	3	8

3.3 Styled `kable` tables in a chunk

Solution: apply some Bootstrap CSS styling using the kableExtra package.

head(mazes) %>% 
  kable(format = "html", digits = 2, caption = "A styled kable table.",
        col.names = my_maze_names) %>% 
  kable_styling()

A styled kable table.
Participant	Age	Verbal IQ	Group	Activity	Content Maze	Filler Maze	Repetition	Revision	False Start	Cued	Not Cued
CSLU-001	5.67	124	TD	Conversation	24	31	2	5	17	36	50
CSLU-001	5.67	124	TD	Picture Description	1	2	0	0	1	2	3
CSLU-001	5.67	124	TD	Play	21	6	3	8	10	6	27
CSLU-001	5.67	124	TD	Wordless Picture Book	8	2	0	4	4	2	10
CSLU-002	6.50	124	TD	Conversation	3	10	3	0	0	10	13
CSLU-002	6.50	124	TD	Picture Description	5	3	2	1	2	3	8

Lots of printing options: https://haozhu233.github.io/kableExtra/awesome_table_in_html.html

head(mazes) %>% 
  kable(format = "html", digits = 2, caption = "A non-full width zebra kable table.") %>% 
  kable_styling(bootstrap_options = "striped", full_width = F)

A non-full width zebra kable table.
study_id	ca	viq	dx	activity	content	filler	rep	rev	fs	cued	not_cued
CSLU-001	5.67	124	TD	Conversation	24	31	2	5	17	36	50
CSLU-001	5.67	124	TD	Picture Description	1	2	0	0	1	2	3
CSLU-001	5.67	124	TD	Play	21	6	3	8	10	6	27
CSLU-001	5.67	124	TD	Wordless Picture Book	8	2	0	4	4	2	10
CSLU-002	6.50	124	TD	Conversation	3	10	3	0	0	10	13
CSLU-002	6.50	124	TD	Picture Description	5	3	2	1	2	3	8

head(mazes) %>% 
  kable(format = "html", digits = 2, caption = "Over here!") %>% 
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

Over here!
study_id	ca	viq	dx	activity	content	filler	rep	rev	fs	cued	not_cued
CSLU-001	5.67	124	TD	Conversation	24	31	2	5	17	36	50
CSLU-001	5.67	124	TD	Picture Description	1	2	0	0	1	2	3
CSLU-001	5.67	124	TD	Play	21	6	3	8	10	6	27
CSLU-001	5.67	124	TD	Wordless Picture Book	8	2	0	4	4	2	10
CSLU-002	6.50	124	TD	Conversation	3	10	3	0	0	10	13
CSLU-002	6.50	124	TD	Picture Description	5	3	2	1	2	3	8

3.4 `kable` + `kableExtra` + `formattable`

color_tile and color_bar are neat extras if used wisely!

http://haozhu233.github.io/kableExtra/use_kableExtra_with_formattable.html

library(formattable)
head(mazes) %>% 
  mutate(ca = color_tile("transparent", "lightpink")(ca),
         viq = color_bar("lightseagreen")(viq)) %>%
  kable("html", escape = F, caption = 'This table is colored.') %>%
  kable_styling(position = "center") %>% 
  column_spec(4, width = "3cm")

This table is colored.
study_id	ca	viq	dx	activity	content	filler	rep	rev	fs	cued	not_cued
CSLU-001	5.6667	124	TD	Conversation	24	31	2	5	17	36	50
CSLU-001	5.6667	124	TD	Picture Description	1	2	0	0	1	2	3
CSLU-001	5.6667	124	TD	Play	21	6	3	8	10	6	27
CSLU-001	5.6667	124	TD	Wordless Picture Book	8	2	0	4	4	2	10
CSLU-002	6.5000	124	TD	Conversation	3	10	3	0	0	10	13
CSLU-002	6.5000	124	TD	Picture Description	5	3	2	1	2	3	8

3.5 `tibble` + `kable` + `kableExtra`

You can also use any of these tools with plain text tables using the tibble package to create a table. Two main functions:

tribble: enter tibble by rows
tibble: enter tibble by columns

For example, I used tribble to make this table in our slide decks:

math_table <- tibble::tribble(
  ~Operator, ~Description, ~Usage,
  "\\+", "addition", "x + y",
  "\\-", "subtraction", "x - y",
  "\\*", "multiplication", "x * y",
  "/", "division", "x / y",
  "^", "raised to the power of", "x ^ y",
  "abs", "absolute value", "abs(x)",
  "%/%", "integer division", "x %/% y",
  "%%", "remainder after division", "x %% y"
)

Then I used this chunk to print it:

```{r, results = 'asis'}
knitr::kable(math_table, format = "html", caption = "Helpful mutate functions") %>% 
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
```

knitr::kable(math_table, format = "html", caption = "Helpful mutate functions") %>% 
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

Helpful mutate functions
Operator	Description	Usage
+	addition	x + y
-	subtraction	x - y
*	multiplication	x * y
/	division	x / y
^	raised to the power of	x ^ y
abs	absolute value	abs(x)
%/%	integer division	x %/% y
%%	remainder after division	x %% y

4 Markdown Tables

Sometimes you may just want to type in a table in Markdown and ignore R. Four kinds of tables may be used. The first three kinds presuppose the use of a fixed-width font, such as Courier. The fourth kind can be used with proportionally spaced fonts, as it does not require lining up columns. All of the below will render when typed outside of an R code chunk since these are based on pandoc being used to render your markdown document. Note that these should all work whether you are knitting to either html or PDF.

4.1 Simple table

This code for a simple table:

  Right     Left     Center     Default
-------     ------ ----------   -------
     12     12        12            12
    123     123       123          123
      1     1          1             1

Table:  Demonstration of simple table syntax.

Produces this simple table:

Demonstration of simple table syntax.
Right	Left	Center	Default
12	12	12	12
123	123	123	123
1	1	1	1

The headers and table rows must each fit on one line. Column alignments are determined by the position of the header text relative to the dashed line below it:3

If the dashed line is flush with the header text on the right side but extends beyond it on the left, the column is right-aligned.
If the dashed line is flush with the header text on the left side but extends beyond it on the right, the column is left-aligned.
If the dashed line extends beyond the header text on both sides, the column is centered.
If the dashed line is flush with the header text on both sides, the default alignment is used (in most cases, this will be left).
The table must end with a blank line, or a line of dashes followed by a blank line.

The column headers may be omitted, provided a dashed line is used to end the table.

4.2 Multi-line tables

This code for a multi-line table:

-------------------------------------------------------------
 Centered   Default           Right Left
  Header    Aligned         Aligned Aligned
----------- ------- --------------- -------------------------
   First    row                12.0 Example of a row that
                                    spans multiple lines.

  Second    row                 5.0 Here's another one. Note
                                    the blank line between
                                    rows.
-------------------------------------------------------------

Table: Here's the caption. It, too, may span
multiple lines.

Produces this multi-line table:

Here’s the caption. It, too, may span multiple lines.
Centered Header	Default Aligned	Right Aligned	Left Aligned
First	row	12.0	Example of a row that spans multiple lines.
Second	row	5.0	Here’s another one. Note the blank line between rows.

4.3 Grid tables

This code for a grid table:

: Sample grid table.

+---------------+---------------+--------------------+
| Fruit         | Price         | Advantages         |
+===============+===============+====================+
| Bananas       | $1.34         | - built-in wrapper |
|               |               | - bright color     |
+---------------+---------------+--------------------+
| Oranges       | $2.10         | - cures scurvy     |
|               |               | - tasty            |
+---------------+---------------+--------------------+

Produces this grid table:

Sample grid table.
Fruit	Price	Advantages
Bananas	$1.34	built-in wrapper bright color
Oranges	$2.10	cures scurvy tasty

Alignments are not supported, nor are cells that span multiple columns or rows.

4.4 Pipe tables

This code for a pipe table:

| Right | Left | Default | Center |
|------:|:-----|---------|:------:|
|   12  |  12  |    12   |    12  |
|  123  |  123 |   123   |   123  |
|    1  |    1 |     1   |     1  |

  : Demonstration of pipe table syntax.

Produces this pipe table:

Demonstration of pipe table syntax.
Right	Left	Default	Center
12	12	12	12
123	123	123	123
1	1	1	1

5 Making tables in R

If you want to make tables that include R output (like output from functions like means, variances, or output from models), there are two steps:

Get the numbers you need in tabular format; then
Render that information in an aesthetically-pleasing way.

This section covers (1). But, although there are some nice options for (2) within R Markdown via various packages, I am not dogmatic about doing everything in R Markdown, especially things like (2).

5.1 `dplyr`

We’ll use the pnwflights14 package to practice our dplyr skills. We need to download the package from github using devtools.

# once per machine
install.packages("devtools")
devtools::install_github("ismayc/pnwflights14")

Now, we need to load the flights dataset from the pnwflights14 package.

# once per work session
data("flights", package = "pnwflights14")

5.1.1 `dplyr::select`

Use select to specify which columns in a dataframe you’d like to keep by name. Heretofore, this was not possible in base R! In base R, this can only be achieved using numeric variable positions. But most of the time, you keep track of your variables by name (like carrier) rather than position (the 8th column).

# keep these 2 cols
mini_flights <- flights %>% 
  select(carrier, flight)
glimpse(mini_flights)

Rows: 162,049
Columns: 2
$ carrier <chr> "AS", "US", "UA", "US", "AS", "DL", "UA", "UA", "UA", "UA", "…
$ flight  <int> 145, 1830, 1609, 466, 121, 1823, 1481, 229, 1576, 478, 1569, …

# keep first five cols
first_five <- flights %>% 
  select(year, month, day, dep_time, dep_delay)
glimpse(first_five)

Rows: 162,049
Columns: 5
$ year      <int> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014,…
$ month     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ day       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ dep_time  <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557…
$ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2…

# alternatively, specify range
first_five <- flights %>% 
  select(year:dep_delay)
glimpse(first_five)

Rows: 162,049
Columns: 5
$ year      <int> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014,…
$ month     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ day       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ dep_time  <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557…
$ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2…

We can also choose the columns we want by negation, that is, you can specify which columns to drop instead of keep. This way, all variables not listed are kept.

# we can also use negation
all_but_year <- flights %>% 
  select(-year)
glimpse(all_but_year)

Rows: 162,049
Columns: 15
$ month     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ day       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ dep_time  <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557…
$ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2…
$ arr_time  <int> 235, 738, 548, 800, 325, 747, 936, 1148, 917, 1334, 911, 90…
$ arr_delay <dbl> 70, -23, -4, -23, 43, 88, 219, 15, 24, -6, 4, 12, -12, -16,…
$ carrier   <chr> "AS", "US", "UA", "US", "AS", "DL", "UA", "UA", "UA", "UA",…
$ tailnum   <chr> "N508AS", "N195UW", "N37422", "N547UW", "N762AS", "N806DN",…
$ flight    <int> 145, 1830, 1609, 466, 121, 1823, 1481, 229, 1576, 478, 1569…
$ origin    <chr> "PDX", "SEA", "PDX", "PDX", "SEA", "SEA", "SEA", "PDX", "SE…
$ dest      <chr> "ANC", "CLT", "IAH", "CLT", "ANC", "DTW", "ORD", "IAH", "DE…
$ air_time  <dbl> 194, 252, 201, 251, 201, 224, 202, 217, 136, 268, 130, 122,…
$ distance  <dbl> 1542, 2279, 1825, 2282, 1448, 1927, 1721, 1825, 1024, 2402,…
$ hour      <dbl> 0, 0, 0, 0, 0, 0, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6,…
$ minute    <dbl> 1, 4, 8, 28, 34, 37, 46, 26, 27, 36, 41, 49, 50, 57, 57, 58…

dplyr::select comes with several other helper functions…

depart <- flights %>%
  select(starts_with("dep_"))
glimpse(depart)

Rows: 162,049
Columns: 2
$ dep_time  <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557…
$ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2…

times <- flights %>%
  select(contains("time"))
glimpse(times)

Rows: 162,049
Columns: 3
$ dep_time <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557,…
$ arr_time <int> 235, 738, 548, 800, 325, 747, 936, 1148, 917, 1334, 911, 907…
$ air_time <dbl> 194, 252, 201, 251, 201, 224, 202, 217, 136, 268, 130, 122, …

# note that we are not actually saving the new dataframe here
flights %>%
  select(-contains("time")) %>% head()

year	month	day	dep_delay	arr_delay	carrier	tailnum	flight	origin	dest	distance	hour	minute
2014	1	1	96	70	AS	N508AS	145	PDX	ANC	1.54e+03	0	1
2014	1	1	-6	-23	US	N195UW	1830	SEA	CLT	2.28e+03	0	4
2014	1	1	13	-4	UA	N37422	1609	PDX	IAH	1.82e+03	0	8
2014	1	1	-2	-23	US	N547UW	466	PDX	CLT	2.28e+03	0	28
2014	1	1	44	43	AS	N762AS	121	SEA	ANC	1.45e+03	0	34
2014	1	1	82	88	DL	N806DN	1823	SEA	DTW	1.93e+03	0	37

delays <- flights %>%
  select(ends_with("delay"))
glimpse(delays)

Rows: 162,049
Columns: 2
$ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2…
$ arr_delay <dbl> 70, -23, -4, -23, 43, 88, 219, 15, 24, -6, 4, 12, -12, -16,…

One of my favorite select helper functions is everything(), which allows you to use select to keep all your variables, but easily rearrange the columns without having to list all the variables to keep/drop.

new_order <- flights %>% 
  select(origin, dest, everything())
head(new_order)

origin	dest	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	air_time	distance	hour	minute
PDX	ANC	2014	1	1	1	96	235	70	AS	N508AS	145	194	1.54e+03	0	1
SEA	CLT	2014	1	1	4	-6	738	-23	US	N195UW	1830	252	2.28e+03	0	4
PDX	IAH	2014	1	1	8	13	548	-4	UA	N37422	1609	201	1.82e+03	0	8
PDX	CLT	2014	1	1	28	-2	800	-23	US	N547UW	466	251	2.28e+03	0	28
SEA	ANC	2014	1	1	34	44	325	43	AS	N762AS	121	201	1.45e+03	0	34
SEA	DTW	2014	1	1	37	82	747	88	DL	N806DN	1823	224	1.93e+03	0	37

# with negation
new_order2 <- flights %>% 
  select(origin, dest, everything(), -year)
head(new_order2)

origin	dest	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	air_time	distance	hour	minute
PDX	ANC	1	1	1	96	235	70	AS	N508AS	145	194	1.54e+03	0	1
SEA	CLT	1	1	4	-6	738	-23	US	N195UW	1830	252	2.28e+03	0	4
PDX	IAH	1	1	8	13	548	-4	UA	N37422	1609	201	1.82e+03	0	8
PDX	CLT	1	1	28	-2	800	-23	US	N547UW	466	251	2.28e+03	0	28
SEA	ANC	1	1	34	44	325	43	AS	N762AS	121	201	1.45e+03	0	34
SEA	DTW	1	1	37	82	747	88	DL	N806DN	1823	224	1.93e+03	0	37

We can also rename variables within select.

flights2 <- flights %>%
  select(tail_num = tailnum, everything())
head(flights2)

tail_num	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	flight	origin	dest	air_time	distance	hour	minute
N508AS	2014	1	1	1	96	235	70	AS	145	PDX	ANC	194	1.54e+03	0	1
N195UW	2014	1	1	4	-6	738	-23	US	1830	SEA	CLT	252	2.28e+03	0	4
N37422	2014	1	1	8	13	548	-4	UA	1609	PDX	IAH	201	1.82e+03	0	8
N547UW	2014	1	1	28	-2	800	-23	US	466	PDX	CLT	251	2.28e+03	0	28
N762AS	2014	1	1	34	44	325	43	AS	121	SEA	ANC	201	1.45e+03	0	34
N806DN	2014	1	1	37	82	747	88	DL	1823	SEA	DTW	224	1.93e+03	0	37

If you don’t want to move the renamed variables within your dataframe, you can use the rename function.

flights3 <- flights %>%
  rename(tail_num = tailnum)
glimpse(flights3)

Rows: 162,049
Columns: 16
$ year      <int> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014,…
$ month     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ day       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ dep_time  <int> 1, 4, 8, 28, 34, 37, 346, 526, 527, 536, 541, 549, 550, 557…
$ dep_delay <dbl> 96, -6, 13, -2, 44, 82, 227, -4, 7, 1, 1, 24, 0, -3, -3, -2…
$ arr_time  <int> 235, 738, 548, 800, 325, 747, 936, 1148, 917, 1334, 911, 90…
$ arr_delay <dbl> 70, -23, -4, -23, 43, 88, 219, 15, 24, -6, 4, 12, -12, -16,…
$ carrier   <chr> "AS", "US", "UA", "US", "AS", "DL", "UA", "UA", "UA", "UA",…
$ tail_num  <chr> "N508AS", "N195UW", "N37422", "N547UW", "N762AS", "N806DN",…
$ flight    <int> 145, 1830, 1609, 466, 121, 1823, 1481, 229, 1576, 478, 1569…
$ origin    <chr> "PDX", "SEA", "PDX", "PDX", "SEA", "SEA", "SEA", "PDX", "SE…
$ dest      <chr> "ANC", "CLT", "IAH", "CLT", "ANC", "DTW", "ORD", "IAH", "DE…
$ air_time  <dbl> 194, 252, 201, 251, 201, 224, 202, 217, 136, 268, 130, 122,…
$ distance  <dbl> 1542, 2279, 1825, 2282, 1448, 1927, 1721, 1825, 1024, 2402,…
$ hour      <dbl> 0, 0, 0, 0, 0, 0, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6,…
$ minute    <dbl> 1, 4, 8, 28, 34, 37, 46, 26, 27, 36, 41, 49, 50, 57, 57, 58…

5.1.2 `dplyr::filter`

# flights taking off from PDX
pdx <- flights %>% 
  filter(origin == "PDX")
head(pdx)

year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute
2014	1	1	1	96	235	70	AS	N508AS	145	PDX	ANC	194	1.54e+03	0	1
2014	1	1	8	13	548	-4	UA	N37422	1609	PDX	IAH	201	1.82e+03	0	8
2014	1	1	28	-2	800	-23	US	N547UW	466	PDX	CLT	251	2.28e+03	0	28
2014	1	1	526	-4	1148	15	UA	N813UA	229	PDX	IAH	217	1.82e+03	5	26
2014	1	1	541	1	911	4	UA	N36476	1569	PDX	DEN	130	991	5	41
2014	1	1	549	24	907	12	US	N548UW	649	PDX	PHX	122	1.01e+03	5	49

# january flights from PDX
pdx_jan <- flights %>% 
  filter(origin == "PDX", month == 1) # the comma is an "and"
head(pdx_jan)

year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute
2014	1	1	1	96	235	70	AS	N508AS	145	PDX	ANC	194	1.54e+03	0	1
2014	1	1	8	13	548	-4	UA	N37422	1609	PDX	IAH	201	1.82e+03	0	8
2014	1	1	28	-2	800	-23	US	N547UW	466	PDX	CLT	251	2.28e+03	0	28
2014	1	1	526	-4	1148	15	UA	N813UA	229	PDX	IAH	217	1.82e+03	5	26
2014	1	1	541	1	911	4	UA	N36476	1569	PDX	DEN	130	991	5	41
2014	1	1	549	24	907	12	US	N548UW	649	PDX	PHX	122	1.01e+03	5	49

# flights to ATL (Atlanta) or BNA (Nashville)
to_south <- flights %>% 
  filter(dest == "ATL" | dest == "BNA") %>% # | is "or"
  select(origin, dest, everything())
head(to_south)

origin	dest	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	air_time	distance	hour	minute
SEA	ATL	2014	1	1	624	-6	1401	-6	DL	N617DL	968	235	2.18e+03	6	24
SEA	ATL	2014	1	1	802	-3	1533	-17	AS	N532AS	742	249	2.18e+03	8	2
SEA	ATL	2014	1	1	824	-1	1546	-14	DL	N633DL	128	234	2.18e+03	8	24
PDX	ATL	2014	1	1	944	-6	1727	-8	AS	N548AS	752	252	2.17e+03	9	44
PDX	ATL	2014	1	1	1054	94	1807	84	DL	N377DA	1502	237	2.17e+03	10	54
SEA	ATL	2014	1	1	1158	6	1915	-14	DL	N6712B	1962	234	2.18e+03	11	58

# flights from PDX to ATL (Atlanta) or BNA (Nashville)
pdx_to_south <- flights %>% 
  filter(origin == "PDX", dest == "ATL" | dest == "BNA") %>% # | is "or"
  select(origin, dest, everything())
head(pdx_to_south)

origin	dest	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	air_time	distance	hour	minute
PDX	ATL	2014	1	1	944	-6	1727	-8	AS	N548AS	752	252	2.17e+03	9	44
PDX	ATL	2014	1	1	1054	94	1807	84	DL	N377DA	1502	237	2.17e+03	10	54
PDX	ATL	2014	1	1	1323	-2	2038	-15	DL	N393DA	773	235	2.17e+03	13	23
PDX	ATL	2014	1	1	2253	8	611	4	DL	N371DA	503	240	2.17e+03	22	53
PDX	ATL	2014	1	2	627	-3	1350	-7	DL	N3746H	1156	244	2.17e+03	6	27
PDX	ATL	2014	1	2	918	-2	1643	-2	DL	N3756	1940	249	2.17e+03	9	18

# alternatively, using group membership
south_dests <- c("ATL", "BNA")
pdx_to_south2 <- flights %>% 
  filter(origin == "PDX", dest %in% south_dests) %>% 
  select(origin, dest, everything())
head(pdx_to_south2)

origin	dest	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	air_time	distance	hour	minute
PDX	ATL	2014	1	1	944	-6	1727	-8	AS	N548AS	752	252	2.17e+03	9	44
PDX	ATL	2014	1	1	1054	94	1807	84	DL	N377DA	1502	237	2.17e+03	10	54
PDX	ATL	2014	1	1	1323	-2	2038	-15	DL	N393DA	773	235	2.17e+03	13	23
PDX	ATL	2014	1	1	2253	8	611	4	DL	N371DA	503	240	2.17e+03	22	53
PDX	ATL	2014	1	2	627	-3	1350	-7	DL	N3746H	1156	244	2.17e+03	6	27
PDX	ATL	2014	1	2	918	-2	1643	-2	DL	N3756	1940	249	2.17e+03	9	18

# flights delayed by 1 hour or more
delay_1plus <- flights %>%
  filter(dep_delay >= 60)
head(delay_1plus)

year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute
2014	1	1	1	96	235	70	AS	N508AS	145	PDX	ANC	194	1.54e+03	0	1
2014	1	1	37	82	747	88	DL	N806DN	1823	SEA	DTW	224	1.93e+03	0	37
2014	1	1	346	227	936	219	UA	N14219	1481	SEA	ORD	202	1.72e+03	3	46
2014	1	1	650	90	1037	91	US	N626AW	460	SEA	PHX	141	1.11e+03	6	50
2014	1	1	959	164	1137	157	AS	N534AS	805	SEA	SMF	77	605	9	59
2014	1	1	1008	68	1242	64	AS	N788AS	456	SEA	LAX	129	954	10	8

# flights delayed by 1 hour, but not more than 2 hours
delay_1hr <- flights %>%
  filter(dep_delay >= 60, dep_delay < 120)
head(delay_1hr)

year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute
2014	1	1	1	96	235	70	AS	N508AS	145	PDX	ANC	194	1.54e+03	0	1
2014	1	1	37	82	747	88	DL	N806DN	1823	SEA	DTW	224	1.93e+03	0	37
2014	1	1	650	90	1037	91	US	N626AW	460	SEA	PHX	141	1.11e+03	6	50
2014	1	1	1008	68	1242	64	AS	N788AS	456	SEA	LAX	129	954	10	8
2014	1	1	1014	75	1613	81	UA	N37408	1444	SEA	ORD	201	1.72e+03	10	14
2014	1	1	1036	81	1408	63	OO	N218AG	3466	PDX	TUS	134	1.12e+03	10	36

range(delay_1hr$dep_delay, na.rm = TRUE)

[1]  60 119

# even more efficient using between (always inclusive)
delay_bwn <- flights %>%
  filter(between(dep_delay, 60, 119))
head(delay_bwn)

year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute
2014	1	1	1	96	235	70	AS	N508AS	145	PDX	ANC	194	1.54e+03	0	1
2014	1	1	37	82	747	88	DL	N806DN	1823	SEA	DTW	224	1.93e+03	0	37
2014	1	1	650	90	1037	91	US	N626AW	460	SEA	PHX	141	1.11e+03	6	50
2014	1	1	1008	68	1242	64	AS	N788AS	456	SEA	LAX	129	954	10	8
2014	1	1	1014	75	1613	81	UA	N37408	1444	SEA	ORD	201	1.72e+03	10	14
2014	1	1	1036	81	1408	63	OO	N218AG	3466	PDX	TUS	134	1.12e+03	10	36

range(delay_bwn$dep_delay, na.rm = TRUE)

[1]  60 119

5.1.3 `dplyr::arrange`

# default is ascending order
flights %>% 
  arrange(year, month, day) %>% head(n=20)

year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute
2014	1	1	1	96	235	70	AS	N508AS	145	PDX	ANC	194	1.54e+03	0	1
2014	1	1	4	-6	738	-23	US	N195UW	1830	SEA	CLT	252	2.28e+03	0	4
2014	1	1	8	13	548	-4	UA	N37422	1609	PDX	IAH	201	1.82e+03	0	8
2014	1	1	28	-2	800	-23	US	N547UW	466	PDX	CLT	251	2.28e+03	0	28
2014	1	1	34	44	325	43	AS	N762AS	121	SEA	ANC	201	1.45e+03	0	34
2014	1	1	37	82	747	88	DL	N806DN	1823	SEA	DTW	224	1.93e+03	0	37
2014	1	1	346	227	936	219	UA	N14219	1481	SEA	ORD	202	1.72e+03	3	46
2014	1	1	526	-4	1148	15	UA	N813UA	229	PDX	IAH	217	1.82e+03	5	26
2014	1	1	527	7	917	24	UA	N75433	1576	SEA	DEN	136	1.02e+03	5	27
2014	1	1	536	1	1334	-6	UA	N574UA	478	SEA	EWR	268	2.4e+03	5	36
2014	1	1	541	1	911	4	UA	N36476	1569	PDX	DEN	130	991	5	41
2014	1	1	549	24	907	12	US	N548UW	649	PDX	PHX	122	1.01e+03	5	49
2014	1	1	550	0	837	-12	DL	N660DL	1634	SEA	SLC	82	689	5	50
2014	1	1	557	-3	1134	-16	AA	N3JLAA	1094	SEA	DFW	184	1.66e+03	5	57
2014	1	1	557	-3	825	-25	AS	N562AS	81	SEA	ANC	188	1.45e+03	5	57
2014	1	1	558	-2	801	-2	AS	N402AS	200	SEA	SJC	100	697	5	58
2014	1	1	559	-1	916	-9	F9	N210FR	796	PDX	DEN	125	991	5	59
2014	1	1	600	0	1151	-19	AA	N3JFAA	2240	SEA	ORD	206	1.72e+03	6	0
2014	1	1	600	-10	842	-8	AS	N786AS	426	SEA	LAX	125	954	6	0
2014	1	1	602	-3	943	5	F9	N201FR	144	SEA	DEN	129	1.02e+03	6	2

# descending order
flights %>% 
  arrange(desc(year), desc(month), desc(day)) %>% head(n=20)

year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute
2014	12	31	2	12	601	31	AA	N3JKAA	1230	SEA	DFW	210	1.66e+03	0	2
2014	12	31	27	-3	623	3	AA	N3EWAA	1431	SEA	ORD	211	1.72e+03	0	27
2014	12	31	39	14	324	4	AS	N762AS	135	SEA	ANC	207	1.45e+03	0	39
2014	12	31	40	0	549	0	DL	N757AT	2440	SEA	MSP	174	1.4e+03	0	40
2014	12	31	52	-8	917	-21	AA	N3JFAA	371	SEA	MIA	306	2.72e+03	0	52
2014	12	31	54	4	621	17	DL	N128DL	1670	PDX	MSP	174	1.43e+03	0	54
2014	12	31	56	61	848	80	DL	N655DL	929	SEA	ATL	249	2.18e+03	0	56
2014	12	31	512	-3	904	4	US	N653AW	480	SEA	PHX	154	1.11e+03	5	12
2014	12	31	515	-5	855	5	US	N580UW	425	PDX	PHX	144	1.01e+03	5	15
2014	12	31	534	4	859	7	UA	N34460	1075	PDX	DEN	125	991	5	34
2014	12	31	546	1	916	-4	WN	N8323C	757	PDX	DEN	126	991	5	46
2014	12	31	548	-2	1351	-13	UA	N461UA	665	PDX	EWR	287	2.43e+03	5	48
2014	12	31	549	4	1208	12	UA	N68807	1457	SEA	IAH	223	1.87e+03	5	49
2014	12	31	550	0	922	2	WN	N797MX	2121	PDX	PHX	138	1.01e+03	5	50
2014	12	31	551	-4	1202	12	AA	N3HXAA	1094	SEA	DFW	211	1.66e+03	5	51
2014	12	31	551	-9	744	-15	AS	N570AS	342	SEA	OAK	91	671	5	51
2014	12	31	555	-10	824	-1	AS	N548AS	602	SEA	LAS	124	867	5	55
2014	12	31	558	-2	849	0	DL	N668DN	1831	PDX	SLC	85	630	5	58
2014	12	31	558	-2	1149	4	AA	N436AA	1534	PDX	DFW	211	1.62e+03	5	58
2014	12	31	558	-2	738	-4	AS	N585AS	406	PDX	SJC	80	569	5	58

5.1.4 `dplyr::distinct`

# unique origin-dest combinations
flights %>% 
  select(origin, dest) %>% 
  distinct %>% head(n=50)

origin	dest
PDX	ANC
SEA	CLT
PDX	IAH
PDX	CLT
SEA	ANC
SEA	DTW
SEA	ORD
SEA	DEN
SEA	EWR
PDX	DEN
PDX	PHX
SEA	SLC
SEA	DFW
SEA	SJC
SEA	LAX
PDX	ORD
SEA	OAK
SEA	SFO
PDX	SJC
SEA	SNA
SEA	SAN
PDX	DFW
PDX	EWR
SEA	IAH
SEA	ATL
PDX	BUR
SEA	MDW
PDX	LAX
PDX	SAN
PDX	IAD
SEA	PSP
PDX	MDW
SEA	MSP
PDX	SNA
SEA	PHX
PDX	HNL
SEA	JFK
PDX	MSP
PDX	SFO
SEA	LAS
SEA	LGB
PDX	ONT
PDX	LGB
PDX	JFK
PDX	KOA
PDX	LAS
SEA	ONT
SEA	MKE
SEA	BUR
SEA	KTN

# all unique destinations from PDX (there are 49, so we'll just list the first few)
from_pdx <- flights %>% 
  filter(origin == "PDX") %>% 
  select(origin, dest) %>%
  distinct(dest)
head(from_pdx)

dest

ANC

IAH

CLT

DEN

PHX

ORD

5.1.5 `dplyr::mutate`

# add total delay variable
flights %>%
  mutate(tot_delay = dep_delay + arr_delay) %>%
  select(origin, dest, ends_with("delay"), everything()) %>% 
  head(n=100)

origin	dest	dep_delay	arr_delay	tot_delay	year	month	day	dep_time	arr_time	carrier	tailnum	flight	air_time	distance	hour	minute
PDX	ANC	96	70	166	2014	1	1	1	235	AS	N508AS	145	194	1.54e+03	0	1
SEA	CLT	-6	-23	-29	2014	1	1	4	738	US	N195UW	1830	252	2.28e+03	0	4
PDX	IAH	13	-4	9	2014	1	1	8	548	UA	N37422	1609	201	1.82e+03	0	8
PDX	CLT	-2	-23	-25	2014	1	1	28	800	US	N547UW	466	251	2.28e+03	0	28
SEA	ANC	44	43	87	2014	1	1	34	325	AS	N762AS	121	201	1.45e+03	0	34
SEA	DTW	82	88	170	2014	1	1	37	747	DL	N806DN	1823	224	1.93e+03	0	37
SEA	ORD	227	219	446	2014	1	1	346	936	UA	N14219	1481	202	1.72e+03	3	46
PDX	IAH	-4	15	11	2014	1	1	526	1148	UA	N813UA	229	217	1.82e+03	5	26
SEA	DEN	7	24	31	2014	1	1	527	917	UA	N75433	1576	136	1.02e+03	5	27
SEA	EWR	1	-6	-5	2014	1	1	536	1334	UA	N574UA	478	268	2.4e+03	5	36
PDX	DEN	1	4	5	2014	1	1	541	911	UA	N36476	1569	130	991	5	41
PDX	PHX	24	12	36	2014	1	1	549	907	US	N548UW	649	122	1.01e+03	5	49
SEA	SLC	0	-12	-12	2014	1	1	550	837	DL	N660DL	1634	82	689	5	50
SEA	DFW	-3	-16	-19	2014	1	1	557	1134	AA	N3JLAA	1094	184	1.66e+03	5	57
SEA	ANC	-3	-25	-28	2014	1	1	557	825	AS	N562AS	81	188	1.45e+03	5	57
SEA	SJC	-2	-2	-4	2014	1	1	558	801	AS	N402AS	200	100	697	5	58
PDX	DEN	-1	-9	-10	2014	1	1	559	916	F9	N210FR	796	125	991	5	59
SEA	ORD	0	-19	-19	2014	1	1	600	1151	AA	N3JFAA	2240	206	1.72e+03	6	0
SEA	LAX	-10	-8	-18	2014	1	1	600	842	AS	N786AS	426	125	954	6	0
SEA	DEN	-3	5	2	2014	1	1	602	943	F9	N201FR	144	129	1.02e+03	6	2
PDX	ORD	-3	7	4	2014	1	1	602	1204	UA	N77295	1573	203	1.74e+03	6	2
SEA	OAK	-2	-17	-19	2014	1	1	603	755	AS	N577AS	344	94	672	6	3
SEA	ORD	-3	-2	-5	2014	1	1	603	1202	UA	N37290	1136	203	1.72e+03	6	3
SEA	SFO	-4	-19	-23	2014	1	1	606	806	AS	N525AS	308	98	679	6	6
PDX	SJC	6	3	9	2014	1	1	606	746	AS	N796AS	406	87	569	6	6
SEA	SNA	-1	-2	-3	2014	1	1	614	850	AS	N607AS	524	135	978	6	14
SEA	SAN	2	-12	-10	2014	1	1	617	850	AS	N579AS	494	131	1.05e+03	6	17
PDX	DFW	-2	-30	-32	2014	1	1	618	1135	AA	N4XMAA	1650	184	1.62e+03	6	18
SEA	SFO	-6	-7	-13	2014	1	1	619	822	VX	N841VA	751	101	679	6	19
SEA	LAX	-2	0	-2	2014	1	1	620	905	OO	N804SK	4693	125	954	6	20
PDX	EWR	2	-19	-17	2014	1	1	622	1412	UA	N36472	1021	270	2.43e+03	6	22
SEA	IAH	13	-4	9	2014	1	1	623	1218	UA	N72405	1101	213	1.87e+03	6	23
SEA	ATL	-6	-6	-12	2014	1	1	624	1401	DL	N617DL	968	235	2.18e+03	6	24
SEA	DEN	-9	-1	-10	2014	1	1	629	1014	UA	N566UA	883	132	1.02e+03	6	29
PDX	BUR	-10	-14	-24	2014	1	1	630	834	OO	N219AG	3456	105	817	6	30
SEA	MDW	-3	5	2	2014	1	1	632	1235	WN	N8302F	3694	204	1.73e+03	6	32
PDX	LAX	-8	-7	-15	2014	1	1	637	858	AS	N408AS	560	108	834	6	37
PDX	SAN	-3	-6	-9	2014	1	1	637	854	AS	N588AS	572	120	933	6	37
PDX	PHX	-2	-5	-7	2014	1	1	638	1003	AS	N508AS	646	125	1.01e+03	6	38
PDX	IAD	10	-4	6	2014	1	1	638	1408	UA	N526UA	250	253	2.33e+03	6	38
SEA	PSP	-1	3	2	2014	1	1	639	918	AS	N794AS	556	130	987	6	39
SEA	DFW	-1	-10	-11	2014	1	1	639	1216	AS	N442AS	668	189	1.66e+03	6	39
PDX	MDW	4	-11	-7	2014	1	1	639	1219	WN	N8308K	401	207	1.75e+03	6	39
SEA	LAX	-3	-2	-5	2014	1	1	647	923	AS	N315AS	470	124	954	6	47
SEA	DEN	-3	19	16	2014	1	1	647	1038	AS	N553AS	682	136	1.02e+03	6	47
SEA	MSP	-7	-9	-16	2014	1	1	648	1203	AS	N419AS	36	164	1.4e+03	6	48
PDX	SNA	-12	-11	-23	2014	1	1	648	910	AS	N611AS	582	122	859	6	48
SEA	PHX	90	91	181	2014	1	1	650	1037	US	N626AW	460	141	1.11e+03	6	50
PDX	HNL	-6	-33	-39	2014	1	1	654	1047	AS	N586AS	833	327	2.6e+03	6	54
SEA	JFK	-6	-10	-16	2014	1	1	654	1455	DL	N686DA	418	273	2.42e+03	6	54
PDX	MSP	-5	-7	-12	2014	1	1	655	1210	DL	N963DN	2090	168	1.43e+03	6	55
PDX	ORD	-4	-28	-32	2014	1	1	656	1242	AA	N3KMAA	2218	200	1.74e+03	6	56
PDX	SFO	-4	8	4	2014	1	1	656	853	VX	N523VA	805	82	550	6	56
PDX	SFO	0	-1	-1	2014	1	1	700	844	UA	N479UA	519	80	550	7	0
SEA	LAS	-4	-6	-10	2014	1	1	701	918	AS	N618AS	606	115	867	7	1
SEA	PHX	1	-9	-8	2014	1	1	701	1036	WN	N8318F	669	133	1.11e+03	7	1
SEA	LAX	-3	-8	-11	2014	1	1	702	932	VX	N840VA	1780	127	954	7	2
SEA	LGB	-8	-2	-10	2014	1	1	702	940	OO	N217AG	3472	130	965	7	2
SEA	DFW	-6	-20	-26	2014	1	1	704	1245	AA	N3KNAA	1308	188	1.66e+03	7	4
PDX	ONT	-1	-9	-10	2014	1	1	704	910	OO	N223AG	3460	106	838	7	4
PDX	LGB	0	9	9	2014	1	1	705	931	B6	N627JB	1121	113	846	7	5
SEA	SLC	6	-13	-7	2014	1	1	706	952	AS	N306AS	726	86	689	7	6
SEA	JFK	-7	-19	-26	2014	1	1	708	1510	AA	N3DNAA	236	281	2.42e+03	7	8
PDX	JFK	-2	-20	-22	2014	1	1	708	1453	DL	N3772H	2258	267	2.45e+03	7	8
SEA	SJC	-1	-5	-6	2014	1	1	709	915	AS	N558AS	332	103	697	7	9
SEA	MSP	9	-4	5	2014	1	1	709	1209	DL	N374NW	2314	160	1.4e+03	7	9
PDX	SFO	-5	-21	-26	2014	1	1	710	906	AS	N309AS	244	82	550	7	10
SEA	SLC	-3	-1	-4	2014	1	1	712	1013	DL	N355NW	2397	85	689	7	12
SEA	IAH	20	10	30	2014	1	1	713	1317	UA	N854UA	461	218	1.87e+03	7	13
SEA	LAX	0	-10	-10	2014	1	1	715	944	AS	N617AS	478	127	954	7	15
PDX	KOA	-5	-47	-52	2014	1	1	715	1055	AS	N536AS	855	326	2.61e+03	7	15
SEA	PHX	-3	-11	-14	2014	1	1	717	1057	AS	N769AS	642	134	1.11e+03	7	17
PDX	LAS	-3	3	0	2014	1	1	722	933	WN	N407WN	947	101	763	7	22
SEA	DEN	1	17	18	2014	1	1	726	1117	WN	N603SW	2181	133	1.02e+03	7	26
SEA	ONT	-2	-7	-9	2014	1	1	728	951	AS	N320AS	530	124	956	7	28
SEA	SFO	5	-9	-4	2014	1	1	735	940	UA	N16732	1114	104	679	7	35
SEA	SAN	14	-2	12	2014	1	1	739	1003	AS	N518AS	240	132	1.05e+03	7	39
SEA	SFO	31	19	50	2014	1	1	741	947	AS	N431AS	222	96	679	7	41
SEA	MKE	5	7	12	2014	1	1	745	1327	WN	N924WN	3872	200	1.69e+03	7	45
SEA	BUR	-6	-3	-9	2014	1	1	749	1013	AS	N627AS	522	122	937	7	49
PDX	LAX	-6	-14	-20	2014	1	1	754	1011	AS	N706AS	568	111	834	7	54
SEA	KTN	52	45	97	2014	1	1	757	855	AS	N705AS	65	100	680	7	57
PDX	OAK	-4	-18	-22	2014	1	1	801	932	WN	N447WN	48	78	543	8	1
SEA	ATL	-3	-17	-20	2014	1	1	802	1533	AS	N532AS	742	249	2.18e+03	8	2
SEA	LAS	-1	-8	-9	2014	1	1	809	1021	AS	N557AS	600	115	867	8	9
PDX	MSP	-6	-5	-11	2014	1	1	809	1326	DL	N951DN	1588	170	1.43e+03	8	9
SEA	ORD	-10	-11	-21	2014	1	1	810	1419	AA	N3DSAA	1236	202	1.72e+03	8	10
SEA	DCA	0	-2	-2	2014	1	1	810	1604	AS	N514AS	4	259	2.33e+03	8	10
SEA	ANC	17	-4	13	2014	1	1	817	1045	AS	N529AS	83	184	1.45e+03	8	17
SEA	ABQ	-3	-7	-10	2014	1	1	817	1158	WN	N379SW	253	138	1.18e+03	8	17
PDX	DFW	14	-7	7	2014	1	1	819	1343	AA	N4XPAA	1220	178	1.62e+03	8	19
SEA	IAD	10	-8	2	2014	1	1	821	1552	UA	N514UA	419	250	2.31e+03	8	21
SEA	ATL	-1	-14	-15	2014	1	1	824	1546	DL	N633DL	128	234	2.18e+03	8	24
SEA	DFW	-5	-3	-8	2014	1	1	825	1412	AA	N3BDAA	312	183	1.66e+03	8	25
SEA	MCO	25	30	55	2014	1	1	825	1650	AS	N551AS	16	301	2.55e+03	8	25
SEA	ORD	-5	15	10	2014	1	1	825	1436	AS	N305AS	28	202	1.72e+03	8	25
SEA	MSP	0	10	10	2014	1	1	825	1352	DL	N584NW	588	162	1.4e+03	8	25
PDX	IAH	2	-16	-14	2014	1	1	827	1414	UA	N38446	1510	210	1.82e+03	8	27
SEA	OGG	-1	-22	-23	2014	1	1	829	1228	AS	N563AS	861	338	2.64e+03	8	29
SEA	PHX	0	-12	-12	2014	1	1	830	1206	US	N667AW	440	132	1.11e+03	8	30

# flights that were delayed at departure had on time or early arrivals?
arrivals <- flights %>%
  mutate(arr_ok = ifelse(dep_delay > 0 & arr_delay <= 0, 1, 0)) %>% 
  select(origin, dest, ends_with("delay"), carrier, arr_ok)

# peek at it
arrivals %>%
  filter(arr_ok == 1) %>%
  head

origin	dest	dep_delay	arr_delay	carrier	arr_ok
PDX	IAH	13	-4	UA	1
SEA	EWR	1	-6	UA	1
SEA	SAN	2	-12	AS	1
PDX	EWR	2	-19	UA	1
SEA	IAH	13	-4	UA	1
PDX	IAD	10	-4	UA	1

5.1.6 `dplyr::summarise` (or `dplyr::summarize`)

flights %>%
  summarise(mean(dep_delay, na.rm = TRUE))

mean(dep_delay, na.rm = TRUE)

6.13

# we can also name that variable, and summarise multiple variables
flights %>%
  summarise(mean_delay = mean(dep_delay, na.rm = TRUE),
            sd_delay = sd(dep_delay, na.rm = TRUE),
            median_delay = median(dep_delay, na.rm = TRUE))

mean_delay	sd_delay	median_delay
6.13	29.1	-2

But this can get tedious with multiple summaries…

flights %>%
  filter(!is.na(dep_delay)) %>%
  select(dep_delay) %>%
  summarise_each(lst(mean, sd, median))

mean	sd	median
6.13	29.1	-2

# same thing
flights %>%
  filter(!is.na(dep_delay)) %>%
  summarise_each(lst(mean, sd, median), dep_delay)

mean	sd	median
6.13	29.1	-2

# combine with gather, change names too
flights %>%
  filter(!is.na(dep_delay)) %>%
  summarise_each(lst(mean, stdev = sd, median), dep_delay) %>%
  gather(delay_stat, value)

delay_stat	value
mean	6.13
stdev	29.1
median	-2

Using aggregating functions in summarise

# how many unique destinations?
summary_table <- flights %>% 
  summarise(tot_flights = n(),
            tot_planes = n_distinct(tailnum),
            tot_carriers = n_distinct(carrier),
            tot_dests = n_distinct(dest),
            tot_origins = n_distinct(origin))

summary_table

tot_flights	tot_planes	tot_carriers	tot_dests	tot_origins
162049	3023	11	71	2

# chain with tidyr functions
summary_table %>% 
  gather(key, value) %>% 
  separate(key, into = c("tot", "entity")) %>% 
  select(-tot, total = value)

entity	total
flights	162049
planes	3023
carriers	11
dests	71
origins	2

5.2 `tidyr`

We’ll work with a made up dataframe:

df <- data.frame(
  id = 1:10,
  date = as.Date('2015-01-01') + 0:9,
  q1_m1_w1 = rnorm(10, 0, 1),
  q1_m1_w2 = rnorm(10, 0, 1),
  q1_m2_w3 = rnorm(10, 0, 1),
  q2_m1_w1 = rnorm(10, 0, 1),
  q2_m2_w1 = rnorm(10, 0, 1),
  q2_m2_w2 = rnorm(10, 0, 1)
)

# HLO
head(df)

id	date	q1_m1_w1	q1_m1_w2	q1_m2_w3	q2_m1_w1	q2_m2_w1	q2_m2_w2
1	2015-01-01	0.908	-1.06	0.259	-0.0417	0.312	0.96
2	2015-01-02	-0.8	0.378	-0.74	-0.888	-0.474	1.13
3	2015-01-03	0.373	0.12	-0.954	-0.767	-0.789	-0.0994
4	2015-01-04	-0.0835	-1.18	0.336	-1.02	-0.808	-0.201
5	2015-01-05	-0.261	-0.47	1.12	1.24	0.423	-0.781
6	2015-01-06	0.16	0.264	0.939	-1.08	0.17	0.0755

glimpse(df)

Rows: 10
Columns: 8
$ id       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
$ date     <date> 2015-01-01, 2015-01-02, 2015-01-03, 2015-01-04, 2015-01-05,…
$ q1_m1_w1 <dbl> 0.90833895, -0.80014146, 0.37278036, -0.08348309, -0.2609710…
$ q1_m1_w2 <dbl> -1.0636021, 0.3780610, 0.1200822, -1.1750064, -0.4698987, 0.…
$ q1_m2_w3 <dbl> 0.2585625, -0.7404383, -0.9540600, 0.3363360, 1.1159808, 0.9…
$ q2_m1_w1 <dbl> -0.04166279, -0.88810449, -0.76684055, -1.02118865, 1.243960…
$ q2_m2_w1 <dbl> 0.3118560, -0.4739947, -0.7892439, -0.8077828, 0.4226551, 0.…
$ q2_m2_w2 <dbl> 0.96043415, 1.13194728, -0.09936716, -0.20098062, -0.7809419…

5.2.1 `tidyr::gather`

First, let’s gather…

df_tidy <- df %>%
  gather(key, value, q1_m1_w1:q2_m2_w2)
head(df_tidy)

id	date	key	value
1	2015-01-01	q1_m1_w1	0.908
2	2015-01-02	q1_m1_w1	-0.8
3	2015-01-03	q1_m1_w1	0.373
4	2015-01-04	q1_m1_w1	-0.0835
5	2015-01-05	q1_m1_w1	-0.261
6	2015-01-06	q1_m1_w1	0.16

Now let’s gather using subtraction…

df_tidy <- df %>%
  gather(key, value, -id, -date)
head(df_tidy)

id	date	key	value
1	2015-01-01	q1_m1_w1	0.908
2	2015-01-02	q1_m1_w1	-0.8
3	2015-01-03	q1_m1_w1	0.373
4	2015-01-04	q1_m1_w1	-0.0835
5	2015-01-05	q1_m1_w1	-0.261
6	2015-01-06	q1_m1_w1	0.16

5.2.2 `tidyr::separate`

# separate 1 col into 3 cols
df_sep <- df_tidy %>%
  separate(key, into = c("quarter", "month", "week"))
head(df_sep)

id	date	quarter	month	week	value
1	2015-01-01	q1	m1	w1	0.908
2	2015-01-02	q1	m1	w1	-0.8
3	2015-01-03	q1	m1	w1	0.373
4	2015-01-04	q1	m1	w1	-0.0835
5	2015-01-05	q1	m1	w1	-0.261
6	2015-01-06	q1	m1	w1	0.16

# separate 1 col into 2 cols
df_sep2 <- df_tidy %>%
  separate(key, into = c("quarter", "period"), extra = "merge")
head(df_sep2)

id	date	quarter	period	value
1	2015-01-01	q1	m1_w1	0.908
2	2015-01-02	q1	m1_w1	-0.8
3	2015-01-03	q1	m1_w1	0.373
4	2015-01-04	q1	m1_w1	-0.0835
5	2015-01-05	q1	m1_w1	-0.261
6	2015-01-06	q1	m1_w1	0.16

stringr vs. tidyr separate by regular expression

5.2.3 `tidyr::extract`

Extract is essentially the same as separate, let’s see how…

# extract
df_ext <- df_sep2 %>%
  extract(period, into = "month")
head(df_ext)

id	date	quarter	month	value
1	2015-01-01	q1	m1	0.908
2	2015-01-02	q1	m1	-0.8
3	2015-01-03	q1	m1	0.373
4	2015-01-04	q1	m1	-0.0835
5	2015-01-05	q1	m1	-0.261
6	2015-01-06	q1	m1	0.16

# this gives us same output as separate
df_ext <- df_sep2 %>%
  extract(period, into = c("month", "week"), 
          regex = "([[:alnum:]]+)_([[:alnum:]]+)")
head(df_ext)

id	date	quarter	month	week	value
1	2015-01-01	q1	m1	w1	0.908
2	2015-01-02	q1	m1	w1	-0.8
3	2015-01-03	q1	m1	w1	0.373
4	2015-01-04	q1	m1	w1	-0.0835
5	2015-01-05	q1	m1	w1	-0.261
6	2015-01-06	q1	m1	w1	0.16

5.2.4 `tidyr::unite`

# let's say we want to combine quarter and month with an underscore
df_uni <- df_sep %>%
  unite(period, quarter:month) # sep = "_" is the default arg
head(df_uni)

id	date	period	week	value
1	2015-01-01	q1_m1	w1	0.908
2	2015-01-02	q1_m1	w1	-0.8
3	2015-01-03	q1_m1	w1	0.373
4	2015-01-04	q1_m1	w1	-0.0835
5	2015-01-05	q1_m1	w1	-0.261
6	2015-01-06	q1_m1	w1	0.16

# let's say we want to combine quarter and month with nothing
df_uni <- df_sep %>%
  unite(period, quarter:month, sep = "")
head(df_uni)

id	date	period	week	value
1	2015-01-01	q1m1	w1	0.908
2	2015-01-02	q1m1	w1	-0.8
3	2015-01-03	q1m1	w1	0.373
4	2015-01-04	q1m1	w1	-0.0835
5	2015-01-05	q1m1	w1	-0.261
6	2015-01-06	q1m1	w1	0.16

5.2.5 `tidyr::spread`

# finally let's spread
df_spread <- df_uni %>%
  spread(week, value) # fill = NA is default arg
head(df_spread)

id	date	period	w1	w2	w3
1	2015-01-01	q1m1	0.908	-1.06
1	2015-01-01	q1m2			0.259
1	2015-01-01	q2m1	-0.0417
1	2015-01-01	q2m2	0.312	0.96
2	2015-01-02	q1m1	-0.8	0.378
2	2015-01-02	q1m2			-0.74

5.2.6 Gather multiple sets of columns (`gather() %>% separate() %>% spread()`)

Gather multiple sets of columns

All in one, if we had wanted to essentially “gather” three sets of columns (here, one for each week)…

df_tidiest <- df %>%
  gather(key, value, -id, -date) %>%
  separate(key, into = c("quarter", "month", "week")) %>%
  spread(week, value)
head(df_tidiest)

id	date	quarter	month	w1	w2	w3
1	2015-01-01	q1	m1	0.908	-1.06
1	2015-01-01	q1	m2			0.259
1	2015-01-01	q2	m1	-0.0417
1	2015-01-01	q2	m2	0.312	0.96
2	2015-01-02	q1	m1	-0.8	0.378
2	2015-01-02	q1	m2			-0.74

5.3 `broom`

“The broom package takes the messy output of built-in functions in R, such as lm, nls, or t.test, and turns them into tidy data frames.” So, broom tidies output from other R functions that are un-tidy.

See here for list of functions: https://github.com/dgrtwo/broom

Vignette: ftp://cran.r-project.org/pub/R/web/packages/broom/vignettes/broom.html

fit <- lm(mpg ~ qsec + factor(am) + wt + factor(gear), 
          data = mtcars)

Un-tidy output from lm

summary(fit)


Call:
lm(formula = mpg ~ qsec + factor(am) + wt + factor(gear), data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5064 -1.5220 -0.7517  1.3841  4.6345 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)     9.3650     8.3730   1.118  0.27359    
qsec            1.2449     0.3828   3.252  0.00317 ** 
factor(am)1     3.1505     1.9405   1.624  0.11654    
wt             -3.9263     0.7428  -5.286 1.58e-05 ***
factor(gear)4  -0.2682     1.6555  -0.162  0.87257    
factor(gear)5  -0.2697     2.0632  -0.131  0.89698    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.55 on 26 degrees of freedom
Multiple R-squared:  0.8498,    Adjusted R-squared:  0.8209 
F-statistic: 29.43 on 5 and 26 DF,  p-value: 6.379e-10

Tidy output from broom

tidy(fit)

term	estimate	std.error	statistic	p.value
(Intercept)	9.37	8.37	1.12	0.274
qsec	1.24	0.383	3.25	0.00317
factor(am)1	3.15	1.94	1.62	0.117
wt	-3.93	0.743	-5.29	1.58e-05
factor(gear)4	-0.268	1.66	-0.162	0.873
factor(gear)5	-0.27	2.06	-0.131	0.897

6 Specialized Packages

6.1 `gt`

gt is a relatively new package from RStudio designed for publication-ready figures.

library(gt)

Let’s set up a tiny little data table…

prepped_flight_data <- flights %>% filter(dest %in% c("DEN", "DCA", "JFK", "SFO")) %>% 
  group_by(dest) %>% 
  mutate_at(vars(dep_delay), replace_na, replace=0.0) %>%  
  summarise(mean_delay=mean(dep_delay, na.rm=TRUE), median_delay=median(dep_delay, na.rm=TRUE))
glimpse(prepped_flight_data)

Rows: 4
Columns: 3
$ dest         <chr> "DCA", "DEN", "JFK", "SFO"
$ mean_delay   <dbl> 1.475138, 8.753835, 9.235996, 12.996799
$ median_delay <dbl> -3, -1, -1, -1

… and now make a table out of it:

gt_formatted <- prepped_flight_data %>%  gt() %>% 
  tab_header(title="PDX Departure Delays, DCA/DEN/JFK/SFO", subtitle="(Delay in minutes)") %>% 
  fmt_number(columns=vars(mean_delay, median_delay), decimals=2) %>% 
  cols_label(dest="Destination", mean_delay="Mean", median_delay="Median") 
gt_formatted

PDX Departure Delays, DCA/DEN/JFK/SFO
(Delay in minutes)
Destination	Mean	Median
DCA	1.48	−3.00
DEN	8.75	−1.00
JFK	9.24	−1.00
SFO	13.00	−1.00

We can adjust things like text alignment after-the-fact:

gt_formatted <- gt_formatted %>% 
  cols_align(align="center", columns=vars(dest)) %>%
  tab_style(
    cell_text(align = "right"), 
    locations = cells_title(groups = c("subtitle"))
  ) %>% 
  tab_style(
    cell_text(style = "italic"),
    locations = cells_title(groups=c("title"))
  )
gt_formatted

PDX Departure Delays, DCA/DEN/JFK/SFO
(Delay in minutes)
Destination	Mean	Median
DCA	1.48	−3.00
DEN	8.75	−1.00
JFK	9.24	−1.00
SFO	13.00	−1.00

Tip: In addition to cells_title(), there are helpers to select any of the other cell types (e.g. cells_column_labels(), etc.).

Now we can turn it into whatever we need! As Latex:

gt_formatted %>% gtsave("my_table.tex")

As RTF, to put in Word:

gt_formatted %>% gtsave("my_table.rtf")

6.2 `tableone`

tableone is for generating “Table 1” in your paper. You know, the “Subject Characteristics” table- lots of boring summary statistics.

Vignette: https://cran.r-project.org/web/packages/tableone/vignettes/introduction.html

library(tableone)

CreateTableOne(data = mazes)

                          
                           Overall       
  n                           381        
  study_id (%)                           
     CSLU-001                   4 ( 1.0) 
     CSLU-002                   4 ( 1.0) 
     CSLU-007                   4 ( 1.0) 
     CSLU-010                   4 ( 1.0) 
     CSLU-020                   4 ( 1.0) 
     CSLU-024                   4 ( 1.0) 
     CSLU-027                   4 ( 1.0) 
     CSLU-031                   4 ( 1.0) 
     CSLU-036                   3 ( 0.8) 
     CSLU-046                   4 ( 1.0) 
     CSLU-053                   4 ( 1.0) 
     CSLU-054                   4 ( 1.0) 
     CSLU-059                   4 ( 1.0) 
     CSLU-062                   4 ( 1.0) 
     CSLU-066                   4 ( 1.0) 
     CSLU-073                   4 ( 1.0) 
     CSLU-077                   4 ( 1.0) 
     CSLU-080                   4 ( 1.0) 
     CSLU-082                   3 ( 0.8) 
     CSLU-084                   4 ( 1.0) 
     CSLU-089                   4 ( 1.0) 
     CSLU-095                   3 ( 0.8) 
     CSLU-096                   4 ( 1.0) 
     CSLU-101                   4 ( 1.0) 
     CSLU-104                   4 ( 1.0) 
     CSLU-112                   4 ( 1.0) 
     CSLU-117                   4 ( 1.0) 
     CSLU-119                   4 ( 1.0) 
     CSLU-122                   4 ( 1.0) 
     CSLU-124                   3 ( 0.8) 
     CSLU-142                   4 ( 1.0) 
     CSLU-144                   4 ( 1.0) 
     CSLU-146                   4 ( 1.0) 
     CSLU-154                   4 ( 1.0) 
     CSLU-156                   4 ( 1.0) 
     CSLU-161                   4 ( 1.0) 
     CSLU-163                   4 ( 1.0) 
     CSLU-165                   4 ( 1.0) 
     CSLU-167                   4 ( 1.0) 
     CSLU-180                   4 ( 1.0) 
     CSLU-191                   4 ( 1.0) 
     CSLU-203                   4 ( 1.0) 
     CSLU-204                   4 ( 1.0) 
     CSLU-213                   4 ( 1.0) 
     CSLU-216                   4 ( 1.0) 
     CSLU-220                   4 ( 1.0) 
     CSLU-226                   4 ( 1.0) 
     CSLU-233                   4 ( 1.0) 
     CSLU-238                   4 ( 1.0) 
     CSLU-245                   4 ( 1.0) 
     CSLU-258                   4 ( 1.0) 
     CSLU-259                   4 ( 1.0) 
     CSLU-263                   4 ( 1.0) 
     CSLU-269                   4 ( 1.0) 
     CSLU-274                   4 ( 1.0) 
     CSLU-275                   4 ( 1.0) 
     CSLU-277                   4 ( 1.0) 
     CSLU-284                   4 ( 1.0) 
     CSLU-290                   4 ( 1.0) 
     CSLU-303                   4 ( 1.0) 
     CSLU-306                   4 ( 1.0) 
     CSLU-312                   4 ( 1.0) 
     CSLU-315                   4 ( 1.0) 
     CSLU-316                   4 ( 1.0) 
     CSLU-320                   4 ( 1.0) 
     CSLU-324                   4 ( 1.0) 
     CSLU-332                   4 ( 1.0) 
     CSLU-335                   4 ( 1.0) 
     CSLU-339                   4 ( 1.0) 
     CSLU-348                   4 ( 1.0) 
     CSLU-349                   4 ( 1.0) 
     CSLU-355                   4 ( 1.0) 
     CSLU-359                   4 ( 1.0) 
     CSLU-372                   4 ( 1.0) 
     CSLU-373                   4 ( 1.0) 
     CSLU-375                   4 ( 1.0) 
     CSLU-379                   4 ( 1.0) 
     CSLU-388                   2 ( 0.5) 
     CSLU-389                   4 ( 1.0) 
     CSLU-393                   4 ( 1.0) 
     CSLU-395                   4 ( 1.0) 
     CSLU-417                   4 ( 1.0) 
     CSLU-419                   4 ( 1.0) 
     CSLU-427                   4 ( 1.0) 
     CSLU-432                   3 ( 0.8) 
     CSLU-435                   4 ( 1.0) 
     CSLU-441                   4 ( 1.0) 
     CSLU-442                   4 ( 1.0) 
     CSLU-447                   4 ( 1.0) 
     CSLU-454                   4 ( 1.0) 
     CSLU-460                   4 ( 1.0) 
     CSLU-470                   4 ( 1.0) 
     CSLU-472                   4 ( 1.0) 
     CSLU-477                   4 ( 1.0) 
     CSLU-482                   4 ( 1.0) 
     CSLU-486                   4 ( 1.0) 
     CSLU-499                   4 ( 1.0) 
  ca (mean (SD))             6.83 (1.06) 
  viq (mean (SD))          100.82 (18.74)
  dx (%)                                 
     ASD                      183 (48.0) 
     SLI                       71 (18.6) 
     TD                       127 (33.3) 
  activity (%)                           
     Conversation              94 (24.7) 
     Picture Description       94 (24.7) 
     Play                      96 (25.2) 
     Wordless Picture Book     97 (25.5) 
  content (mean (SD))       18.73 (24.84)
  filler (mean (SD))        11.20 (17.59)
  rep (mean (SD))            6.24 (9.45) 
  rev (mean (SD))            3.79 (4.31) 
  fs (mean (SD))             8.70 (12.76)
  cued (mean (SD))          14.36 (24.22)
  not_cued (mean (SD))      26.77 (31.73)

my_maze_names <- c("Participant", "Age", "Verbal\nIQ", "Group", "Activity", "Content\nMaze", "Filler\nMaze", "Repetition", "Revision", "False\nStart", "Cued", "Not\nCued")
## Vector of variables to summarize
my_num_vars <- c("ca", "viq", "content", "filler", "rep", "rev", "fs", "cued", "not_cued")
## Vector of categorical variables that need transformation
my_cat_vars <- c("dx", "activity")
## Create a TableOne object
tab2 <- CreateTableOne(vars = my_num_vars, data = mazes, factorVars = my_cat_vars)
print(tab2, showAllLevels = TRUE)

                      
                       level Overall       
  n                             381        
  ca (mean (SD))               6.83 (1.06) 
  viq (mean (SD))            100.82 (18.74)
  content (mean (SD))         18.73 (24.84)
  filler (mean (SD))          11.20 (17.59)
  rep (mean (SD))              6.24 (9.45) 
  rev (mean (SD))              3.79 (4.31) 
  fs (mean (SD))               8.70 (12.76)
  cued (mean (SD))            14.36 (24.22)
  not_cued (mean (SD))        26.77 (31.73)

tab3 <- CreateTableOne(vars = my_num_vars, strata = "dx" , data = mazes)
tab3

                      Stratified by dx
                       ASD           SLI           TD             p      test
  n                      183            71            127                    
  ca (mean (SD))        6.74 (1.11)   7.15 (1.00)    6.76 (0.97)   0.015     
  viq (mean (SD))      95.29 (17.62) 86.24 (5.95)  116.94 (12.82) <0.001     
  content (mean (SD))  20.46 (29.73) 17.34 (24.35)  17.00 (15.67)  0.422     
  filler (mean (SD))    7.86 (13.54) 10.56 (16.35)  16.38 (21.84) <0.001     
  rep (mean (SD))       7.25 (11.82)  5.45 (6.86)    5.23 (6.21)   0.134     
  rev (mean (SD))       3.87 (4.85)   3.25 (3.55)    3.98 (3.85)   0.498     
  fs (mean (SD))        9.35 (14.60)  8.63 (15.00)   7.80 (7.55)   0.574     
  cued (mean (SD))     10.66 (21.94) 13.21 (22.54)  20.35 (27.10)  0.002     
  not_cued (mean (SD)) 25.52 (33.49) 25.25 (31.84)  29.41 (29.04)  0.517

6.3 The `DT` package

An excellent tutorial on DT is available at https://rstudio.github.io/DT/.

datatable(mazes)

6.4 `xtable` (best for html)

The xtable is a solution that delivers both HTML and LaTeX. The syntax is very similar to kable:

output <- 
  matrix(sprintf("Content %s", LETTERS[1:4]),
         ncol=2, byrow=TRUE)
colnames(output) <- 
  c("1st header", "2nd header")
rownames(output) <- 
  c("1st row", "2nd row")

print(xtable(output, 
             caption="A test table", 
             align = c("l", "c", "r")), 
      type="html")

<!-- html table generated in R 3.6.3 by xtable 1.8-4 package -->
<!-- Mon May  4 16:21:07 2020 -->
<table border=1>
<caption align="bottom"> A test table </caption>
<tr> <th>  </th> <th> 1st header </th> <th> 2nd header </th>  </tr>
  <tr> <td> 1st row </td> <td align="center"> Content A </td> <td align="right"> Content B </td> </tr>
  <tr> <td> 2nd row </td> <td align="center"> Content C </td> <td align="right"> Content D </td> </tr>
   </table>

Note that to make it knit, you need to specify a chunk option: results = 'asis'

print(xtable(output, 
             caption="A test table", 
             align = c("l", "c", "r")), 
      type="html")

A test table
	1st header	2nd header
1st row	Content A	Content B
2nd row	Content C	Content D

print(xtable(head(iris)), type = 'html', html.table.attributes = '')

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
1	5.10	3.50	1.40	0.20	setosa
2	4.90	3.00	1.40	0.20	setosa
3	4.70	3.20	1.30	0.20	setosa
4	4.60	3.10	1.50	0.20	setosa
5	5.00	3.60	1.40	0.20	setosa
6	5.40	3.90	1.70	0.40	setosa

6.5 `pixiedust` (best for PDF)

Remember that broom package we used earlier? We can make this table better…

tidy(fit)

term	estimate	std.error	statistic	p.value
(Intercept)	9.37	8.37	1.12	0.274
qsec	1.24	0.383	3.25	0.00317
factor(am)1	3.15	1.94	1.62	0.117
wt	-3.93	0.743	-5.29	1.58e-05
factor(gear)4	-0.268	1.66	-0.162	0.873
factor(gear)5	-0.27	2.06	-0.131	0.897

https://cran.r-project.org/web/packages/pixiedust/vignettes/pixiedust.html

http://www.suchanutter.net/pixiedust/index.html

dust(fit) %>% 
  sprinkle(cols = "term", 
           replace = c("Intercept", "Quarter Mile Time", "Automatic vs. Manual",
                       "Weight", "Gears: 4 vs. 3", "Gears: 5 vs 3")) %>%
  sprinkle(cols = c("estimate", "std.error", "statistic"),
           round = 3) %>% 
  sprinkle(cols = "p.value", fn = quote(pvalString(value))) %>% 
  sprinkle_colnames("Term", "Coefficient", "SE", "T-statistic", "P-value")

Term	Coefficient	SE	T-statistic	P-value
Intercept	9.365	8.373	1.118	0.27
Quarter Mile Time	1.245	0.383	3.252	0.003
Automatic vs. Manual	3.151	1.941	1.624	0.12
Weight	-3.926	0.743	-5.286	< 0.001
Gears: 4 vs. 3	-0.268	1.655	-0.162	0.87
Gears: 5 vs 3	-0.27	2.063	-0.131	0.9

7 Finally, fonts!

7.1 Your friend, `extrafont`

The extrafont package is the best place to start with fonts in R and ggplot.

library(extrafont)

Note: Follow all installation instructions from github!

You can access fonts on your local system in ggplot by using the theme() function to set the relevant properties of your figure.

demo_plt <- ggplot(iris, aes(Sepal.Length, Sepal.Width, color=Species)) + 
  geom_point(size=3) +
  scale_color_discrete(labels=c("Setosa", "Versicolor", "Virginica")) + 
  labs(x="Sepal Length (cm)", y="Sepal Width (cm)", title="Fisher's Irises") 
demo_plt

Let’s try changing that font a bit:

demo_plt + theme(text=element_text(family="Optima"))

demo_plt + theme(text=element_text(family="Palatino"))

demo_plt + theme(text=element_text(family="Zapfino"))

Note that the specific name that you must use to refer to the font is not exactly obvious. It is not the font name, but rather the font’s “family name”. Because of some under-the-hood details about the way that fonts work, this is a bit different from what you might see in e.g. a font selection menu in Word. You can see a list of all the fonts that R knows about by using the fonttable() command:

fonttable() %>% glimpse()

Rows: 122
Columns: 10
$ package    <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ afmfile    <chr> "Keyboard.afm.gz", "SFNSDisplay.afm.gz", "SFNSText.afm.gz"…
$ fontfile   <chr> "/System/Library/Fonts/Keyboard.ttf", "/System/Library/Fon…
$ FullName   <chr> ".Keyboard", "System Font", "System Font", "System Font It…
$ FamilyName <chr> ".Keyboard", "System Font", "System Font", "System Font", …
$ FontName   <chr> "-Keyboard", "-SFNSDisplay", "-SFNSText", "-SFNSText-Itali…
$ Bold       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
$ Italic     <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL…
$ Symbol     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
$ afmsymfile <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…

fonttable() %>% head()

package	afmfile	fontfile	FullName	FamilyName	FontName	Bold	Italic	Symbol	afmsymfile
	Keyboard.afm.gz	/System/Library/Fonts/Keyboard.ttf	.Keyboard	.Keyboard	-Keyboard	FALSE	FALSE	FALSE
	SFNSDisplay.afm.gz	/System/Library/Fonts/SFNSDisplay.ttf	System Font	System Font	-SFNSDisplay	FALSE	FALSE	FALSE
	SFNSText.afm.gz	/System/Library/Fonts/SFNSText.ttf	System Font	System Font	-SFNSText	FALSE	FALSE	FALSE
	SFNSTextItalic.afm.gz	/System/Library/Fonts/SFNSTextItalic.ttf	System Font Italic	System Font	-SFNSText-Italic	FALSE	TRUE	FALSE
	Andale Mono.afm.gz	/Library/Fonts/Andale Mono.ttf	Andale Mono	Andale Mono	AndaleMono	FALSE	FALSE	FALSE
	Apple Braille.afm.gz	/System/Library/Fonts/Apple Braille.ttf	Apple Braille	Apple Braille	AppleBraille	FALSE	FALSE	FALSE

fonttable() %>% select(FullName, FamilyName, FontName) %>% head(20)

FullName	FamilyName	FontName
.Keyboard	.Keyboard	-Keyboard
System Font	System Font	-SFNSDisplay
System Font	System Font	-SFNSText
System Font Italic	System Font	-SFNSText-Italic
Andale Mono	Andale Mono	AndaleMono
Apple Braille	Apple Braille	AppleBraille
Apple Braille Outline 6 Dot	Apple Braille	AppleBraille-Outline6Dot
Apple Braille Outline 8 Dot	Apple Braille	AppleBraille-Outline8Dot
Apple Braille Pinpoint 6 Dot	Apple Braille	AppleBraille-Pinpoint6Dot
Apple Braille Pinpoint 8 Dot	Apple Braille	AppleBraille-Pinpoint8Dot
AppleMyungjo Regular	AppleMyungjo	AppleMyungjo
Arial Black	Arial Black	Arial-Black
Arial Bold Italic	Arial	Arial-BoldItalicMT
Arial Bold	Arial	Arial-BoldMT
Arial Italic	Arial	Arial-ItalicMT
Arial	Arial	ArialMT
Arial Narrow	Arial Narrow	ArialNarrow
Arial Narrow Bold	Arial Narrow	ArialNarrow-Bold
Arial Narrow Bold Italic	Arial Narrow	ArialNarrow-BoldItalic
Arial Narrow Italic	Arial Narrow	ArialNarrow-Italic

fonttable() %>% select(FamilyName) %>% distinct()

FamilyName

.Keyboard

System Font

Andale Mono

Apple Braille

AppleMyungjo

Arial Black

Arial

Arial Narrow

Arial Rounded MT Bold

Arial Unicode MS

Bitter

Bodoni Ornaments

Bodoni 72 Smallcaps

Brush Script MT

Calibri

Calibri Light

Comic Sans MS

Consolas

Courier New

DIN Alternate

DIN Condensed

Fondamento

Georgia

Gorni

Impact

Khmer Sangam MN

Lao Sangam MN

Lato Black

Lato

Lato Hairline

Lato Light

Luminari

Microsoft Sans Serif

Noto Sans

Noto Serif

Open Sans

Open Sans Extrabold

Open Sans Light

Open Sans Semibold

SILDoulos IPA93

SILManuscript IPA93

SILSophia IPA93

Tahoma

Tangerine

Times New Roman

Trattatello

Trebuchet MS

Verdana

Webdings

WeePeople

Wingdings

Wingdings 2

Wingdings 3

You’ll want to use the value in FamilyName.

7.2 Installing new fonts

This part is a bit beyond the scope of this class, but the upshot is that after you install new fonts in your computer, you should tell extrafont about it using the font_import() command.

7.3 Caveats

extrafont only knows about TrueType fonts (ones whose file ends in .ttf). If you have an OpenType font you want to use with R, your best bet probably involves the showtext package.

Lab 05: Fonts & Tables

CS631

Alison Hill & Steven Bedrick

1 Goals for Lab 05

2 TL;DR

3 knitr::kable

3.1 kable all tables everywhere

3.2 kable table in a chunk

3.3 Styled kable tables in a chunk

3.4 kable + kableExtra + formattable

3.5 tibble + kable + kableExtra

4 Markdown Tables

4.1 Simple table

4.2 Multi-line tables

4.3 Grid tables

4.4 Pipe tables

5 Making tables in R

5.1 dplyr

5.1.1 dplyr::select

5.1.2 dplyr::filter

5.1.3 dplyr::arrange

5.1.4 dplyr::distinct

5.1.5 dplyr::mutate

5.1.6 dplyr::summarise (or dplyr::summarize)

5.2 tidyr

5.2.1 tidyr::gather

5.2.2 tidyr::separate

5.2.3 tidyr::extract

5.2.4 tidyr::unite

5.2.5 tidyr::spread

5.2.6 Gather multiple sets of columns (gather() %>% separate() %>% spread())

5.3 broom