Lab 02: BMI 5/625

class: center, middle, inverse, title-slide

.title[
# Lab 02: BMI 5/625
]
.subtitle[
## Working in the Tidyverse
]
.author[
### Alison Hill (w/ modifications by Steven Bedrick)
]

---

# Tidyverse basics

Last week, we covered some basics:

- `<-` (variable assignment)
- `%>%` (then...)
- `dplyr`, `ggplot2` (packages)
  - `install.packages("dplyr")` (1x per machine)
  - `library(dplyr)` (1x per work session)

---
# Data for today

We'll use data from the Museum of Modern Art (MoMA)

- Publicly available on [GitHub](https://github.com/MuseumofModernArt/collection)
- As analyzed by [fivethirtyeight.com](https://fivethirtyeight.com/features/a-nerds-guide-to-the-2229-paintings-at-moma/)
- And by [others](https://medium.com/@foe/here-s-a-roundup-of-how-people-have-used-our-data-so-far-80862e4ce220)

---
# Get the data

Use this code chunk to import my cleaned CSV file:

```r
library(readr)
moma <- read_csv("../data/artworks-cleaned.csv")
```

---

# Data wrangling:

All functions from `dplyr` package

.pull-left[
A few basics:

- print a tibble

- `filter`

- `arrange`

- `mutate`
]

.pull-right[
From Lab 01

- `glimpse`

- `distinct`

- `count`
]

---
class: middle, center, inverse

![](../images/rladylego-pipe.jpg)

## Plus: `%>%`

*image courtesy [@LegoRLady](https://twitter.com/LEGO_RLady/status/986661916855754752)*

---

## Three core functions: `filter`

`filter` subsets data according to a _predicate_ (logical statement)

- Use for things like "remove subjects whose age is less than 18 years"

```r
peds <- all.patients %>% filter(age <= 18)
```

- Note that predicates can be as complex as you like (examples to come)

---

## Three core functions: `arrange`

`arrange` _sorts_ a dataframe by one or more columns

```r
peds <- peds %>% arrange(age)
```

- The default sort order is _ascending_ (smallest to largest); you can reverse this in two ways:

- The `desc()` function, and negation:

```r
# option 1:
peds <- peds %>% arrange(desc(age))
```

```r
# option 2:
peds <- peds %>% arrange(-age)
```

---

## Three core functions: `mutate`

`mutate` adds a new column (or replaces an existing one)

```r
peds <- peds %>% mutate(age.in.months = age * 12)
```

```r
# convert to meters from feet
peds <- peds %>% mutate(height = height * 0.305)
```

- Multiple columns can be worked on at the same time:

```r
peds <- peds %>% mutate(
    age.in.months = age * 12, 
    is.school.age = age >= 5,
    height = height * 0.305
  )
```

---

class: middle, center, inverse

# ⌛️

## Let's review some helpful functions for `filter`

---
class: inverse, bottom, center
background-image: url("../images/peapod.png")
background-size: 25%

## Base R + Tidyverse

---
class: middle, center, inverse

#💡

## First:

## Logical Operators

---

```r
?base::Logic
```

---

Logical or (`|`) is inclusive, so `x | y` really means:

* x or 
* y or 
* both x & y

Exclusive or (`xor`) is exclusive, so `xor(x, y)` really means:

* x or
* y...
* but not both x & y

```r
x <- c(0, 1, 0, 1)
y <- c(0, 0, 1, 1)
boolean_or <- x | y
exclusive_or <- xor(x, y)
cbind(x, y, boolean_or, exclusive_or)
```

```
     x y boolean_or exclusive_or
[1,] 0 0          0            0
[2,] 1 0          1            1
[3,] 0 1          1            1
[4,] 1 1          1            0
```

---
class: middle, center, inverse

#💡

## Second:

## Comparisons

---

```r
?Comparison
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> Operator </th>
   <th style="text-align:left;"> Description </th>
   <th style="text-align:left;"> Usage </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> &lt; </td>
   <td style="text-align:left;"> less than </td>
   <td style="text-align:left;"> x &lt; y </td>
  </tr>
  <tr>
   <td style="text-align:left;"> &lt;= </td>
   <td style="text-align:left;"> less than or equal to </td>
   <td style="text-align:left;"> x &lt;= y </td>
  </tr>
  <tr>
   <td style="text-align:left;"> &gt; </td>
   <td style="text-align:left;"> greater than </td>
   <td style="text-align:left;"> x &gt; y </td>
  </tr>
  <tr>
   <td style="text-align:left;"> &gt;= </td>
   <td style="text-align:left;"> greater than or equal to </td>
   <td style="text-align:left;"> x &gt;= y </td>
  </tr>
  <tr>
   <td style="text-align:left;"> == </td>
   <td style="text-align:left;"> exactly equal to </td>
   <td style="text-align:left;"> x == y </td>
  </tr>
  <tr>
   <td style="text-align:left;"> != </td>
   <td style="text-align:left;"> not equal to </td>
   <td style="text-align:left;"> x != y </td>
  </tr>
  <tr>
   <td style="text-align:left;"> %in% </td>
   <td style="text-align:left;"> group membership* </td>
   <td style="text-align:left;"> x %in% y </td>
  </tr>
  <tr>
   <td style="text-align:left;"> is.na </td>
   <td style="text-align:left;"> is missing </td>
   <td style="text-align:left;"> is.na(x) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> !is.na </td>
   <td style="text-align:left;"> is not missing </td>
   <td style="text-align:left;"> !is.na(x) </td>
  </tr>
</tbody>
</table>

*(shortcut to using `|` repeatedly with `==`)

## Another level: `group_by`

Many `dplyr` verbs can be _grouped_

I.e., their operation can be performed on partitions of your data:

("average of `X`, _by_ `Y`)

Consider `summarise`:

```r
penguins %>% filter(!is.na(bill_length_mm)) %>% 
  summarise(mean_length=mean(bill_length_mm))
```

```
# A tibble: 1 × 1
  mean_length
        <dbl>
1        43.9
```
---

## New this week: `group_by`

Many `dplyr` verbs can be _grouped_

I.e., their operation can be performed on partitions of your data:

("average of `X`, _by_ `Y`)

```r
penguins %>% filter(!is.na(bill_length_mm)) %>% 
  group_by(species) %>% 
  summarise(mean_length=mean(bill_length_mm))
```

```
# A tibble: 3 × 2
  species   mean_length
  <fct>           <dbl>
1 Adelie           38.8
2 Chinstrap        48.8
3 Gentoo           47.5
```

Most other `dplyr` verbs will "play nicely" with grouped data:

`arrange`, `slice`, `count`, `top_n`, etc.

---

## Under the hood

What does `group_by` actually _do_?

```r
penguins.grouped <- penguins %>% group_by(species)
penguins.grouped
```

```
# A tibble: 344 × 8
# Groups:   species [3]
   species island    bill_length_mm bill_depth_mm flipper_…¹ body_…² sex    year
   <fct>   <fct>              <dbl>         <dbl>      <int>   <int> <fct> <int>
 1 Adelie  Torgersen           39.1          18.7        181    3750 male   2007
 2 Adelie  Torgersen           39.5          17.4        186    3800 fema…  2007
 3 Adelie  Torgersen           40.3          18          195    3250 fema…  2007
 4 Adelie  Torgersen           NA            NA           NA      NA <NA>   2007
 5 Adelie  Torgersen           36.7          19.3        193    3450 fema…  2007
 6 Adelie  Torgersen           39.3          20.6        190    3650 male   2007
 7 Adelie  Torgersen           38.9          17.8        181    3625 fema…  2007
 8 Adelie  Torgersen           39.2          19.6        195    4675 male   2007
 9 Adelie  Torgersen           34.1          18.1        193    3475 <NA>   2007
10 Adelie  Torgersen           42            20.2        190    4250 <NA>   2007
# … with 334 more rows, and abbreviated variable names ¹flipper_length_mm,
#   ²body_mass_g
```

---

## Multiple Groups

"How many males and females of each sex do we have?"

```r
penguins %>% group_by(species, sex) %>% tally
```

Note that the resulting dataframe is still grouped by `species`!

```r
penguins %>% group_by(species, sex)
```

```
# A tibble: 344 × 8
# Groups:   species, sex [8]
   species island    bill_length_mm bill_depth_mm flipper_…¹ body_…² sex    year
   <fct>   <fct>              <dbl>         <dbl>      <int>   <int> <fct> <int>
 1 Adelie  Torgersen           39.1          18.7        181    3750 male   2007
 2 Adelie  Torgersen           39.5          17.4        186    3800 fema…  2007
 3 Adelie  Torgersen           40.3          18          195    3250 fema…  2007
 4 Adelie  Torgersen           NA            NA           NA      NA <NA>   2007
 5 Adelie  Torgersen           36.7          19.3        193    3450 fema…  2007
 6 Adelie  Torgersen           39.3          20.6        190    3650 male   2007
 7 Adelie  Torgersen           38.9          17.8        181    3625 fema…  2007
 8 Adelie  Torgersen           39.2          19.6        195    4675 male   2007
 9 Adelie  Torgersen           34.1          18.1        193    3475 <NA>   2007
10 Adelie  Torgersen           42            20.2        190    4250 <NA>   2007
# … with 334 more rows, and abbreviated variable names ¹flipper_length_mm,
#   ²body_mass_g
```

---
## Lab 02: Challenge 1 (`dplyr`)

1. How many paintings (rows) are in `moma`? How many variables (columns) are in `moma`?
1. What is the first painting acquired by MoMA? Which year? Which artist? What title?
    - *Hint: you may want to look into `select` + `arrange`*
1. What is the oldest painting in the collection? Which year? Which artist? What title? *(see above hint)*
1. How many distinct artists are there?
1. Which artist has the most paintings in the collection? How many paintings are by this artist?
1. How many paintings are by male vs female artists?

If you want more:
1. How many artists of each gender are there?
1. In what year were the most paintings acquired? Created?
1. In what year was the first painting by a (solo) female artist acquired? When was that painting created? Which artist? What title?

---

# From Last Week  2

From `ggplot2`:

- `aes(x = , y = )` (aesthetics)
- `aes(x = , y = , color = )` (add color)
- `aes(x = , y = , size = )` (add size)
- `+ facet_wrap(~ )` (facetting)

---
# "Old School" (Challenge 2)<sup>1</sup>

- Sketch the graphics below on paper, where the `x`-axis is variable `year_created` and the `y`-axis is variable `year_acquired`

```
# A tibble: 4 × 4
  painted acquired  area gender
    <dbl>    <dbl> <dbl> <chr> 
1    1980     1985     3 male  
2    1990     1995     2 male  
3    2000     2005     1 female
4    2010     2015     2 female
```

1. A scatter plot
1. A scatter plot where the `color` of the points corresponds to `gender`
1. A scatter plot where the `size` of the points corresponds to `area`
1. A version of (1), but with separate plots by gender

.footnote[
[1] Shamelessly borrowed with much appreciation to [Chester Ismay](https://ismayc.github.io/talks/ness-infer/slide_deck.html)
]

---

# 1. A scatterplot

```r
library(ggplot2)
ggplot(moma_ex, aes(painted, acquired)) + 
  geom_point()
```
--

---

# 2. `color` points by `gender`

```r
library(ggplot2)
ggplot(moma_ex, aes(painted, acquired, color = gender)) + 
  geom_point()
```
--

---

# 3. `size` points by `area`

```r
library(ggplot2)
ggplot(moma_ex, aes(painted, acquired, size = area)) + 
  geom_point()
```
--

---

# 4. Faceting

```r
library(ggplot2)
ggplot(moma_ex, aes(painted, acquired, color = gender)) + 
  geom_point() + facet_wrap(~gender)
```

--
<img src="02-slides_files/figure-html/unnamed-chunk-31-1.png" width="80%" style="display: block; margin: auto;" />

---

# [The Five-Named Graphs](http://moderndive.com/3-viz.html#FiveNG)

- Scatterplot: `geom_point()`
- Line graph: `geom_line()`
- Histogram: `geom_histogram()`
- Boxplot: `geom_boxplot()`
- Bar graph: `geom_bar()` or `geom_col` (see [Lab 01](../01-eda_hot_dogs.html))

---
# Lab 02: Plotting Challenges

Challenges 3-5 are in the [Lab 02 code-through](../02-moma.html)!

https://stevenbedrick.github.io/data-vis-labs-2023/02-moma.html

---
class: inverse, middle, center

# 📊

## Basics of `ggplot2` and `dplyr`:

[R4DS `ggplot2` chapter](http://r4ds.had.co.nz/data-visualisation.html)

[ModernDive `ggplot2` chapter](http://r4ds.had.co.nz/data-visualisation.html)

[RStudio `ggplot2` Cheatsheet](https://github.com/rstudio/cheatsheets/blob/master/data-visualization-2.1.pdf)

[R4DS `dplyr` chapter](http://r4ds.had.co.nz/transform.html)

[ModernDive `dplyr` chapter](https://moderndive.com/3-wrangling.html)

[RStudio `dplyr` Cheatsheet](https://github.com/rstudio/cheatsheets/blob/master/data-transformation.pdf)