Lab 02: BMI 5/625

class: center, middle, inverse, title-slide

# Lab 02: BMI 5/625
## Working in the Tidyverse
### Alison Hill (w/ modifications by Steven Bedrick)

---

# Tidyverse basics

Last week, we covered some basics:

- `<-` (variable assignment)
- `%>%` (then...)
- `dplyr`, `ggplot2` (packages)
  - `install.packages("dplyr")` (1x per machine)
  - `library(dplyr)` (1x per work session)

---
class: center, middle, inverse

# 📇

## Let's review

---
# Data for today

We'll use data from the Museum of Modern Art (MoMA)

- Publicly available on [GitHub](https://github.com/MuseumofModernArt/collection)
- As analyzed by [fivethirtyeight.com](https://fivethirtyeight.com/features/a-nerds-guide-to-the-2229-paintings-at-moma/)
- And by [others](https://medium.com/@foe/here-s-a-roundup-of-how-people-have-used-our-data-so-far-80862e4ce220)

---
# Get the data

Use this code chunk to import my cleaned CSV file:

```r
library(readr)
moma <- read_csv("../data/artworks-cleaned.csv")
```

---

# Data wrangling so far

All functions from `dplyr` package

.pull-left[
From Last Week

- print a tibble

- `filter`

- `arrange`

- `mutate`
]

.pull-right[
From Lab 01

- `glimpse`

- `distinct`

- `count`
]

---
class: middle, center, inverse

![](../images/rladylego-pipe.jpg)

## Plus: `%>%`

*image courtesy [@LegoRLady](https://twitter.com/LEGO_RLady/status/986661916855754752)*

---
class: middle, center, inverse

# ⌛️

## Let's review some helpful functions for `filter`

---
class: inverse, bottom, center
background-image: url("../images/peapod.png")
background-size: 25%

## Base R + Tidyverse

---
class: middle, center, inverse

#💡

## First:

## Logical Operators

---

```r
?base::Logic
```

---

Logical or (`|`) is inclusive, so `x | y` really means:

* x or 
* y or 
* both x & y

Exclusive or (`xor`) is exclusive, so `xor(x, y)` really means:

* x or
* y...
* but not both x & y

```r
x <- c(0, 1, 0, 1)
y <- c(0, 0, 1, 1)
boolean_or <- x | y
exclusive_or <- xor(x, y)
cbind(x, y, boolean_or, exclusive_or)
```

```
     x y boolean_or exclusive_or
[1,] 0 0          0            0
[2,] 1 0          1            1
[3,] 0 1          1            1
[4,] 1 1          1            0
```

---
class: middle, center, inverse

#💡

## Second:

## Comparisons

---

```r
?Comparison
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> Operator </th>
   <th style="text-align:left;"> Description </th>
   <th style="text-align:left;"> Usage </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> &lt; </td>
   <td style="text-align:left;"> less than </td>
   <td style="text-align:left;"> x &lt; y </td>
  </tr>
  <tr>
   <td style="text-align:left;"> &lt;= </td>
   <td style="text-align:left;"> less than or equal to </td>
   <td style="text-align:left;"> x &lt;= y </td>
  </tr>
  <tr>
   <td style="text-align:left;"> &gt; </td>
   <td style="text-align:left;"> greater than </td>
   <td style="text-align:left;"> x &gt; y </td>
  </tr>
  <tr>
   <td style="text-align:left;"> &gt;= </td>
   <td style="text-align:left;"> greater than or equal to </td>
   <td style="text-align:left;"> x &gt;= y </td>
  </tr>
  <tr>
   <td style="text-align:left;"> == </td>
   <td style="text-align:left;"> exactly equal to </td>
   <td style="text-align:left;"> x == y </td>
  </tr>
  <tr>
   <td style="text-align:left;"> != </td>
   <td style="text-align:left;"> not equal to </td>
   <td style="text-align:left;"> x != y </td>
  </tr>
  <tr>
   <td style="text-align:left;"> %in% </td>
   <td style="text-align:left;"> group membership* </td>
   <td style="text-align:left;"> x %in% y </td>
  </tr>
  <tr>
   <td style="text-align:left;"> is.na </td>
   <td style="text-align:left;"> is missing </td>
   <td style="text-align:left;"> is.na(x) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> !is.na </td>
   <td style="text-align:left;"> is not missing </td>
   <td style="text-align:left;"> !is.na(x) </td>
  </tr>
</tbody>
</table>

*(shortcut to using `|` repeatedly with `==`)

---
## Lab 02: Challenge 1 (`dplyr`)

1. How many paintings (rows) are in `moma`? How many variables (columns) are in `moma`?
1. What is the first painting acquired by MoMA? Which year? Which artist? What title?
    - *Hint: you may want to look into `select` + `arrange`*
1. What is the oldest painting in the collection? Which year? Which artist? What title? *(see above hint)*
1. How many distinct artists are there?
1. Which artist has the most paintings in the collection? How many paintings are by this artist?
1. How many paintings are by male vs female artists?

If you want more:
1. How many artists of each gender are there?
1. In what year were the most paintings acquired? Created?
1. In what year was the first painting by a (solo) female artist acquired? When was that painting created? Which artist? What title?

---

## New this week: `group_by`

Many `dplyr` verbs can be _grouped_

I.e., their operation can be performed on partitions of your data:

("average of `X`, _by_ `Y`)

Consider `summarise`:

```r
penguins %>% filter(!is.na(bill_length_mm)) %>% 
  summarise(mean_length=mean(bill_length_mm))
```

```
# A tibble: 1 × 1
  mean_length
        <dbl>
1        43.9
```
---

## New this week: `group_by`

Many `dplyr` verbs can be _grouped_

I.e., their operation can be performed on partitions of your data:

("average of `X`, _by_ `Y`)

```r
penguins %>% filter(!is.na(bill_length_mm)) %>% 
  group_by(species) %>% 
  summarise(mean_length=mean(bill_length_mm))
```

```
# A tibble: 3 × 2
  species   mean_length
  <fct>           <dbl>
1 Adelie           38.8
2 Chinstrap        48.8
3 Gentoo           47.5
```

Most other `dplyr` verbs will "play nicely" with grouped data:

`arrange`, `slice`, `count`, `top_n`, etc.

---

## Under the hood

What does `group_by` actually _do_?

```r
penguins.grouped <- penguins %>% group_by(species)
penguins.grouped
```

```
# A tibble: 344 × 8
# Groups:   species [3]
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# … with 334 more rows, and 2 more variables: sex <fct>, year <int>
```

---

## Multiple Groups

"How many males and females of each sex do we have?"

```r
penguins %>% group_by(species, sex) %>% tally
```

Note that the resulting dataframe is still grouped by `species`!

```r
penguins %>% group_by(species, sex)
```

```
# A tibble: 344 × 8
# Groups:   species, sex [8]
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# … with 334 more rows, and 2 more variables: sex <fct>, year <int>
```

---
## Lab 02: Challenge 1 (`dplyr`)

---

# From Last Week  2

From `ggplot2`:

- `aes(x = , y = )` (aesthetics)
- `aes(x = , y = , color = )` (add color)
- `aes(x = , y = , size = )` (add size)
- `+ facet_wrap(~ )` (facetting)

---
# "Old School" (Challenge 2)<sup>1</sup>

- Sketch the graphics below on paper, where the `x`-axis is variable `year_created` and the `y`-axis is variable `year_acquired`

```
# A tibble: 4 × 4
  painted acquired  area gender
    <dbl>    <dbl> <dbl> <chr> 
1    1980     1985     3 male  
2    1990     1995     2 male  
3    2000     2005     1 female
4    2010     2015     2 female
```

1. A scatter plot
1. A scatter plot where the `color` of the points corresponds to `gender`
1. A scatter plot where the `size` of the points corresponds to `area`
1. A version of (1), but with separate plots by gender

.footnote[
[1] Shamelessly borrowed with much appreciation to [Chester Ismay](https://ismayc.github.io/talks/ness-infer/slide_deck.html)
]

---

# 1. A scatterplot

```r
library(ggplot2)
ggplot(moma_ex, aes(painted, acquired)) + 
  geom_point()
```
--

---

# 2. `color` points by `gender`

```r
library(ggplot2)
ggplot(moma_ex, aes(painted, acquired, color = gender)) + 
  geom_point()
```
--

---

# 3. `size` points by `area`

```r
library(ggplot2)
ggplot(moma_ex, aes(painted, acquired, size = area)) + 
  geom_point()
```
--

---

# 4. Faceting

```r
library(ggplot2)
ggplot(moma_ex, aes(painted, acquired, color = gender)) + 
  geom_point() + facet_wrap(~gender)
```

--
<img src="02-slides_files/figure-html/unnamed-chunk-24-1.png" width="80%" style="display: block; margin: auto;" />

---

# [The Five-Named Graphs](http://moderndive.com/3-viz.html#FiveNG)

- Scatterplot: `geom_point()`
- Line graph: `geom_line()`
- Histogram: `geom_histogram()`
- Boxplot: `geom_boxplot()`
- Bar graph: `geom_bar()` or `geom_col` (see [Lab 01](../01-eda_hot_dogs.html))

---
# Lab 02: Plotting Challenges

Challenges 3-5 are in the [Lab 02 code-through](../02-moma.html)!

https://stevenbedrick.github.io/data-vis-labs-2022/02-moma.html

---
class: inverse, middle, center

# 📊

## Basics of `ggplot2` and `dplyr`:

[R4DS `ggplot2` chapter](http://r4ds.had.co.nz/data-visualisation.html)

[ModernDive `ggplot2` chapter](http://r4ds.had.co.nz/data-visualisation.html)

[RStudio `ggplot2` Cheatsheet](https://github.com/rstudio/cheatsheets/blob/master/data-visualization-2.1.pdf)

[R4DS `dplyr` chapter](http://r4ds.had.co.nz/transform.html)

[ModernDive `dplyr` chapter](https://moderndive.com/3-wrangling.html)

[RStudio `dplyr` Cheatsheet](https://github.com/rstudio/cheatsheets/blob/master/data-transformation.pdf)