class: center, middle, inverse, title-slide # Lab 02: BMI 5/625 ## Working in the Tidyverse ### Alison Hill (w/ modifications by Steven Bedrick) --- # Tidyverse basics Last week, we covered some basics: - `<-` (variable assignment) - `%>%` (then...) - `dplyr`, `ggplot2` (packages) - `install.packages("dplyr")` (1x per machine) - `library(dplyr)` (1x per work session) --- class: center, middle, inverse # 📇 ## Let's review --- # Data for today We'll use data from the Museum of Modern Art (MoMA) - Publicly available on [GitHub](https://github.com/MuseumofModernArt/collection) - As analyzed by [fivethirtyeight.com](https://fivethirtyeight.com/features/a-nerds-guide-to-the-2229-paintings-at-moma/) - And by [others](https://medium.com/@foe/here-s-a-roundup-of-how-people-have-used-our-data-so-far-80862e4ce220) --- # Get the data Use this code chunk to import my cleaned CSV file: ```r library(readr) moma <- read_csv("../data/artworks-cleaned.csv") ``` --- # Data wrangling so far All functions from `dplyr` package .pull-left[ From Last Week - print a tibble - `filter` - `arrange` - `mutate` ] -- .pull-right[ From Lab 01 - `glimpse` - `distinct` - `count` ] --- class: middle, center, inverse ![](../images/rladylego-pipe.jpg) ## Plus: `%>%` *image courtesy [@LegoRLady](https://twitter.com/LEGO_RLady/status/986661916855754752)* --- class: middle, center, inverse # ⌛️ ## Let's review some helpful functions for `filter` --- class: inverse, bottom, center background-image: url("../images/peapod.png") background-size: 25% ## Base R + Tidyverse --- class: middle, center, inverse #💡 ## First: ## Logical Operators --- ```r ?base::Logic ``` <table> <thead> <tr> <th style="text-align:left;"> Operator </th> <th style="text-align:left;"> Description </th> <th style="text-align:left;"> Usage </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> & </td> <td style="text-align:left;"> and </td> <td style="text-align:left;"> x & y </td> </tr> <tr> <td style="text-align:left;"> | </td> <td style="text-align:left;"> or </td> <td style="text-align:left;"> x | y </td> </tr> <tr> <td style="text-align:left;"> xor </td> <td style="text-align:left;"> exactly x or y </td> <td style="text-align:left;"> xor(x, y) </td> </tr> <tr> <td style="text-align:left;"> ! </td> <td style="text-align:left;"> not </td> <td style="text-align:left;"> !x </td> </tr> </tbody> </table> --- Logical or (`|`) is inclusive, so `x | y` really means: * x or * y or * both x & y Exclusive or (`xor`) is exclusive, so `xor(x, y)` really means: * x or * y... * but not both x & y ```r x <- c(0, 1, 0, 1) y <- c(0, 0, 1, 1) boolean_or <- x | y exclusive_or <- xor(x, y) cbind(x, y, boolean_or, exclusive_or) ``` ``` x y boolean_or exclusive_or [1,] 0 0 0 0 [2,] 1 0 1 1 [3,] 0 1 1 1 [4,] 1 1 1 0 ``` --- class: middle, center, inverse #💡 ## Second: ## Comparisons --- ```r ?Comparison ``` <table> <thead> <tr> <th style="text-align:left;"> Operator </th> <th style="text-align:left;"> Description </th> <th style="text-align:left;"> Usage </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> < </td> <td style="text-align:left;"> less than </td> <td style="text-align:left;"> x < y </td> </tr> <tr> <td style="text-align:left;"> <= </td> <td style="text-align:left;"> less than or equal to </td> <td style="text-align:left;"> x <= y </td> </tr> <tr> <td style="text-align:left;"> > </td> <td style="text-align:left;"> greater than </td> <td style="text-align:left;"> x > y </td> </tr> <tr> <td style="text-align:left;"> >= </td> <td style="text-align:left;"> greater than or equal to </td> <td style="text-align:left;"> x >= y </td> </tr> <tr> <td style="text-align:left;"> == </td> <td style="text-align:left;"> exactly equal to </td> <td style="text-align:left;"> x == y </td> </tr> <tr> <td style="text-align:left;"> != </td> <td style="text-align:left;"> not equal to </td> <td style="text-align:left;"> x != y </td> </tr> <tr> <td style="text-align:left;"> %in% </td> <td style="text-align:left;"> group membership* </td> <td style="text-align:left;"> x %in% y </td> </tr> <tr> <td style="text-align:left;"> is.na </td> <td style="text-align:left;"> is missing </td> <td style="text-align:left;"> is.na(x) </td> </tr> <tr> <td style="text-align:left;"> !is.na </td> <td style="text-align:left;"> is not missing </td> <td style="text-align:left;"> !is.na(x) </td> </tr> </tbody> </table> *(shortcut to using `|` repeatedly with `==`) --- ## Lab 02: Challenge 1 (`dplyr`) 1. How many paintings (rows) are in `moma`? How many variables (columns) are in `moma`? 1. What is the first painting acquired by MoMA? Which year? Which artist? What title? - *Hint: you may want to look into `select` + `arrange`* 1. What is the oldest painting in the collection? Which year? Which artist? What title? *(see above hint)* 1. How many distinct artists are there? 1. Which artist has the most paintings in the collection? How many paintings are by this artist? 1. How many paintings are by male vs female artists? If you want more: 1. How many artists of each gender are there? 1. In what year were the most paintings acquired? Created? 1. In what year was the first painting by a (solo) female artist acquired? When was that painting created? Which artist? What title? --- ## New this week: `group_by` Many `dplyr` verbs can be _grouped_ -- I.e., their operation can be performed on partitions of your data: -- ("average of `X`, _by_ `Y`) -- Consider `summarise`: ```r penguins %>% filter(!is.na(bill_length_mm)) %>% summarise(mean_length=mean(bill_length_mm)) ``` ``` # A tibble: 1 × 1 mean_length <dbl> 1 43.9 ``` --- ## New this week: `group_by` Many `dplyr` verbs can be _grouped_ I.e., their operation can be performed on partitions of your data: ("average of `X`, _by_ `Y`) ```r penguins %>% filter(!is.na(bill_length_mm)) %>% group_by(species) %>% summarise(mean_length=mean(bill_length_mm)) ``` ``` # A tibble: 3 × 2 species mean_length <fct> <dbl> 1 Adelie 38.8 2 Chinstrap 48.8 3 Gentoo 47.5 ``` -- Most other `dplyr` verbs will "play nicely" with grouped data: -- `arrange`, `slice`, `count`, `top_n`, etc. --- ## Under the hood What does `group_by` actually _do_? -- ```r penguins.grouped <- penguins %>% group_by(species) penguins.grouped ``` -- ``` # A tibble: 344 × 8 # Groups: species [3] species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <fct> <fct> <dbl> <dbl> <int> <int> 1 Adelie Torgersen 39.1 18.7 181 3750 2 Adelie Torgersen 39.5 17.4 186 3800 3 Adelie Torgersen 40.3 18 195 3250 4 Adelie Torgersen NA NA NA NA 5 Adelie Torgersen 36.7 19.3 193 3450 6 Adelie Torgersen 39.3 20.6 190 3650 7 Adelie Torgersen 38.9 17.8 181 3625 8 Adelie Torgersen 39.2 19.6 195 4675 9 Adelie Torgersen 34.1 18.1 193 3475 10 Adelie Torgersen 42 20.2 190 4250 # … with 334 more rows, and 2 more variables: sex <fct>, year <int> ``` --- ## Multiple Groups "How many males and females of each sex do we have?" -- ```r penguins %>% group_by(species, sex) %>% tally ``` -- Note that the resulting dataframe is still grouped by `species`! -- ```r penguins %>% group_by(species, sex) ``` ``` # A tibble: 344 × 8 # Groups: species, sex [8] species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <fct> <fct> <dbl> <dbl> <int> <int> 1 Adelie Torgersen 39.1 18.7 181 3750 2 Adelie Torgersen 39.5 17.4 186 3800 3 Adelie Torgersen 40.3 18 195 3250 4 Adelie Torgersen NA NA NA NA 5 Adelie Torgersen 36.7 19.3 193 3450 6 Adelie Torgersen 39.3 20.6 190 3650 7 Adelie Torgersen 38.9 17.8 181 3625 8 Adelie Torgersen 39.2 19.6 195 4675 9 Adelie Torgersen 34.1 18.1 193 3475 10 Adelie Torgersen 42 20.2 190 4250 # … with 334 more rows, and 2 more variables: sex <fct>, year <int> ``` --- ## Lab 02: Challenge 1 (`dplyr`) 1. How many paintings (rows) are in `moma`? How many variables (columns) are in `moma`? 1. What is the first painting acquired by MoMA? Which year? Which artist? What title? - *Hint: you may want to look into `select` + `arrange`* 1. What is the oldest painting in the collection? Which year? Which artist? What title? *(see above hint)* 1. How many distinct artists are there? 1. Which artist has the most paintings in the collection? How many paintings are by this artist? 1. How many paintings are by male vs female artists? If you want more: 1. How many artists of each gender are there? 1. In what year were the most paintings acquired? Created? 1. In what year was the first painting by a (solo) female artist acquired? When was that painting created? Which artist? What title? --- # From Last Week 2 From `ggplot2`: - `aes(x = , y = )` (aesthetics) - `aes(x = , y = , color = )` (add color) - `aes(x = , y = , size = )` (add size) - `+ facet_wrap(~ )` (facetting) --- # "Old School" (Challenge 2)<sup>1</sup> - Sketch the graphics below on paper, where the `x`-axis is variable `year_created` and the `y`-axis is variable `year_acquired` ``` # A tibble: 4 × 4 painted acquired area gender <dbl> <dbl> <dbl> <chr> 1 1980 1985 3 male 2 1990 1995 2 male 3 2000 2005 1 female 4 2010 2015 2 female ``` <!-- Copy to chalkboard/whiteboard --> 1. A scatter plot 1. A scatter plot where the `color` of the points corresponds to `gender` 1. A scatter plot where the `size` of the points corresponds to `area` 1. A version of (1), but with separate plots by gender .footnote[ [1] Shamelessly borrowed with much appreciation to [Chester Ismay](https://ismayc.github.io/talks/ness-infer/slide_deck.html) ] --- # 1. A scatterplot ```r library(ggplot2) ggplot(moma_ex, aes(painted, acquired)) + geom_point() ``` -- <img src="02-slides_files/figure-html/unnamed-chunk-18-1.png" width="80%" style="display: block; margin: auto;" /> --- # 2. `color` points by `gender` ```r library(ggplot2) ggplot(moma_ex, aes(painted, acquired, color = gender)) + geom_point() ``` -- <img src="02-slides_files/figure-html/unnamed-chunk-20-1.png" width="80%" style="display: block; margin: auto;" /> --- # 3. `size` points by `area` ```r library(ggplot2) ggplot(moma_ex, aes(painted, acquired, size = area)) + geom_point() ``` -- <img src="02-slides_files/figure-html/unnamed-chunk-22-1.png" width="80%" style="display: block; margin: auto;" /> --- # 4. Faceting ```r library(ggplot2) ggplot(moma_ex, aes(painted, acquired, color = gender)) + geom_point() + facet_wrap(~gender) ``` -- <img src="02-slides_files/figure-html/unnamed-chunk-24-1.png" width="80%" style="display: block; margin: auto;" /> --- # [The Five-Named Graphs](http://moderndive.com/3-viz.html#FiveNG) - Scatterplot: `geom_point()` - Line graph: `geom_line()` - Histogram: `geom_histogram()` - Boxplot: `geom_boxplot()` - Bar graph: `geom_bar()` or `geom_col` (see [Lab 01](../01-eda_hot_dogs.html)) --- # Lab 02: Plotting Challenges Challenges 3-5 are in the [Lab 02 code-through](../02-moma.html)! https://stevenbedrick.github.io/data-vis-labs-2022/02-moma.html --- class: inverse, middle, center # 📊 ## Basics of `ggplot2` and `dplyr`: [R4DS `ggplot2` chapter](http://r4ds.had.co.nz/data-visualisation.html) [ModernDive `ggplot2` chapter](http://r4ds.had.co.nz/data-visualisation.html) [RStudio `ggplot2` Cheatsheet](https://github.com/rstudio/cheatsheets/blob/master/data-visualization-2.1.pdf) [R4DS `dplyr` chapter](http://r4ds.had.co.nz/transform.html) [ModernDive `dplyr` chapter](https://moderndive.com/3-wrangling.html) [RStudio `dplyr` Cheatsheet](https://github.com/rstudio/cheatsheets/blob/master/data-transformation.pdf)