class: center, middle, inverse, title-slide .title[ # Lab 02: BMI 5/625 ] .subtitle[ ## Working in the Tidyverse ] .author[ ### Alison Hill (w/ modifications by Steven Bedrick) ] --- # Tidyverse basics Last week, we covered some basics: - `<-` (variable assignment) - `%>%` (then...) - `dplyr`, `ggplot2` (packages) - `install.packages("dplyr")` (1x per machine) - `library(dplyr)` (1x per work session) --- # Data for today We'll use data from the Museum of Modern Art (MoMA) - Publicly available on [GitHub](https://github.com/MuseumofModernArt/collection) - As analyzed by [fivethirtyeight.com](https://fivethirtyeight.com/features/a-nerds-guide-to-the-2229-paintings-at-moma/) - And by [others](https://medium.com/@foe/here-s-a-roundup-of-how-people-have-used-our-data-so-far-80862e4ce220) --- # Get the data Use this code chunk to import my cleaned CSV file: ```r library(readr) moma <- read_csv("../data/artworks-cleaned.csv") ``` --- # Data wrangling: All functions from `dplyr` package .pull-left[ A few basics: - print a tibble - `filter` - `arrange` - `mutate` ] -- .pull-right[ From Lab 01 - `glimpse` - `distinct` - `count` ] --- class: middle, center, inverse ![](../images/rladylego-pipe.jpg) ## Plus: `%>%` *image courtesy [@LegoRLady](https://twitter.com/LEGO_RLady/status/986661916855754752)* --- ## Three core functions: `filter` -- `filter` subsets data according to a _predicate_ (logical statement) -- - Use for things like "remove subjects whose age is less than 18 years" -- ```r peds <- all.patients %>% filter(age <= 18) ``` -- - Note that predicates can be as complex as you like (examples to come) --- ## Three core functions: `arrange` -- `arrange` _sorts_ a dataframe by one or more columns -- ```r peds <- peds %>% arrange(age) ``` -- - The default sort order is _ascending_ (smallest to largest); you can reverse this in two ways: -- - The `desc()` function, and negation: -- ```r # option 1: peds <- peds %>% arrange(desc(age)) ``` -- ```r # option 2: peds <- peds %>% arrange(-age) ``` --- ## Three core functions: `mutate` -- `mutate` adds a new column (or replaces an existing one) -- ```r peds <- peds %>% mutate(age.in.months = age * 12) ``` -- ```r # convert to meters from feet peds <- peds %>% mutate(height = height * 0.305) ``` -- - Multiple columns can be worked on at the same time: -- ```r peds <- peds %>% mutate( age.in.months = age * 12, is.school.age = age >= 5, height = height * 0.305 ) ``` --- class: middle, center, inverse # ⌛️ ## Let's review some helpful functions for `filter` --- class: inverse, bottom, center background-image: url("../images/peapod.png") background-size: 25% ## Base R + Tidyverse --- class: middle, center, inverse #💡 ## First: ## Logical Operators --- ```r ?base::Logic ``` <table> <thead> <tr> <th style="text-align:left;"> Operator </th> <th style="text-align:left;"> Description </th> <th style="text-align:left;"> Usage </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> & </td> <td style="text-align:left;"> and </td> <td style="text-align:left;"> x & y </td> </tr> <tr> <td style="text-align:left;"> | </td> <td style="text-align:left;"> or </td> <td style="text-align:left;"> x | y </td> </tr> <tr> <td style="text-align:left;"> xor </td> <td style="text-align:left;"> exactly x or y </td> <td style="text-align:left;"> xor(x, y) </td> </tr> <tr> <td style="text-align:left;"> ! </td> <td style="text-align:left;"> not </td> <td style="text-align:left;"> !x </td> </tr> </tbody> </table> --- Logical or (`|`) is inclusive, so `x | y` really means: * x or * y or * both x & y Exclusive or (`xor`) is exclusive, so `xor(x, y)` really means: * x or * y... * but not both x & y ```r x <- c(0, 1, 0, 1) y <- c(0, 0, 1, 1) boolean_or <- x | y exclusive_or <- xor(x, y) cbind(x, y, boolean_or, exclusive_or) ``` ``` x y boolean_or exclusive_or [1,] 0 0 0 0 [2,] 1 0 1 1 [3,] 0 1 1 1 [4,] 1 1 1 0 ``` --- class: middle, center, inverse #💡 ## Second: ## Comparisons --- ```r ?Comparison ``` <table> <thead> <tr> <th style="text-align:left;"> Operator </th> <th style="text-align:left;"> Description </th> <th style="text-align:left;"> Usage </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> < </td> <td style="text-align:left;"> less than </td> <td style="text-align:left;"> x < y </td> </tr> <tr> <td style="text-align:left;"> <= </td> <td style="text-align:left;"> less than or equal to </td> <td style="text-align:left;"> x <= y </td> </tr> <tr> <td style="text-align:left;"> > </td> <td style="text-align:left;"> greater than </td> <td style="text-align:left;"> x > y </td> </tr> <tr> <td style="text-align:left;"> >= </td> <td style="text-align:left;"> greater than or equal to </td> <td style="text-align:left;"> x >= y </td> </tr> <tr> <td style="text-align:left;"> == </td> <td style="text-align:left;"> exactly equal to </td> <td style="text-align:left;"> x == y </td> </tr> <tr> <td style="text-align:left;"> != </td> <td style="text-align:left;"> not equal to </td> <td style="text-align:left;"> x != y </td> </tr> <tr> <td style="text-align:left;"> %in% </td> <td style="text-align:left;"> group membership* </td> <td style="text-align:left;"> x %in% y </td> </tr> <tr> <td style="text-align:left;"> is.na </td> <td style="text-align:left;"> is missing </td> <td style="text-align:left;"> is.na(x) </td> </tr> <tr> <td style="text-align:left;"> !is.na </td> <td style="text-align:left;"> is not missing </td> <td style="text-align:left;"> !is.na(x) </td> </tr> </tbody> </table> *(shortcut to using `|` repeatedly with `==`) ## Another level: `group_by` Many `dplyr` verbs can be _grouped_ -- I.e., their operation can be performed on partitions of your data: -- ("average of `X`, _by_ `Y`) -- Consider `summarise`: ```r penguins %>% filter(!is.na(bill_length_mm)) %>% summarise(mean_length=mean(bill_length_mm)) ``` ``` # A tibble: 1 × 1 mean_length <dbl> 1 43.9 ``` --- ## New this week: `group_by` Many `dplyr` verbs can be _grouped_ I.e., their operation can be performed on partitions of your data: ("average of `X`, _by_ `Y`) ```r penguins %>% filter(!is.na(bill_length_mm)) %>% group_by(species) %>% summarise(mean_length=mean(bill_length_mm)) ``` ``` # A tibble: 3 × 2 species mean_length <fct> <dbl> 1 Adelie 38.8 2 Chinstrap 48.8 3 Gentoo 47.5 ``` -- Most other `dplyr` verbs will "play nicely" with grouped data: -- `arrange`, `slice`, `count`, `top_n`, etc. --- ## Under the hood What does `group_by` actually _do_? -- ```r penguins.grouped <- penguins %>% group_by(species) penguins.grouped ``` -- ``` # A tibble: 344 × 8 # Groups: species [3] species island bill_length_mm bill_depth_mm flipper_…¹ body_…² sex year <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int> 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007 2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007 3 Adelie Torgersen 40.3 18 195 3250 fema… 2007 4 Adelie Torgersen NA NA NA NA <NA> 2007 5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007 7 Adelie Torgersen 38.9 17.8 181 3625 fema… 2007 8 Adelie Torgersen 39.2 19.6 195 4675 male 2007 9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007 10 Adelie Torgersen 42 20.2 190 4250 <NA> 2007 # … with 334 more rows, and abbreviated variable names ¹flipper_length_mm, # ²body_mass_g ``` --- ## Multiple Groups "How many males and females of each sex do we have?" -- ```r penguins %>% group_by(species, sex) %>% tally ``` -- Note that the resulting dataframe is still grouped by `species`! -- ```r penguins %>% group_by(species, sex) ``` ``` # A tibble: 344 × 8 # Groups: species, sex [8] species island bill_length_mm bill_depth_mm flipper_…¹ body_…² sex year <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int> 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007 2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007 3 Adelie Torgersen 40.3 18 195 3250 fema… 2007 4 Adelie Torgersen NA NA NA NA <NA> 2007 5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007 7 Adelie Torgersen 38.9 17.8 181 3625 fema… 2007 8 Adelie Torgersen 39.2 19.6 195 4675 male 2007 9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007 10 Adelie Torgersen 42 20.2 190 4250 <NA> 2007 # … with 334 more rows, and abbreviated variable names ¹flipper_length_mm, # ²body_mass_g ``` --- ## Lab 02: Challenge 1 (`dplyr`) 1. How many paintings (rows) are in `moma`? How many variables (columns) are in `moma`? 1. What is the first painting acquired by MoMA? Which year? Which artist? What title? - *Hint: you may want to look into `select` + `arrange`* 1. What is the oldest painting in the collection? Which year? Which artist? What title? *(see above hint)* 1. How many distinct artists are there? 1. Which artist has the most paintings in the collection? How many paintings are by this artist? 1. How many paintings are by male vs female artists? If you want more: 1. How many artists of each gender are there? 1. In what year were the most paintings acquired? Created? 1. In what year was the first painting by a (solo) female artist acquired? When was that painting created? Which artist? What title? --- # From Last Week 2 From `ggplot2`: - `aes(x = , y = )` (aesthetics) - `aes(x = , y = , color = )` (add color) - `aes(x = , y = , size = )` (add size) - `+ facet_wrap(~ )` (facetting) --- # "Old School" (Challenge 2)<sup>1</sup> - Sketch the graphics below on paper, where the `x`-axis is variable `year_created` and the `y`-axis is variable `year_acquired` ``` # A tibble: 4 × 4 painted acquired area gender <dbl> <dbl> <dbl> <chr> 1 1980 1985 3 male 2 1990 1995 2 male 3 2000 2005 1 female 4 2010 2015 2 female ``` <!-- Copy to chalkboard/whiteboard --> 1. A scatter plot 1. A scatter plot where the `color` of the points corresponds to `gender` 1. A scatter plot where the `size` of the points corresponds to `area` 1. A version of (1), but with separate plots by gender .footnote[ [1] Shamelessly borrowed with much appreciation to [Chester Ismay](https://ismayc.github.io/talks/ness-infer/slide_deck.html) ] --- # 1. A scatterplot ```r library(ggplot2) ggplot(moma_ex, aes(painted, acquired)) + geom_point() ``` -- <img src="02-slides_files/figure-html/unnamed-chunk-25-1.png" width="80%" style="display: block; margin: auto;" /> --- # 2. `color` points by `gender` ```r library(ggplot2) ggplot(moma_ex, aes(painted, acquired, color = gender)) + geom_point() ``` -- <img src="02-slides_files/figure-html/unnamed-chunk-27-1.png" width="80%" style="display: block; margin: auto;" /> --- # 3. `size` points by `area` ```r library(ggplot2) ggplot(moma_ex, aes(painted, acquired, size = area)) + geom_point() ``` -- <img src="02-slides_files/figure-html/unnamed-chunk-29-1.png" width="80%" style="display: block; margin: auto;" /> --- # 4. Faceting ```r library(ggplot2) ggplot(moma_ex, aes(painted, acquired, color = gender)) + geom_point() + facet_wrap(~gender) ``` -- <img src="02-slides_files/figure-html/unnamed-chunk-31-1.png" width="80%" style="display: block; margin: auto;" /> --- # [The Five-Named Graphs](http://moderndive.com/3-viz.html#FiveNG) - Scatterplot: `geom_point()` - Line graph: `geom_line()` - Histogram: `geom_histogram()` - Boxplot: `geom_boxplot()` - Bar graph: `geom_bar()` or `geom_col` (see [Lab 01](../01-eda_hot_dogs.html)) --- # Lab 02: Plotting Challenges Challenges 3-5 are in the [Lab 02 code-through](../02-moma.html)! https://stevenbedrick.github.io/data-vis-labs-2023/02-moma.html --- class: inverse, middle, center # 📊 ## Basics of `ggplot2` and `dplyr`: [R4DS `ggplot2` chapter](http://r4ds.had.co.nz/data-visualisation.html) [ModernDive `ggplot2` chapter](http://r4ds.had.co.nz/data-visualisation.html) [RStudio `ggplot2` Cheatsheet](https://github.com/rstudio/cheatsheets/blob/master/data-visualization-2.1.pdf) [R4DS `dplyr` chapter](http://r4ds.had.co.nz/transform.html) [ModernDive `dplyr` chapter](https://moderndive.com/3-wrangling.html) [RStudio `dplyr` Cheatsheet](https://github.com/rstudio/cheatsheets/blob/master/data-transformation.pdf)