class: center, middle, inverse, title-slide .title[ # Lab 03: BMI 5/625 ] .subtitle[ ## Working with Data ] .author[ ### Alison Hill, updates by Steven Bedrick ] --- # Plan for today - Refresher on Base R types and operators -- - Tying it together with `dplyr` -- - `ggplot` aesthetics and colors -- - Lab workbook --- # Data for today We'll use data from [Wordbank](http://wordbank.stanford.edu)- an open source database of children's vocabulary development. The tool used to measure children's language and communicative development in this database is the [MacArthur-Bates Communicative Development Inventories (MB-CDI)](http://mb-cdi.stanford.edu). The MB-CDI is a parent-reported questionnaire. - R package [`wordbankr`](https://cran.r-project.org/web/packages/wordbankr/index.html) - [`wordbankr` vignette](https://cran.r-project.org/web/packages/wordbankr/vignettes/wordbankr.html) - More about [Wordbank](http://wordbank.stanford.edu) - More about [MB-CDI](http://mb-cdi.stanford.edu) --- # Get the data Use this code chunk to import my cleaned CSV file: ```r library(readr) sounds <- read_csv("http://bit.ly/cs631-meow") ``` --- class: inverse, middle, center <img src="../images/r-data-types.png" width="65%" style="display: block; margin: auto;" /> ## RStudio Base R Cheatsheet https://github.com/rstudio/cheatsheets/blob/master/base-r.pdf --- ## Know your data types * Numeric (2 subtypes) - Integers (`1, 50`) - Double (`1.5, 50.25`, `?double`) * Character (`"hello"`) * Factor (`grade = "A" | grade = "B"`) * Logical (`TRUE | FALSE`) -- ```r typeof(sounds$age) ``` ``` [1] "double" ``` ```r typeof(sounds$sound) ``` ``` [1] "character" ``` ```r typeof(sounds$sound == "meow") ``` ``` [1] "logical" ``` --- # Even better: `glimpse` ```r glimpse(sounds) ``` ``` Rows: 33 Columns: 4 $ age <dbl> 8, 8, 8, 9, 9, 9, 10, 10, 10, 11, 11, 11, 12, 12, 12, 13,… $ sound <chr> "cockadoodledoo", "meow", "woof woof", "cockadoodledoo", … $ kids_produce <dbl> 1, 0, 3, 0, 2, 2, 0, 5, 4, 0, 5, 12, 0, 12, 28, 9, 125, 2… $ kids_respond <dbl> 35, 35, 35, 91, 93, 93, 139, 145, 143, 94, 94, 94, 141, 1… ``` --- # `sounds` (a subset) - `age`: child age in months - `sound`: a string describing a type of animal sound - `kids_produce`: the number of parents who answered "yes, my child produces this animal sound" - `kids_respond`: the number of parents who responded to this question at all <table> <thead> <tr> <th style="text-align:right;"> age </th> <th style="text-align:left;"> sound </th> <th style="text-align:right;"> kids_produce </th> <th style="text-align:right;"> kids_respond </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 8 </td> <td style="text-align:left;"> cockadoodledoo </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 35 </td> </tr> <tr> <td style="text-align:right;"> 8 </td> <td style="text-align:left;"> meow </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 35 </td> </tr> <tr> <td style="text-align:right;"> 8 </td> <td style="text-align:left;"> woof woof </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 35 </td> </tr> <tr> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> cockadoodledoo </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 91 </td> </tr> <tr> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> meow </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 93 </td> </tr> <tr> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> woof woof </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 93 </td> </tr> </tbody> </table> --- # Data types <img src="http://r4ds.had.co.nz/diagrams/data-structures-overview.png" width="65%" style="display: block; margin: auto;" /> --- # Lists? Vectors? -- What is `2`, anyway? ```r typeof(2) ``` ``` [1] "double" ``` -- ```r length(2) ``` ``` [1] 1 ``` -- `2` is really a vector of length 1! ```r is.vector(2) ``` ``` [1] TRUE ``` --- # Lists? Vectors? What about longer vectors? -- ```r typeof(c(2,2)) ``` ``` [1] "double" ``` ```r length(c(2,2)) ``` ``` [1] 2 ``` --- # Lists? Vectors? _Atomic Vectors_ in `R` are of the same _type_ (all numbers, all logicals, etc.) -- _Lists_ are vectors made up of multiple items of _different_ types: ```r my.list <- list(c(1,2,3), c(TRUE, FALSE, TRUE)) my.list ``` ``` [[1]] [1] 1 2 3 [[2]] [1] TRUE FALSE TRUE ``` --- # Lists? Vectors? _Atomic Vectors_ in `R` are of the same _type_ (all numbers, all logicals, etc.) _Lists_ are vectors made up of multiple items of _different_ types: ```r my.list <- list(x=c(1,2,3), y=c(TRUE, FALSE, TRUE)) my.list ``` ``` $x [1] 1 2 3 $y [1] TRUE FALSE TRUE ``` It is usually better to give names to list elements... -- - Note: a data frame is just a list where each element is: - An atomic vector - Of uniform length --- # Lists? Vectors? Many operations in R are _vectorized_ (i.e., operate on an entire vector all in one go): -- Adding a "scalar" (i.e., a vector of length 1) to a vector: ```r some.numbers <- c(1,2,3) some.numbers + 1 ``` ``` [1] 2 3 4 ``` -- Adding two vectors of identical length: ```r some.numbers + c(2,2,2) ``` ``` [1] 3 4 5 ``` --- # Lists? Vectors? What if there is a length mis-match? -- ```r some.numbers + c(8,8,8,8) ``` ``` [1] 9 10 11 9 ``` -- What's going on? -- The shorter vector is _recycled_ until it matches the length of the larger vector: -- ```r c(1,2,3,1) + c(8,8,8,8) ``` ``` [1] 9 10 11 9 ``` -- (Note that you may get a _warning_ from R if this happens) -- This is how adding a scalar to a vector works, under the hood! --- # Lists? Vectors? Vectorization is very useful: ```r y_hat <- predict(some.model) residuals <- y_hat - my.data$y ``` -- Many things that would use _loops_ in other languages can be done using vectorized operations in R. -- Many functions operate on vectors: -- - `mean()`, `sum()`, etc. -- One way to think of them : factories to convert atomic vectors of length `n` to atomic vectors of length 1! --- # Data wrangling with `dplyr` .pull-left[ From previous: - `group_by` - `summarize` ] -- .pull-right[ Adding onto your arsenal of... - `filter` - `arrange` - `mutate` - `glimpse` - `distinct` - `count` - `tally` - `pull` - `top_n` ] --- class: middle, center, inverse # 😈 ## More on `mutate` --- # 3 ways to `mutate` 1. <font color="#ED1941">Create a new variable with a specific value</font> 1. Create a new variable based on other variables 1. Change an existing variable -- ```r sounds %>% mutate(form = "WS") ``` ``` # A tibble: 33 × 5 age sound kids_produce kids_respond form <dbl> <chr> <dbl> <dbl> <chr> 1 8 cockadoodledoo 1 35 WS 2 8 meow 0 35 WS 3 8 woof woof 3 35 WS 4 9 cockadoodledoo 0 91 WS 5 9 meow 2 93 WS 6 9 woof woof 2 93 WS 7 10 cockadoodledoo 0 139 WS 8 10 meow 5 145 WS 9 10 woof woof 4 143 WS 10 11 cockadoodledoo 0 94 WS # … with 23 more rows ``` --- # 3 ways to `mutate` 1. Create a new variable with a specific value 1. <font color="#ED1941">Create a new variable based on other variables</font> 1. Change an existing variable -- ```r sounds %>% mutate(prop_produce = kids_produce / kids_respond) ``` ``` # A tibble: 33 × 5 age sound kids_produce kids_respond prop_produce <dbl> <chr> <dbl> <dbl> <dbl> 1 8 cockadoodledoo 1 35 0.0286 2 8 meow 0 35 0 3 8 woof woof 3 35 0.0857 4 9 cockadoodledoo 0 91 0 5 9 meow 2 93 0.0215 6 9 woof woof 2 93 0.0215 7 10 cockadoodledoo 0 139 0 8 10 meow 5 145 0.0345 9 10 woof woof 4 143 0.0280 10 11 cockadoodledoo 0 94 0 # … with 23 more rows ``` --- # 3 ways to `mutate` 1. Create a new variable with a specific value 1. Create a new variable based on other variables 1. <font color="#ED1941">Change an existing variable</font> -- ```r sounds %>% mutate(prop_produce = prop_produce * 100) ``` ``` # A tibble: 33 × 5 age sound kids_produce kids_respond prop_produce <dbl> <chr> <dbl> <dbl> <dbl> 1 8 cockadoodledoo 1 35 2.86 2 8 meow 0 35 0 3 8 woof woof 3 35 8.57 4 9 cockadoodledoo 0 91 0 5 9 meow 2 93 2.15 6 9 woof woof 2 93 2.15 7 10 cockadoodledoo 0 139 0 8 10 meow 5 145 3.45 9 10 woof woof 4 143 2.80 10 11 cockadoodledoo 0 94 0 # … with 23 more rows ``` --- class: middle, center, inverse # ⌛️ ## Let's review some helpful functions for `mutate` + `summarize` --- class: inverse, bottom, center background-image: url("../images/peapod.png") background-size: 25% ## Remember: ## Base R + Tidyverse --- class: middle, center, inverse #💡 ## First: ## Arithmetic *especially useful for* `mutate` See: http://r4ds.had.co.nz/transform.html#mutate-funs --- ```r ?Arithmetic ``` <table> <thead> <tr> <th style="text-align:left;"> Operator </th> <th style="text-align:left;"> Description </th> <th style="text-align:left;"> Usage </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> + </td> <td style="text-align:left;"> addition </td> <td style="text-align:left;"> x + y </td> </tr> <tr> <td style="text-align:left;"> - </td> <td style="text-align:left;"> subtraction </td> <td style="text-align:left;"> x - y </td> </tr> <tr> <td style="text-align:left;"> * </td> <td style="text-align:left;"> multiplication </td> <td style="text-align:left;"> x * y </td> </tr> <tr> <td style="text-align:left;"> / </td> <td style="text-align:left;"> division </td> <td style="text-align:left;"> x / y </td> </tr> <tr> <td style="text-align:left;"> ^ </td> <td style="text-align:left;"> raised to the power of </td> <td style="text-align:left;"> x ^ y </td> </tr> <tr> <td style="text-align:left;"> abs </td> <td style="text-align:left;"> absolute value </td> <td style="text-align:left;"> abs(x) </td> </tr> <tr> <td style="text-align:left;"> %/% </td> <td style="text-align:left;"> integer division </td> <td style="text-align:left;"> x %/% y </td> </tr> <tr> <td style="text-align:left;"> %% </td> <td style="text-align:left;"> remainder after division </td> <td style="text-align:left;"> x %% y </td> </tr> </tbody> </table> ```r 5 %/% 2 # 2 goes into 5 two times with... ``` ``` [1] 2 ``` ```r 5 %% 2 # 1 left over ``` ``` [1] 1 ``` --- class: middle, center, inverse #💡 ## Second: ## Summaries *especially useful for* `summarize` *even more useful after a* `group_by` See: http://r4ds.had.co.nz/transform.html#summarise-funs --- <table> <thead> <tr> <th style="text-align:left;"> Description </th> <th style="text-align:left;"> Usage </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> sum </td> <td style="text-align:left;"> sum(x) </td> </tr> <tr> <td style="text-align:left;"> minimum </td> <td style="text-align:left;"> min(x) </td> </tr> <tr> <td style="text-align:left;"> maximum </td> <td style="text-align:left;"> max(x) </td> </tr> <tr> <td style="text-align:left;"> mean </td> <td style="text-align:left;"> mean(x) </td> </tr> <tr> <td style="text-align:left;"> median </td> <td style="text-align:left;"> median(x) </td> </tr> <tr> <td style="text-align:left;"> standard deviation </td> <td style="text-align:left;"> sd(x) </td> </tr> <tr> <td style="text-align:left;"> variance </td> <td style="text-align:left;"> var(x) </td> </tr> <tr> <td style="text-align:left;"> rank </td> <td style="text-align:left;"> rank(x) </td> </tr> </tbody> </table> * All allow for `na.rm` argument to remove `NA` values before summarizing. The default setting for this argument is *always* `na.rm = FALSE`, so if there is one `NA` value the summary will be `NA`. * See "Maths Functions" in the RStudio Base R Cheatsheet: https://github.com/rstudio/cheatsheets/blob/master/base-r.pdf * Any function that operates _on a vector_ can be used... --- class: inverse, middle, center ![](../images/alicedata-lego-colors.jpg) ## <small>"Spent day pondering grayscale vs colourscale using `ggplot`"</small> *photo and caption courtesy [@alice-data](https://twitter.com/alice_data)* --- # Today's lab: COLORS Specifically, discrete colors. At the end of today's lab, you'll see an extra section on continuous colors. --- ## But first: `shape` -- `Shape` works like any other `ggplot` aesthetic: -- ```r penguins %>% ggplot(aes(x=bill_length_mm, y=bill_depth_mm, shape=species, color=species)) + geom_point(size=2) ``` <img src="03-slides_files/figure-html/unnamed-chunk-30-1.png" width="65%" style="display: block; margin: auto;" /> --- ## But first: `shape` <img src="03-slides_files/figure-html/unnamed-chunk-31-1.png" width="65%" style="display: block; margin: auto;" /> --- ## Shapes with `color = "hotpink"` <img src="03-slides_files/figure-html/unnamed-chunk-32-1.png" width="65%" style="display: block; margin: auto;" /> --- ## Shapes with `fill = "gold"` <img src="03-slides_files/figure-html/unnamed-chunk-33-1.png" width="65%" style="display: block; margin: auto;" /> --- ## Default shape for `geom_point` 🕵🏽 Requires spelunking into the dark corners of the `ggplot2` code on [GitHub](https://github.com/tidyverse/ggplot2/blob/master/R/geom-point.r): -- ```r default_aes = aes( shape = 19, colour = "black", size = 1.5, fill = NA, alpha = NA, stroke = 0.5 ) ``` So, the default for `geom_point(shape = 19)`! This is important to remember: this shape only "understands" the *color* aesthetic, but not the *fill* aesthetic. --- ## Beyond `fill`/`color` aesthetics: "Scales" For when we need to go beyond what `fill` and `color` can do for us... * Like adding multiple `geom_*` objects, we can add several `scale_` * e.g. `scale_fill_discrete()`, etc. * Time to jump to the lab worksheet!