Read all the way through step 6, and note that there is a file that needs to be turned in to Sakai before Wednesday at noon!

1 Overview

In this class, we will be working primarily with R, a free and open-source software environment for statistical computing and graphics.

What is R?

  • R is the name of the programming language itself, based off of S from Bell Labs, which users access through a command-line interpreter (>)

What is RStudio?

  • RStudio is a powerful and convenient user interface that allows you to access the R programming language along with a lot of other bells and whistles that enhance functionality (and sanity).

What is RStudio Cloud?

  • RStudio Cloud is web-based version of RStudio. Think Google Docs, but for R. Pretty much anything that you can do in RStudio, you can do in RStudio Cloud, but without having to install anything locally. For this class, you are certainly free to install RStudio on your local computer, but we will be using several features of RStudio Cloud to help manage assignments and save everybody time.

1.0.1 Our Goal for Today

Our end goal is to get you looking at a screen like this:

2 Sign up for RStudio Cloud

Go to https://rstudio.cloud and sign up for an account.

Once you’re signed in, you should be looking at something like this:

At this point, you are ready to join the class workspace, which is where you will find all of the labs for this term, and in which you will do all of the assignments. To join the workspace, look on Sakai for the sharing link. When you click the link, you should be prompted to join the class workspace:

Click the “Projects” tab at the top to see the various labs and assignments (right now, there should just be one):

You can create your own projects from scratch in the class workspace, or you can start with one of the template projects. For labs and some assignments, I will have put together templates for you to start from. When you click “start” next to a template project, RStudio Cloud makes you a personal copy of that project, and then all of your changes and work are specific to your copy.

Begin by clicking the “Start” button next to the first project (“Lab 0”). After a few moments, your screen should look something like this:

2.1 Check in

  • Place your cursor where you see > and type x <- 2 + 2, hit enter or return, then type x, and hit enter/return again.
  • If [1] 4 prints to the screen, you’re all set!

There’s a lot going on, here. If you’re familiar with the desktop version of RStudio, this should look very familiar; if not, don’t worry! You’ll find your way around very quickly.

Also, note that on the left-hand side of the screen there is a set of links labeled “Learn”. RStudio has done a great job of providing tutorials and documentation about how to use their tools, and I encourage you to take a look at the various resources under that tab to familiarize yourself with the RStudio environment.

3 Install packages

One of the best things about R is its rich ecosystem of add-on packages and tools. “Out of the box”, R comes with good but basic statistical computing and graphics powers. For analytical and graphical super-powers, you’ll need to install add-on packages, which are user-written, to extend/expand your R capabilities. Packages can live in one of two places:

  • They may be carefully curated by CRAN (which involves a thorough submission and review process), and thus are easy install using install.packages("name_of_package", dependencies = TRUE).
  • Alternatively, they may be available via GitHub. To download these packages, you first need to install the devtools package.
install.packages("devtools")
library(devtools)
install_github("name_of_package")

One nice thing about using RStudio Cloud is that a workspace project (like the one you’ve just opened up) can come pre-loaded with the necessary libraries, which is a real time-saver in a classroom environment. For example, the project you’re using right now already has the devtools package installed. But for the next part of the lab, you’ll need to install one additional package.

Place your cursor in the console again (where you last typed x and [4] printed on the screen). You can use the first method that we described above (install.packages()) to install the babynames package from CRAN:

install.packages("babynames")

Mind your use of quotes carefully with packages.

  • To install a package, you put the name of the package in quotes as in install.packages("name_of_package").
  • To use an already installed package, you must load it first, as in library(name_of_package), leaving the name of the package bare. You only need to do this once per RStudio session.
  • If you want help, no quotes are needed: help(name_of_package) or ?name_of_package.
  • If you want the citation for a package (and you should give credit where credit is due), ask R as in citation("name_of_package").
library(dplyr)
help("dplyr")
citation("ggplot2")

To cite ggplot2 in publications, please use:

  H. Wickham. ggplot2: Elegant Graphics for Data Analysis.
  Springer-Verlag New York, 2016.

A BibTeX entry for LaTeX users is

  @Book{,
    author = {Hadley Wickham},
    title = {ggplot2: Elegant Graphics for Data Analysis},
    publisher = {Springer-Verlag New York},
    year = {2016},
    isbn = {978-3-319-24277-4},
    url = {https://ggplot2.tidyverse.org},
  }

Heads up: R is case-sensitive, so ?dplyr works but ?Dplyr will not. Likewise, a variable called A is different from a.

4 Make a name plot

We can do everything we need to directly from the console, but it is often a lot easier to work in a more traditional editing environment. Open a new R script in RStudio by going to File --> New File --> R Script. For this first foray into R, I’ll give you the code, so sit back and relax and feel free to copy and paste my code with some small tweaks. Don’t worry if you’re not familiar with the commands and functions that we are using; as the course goes on, you will learn more about all of these.

First load the packages:

library(babynames) # contains the actual data
library(dplyr) # for manipulating data
library(ggplot2) # for plotting data

In an RStudio editor window, you run code by either clicking the buton marked “Run”, or (more frequently) by using the “run line/selection” keyboard shortcut. On a Mac, this is “command+enter”, on Windows or Linux it’s “control+enter”. If no text is selected, this will run the current line; if you’ve selected more than one line, your entire selection will be run. Depending on how your screen is laid out, you may see your selection (or line) be copied automatically down into the Console tab.

Begin by executing the three library() calls:

Next, we’ll follow best practices for inspecting a freshly read dataset. Also, see “What I do when I get a new data set as told through tweets” for more ideas about exploring a new dataset. Here are some critical commands to obtain a high-level overview (HLO) of your freshly read dataset in R. We’ll call it saying hello to your dataset:

glimpse(babynames) # dplyr
Observations: 1,924,665
Variables: 5
$ year <dbl> 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880…
$ sex  <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F",…
$ name <chr> "Mary", "Anna", "Emma", "Elizabeth", "Minnie", "Margaret", "Ida"…
$ n    <int> 7065, 2604, 2003, 1939, 1746, 1578, 1472, 1414, 1320, 1288, 1258…
$ prop <dbl> 0.07238359, 0.02667896, 0.02052149, 0.01986579, 0.01788843, 0.01…
head(babynames) # base R
# A tibble: 6 x 5
   year sex   name          n   prop
  <dbl> <chr> <chr>     <int>  <dbl>
1  1880 F     Mary       7065 0.0724
2  1880 F     Anna       2604 0.0267
3  1880 F     Emma       2003 0.0205
4  1880 F     Elizabeth  1939 0.0199
5  1880 F     Minnie     1746 0.0179
6  1880 F     Margaret   1578 0.0162
tail(babynames) # same
# A tibble: 6 x 5
   year sex   name       n       prop
  <dbl> <chr> <chr>  <int>      <dbl>
1  2017 M     Zyhier     5 0.00000255
2  2017 M     Zykai      5 0.00000255
3  2017 M     Zykeem     5 0.00000255
4  2017 M     Zylin      5 0.00000255
5  2017 M     Zylis      5 0.00000255
6  2017 M     Zyrie      5 0.00000255
names(babynames) # same
[1] "year" "sex"  "name" "n"    "prop"

If you have done the above and produced sane-looking output, you are ready for the next bit. Use the code below to create a new data frame called alison.

alison <- babynames %>%
  filter(name == "Alison" | name == "Allison") %>% 
  filter(sex == "F") 
  • The first bit makes a new dataset called alison that is a copy of the babynames dataset- the %>% tells you we are doing some other stuff to it later.

  • The second bit filters our babynames to only keep rows where the name is either Alison or Allison (read | as “or”.)

  • The third bit applies another filter to keep only those where sex is female.

When you ran that command, you may have noticed a new entry appear on the right-hand side of the screen, in the “environment” tab. This tab lists all of the variables that your current environment has loaded.

Let’s check out the data.

alison
# A tibble: 218 x 5
    year sex   name        n      prop
   <dbl> <chr> <chr>   <int>     <dbl>
 1  1905 F     Alison      7 0.0000226
 2  1907 F     Alison      5 0.0000148
 3  1908 F     Allison     6 0.0000169
 4  1910 F     Alison      5 0.0000119
 5  1910 F     Allison     5 0.0000119
 6  1911 F     Allison     9 0.0000204
 7  1912 F     Allison    12 0.0000204
 8  1912 F     Alison      9 0.0000153
 9  1913 F     Alison     12 0.0000183
10  1913 F     Allison     7 0.0000107
# … with 208 more rows
glimpse(alison)
Observations: 218
Variables: 5
$ year <dbl> 1905, 1907, 1908, 1910, 1910, 1911, 1912, 1912, 1913, 1913, 1914…
$ sex  <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F",…
$ name <chr> "Alison", "Alison", "Allison", "Alison", "Allison", "Allison", "…
$ n    <int> 7, 5, 6, 5, 5, 9, 12, 9, 12, 7, 22, 11, 16, 13, 24, 15, 20, 15, …
$ prop <dbl> 2.259e-05, 1.482e-05, 1.692e-05, 1.192e-05, 1.192e-05, 2.037e-05…

Again, if you have sane-looking output here, move along to plotting the data!

plot <- ggplot(alison, aes(x = year, 
                           y = prop,  
                           group = name, 
                           color = name)) + 
  geom_line()  

Now if you did this right, you will not see your plot! Because we saved the ggplot with a name (plot), R just saved the object for you. But check out the top right pane in RStudio again: under Data you should see plot, so it is there, you just have to ask for it. Here’s how:

plot 

5 Make a new name plot

Edit my code above to create a new dataset. Pick 2 names to compare how popular they each are (these could be different spellings of your own name, like I did, but you can choose any 2 names that are present in the dataset). Make the new plot, changing the name of the first argument alison in ggplot() to the name of your new dataset.

5.1 Save and share

Save your work so you can share your favorite plot with us. You will not like the looks of your plot if you mouse over to Export and save it. Instead, use ggplot2’s command for saving a plot with sensible defaults:

help(ggsave)
ggsave("alison_hill.pdf", plot) # please make the filename unique!

Upload this exported plot to Sakai before Wednesday at noon.

(Note that your plot, the one that you turn in, should not be called alison_hill.pdf- it should be named something else).

6 Other cool babynames projects

Creative Commons License