Read all the way through step 6, and note that there is a file that needs to be turned in to Sakai before Wednesday at noon!
In this class, we will be working primarily with R, a free and open-source software environment for statistical computing and graphics.
What is R?
>
)What is RStudio?
What is RStudio Cloud?
Our end goal is to get you looking at a screen like this:
Go to https://rstudio.cloud and sign up for an account.
Once you’re signed in, you should be looking at something like this:
At this point, you are ready to join the class workspace, which is where you will find all of the labs for this term, and in which you will do all of the assignments. To join the workspace, look on Sakai for the sharing link. When you click the link, you should be prompted to join the class workspace:
Click the “Projects” tab at the top to see the various labs and assignments (right now, there should just be one):
You can create your own projects from scratch in the class workspace, or you can start with one of the template projects. For labs and some assignments, I will have put together templates for you to start from. When you click “start” next to a template project, RStudio Cloud makes you a personal copy of that project, and then all of your changes and work are specific to your copy.
Begin by clicking the “Start” button next to the first project (“Lab 0”). After a few moments, your screen should look something like this:
>
and type x <- 2 + 2
, hit enter or return, then type x
, and hit enter/return again.[1] 4
prints to the screen, you’re all set!There’s a lot going on, here. If you’re familiar with the desktop version of RStudio, this should look very familiar; if not, don’t worry! You’ll find your way around very quickly.
Also, note that on the left-hand side of the screen there is a set of links labeled “Learn”. RStudio has done a great job of providing tutorials and documentation about how to use their tools, and I encourage you to take a look at the various resources under that tab to familiarize yourself with the RStudio environment.
One of the best things about R is its rich ecosystem of add-on packages and tools. “Out of the box”, R comes with good but basic statistical computing and graphics powers. For analytical and graphical super-powers, you’ll need to install add-on packages, which are user-written, to extend/expand your R capabilities. Packages can live in one of two places:
install.packages("name_of_package", dependencies = TRUE)
.devtools
package.install.packages("devtools")
library(devtools)
install_github("name_of_package")
One nice thing about using RStudio Cloud is that a workspace project (like the one you’ve just opened up) can come pre-loaded with the necessary libraries, which is a real time-saver in a classroom environment. For example, the project you’re using right now already has the devtools
package installed. But for the next part of the lab, you’ll need to install one additional package.
Place your cursor in the console again (where you last typed x
and [4]
printed on the screen). You can use the first method that we described above (install.packages()
) to install the babynames
package from CRAN:
install.packages("babynames")
Mind your use of quotes carefully with packages.
install.packages("name_of_package")
.library(name_of_package)
, leaving the name of the package bare. You only need to do this once per RStudio session.help(name_of_package)
or ?name_of_package
.citation("name_of_package")
.library(dplyr)
help("dplyr")
citation("ggplot2")
To cite ggplot2 in publications, please use:
H. Wickham. ggplot2: Elegant Graphics for Data Analysis.
Springer-Verlag New York, 2016.
A BibTeX entry for LaTeX users is
@Book{,
author = {Hadley Wickham},
title = {ggplot2: Elegant Graphics for Data Analysis},
publisher = {Springer-Verlag New York},
year = {2016},
isbn = {978-3-319-24277-4},
url = {https://ggplot2.tidyverse.org},
}
Heads up: R is case-sensitive, so ?dplyr
works but ?Dplyr
will not. Likewise, a variable called A
is different from a
.
We can do everything we need to directly from the console, but it is often a lot easier to work in a more traditional editing environment. Open a new R script in RStudio by going to File --> New File --> R Script
. For this first foray into R, I’ll give you the code, so sit back and relax and feel free to copy and paste my code with some small tweaks. Don’t worry if you’re not familiar with the commands and functions that we are using; as the course goes on, you will learn more about all of these.
First load the packages:
library(babynames) # contains the actual data
library(dplyr) # for manipulating data
library(ggplot2) # for plotting data
In an RStudio editor window, you run code by either clicking the buton marked “Run”, or (more frequently) by using the “run line/selection” keyboard shortcut. On a Mac, this is “command+enter”, on Windows or Linux it’s “control+enter”. If no text is selected, this will run the current line; if you’ve selected more than one line, your entire selection will be run. Depending on how your screen is laid out, you may see your selection (or line) be copied automatically down into the Console tab.
Begin by executing the three library()
calls:
Next, we’ll follow best practices for inspecting a freshly read dataset. Also, see “What I do when I get a new data set as told through tweets” for more ideas about exploring a new dataset. Here are some critical commands to obtain a high-level overview (HLO) of your freshly read dataset in R. We’ll call it saying hello to your dataset:
glimpse(babynames) # dplyr
Observations: 1,924,665
Variables: 5
$ year <dbl> 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880…
$ sex <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F",…
$ name <chr> "Mary", "Anna", "Emma", "Elizabeth", "Minnie", "Margaret", "Ida"…
$ n <int> 7065, 2604, 2003, 1939, 1746, 1578, 1472, 1414, 1320, 1288, 1258…
$ prop <dbl> 0.07238359, 0.02667896, 0.02052149, 0.01986579, 0.01788843, 0.01…
head(babynames) # base R
# A tibble: 6 x 5
year sex name n prop
<dbl> <chr> <chr> <int> <dbl>
1 1880 F Mary 7065 0.0724
2 1880 F Anna 2604 0.0267
3 1880 F Emma 2003 0.0205
4 1880 F Elizabeth 1939 0.0199
5 1880 F Minnie 1746 0.0179
6 1880 F Margaret 1578 0.0162
tail(babynames) # same
# A tibble: 6 x 5
year sex name n prop
<dbl> <chr> <chr> <int> <dbl>
1 2017 M Zyhier 5 0.00000255
2 2017 M Zykai 5 0.00000255
3 2017 M Zykeem 5 0.00000255
4 2017 M Zylin 5 0.00000255
5 2017 M Zylis 5 0.00000255
6 2017 M Zyrie 5 0.00000255
names(babynames) # same
[1] "year" "sex" "name" "n" "prop"
If you have done the above and produced sane-looking output, you are ready for the next bit. Use the code below to create a new data frame called alison
.
alison <- babynames %>%
filter(name == "Alison" | name == "Allison") %>%
filter(sex == "F")
The first bit makes a new dataset called alison
that is a copy of the babynames
dataset- the %>%
tells you we are doing some other stuff to it later.
The second bit filters
our babynames
to only keep rows where the name
is either Alison or Allison (read |
as “or”.)
The third bit applies another filter
to keep only those where sex is female.
When you ran that command, you may have noticed a new entry appear on the right-hand side of the screen, in the “environment” tab. This tab lists all of the variables that your current environment has loaded.
Let’s check out the data.
alison
# A tibble: 218 x 5
year sex name n prop
<dbl> <chr> <chr> <int> <dbl>
1 1905 F Alison 7 0.0000226
2 1907 F Alison 5 0.0000148
3 1908 F Allison 6 0.0000169
4 1910 F Alison 5 0.0000119
5 1910 F Allison 5 0.0000119
6 1911 F Allison 9 0.0000204
7 1912 F Allison 12 0.0000204
8 1912 F Alison 9 0.0000153
9 1913 F Alison 12 0.0000183
10 1913 F Allison 7 0.0000107
# … with 208 more rows
glimpse(alison)
Observations: 218
Variables: 5
$ year <dbl> 1905, 1907, 1908, 1910, 1910, 1911, 1912, 1912, 1913, 1913, 1914…
$ sex <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F",…
$ name <chr> "Alison", "Alison", "Allison", "Alison", "Allison", "Allison", "…
$ n <int> 7, 5, 6, 5, 5, 9, 12, 9, 12, 7, 22, 11, 16, 13, 24, 15, 20, 15, …
$ prop <dbl> 2.259e-05, 1.482e-05, 1.692e-05, 1.192e-05, 1.192e-05, 2.037e-05…
Again, if you have sane-looking output here, move along to plotting the data!
plot <- ggplot(alison, aes(x = year,
y = prop,
group = name,
color = name)) +
geom_line()
Now if you did this right, you will not see your plot! Because we saved the ggplot
with a name (plot
), R just saved the object for you. But check out the top right pane in RStudio again: under Data
you should see plot
, so it is there, you just have to ask for it. Here’s how:
plot
Edit my code above to create a new dataset. Pick 2 names to compare how popular they each are (these could be different spellings of your own name, like I did, but you can choose any 2 names that are present in the dataset). Make the new plot, changing the name of the first argument alison
in ggplot()
to the name of your new dataset.
babynames
projects