This walkthrough will illustrate a potential workflow for reproducing Table 4 from the following paper:
Axtell B, Munteanu C. Tea, Earl Grey, Hot: Designing Speech Interactions from the Imagined Ideal of Star Trek. In: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. Yokohama Japan: ACM; 2021. p. 1–14.
The associated data can be found here (and is on Posit.cloud).
Note: There is no deliverable for this lab; follow along and experiment at your leisure!
library(tidyverse)
library(here)
library(jsonlite)
library(gt)
Before we even look at the JSON itself, we should read the originating paper and get a sense of the analyses performed. Were there important variables that the authors used for stratifying the study population? Were there particular dimensions over which they computed summary statistics? Look
Next, manually check the structure of the data file in the text editor of your choice (or the JSON-viewer of your choice- Firefox will work as well):
{
"102": {
"255": {
"char": "Tasha",
"line": "Battle bridge.",
"direction": "The doors snap closed and the lift moves. Riker looks Tasha over waiting then:",
"type": [
"Statement"
],
"pri_type": "Statement",
"domain": [
"IoT"
],
"sub-domain": [
"Turbolift"
],
"nv_resp": true,
"interaction": "Battle bridge.",
"char_type": "Person",
"is_fed": true,
"error": false
},
...
What do we see?
The outer-most part of the document is a dictionary whose keys are episodes, and whose values are also dictionaries. Each of these represents a single episode. What is in these dictionaries?
Each episode-level dictionary contains utterance IDs as keys and yet another dictionary as a value. Each of these represents a single utterance. What is in these dictionaries?
Each utterance-level dictionary contains the actual information about the utterance. This dictionary’s keys refer to specific attributes of the utterance (what character was speaking, the actual line of dialogue itself, etc.).
Some of these attributes are scalar (character, etc.) while others are lists (utterances can be of more than one type, etc.).
After looking over the structure, go back to your notes about the paper and see if the things you were looking for — variables, things from figures, etc. — appear in the document anywhere. If they don’t, get ready to figure out how to recompute analytical variables!
Recall that our goal is to reproduce table 4 frmo the paper, which was a final tally of interaction type by speaker type (human vs. computer). Here’s a “sketch” of a dataframe we might want to ultimately end up with:
A long dataframe, in which each row is an utterance, with columns for:
Looking at the structure of our data, it looks like the important
inner keys are going to be type and
char_type.
jsonlite::read_json() is a good place to start; note
that for some JSON files, it can automatically do a fair bit of
the work that we are about to do by hand. If your JSON file is
relatively simple, and does not have deeply-nested objects with varying
dimensions, jsonlite::as_tibble() may be able to coerce the
nested list into a dataframe… but not always, and not for this
JSON file, because of the amount of heterogeneity in the data. So we
will be doing this “by hand”.
j_path <- here("data/lab11/teaearlgreyhotdataset.json")
j <- read_json(j_path)
Note what we get back from read_json:
typeof(j)
## [1] "list"
length(j)
## [1] 137
Why do we get a list and not a data frame? Because JSON allows for
arbitrarily complicated nested structures, and so we don’t have
any guarantee that what’s in that file will be amenable to flattening
without some very file-specific work. So jsonlite punts on
the issue and makes a list-of-lists for us, so we can deal with it
ourselves.
Before we continue, note that this file is pretty big. To make it a little easier to work with while we’re just getting started, I am going to make a small version that only has two episodes’ worth of data. I’ll do this “by hand” outside of R, in my text editor.
j_path <- here("data/lab11/teaearlgreyhotdataset.small.json")
j <- read_json(j_path)
length(j)
## [1] 2
This is a named list, so we can access its elements by name:
names(j)
## [1] "102" "104"
length(j$"102")
## [1] 11
And the sub-lists are themselves named:
names(j$"102")
## [1] "255" "345" "347" "363" "421" "422" "426" "427" "428" "429" "430"
Look back at the JSON file itself; you’ll see that these correspond to the dictionary keys in the file.
It’s turtles all the way down, to the final layer of the information objects themselves:
names(j$"102"$"255")
## [1] "char" "line" "direction" "type" "pri_type"
## [6] "domain" "sub-domain" "nv_resp" "interaction" "char_type"
## [11] "is_fed" "error"
j$"102"$"255"$"char"
## [1] "Tasha"
Now, we are going to work our way through the document, flattening
and filtering as we go, until we end up with a tidy data frame. Step
one: turn our list into a very simple DF, using
enframe():
j2 <- j %>% enframe
j2
## # A tibble: 2 × 2
## name value
## <chr> <list>
## 1 102 <named list [11]>
## 2 104 <named list [3]>
See what happened here? We started with a named list, and ended up with a dataframe where one column is the names of the list and the second column is the values.
To get a little more concrete: each row of our dataframe now
corresponds to a single top-level item from our input JSON file (i.e.,
an episode), with a column (by default, named
name) representing the item’s key and a second column (by
default, named value) containing that item’s value. In this
JSON file, the keys correspond to particular episode IDs, with the
matching values representing dictionaries containing utterances from
that episode.
Next, we’re going to go one more level down, but before we do, let’s
rename our columns- by default, enframe just gives the
uninformative column names “name” and “value”, which (as you will see)
can quickly become confusing We can do this using rename(),
or we can override those defaults when we call enframe.
j2 <- j2 %>% rename(episode.id=name, episode=value)
j2
## # A tibble: 2 × 2
## episode.id episode
## <chr> <list>
## 1 102 <named list [11]>
## 2 104 <named list [3]>
Now that we’ve got useful names, let’s go one level deeper (Inception-style) and turn our episodes into nested dataframes:
j3 <- j2 %>% mutate(episode=purrr::map(episode,enframe))
j3
## # A tibble: 2 × 2
## episode.id episode
## <chr> <list>
## 1 102 <tibble [11 × 2]>
## 2 104 <tibble [3 × 2]>
Now, if we unnest the episode column…
j3 <- j3 %>% unnest(episode)
j3
## # A tibble: 14 × 3
## episode.id name value
## <chr> <chr> <list>
## 1 102 255 <named list [12]>
## 2 102 345 <named list [12]>
## 3 102 347 <named list [12]>
## 4 102 363 <named list [12]>
## 5 102 421 <named list [12]>
## 6 102 422 <named list [12]>
## 7 102 426 <named list [12]>
## 8 102 427 <named list [12]>
## 9 102 428 <named list [12]>
## 10 102 429 <named list [12]>
## 11 102 430 <named list [12]>
## 12 104 4 <named list [12]>
## 13 104 45 <named list [12]>
## 14 104 54 <named list [12]>
Now, see what we’ve got? A tidy data frame, one row per utterance,
with the default column names from enframe. Before we
forget, let’s rename our columns:
j3 <- j3 %>% rename(utterance.id=name, utterance=value)
j3
## # A tibble: 14 × 3
## episode.id utterance.id utterance
## <chr> <chr> <list>
## 1 102 255 <named list [12]>
## 2 102 345 <named list [12]>
## 3 102 347 <named list [12]>
## 4 102 363 <named list [12]>
## 5 102 421 <named list [12]>
## 6 102 422 <named list [12]>
## 7 102 426 <named list [12]>
## 8 102 427 <named list [12]>
## 9 102 428 <named list [12]>
## 10 102 429 <named list [12]>
## 11 102 430 <named list [12]>
## 12 104 4 <named list [12]>
## 13 104 45 <named list [12]>
## 14 104 54 <named list [12]>
I bet you can guess what comes next: that’s right, we are going to
enframe our utterances, and unnest!
j4 <- j3 %>% mutate(utterance=purrr::map(utterance,enframe))
j4
## # A tibble: 14 × 3
## episode.id utterance.id utterance
## <chr> <chr> <list>
## 1 102 255 <tibble [12 × 2]>
## 2 102 345 <tibble [12 × 2]>
## 3 102 347 <tibble [12 × 2]>
## 4 102 363 <tibble [12 × 2]>
## 5 102 421 <tibble [12 × 2]>
## 6 102 422 <tibble [12 × 2]>
## 7 102 426 <tibble [12 × 2]>
## 8 102 427 <tibble [12 × 2]>
## 9 102 428 <tibble [12 × 2]>
## 10 102 429 <tibble [12 × 2]>
## 11 102 430 <tibble [12 × 2]>
## 12 104 4 <tibble [12 × 2]>
## 13 104 45 <tibble [12 × 2]>
## 14 104 54 <tibble [12 × 2]>
When we unnest this time, let’s see what happens:
j5 <- j4 %>% unnest(utterance)
j5
## # A tibble: 168 × 4
## episode.id utterance.id name value
## <chr> <chr> <chr> <list>
## 1 102 255 char <chr [1]>
## 2 102 255 line <chr [1]>
## 3 102 255 direction <chr [1]>
## 4 102 255 type <list [1]>
## 5 102 255 pri_type <chr [1]>
## 6 102 255 domain <list [1]>
## 7 102 255 sub-domain <list [1]>
## 8 102 255 nv_resp <lgl [1]>
## 9 102 255 interaction <chr [1]>
## 10 102 255 char_type <chr [1]>
## # ℹ 158 more rows
Woah! Now we’ve got one line for each key in the inner-most part of the JSON file, with a corresponding list value. A few things to note:
value column here is of type list, but
note that the different rows have different types of lists in them -
some have characters, some have logicals, some have lists, etc.Our next step will be to pivot this into a slightly wider dataframe:
j6 <- j5 %>% pivot_wider(names_from=name, values_from=value)
j6
## # A tibble: 14 × 14
## episode.id utterance.id char line direction type pri_type domain
## <chr> <chr> <list> <list> <list> <list> <list> <list>
## 1 102 255 <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
## 2 102 345 <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
## 3 102 347 <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
## 4 102 363 <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
## 5 102 421 <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
## 6 102 422 <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
## 7 102 426 <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
## 8 102 427 <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
## 9 102 428 <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
## 10 102 429 <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
## 11 102 430 <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
## 12 104 4 <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
## 13 104 45 <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
## 14 104 54 <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
## # ℹ 6 more variables: `sub-domain` <list>, nv_resp <list>, interaction <list>,
## # char_type <list>, is_fed <list>, error <list>
This is starting to look like we’re getting closer to a dataframe we can work with! We still have a little bit of work to do, however. First, let’s make our lives simpler by getting rid of some of the columns we don’t need for our analysis:
j6 <- j6 %>% select(episode.id, utterance.id, char_type, type)
j6
## # A tibble: 14 × 4
## episode.id utterance.id char_type type
## <chr> <chr> <list> <list>
## 1 102 255 <chr [1]> <list [1]>
## 2 102 345 <chr [1]> <list [1]>
## 3 102 347 <chr [1]> <list [1]>
## 4 102 363 <chr [1]> <list [3]>
## 5 102 421 <chr [1]> <list [1]>
## 6 102 422 <chr [1]> <list [1]>
## 7 102 426 <chr [1]> <list [2]>
## 8 102 427 <chr [1]> <list [2]>
## 9 102 428 <chr [1]> <list [2]>
## 10 102 429 <chr [1]> <list [2]>
## 11 102 430 <chr [1]> <list [2]>
## 12 104 4 <chr [1]> <list [1]>
## 13 104 45 <chr [1]> <list [1]>
## 14 104 54 <chr [1]> <list [1]>
Notice that char_type and type are list
columns, with char_type always being of length 1 and
type being of varying lengths. In our original data,
char_type was a scalar, not a list: we ended
up with it as a list because of a quirk of how our raw JSON parse was
processed and enframed. We can easily deal with this via
unnest:
j6 <- j6 %>% unnest(char_type)
j6
## # A tibble: 14 × 4
## episode.id utterance.id char_type type
## <chr> <chr> <chr> <list>
## 1 102 255 Person <list [1]>
## 2 102 345 Person <list [1]>
## 3 102 347 Person <list [1]>
## 4 102 363 Person <list [3]>
## 5 102 421 Person <list [1]>
## 6 102 422 Computer <list [1]>
## 7 102 426 Computer <list [2]>
## 8 102 427 Person <list [2]>
## 9 102 428 Computer <list [2]>
## 10 102 429 Computer <list [2]>
## 11 102 430 Person <list [2]>
## 12 104 4 Person <list [1]>
## 13 104 45 Person <list [1]>
## 14 104 54 Person <list [1]>
Now, let’s turn to the type column. Look at the entry
for utterance 363 in espisode 102: it has a length of 3. Looking at the
JSON file, we can see that this is because this utterance was assigned
three types- “Wake word”, “Conversation”, and “Question”.
Because the “type” column’s JSON origin was as a container element (a list), which in principle could have had entries of varying types, R has given it to us as an R list (which can have varying contents) rather than as vector (which can only have one kind of data). We know what is in that list, but R doesn’t.
As such, we have a little bit of extra work to do; we actually will
need to unnest it twice:
unnest() call will get it out of its
“outer” R listunnest() call gets it out of its
length-1-vector into a scalar.This is the sort of thing that jsonlite has helpers to
do, but for a simple file format like this one I personally usually just
do it all by hand, so I can be sure about what’s going on.
j7 <- j6 %>% unnest(type) %>% unnest(type)
j7
## # A tibble: 21 × 4
## episode.id utterance.id char_type type
## <chr> <chr> <chr> <chr>
## 1 102 255 Person Statement
## 2 102 345 Person Command
## 3 102 347 Person Statement
## 4 102 363 Person Wake Word
## 5 102 363 Person Question
## 6 102 363 Person Conversation
## 7 102 421 Person Command
## 8 102 422 Computer Response
## 9 102 426 Computer Info
## 10 102 426 Computer Alert
## # ℹ 11 more rows
A more modern and simpler option, that works in this specific
case but might not always work, is to use
unnest_longer:
j7 <- j6 %>% unnest_longer(type)
j7
## # A tibble: 21 × 4
## episode.id utterance.id char_type type
## <chr> <chr> <chr> <chr>
## 1 102 255 Person Statement
## 2 102 345 Person Command
## 3 102 347 Person Statement
## 4 102 363 Person Wake Word
## 5 102 363 Person Question
## 6 102 363 Person Conversation
## 7 102 421 Person Command
## 8 102 422 Computer Response
## 9 102 426 Computer Info
## 10 102 426 Computer Alert
## # ℹ 11 more rows
At this point, we’ve got a nice, tidy data frame, and we are ready to use regular methods to compute our table.
We can use regular tidyverse methods to to compute our table:
j7 %>% group_by(char_type, type) %>%
summarise(n=n()) %>%
mutate(denom=sum(n)) %>%
mutate(prop=n/denom) %>%
select(char_type, type, prop) %>%
pivot_wider(names_from=char_type, values_from=prop) %>%
select(type, Person, Computer) %>%
arrange(-Person)
## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by char_type and type.
## ℹ Output is grouped by char_type.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(char_type, type))` for per-operation grouping
## (`?dplyr::dplyr_by`) instead.
## # A tibble: 9 × 3
## type Person Computer
## <chr> <dbl> <dbl>
## 1 Statement 0.357 NA
## 2 Conversation 0.214 0.286
## 3 Command 0.214 NA
## 4 Comment 0.0714 NA
## 5 Question 0.0714 NA
## 6 Wake Word 0.0714 NA
## 7 Alert NA 0.286
## 8 Info NA 0.143
## 9 Response NA 0.286
And we’re all set! Now we just need to repeat the above workflow, pointed at the true file:
## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by char_type and type.
## ℹ Output is grouped by char_type.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(char_type, type))` for per-operation grouping
## (`?dplyr::dplyr_by`) instead.
| type | Person | Computer |
|---|---|---|
| Command | 33.99% | 0.00% |
| Wake Word | 31.51% | 0.00% |
| Statement | 13.23% | 0.00% |
| Question | 10.31% | 0.00% |
| Conversation | 5.70% | 1.21% |
| Password | 2.56% | 0.00% |
| Comment | 1.97% | 0.00% |
| command | 0.66% | 0.00% |
| question | 0.07% | 0.00% |
| Response | 0.00% | 56.36% |
| Alert | 0.00% | 15.45% |
| Info | 0.00% | 8.48% |
| Countdown | 0.00% | 8.18% |
| Clarification | 0.00% | 6.36% |
| Progress | 0.00% | 3.94% |
Note that these numbers don’t match what is in the paper; I suspect
that what we could be seeing is that the authors may have used the
pri_type field instead of the type field.
Repeating our analysis, looking for that field, we see:
full_j %>% enframe %>%
rename(episode.id=name, episode=value) %>%
mutate(episode=purrr::map(episode,enframe)) %>%
unnest(episode) %>%
rename(utterance.id=name, utterance=value) %>%
mutate(utterance=purrr::map(utterance,enframe)) %>%
unnest(utterance) %>%
pivot_wider(names_from=name, values_from=value) %>%
select(episode.id, utterance.id, char_type, pri_type) %>%
unnest(c(char_type, pri_type)) %>%
group_by(char_type, pri_type) %>%
summarise(n=n()) %>%
mutate(denom=sum(n)) %>%
mutate(prop=n/denom) %>%
select(char_type, pri_type, prop) %>%
pivot_wider(names_from=char_type, values_from=prop, values_fill=0.0) %>%
select(pri_type, Person, Computer) %>%
arrange(-Person, -Computer) %>%
gt %>% fmt_percent(columns=c(Person, Computer))
## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by char_type and pri_type.
## ℹ Output is grouped by char_type.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(char_type, pri_type))` for per-operation grouping
## (`?dplyr::dplyr_by`) instead.
| pri_type | Person | Computer |
|---|---|---|
| Command | 59.42% | 0.00% |
| Statement | 21.23% | 0.00% |
| Question | 17.09% | 0.00% |
| Password | 0.75% | 0.00% |
| Wake Word | 0.75% | 0.00% |
| Comment | 0.63% | 0.00% |
| Conversation | 0.13% | 0.20% |
| Response | 0.00% | 66.27% |
| Alert | 0.00% | 19.25% |
| Clarification | 0.00% | 8.33% |
| Info | 0.00% | 2.78% |
| Countdown | 0.00% | 1.79% |
| Progress | 0.00% | 1.39% |
This still doesn’t match what is in the paper, and I think it’s because I was mistaken about the denominator.
full_j %>% enframe %>%
rename(episode.id=name, episode=value) %>%
mutate(episode=purrr::map(episode,enframe)) %>%
unnest(episode) %>%
rename(utterance.id=name, utterance=value) %>%
mutate(utterance=purrr::map(utterance,enframe)) %>%
unnest(utterance) %>%
pivot_wider(names_from=name, values_from=value) %>%
select(episode.id, utterance.id, char_type, type) %>%
unnest(c(char_type, type)) %>% unnest(type) %>%
janitor::tabyl(type, char_type) %>% data.frame %>%
mutate(Comp.prop=Computer/sum(Computer), Person.prop=Person/sum(Person)) %>% arrange(-Person.prop, -Comp.prop) %>%
select(type,Person=Person.prop, Computer=Comp.prop) %>%
gt
| type | Person | Computer |
|---|---|---|
| Command | 0.3399122807 | 0.00000000 |
| Wake Word | 0.3150584795 | 0.00000000 |
| Statement | 0.1323099415 | 0.00000000 |
| Question | 0.1030701754 | 0.00000000 |
| Conversation | 0.0570175439 | 0.01212121 |
| Password | 0.0255847953 | 0.00000000 |
| Comment | 0.0197368421 | 0.00000000 |
| command | 0.0065789474 | 0.00000000 |
| question | 0.0007309942 | 0.00000000 |
| Response | 0.0000000000 | 0.56363636 |
| Alert | 0.0000000000 | 0.15454545 |
| Info | 0.0000000000 | 0.08484848 |
| Countdown | 0.0000000000 | 0.08181818 |
| Clarification | 0.0000000000 | 0.06363636 |
| Progress | 0.0000000000 | 0.03939394 |
This still isn’t in alignment with what the paper reported, but from this point, the issue becomes one of replicating an analysis rather than parsing JSON, and thus is out of scope for our lab. :-D