This walkthrough will illustrate a potential workflow for reproducing the data from the following paper:
Axtell B, Munteanu C. Tea, Earl Grey, Hot: Designing Speech Interactions from the Imagined Ideal of Star Trek. In: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. Yokohama Japan: ACM; 2021. p. 1–14.
The associated data can be found here (and is on Posit.cloud).
library(tidyverse)
library(here)
library(jsonlite)
library(gt)
Check in text editor (or JSON-viewer of your choice- Firefox will work as well)
{
"102": {
"255": {
"char": "Tasha",
"line": "Battle bridge.",
"direction": "The doors snap closed and the lift moves. Riker looks Tasha over waiting then:",
"type": [
"Statement"
],
"pri_type": "Statement",
"domain": [
"IoT"
],
"sub-domain": [
"Turbolift"
],
"nv_resp": true,
"interaction": "Battle bridge.",
"char_type": "Person",
"is_fed": true,
"error": false
},
...
Outer-most part of the document is a dict; keys are episodes, values are dictionaries containing utterance IDs as keys and a dictionary as a value. Each of these has the actual information about the utterance.
Since we want a final tally of interaction type by speaker type, we’ll want something like this:
Each row should be an utterance, with columns for:
Looking at the structure of our data, it looks like the important
inner keys are going to be type
and
char_type
.
jsonlite::read_json()
is a good place to start; note
that for some JSON files, it can automatically do a fair bit of
the work that we are about to do by hand. If your JSON file is
relatively simple, and does not have deeply-nested objects with varying
dimensions, as_tibble()
may be able to coerce the nested
list into a dataframe… but not always, and not for this JSON
file, because of the amount of heterogeneity in the data. So we will be
doing this “by hand”.
j_path <- here("data/lab11/teaearlgreyhotdataset.json")
j <- read_json(j_path)
Note what we get back from read_json
:
typeof(j)
## [1] "list"
length(j)
## [1] 137
Why do we get a list and not a data frame? Because JSON allows for
arbitrarily complicated nested structures, and so we don’t have
any guarantee that what’s in that file will be amenable to flattening
without some very file-specific work. So jsonlite
punts on
the issue and makes a list-of-lists for us, so we can deal with it
ourselves.
Before we continue, note that this file is pretty big. To make it a little easier to work with while we’re just getting started, I am going to make a small version that only has two episodes’ worth of data. I’ll do this “by hand” outside of R, in my text editor.
j_path <- here("data/lab11/teaearlgreyhotdataset.small.json")
j <- read_json(j_path)
length(j)
## [1] 2
This is a named list, so we can access its elements by name:
names(j)
## [1] "102" "104"
length(j$"102")
## [1] 11
And the sub-lists are themselves named:
names(j$"102")
## [1] "255" "345" "347" "363" "421" "422" "426" "427" "428" "429" "430"
Look back at the JSON file itself; you’ll see that these correspond to the dictionary keys in the file.
It’s turtles all the way down, to the final layer of the information objects themselves:
names(j$"102"$"255")
## [1] "char" "line" "direction" "type" "pri_type"
## [6] "domain" "sub-domain" "nv_resp" "interaction" "char_type"
## [11] "is_fed" "error"
j$"102"$"255"$"char"
## [1] "Tasha"
Now, we are going to work our way through the document, flattening
and filtering as we go, until we end up with a tidy data frame. Step
one: turn our list into a very simple DF, using
enframe()
:
j2 <- j %>% enframe
j2
## # A tibble: 2 × 2
## name value
## <chr> <list>
## 1 102 <named list [11]>
## 2 104 <named list [3]>
See what happened here? We started with a named list, and ended up with a dataframe where one column is the names of the list and the second column is the values.
Next, we’re going to go one more level down, but before we do, let’s
rename our columns- by default, enframe
just gives the
uninformative column names “name” and “value”, which (as you will see)
can quickly become confusing We can do this using rename()
,
or we can override those defaults when we call enframe
.
j2 <- j2 %>% rename(episode.id=name, episode=value)
j2
## # A tibble: 2 × 2
## episode.id episode
## <chr> <list>
## 1 102 <named list [11]>
## 2 104 <named list [3]>
Now that we’ve got useful names, let’s go one level deeper (Inception-style) and turn our episodes into nested dataframes:
j3 <- j2 %>% mutate(episode=map(episode,enframe))
j3
## # A tibble: 2 × 2
## episode.id episode
## <chr> <list>
## 1 102 <tibble [11 × 2]>
## 2 104 <tibble [3 × 2]>
Now, if we unnest
the episode column…
j3 <- j3 %>% unnest(episode)
j3
## # A tibble: 14 × 3
## episode.id name value
## <chr> <chr> <list>
## 1 102 255 <named list [12]>
## 2 102 345 <named list [12]>
## 3 102 347 <named list [12]>
## 4 102 363 <named list [12]>
## 5 102 421 <named list [12]>
## 6 102 422 <named list [12]>
## 7 102 426 <named list [12]>
## 8 102 427 <named list [12]>
## 9 102 428 <named list [12]>
## 10 102 429 <named list [12]>
## 11 102 430 <named list [12]>
## 12 104 4 <named list [12]>
## 13 104 45 <named list [12]>
## 14 104 54 <named list [12]>
Now, see what we’ve got? A tidy data frame, one row per utterance,
with the default column names from enframe
. Before we
forget, let’s rename our columns:
j3 <- j3 %>% rename(utterance.id=name, utterance=value)
j3
## # A tibble: 14 × 3
## episode.id utterance.id utterance
## <chr> <chr> <list>
## 1 102 255 <named list [12]>
## 2 102 345 <named list [12]>
## 3 102 347 <named list [12]>
## 4 102 363 <named list [12]>
## 5 102 421 <named list [12]>
## 6 102 422 <named list [12]>
## 7 102 426 <named list [12]>
## 8 102 427 <named list [12]>
## 9 102 428 <named list [12]>
## 10 102 429 <named list [12]>
## 11 102 430 <named list [12]>
## 12 104 4 <named list [12]>
## 13 104 45 <named list [12]>
## 14 104 54 <named list [12]>
I bet you can guess what comes next: that’s right, we are going to
enframe
our utterances, and unnest!
j4 <- j3 %>% mutate(utterance=map(utterance,enframe))
j4
## # A tibble: 14 × 3
## episode.id utterance.id utterance
## <chr> <chr> <list>
## 1 102 255 <tibble [12 × 2]>
## 2 102 345 <tibble [12 × 2]>
## 3 102 347 <tibble [12 × 2]>
## 4 102 363 <tibble [12 × 2]>
## 5 102 421 <tibble [12 × 2]>
## 6 102 422 <tibble [12 × 2]>
## 7 102 426 <tibble [12 × 2]>
## 8 102 427 <tibble [12 × 2]>
## 9 102 428 <tibble [12 × 2]>
## 10 102 429 <tibble [12 × 2]>
## 11 102 430 <tibble [12 × 2]>
## 12 104 4 <tibble [12 × 2]>
## 13 104 45 <tibble [12 × 2]>
## 14 104 54 <tibble [12 × 2]>
When we unnest this time, let’s see what happens:
j5 <- j4 %>% unnest(utterance)
j5
## # A tibble: 168 × 4
## episode.id utterance.id name value
## <chr> <chr> <chr> <list>
## 1 102 255 char <chr [1]>
## 2 102 255 line <chr [1]>
## 3 102 255 direction <chr [1]>
## 4 102 255 type <list [1]>
## 5 102 255 pri_type <chr [1]>
## 6 102 255 domain <list [1]>
## 7 102 255 sub-domain <list [1]>
## 8 102 255 nv_resp <lgl [1]>
## 9 102 255 interaction <chr [1]>
## 10 102 255 char_type <chr [1]>
## # ℹ 158 more rows
Woah! Now we’ve got one line for each key in the inner-most part of the JSON file, with a corresponding list value. A few things to note:
value
column here is of type list
, but
note that the different rows have different types of lists in them -
some have characters, some have logicals, some have lists, etc.Our next step will be to pivot this into a slightly wider dataframe:
j6 <- j5 %>% pivot_wider(names_from=name, values_from=value)
j6
## # A tibble: 14 × 14
## episode.id utterance.id char line direction type pri_type domain
## <chr> <chr> <list> <list> <list> <list> <list> <list>
## 1 102 255 <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
## 2 102 345 <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
## 3 102 347 <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
## 4 102 363 <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
## 5 102 421 <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
## 6 102 422 <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
## 7 102 426 <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
## 8 102 427 <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
## 9 102 428 <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
## 10 102 429 <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
## 11 102 430 <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
## 12 104 4 <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
## 13 104 45 <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
## 14 104 54 <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
## # ℹ 6 more variables: `sub-domain` <list>, nv_resp <list>, interaction <list>,
## # char_type <list>, is_fed <list>, error <list>
This is starting to look like we’re getting closer to a dataframe we can work with! We still have a little bit of work to do, however. First, let’s make our lives simpler by getting rid of some of the columns we don’t need:
j6 <- j6 %>% select(episode.id, utterance.id, char_type, type)
j6
## # A tibble: 14 × 4
## episode.id utterance.id char_type type
## <chr> <chr> <list> <list>
## 1 102 255 <chr [1]> <list [1]>
## 2 102 345 <chr [1]> <list [1]>
## 3 102 347 <chr [1]> <list [1]>
## 4 102 363 <chr [1]> <list [3]>
## 5 102 421 <chr [1]> <list [1]>
## 6 102 422 <chr [1]> <list [1]>
## 7 102 426 <chr [1]> <list [2]>
## 8 102 427 <chr [1]> <list [2]>
## 9 102 428 <chr [1]> <list [2]>
## 10 102 429 <chr [1]> <list [2]>
## 11 102 430 <chr [1]> <list [2]>
## 12 104 4 <chr [1]> <list [1]>
## 13 104 45 <chr [1]> <list [1]>
## 14 104 54 <chr [1]> <list [1]>
Note the length of some of the entries in the type
column- utterance 363 in espisode 102 has length 3. Looking at the JSON
file, we can see that this is because this utterance was assigned three
types- “Wake word”, “Conversation”, and “Question”.
Because the “type” column is itself a container JSON element (a
list), we actually will need to unnest
it twice. This is
the sort of thing that jsonlite
has helpers to do, but for
a simple file format like this one I personally usually just do it all
by hand, so I can be sure about what’s going on.
j7 <- j6 %>% unnest(char_type) %>% unnest(type) %>% unnest(type)
j7
## # A tibble: 21 × 4
## episode.id utterance.id char_type type
## <chr> <chr> <chr> <chr>
## 1 102 255 Person Statement
## 2 102 345 Person Command
## 3 102 347 Person Statement
## 4 102 363 Person Wake Word
## 5 102 363 Person Question
## 6 102 363 Person Conversation
## 7 102 421 Person Command
## 8 102 422 Computer Response
## 9 102 426 Computer Info
## 10 102 426 Computer Alert
## # ℹ 11 more rows
At this point, we’ve got a nice, tidy data frame, and we are ready to use regular methods to compute our table:
j7 %>% group_by(char_type, type) %>%
summarise(n=n()) %>%
mutate(denom=sum(n)) %>%
mutate(prop=n/denom) %>%
select(char_type, type, prop) %>%
pivot_wider(names_from=char_type, values_from=prop) %>%
select(type, Person, Computer) %>%
arrange(-Person)
## `summarise()` has grouped output by 'char_type'. You can override using the
## `.groups` argument.
## # A tibble: 9 × 3
## type Person Computer
## <chr> <dbl> <dbl>
## 1 Statement 0.357 NA
## 2 Conversation 0.214 0.286
## 3 Command 0.214 NA
## 4 Comment 0.0714 NA
## 5 Question 0.0714 NA
## 6 Wake Word 0.0714 NA
## 7 Alert NA 0.286
## 8 Info NA 0.143
## 9 Response NA 0.286
And we’re all set! Now we just need to repeat the above workflow, pointed at the true file:
## `summarise()` has grouped output by 'char_type'. You can override using the
## `.groups` argument.
type | Person | Computer |
---|---|---|
Command | 33.99% | 0.00% |
Wake Word | 31.51% | 0.00% |
Statement | 13.23% | 0.00% |
Question | 10.31% | 0.00% |
Conversation | 5.70% | 1.21% |
Password | 2.56% | 0.00% |
Comment | 1.97% | 0.00% |
command | 0.66% | 0.00% |
question | 0.07% | 0.00% |
Response | 0.00% | 56.36% |
Alert | 0.00% | 15.45% |
Info | 0.00% | 8.48% |
Countdown | 0.00% | 8.18% |
Clarification | 0.00% | 6.36% |
Progress | 0.00% | 3.94% |
Note that these numbers don’t match what is in the paper; I suspect
that what we could be seeing is that the authors may have used the
pri_type
field instead of the type
field.
Repeating our analysis, looking for that field, we see:
full_j %>% enframe %>%
rename(episode.id=name, episode=value) %>%
mutate(episode=map(episode,enframe)) %>%
unnest(episode) %>%
rename(utterance.id=name, utterance=value) %>%
mutate(utterance=map(utterance,enframe)) %>%
unnest(utterance) %>%
pivot_wider(names_from=name, values_from=value) %>%
select(episode.id, utterance.id, char_type, pri_type) %>%
unnest(c(char_type, pri_type)) %>%
group_by(char_type, pri_type) %>%
summarise(n=n()) %>%
mutate(denom=sum(n)) %>%
mutate(prop=n/denom) %>%
select(char_type, pri_type, prop) %>%
pivot_wider(names_from=char_type, values_from=prop, values_fill=0.0) %>%
select(pri_type, Person, Computer) %>%
arrange(-Person, -Computer) %>%
gt %>% fmt_percent(columns=c(Person, Computer))
## `summarise()` has grouped output by 'char_type'. You can override using the
## `.groups` argument.
pri_type | Person | Computer |
---|---|---|
Command | 59.42% | 0.00% |
Statement | 21.23% | 0.00% |
Question | 17.09% | 0.00% |
Password | 0.75% | 0.00% |
Wake Word | 0.75% | 0.00% |
Comment | 0.63% | 0.00% |
Conversation | 0.13% | 0.20% |
Response | 0.00% | 66.27% |
Alert | 0.00% | 19.25% |
Clarification | 0.00% | 8.33% |
Info | 0.00% | 2.78% |
Countdown | 0.00% | 1.79% |
Progress | 0.00% | 1.39% |
This still doesn’t match what is in the paper, and I think it’s because I was mistaken about the denominator.
full_j %>% enframe %>%
rename(episode.id=name, episode=value) %>%
mutate(episode=map(episode,enframe)) %>%
unnest(episode) %>%
rename(utterance.id=name, utterance=value) %>%
mutate(utterance=map(utterance,enframe)) %>%
unnest(utterance) %>%
pivot_wider(names_from=name, values_from=value) %>%
select(episode.id, utterance.id, char_type, type) %>%
unnest(c(char_type, type)) %>% unnest(type) %>%
janitor::tabyl(type, char_type) %>% data.frame %>%
mutate(Comp.prop=Computer/sum(Computer), Person.prop=Person/sum(Person)) %>% arrange(-Person.prop, -Comp.prop) %>%
select(type,Person=Person.prop, Computer=Comp.prop) %>%
gt
type | Person | Computer |
---|---|---|
Command | 0.3399122807 | 0.00000000 |
Wake Word | 0.3150584795 | 0.00000000 |
Statement | 0.1323099415 | 0.00000000 |
Question | 0.1030701754 | 0.00000000 |
Conversation | 0.0570175439 | 0.01212121 |
Password | 0.0255847953 | 0.00000000 |
Comment | 0.0197368421 | 0.00000000 |
command | 0.0065789474 | 0.00000000 |
question | 0.0007309942 | 0.00000000 |
Response | 0.0000000000 | 0.56363636 |
Alert | 0.0000000000 | 0.15454545 |
Info | 0.0000000000 | 0.08484848 |
Countdown | 0.0000000000 | 0.08181818 |
Clarification | 0.0000000000 | 0.06363636 |
Progress | 0.0000000000 | 0.03939394 |
This still isn’t in alignment with what the paper reported, but at this point you get the idea. :-D