Introduction

This walkthrough will illustrate a potential workflow for reproducing the data from the following paper:

Axtell B, Munteanu C. Tea, Earl Grey, Hot: Designing Speech Interactions from the Imagined Ideal of Star Trek. In: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. Yokohama Japan: ACM; 2021. p. 1–14.

The associated data can be found here (and is on Posit.cloud).

Setup

library(tidyverse)
library(here)
library(jsonlite)
library(gt)

Overall workflow:

  1. Know your data
  2. Think about desired final structure
  3. Read JSON into gnarly list
  4. Work from the outside inwards.

Know your data:

Check in text editor (or JSON-viewer of your choice- Firefox will work as well)

{
    "102": {
        "255": {
            "char": "Tasha",
            "line": "Battle bridge.",
            "direction": "The doors snap closed and the lift moves. Riker looks Tasha over  waiting  then:",
            "type": [
                "Statement"
            ],
            "pri_type": "Statement",
            "domain": [
                "IoT"
            ],
            "sub-domain": [
                "Turbolift"
            ],
            "nv_resp": true,
            "interaction": "Battle bridge.",
            "char_type": "Person",
            "is_fed": true,
            "error": false
        },
        ...

Outer-most part of the document is a dict; keys are episodes, values are dictionaries containing utterance IDs as keys and a dictionary as a value. Each of these has the actual information about the utterance.

Thinking about desired final structure:

Since we want a final tally of interaction type by speaker type, we’ll want something like this:

Each row should be an utterance, with columns for:

  1. Episode ID
  2. Utterance ID within that episode
  3. The category of speaker (person or computer)
  4. The type of interaction (command, wake word, etc.)

Looking at the structure of our data, it looks like the important inner keys are going to be type and char_type.

Initial loading

jsonlite::read_json() is a good place to start; note that for some JSON files, it can automatically do a fair bit of the work that we are about to do by hand. If your JSON file is relatively simple, and does not have deeply-nested objects with varying dimensions, as_tibble() may be able to coerce the nested list into a dataframe… but not always, and not for this JSON file, because of the amount of heterogeneity in the data. So we will be doing this “by hand”.

j_path <- here("data/lab11/teaearlgreyhotdataset.json")
j <- read_json(j_path)

Note what we get back from read_json:

typeof(j)
## [1] "list"
length(j)
## [1] 137

Why do we get a list and not a data frame? Because JSON allows for arbitrarily complicated nested structures, and so we don’t have any guarantee that what’s in that file will be amenable to flattening without some very file-specific work. So jsonlite punts on the issue and makes a list-of-lists for us, so we can deal with it ourselves.

Before we continue, note that this file is pretty big. To make it a little easier to work with while we’re just getting started, I am going to make a small version that only has two episodes’ worth of data. I’ll do this “by hand” outside of R, in my text editor.

j_path <- here("data/lab11/teaearlgreyhotdataset.small.json")
j <- read_json(j_path)
length(j)
## [1] 2

This is a named list, so we can access its elements by name:

names(j)
## [1] "102" "104"
length(j$"102")
## [1] 11

And the sub-lists are themselves named:

names(j$"102")
##  [1] "255" "345" "347" "363" "421" "422" "426" "427" "428" "429" "430"

Look back at the JSON file itself; you’ll see that these correspond to the dictionary keys in the file.

It’s turtles all the way down, to the final layer of the information objects themselves:

names(j$"102"$"255")
##  [1] "char"        "line"        "direction"   "type"        "pri_type"   
##  [6] "domain"      "sub-domain"  "nv_resp"     "interaction" "char_type"  
## [11] "is_fed"      "error"
j$"102"$"255"$"char"
## [1] "Tasha"

Work from the outside in

Now, we are going to work our way through the document, flattening and filtering as we go, until we end up with a tidy data frame. Step one: turn our list into a very simple DF, using enframe():

j2 <- j %>% enframe
j2
## # A tibble: 2 × 2
##   name  value            
##   <chr> <list>           
## 1 102   <named list [11]>
## 2 104   <named list [3]>

See what happened here? We started with a named list, and ended up with a dataframe where one column is the names of the list and the second column is the values.

Next, we’re going to go one more level down, but before we do, let’s rename our columns- by default, enframe just gives the uninformative column names “name” and “value”, which (as you will see) can quickly become confusing We can do this using rename(), or we can override those defaults when we call enframe.

j2 <- j2 %>% rename(episode.id=name, episode=value) 
j2
## # A tibble: 2 × 2
##   episode.id episode          
##   <chr>      <list>           
## 1 102        <named list [11]>
## 2 104        <named list [3]>

Now that we’ve got useful names, let’s go one level deeper (Inception-style) and turn our episodes into nested dataframes:

j3 <- j2 %>% mutate(episode=map(episode,enframe))
j3
## # A tibble: 2 × 2
##   episode.id episode          
##   <chr>      <list>           
## 1 102        <tibble [11 × 2]>
## 2 104        <tibble [3 × 2]>

Now, if we unnest the episode column…

j3 <- j3 %>% unnest(episode)
j3
## # A tibble: 14 × 3
##    episode.id name  value            
##    <chr>      <chr> <list>           
##  1 102        255   <named list [12]>
##  2 102        345   <named list [12]>
##  3 102        347   <named list [12]>
##  4 102        363   <named list [12]>
##  5 102        421   <named list [12]>
##  6 102        422   <named list [12]>
##  7 102        426   <named list [12]>
##  8 102        427   <named list [12]>
##  9 102        428   <named list [12]>
## 10 102        429   <named list [12]>
## 11 102        430   <named list [12]>
## 12 104        4     <named list [12]>
## 13 104        45    <named list [12]>
## 14 104        54    <named list [12]>

Now, see what we’ve got? A tidy data frame, one row per utterance, with the default column names from enframe. Before we forget, let’s rename our columns:

j3 <- j3 %>% rename(utterance.id=name, utterance=value)
j3
## # A tibble: 14 × 3
##    episode.id utterance.id utterance        
##    <chr>      <chr>        <list>           
##  1 102        255          <named list [12]>
##  2 102        345          <named list [12]>
##  3 102        347          <named list [12]>
##  4 102        363          <named list [12]>
##  5 102        421          <named list [12]>
##  6 102        422          <named list [12]>
##  7 102        426          <named list [12]>
##  8 102        427          <named list [12]>
##  9 102        428          <named list [12]>
## 10 102        429          <named list [12]>
## 11 102        430          <named list [12]>
## 12 104        4            <named list [12]>
## 13 104        45           <named list [12]>
## 14 104        54           <named list [12]>

I bet you can guess what comes next: that’s right, we are going to enframe our utterances, and unnest!

j4 <- j3 %>% mutate(utterance=map(utterance,enframe))
j4
## # A tibble: 14 × 3
##    episode.id utterance.id utterance        
##    <chr>      <chr>        <list>           
##  1 102        255          <tibble [12 × 2]>
##  2 102        345          <tibble [12 × 2]>
##  3 102        347          <tibble [12 × 2]>
##  4 102        363          <tibble [12 × 2]>
##  5 102        421          <tibble [12 × 2]>
##  6 102        422          <tibble [12 × 2]>
##  7 102        426          <tibble [12 × 2]>
##  8 102        427          <tibble [12 × 2]>
##  9 102        428          <tibble [12 × 2]>
## 10 102        429          <tibble [12 × 2]>
## 11 102        430          <tibble [12 × 2]>
## 12 104        4            <tibble [12 × 2]>
## 13 104        45           <tibble [12 × 2]>
## 14 104        54           <tibble [12 × 2]>

When we unnest this time, let’s see what happens:

j5 <- j4 %>% unnest(utterance)
j5
## # A tibble: 168 × 4
##    episode.id utterance.id name        value     
##    <chr>      <chr>        <chr>       <list>    
##  1 102        255          char        <chr [1]> 
##  2 102        255          line        <chr [1]> 
##  3 102        255          direction   <chr [1]> 
##  4 102        255          type        <list [1]>
##  5 102        255          pri_type    <chr [1]> 
##  6 102        255          domain      <list [1]>
##  7 102        255          sub-domain  <list [1]>
##  8 102        255          nv_resp     <lgl [1]> 
##  9 102        255          interaction <chr [1]> 
## 10 102        255          char_type   <chr [1]> 
## # ℹ 158 more rows

Woah! Now we’ve got one line for each key in the inner-most part of the JSON file, with a corresponding list value. A few things to note:

  1. The value column here is of type list, but note that the different rows have different types of lists in them - some have characters, some have logicals, some have lists, etc.
  2. Some of those lists have different numbers of entries- most are 1, but some are 0.
  3. The actual data we want (the speaker, the interaction type, etc.) is still locked away in a list.

Our next step will be to pivot this into a slightly wider dataframe:

j6 <- j5 %>% pivot_wider(names_from=name, values_from=value)
j6
## # A tibble: 14 × 14
##    episode.id utterance.id char      line      direction type   pri_type  domain
##    <chr>      <chr>        <list>    <list>    <list>    <list> <list>    <list>
##  1 102        255          <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
##  2 102        345          <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
##  3 102        347          <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
##  4 102        363          <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
##  5 102        421          <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
##  6 102        422          <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
##  7 102        426          <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
##  8 102        427          <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
##  9 102        428          <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
## 10 102        429          <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
## 11 102        430          <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
## 12 104        4            <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
## 13 104        45           <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
## 14 104        54           <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
## # ℹ 6 more variables: `sub-domain` <list>, nv_resp <list>, interaction <list>,
## #   char_type <list>, is_fed <list>, error <list>

This is starting to look like we’re getting closer to a dataframe we can work with! We still have a little bit of work to do, however. First, let’s make our lives simpler by getting rid of some of the columns we don’t need:

j6 <- j6 %>% select(episode.id, utterance.id, char_type, type)
j6
## # A tibble: 14 × 4
##    episode.id utterance.id char_type type      
##    <chr>      <chr>        <list>    <list>    
##  1 102        255          <chr [1]> <list [1]>
##  2 102        345          <chr [1]> <list [1]>
##  3 102        347          <chr [1]> <list [1]>
##  4 102        363          <chr [1]> <list [3]>
##  5 102        421          <chr [1]> <list [1]>
##  6 102        422          <chr [1]> <list [1]>
##  7 102        426          <chr [1]> <list [2]>
##  8 102        427          <chr [1]> <list [2]>
##  9 102        428          <chr [1]> <list [2]>
## 10 102        429          <chr [1]> <list [2]>
## 11 102        430          <chr [1]> <list [2]>
## 12 104        4            <chr [1]> <list [1]>
## 13 104        45           <chr [1]> <list [1]>
## 14 104        54           <chr [1]> <list [1]>

Note the length of some of the entries in the type column- utterance 363 in espisode 102 has length 3. Looking at the JSON file, we can see that this is because this utterance was assigned three types- “Wake word”, “Conversation”, and “Question”.

Because the “type” column is itself a container JSON element (a list), we actually will need to unnest it twice. This is the sort of thing that jsonlite has helpers to do, but for a simple file format like this one I personally usually just do it all by hand, so I can be sure about what’s going on.

j7 <- j6 %>% unnest(char_type) %>% unnest(type) %>% unnest(type)
j7
## # A tibble: 21 × 4
##    episode.id utterance.id char_type type        
##    <chr>      <chr>        <chr>     <chr>       
##  1 102        255          Person    Statement   
##  2 102        345          Person    Command     
##  3 102        347          Person    Statement   
##  4 102        363          Person    Wake Word   
##  5 102        363          Person    Question    
##  6 102        363          Person    Conversation
##  7 102        421          Person    Command     
##  8 102        422          Computer  Response    
##  9 102        426          Computer  Info        
## 10 102        426          Computer  Alert       
## # ℹ 11 more rows

At this point, we’ve got a nice, tidy data frame, and we are ready to use regular methods to compute our table:

j7 %>% group_by(char_type, type) %>% 
  summarise(n=n()) %>% 
  mutate(denom=sum(n)) %>% 
  mutate(prop=n/denom) %>% 
  select(char_type,  type, prop) %>% 
  pivot_wider(names_from=char_type, values_from=prop) %>% 
  select(type, Person, Computer) %>% 
  arrange(-Person) 
## `summarise()` has grouped output by 'char_type'. You can override using the
## `.groups` argument.
## # A tibble: 9 × 3
##   type          Person Computer
##   <chr>          <dbl>    <dbl>
## 1 Statement     0.357    NA    
## 2 Conversation  0.214     0.286
## 3 Command       0.214    NA    
## 4 Comment       0.0714   NA    
## 5 Question      0.0714   NA    
## 6 Wake Word     0.0714   NA    
## 7 Alert        NA         0.286
## 8 Info         NA         0.143
## 9 Response     NA         0.286

And we’re all set! Now we just need to repeat the above workflow, pointed at the true file:

## `summarise()` has grouped output by 'char_type'. You can override using the
## `.groups` argument.
type Person Computer
Command 33.99% 0.00%
Wake Word 31.51% 0.00%
Statement 13.23% 0.00%
Question 10.31% 0.00%
Conversation 5.70% 1.21%
Password 2.56% 0.00%
Comment 1.97% 0.00%
command 0.66% 0.00%
question 0.07% 0.00%
Response 0.00% 56.36%
Alert 0.00% 15.45%
Info 0.00% 8.48%
Countdown 0.00% 8.18%
Clarification 0.00% 6.36%
Progress 0.00% 3.94%

Note that these numbers don’t match what is in the paper; I suspect that what we could be seeing is that the authors may have used the pri_type field instead of the type field. Repeating our analysis, looking for that field, we see:

full_j %>% enframe %>% 
  rename(episode.id=name, episode=value) %>% 
  mutate(episode=map(episode,enframe)) %>% 
  unnest(episode) %>% 
  rename(utterance.id=name, utterance=value) %>% 
  mutate(utterance=map(utterance,enframe)) %>% 
  unnest(utterance) %>% 
  pivot_wider(names_from=name, values_from=value) %>% 
  select(episode.id, utterance.id, char_type, pri_type) %>% 
  unnest(c(char_type, pri_type)) %>% 
   group_by(char_type, pri_type) %>% 
  summarise(n=n()) %>% 
  mutate(denom=sum(n)) %>% 
  mutate(prop=n/denom) %>% 
  select(char_type,  pri_type, prop) %>% 
  pivot_wider(names_from=char_type, values_from=prop, values_fill=0.0) %>% 
  select(pri_type, Person, Computer) %>% 
  arrange(-Person, -Computer) %>% 
  gt %>% fmt_percent(columns=c(Person, Computer))
## `summarise()` has grouped output by 'char_type'. You can override using the
## `.groups` argument.
pri_type Person Computer
Command 59.42% 0.00%
Statement 21.23% 0.00%
Question 17.09% 0.00%
Password 0.75% 0.00%
Wake Word 0.75% 0.00%
Comment 0.63% 0.00%
Conversation 0.13% 0.20%
Response 0.00% 66.27%
Alert 0.00% 19.25%
Clarification 0.00% 8.33%
Info 0.00% 2.78%
Countdown 0.00% 1.79%
Progress 0.00% 1.39%

This still doesn’t match what is in the paper, and I think it’s because I was mistaken about the denominator.

full_j %>% enframe %>% 
  rename(episode.id=name, episode=value) %>% 
  mutate(episode=map(episode,enframe)) %>% 
  unnest(episode) %>% 
  rename(utterance.id=name, utterance=value) %>% 
  mutate(utterance=map(utterance,enframe)) %>% 
  unnest(utterance) %>% 
  pivot_wider(names_from=name, values_from=value) %>% 
  select(episode.id, utterance.id, char_type, type) %>%
  unnest(c(char_type, type)) %>% unnest(type)  %>%

  janitor::tabyl(type, char_type) %>% data.frame %>% 
  mutate(Comp.prop=Computer/sum(Computer), Person.prop=Person/sum(Person)) %>% arrange(-Person.prop, -Comp.prop) %>% 
  select(type,Person=Person.prop, Computer=Comp.prop) %>% 
  gt
type Person Computer
Command 0.3399122807 0.00000000
Wake Word 0.3150584795 0.00000000
Statement 0.1323099415 0.00000000
Question 0.1030701754 0.00000000
Conversation 0.0570175439 0.01212121
Password 0.0255847953 0.00000000
Comment 0.0197368421 0.00000000
command 0.0065789474 0.00000000
question 0.0007309942 0.00000000
Response 0.0000000000 0.56363636
Alert 0.0000000000 0.15454545
Info 0.0000000000 0.08484848
Countdown 0.0000000000 0.08181818
Clarification 0.0000000000 0.06363636
Progress 0.0000000000 0.03939394

This still isn’t in alignment with what the paper reported, but at this point you get the idea. :-D

Creative Commons License