Introduction

This walkthrough will illustrate a potential workflow for reproducing Table 4 from the following paper:

Axtell B, Munteanu C. Tea, Earl Grey, Hot: Designing Speech Interactions from the Imagined Ideal of Star Trek. In: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. Yokohama Japan: ACM; 2021. p. 1–14.

The associated data can be found here (and is on Posit.cloud).

Note: There is no deliverable for this lab; follow along and experiment at your leisure!

Setup

library(tidyverse)
library(here)
library(jsonlite)
library(gt)

Overall workflow:

  1. Know your data
  2. Think about desired final structure
  3. Read JSON into gnarly list
  4. Work from the outside inwards.

Know your data:

Before we even look at the JSON itself, we should read the originating paper and get a sense of the analyses performed. Were there important variables that the authors used for stratifying the study population? Were there particular dimensions over which they computed summary statistics? Look

Next, manually check the structure of the data file in the text editor of your choice (or the JSON-viewer of your choice- Firefox will work as well):

{
    "102": {
        "255": {
            "char": "Tasha",
            "line": "Battle bridge.",
            "direction": "The doors snap closed and the lift moves. Riker looks Tasha over  waiting  then:",
            "type": [
                "Statement"
            ],
            "pri_type": "Statement",
            "domain": [
                "IoT"
            ],
            "sub-domain": [
                "Turbolift"
            ],
            "nv_resp": true,
            "interaction": "Battle bridge.",
            "char_type": "Person",
            "is_fed": true,
            "error": false
        },
        ...

What do we see?

Top-level Structure

The outer-most part of the document is a dictionary whose keys are episodes, and whose values are also dictionaries. Each of these represents a single episode. What is in these dictionaries?

Episode-level Structure

Each episode-level dictionary contains utterance IDs as keys and yet another dictionary as a value. Each of these represents a single utterance. What is in these dictionaries?

Utterance-level structure

Each utterance-level dictionary contains the actual information about the utterance. This dictionary’s keys refer to specific attributes of the utterance (what character was speaking, the actual line of dialogue itself, etc.).

Some of these attributes are scalar (character, etc.) while others are lists (utterances can be of more than one type, etc.).

Now, go back to the paper

After looking over the structure, go back to your notes about the paper and see if the things you were looking for — variables, things from figures, etc. — appear in the document anywhere. If they don’t, get ready to figure out how to recompute analytical variables!

Thinking about desired final structure:

Recall that our goal is to reproduce table 4 frmo the paper, which was a final tally of interaction type by speaker type (human vs. computer). Here’s a “sketch” of a dataframe we might want to ultimately end up with:

A long dataframe, in which each row is an utterance, with columns for:

  1. Episode ID
  2. Utterance ID within that episode
  3. The category of speaker (person or computer)
  4. The type of interaction (command, wake word, etc.)

Looking at the structure of our data, it looks like the important inner keys are going to be type and char_type.

Initial loading

jsonlite::read_json() is a good place to start; note that for some JSON files, it can automatically do a fair bit of the work that we are about to do by hand. If your JSON file is relatively simple, and does not have deeply-nested objects with varying dimensions, jsonlite::as_tibble() may be able to coerce the nested list into a dataframe… but not always, and not for this JSON file, because of the amount of heterogeneity in the data. So we will be doing this “by hand”.

j_path <- here("data/lab11/teaearlgreyhotdataset.json")
j <- read_json(j_path)

Note what we get back from read_json:

typeof(j)
## [1] "list"
length(j)
## [1] 137

Why do we get a list and not a data frame? Because JSON allows for arbitrarily complicated nested structures, and so we don’t have any guarantee that what’s in that file will be amenable to flattening without some very file-specific work. So jsonlite punts on the issue and makes a list-of-lists for us, so we can deal with it ourselves.

Before we continue, note that this file is pretty big. To make it a little easier to work with while we’re just getting started, I am going to make a small version that only has two episodes’ worth of data. I’ll do this “by hand” outside of R, in my text editor.

j_path <- here("data/lab11/teaearlgreyhotdataset.small.json")
j <- read_json(j_path)
length(j)
## [1] 2

This is a named list, so we can access its elements by name:

names(j)
## [1] "102" "104"
length(j$"102")
## [1] 11

And the sub-lists are themselves named:

names(j$"102")
##  [1] "255" "345" "347" "363" "421" "422" "426" "427" "428" "429" "430"

Look back at the JSON file itself; you’ll see that these correspond to the dictionary keys in the file.

It’s turtles all the way down, to the final layer of the information objects themselves:

names(j$"102"$"255")
##  [1] "char"        "line"        "direction"   "type"        "pri_type"   
##  [6] "domain"      "sub-domain"  "nv_resp"     "interaction" "char_type"  
## [11] "is_fed"      "error"
j$"102"$"255"$"char"
## [1] "Tasha"

Work from the outside in

Now, we are going to work our way through the document, flattening and filtering as we go, until we end up with a tidy data frame. Step one: turn our list into a very simple DF, using enframe():

j2 <- j %>% enframe
j2
## # A tibble: 2 × 2
##   name  value            
##   <chr> <list>           
## 1 102   <named list [11]>
## 2 104   <named list [3]>

See what happened here? We started with a named list, and ended up with a dataframe where one column is the names of the list and the second column is the values.

To get a little more concrete: each row of our dataframe now corresponds to a single top-level item from our input JSON file (i.e., an episode), with a column (by default, named name) representing the item’s key and a second column (by default, named value) containing that item’s value. In this JSON file, the keys correspond to particular episode IDs, with the matching values representing dictionaries containing utterances from that episode.

Next, we’re going to go one more level down, but before we do, let’s rename our columns- by default, enframe just gives the uninformative column names “name” and “value”, which (as you will see) can quickly become confusing We can do this using rename(), or we can override those defaults when we call enframe.

j2 <- j2 %>% rename(episode.id=name, episode=value) 
j2
## # A tibble: 2 × 2
##   episode.id episode          
##   <chr>      <list>           
## 1 102        <named list [11]>
## 2 104        <named list [3]>

Now that we’ve got useful names, let’s go one level deeper (Inception-style) and turn our episodes into nested dataframes:

j3 <- j2 %>% mutate(episode=purrr::map(episode,enframe))
j3
## # A tibble: 2 × 2
##   episode.id episode          
##   <chr>      <list>           
## 1 102        <tibble [11 × 2]>
## 2 104        <tibble [3 × 2]>

Now, if we unnest the episode column…

j3 <- j3 %>% unnest(episode)
j3
## # A tibble: 14 × 3
##    episode.id name  value            
##    <chr>      <chr> <list>           
##  1 102        255   <named list [12]>
##  2 102        345   <named list [12]>
##  3 102        347   <named list [12]>
##  4 102        363   <named list [12]>
##  5 102        421   <named list [12]>
##  6 102        422   <named list [12]>
##  7 102        426   <named list [12]>
##  8 102        427   <named list [12]>
##  9 102        428   <named list [12]>
## 10 102        429   <named list [12]>
## 11 102        430   <named list [12]>
## 12 104        4     <named list [12]>
## 13 104        45    <named list [12]>
## 14 104        54    <named list [12]>

Now, see what we’ve got? A tidy data frame, one row per utterance, with the default column names from enframe. Before we forget, let’s rename our columns:

j3 <- j3 %>% rename(utterance.id=name, utterance=value)
j3
## # A tibble: 14 × 3
##    episode.id utterance.id utterance        
##    <chr>      <chr>        <list>           
##  1 102        255          <named list [12]>
##  2 102        345          <named list [12]>
##  3 102        347          <named list [12]>
##  4 102        363          <named list [12]>
##  5 102        421          <named list [12]>
##  6 102        422          <named list [12]>
##  7 102        426          <named list [12]>
##  8 102        427          <named list [12]>
##  9 102        428          <named list [12]>
## 10 102        429          <named list [12]>
## 11 102        430          <named list [12]>
## 12 104        4            <named list [12]>
## 13 104        45           <named list [12]>
## 14 104        54           <named list [12]>

I bet you can guess what comes next: that’s right, we are going to enframe our utterances, and unnest!

j4 <- j3 %>% mutate(utterance=purrr::map(utterance,enframe))
j4
## # A tibble: 14 × 3
##    episode.id utterance.id utterance        
##    <chr>      <chr>        <list>           
##  1 102        255          <tibble [12 × 2]>
##  2 102        345          <tibble [12 × 2]>
##  3 102        347          <tibble [12 × 2]>
##  4 102        363          <tibble [12 × 2]>
##  5 102        421          <tibble [12 × 2]>
##  6 102        422          <tibble [12 × 2]>
##  7 102        426          <tibble [12 × 2]>
##  8 102        427          <tibble [12 × 2]>
##  9 102        428          <tibble [12 × 2]>
## 10 102        429          <tibble [12 × 2]>
## 11 102        430          <tibble [12 × 2]>
## 12 104        4            <tibble [12 × 2]>
## 13 104        45           <tibble [12 × 2]>
## 14 104        54           <tibble [12 × 2]>

When we unnest this time, let’s see what happens:

j5 <- j4 %>% unnest(utterance)
j5
## # A tibble: 168 × 4
##    episode.id utterance.id name        value     
##    <chr>      <chr>        <chr>       <list>    
##  1 102        255          char        <chr [1]> 
##  2 102        255          line        <chr [1]> 
##  3 102        255          direction   <chr [1]> 
##  4 102        255          type        <list [1]>
##  5 102        255          pri_type    <chr [1]> 
##  6 102        255          domain      <list [1]>
##  7 102        255          sub-domain  <list [1]>
##  8 102        255          nv_resp     <lgl [1]> 
##  9 102        255          interaction <chr [1]> 
## 10 102        255          char_type   <chr [1]> 
## # ℹ 158 more rows

Woah! Now we’ve got one line for each key in the inner-most part of the JSON file, with a corresponding list value. A few things to note:

  1. The value column here is of type list, but note that the different rows have different types of lists in them - some have characters, some have logicals, some have lists, etc.
  2. Some of those lists have different numbers of entries- most are 1, but some are 0.
  3. The actual data we want (the speaker, the interaction type, etc.) is still locked away in a list.

Our next step will be to pivot this into a slightly wider dataframe:

j6 <- j5 %>% pivot_wider(names_from=name, values_from=value)
j6
## # A tibble: 14 × 14
##    episode.id utterance.id char      line      direction type   pri_type  domain
##    <chr>      <chr>        <list>    <list>    <list>    <list> <list>    <list>
##  1 102        255          <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
##  2 102        345          <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
##  3 102        347          <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
##  4 102        363          <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
##  5 102        421          <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
##  6 102        422          <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
##  7 102        426          <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
##  8 102        427          <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
##  9 102        428          <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
## 10 102        429          <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
## 11 102        430          <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
## 12 104        4            <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
## 13 104        45           <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
## 14 104        54           <chr [1]> <chr [1]> <chr [1]> <list> <chr [1]> <list>
## # ℹ 6 more variables: `sub-domain` <list>, nv_resp <list>, interaction <list>,
## #   char_type <list>, is_fed <list>, error <list>

This is starting to look like we’re getting closer to a dataframe we can work with! We still have a little bit of work to do, however. First, let’s make our lives simpler by getting rid of some of the columns we don’t need for our analysis:

j6 <- j6 %>% select(episode.id, utterance.id, char_type, type)
j6
## # A tibble: 14 × 4
##    episode.id utterance.id char_type type      
##    <chr>      <chr>        <list>    <list>    
##  1 102        255          <chr [1]> <list [1]>
##  2 102        345          <chr [1]> <list [1]>
##  3 102        347          <chr [1]> <list [1]>
##  4 102        363          <chr [1]> <list [3]>
##  5 102        421          <chr [1]> <list [1]>
##  6 102        422          <chr [1]> <list [1]>
##  7 102        426          <chr [1]> <list [2]>
##  8 102        427          <chr [1]> <list [2]>
##  9 102        428          <chr [1]> <list [2]>
## 10 102        429          <chr [1]> <list [2]>
## 11 102        430          <chr [1]> <list [2]>
## 12 104        4            <chr [1]> <list [1]>
## 13 104        45           <chr [1]> <list [1]>
## 14 104        54           <chr [1]> <list [1]>

Notice that char_type and type are list columns, with char_type always being of length 1 and type being of varying lengths. In our original data, char_type was a scalar, not a list: we ended up with it as a list because of a quirk of how our raw JSON parse was processed and enframed. We can easily deal with this via unnest:

j6 <- j6 %>% unnest(char_type)
j6
## # A tibble: 14 × 4
##    episode.id utterance.id char_type type      
##    <chr>      <chr>        <chr>     <list>    
##  1 102        255          Person    <list [1]>
##  2 102        345          Person    <list [1]>
##  3 102        347          Person    <list [1]>
##  4 102        363          Person    <list [3]>
##  5 102        421          Person    <list [1]>
##  6 102        422          Computer  <list [1]>
##  7 102        426          Computer  <list [2]>
##  8 102        427          Person    <list [2]>
##  9 102        428          Computer  <list [2]>
## 10 102        429          Computer  <list [2]>
## 11 102        430          Person    <list [2]>
## 12 104        4            Person    <list [1]>
## 13 104        45           Person    <list [1]>
## 14 104        54           Person    <list [1]>

Now, let’s turn to the type column. Look at the entry for utterance 363 in espisode 102: it has a length of 3. Looking at the JSON file, we can see that this is because this utterance was assigned three types- “Wake word”, “Conversation”, and “Question”.

Because the “type” column’s JSON origin was as a container element (a list), which in principle could have had entries of varying types, R has given it to us as an R list (which can have varying contents) rather than as vector (which can only have one kind of data). We know what is in that list, but R doesn’t.

As such, we have a little bit of extra work to do; we actually will need to unnest it twice:

This is the sort of thing that jsonlite has helpers to do, but for a simple file format like this one I personally usually just do it all by hand, so I can be sure about what’s going on.

j7 <- j6 %>% unnest(type) %>% unnest(type)
j7
## # A tibble: 21 × 4
##    episode.id utterance.id char_type type        
##    <chr>      <chr>        <chr>     <chr>       
##  1 102        255          Person    Statement   
##  2 102        345          Person    Command     
##  3 102        347          Person    Statement   
##  4 102        363          Person    Wake Word   
##  5 102        363          Person    Question    
##  6 102        363          Person    Conversation
##  7 102        421          Person    Command     
##  8 102        422          Computer  Response    
##  9 102        426          Computer  Info        
## 10 102        426          Computer  Alert       
## # ℹ 11 more rows

A more modern and simpler option, that works in this specific case but might not always work, is to use unnest_longer:

j7 <- j6 %>% unnest_longer(type)
j7
## # A tibble: 21 × 4
##    episode.id utterance.id char_type type        
##    <chr>      <chr>        <chr>     <chr>       
##  1 102        255          Person    Statement   
##  2 102        345          Person    Command     
##  3 102        347          Person    Statement   
##  4 102        363          Person    Wake Word   
##  5 102        363          Person    Question    
##  6 102        363          Person    Conversation
##  7 102        421          Person    Command     
##  8 102        422          Computer  Response    
##  9 102        426          Computer  Info        
## 10 102        426          Computer  Alert       
## # ℹ 11 more rows

At this point, we’ve got a nice, tidy data frame, and we are ready to use regular methods to compute our table.

Actually Replicating an Analysis

We can use regular tidyverse methods to to compute our table:

j7 %>% group_by(char_type, type) %>% 
  summarise(n=n()) %>% 
  mutate(denom=sum(n)) %>% 
  mutate(prop=n/denom) %>% 
  select(char_type,  type, prop) %>% 
  pivot_wider(names_from=char_type, values_from=prop) %>% 
  select(type, Person, Computer) %>% 
  arrange(-Person) 
## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by char_type and type.
## ℹ Output is grouped by char_type.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(char_type, type))` for per-operation grouping
##   (`?dplyr::dplyr_by`) instead.
## # A tibble: 9 × 3
##   type          Person Computer
##   <chr>          <dbl>    <dbl>
## 1 Statement     0.357    NA    
## 2 Conversation  0.214     0.286
## 3 Command       0.214    NA    
## 4 Comment       0.0714   NA    
## 5 Question      0.0714   NA    
## 6 Wake Word     0.0714   NA    
## 7 Alert        NA         0.286
## 8 Info         NA         0.143
## 9 Response     NA         0.286

And we’re all set! Now we just need to repeat the above workflow, pointed at the true file:

## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by char_type and type.
## ℹ Output is grouped by char_type.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(char_type, type))` for per-operation grouping
##   (`?dplyr::dplyr_by`) instead.
type Person Computer
Command 33.99% 0.00%
Wake Word 31.51% 0.00%
Statement 13.23% 0.00%
Question 10.31% 0.00%
Conversation 5.70% 1.21%
Password 2.56% 0.00%
Comment 1.97% 0.00%
command 0.66% 0.00%
question 0.07% 0.00%
Response 0.00% 56.36%
Alert 0.00% 15.45%
Info 0.00% 8.48%
Countdown 0.00% 8.18%
Clarification 0.00% 6.36%
Progress 0.00% 3.94%

Note that these numbers don’t match what is in the paper; I suspect that what we could be seeing is that the authors may have used the pri_type field instead of the type field. Repeating our analysis, looking for that field, we see:

full_j %>% enframe %>% 
  rename(episode.id=name, episode=value) %>% 
  mutate(episode=purrr::map(episode,enframe)) %>% 
  unnest(episode) %>% 
  rename(utterance.id=name, utterance=value) %>% 
  mutate(utterance=purrr::map(utterance,enframe)) %>% 
  unnest(utterance) %>% 
  pivot_wider(names_from=name, values_from=value) %>% 
  select(episode.id, utterance.id, char_type, pri_type) %>% 
  unnest(c(char_type, pri_type)) %>% 
   group_by(char_type, pri_type) %>% 
  summarise(n=n()) %>% 
  mutate(denom=sum(n)) %>% 
  mutate(prop=n/denom) %>% 
  select(char_type,  pri_type, prop) %>% 
  pivot_wider(names_from=char_type, values_from=prop, values_fill=0.0) %>% 
  select(pri_type, Person, Computer) %>% 
  arrange(-Person, -Computer) %>% 
  gt %>% fmt_percent(columns=c(Person, Computer))
## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by char_type and pri_type.
## ℹ Output is grouped by char_type.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(char_type, pri_type))` for per-operation grouping
##   (`?dplyr::dplyr_by`) instead.
pri_type Person Computer
Command 59.42% 0.00%
Statement 21.23% 0.00%
Question 17.09% 0.00%
Password 0.75% 0.00%
Wake Word 0.75% 0.00%
Comment 0.63% 0.00%
Conversation 0.13% 0.20%
Response 0.00% 66.27%
Alert 0.00% 19.25%
Clarification 0.00% 8.33%
Info 0.00% 2.78%
Countdown 0.00% 1.79%
Progress 0.00% 1.39%

This still doesn’t match what is in the paper, and I think it’s because I was mistaken about the denominator.

full_j %>% enframe %>% 
  rename(episode.id=name, episode=value) %>% 
  mutate(episode=purrr::map(episode,enframe)) %>% 
  unnest(episode) %>% 
  rename(utterance.id=name, utterance=value) %>% 
  mutate(utterance=purrr::map(utterance,enframe)) %>% 
  unnest(utterance) %>% 
  pivot_wider(names_from=name, values_from=value) %>% 
  select(episode.id, utterance.id, char_type, type) %>%
  unnest(c(char_type, type)) %>% unnest(type)  %>%

  janitor::tabyl(type, char_type) %>% data.frame %>% 
  mutate(Comp.prop=Computer/sum(Computer), Person.prop=Person/sum(Person)) %>% arrange(-Person.prop, -Comp.prop) %>% 
  select(type,Person=Person.prop, Computer=Comp.prop) %>% 
  gt
type Person Computer
Command 0.3399122807 0.00000000
Wake Word 0.3150584795 0.00000000
Statement 0.1323099415 0.00000000
Question 0.1030701754 0.00000000
Conversation 0.0570175439 0.01212121
Password 0.0255847953 0.00000000
Comment 0.0197368421 0.00000000
command 0.0065789474 0.00000000
question 0.0007309942 0.00000000
Response 0.0000000000 0.56363636
Alert 0.0000000000 0.15454545
Info 0.0000000000 0.08484848
Countdown 0.0000000000 0.08181818
Clarification 0.0000000000 0.06363636
Progress 0.0000000000 0.03939394

This still isn’t in alignment with what the paper reported, but from this point, the issue becomes one of replicating an analysis rather than parsing JSON, and thus is out of scope for our lab. :-D

Creative Commons License