7  Preparing data to answer causal questions

7.1 Introduction to the data

Throughout this book we will be using data obtained from Touring Plans. Touring Plans is a company that helps folks plan their trips to Disney and Universal theme parks. One of their goals is to accurately predict attraction wait times at these theme parks by leveraging data and statistical modeling. The touringplans R package includes several datasets containing information about Disney theme park attractions. A summary of the attractions included in the package can be found by running the following:

library(touringplans)
attractions_metadata
# A tibble: 14 × 8
   dataset_name name  short_name park  land  opened_on 
   <chr>        <chr> <chr>      <chr> <chr> <date>    
 1 alien_sauce… Alie… Alien Sau… Disn… Toy … 2018-06-30
 2 dinosaur     DINO… DINOSAUR   Disn… Dino… 1998-04-22
 3 expedition_… Expe… Expeditio… Disn… Asia  2006-04-07
 4 flight_of_p… Avat… Flight of… Disn… Pand… 2017-05-27
 5 kilimanjaro… Kili… Kilimanja… Disn… Afri… 1998-04-22
 6 navi_river   Na'v… Na'vi Riv… Disn… Pand… 2017-05-27
 7 pirates_of_… Pira… Pirates o… Magi… Adve… 1973-12-17
 8 rock_n_roll… Rock… Rock Coas… Disn… Suns… 1999-07-29
 9 seven_dwarf… Seve… 7 Dwarfs … Magi… Fant… 2014-05-28
10 slinky_dog   Slin… Slinky Dog Disn… Toy … 2018-06-30
11 soarin       Soar… Soarin'    Epcot Worl… 2005-05-05
12 spaceship_e… Spac… Spaceship… Epcot Worl… 1982-10-01
13 splash_moun… Spla… Splash Mo… Magi… Fron… 1992-07-17
14 toy_story_m… Toy … Toy Story… Disn… Toy … 2008-05-31
# ℹ 2 more variables: duration <dbl>,
#   average_wait_per_hundred <dbl>

Additionally, this package contains a dataset with raw metadata about the parks, with observations recorded daily. This metadata includes information like the Walt Disney World ticket season on the particular day (was it high season – think Christmas – or low season – think right when school started), what the historic temperatures were in the park on that day, and whether there was a special event, such as “extra magic hours” in the park on that day (did the park open early to guests staying in the Walt Disney World resorts?).

parks_metadata_raw
# A tibble: 2,079 × 181
   date       wdw_ticket_season dayofweek dayofyear
   <date>     <chr>                 <dbl>     <dbl>
 1 2015-01-01 <NA>                      5         0
 2 2015-01-02 <NA>                      6         1
 3 2015-01-03 <NA>                      7         2
 4 2015-01-04 <NA>                      1         3
 5 2015-01-05 <NA>                      2         4
 6 2015-01-06 <NA>                      3         5
 7 2015-01-07 <NA>                      4         6
 8 2015-01-08 <NA>                      5         7
 9 2015-01-09 <NA>                      6         8
10 2015-01-10 <NA>                      7         9
# ℹ 2,069 more rows
# ℹ 177 more variables: weekofyear <dbl>,
#   monthofyear <dbl>, year <dbl>, season <chr>,
#   holidaypx <dbl>, holidaym <dbl>, holidayn <chr>,
#   holiday <dbl>, wdwticketseason <chr>,
#   wdwracen <chr>, wdweventn <chr>, wdwevent <dbl>,
#   wdwrace <dbl>, wdwseason <chr>, …

Suppose the causal question of interest is:

Is there a relationship between whether there were “Extra Magic Hours” in the morning at Magic Kingdom and the average wait time for an attraction called the “Seven Dwarfs Mine Train” the same day between 9am and 10am in 2018?

Let’s begin by diagramming this causal question (Figure 7.1).

Warning in geom_segment(aes(x = 0.5, xend = 2.5, y = 0.95, yend = 0.95)): All aesthetics have length 1, but the data has 6 rows.
ℹ Please consider using `annotate()` or provide this
  layer with data containing a single row.
Warning in geom_segment(aes(x = 1.5, xend = 1.5, y = 0.95, yend = 1.1)): All aesthetics have length 1, but the data has 6 rows.
ℹ Please consider using `annotate()` or provide this
  layer with data containing a single row.
Warning in geom_segment(aes(x = 0.5, xend = 1, y = 0.95, yend = 0.65)): All aesthetics have length 1, but the data has 6 rows.
ℹ Please consider using `annotate()` or provide this
  layer with data containing a single row.
Warning in geom_segment(aes(x = 1, xend = 1.5, y = 0.65, yend = 0.65)): All aesthetics have length 1, but the data has 6 rows.
ℹ Please consider using `annotate()` or provide this
  layer with data containing a single row.
Warning in geom_segment(aes(x = 1.55, xend = 2.05, y = 0.95, yend = 0.65)): All aesthetics have length 1, but the data has 6 rows.
ℹ Please consider using `annotate()` or provide this
  layer with data containing a single row.
Warning in geom_segment(aes(x = 2.05, xend = 2.55, y = 0.65, yend = 0.65)): All aesthetics have length 1, but the data has 6 rows.
ℹ Please consider using `annotate()` or provide this
  layer with data containing a single row.

Figure 7.1: Diagram of the causal question “Is there a relationship between whether there were”Extra Magic Hours” in the morning at Magic Kingdom and the average wait time for an attraction called the “Seven Dwarfs Mine Train” the same day between 9am and 10am in 2018?”

Historically, guests who stayed in a Walt Disney World resort hotel could access the park during “Extra Magic Hours,” during which the park was closed to all other guests. These extra hours could be in the morning or evening. The Seven Dwarfs Mine Train is a ride at Walt Disney World’s Magic Kingdom. Magic Kingdom may or may not be selected each day to have these “Extra Magic Hours.” We are interested in examining the relationship between whether there were “Extra Magic Hours” in the morning and the average wait time for the Seven Dwarfs Mine Train on the same day between 9 am and 10 am. Below is a proposed DAG for this question.

Figure 7.2: Proposed DAG for the relationship between Extra Magic Hours in the morning at a particular park and the average wait time between 9 am and 10 am. Here we are saying that we believe 1) Extra Magic Hours impacts average wait time and 2) both Extra Magic Hours and average wait time are determined by the time the park closes, historic high temperatures, and ticket season.

Since we are not in charge of Walt Disney World’s operations, we cannot randomize dates to have (or not) “Extra Magic Hours”, therefore, we need to rely on previously collected observational data and do our best to emulate the target trial that we would have created, should it have been possible. Here, our observations are days. Looking at the diagram above, we can map each element of the causal question to elements of our target trial protocol:

  • Eligibility criteria: Days must be from 2018
  • Exposure definition: Magic kingdom had “Extra Magic Hours” in the morning
  • Assignment procedures: Observed – if the historic data suggests there were “Extra Magic Hours” in the morning on a particular day, that day is classified as “exposed” otherwise it is “unexposed”
  • Follow-up period: From park open to 10am.
  • Outcome definition: The average posted wait time between 9am and 10am
  • Causal contrast of interest: Average treatment effect (we will discuss this in Chapter 11)
  • Analysis plan: We use inverse probability weighting after fitting a propensity score model to estimate the average treatment effect of the exposure on the outcome of interest. We will adjust for variables as determined by our DAG (Figure 7.2)

7.2 Data wrangling and recipes

Most of our data manipulation tools come from the dplyr package (Table 7.1). We will also use lubridate to help us manipulate dates.

Table 7.1: Mapping target trial protocol elements to commonly used dplyr functions
Target trial protocol element {dplyr} functions
Eligibility criteria filter()
Exposure definition mutate()
Assignment procedures mutate()
Follow-up period mutate() pivot_longer() pivot_wider()
Outcome definition mutate()
Analysis plan select() mutate()

To answer this question, we are going to need to manipulate both the seven_dwarfs_train dataset as well as the parks_metadata_raw dataset. Let’s start with the seven_dwarfs_train data set. The Seven Dwarfs Mine Train ride is an attraction at Walt Disney World’s Magic Kingdom. The seven_dwarfs_train dataset in the {touringplans} package contains information about the date a particular wait time was recorded (park_date), the time of the wait time (wait_datetime), the actual wait time (wait_minutes_actual), and the posted wait time (wait_minutes_posted). Let’s take a look at this dataset. The {skimr} package is great for getting a quick glimpse at a new dataset.

library(skimr)
skim(seven_dwarfs_train)
Data summary
Name seven_dwarfs_train
Number of rows 321631
Number of columns 4
_______________________
Column type frequency:
Date 1
numeric 2
POSIXct 1
________________________
Group variables None

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
park_date 0 1 2015-01-01 2021-12-28 2018-04-07 2334

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
wait_minutes_actual 313996 0.02 23.99 1064.06 -92918 21 31 46 217 ▁▁▁▁▇
wait_minutes_posted 30697 0.90 76.96 33.99 0 50 70 95 300 ▆▇▁▁▁

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
wait_datetime 0 1 2015-01-01 07:51:12 2021-12-28 22:57:34 2018-04-07 23:14:06 321586

Examining the output above, we learn that this dataset contains four columns and 321,631 rows. We also learn that the dates span from 2015 to 2021. We can also examine the distribution of each of the variables to detect any potential anomalies. Notice anything strange? Look at the p0 (that is the minimum value) for wait_minutes_actual. It is -92918! We are not using this variable for this analysis, but we will for future analyses, so this is good to keep in mind.

We need this dataset to calculate our outcome. Recall from above that our outcome is defined as the average posted wait time between 9am and 10am. Additionally, recall our eligibility criteria states that we need to restrict our analysis to days in 2018.

library(dplyr)
library(lubridate)
seven_dwarfs_train_2018 <- seven_dwarfs_train |>
  filter(year(park_date) == 2018) |> # eligibility criteria
  mutate(hour = hour(wait_datetime)) |> # get hour from wait
  group_by(park_date, hour) |> # group by date
  summarise(
    wait_minutes_posted_avg = mean(wait_minutes_posted, na.rm = TRUE),
    .groups = "drop"
  ) |> # get average wait time
  mutate(
    wait_minutes_posted_avg =
      case_when(
        is.nan(wait_minutes_posted_avg) ~ NA,
        TRUE ~ wait_minutes_posted_avg
      )
  ) |> # if it is NAN make it NA
  filter(hour == 9) # only keep the average wait time between 9 and 10
seven_dwarfs_train_2018
# A tibble: 362 × 3
   park_date   hour wait_minutes_posted_avg
   <date>     <int>                   <dbl>
 1 2018-01-01     9                    60  
 2 2018-01-02     9                    60  
 3 2018-01-03     9                    60  
 4 2018-01-04     9                    68.9
 5 2018-01-05     9                    70.6
 6 2018-01-06     9                    33.3
 7 2018-01-07     9                    46.4
 8 2018-01-08     9                    69.5
 9 2018-01-09     9                    64.3
10 2018-01-10     9                    74.3
# ℹ 352 more rows

Now that we have our outcome settled, we need to get our exposure variable, as well as any other park-specific variables about the day in question that may be used as variables that we adjust for. Examining Figure 7.2, we see that we need data for three proposed confounders: the ticket season, the time the park closed, and the historic high temperature. These are in the parks_metadata_raw dataset. This data will require extra cleaning, since the names are in the original format.

We like to have our variable names follow a clean convention – one way to do this is to follow Emily Riederer’s “Column Names as Contracts” format (Riederer 2020). The basic idea is to predefine a set of words, phrases, or stubs with clear meanings to index information, and use these consistently when naming variables. For example, in these data, variables that are specific to a particular wait time are prepended with the term wait (e.g. wait_datetime and wait_minutes_actual), variables that are specific to the park on a particular day, acquired from parks metadata, are prepended with the term park (e.g. park_date or park_temperature_high).

Let’s first decide what variables we will need. In practice, this decision may involve an iterative process. For example, after drawing our DAG or after conducting diagnostic, we may determine that we need more variables than what we originally cleaned. Let’s start by skimming this dataframe.

skim(parks_metadata_raw)
Data summary
Name parks_metadata_raw
Number of rows 2079
Number of columns 181
_______________________
Column type frequency:
character 42
Date 1
difftime 46
logical 6
numeric 86
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
wdw_ticket_season 861 0.59 4 7 0 3 0
season 253 0.88 4 29 0 17 0
holidayn 1865 0.10 3 7 0 43 0
wdwticketseason 761 0.63 4 7 0 3 0
wdwracen 1992 0.04 4 6 0 5 0
wdweventn 1832 0.12 3 12 0 8 0
wdwseason 87 0.96 4 29 0 17 0
mkeventn 1546 0.26 3 11 0 10 0
epeventn 868 0.58 4 5 0 4 0
hseventn 1877 0.10 4 7 0 5 0
akeventn 2010 0.03 4 4 0 2 0
holidayj 2037 0.02 5 15 0 8 0
insession 105 0.95 2 3 0 95 0
insession_enrollment 105 0.95 2 4 0 100 0
insession_wdw 105 0.95 2 4 0 94 0
insession_dlr 105 0.95 2 4 0 94 0
insession_sqrt_wdw 105 0.95 2 4 0 97 0
insession_sqrt_dlr 105 0.95 2 4 0 97 0
insession_california 105 0.95 2 4 0 89 0
insession_dc 105 0.95 2 4 0 86 0
insession_central_fl 105 0.95 2 4 0 71 0
insession_drive1_fl 105 0.95 2 4 0 85 0
insession_drive2_fl 105 0.95 2 4 0 95 0
insession_drive_ca 105 0.95 2 4 0 91 0
insession_florida 105 0.95 2 4 0 86 0
insession_mardi_gras 105 0.95 2 4 0 82 0
insession_midwest 105 0.95 2 4 0 75 0
insession_ny_nj 105 0.95 2 4 0 8 0
insession_ny_nj_pa 105 0.95 2 4 0 19 0
insession_new_england 105 0.95 2 4 0 45 0
insession_new_jersey 105 0.95 2 4 0 2 0
insession_nothwest 105 0.95 2 4 0 17 0
insession_planes 105 0.95 2 4 0 81 0
insession_socal 105 0.95 2 4 0 80 0
insession_southwest 105 0.95 2 4 0 86 0
mkprddn 183 0.91 33 41 0 2 0
mkprdnn 1358 0.35 29 38 0 2 0
mkfiren 134 0.94 18 65 0 8 0
epfiren 126 0.94 13 35 0 2 0
hsfiren 485 0.77 24 66 0 6 0
hsshwnn 164 0.92 10 28 0 2 0
akshwnn 883 0.58 15 33 0 2 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2015-01-01 2021-08-31 2017-11-05 2079

Variable type: difftime

skim_variable n_missing complete_rate min max median n_unique
mkopen 0 1.00 21600 secs 32400 secs 09:00:00 4
mkclose 0 1.00 54000 secs 107940 secs 22:00:00 13
mkemhopen 0 1.00 21600 secs 32400 secs 09:00:00 5
mkemhclose 0 1.00 54000 secs 107940 secs 23:00:00 14
mkopenyest 0 1.00 21600 secs 32400 secs 09:00:00 4
mkcloseyest 0 1.00 54000 secs 107940 secs 22:00:00 13
mkopentom 0 1.00 21600 secs 32400 secs 09:00:00 4
mkclosetom 0 1.00 54000 secs 107940 secs 22:00:00 13
epopen 0 1.00 25200 secs 43200 secs 09:00:00 6
epclose 0 1.00 61200 secs 90000 secs 21:00:00 9
epemhopen 0 1.00 25200 secs 43200 secs 09:00:00 6
epemhclose 0 1.00 61200 secs 90000 secs 21:00:00 12
epopenyest 0 1.00 25200 secs 43200 secs 09:00:00 6
epcloseyest 0 1.00 61200 secs 90000 secs 21:00:00 9
epopentom 0 1.00 25200 secs 43200 secs 09:00:00 6
epclosetom 0 1.00 61200 secs 90000 secs 21:00:00 9
hsopen 0 1.00 21600 secs 36000 secs 09:00:00 6
hsclose 0 1.00 50400 secs 86400 secs 21:00:00 14
hsemhopen 0 1.00 21600 secs 36000 secs 09:00:00 7
hsemhclose 0 1.00 50400 secs 93600 secs 21:00:00 18
hsopenyest 0 1.00 21600 secs 36000 secs 09:00:00 6
hscloseyest 0 1.00 50400 secs 86400 secs 21:00:00 14
hsopentom 0 1.00 21600 secs 36000 secs 09:00:00 6
hsclosetom 0 1.00 50400 secs 86400 secs 21:00:00 14
akopen 0 1.00 25200 secs 32400 secs 09:00:00 3
akclose 0 1.00 50400 secs 86400 secs 20:00:00 16
akemhopen 0 1.00 25200 secs 32400 secs 09:00:00 3
akemhclose 0 1.00 50400 secs 90000 secs 20:00:00 17
akopenyest 0 1.00 25200 secs 32400 secs 09:00:00 3
akcloseyest 0 1.00 50400 secs 86400 secs 20:00:00 16
akopentom 0 1.00 25200 secs 32400 secs 09:00:00 3
akclosetom 0 1.00 50400 secs 86400 secs 20:00:00 16
mkprddt1 183 0.91 39600 secs 61200 secs 15:00:00 5
mkprddt2 1851 0.11 50400 secs 73800 secs 15:30:00 5
mkprdnt1 1358 0.35 68400 secs 82800 secs 21:00:00 11
mkprdnt2 1480 0.29 0 secs 84600 secs 23:00:00 8
mkfiret1 134 0.94 66600 secs 80100 secs 21:15:00 12
mkfiret2 2069 0.00 85800 secs 85800 secs 23:50:00 1
epfiret1 126 0.94 64800 secs 81000 secs 21:00:00 6
epfiret2 2074 0.00 85200 secs 85200 secs 23:40:00 1
hsfiret1 485 0.77 0 secs 82800 secs 21:00:00 17
hsfiret2 2045 0.02 0 secs 81000 secs 21:00:00 5
hsshwnt1 164 0.92 65100 secs 79200 secs 20:30:00 10
hsshwnt2 1369 0.34 72000 secs 82800 secs 21:30:00 11
akshwnt1 883 0.58 65700 secs 76500 secs 20:30:00 13
akshwnt2 1149 0.45 70200 secs 81000 secs 21:45:00 13

Variable type: logical

skim_variable n_missing complete_rate mean count
hsprddt1 2079 0 NaN :
hsprddn 2079 0 NaN :
akprddt1 2079 0 NaN :
akprddt2 2079 0 NaN :
akprddn 2079 0 NaN :
akfiren 2079 0 NaN :

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
dayofweek 0 1 4.000e+00 2.000e+00 1.000e+00 2.000e+00 4.000e+00 6.000e+00 7.000e+00 ▇▃▃▃▇
dayofyear 0 1 1.818e+02 1.063e+02 0.000e+00 8.900e+01 1.840e+02 2.730e+02 3.650e+02 ▇▇▇▇▇
weekofyear 0 1 2.609e+01 1.520e+01 0.000e+00 1.300e+01 2.600e+01 3.900e+01 5.300e+01 ▇▇▇▇▇
monthofyear 0 1 6.510e+00 3.480e+00 1.000e+00 3.000e+00 7.000e+00 1.000e+01 1.200e+01 ▇▅▆▅▇
year 0 1 2.017e+03 1.740e+00 2.015e+03 2.016e+03 2.017e+03 2.019e+03 2.021e+03 ▇▃▃▃▃
holidaypx 0 1 7.850e+00 6.890e+00 0.000e+00 3.000e+00 6.000e+00 1.000e+01 3.300e+01 ▇▅▂▁▁
holidaym 0 1 5.400e-01 1.350e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 5.000e+00 ▇▁▁▁▁
holiday 0 1 1.000e-01 3.000e-01 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 ▇▁▁▁▁
wdwevent 0 1 1.200e-01 3.200e-01 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 ▇▁▁▁▁
wdwrace 0 1 4.000e-02 2.000e-01 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 ▇▁▁▁▁
wdwmaxtemp 5 1 8.280e+01 8.530e+00 5.111e+01 7.829e+01 8.450e+01 8.954e+01 9.772e+01 ▁▁▃▇▆
wdwmintemp 6 1 6.550e+01 1.018e+01 2.748e+01 5.903e+01 6.835e+01 7.410e+01 8.128e+01 ▁▂▃▆▇
wdwmeantemp 6 1 7.415e+01 9.060e+00 3.975e+01 6.876e+01 7.637e+01 8.161e+01 8.776e+01 ▁▂▃▆▇
mkevent 0 1 2.600e-01 4.400e-01 0.000e+00 0.000e+00 0.000e+00 1.000e+00 1.000e+00 ▇▁▁▁▃
epevent 0 1 5.800e-01 4.900e-01 0.000e+00 0.000e+00 1.000e+00 1.000e+00 1.000e+00 ▆▁▁▁▇
hsevent 0 1 1.000e-01 3.000e-01 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 ▇▁▁▁▁
akevent 0 1 3.000e-02 1.800e-01 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 ▇▁▁▁▁
mkemhmorn 0 1 1.900e-01 4.000e-01 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 ▇▁▁▁▂
mkemhmyest 0 1 1.900e-01 4.000e-01 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 ▇▁▁▁▂
mkemhmtom 0 1 1.900e-01 4.000e-01 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 ▇▁▁▁▂
mkemheve 0 1 1.300e-01 3.300e-01 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 ▇▁▁▁▁
mkhoursemh 0 1 1.364e+01 1.980e+00 7.500e+00 1.300e+01 1.400e+01 1.500e+01 2.398e+01 ▁▇▅▁▁
mkhoursemhyest 0 1 1.365e+01 1.980e+00 7.500e+00 1.300e+01 1.400e+01 1.500e+01 2.398e+01 ▁▇▅▁▁
mkhoursemhtom 0 1 1.364e+01 1.980e+00 7.500e+00 1.300e+01 1.400e+01 1.500e+01 2.398e+01 ▁▇▅▁▁
mkemheyest 0 1 1.300e-01 3.300e-01 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 ▇▁▁▁▁
mkemhetom 0 1 1.300e-01 3.300e-01 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 ▇▁▁▁▁
epemhmorn 0 1 1.300e-01 3.400e-01 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 ▇▁▁▁▁
epemhmyest 0 1 1.300e-01 3.400e-01 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 ▇▁▁▁▁
epemhmtom 0 1 1.300e-01 3.400e-01 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 ▇▁▁▁▁
epemheve 0 1 1.300e-01 3.400e-01 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 ▇▁▁▁▁
epemheyest 0 1 1.300e-01 3.400e-01 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 ▇▁▁▁▁
epemhetom 0 1 1.300e-01 3.400e-01 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 ▇▁▁▁▁
ephoursemh 0 1 1.241e+01 9.600e-01 9.000e+00 1.200e+01 1.200e+01 1.300e+01 1.700e+01 ▁▇▃▂▁
ephoursemhyest 0 1 1.241e+01 9.600e-01 9.000e+00 1.200e+01 1.200e+01 1.300e+01 1.700e+01 ▁▇▃▂▁
ephoursemhtom 0 1 1.241e+01 9.600e-01 9.000e+00 1.200e+01 1.200e+01 1.300e+01 1.700e+01 ▁▇▃▂▁
hsemhmorn 0 1 1.800e-01 3.800e-01 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 ▇▁▁▁▂
hsemhmyest 0 1 1.800e-01 3.800e-01 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 ▇▁▁▁▂
hsemhmtom 0 1 1.800e-01 3.800e-01 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 ▇▁▁▁▂
hsemheve 0 1 6.000e-02 2.500e-01 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 ▇▁▁▁▁
hsemheyest 0 1 6.000e-02 2.500e-01 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 ▇▁▁▁▁
hsemhetom 0 1 6.000e-02 2.500e-01 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 ▇▁▁▁▁
hshoursemh 0 1 1.232e+01 1.490e+00 8.000e+00 1.100e+01 1.200e+01 1.300e+01 1.800e+01 ▁▇▇▂▁
hshoursemhyest 0 1 1.232e+01 1.490e+00 8.000e+00 1.100e+01 1.200e+01 1.300e+01 1.800e+01 ▁▇▇▂▁
hshoursemhtom 0 1 1.232e+01 1.490e+00 8.000e+00 1.100e+01 1.200e+01 1.300e+01 1.800e+01 ▁▇▇▂▁
akemhmorn 0 1 3.000e-01 4.600e-01 0.000e+00 0.000e+00 0.000e+00 1.000e+00 1.000e+00 ▇▁▁▁▃
akemhmyest 0 1 3.000e-01 4.600e-01 0.000e+00 0.000e+00 0.000e+00 1.000e+00 1.000e+00 ▇▁▁▁▃
akemhmtom 0 1 3.000e-01 4.600e-01 0.000e+00 0.000e+00 0.000e+00 1.000e+00 1.000e+00 ▇▁▁▁▃
akemheve 0 1 4.000e-02 2.000e-01 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 ▇▁▁▁▁
akemheyest 0 1 4.000e-02 2.000e-01 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 ▇▁▁▁▁
akemhetom 0 1 4.000e-02 2.000e-01 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 ▇▁▁▁▁
akhoursemh 0 1 1.177e+01 1.800e+00 7.000e+00 1.100e+01 1.200e+01 1.300e+01 1.700e+01 ▂▇▇▅▁
akhoursemhyest 0 1 1.177e+01 1.800e+00 7.000e+00 1.100e+01 1.200e+01 1.300e+01 1.700e+01 ▂▇▇▅▁
akhoursemhtom 0 1 1.176e+01 1.800e+00 7.000e+00 1.100e+01 1.200e+01 1.300e+01 1.700e+01 ▂▇▇▅▁
mkhours 0 1 1.326e+01 2.010e+00 7.000e+00 1.200e+01 1.300e+01 1.500e+01 2.398e+01 ▂▇▇▁▁
mkhoursyest 0 1 1.326e+01 2.010e+00 7.000e+00 1.200e+01 1.300e+01 1.500e+01 2.398e+01 ▂▇▇▁▁
mkhourstom 0 1 1.326e+01 2.000e+00 7.000e+00 1.200e+01 1.300e+01 1.500e+01 2.398e+01 ▂▇▇▁▁
ephours 0 1 1.202e+01 6.400e-01 8.000e+00 1.200e+01 1.200e+01 1.200e+01 1.700e+01 ▁▁▇▁▁
ephoursyest 0 1 1.203e+01 6.400e-01 8.000e+00 1.200e+01 1.200e+01 1.200e+01 1.700e+01 ▁▁▇▁▁
ephourstom 0 1 1.202e+01 6.400e-01 8.000e+00 1.200e+01 1.200e+01 1.200e+01 1.700e+01 ▁▁▇▁▁
hshours 0 1 1.192e+01 1.190e+00 5.000e+00 1.100e+01 1.200e+01 1.250e+01 1.800e+01 ▁▁▇▂▁
hshoursyest 0 1 1.193e+01 1.200e+00 5.000e+00 1.100e+01 1.200e+01 1.250e+01 1.800e+01 ▁▁▇▂▁
hshourstom 0 1 1.192e+01 1.190e+00 5.000e+00 1.100e+01 1.200e+01 1.250e+01 1.800e+01 ▁▁▇▂▁
akhours 0 1 1.146e+01 1.680e+00 6.000e+00 1.050e+01 1.100e+01 1.250e+01 1.700e+01 ▁▃▇▃▁
akhoursyest 0 1 1.147e+01 1.680e+00 6.000e+00 1.050e+01 1.100e+01 1.250e+01 1.700e+01 ▁▃▇▃▁
akhourstom 0 1 1.146e+01 1.680e+00 6.000e+00 1.050e+01 1.100e+01 1.250e+01 1.700e+01 ▁▃▇▃▁
weather_wdwhigh 0 1 8.235e+01 7.860e+00 7.020e+01 7.460e+01 8.280e+01 9.060e+01 9.230e+01 ▅▃▂▂▇
weather_wdwlow 0 1 6.410e+01 9.260e+00 4.920e+01 5.580e+01 6.360e+01 7.400e+01 7.610e+01 ▅▅▃▂▇
weather_wdwprecip 0 1 1.500e-01 8.000e-02 3.000e-02 8.000e-02 1.200e-01 2.300e-01 3.500e-01 ▇▆▃▅▁
capacitylost_mk 0 1 4.221e+05 3.646e+04 3.529e+05 3.858e+05 4.339e+05 4.561e+05 4.736e+05 ▅▂▁▇▆
capacitylost_ep 0 1 3.669e+05 2.402e+04 3.252e+05 3.384e+05 3.808e+05 3.820e+05 3.947e+05 ▅▁▁▂▇
capacitylost_hs 0 1 2.875e+05 3.320e+04 2.038e+05 2.796e+05 3.019e+05 3.119e+05 3.219e+05 ▂▁▁▁▇
capacitylost_ak 0 1 2.282e+05 1.497e+04 2.108e+05 2.208e+05 2.232e+05 2.328e+05 2.739e+05 ▇▅▁▁▂
capacitylostwgt_mk 0 1 4.137e+07 3.621e+06 3.466e+07 3.764e+07 4.259e+07 4.458e+07 4.655e+07 ▅▂▂▇▆
capacitylostwgt_ep 0 1 3.534e+07 2.201e+06 3.169e+07 3.269e+07 3.667e+07 3.667e+07 3.768e+07 ▅▁▁▁▇
capacitylostwgt_hs 0 1 2.753e+07 3.049e+06 1.981e+07 2.677e+07 2.877e+07 2.976e+07 3.075e+07 ▂▁▁▁▇
capacitylostwgt_ak 0 1 2.239e+07 1.398e+06 2.079e+07 2.178e+07 2.180e+07 2.278e+07 2.674e+07 ▇▅▁▁▂
mkprdday 0 1 1.070e+00 5.900e-01 0.000e+00 1.000e+00 1.000e+00 1.000e+00 3.000e+00 ▁▇▁▁▁
mkprdngt 0 1 6.400e-01 9.000e-01 0.000e+00 0.000e+00 0.000e+00 2.000e+00 3.000e+00 ▇▁▁▃▁
mkfirewk 0 1 9.400e-01 2.600e-01 0.000e+00 1.000e+00 1.000e+00 1.000e+00 2.000e+00 ▁▁▇▁▁
epfirewk 0 1 9.400e-01 2.500e-01 0.000e+00 1.000e+00 1.000e+00 1.000e+00 3.000e+00 ▁▇▁▁▁
hsprdday 0 1 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 ▁▁▇▁▁
hsfirewk 0 1 7.800e-01 4.500e-01 0.000e+00 1.000e+00 1.000e+00 1.000e+00 2.000e+00 ▂▁▇▁▁
hsshwngt 0 1 1.280e+00 6.200e-01 0.000e+00 1.000e+00 1.000e+00 2.000e+00 3.000e+00 ▁▇▁▅▁
hsfirewks 0 1 1.000e+00 0.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00 ▁▁▇▁▁
akprdday 0 1 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 ▁▁▇▁▁
akshwngt 0 1 1.040e+00 9.700e-01 0.000e+00 0.000e+00 1.000e+00 2.000e+00 3.000e+00 ▇▂▁▇▁

This dataset contains many more variables than the one we worked with previously. For this analysis, we are going to select date (the observation date), wdw_ticket_season (the ticket season for the observation), wdwmaxtemp (the maximum temperature), mkclose (the time Magic Kingdom closed), mkemhmorn (whether Magic Kingdom had an “Extra Magic Hour” in the morning).

parks_metadata_clean <- parks_metadata_raw |>
  ##  based on our analysis plan, we will select the following variables
  select(date, wdw_ticket_season, wdwmaxtemp, mkclose, mkemhmorn) |>
  ## based on eligibility criteria, limit to 2018
  filter(year(date) == 2018) |>
  ## rename variables
  rename(
    park_date = date,
    park_ticket_season = wdw_ticket_season,
    park_temperature_high = wdwmaxtemp,
    park_close = mkclose,
    park_extra_magic_morning = mkemhmorn
  )

7.3 Working with multiple data sources

Frequently we find ourselves merging data from multiple sources when attempting to answer causal questions in order to ensure that all of the necessary factors are accounted for. The way we can combine datasets is via joins – joining two or more datasets based on a set or sets of common variables. We can think of three main types of joins: left, right, and inner. A left join combines data from two datasets based on a common variable and includes all records from the left dataset along with matching records from the right dataset (in dplyr, left_join()), while a right join includes all records from the right dataset and their corresponding matches from the left dataset (in dplyr right_join()); an inner join, on the other hand, includes only the records with matching values in both datasets, excluding non-matching records (in dplyr inner_join(). For this analysis, we need to use a left join to pull in the cleaned parks metadata.

seven_dwarfs_train_2018 <- seven_dwarfs_train_2018 |>
  left_join(parks_metadata_clean, by = "park_date")

7.4 Recognizing missing data

It is important to recognize whether we have any missing data in our variables. The visdat package is great for getting a quick sense of whether we have any missing data.

library(visdat)
vis_miss(seven_dwarfs_train_2018)

It looks like we only have a few observations (2%) missing our outcome of interest. This is not too bad. For this first analysis we will ignore the missing values. We can explicitly drop them using the drop_na() function from dplyr.

seven_dwarfs_train_2018 <- seven_dwarfs_train_2018 |>
  drop_na()

7.5 Exploring and visualizing data and assumptions

The positivity assumption requires that within each level and combination of the study variables used to achieve exchangeability, there are exposed and unexposed subjects (Section 3.3). We can explore this by visualizing the distribution of each of our proposed confounders stratified by the exposure.

7.5.1 Single variable checks for positivity violations

Figure 7.3 shows the distribution of Magic Kingdom park closing time by whether the date had extra magic hours in the morning. There is not clear evidence of a lack of positivity here as both exposure levels span the majority of the covariate space.

ggplot(
  seven_dwarfs_train_2018,
  aes(
    x = factor(park_close),
    group = factor(park_extra_magic_morning),
    fill = factor(park_extra_magic_morning)
  )
) +
  geom_bar(position = position_dodge2(width = 0.9, preserve = "single")) +
  labs(
    fill = "Extra Magic Morning",
    x = "Time of Park Close"
  )

Figure 7.3: Distribution of Magic Kingdom park closing time by whether the date had extra magic hours in the morning

To examine the distribution of historic temperature high at Magic Kingdom by whether the date had extra magic hours in the morning we can use a mirrored histogram. We’ll use the {halfmoon} package’s geom_mirror_histogram() to create one. Examining Figure 7.4, it does look like there are very few days in the exposed group with maximum temperatures less than 60 degrees, while not necessarily a positivity violation it is worth keeping an eye on, particularly because the dataset is not very large, so this could make it difficult to estimate an average exposure effect across this whole space. If we found this to be particularly difficult, we could posit changing our causal question to instead restrict the analysis to warmer days. This of course would also restrict which days we could draw conclusions about for the future.

library(halfmoon)
ggplot(
  seven_dwarfs_train_2018,
  aes(
    x = park_temperature_high,
    group = factor(park_extra_magic_morning),
    fill = factor(park_extra_magic_morning)
  )
) +
  geom_mirror_histogram(bins = 20) +
  labs(
    fill = "Extra Magic Morning",
    x = "Historic maximum temperature (degrees F)"
  )

Figure 7.4: Distribution of historic temperature high at Magic Kingdom by whether the date had extra magic hours in the morning

Finally, let’s look at the distribution of ticket season by whether there were extra magic hours in the morning. Examining Figure 7.5, we do not see any positivity violations.

ggplot(
  seven_dwarfs_train_2018,
  aes(
    x = park_ticket_season,
    group = factor(park_extra_magic_morning),
    fill = factor(park_extra_magic_morning)
  )
) +
  geom_bar(position = "dodge") +
  labs(
    fill = "Extra Magic Morning",
    x = "Magic Kingdom Ticket Season"
  )

Figure 7.5: Distribution of historic temperature high at Magic Kingdom by whether the date had extra magic hours in the morning

7.5.2 Multiple variable checks for positivity violations

We have confirmed that for each of the three confounders, we do not see strong evidence of positivity violations. Because we have so few variables here, we can examine this a bit more closely. Let’s start by discretizing the park_temperature_high variable a bit (we will cut it into tertiles).

seven_dwarfs_train_2018 |>
  ## cut park_temperature_high into tertiles
  mutate(park_temperature_high_bin = cut(park_temperature_high, breaks = 3)) |>
  ## bin park close time
  mutate(park_close_bin = case_when(
    hour(park_close) < 19 & hour(park_close) > 12 ~ "(1) early",
    hour(park_close) >= 19 & hour(park_close) < 24 ~ "(2) standard",
    hour(park_close) >= 24 | hour(park_close) < 12 ~ "(3) late"
  )) |>
  group_by(park_close_bin, park_temperature_high_bin, park_ticket_season) |>
  ## calculate the proportion exposed in each bin
  summarise(prop_exposed = mean(park_extra_magic_morning), .groups = "drop") |>
  ggplot(aes(x = park_close_bin, y = park_temperature_high_bin, fill = prop_exposed)) +
  geom_tile() +
  scale_fill_gradient2(midpoint = 0.5) +
  facet_wrap(~park_ticket_season) +
  labs(
    y = "Historic Maximum Temperature (F)",
    x = "Magic Kingdom Park Close Time",
    fill = "Proportion of Days Exposed"
  )

Figure 7.6: Check for positivity violations across three confounders: historic high temperature, park close time, and ticket season.

Interesting! Figure 7.6 shows an interesting potential violation. It looks like 100% of days with lower temperatures (historic highs between 51 and 65 degrees) that are in the peak ticket season have extra magic hours in the morning. This actually makes sense if we think a bit about this data set. The only days with cold temperatures in Florida that would also be considered a “peak” time to visit Walt Disney World would be over Christmas / New Years. During this time there historically were always extra magic hours.

We are going to proceed with the analysis, but we will keep these observations in mind.

7.6 Presenting descriptive statistics

Let’s examine a table of the variables of interest in this data frame. To do so, we are going to use the tbl_summary() function from the gtsummary package. (We’ll also use the labelled package to clean up the variable names for the table.)

library(gtsummary)
library(labelled)
seven_dwarfs_train_2018 <- seven_dwarfs_train_2018 |>
  set_variable_labels(
    park_ticket_season = "Ticket Season",
    park_close = "Close Time",
    park_temperature_high = "Historic High Temperature"
  )

tbl_summary(
  seven_dwarfs_train_2018,
  by = park_extra_magic_morning,
  include = c(park_ticket_season, park_close, park_temperature_high)
) |>
  # add an overall column to the table
  add_overall(last = TRUE)
Table 7.2: A descriptive table of Extra Magic Morning in the touringplans dataset. This table shows the distributions of these variables in the observed population.
Characteristic 0, N = 2941 1, N = 601 Overall, N = 3541
Ticket Season


    peak 60 (20%) 18 (30%) 78 (22%)
    regular 158 (54%) 35 (58%) 193 (55%)
    value 76 (26%) 7 (12%) 83 (23%)
Close Time


    16:30:00 1 (0.3%) 0 (0%) 1 (0.3%)
    18:00:00 37 (13%) 18 (30%) 55 (16%)
    20:00:00 18 (6.1%) 2 (3.3%) 20 (5.6%)
    21:00:00 28 (9.5%) 0 (0%) 28 (7.9%)
    22:00:00 91 (31%) 11 (18%) 102 (29%)
    23:00:00 78 (27%) 11 (18%) 89 (25%)
    24:00:00 40 (14%) 17 (28%) 57 (16%)
    25:00:00 1 (0.3%) 1 (1.7%) 2 (0.6%)
Historic High Temperature 84 (78, 89) 83 (76, 87) 84 (78, 89)
1 n (%); Median (IQR)