Throughout this book we will be using data obtained from Touring Plans. Touring Plans is a company that helps folks plan their trips to Disney and Universal theme parks. One of their goals is to accurately predict attraction wait times at these theme parks by leveraging data and statistical modeling. The touringplans R package includes several datasets containing information about Disney theme park attractions. A summary of the attractions included in the package can be found by running the following:
Additionally, this package contains a dataset with raw metadata about the parks, with observations recorded daily. This metadata includes information like the Walt Disney World ticket season on the particular day (was it high season – think Christmas – or low season – think right when school started), what the historic temperatures were in the park on that day, and whether there was a special event, such as “extra magic hours” in the park on that day (did the park open early to guests staying in the Walt Disney World resorts?).
Is there a relationship between whether there were “Extra Magic Hours” in the morning at Magic Kingdom and the average wait time for an attraction called the “Seven Dwarfs Mine Train” the same day between 9am and 10am in 2018?
Let’s begin by diagramming this causal question (Figure 7.1).
Warning in geom_segment(aes(x = 0.5, xend = 2.5, y = 0.95, yend = 0.95)): All aesthetics have length 1, but the data has 6 rows.
ℹ Did you mean to use `annotate()`?
Warning in geom_segment(aes(x = 1.5, xend = 1.5, y = 0.95, yend = 1.1)): All aesthetics have length 1, but the data has 6 rows.
ℹ Did you mean to use `annotate()`?
Warning in geom_segment(aes(x = 0.5, xend = 1, y = 0.95, yend = 0.65)): All aesthetics have length 1, but the data has 6 rows.
ℹ Did you mean to use `annotate()`?
Warning in geom_segment(aes(x = 1, xend = 1.5, y = 0.65, yend = 0.65)): All aesthetics have length 1, but the data has 6 rows.
ℹ Did you mean to use `annotate()`?
Warning in geom_segment(aes(x = 1.55, xend = 2.05, y = 0.95, yend = 0.65)): All aesthetics have length 1, but the data has 6 rows.
ℹ Did you mean to use `annotate()`?
Warning in geom_segment(aes(x = 2.05, xend = 2.55, y = 0.65, yend = 0.65)): All aesthetics have length 1, but the data has 6 rows.
ℹ Did you mean to use `annotate()`?
Historically, guests who stayed in a Walt Disney World resort hotel could access the park during “Extra Magic Hours,” during which the park was closed to all other guests. These extra hours could be in the morning or evening. The Seven Dwarfs Mine Train is a ride at Walt Disney World’s Magic Kingdom. Magic Kingdom may or may not be selected each day to have these “Extra Magic Hours.” We are interested in examining the relationship between whether there were “Extra Magic Hours” in the morning and the average wait time for the Seven Dwarfs Mine Train on the same day between 9 am and 10 am. Below is a proposed DAG for this question.
Since we are not in charge of Walt Disney World’s operations, we cannot randomize dates to have (or not) “Extra Magic Hours”, therefore, we need to rely on previously collected observational data and do our best to emulate the target trial that we would have created, should it have been possible. Here, our observations are days. Looking at the diagram above, we can map each element of the causal question to elements of our target trial protocol:
Eligibility criteria: Days must be from 2018
Exposure definition: Magic kingdom had “Extra Magic Hours” in the morning
Assignment procedures: Observed – if the historic data suggests there were “Extra Magic Hours” in the morning on a particular day, that day is classified as “exposed” otherwise it is “unexposed”
Follow-up period: From park open to 10am.
Outcome definition: The average posted wait time between 9am and 10am
Causal contrast of interest: Average treatment effect (we will discuss this in Chapter 11)
Analysis plan: We use inverse probability weighting after fitting a propensity score model to estimate the average treatment effect of the exposure on the outcome of interest. We will adjust for variables as determined by our DAG (Figure 7.2)
7.2 Data wrangling and recipes
Most of our data manipulation tools come from the dplyr package (Table 7.1). We will also use lubridate to help us manipulate dates.
Table 7.1: Mapping target trial protocol elements to commonly used dplyr functions
To answer this question, we are going to need to manipulate both the seven_dwarfs_train dataset as well as the parks_metadata_raw dataset. Let’s start with the seven_dwarfs_train data set. The Seven Dwarfs Mine Train ride is an attraction at Walt Disney World’s Magic Kingdom. The seven_dwarfs_train dataset in the {touringplans} package contains information about the date a particular wait time was recorded (park_date), the time of the wait time (wait_datetime), the actual wait time (wait_minutes_actual), and the posted wait time (wait_minutes_posted). Let’s take a look at this dataset. The {skimr} package is great for getting a quick glimpse at a new dataset.
Examining the output above, we learn that this dataset contains four columns and 321,631 rows. We also learn that the dates span from 2015 to 2021. We can also examine the distribution of each of the variables to detect any potential anomalies. Notice anything strange? Look at the p0 (that is the minimum value) for wait_minutes_actual. It is -92918! We are not using this variable for this analysis, but we will for future analyses, so this is good to keep in mind.
We need this dataset to calculate our outcome. Recall from above that our outcome is defined as the average posted wait time between 9am and 10am. Additionally, recall our eligibility criteria states that we need to restrict our analysis to days in 2018.
library(dplyr)library(lubridate)seven_dwarfs_train_2018<-seven_dwarfs_train|>filter(year(park_date)==2018)|># eligibility criteriamutate(hour =hour(wait_datetime))|># get hour from waitgroup_by(park_date, hour)|># group by datesummarise( wait_minutes_posted_avg =mean(wait_minutes_posted, na.rm =TRUE), .groups ="drop")|># get average wait timemutate( wait_minutes_posted_avg =case_when(is.nan(wait_minutes_posted_avg)~NA,TRUE~wait_minutes_posted_avg))|># if it is NAN make it NAfilter(hour==9)# only keep the average wait time between 9 and 10
Now that we have our outcome settled, we need to get our exposure variable, as well as any other park-specific variables about the day in question that may be used as variables that we adjust for. Examining Figure 7.2, we see that we need data for three proposed confounders: the ticket season, the time the park closed, and the historic high temperature. These are in the parks_metadata_raw dataset. This data will require extra cleaning, since the names are in the original format.
We like to have our variable names follow a clean convention – one way to do this is to follow Emily Riederer’s “Column Names as Contracts” format (Riederer 2020). The basic idea is to predefine a set of words, phrases, or stubs with clear meanings to index information, and use these consistently when naming variables. For example, in these data, variables that are specific to a particular wait time are prepended with the term wait (e.g. wait_datetime and wait_minutes_actual), variables that are specific to the park on a particular day, acquired from parks metadata, are prepended with the term park (e.g. park_date or park_temperature_high).
Let’s first decide what variables we will need. In practice, this decision may involve an iterative process. For example, after drawing our DAG or after conducting diagnostic, we may determine that we need more variables than what we originally cleaned. Let’s start by skimming this dataframe.
This dataset contains many more variables than the one we worked with previously. For this analysis, we are going to select date (the observation date), wdw_ticket_season (the ticket season for the observation), wdwmaxtemp (the maximum temperature), mkclose (the time Magic Kingdom closed), mkemhmorn (whether Magic Kingdom had an “Extra Magic Hour” in the morning).
parks_metadata_clean<-parks_metadata_raw|>## based on our analysis plan, we will select the following variablesselect(date, wdw_ticket_season, wdwmaxtemp, mkclose, mkemhmorn)|>## based on eligibility criteria, limit to 2018filter(year(date)==2018)|>## rename variablesrename( park_date =date, park_ticket_season =wdw_ticket_season, park_temperature_high =wdwmaxtemp, park_close =mkclose, park_extra_magic_morning =mkemhmorn)
7.3 Working with multiple data sources
Frequently we find ourselves merging data from multiple sources when attempting to answer causal questions in order to ensure that all of the necessary factors are accounted for. The way we can combine datasets is via joins – joining two or more datasets based on a set or sets of common variables. We can think of three main types of joins: left, right, and inner. A left join combines data from two datasets based on a common variable and includes all records from the left dataset along with matching records from the right dataset (in dplyr, left_join()), while a right join includes all records from the right dataset and their corresponding matches from the left dataset (in dplyrright_join()); an inner join, on the other hand, includes only the records with matching values in both datasets, excluding non-matching records (in dplyrinner_join(). For this analysis, we need to use a left join to pull in the cleaned parks metadata.
seven_dwarfs_train_2018<-seven_dwarfs_train_2018|>left_join(parks_metadata_clean, by ="park_date")
7.4 Recognizing missing data
It is important to recognize whether we have any missing data in our variables. The visdat package is great for getting a quick sense of whether we have any missing data.
It looks like we only have a few observations (2%) missing our outcome of interest. This is not too bad. For this first analysis we will ignore the missing values. We can explicitly drop them using the drop_na() function from dplyr.
7.5 Exploring and visualizing data and assumptions
The positivity assumption requires that within each level and combination of the study variables used to achieve exchangeability, there are exposed and unexposed subjects (Section 3.3). We can explore this by visualizing the distribution of each of our proposed confounders stratified by the exposure.
7.5.1 Single variable checks for positivity violations
Figure 7.3 shows the distribution of Magic Kingdom park closing time by whether the date had extra magic hours in the morning. There is not clear evidence of a lack of positivity here as both exposure levels span the majority of the covariate space.
ggplot(seven_dwarfs_train_2018,aes( x =factor(park_close), group =factor(park_extra_magic_morning), fill =factor(park_extra_magic_morning)))+geom_bar(position =position_dodge2(width =0.9, preserve ="single"))+labs( fill ="Extra Magic Morning", x ="Time of Park Close")
To examine the distribution of historic temperature high at Magic Kingdom by whether the date had extra magic hours in the morning we can use a mirrored histogram. We’ll use the {halfmoon} package’s geom_mirror_histogram() to create one. Examining Figure 7.4, it does look like there are very few days in the exposed group with maximum temperatures less than 60 degrees, while not necessarily a positivity violation it is worth keeping an eye on, particularly because the dataset is not very large, so this could make it difficult to estimate an average exposure effect across this whole space. If we found this to be particularly difficult, we could posit changing our causal question to instead restrict the analysis to warmer days. This of course would also restrict which days we could draw conclusions about for the future.
library(halfmoon)ggplot(seven_dwarfs_train_2018,aes( x =park_temperature_high, group =factor(park_extra_magic_morning), fill =factor(park_extra_magic_morning)))+geom_mirror_histogram(bins =20)+labs( fill ="Extra Magic Morning", x ="Historic maximum temperature (degrees F)")
Finally, let’s look at the distribution of ticket season by whether there were extra magic hours in the morning. Examining Figure 7.5, we do not see any positivity violations.
ggplot(seven_dwarfs_train_2018,aes( x =park_ticket_season, group =factor(park_extra_magic_morning), fill =factor(park_extra_magic_morning)))+geom_bar(position ="dodge")+labs( fill ="Extra Magic Morning", x ="Magic Kingdom Ticket Season")
7.5.2 Multiple variable checks for positivity violations
We have confirmed that for each of the three confounders, we do not see strong evidence of positivity violations. Because we have so few variables here, we can examine this a bit more closely. Let’s start by discretizing the park_temperature_high variable a bit (we will cut it into tertiles).
seven_dwarfs_train_2018|>## cut park_temperature_high into tertilesmutate(park_temperature_high_bin =cut(park_temperature_high, breaks =3))|>## bin park close timemutate(park_close_bin =case_when(hour(park_close)<19&hour(park_close)>12~"(1) early",hour(park_close)>=19&hour(park_close)<24~"(2) standard",hour(park_close)>=24|hour(park_close)<12~"(3) late"))|>group_by(park_close_bin, park_temperature_high_bin, park_ticket_season)|>## calculate the proportion exposed in each binsummarise(prop_exposed =mean(park_extra_magic_morning), .groups ="drop")|>ggplot(aes(x =park_close_bin, y =park_temperature_high_bin, fill =prop_exposed))+geom_tile()+scale_fill_gradient2(midpoint =0.5)+facet_wrap(~park_ticket_season)+labs( y ="Historic Maximum Temperature (F)", x ="Magic Kingdom Park Close Time", fill ="Proportion of Days Exposed")
Interesting! Figure 7.6 shows an interesting potential violation. It looks like 100% of days with lower temperatures (historic highs between 51 and 65 degrees) that are in the peak ticket season have extra magic hours in the morning. This actually makes sense if we think a bit about this data set. The only days with cold temperatures in Florida that would also be considered a “peak” time to visit Walt Disney World would be over Christmas / New Years. During this time there historically were always extra magic hours.
We are going to proceed with the analysis, but we will keep these observations in mind.
7.6 Presenting descriptive statistics
Let’s examine a table of the variables of interest in this data frame. To do so, we are going to use the tbl_summary() function from the gtsummary package. (We’ll also use the labelled package to clean up the variable names for the table.)
library(gtsummary)library(labelled)seven_dwarfs_train_2018<-seven_dwarfs_train_2018|>set_variable_labels( park_ticket_season ="Ticket Season", park_close ="Close Time", park_temperature_high ="Historic High Temperature")tbl_summary(seven_dwarfs_train_2018, by =park_extra_magic_morning, include =c(park_ticket_season, park_close, park_temperature_high))|># add an overall column to the tableadd_overall(last =TRUE)
Table 7.2: A descriptive table of Extra Magic Morning in the touringplans dataset. This table shows the distributions of these variables in the observed population.