21  Sensitivity analysis

Because many of the assumptions of causal inference are unverifiable, it’s reasonable to be concerned about the validity of your results. In this chapter, we’ll provide some ways to probe our assumptions and results for strengths and weaknesses. We’ll explore two main ways to do so: exploring the logical implications of the causal question and related DAGs and using mathematical techniques to quantify how different our results would be under other circumstances, such as in the presence of unmeasured confounding. These approaches are known as sensitivity analyses: How sensitive is our result to conditions other than those laid out in our assumptions and analysis?

21.1 Checking DAGs for robustness

Let’s start with where we began the modeling process: creating causal diagrams. Because DAGs encode the assumptions on which we base our analysis, they are natural points of critique for both others and ourselves.

21.1.1 Alternate adjustment sets and alternate DAGs

The same mathematical underpinnings of DAGs that allow us to to query them for things like adjustment sets also allow us to query other implications of DAGs. One of the simplest is that if your DAG is correct and your data are well-measured, any valid adjustment set should result in an unbiased estimate of the causal effect. Let’s consider the DAG we introduced in Figure 7.2.

Figure 21.1: The original proposed DAG for the relationship between Extra Magic Hours in the morning at a particular park and the average wait time between 9 am and 10 am. As before, we are saying that we believe 1) Extra Magic Hours impacts average wait time and 2) both Extra Magic Hours and average wait time are determined by the time the park closes, historic high temperatures, and ticket season.

In Figure 21.1, there’s only one adjustment set because all three confounders represent independent backdoor paths. Let’s say, though, that we had used Figure 21.2 instead, which is missing arrows from the park close time and historical temperature to whether there was an Extra Magic Morning.

Figure 21.2: An alternative DAG for the relationship between Extra Magic Hours in the morning at a particular park and the average wait time between 9 am and 10 am. This DAG has no arrows from park close time and historical temperature to Extra Magic Hours.

Now there are 4 potential adjustment sets: park_ticket_season, park_close + park_ticket_season, park_temperature_high + park_ticket_season, or park_close + park_temperature_high + park_ticket_season. Table 21.1 presents the IPW estimates for each adjustment set. The effects are quite different. Some slight variation in the estimates is expected since they are estimated using different variables that may not be measured perfectly; if this DAG were right, however, we should see them much more closely aligned than this. In particular, there seems to be a 3-minute difference between the models with and without park close time. The difference in these results implies that there is something off about the causal structure we specified.

Table 21.1: A table of ATE estimates from the IPW estimator. Each estimate was calculated for one of the valid adjustment sets for the DAG. The estimates are sorted by effect size in order. If the DAG is right and all the data well measured, different adjustment sets should give roughly the same answer.
Adjustment Set ATE
Close time, ticket season 6.579
Historic temperature, close time, ticket season 6.199
Ticket season 4.114
Historic temperature, ticket season 3.627

21.1.2 Negative controls

Alternate adjustment sets are a way of probing the logical implications of your DAG: if it’s correct, there could be many ways to account for the open backdoor paths correctly. The reverse is also true: the causal structure of your research question also implies relationships that should be null. One way that researchers take advantage of this implication is through negative controls. A negative control is either an exposure (negative exposure control) or outcome (negative outcome control) similar to your question in as many ways as possible, except that there shouldn’t be a causal effect. Lipsitch, Tchetgen Tchetgen, and Cohen (2010) describe negative controls for observational research. In their article, they reference standard controls in bench science. In a lab experiment, any of these actions should lead to a null effect:

  1. Leave out an essential ingredient.
  2. Inactivate the hypothesized active ingredient.
  3. Check for an effect that would be impossible by the hypothesized outcome.

There’s nothing unique to lab work here; these scientists merely probe the logical implications of their understanding and hypotheses. To find a good negative control, you usually need to extend your DAG to include more of the causal structure surrounding your question. Let’s look at some examples.

21.1.2.1 Negative exposures

First, we’ll look at a negative exposure control. If Extra Magic Mornings really cause an increase in wait time, it stands to reason that this effect is time-limited. In other words, there should be some period after which the effect of Extra Magic Morning dissipates. Let’s call today i and the previous day i - n, where n is the number of days before the outcome that the negative exposure control occurs. First, let’s explore n = 63, e.g., whether or not there was an Extra Magic Morning nine weeks ago. That is a pretty reasonable starting point: it’s unlikely that the effect on wait time would still be present 63 days later. This analysis is an example of leaving out an essential ingredient: we waited too long for this to be a realistic cause. Any remaining effect is likely due to residual confounding.

Let’s look at a DAG to visualize this situation. In Figure 21.3, we’ve added an identical layer to our original one: now there are two Extra Magic Mornings: one for day i and one for day i - 63. Similarly, there are two versions of the confounders for each day. One crucial detail in this DAG is that we’re assuming that there is an effect of day i - 63’s Extra Magic Morning on day i’s; whether or not there is an Extra Magic Morning one day likely affects whether or not it happens on another day. The decision about where to place them across the year is not random. If this is true, we would expect an effect: the indirect effect via day i’s Extra Magic Morning status. To get a valid negative control, we need to inactivate this effect, which we can do statistically by controlling for day i’s Extra Magic Morning status. So, given the DAG, our adjustment set is any combination of the confounders (as long as we have at least one version of each) and day i’s Extra Magic Morning (suppressing the indirect effect).

Figure 21.3: An expansion of the causal structure presented in Figure 7.2. In this DAG, the exposure is instead whether or not there were Extra Magic Hours 63 days before the day’s wait time we are examining. Because of the long period, there should be no effect. Similarly, the DAG also has earlier confounders related to day i - 63.

Since the exposure is on day i - 63, we prefer to control for the confounders related to that day, so we’ll use the i - 63 versions. We’ll use lag() from dplyr to get those variables.

n_days_lag <- 63
distinct_emm <- seven_dwarfs_train_2018 |>
  filter(wait_hour == 9) |>
  arrange(park_date) |>
  transmute(
    park_date,
    prev_park_extra_magic_morning = lag(park_extra_magic_morning, n = n_days_lag),
    prev_park_temperature_high = lag(park_temperature_high, n = n_days_lag),
    prev_park_close = lag(park_close, n = n_days_lag),
    prev_park_ticket_season = lag(park_ticket_season, n = n_days_lag)
  )

seven_dwarfs_train_2018_lag <- seven_dwarfs_train_2018 |>
  filter(wait_hour == 9) |>
  left_join(distinct_emm, by = "park_date") |>
  drop_na(prev_park_extra_magic_morning)

When we use these data for the IPW effect, we get -0.93 minutes, much closer to null than we found on day i. Let’s take a look at the effect over time. While there might be a lingering effect of Extra Magic Mornings for a little while (say, the span of an average trip to Disney World), it should quickly approach null. However, in Figure 21.4, we see that, while it eventually approaches null, there is quite a bit of lingering effect. If these results are accurate, it implies that we have some residual confounding in our effect.

Figure 21.4: A scatterplot with a smoothed regression of the relationship between wait times on day i and whether there were Extra Magic Hours on day i - n, where n represents the number of days previous to day i. We expect this relationship to rapidly approach the null, but the effect hovers above null for quite some time. This lingering effect implies we have some residual confounding present.

21.1.2.2 Negative outcomes

Now, let’s examine an example of a negative control outcome: the wait time for a ride at Universal Studios. Universal Studios is also in Orlando, so the set of causes for wait times are likely comparable to those at Disney World on the same day. Of course, whether or not there are Extra Magic Mornings at Disney shouldn’t affect the wait times at Universal on the same day: they are separate parks, and most people don’t visit both within an hour of one another. This negative control is an example of an effect implausible by the hypothesized mechanism.

We don’t have Universal’s ride data, so let’s simulate what would happen with and without residual confounding. We’ll generate wait times based on the historical temperature, park close time, and ticket season (the second two are technically specific to Disney, but we expect a strong correlation with the Universal versions). Because this is a negative outcome, it is not related to whether or not there were Extra Magic Morning hours at Disney.

seven_dwarfs_sim <- seven_dwarfs_train_2018 |>
  mutate(
    # we scale each variable and add a bit of random noise
    # to simulate reasonable Universal wait times
    wait_time_universal =
      park_temperature_high / 150 +
        as.numeric(park_close) / 1500 +
        as.integer(factor(park_ticket_season)) / 1000 +
        rnorm(n(), 5, 5)
  )

If we calculate the IPW effect of park_extra_magic_morning on wait_time_universal, we get 0.21 minutes, a roughly null effect, as expected. But what if we missed an unmeasured confounder, u, which caused Extra Magic Mornings and wait times at both Disney and Universal? Let’s simulate that scenario but augment the data further.

seven_dwarfs_sim2 <- seven_dwarfs_train_2018 |>
  mutate(
    u = rnorm(n(), mean = 10, sd = 3),
    wait_minutes_posted_avg = wait_minutes_posted_avg + u,
    park_extra_magic_morning = ifelse(
      u > 10,
      rbinom(1, 1, .1),
      park_extra_magic_morning
    ),
    wait_time_universal =
      park_temperature_high / 150 +
        as.numeric(park_close) / 1500 +
        as.integer(factor(park_ticket_season)) / 1000 +
        u +
        rnorm(n(), 5, 5)
  )

Now, the effect for both Disney and Universal wait times is different. If we had seen 0.67 minutes for the effect for Disney, we wouldn’t necessarily know that we had a confounded result. However, since we know the wait times at Universal should be unrelated, it’s suspicious that the result, -2.32 minutes, is not null. That is evidence that we have unmeasured confounding.

21.1.3 DAG-data consistency

Negative controls use the logical implications of the causal structure you assume. We can extend that idea to the entire DAG. If the DAG is correct, there are many implications for statistically determining how different variables in the DAG should and should not be related to each other. Like negative controls, we can check if variables that should be independent are independent in the data. Sometimes, the way that DAGs imply independence between variables is conditional on other variables. Thus, this technique is sometimes called implied conditional independencies (Textor et al. 2017). Let’s query our original DAG to find out what it says about the relationships among the variables.

query_conditional_independence(emm_wait_dag) |>
  unnest(conditioned_on)
# A tibble: 3 × 4
  set   a                     b          conditioned_on
  <chr> <chr>                 <chr>      <chr>         
1 1     park_close            park_temp… <NA>          
2 2     park_close            park_tick… <NA>          
3 3     park_temperature_high park_tick… <NA>          

In this DAG, three relationships should be null: 1) park_close and park_temperature_high, 2) park_close and park_ticket_season, and 3) park_temperature_high and park_ticket_season. None of these relationships need to condition on another variable to achieve independence; in other words, they should be unconditionally independent. We can use simple techniques like correlation and regression, as well as other statistical tests, to see if nullness holds for these relationships. Conditional independencies quickly grow in number in complex DAGs, and so dagitty implements a way to automate checks for DAG-data consistency given these implied nulls. dagitty checks if the residuals of a given conditional relationship are correlated, which can be modeled automatically in several ways. We’ll tell dagitty to calculate the residuals using non-linear models with type = "cis.loess". Since we’re working with correlations, the results should be around 0 if our DAG is right. As we see in Figure 21.5, though, one relationship doesn’t hold. There is a correlation between the park’s close time and ticket season.

test_conditional_independence(
  emm_wait_dag,
  data = seven_dwarfs_train_2018 |>
    filter(wait_hour == 9) |>
    mutate(
      across(where(is.character), factor),
      park_close = as.numeric(park_close),
    ) |>
    as.data.frame(),
  type = "cis.loess",
  # use 200 bootstrapped samples to calculate CIs
  R = 200
) |>
  ggdag_conditional_independence()

Figure 21.5: A plot of the estimates and 95% confidence intervals of the correlations between the residuals resulting from a regression of variables in the DAG that should have no relationship. While two relationships appear null, park close time and ticket season seem to be correlated, suggesting we have misspecified the DAG. One source of this misspecification may be missing arrows between the variables. Notably, the adjustment sets are identical with and without this arrow.

Why might we be seeing a relationship when there isn’t supposed to be one? A simple explanation is chance: just like in any statistical inference, we need to be cautious about over-extrapolating what we see in our limited sample. Since we have data for every day in 2018, we could probably rule that out. Another reason is that we’re missing direct arrows from one variable to the other, e.g. from historic temperature to park close time. Adding additional arrows is reasonable: park close time and ticket season closely track the weather. That’s a little bit of evidence that we’re missing an arrow.

At this point, we need to be cautious about overfitting the DAG to the data. DAG-data consistency tests cannot prove your DAG right and wrong, and as we saw in Chapter 6, statistical techniques alone cannot determine the causal structure of a problem. So why use these tests? As with negative controls, they provide a way to probe your assumptions. While we can never be sure about them, we do have information in the data. Finding that conditional independence holds is a little more evidence supporting your assumptions. There’s a fine line here, so we recommend being transparent about these types of checks: if you make changes based on the results of these tests, you should report your original DAG, too. Notably, in this case, adding direct arrows to all three of these relationships results in an identical adjustment set.

Let’s look at an example that is more likely to be misspecified, where we remove the arrows from park close time and ticket season to Extra Magic Morning.

emm_wait_dag2 <- dagify(
  wait_minutes_posted_avg ~ park_extra_magic_morning + park_close +
    park_ticket_season + park_temperature_high,
  park_extra_magic_morning ~ park_temperature_high,
  coords = coord_dag,
  labels = labels,
  exposure = "park_extra_magic_morning",
  outcome = "wait_minutes_posted_avg"
)

query_conditional_independence(emm_wait_dag2) |>
  unnest(conditioned_on)
# A tibble: 5 × 4
  set   a                        b       conditioned_on
  <chr> <chr>                    <chr>   <chr>         
1 1     park_close               park_e… <NA>          
2 2     park_close               park_t… <NA>          
3 3     park_close               park_t… <NA>          
4 4     park_extra_magic_morning park_t… <NA>          
5 5     park_temperature_high    park_t… <NA>          

This alternative DAG introduces two new relationships that should be independent. In Figure 21.6, we see an additional association between ticket season and Extra Magic Morning.

test_conditional_independence(
  emm_wait_dag2,
  data = seven_dwarfs_train_2018 |>
    filter(wait_hour == 9) |>
    mutate(
      across(where(is.character), factor),
      park_close = as.numeric(park_close),
    ) |>
    as.data.frame(),
  type = "cis.loess",
  R = 200
) |>
  ggdag_conditional_independence()

Figure 21.6: A plot of the estimates and 95% confidence intervals of the correlations between the residuals resulting from a regression of variables in the DAG that should have no relationship. While two relationships appear null, park close time and ticket season seem to be correlated, suggesting we have misspecified the DAG. One source of this misspecification may be missing arrows between the variables.

So, is this DAG wrong? Based on our understanding of the problem, it seems likely that’s the case, but interpreting DAG-data consistency tests has a hiccup: different DAGs can have the same set of conditional independencies. In the case of our DAG, one other DAG can generate the same implied conditional independencies (Figure 21.7). These are called equivalent DAGs because their implications are the same.

ggdag_equivalent_dags(emm_wait_dag2)

Figure 21.7: Equivalent DAGs for the likely misspecified version of Figure 7.2. These two DAGs produce the same set of implied conditional independencies. The difference between them is only the direction of the arrow between historic high temperature and Extra Magic Hours.

Equivalent DAGs are generated by reversing arrows. The subset of DAGs with reversible arrows that generate the same implications is called an equivalence class. While technical, this connection can condense the visualization to a single DAG where the reversible edges are denoted by a straight line without arrows.

ggdag_equivalent_class(emm_wait_dag2, use_text = FALSE, use_labels = TRUE)

Figure 21.8: An alternative way of visualizing Figure 21.7 where all the equivalent DAGs are condensed to a single version where the reversible edges are denoted with edges without arrows.

So, what do we do with this information? Since many DAGs can produce the same set of conditional independencies, one strategy is to find all the adjustment sets that would be valid for every equivalent DAG. dagitty makes this straightforward by calling equivalenceClass() and adjustmentSets(), but in this case, there are no overlapping adjustment sets.

library(dagitty)
# determine valid sets for all equiv. DAGs
equivalenceClass(emm_wait_dag2) |>
  adjustmentSets(type = "all")

We can see that by looking at the individual equivalent DAGs.

dags <- equivalentDAGs(emm_wait_dag2)

# no overlapping sets
dags[[1]] |> adjustmentSets(type = "all")
{ park_temperature_high }
{ park_close, park_temperature_high }
{ park_temperature_high, park_ticket_season }
{ park_close, park_temperature_high,
  park_ticket_season }
dags[[2]] |> adjustmentSets(type = "all")
 {}
{ park_close }
{ park_ticket_season }
{ park_close, park_ticket_season }

The good news is that, in this case, one of the equivalent DAGs doesn’t make logical sense: the reversible edge is from historical weather to Extra Magic Morning, but that is impossible for both time-ordering reasons (historical temperature occurs in the past) and for logical ones (Disney may be powerful, but to our knowledge, they can’t yet control the weather). Even though we’re using more data in these types of checks, we need to consider the logical and time-ordered plausibility of possible scenarios.

21.1.4 Alternate DAGs

As we mentioned in Section 5.4.1, you should specify your DAG ahead of time with ample feedback from other experts. Let’s now take the opposite approach to the last example: what if we used the original DAG but received feedback after the analysis that we should add more variables? Consider the expanded DAG in Figure 21.9. We’ve added two new confounders: whether it’s a weekend or a holiday. This analysis differs from when we checked alternate adjustment sets in the same DAG; in that case, we checked the DAG’s logical consistency. In this case, we’re considering a different causal structure.

Figure 21.9: An expansion of Figure 7.2, which now includes two new variables on their own backdoor paths: whether or not it’s a holiday and/or a weekend.

We can calculate these features from park_date using the timeDate package.

library(timeDate)

holidays <- c(
  "USChristmasDay",
  "USColumbusDay",
  "USIndependenceDay",
  "USLaborDay",
  "USLincolnsBirthday",
  "USMemorialDay",
  "USMLKingsBirthday",
  "USNewYearsDay",
  "USPresidentsDay",
  "USThanksgivingDay",
  "USVeteransDay",
  "USWashingtonsBirthday"
) |>
  holiday(2018, Holiday = _) |>
  as.Date()

seven_dwarfs_with_days <- seven_dwarfs_train_2018 |>
  mutate(
    is_holiday = park_date %in% holidays,
    is_weekend = isWeekend(park_date)
  ) |>
  filter(wait_hour == 9)

Both Extra Magic Morning hours and posted wait times are associated with whether it’s a holiday or weekend.

Table 21.2: The descriptive associations between the two new variables, holiday and weekend, and the exposure and outcome. The average posted waiting time differs on both holidays and weekends, as do the occurrences of Extra Magic Hours. While we can’t determine a confounding relationship from descriptive statistics alone, this adds to the evidence that these are confounders.
Characteristic Weekend Holiday
FALSE, N = 2531 TRUE, N = 1011 FALSE, N = 3431 TRUE, N = 111
Posted Wait Time 67 (60, 77) 65 (56, 75) 66 (58, 77) 76 (63, 77)
Extra Magic Morning 55 (22%) 5 (5.0%) 59 (17%) 1 (9.1%)
1 Median (IQR); n (%)

When we refit the IPW estimator, we get 7.58 minutes, slightly bigger than we got without the two new confounders. Because it was a deviation from the analysis plan, you should likely report both effects. That said, this new DAG is probably more correct than the original one. From a decision point of view, though, the difference is slight in absolute terms (about a minute) and the effect in the same direction as the original estimate. In other words, the result is not terribly sensitive to this change regarding how we might act on the information.

One other point here: sometimes, people present the results of using increasingly complicated adjustment sets. This comes from the tradition of comparing complex models to parsimonious ones. That type of comparison is a sensitivity analysis in its own right, but it should be principled: rather than fitting simple models for simplicity’s sake, you should compare competing adjustment sets or conditions. For instance, you may feel like these two DAGs are equally plausible or want to examine if adding other variables better captures the baseline crowd flow at the Magic Kingdom.

21.2 Quantitative bias analyses

Thus far, we’ve probed some of the assumptions we’ve made about the causal structure of the question. We can take this further using quantitative bias analysis, which uses mathematical assumptions to see how results would change under different conditions.

21.2.1 Tipping point analyses

21.2.2 Other types of QBA