9 Advanced topics
The previous sections describe propensity score analysis in the simplest case, that of a binary treatment administered at a single time point. In reality, research is often more complicated, and more complicated research questions require adjustments to to this simple case. Below, we briefly outline some more complicated scenarios, including performing subgroup analysis, dealing with multi-category and continuous treatments, dealing with longitudinal/sequential treatments, and dealing with missing data. In practice, one should consult with a biostatistician specially trained in these methods rather than attempt them oneself.
9.1 Subgroup analysis
Subgroup analysis is required to understand how treatments affect different types of patients and to be able to provide reasoned recommendations when information about patients is available (in contrast to the broad policy-based recommendations implied by the usual estimands). Subgroup analysis can be done simply by performing separate analyses within each subgroup (Green and Stuart 2014), though in some case it can be beneficial to share information (e.g., estimation of the propensity score or outcome model) across subgroups (Dong et al. 2020).
It is also important to remember that performing subgroup analysis does not allow one to make a causal claim about the effect of subgroup membership on the treatment effect unless additional work is done to remove confounding from subgroup membership. For example, one may be interested in a subgroup analysis stratified by hospital. It may be that the treatment effect in one hospital differs from that in another, but that does not mean which hospital one goes to causes differences in the treatment effect (e.g., because of different quality of care); it may simply be that one hospital caters to patients for whom treatment is less effective (e.g., because of systemic issues that cause people both to suffer from comorbidities that change the treatment effect and to live closer to one hospital than another). This distinction between a scenario in which subgroup membership causes treatment effect heterogeneity and one in which subgroup membership is merely associated with treatment effect heterogeneity is described in detail in (VanderWeele 2009).
9.2 Multi-category and continuous treatments
Treatments do not have to be binary to be used with propensity score analysis. Methods also exist for multi-category and continuous treatments. An example of a multi-category treatment might be drug type in a study comparing two drugs to each other and to control. (Strasser et al. 2022) considered virus variant a multi-category exposure when examining the effect of COVID subvariant (Delta, Omicron, and Omicron BA.2) on patient health outcomes. An example of a continuous treatment might be the effect of pollutant exposure on mortality, as examined by (Wu et al. 2022).
9.2.1 Multi-category treatments
Estimating effects for multi-category treatment involves adjusting the sample so that the distributions of covariates in all categories resemble each other and some target population corresponding to the estimand of interest. This can be done using matching (Lopez and Gutman 2017) or weighting (McCaffrey et al. 2013). Instead of a single-valued propensity score, each unit has a vector-valued “generalized” propensity score corresponding to the probability of receiving each level of treatment (Imbens 2000). For example, for a three-level treatment, an individual unit may have a generalized propensity score of \([.1, .4, .5]\). McCaffrey et al. (2013) and Li and Li (2019) describe how to use these generalized propensity scores to compute weights. Currently, weighting methods are better developed and easier to use than matching methods for multi-category treatments and are available in WeightIt
.
9.2.2 Continuous treatments
The usual estimand for a continuous treatment is the average dose-response function (ADRF), which links the expected potential outcome (i.e., the average outcome if everyone was assigned to a single treatment value) to the corresponding treatment level. Propensity score analysis for continuous treatments involves adjusting the sample so that the treatment is independent from the covariates. This can be done using matching (Wu et al. 2022) or weighting (Robins, Hernán, and Brumback 2000; Zhu, Coffman, and Ghosh 2015; Huling, Greifer, and Chen 2023) (available in WeightIt
). The propensity score is instead represented as a single-valued generalized propensity score corresponding to the conditional density of treatment given the covariates (i.e., rather than the probability, which would be 0 for a all values of a truly continuous treatment) (Hirano and Imbens 2005). Balance is often assessed using the correlations between the treatment and each covariate in the adjusted sample (Austin 2019), though more holistic measures such as the distance covariance have also been developed (Huling, Greifer, and Chen 2023). To estimate the ADRF, one can fit a flexible model for the outcome given the treatment in the weighted sample.
9.3 Longitudinal/sequential treatments
Methods have been developed for estimating the effect of a treatment that can occur at multiple time points. For example, Robins, Hernán, and Brumback (2000) described methods for estimating the effect of zidovudine (AZT) treatment on mortality in HIV-infected patients, where treatment was defined each day since start of follow-up as the those dose of AZT received that day. These special methods must be used when confounding is time-varying, i.e., confounders of subsequent treatment and the outcome are themselves affected by previous treatments. Simply adjusting for these time-varying confounders by regression adjustment or standard propensity score analysis causes the same problems that adjusting for any post-treatment variable does.
The methods used for adjusting for time-varying confounding are called “g-methods” and are described in Hernán and Robins (2020). The simplest one is inverse probability weighting of marginal structural models, which essentially involves creating a propensity score weight at each time point and multiplying them together (available in WeightIt
); ideally, this yields a scenario analogous to one in which treatment is randomized at each time point. Thoemmes and Ong (2016) and Robins, Hernán, and Brumback (2000) provide clear examples of the method.
9.4 Missing data
Missing data is often present in the analysis of real datasets. There are a variety of reasons why data could be missing: an administrative error, loss to follow-up, or participant refusal to provide information are some examples. Handling missing data generally is a serious topic that requires expertise to do correctly, though there are mainstream methods that are commonly used and have been shown to be compatible with propensity score analysis and can yield accurate results if certain assumptions about why the data are missing are met (Cham and West 2016). The most common methods for dealing with missing data in propensity score analysis are multiple imputation (Rubin 2004) and censoring weights (Hernán and Robins 2020, Ch 12.6).
Imputation involves making a guess about the true value of each missing value. This guess often comes from a predictive model that describes the relationships among variables in the data. Instead of making a single guess, multiple imputation involves making many guesses, each stored in a separate version of the dataset with the guesses filled in. The analysis occurs in each imputed dataset, and then the results are pooled across datasets to arrive at a final single estimate. Although there have been doubts about the best way to perform propensity score analysis with multiply imputed data, simulations frequently verify that the standard approach described above yields the most accurate results (Leyrat et al. 2019). The MatchThem
package provides some utilities for matching and weighting with multiple imputed data (Pishgar et al. 2021), and cobalt
supports assessing balance across imputations (Greifer 2020).
Censoring weights are an alternative to imputation that are more commonly used when a single variable, e.g., the outcome, is missing for some units. Censoring weights discard any units with missing data and weight the remaining units to resemble the full sample (i.e., the original sample that included those with missing data) (Hernán and Robins 2020, Ch 12.6). Censoring weights are multiplied by propensity score weights when both are used to create a final set of weights that adjust for both confounding and censoring. Censoring weights are especially common with longitudinal treatments and in survival analysis.