4 Conditioning: matching and weighting
In the conditioning step, we perform the matching or weighting on the data, which adjusts the sample in such a way that (ideally) balance is achieved, i.e., the distributions of the covariates are similar between the treated and untreated groups. There are many options one can choose from in the conditioning step, such as which matching or weighting method is used and how each method is to be customized. Remember that the initial conditioning step does not need to be perfect because it will be respecified in the respecification step (Chapter 6). In Chapter 5, we discuss how to evaluate the choices made in the conditioning step.
First, we will describe weighting and then matching. The focus here will be on the choices one can make with these methods and how to make them rather than on the technical detail. Before we describe matching and weighting, we introduce the propensity score (Rosenbaum and Rubin 1983), a constructed variable that is often used in matching or weighting.
4.1 The Propensity Score
The propensity score is a one-dimensional summary of the variables to be adjusted for, computed as the predicted probability of receiving treatment given the variables (i.e., covariates), often written as \(p_i = P(A=1|X_i)\), where \(X_i\) represents unit \(i\)’s covariates to be adjusted for and \(A\) represents treatment, where \(A=1\) indicates being treated. The simplest and most common way to compute propensity scores is to run a logistic regression of treatment membership as the outcome and the covariates as the predictors, and use the predicted probability of being treated as the propensity score for each unit, though more advanced methods have been developed that incorporate machine learning or optimization as well.
Rosenbaum and Rubin (1983) proved that adjusting for the (true) propensity score is equivalent to adjusting for the covariates used to compute the propensity score, which is what makes it such a powerful technique. In practice, though, the performance of the adjustment must be evaluated (Ho et al. 2007), a process we describe in detail later. Though the propensity score is a popular component of matching and weighting methods, it is by no means a necessary component, and many alternatives have been developed that avoid some of the problems propensity scores face.
4.2 Weighting
Weighting involves estimating a weight for each unit in the sample such that, in the weighted sample, the covariates are balanced. Effect estimation proceeds by computing the weighted outcome means in each treatment group and computing their contrast or fitting a weighted outcome regression model with the treatment as a predictor. Weighting can be used to target any estimand, but the formulas for how the weights are computed depends on the estimand desired. Balancing weights function similarly to sampling weights; while sampling weights shift a sample to be representative of a national population, balancing weights shift each treatment group to resemble each other and a target population.
The most basic form of weighting is propensity score weighting for the ATE, also known as inverse probability of treatment weighting (IPTW). The weights have the following formula:
\[ w_{ATE, i} = \begin{cases} 1/p_i, & \text{if } A_i = 1, \\ 1/(1-p_i), & \text{if } A_i = 0 \end{cases} \]
where \(p_i\) is an individual’s propensity score and \(A_i\) is an individual’s treatment value, 1 if treated and 0 if untreated. They are known as “inverse probability” weights because the weights are equal to the inverse of the probability of receiving the treatment actually received, i.e., because \(p_i=P(A=1|X_i)\) and \(1-p_i = P(A=0|X_i)\). IPTW weights shift the distributions of both the treated and untreated groups to resemble that of the full sample and therefore each other.
Propensity score weighting for the ATT is also known as standardized mortality ratio weighting, and the weights have the following formula:
\[ w_{ATT, i} = p_i \times w_{ATE, i} = \begin{cases} 1, & \text{if } A_i = 1, \\ p_i/(1-p_i), & \text{if } A_i = 0 \end{cases} \]
Weights for the ATT shift the distribution of the untreated units to resemble that of the treated units and leave the treated units untouched.
Propensity score weighting for the ATO is also known as overlap weighting, and the weights have the following formula:
\[ w_{ATO, i} = p_i(1-p_i) \times w_{ATE, i} = \begin{cases} 1-p_i, & \text{if } A_i = 1, \\ p_i, & \text{if } A_i = 0 \end{cases} \]
Overlap weights upweight units most like those in the other treatment group. Though there are other methods to compute weights that target an overlap sample (e.g., the “matching weights” of L. Li and Greene (2013)), overlap weights tend to outperform them and produce a weighted sample with the most precision of any propensity score weights.
A choice that researchers must make when using propensity score weighting is how to estimate the propensity score. This choice affects the properties of the weights (i.e., the balance they induce and the precision in the weighted sample). The most common method is a logistic regression of the treatment on the covariates. Other popular methods involve machine learning methods like generalized boosted modeling (GBM) (McCaffrey, Ridgeway, and Morral 2004), Bayesian additive regression trees (BART) (Hill, Weiss, and Zhai 2011), and Super Learner (Alam, Moodie, and Stephens 2019), though any model that produces predicted class probabilities can be used to estimate propensity scores. Often, versions of these methods incorporate balance optimization into the estimation of the weights; for example, a popular implementation of GBM chooses the value of a tuning parameter as that which minimizes an imbalance statistic (McCaffrey, Ridgeway, and Morral 2004). Logistic regression has a particular benefit when using overlap weights for the ATO: the covariate means will be exactly balanced between the treatment groups.
Many modern methods skip the step of estimating a propensity score and estimate the weights directly. Examples of this approach include entropy balancing (Hainmueller 2012; Zhao and Percival 2017), stable balancing weights (Zubizarreta 2015), and energy balancing (Huling and Mak 2024). This distinction is explored in detail by Chattopadhyay, Hase, and Zubizarreta (2020). A popular weighting method, covariate balancing propensity score (CBPS) weighting, combines optimization and logistic regression-based propensity score estimation (Imai and Ratkovic 2014) (though this does not necessarily confer any benefits over methods that don’t estimate a propensity score (Y. Li and Li 2021)). These optimization-based methods often exactly or approximately balance features of the distribution of covariates while retaining precision in the weighted sample, making them highly effective1.
Weighting methods sometimes yield “extreme” weights, i.e., a few weights that take on a much larger value than the others and dominate the analysis, which can make balance worse and reduce precision. This can occur especially when propensity score weighting for the ATE, as small propensity scores in the treated group and large propensity scores in the untreated group cause weights to be large when inverted. One approach to dealing with extreme weights is to trim the weights, which can involve either removing units with large weights or extreme propensity scores or setting the value of their weights to a smaller value (winsorizing). These methods often change the estimand from the ATE, in which case the ATO should be targeted using overlap weights instead.
4.3 Matching
Matching involves dropping or reorganizing units into strata such that the remaining sample is balanced (Greifer and Stuart 2021a). Examples include pair matching, pure subset selection (in which units are dropped from the sample but no pairing occurs), and subclassification, among others. The outputs of a matching method are a set of matching weights and, if pairing or stratification is done, pair or stratum membership for each unit. Matching weights function identically to propensity score weights as described above; indeed, matching can be seen as a restricted form of weighting, where that restriction can sometimes afford benefits in terms of robustness and precision. Stuart (2010) provides an excellent introduction to matching.
Below, we briefly describe the two major forms of matching, stratification and subset selection (including pair matching).
4.3.1 Stratification
Stratification simply involves assigning units into strata such that, within strata, the covariates are balanced between treatment groups. The most straightforward method of stratification is exact matching, in which sets of units with identical covariate values are placed into strata based on those values. Any units with no exact matches in the other treatment group are dropped. The remaining sample will be exactly balanced on all included covariates. In practice, continuous variables or categorical variables with many categories make exact matching impossible. One alternative is coarsened exact matching (CEM) (Iacus, King, and Porro 2012, 2011), which is simply exact matching on coarsened version of the covariates. The degree of coarsening is controlled by the researcher to strike a balance between discarding units with no matches and ensuring the units within strata are relatively homogeneous in the covariates.
Another alternative is propensity score subclassification (Rosenbaum and Rubin 1984), in which units are placed into strata based on their propensity score values. The number of strata is decided by the researcher; although early literature recommends as few as 5 subclasses, it is always best to try larger numbers of subclass to find the one that yields the highest quality matches, which can sometimes be in the hundreds depending on the sample size (Desai et al. 2017).
The result of stratification is a set of stratification weights. These are computed as follows: 1) compute the proportion of units in each stratum that are treated, 2) for each unit, assign to it this proportion in its stratum as a new stratum “propensity score”, and 3) apply the propensity score weighting formulas above to the new stratum propensity score using the formula that corresponds to the desired estimand. In this way, stratification is a form of propensity score weighting in which the weights are estimated using a multi-step procedure rather than directly from the propensity scores. This method is known as marginal mean weighting through stratification (MMWS) (Hong 2010) or fine stratification (Desai et al. 2017). This blog post explains this idea in more detail.
4.3.2 Subset selection and pair matching
Subset selection involves taking a subset of units from the original sample and dropping the rest, ideally in such a way that the remaining sample is balanced on the covariates. The most common method of subset selection is pair matching, in which treated units are paired with untreated units, and any unpaired units are dropped. There are a number of ways to customize pair matching to improve the balance and precision of the resulting sample:
The distance measure use to compute the closeness between units. The most commonly used measure is the propensity score difference between each treated and untreated unit. Other distances include the Mahalanobis distance (Rubin 1980) and its robust variant (Rosenbaum 2010) and the scaled Euclidean distance, which are computed from the covariates directly and do not require a propensity score, though a propensity score can be added as an additional covariate in computing them. A powerful optimization-based matching method called genetic matching adjusts elements of the distance measure used in order to optimize the balance of the resulting sample (Diamond and Sekhon 2013). The best distance measure to use will depend on the unique features of the dataset (Ripollone et al. 2018), so several should be tried, though genetic matching automates this process. Because propensity scores involve an extreme coarsening of the covariates into a one-dimensional score, they can do more harm than good because units closely matched on the propensity score may not be close to each on the covariates of interest (King and Nielsen 2019).
The number of matches unit receives. Each treated unit can receive one or more control units as a match, and this number can be chosen by the researcher. For example, one can request 2:1 matching instead of 1:1 matching, which increases the size of the resulting sample but may worsen balance because worse matches are being included (Rassen et al. 2012). The number of matches each unit receives can be fixed across all units or varied (Ming and Rosenbaum 2000).
Whether matching is done with or without replacement. One can choose whether untreated units can be reused as matches for multiple treated units. Matching with replacement often yields better balance and eliminates the effect of who gets matched first on the resulting sample. However, it can yield imprecision in the effect estimate when the same untreated unit is matched many times; the number of times each untreated unit can be matched can be limited by the researcher. Inference after matching with replacement can be more challenging than when matching without replacement2. Matching without replacement is only feasible when the control pool is much larger than the pool of treated units and all treated units have propensity scores below .5 (F. Sävje 2022).
The order of matches. When matching without replacement, the order that treated units are matched matters. There are a variety of ways one can specify this order with varying evidence supporting each choice (Rubin 1973; Austin 2014), so it is best to try various orders. An alternative is to use optimal pair matching (Hansen and Klopfer 2006; Gu and Rosenbaum 1993), which optimizes a global distance criterion, but may be unfeasible or slow with large datasets.
Calipers and exact matching constraints. A caliper is a limit on how far two units can be before they are disallowed from being matched. Using calipers can improve balance but decrease precision because additional units are discarded (i.e., those without any matches within the caliper) (Austin 2014). It is very common to place a caliper on the propensity score, though it is also possible to place calipers on covariates directly. Though there has been some research into optimal caliper widths (Austin 2011), the best caliper will depend on the unique features of the dataset, and so many should be tried and evaluated. An exact matching constraint requires that two unit have identical values of the given covariate in order to be allowed to be matched. When calipers or exact matching constraints are used, balance often improves (sometimes dramatically), but discarding treated units changes the estimand to the ATO, which may not be desired (Greifer and Stuart 2021b; Rosenbaum and Rubin 1985). Applying calipers can also make balance worse if good balance has already been achieved without them (King and Nielsen 2019).
There are also methods of subset selection without pairing, such as cardinality and profile matching (Zubizarreta, Paredes, and Rosenbaum 2014; Cohn and Zubizarreta 2022). These use optimization to find the largest matched sample that satisfies balance constraints set by the user and are starting to see broader use in medical research (e.g., Niknam and Zubizarreta 2022; Fortin and Schuemie 2022).
4.3.3 Full matching
Full matching is an effective matching method that is somewhat of a cross between pair matching and stratification (Stuart and Green 2008; Hansen and Klopfer 2006). Every unit in the sample is assigned to a stratum as with stratification methods, but the strata are formed based on the pairwise distances between units. Full matching tends to outperform other matching methods and can be customized in many of the same ways (Austin and Stuart 2015, 2017b). Variations of full matching often run much faster than other matching algorithms (Fredrik Sävje, Higgins, and Sekhon 2021). Unlike other pair matching methods, full matching can be used to estimate the ATT, ATC, or ATE. Full matching can also been seen as a alternative to IPTW that can be more robust to misspecification (Austin and Stuart 2017a).
4.4 Choosing a specification
We explain how to choose among the variety of weighting and matching methods in Chapter 5. In short, the choice should depend on covariate balance, precision, and respect of the desired estimand. One does not need to choose a method and commit to it; one can instead try many, assess their quality, and move forward with effect estimation using only the best among those compared. However, this search can be shortened by using methods known to perform exceptionally well or that can be specified to respond to a researcher’s precise requirements.
Some methods are more popular in certain fields; for example, medical research often uses matching, whereas epidemiological research often uses weighting. In some cases, the popularity of methods in certain fields reflects real substantive demands, but in many cases it simply reflects trends and cultures or the specific methods emphasized in training materials for students. For example, epidemiological training emphasizes the use of weighting over matching (Hernán and Robins 2020), even though they often serve the same purpose and perform equally well (Greifer and Stuart 2021a; Kush et al. 2022).
One key aspect to remember is that the basic or default method is almost never the best method and should not be used just because it is the most familiar or popular in a field. For example, 1:1 pair matching on the propensity score is a popular matching method in medical research, even though its problems have been well documented (F. Sävje 2022; King and Nielsen 2019), it is uniformly outperformed by genetic matching (Diamond and Sekhon 2013), and very often performs worse than methods that are no more difficult to use, such as full matching (Austin and Stuart 2015) and cardinality matching (Visconti and Zubizarreta 2018; Angeles Resa and Zubizarreta 2016). Similarly, propensity score weighting often performs worse than modern optimization-based methods like entropy balancing (Hainmueller 2012).
A common theme in many areas of statistics is that newer methods perform better than older methods in general. However, those newer methods are often less well studied and opaque or unheard of by applied researchers. Their absence from the applied literature and the popularity of older, more basic methods is not an indication that the basic methods should be preferred; rather, it often reflects the fear or ignorance by researchers and reviewers of newer, better performing methods.↩︎
Although much early research into statistical inference for matching was done for matching with replacement, the inference methods required highly specific uses of matching and complicated estimators of standard errors (Abadie and Imbens 2006, 2016). The most applicable methods of inference for matching with replacement rely on simulation evidence and are only approximations (Hill and Reiter 2006).↩︎