7 Respecification

If a given specification is not adequate in that balance is too poor, the effective sample size is too small, or the sample is no longer representative of the target population, one must respecify. Respecification can involve changing some aspect of the conditioning strategy, such as changing a parameter involved in the matching or weighting or changing the model used to estimate propensity scores, if any. Because there are so many parameters that can be changed and they can be changed in so many ways, it is impossible to give a complete account of the best way to respecify. One should try many specifications, examining patterns in how making those changes improves the quality of the resulting sample. As long as the outcome is not involved in this process, doing so will not invalidate inferences made at the end.

There are some common tricks that can be used to nudge the respecification process in the right direction. Below are some common issues and some potential solutions.

Poor balance as measured by SMDs: consider using an optimization-based method, like entropy balancing (Hainmueller 2012) or cardinality matching (Zubizarreta, Paredes, and Rosenbaum 2014) or using a method that changes the estimand, like caliper matching or overlap weighting
Poor balance beyond SMDs (e.g., on polynomial terms, variance ratios, or KS statistics): consider adding polynomial or interaction terms to the propensity score model; using a machine-learning method that flexibly models the propensity score (Lee, Lessler, and Stuart 2010); using an optimization-based method that balances the full distribution, like energy balancing (Huling and Mak 2024); using coarsened exact matching to balance the full distribution approximately (Iacus, King, and Porro 2012); or adding an exact matching constraint to a matching specification
Low ESS: consider using a method to regularize the propensity score model (e.g., ridge or lasso regression); increasing the matching ratio; using an optimization-based method that maximizes the ESS (e.g., profile matching or stable balancing weights (Zubizarreta 2015)); relaxing the caliper (if used); trimming extreme weights; or using overlap weighting
Poor representativeness: consider using a method that strongly respects the estimand (e.g., entropy balancing; not cardinality matching, caliper matching, or overlap weighting) or removing a caliper or exact matching restriction

Having broad experience with the variety of matching and weighting methods available makes this process quick. Fortunately, the software we recommend and use in the examples, the R packages MatchIt and WeightIt, make switching between various specifications easy.

To avoid endless respecification, it is a good idea to use methods designed to optimize the evaluation criteria in a simple way. Often, the oldest and most commonly used methods are the worst in that they perform poorly and require manual respecification to get right. For example, 1:1 propensity score matching with a caliper is the most commonly used propensity score method in medical research, but it is widely known to have many problems: it hampers representativeness because the caliper discards units from both treatment groups (Rosenbaum and Rubin 1985), it reduces the effective sample size by dropping many units from the sample, it can make balance worse when used thoughtlessly (King and Nielsen 2019), and it has many specification parameters that need to be adjusted arbitrarily (e.g., the propensity score model, caliper width, matching order, etc.). Another popular but old method, propensity score weighting, also has many problems, including inability to achieve balance, low ESS due to extreme weights, and reduced representativeness when measures are taken to rectify the other problems.

Methods that consistently perform well include entropy balancing (Hainmueller 2012) and energy balancing (Huling and Mak 2024), as these ensure balance and representativeness without requiring major respecification. Entropy balancing guarantees exact balance as measured by the SMD, but it may be necessary to include other terms to fully balance the covariate distributions. Energy balancing balances the full covariate distribution, but can decrease ESS (though the trade-off between them can be managed with a single parameter). Though these methods are newer, they are beginning to see use in applied research (e.g., Bramante et al. 2022; Sharma et al. 2023) and should be the first line of defense when adjusting for confounders rather than poorly performing but older and more familiar methods.

Bramante, Carolyn T., Steven G. Johnson, Victor Garcia, Michael D. Evans, Jeremy Harper, Kenneth J. Wilkins, Jared D. Huling, et al. 2022. “Diabetes Medications and Associations with Covid-19 Outcomes in the N3C Database: A National Retrospective Cohort Study.” Edited by Surasak Saokaew. PLOS ONE 17 (11): e0271574. https://doi.org/10.1371/journal.pone.0271574.

Hainmueller, J. 2012. “Entropy Balancing for Causal Effects: A Multivariate Reweighting Method to Produce Balanced Samples in Observational Studies.” Political Analysis 20 (1): 25–46. https://doi.org/10.1093/pan/mpr025.

Huling, Jared D., and Simon Mak. 2024. “Energy Balancing of Covariate Distributions.” Journal of Causal Inference 12 (1). https://doi.org/10.1515/jci-2022-0029.

Iacus, Stefano M., Gary King, and Giuseppe Porro. 2012. “Causal Inference Without Balance Checking: Coarsened Exact Matching.” Political Analysis 20 (1): 1–24. https://doi.org/10.1093/pan/mpr013.

King, Gary, and Richard Nielsen. 2019. “Why Propensity Scores Should Not Be Used for Matching.” Political Analysis, May, 1–20. https://doi.org/10.1017/pan.2019.11.

Lee, Brian K., Justin Lessler, and Elizabeth A. Stuart. 2010. “Improving Propensity Score Weighting Using Machine Learning.” Statistics in Medicine 29 (3): 337–46. https://doi.org/10.1002/sim.3782.

Rosenbaum, Paul R., and Donald B. Rubin. 1985. “The Bias Due to Incomplete Matching.” Biometrics 41 (1): 103–16. https://doi.org/10.2307/2530647.

Sharma, Mayur, Truong H. Do, Elise F. Palzer, Jared D. Huling, and Clark C. Chen. 2023. “Comparable Safety Profile Between Neuro-Oncology Procedures Involving Stereotactic Needle Biopsy (SNB) Followed by Laser Interstitial Thermal Therapy (LITT) and LITT Alone Procedures.” Journal of Neuro-Oncology 162 (1): 147–56. https://doi.org/10.1007/s11060-023-04275-w.

Zubizarreta, José R. 2015. “Stable Weights That Balance Covariates for Estimation with Incomplete Outcome Data.” Journal of the American Statistical Association 110 (511): 910–22. https://doi.org/10.1080/01621459.2015.1023805.

Zubizarreta, José R., Ricardo D. Paredes, and Paul R. Rosenbaum. 2014. “Matching for Balance, Pairing for Heterogeneity in an Observational Study of the Effectiveness of for-Profit and Not-for-Profit High Schools in Chile.” The Annals of Applied Statistics 8 (1): 204–31. https://doi.org/10.1214/13-AOAS713.