11  Example Data

Below, we’ll demonstrate how to perform matching and weighting in R. We’ll use the famous right-heart catheterization (RHC) dataset analyzed in Connors et al. (1996), which examines the effect of RHC on death by 60 days. This dataset can be downloaded here or using Hmisc::getHdata("rhc")1. Connors et al. (1996) used 1:1 matching with a caliper to estimate the effect, which corresponds to an ATO (though they provided no justification for this choice of estimand). It turns out this matters quite a bit; the ATT, ATC, and ATE differ from each other and lead to different conclusions about the risk of RHC.

The choice of estimand depends on the policy implied by the analysis. Are we interested in examining whether RHC is harmful and should be withheld from patients receiving it? If so, we are interested in the ATT of RHC. Are we interested in examining whether RHC would benefit patients not receiving it? If so, we are interested in the ATC of RHC. Are we interested in the average effect of RHC for the whole study population? If so, we are interested in the ATE of RHC.

We’ll assume that if we are making a causal inference about the effect of RHC, we have collected a sufficient set of variables to remove confounding. This may be a long list, but to keep the example short, we’ll use a list of 13 covariates thought to be related to receipt of RHC and death at 60 days, all measured prior to receipt of RHC.

Let’s take a look at our dataset:

summary(rhc)
      aps1           meanbp1           pafi1           crea1         
 Min.   :  3.00   Min.   :  0.00   Min.   : 11.6   Min.   : 0.09999  
 1st Qu.: 41.00   1st Qu.: 50.00   1st Qu.:133.3   1st Qu.: 1.00000  
 Median : 54.00   Median : 63.00   Median :202.5   Median : 1.50000  
 Mean   : 54.67   Mean   : 78.52   Mean   :222.3   Mean   : 2.13302  
 3rd Qu.: 67.00   3rd Qu.:115.00   3rd Qu.:316.6   3rd Qu.: 2.39990  
 Max.   :147.00   Max.   :259.00   Max.   :937.5   Max.   :25.09766  
     hema1           paco21          surv2md1          resp1         card     
 Min.   : 2.00   Min.   :  1.00   Min.   :0.0000   Min.   :  0.00   No :3804  
 1st Qu.:26.10   1st Qu.: 31.00   1st Qu.:0.4709   1st Qu.: 14.00   Yes:1931  
 Median :30.00   Median : 37.00   Median :0.6280   Median : 30.00             
 Mean   :31.87   Mean   : 38.75   Mean   :0.5925   Mean   : 28.09             
 3rd Qu.:36.30   3rd Qu.: 42.00   3rd Qu.:0.7430   3rd Qu.: 38.00             
 Max.   :66.19   Max.   :156.00   Max.   :0.9620   Max.   :100.00             
      edu             age            race          sex            RHC        
 Min.   : 0.00   Min.   : 18.04   white:4460   Female:2543   Min.   :0.0000  
 1st Qu.:10.00   1st Qu.: 50.15   black: 920   Male  :3192   1st Qu.:0.0000  
 Median :12.00   Median : 64.05   other: 355                 Median :0.0000  
 Mean   :11.68   Mean   : 61.38                              Mean   :0.3808  
 3rd Qu.:13.00   3rd Qu.: 73.93                              3rd Qu.:1.0000  
 Max.   :30.00   Max.   :101.85                              Max.   :1.0000  
     death      
 Min.   :0.000  
 1st Qu.:0.000  
 Median :1.000  
 Mean   :0.649  
 3rd Qu.:1.000  
 Max.   :1.000  

Our treatment variable is RHC (1 for receipt, 0 for non-receipt), our outcome is death (1 for died at 60 days, 0 otherwise), and the other variables are covariates thought to remove confounding, which include a mix of continuous and categorical variables.

Let’s examine balance on the variables between the treatment groups using cobalt, which provides the function bal.tab() for creating a balance table containing balance statistics for each variables.

 cobalt (Version 4.5.5, Build Date: 2024-04-02)

We’ll request the standardized mean difference by including "m" in the stats argument and setting binary = "std" (by default binary variables are not standardized) and we’ll request KS statistics by including "ks" in stats. Supplying the treatment and covariates in the first argument using a formula and supplying the data set gives us the following:

bal.tab(RHC ~ aps1 + meanbp1 + pafi1 + crea1 + hema1 +
          paco21 + surv2md1 + resp1 + card + edu +
          age + race + sex, data = rhc,
        stats = c("m", "ks"), binary = "std")
Note: `s.d.denom` not specified; assuming "pooled".
Balance Measures
              Type Diff.Un  KS.Un
aps1       Contin.  0.5014 0.2127
meanbp1    Contin. -0.4551 0.2117
pafi1      Contin. -0.4332 0.1816
crea1      Contin.  0.2696 0.2011
hema1      Contin. -0.2693 0.1479
paco21     Contin. -0.2486 0.1081
surv2md1   Contin. -0.1985 0.0957
resp1      Contin. -0.1655 0.0910
card_Yes    Binary  0.2950 0.1395
edu        Contin.  0.0914 0.0511
age        Contin. -0.0614 0.0703
race_white  Binary  0.0152 0.0063
race_black  Binary -0.0310 0.0114
race_other  Binary  0.0208 0.0050
sex_Male    Binary  0.0931 0.0462

Sample sizes
    Control Treated
All    3551    2184

We can see significant imbalances in many of the covariates, with high SMDs (greater than .1) and KS statistics (greater than .1, but there is no accepted threshold for these). We can also see the sample sizes for each treatment group. Note that because they are somewhat close in size (the control group is not even twice the size of the treatment group), this will limit the available matching options available and might affect our ability to achieve balance using methods that require a large pool of controls relative to the treated group.

Other balance statistics can be requested, too, using the stats argument. It is straightforward to assess balance on particular transformations of covariates using the addl argument, e.g., addl = ~age:educ to assess balance on the interaction (i.e., product) of age and educ. We can also supply int = TRUE and poly = 3, for example, to assess balance on all pairwise interactions of covariates and all squares and cubes of the continuous covariates. This can make for large tables, but there are ways to keep them short and summarize them. For example, we can hide the balance table and request the number of covariates that fail to satisfy balance criteria and the covariates with the worst imbalance using code below:

bal.tab(RHC ~ aps1 + meanbp1 + pafi1 + crea1 + hema1 +
          paco21 + surv2md1 + resp1 + card + edu +
          age + race + sex, data = rhc,
        int = TRUE, poly = 3,
        stats = c("m", "ks"), binary = "std",
        thresholds = c(m = .1, ks = .1),
        disp.bal.tab = FALSE)
Note: `s.d.denom` not specified; assuming "pooled".
Balance tally for mean differences
                   count
Balanced, <0.1        61
Not Balanced, >0.1   105

Variable with the greatest mean difference
        Variable Diff.Un     M.Threshold.Un
 meanbp1 * pafi1 -0.5965 Not Balanced, >0.1

Balance tally for KS statistics
                   count
Balanced, <0.1        80
Not Balanced, >0.1    86

Variable with the greatest KS statistic
        Variable  KS.Un    KS.Threshold.Un
 meanbp1 * pafi1 0.2562 Not Balanced, >0.1

Sample sizes
    Control Treated
All    3551    2184

We can see that many covariates and their transformations (interactions, squares, and cubes) are not balanced based on our criteria for SMDs or KS statistics. We’ll use matching and weighting in the next sections to attempt to achieve balance on the covariates.


  1. The version we use here has slight modifications and can be downloaded here or brought into R using rhc <- readRDS(url("https://github.com/IQSS/dss-ps/raw/refs/heads/main/rhc.rds"))↩︎