PSM assumptions
Suppose one has N units. In addition to a response value Y
i
, each of N units has a covariate value vector X
i
= (Xi1, …, X
iK
)′, where i = 1, …, N, and K is the number of covariates. Let T
i
be the treatment condition. T
i
= 1 indicates that unit i is in the treatment group and T
i
= 0 the comparison group. Rosenbaum and Rubin [1] defined a propensity score for unit i as the probability of the unit being assigned to the treatment group, conditional on the covariate vector X
i
; that is,
$$ p\left({\mathbf{X}}_i\right)= Pr\left({T}_i=1\Big|{\mathbf{X}}_i\right). $$
(1)
PSM is based on the following two strong ignorability assumptions in treatment assignment [1]: (1) (Y1i, Y0i) ⊥ T
i
| X
i
; and (2) 0 < p(X
i
) < 1. The first assumption states a condition that treatment assignment T
i
and response (Y1i, Y0i) are conditionally independent, given X
i
; the second one ensures a common support between the treatment and comparison groups.
Rosenbaum and Rubin [1] further demonstrated in their Theorem 3 that ignorability conditional on X
i
implies ignorability conditional on p(X
i
); that is,
$$ \left({Y}_{1i},{Y}_{0i}\right)\perp {T}_i\left|{\mathbf{X}}_i\Rightarrow \left({Y}_{1i},{Y}_{0i}\right)\perp {T}_i\right|p\left({\mathbf{X}}_i\right). $$
(2)
Thus, under the assumptions of the strong ignorability in treatment assignment, if a unit in the treatment group and a corresponding matched unit in the comparison group have the same propensity score, the two matched units will have, in probability, the same value of the covariate vector X
i
. Therefore, outcome analysis on the matched data after matching tends to produce unbiased estimates of treatment effects due to reduced selection bias through balancing the distributions of observed covariates between the treatment and comparison groups [1, 2, 7]. In practice, the logit of propensity score, l(X
i
) = ln{p(X
i
)/[1 – p(X
i
)]}, rather than the propensity score p(X
i
) itself, is commonly used because l(X
i
) has a better property of normality than does p(X
i
) [1].
PSM methods
The basis of PSM is nearest neighbor matching [8], which matches unit i in the treatment group with unit j in the comparison group with the closest distance between the two units’ logit of their propensity scores expressed as follows:
$$ d\left(i,j\right)={ \min}_j\left\{\left|l\left({\mathbf{X}}_i\right)\hbox{--} l\left({\mathbf{X}}_j\right)\right|\right\}. $$
(3)
Alternatively, caliper matching [4] matches unit i in the treatment group with unit j in the comparison group within a pre-set caliper band b; that is,
$$ d\left(i,j\right)={ \min}_j\left\{\left|l\left({\mathbf{X}}_i\right)\hbox{--} l\left({\mathbf{X}}_j\right)\right|<b\right\}. $$
(4)
Based on Cochran and Rubin’s work [4], Rosenbaum and Rubin [8] recommend b equals 0.25 of the pooled standard deviation (SD) of the propensity scores. Austin [9] further asserted that b = 0.20 × SD of the propensity scores is the optimal caliper bandwidth.
Correspondingly, Mahalanobis metric matching (or Mahalanobis metric matching including the propensity score) and Mahalanobis caliper matching (or Mahalanobis metric matching within a propensity score caliper) [8] are two additional matching techniques similar to nearest neighbor matching and caliper matching, respectively, but use a diffident distance measure. In Mahalanobis metric matching, unit i in the treatment group is matched with unit j in the comparison group with the closest Mahalanobis distance measured as follows:
$$ d\left(i,j\right)={ \min}_j\left\{{D}_{ij}\right\}, $$
(5)
where D
ij
= (Z
i
′ – Zj′)′S−1(Zi′ – Zj′), Z• (• = i or j) is a new vector (X•, l(X•)), and S is the sample variance-covariance matrix of the vector for the comparison group. Mahalanobis caliper matching is a variant of Mahalanobis metric matching and it uses
$$ d\left(i,j\right)={ \min}_j\left\{{D}_{ij}<b\right\}, $$
(6)
where the selection of the caliper band b is the same as in caliper matching.
Data reduction after matching is a common and inevitable phenomenon in PSM. Loss of data in the comparison group seems a problem, but what we lose is unmatched cases that are assumed to potentially cause selection bias, and therefore, those unmatched units would have a negative impact on estimation of treatment effects. The matched data that may have a smaller sample size will, however, produce more valid (or less biased) estimates than do the original data. It is true that if we have small samples, which is not uncommon in medical research, PSM may not be applicable in such situations, but PSM is particularly useful in secondary data analysis on national databases such as the CMS data.
PSM algorithms
All aforementioned PSM methods can be implemented by using either greedy matching or optimal matching algorithm [10]. Both matching algorithms usually produce similar matched data when the size of the comparison group is large; whereas optimal matching gives rise to smaller overall distances within matched units [11, 12]. All the matching techniques, either using greedy matching or optimal matching, are based on the distance between point estimates of propensity scores. The problem with this approach is that it is difficult to establish a meaningful criterion to evaluate the closeness of the matched units without knowing the standard errors of the estimated unit-specific propensity scores. Simply put, without knowing the standard errors of l(X
i
) and l(X
j
), we do not know if l(X
j
) in the comparison group is the best matched score with l(X
i
) in the treatment group. In other words, a score a little smaller than l(X
j
) might be a better matched one with l(X
i
); or conversely, l(X
j
) might be matched better with a score a little larger than l(X
i
).
Although caliper matching, one of the most effective matching methods [13–15], uses a caliper band to avoid “bad” matches, a caliper band is fixed (or unit-invariant) and cannot capture the unit-specific standard error of the estimated propensity score for each unit. Therefore, a new matching technique is needed for gauging standard errors of propensity scores.
Interval matching
Interval matching extends caliper matching for accommodating the estimation error (or standard error) of the estimated propensity score by establishing a CI of the estimated propensity score for each unit. In interval matching, if the CI of a unit in the treatment group overlaps with that of one or more units in the comparison group, they are considered as matched units. Because the true distribution of propensity scores is unknown, the bootstrap [5] is utilized for obtaining a unit-specific CI for each unit. The bootstrap is a statistical method of assessing the accuracy (e.g., standard errors and CIs) of sample estimates to population parameters, based on the empirical distribution of sample estimates from random resamples of a given sample whose distribution is unknown.
Let {X1, …, X
N
} be a random sample of size N from an unknown distribution F; θ(F) is a parameter of interest. The specific procedure of the bootstrap for computing a CI of the parameter estimate, [\( {\widehat{\theta}}_{a/2} \) (X1, …, X
N
), \( {\widehat{\theta}}_{1-a/2} \) (X1, …, X
N
)], where (1 - α) is the confidence level, consists of the following four steps:
-
1.
Obtain a bootstrap sample {X1*, …, X
N
*} that is randomly resampled with replacement from the empirical distribution F
N
represented by the original sample {X1, …, X
N
};
-
2.
Calculate the parameter estimate \( \widehat{\theta} \) (X1*, …, X
N
*) for the quantity θ(F
N
) = θ(X1, …, X
N
);
-
3.
Repeat the same independent resampling-calculating scheme B times (typically 500 times), resulting in B bootstrap estimates \( \widehat{\theta} \) (X1*(b), …, X
N
*(b)), b = 1, …, B, which constitute an empirical distribution (or sampling distribution) of the estimate \( \widehat{\theta} \) (X1, …, X
N
); and
-
4.
Obtain the estimated CI of the parameter estimate, [\( {\widehat{\theta}}_{a/2} \) (X1, …, X
N
), \( {\widehat{\theta}}_{1-a/2} \) (X1, …, X
N
)], by computing the (α/2)th and (1 – α/2)th percentiles of the sampling distribution, \( {\widehat{\theta}}_{a/2} \) (X1*, …, X
N
*) and \( {\widehat{\theta}}_{1-a/2} \) (X1*, …, X
N
*).
To obtain the bootstrap CIs for interval matching, one can simply follow the steps described above. First, conduct the bootstrap resampling B times on units in the sample data (T, X), where T is the indicator of the treatment conditions and X is the covariate value matrix (X1, …, X
N
)′, resulting in B bootstrap samples (T(b), X(b)), where X(b) = (X1*(b), …, X
N
*(b))′, b = 1, …, B. Second, a logistic regression (or other propensity score estimation model) is repeatedly applied to each of the B bootstrap samples, resulting in B propensity scores for each unit i (i = 1, …, N): p(X
i
*(1)), …, p(X
i
*(B)); then, their logit, l(X
i
*(1)), …, l(X
i
*(B)), are calculated. Last, for each unit i, a CI at certain confidence level (e.g., 68 %CI) is obtained by calculating the corresponding percentiles of the sampling distribution of the logit of B bootstrap propensity scores. Specifically, an estimated bootstrap 68 %CI for the logit of the propensity score of unit i would be [l.16(X
i
*), l.84(X
i
*)] (see Fig. 1 for an illustration).
Once a CI of the estimate of the logit of propensity score is obtained for each unit, interval matching can be conducted by examining whether the CI for a unit in the treatment group overlaps with that for one or more units in the comparison group. In other words, if the two CIs overlap; that is,
$$ \left[{l}_{.16}\left({\mathbf{X}}_i*\right),{l}_{.84}\left({\mathbf{X}}_i*\right)\right]\cap \left[{l}_{.16}\left({\mathbf{X}}_j*\right),{l}_{.84}\left({\mathbf{X}}_j*\right)\right]\ne \varnothing, $$
(7)
the two units are taken as matched units. In practice, one can do either 1:1 or 1:K interval matching. In 1:1 interval matching, one needs to take only one unit that has the closest distance, as defined by the matching method (e.g., Equation 3 for nearest neighbor matching and Equation 6 for Mahalanobis caliper matching), between the logit of the propensity scores among all the units in the comparison group whose CIs overlap with that of the unit in the treatment group. If there are two or more units in the comparison group within the overlap having the same closest distance, the program will randomly select one as the matched unit. In 1:K interval matching, one can simply take K closest units in the comparison group whose CIs overlap with that of the unit in the treatment group.
It is worth noting that using the logit of propensity score l(X
i
) is particularly important in interval matching because the distribution of logit l(X
i
) is more symmetric than the propensity score p(X
i
); therefore, interval matching based on logit l(X
i
) will be more balanced in terms of matching from both sides (left or right) of the distribution of logit l(X
i
).