Skip to main content

Comparison of two propensity score-based methods for balancing covariates: the overlap weighting and fine stratification methods in real-world claims data



Two propensity score (PS) based balancing covariate methods, the overlap weighting method (OW) and the fine stratification method (FS), produce superb covariate balance. OW has been compared with various weighting methods while FS has been compared with the traditional stratification method and various matching methods. However, no study has yet compared OW and FS. In addition, OW has not yet been evaluated in large claims data with low prevalence exposure and with low frequency outcomes, a context in which optimal use of balancing methods is critical. In the study, we aimed to compare OW and FS using real-world data and simulations with low prevalence exposure and with low frequency outcomes.


We used the Texas State Medicaid claims data on adult beneficiaries with diabetes in 2012 as an empirical example (N = 42,628). Based on its real-world research question, we estimated an average treatment effect of health center vs. non-health center attendance in the total population. We also performed simulations to evaluate their relative performance. To preserve associations between covariates, we used the plasmode approach to simulate outcomes and/or exposures with N = 4,000. We simulated both homogeneous and heterogeneous treatment effects with various outcome risks (1-30% or observed: 27.75%) and/or exposure prevalence (2.5-30% or observed:10.55%). We used a weighted generalized linear model to estimate the exposure effect and the cluster-robust standard error (SE) method to estimate its SE.


In the empirical example, we found that OW had smaller standardized mean differences in all covariates (range: OW: 0.0–0.02 vs. FS: 0.22–3.26) and Mahalanobis balance distance (MB) (< 0.001 vs. > 0.049) than FS. In simulations, OW also achieved smaller MB (homogeneity: <0.04 vs. > 0.04; heterogeneity: 0.0-0.11 vs. 0.07–0.29), relative bias (homogeneity: 4.04–56.20 vs. 20–61.63; heterogeneity: 7.85–57.6 vs. 15.0-60.4), square root of mean squared error (homogeneity: 0.332–1.308 vs. 0.385–1.365; heterogeneity: 0.263-0.526 vs 0.313-0.620), and coverage probability (homogeneity: 0.0–80.4% vs. 0.0-69.8%; heterogeneity: 0.0-97.6% vs. 0.0-92.8%), than FS, in most cases.


These findings suggest that OW can yield nearly perfect covariate balance and therefore enhance the accuracy of average treatment effect estimation in the total population.

Peer Review reports


Due to infeasibility of running a randomized experiment, observational data are often used to estimate the population health effects of interventions. When estimating plausibly causal effects using observational data, it is necessary to reduce imbalance in the empirical distribution of the pretreatment confounders between the treated and control groups [1]. Lowering imbalance can reduce the degree of model dependence for the statistical estimation of causal effects [1,2,3,4], and thus reduces inefficiency and bias [1]. To achieve balanced covariates, propensity scores (PS) have become a cornerstone in observational studies aimed at estimating causal effects [5, 6]. PS are defined as the predicted probability of receiving a particular treatment (or exposure) for the given covariate realizations of a study subject.

In this paper, we study PS-based approaches to estimate the average treatment effect in the total population (ATE). There are three common types of balancing methods via PS: matching, stratifying, and weighting. Among matching methods, the PS matching method (PSM) is the most commonly used in practice [1]. It is simple and intuitive by reducing the multidimensional covariate space to one dimension. Despite its widespread adoption, a large sample size is required as it discards some subjects who are not matched. In addition, PSM has been shown to increase model “imbalance, inefficiency, model dependence, and bias,” which is not the case with most other matching methods [1]. Among the stratification methods, the most common one is to stratify subjects into five quintiles of PS. With the stratum boundaries determined by PS distribution in the exposed and the comparison group combined, it eliminates approximately 90% of bias due to measured confounding [7]. However, when exposure is infrequent, it may result in all exposed subjects being aggregated in one or more extreme strata [5, 8]. The fine stratification weights method (FS), a recent method, can solve this issue by increasing number of strata and by determining stratum boundaries based on PS distribution in exposed group only. It has been shown to gain greater efficiency than the traditional one [8]. Among the weighting methods, inverse probability weighting (IPW) is popular but performs poorly when some subjects have extreme PS [9,10,11]. The PS based overlap weighting method (OW), another recent method, overcomes IPW’s extreme weight issue and produces impressive covariate balance [12, 13].

OW has been theoretically proven to have small-sample exact balance property [12]. That is, it leads to exact balance on the mean of every covariate when the PS is estimated by a logistic regression. It is less sensitive to model misspecification compared to the inverse probability weighting method (IPW) in a simulation study [14]. Despite these features, to our knowledge, OW has only been evaluated by comparing with weighting methods such as IPW and trimmed IPW [9,10,11,12,13,14]. Little is known about the relative performance of OW compared with other types of balancing methods including matching and stratification methods [15]. In addition, OW has not been evaluated in large claims data with low prevalence exposure and/or with low frequency events (i.e., outcomes), a context in which optimal use of balancing methods is critical.

Furthermore, matching on PS is limited by exclusion of subjects without a suitable match leading to a non-representative population and a loss of statistical power [16]. PSM including 1:1, 1:5, and full matching have less model precision than FS in at least two claims studies [8, 17]. Therefore, we aimed to compare OW with FS only, both relatively new and promising methods, using real-world and simulated claims data in settings with infrequent exposure and/or with low prevalence outcomes.


Empirical example

We used a cohort of 42,628 Texas State Medicaid beneficiaries, aged 18–64, diagnosed with type 2 diabetes, who had at least one primary care visit between January 2012 and December 2012. About 10.55% (n = 4,498) of the patients received the majority of their primary care at federally qualified health centers (FQHCs) (exposure), while the rest (89.45%, n = 38,130) received care at non-FQHCs (control). Researchers analyzed whether or not those patients who had routine primary care at FQHCs had fewer hospitalizations and emergency room visits than the non-FQHC patients. Five continuous and 12 binary covariates were selected based on clinical relevance and previous literature. The empirical example has 10.55% exposure rate which is near rare (typically < 10% considered as rare) and hospitalization quite often is a rare outcome.

The study was reviewed by the University of Chicago Institutional Review Board and determined to be non-human subject research.

Overlap weighting method (OW)

The OW method mimics a randomized trial by assigning appropriate weights to generate a clinically relevant target population – overlapped between groups. That is, a subject in the treatment group receives a weight that is the probability of not receiving the treatment (i.e., 1 – PS), while a subject in the control group receives a weight that is the probability of receiving the treatment (i.e., PS). As a consequence, the two groups have overlapped PS distributions. Those subjects overlapped between the two groups in the PS distribution receive more weight, while those who are only in one non-overlapping tail of the PS distribution receive less. Also OW does not prune any subjects. The target of inference, advantages, and disadvantages of the OW and FS methods are compared in eTable 1.

PS-based fine stratification method (FS)

The FS method proposed by Desai et al. (2017) [8] finds matched balancing scores (PS) via stratification with a large number of PS strata (much larger than five in the traditional stratification method), and then assigns appropriate weights to subjects per stratum. It minimizes any loss of exposed subjects that may be relevant especially when treatment exposure is rare, because losing subjects decreases precision of the treatment effect estimates [8]. The method only excludes subjects whose PS are not in the overlapped PS regions between the two groups. There are two steps for implementation: (1) create equally-sized PS strata by ranking only treated/exposed subjects based on PS values and then assign control/unexposed subjects to these strata; (2) following stratification, in all strata with at least one treated patient and one control patient, weights are calculated (see below).

Regarding the optimum number of strata, Desai et al. stated that it may be difficult to make general recommendations because it may depend on the prevalence of a rare-exposed treatment [8]. The number of PS strata they used was 10, 50, or 100 and all produced similar bias and precisions in their simulations. In this study, we chose 20 PS strata, their stratification width about 0.05 on average, smaller than the recommended PS width of 0.2 [7]. . Each stratum had about 225 subjects from the FQHC-exposed group in our empirical example.

Target of inference (estimand) and weights

In the study, since each patient can switch their primary care visits between FQHCs and non-FQHCs, we estimated ATE among all patients [6]. In literature, there are two existing approaches to assign weights for ATE. One approach is to generate equal total weights between groups, denoted as ‘ATE-equ,’ is based on Ntotal in stratum i/Ntotal exp in stratum i for the exposed group and N"total in stratum i"/N"total unexp in stratum i" for the unexposed group [18, 19]. The other approach, denoted as ‘ATE-unequ,’ is based on (Ntotal in stratum i/Ntotal)/(Ntotal exp in stratum i/Ntotal exp") for the exposed group and (Ntotal in stratum i/Ntotal)/(Ntotal unexp in stratum i/Ntotal unexp) for the unexposed group [6, 20]. This alternative approach results in the total weight in one group equivalent to the sample size in that group. The two weighting methods are very similar, except that ATE-unequ has a weight of N“total exp” (N“total unexp”) for the exposed (unexposed) group.

As a weighting method, OW targets the overlap population and its corresponding estimand is referred to as ATE on the overlap population (ATO) [12]. Zhou et al. (2020) stated that OW was part of a class of balancing weights that target a judiciously chosen subpopulation of interest from which an estimand is closely related to ATE [14]. Not surprisingly, OW’s total weights are identical between groups, the same as ATE-equ.

Evaluation of performance via the empirical example

In the empirical study for the method evaluation [21], we used the standardized mean difference between the two groups (SMD), Mahalanobis balance (MB), and final sample size of retaining sample. SMD is a distance measure of balancing criterion for each covariate [22]. MB is a metric that measures the distance between two group mean vectors of all covariates and is standardized by the sample covariance matrix [1, 17, 23]. Final sample size [8] is a measure of model precision and can be important for a rare event outcome.


After balancing covariates, we determined relative performance of OW and FS for model bias and precision. The degree of covariate imbalance is proportional to bias in the treatment effect [24], and final sample size is associated with precision. However, due to lack of knowledge of the true FQHC effect in the empirical example, we do not know the real size of model bias and precision, especially, in a setting of infrequent exposure/outcome. Therefore, we conducted simulations.

Instead of using ordinary simulation approaches that do not capture important features that may exist between covariates, we chose the plasmode approach to conduct simulations [25,26,27]. Through resampling with replacement from all the observed covariates, plasmode can preserve the associations between covariates with potential complex covariance structures, which are common in healthcare claims databases [25]. Via a logistic regression model, details were provided in Appendix A (including R code) on how to simulate an outcome and/or an exposure factor. There were two logit models for outcome and exposure, respectively, with two different linear combinations of covariates. We simulated two types of treatment effects: homogeneity and heterogeneity. To simulate a heterogeneous treatment effect, we replaced constant treatment effect with an interaction term between exposure and sex (or age): sex, as an example, represented as a binary heterogeneity factor and age was a continuous one [25]. Age was standardized first before conducting a heterogeneous treatment effect. The simulation settings can be found in eTable 2.

To examine settings with infrequent outcome and/or occasional exposure, for each type of treatment effect, we simulated four scenarios by varying outcome risks and/or exposure prevalence. Scenarios simulated outcome risks of 1%, 10%, and 30% with the observed exposure prevalence (10.55%) or with 2.5% simulated exposure. We also simulated exposure prevalence of 2.5%, 10%, and 30% with the observed outcome risk (27.75%) or with 1% simulated outcome risk. We set the true FQHC effect to be one as a coefficient to both homogeneous and heterogeneous treatment terms. For each scenario, we simulated 500 datasets, each with the sample size of 4,000. For each simulated dataset, a weighted generalized linear model (GLM) with the log link function in the SAS GENMOD procedure was used to estimate the FQHC effect, i.e., natural logarithms of relative risk ratio [8]. Due to non-uniform weights included in our GLMs, instead of using the default delta method, we used the cluster-robust standard error method to estimate standard error (SE) of the effect [28, 29]. After covariates balanced, adjusting further for covariates is unnecessary because it is unrelated to the treatment independent variable [30]. That is, a simple difference in means on the balanced data can estimate the causal effect.

Evaluation of performance in simulations

In the simulation study, we used the following criteria to evaluate the methods: mean MB, mean relative bias (rbias), standard deviation (SD) of rbias, square root of mean squared error of bias (rMSE), average SE of the estimated effect, average final sample size, two coverages [8, 17, 23], and significance. Relative bias is the percent relative difference, 100(estimated effect -truth)/truth [12,13,14]. The rMSE combines squared bias (not rbias) and its variance. The coverage is a probability of the 95% confidence interval (CI) that covers the true effect (denoted as ‘coverage’) [12, 13, 31]. It can be obtained with two steps: (1) to compute a CI via our weighted GLM and (2) then to calculate a proportion of samples covering the true effect among 500 simulations. In our simulation study, a CI could cover both the non-zero true effect and zero, and statistical significance may be influenced. Therefore, to distinguish from the traditional coverage, we generated another one (‘coverageT’) counting those CIs that cover the true effect but not zero. In some cases where CIs were too narrow to cover the true effect (see results below), significance was defined as a proportion of samples obtaining a significant effect (by a weighted GLM with a two-sided p-value < 0.05). The two coverages and significance are associated with model precision [32], but more targeted to detect the true treatment effect. Among the criteria, the least useful criterion is SE because it measures variability of effect in a model, not bias, precision, or measures in covariate balance.

Unmatched subjects

Although matching was not involved in the study, via simulations we discovered whether pruning those clearly unmatched subjects has any effects on model bias and precision. The unmatched subjects are those who are available from one group but not from the other group in terms of combination cells of binary covariates.

Summary of all methods

In summary, we used two datasets for performance evaluation: one was the original full dataset (denoted as ‘F’); the other was the dataset (denoted as ‘X’) after deleting those unmatched subjects. We also evaluated the two weighting approaches. Therefore, there were a total of seven methods for comparisons: crude, OWF, OWX, FSF−equ, FSX−equ, FSF−uneq, and FSX−unequ (summary can be found in eTable 3). We used SAS version 9.4 to conduct covariate balancing, and statistical modeling for both empirical and simulation studies (Appendix A for analysis of one simulation), and we used R function (Appendix B) to generate simulation datasets in R version 4.3.0.


Analysis of the empirical dataset

Table 1 shows the evaluation of the seven methods using the real-world data. OWF and OWX were nearly identical and performed the best by reducing all SMD of covariates to zero, and the smallest MB over all covariates, indicating perfect balancing of covariates. The four FS methods consistently performed fairly well over all covariates, with all SMD around zero. Among them, the two FS with equal weights between groups (FSF−equ and FSX−equ) were closer to each other and achieved better MB than the other two FS with unequal weights (FSF−unequ and FSX−unequ). The crude method exhibited the worst performance, far more imbalanced.

For the final sample size used for further analysis, OWX excluded 147 subjects (0.345%) who had no matches for combinations of all binary variables. Using the full dataset, FS excluded 21 subjects (< 0.05%) whose PS were in non-overlapped regions. Distributions of PS per group were in eFigure 1.

Analysis of simulated datasets with the homogeneous treatment effect

Tables 2 and 3 showed the simulation evaluation of each method by risk level of outcome and by exposure prevalence, respectively. In most scenarios, OWF and OWX had very similar results and both performed better than the other methods. That is, both OW had small MB, rbias, SD of rbias, rMSE, relatively small SE, and relatively large coverage and coverageT. There were two exceptions. One was that the crude method had smallest SE of estimate. The reason is that model estimation by the crude method is consistent with the simulation method (i.e., two logit models with a constant and additive treatment effect). The second exception was the cases with rare outcome events (1%) and low exposures (2.5% and 10%) (Table 2), where the crude method had the smallest rbias, SD of rbias, and rMSE compared to the others. The reason is that both rare events and low exposures resulted in complete separation or quasi-complete separation of data points that caused model estimation to be unstable [33, 34]. After removing these simulated samples, both OW had smaller rbias and larger coverage than the others (eTable 4).

Table 1 Evaluation of OW and FS in the empirical example
Table 2 Evaluation of OW and FS methods by simulation with the constant true effect by outcome risk along with observed/simulated exposure
Table 3 Evaluation of OW and FS methods by simulation with the constant true effect and by exposure prevalence with observed outcome risk or with simulated outcome risk of 1%

Similar to the empirical study, the four FS methods were quite close to each other. The two FS with equal weights had smaller MB than the two FS with unequal weights. However, each pair of FS using the data with the same sample sizes (either full or reduced datasets) had almost the same model estimations. These indicate that the two ATE weighting methods had minor difference in balancing values but almost identical values in model estimation. The two FS using the full datasets generally had better model estimations than the two using the reduced datasets.

In the criteria, there were different change patterns over simulations. As outcome risk (Table 2) or exposure prevalence (Table 3) increased, the power increased, SE, SD, rMSE, and both coverages decreased, and significance increased. Coverages decreased as outcome risk (or exposure prevalence) increased because smaller SE resulted in a narrower CI of effect that were too narrow to cover the true effect. MB, unrelated to model estimation, remained stable in a method when outcome risk increased, and reduced greatly when exposure prevalence increased. On the other hand, rbias increased when outcome risk increased, and remained similar in a method when exposure increased.

To determine whether the simulation results were due to real differences or Monte Carlo error (MCE), we calculated MCE for both MB and rbias (eTables 5–6). We evaluated the number of simulations needed (Appendix B) and found that 500 simulations were enough for most settings of outcome and exposure.

Analysis of simulated datasets with heterogeneous treatment effect

Tables 4 and 5 and eTables 7 and 8 showed the evaluation results due to sex(age)-dependent treatment effects. Similar to the results with the constant treatment effect, the two OW methods had very similar results and both performed better than the other methods in terms of MB, rbias, rMSE, coverages, and significance. Also the same to the homogeneous cases, there were two exceptions. One was that the crude method had smallest SE of estimate. The other was that FS had smaller rMSE than OW in the case with 1% outcome and 10% or 10.55% exposure. It was also due to the issue of complete separation or quasi-complete separation of data points in a few simulated samples. After removing those samples, the OW methods still performed the best (eTable 9). All change patterns across scenarios were consistent to those in the homogeneous cases.

Table 4 Evaluation of OW and FS methods by simulations with sex-dependent heterogeneous treatment effect by outcome risk along with observed/simulated exposure.*
Table 4 Evaluation of OW and FS methods by simulation with the sex-dependent heterogeneous treatment effect by exposure prevalence with observed outcome risk or with simulated outcome risk of 1%


Both OW and FS methods performed well among PS-based balancing methods for causal inference. To our knowledge, our study is the first to compare OW and FS as the two types of PS-based balancing methods: weighting and stratification. We used a real-world and simulated claims data for their relative performance. We included simulations of rare outcome and/or exposure, not rare in a claims-based observational study. We simulated data for both homogeneous and heterogeneous treatment effects. The OW method obtained nearly perfect covariate balance and performed much better in covariate balance, model bias, and model precision and coverages than FS.

The target of inference (estimand) we focused on was ATE due to the nature of the intervention in the real-world example where the intervention was feasible to treat all eligible patients. The target of inference by OW is a special ATE, called ATO. OW is part of a class of balancing weights that target a judiciously chosen subpopulation from which an estimand is closely related to ATE [14]. OW produces equal total weights between groups by its definition, i.e., making the two groups overlapped in terms of PS values. For the FS method, we evaluated the two published weighting algorithms for ATE estimation: with and without equal total weights between groups. We found that the ATE-equ performed better than ATE-unequ in terms of covariate balance (SMD and MB) but both algorithms had almost identical model estimation in terms of bias and precision. In its formula, compared with ATE-equ, ATE-unequ includes a group sample size in its numerator to a subject in that group. This additional piece was designed to normalize and stabilize weights by limiting unduly large weights [20]. However, the additional piece unequaled total weights between groups, reducing covariate balance slightly, but did not affect model bias and precision.

We assume that our study met all the key assumptions for causal inference, including the stable unit treatment value assumption, the consistency assumption, and the positivity assumption [14, 35]. However, practical violations of the positivity assumption occur when some subjects almost always (or almost never) receive treatment [14], for example, those unmatched in combinations of binary covariates. Our study explored if removing those unmatched helped covariate balance and model estimation. This was a minor matter in our case, maybe because the proportion of those removed was very low, about 0.34% of the whole population. Using the reduced data compared to using the full data, the simulation results showed slightly smaller covariate imbalance, but slightly larger model bias and imprecision. That is, although covariate balance is slightly reduced by allowing those clearly unmatched subjects between groups, larger sample size kept model estimation less biased and imprecise, especially with infrequent outcome and/or exposure. In addition, FS further removed some subjects with extreme PS, due to their PS not in the overlapped PS region between groups. However, comparing FS with OW which did not remove any subjects, we confidently state that given balanced PS values between groups, including mismatched subjects does not affect model estimation in settings with infrequent outcome/exposure.

In our weighted GLM analysis, we used the cluster-robust method to estimate SE of the intervention effect. It is inaccurate to use the delta method, the default model-based method, when using matching weights, because it assumes weights are frequency weights rather than probability weights [28].

In simulation results for both homogeneous and heterogeneous scenarios, we observed that as outcome risk level increased, bias increased. Higher risks and stronger correlations among exposure, outcome, and covariates led to larger bias in effect estimation [36]. That is, higher confounding, which we did not adjust for in analysis, caused more bias. Among 17 covariates, more than half of them were confounders, i.e., associated with hospitalization rate. As outcome risk increased, these confounders had more confounding effect that resulted in larger bias. Adjusting for those confounders could have improved model precision and accuracy. However, we purposely did not adjust further for them in the modeling stage because in the real-world example, investigators did not know which covariates were real confounders.

In simulation results, we also observed that as exposure prevalence increased, MB values in the crude method decreased. One possible reason is that higher exposure, and stronger correlations between covariates and exposure, resulted in more covariate balance. Furthermore, we found that as rate of outcome and/or exposure increased, coverages decreased and even became zero. That is, when there was larger power, CIs became too narrow to cover the true effect. Their 100% significance rate confirmed the reason.

The choices of our performance criteria were based on the guidance of metrics for covariate balance [21]. The MB criterion, which considers pairwise correlations between covariates, provides new insights beyond SMD. This is the first study to use MB to evaluate OW. In some simulation settings, coverage probability could be a maximum of 100% because it is different from confidence level [32]. Our study also solved the issue of some misleading results using the coverage probability as a criterion [14] by providing two coverages and one significance to replace the traditional one.

Besides OW, the FS method performed relatively well comparing the crude method. The FQHC and non-FQHC groups had significantly overlapped PS distributions, and only < 0.05% subjects were removed due to non-overlapped PS between them. Just as in Desai et al.’s study evaluating FS [8], after balancing covariates, the PS distributions became perfectly overlapped in the empirical example. This indicates that the number of strata, 20, was sufficient.

Our simulation results for constant treatment effect are consistent with Ripollone et al.’s study which also used simulated claims data [17]. In the simulated outcome with risk level of 20%, 20% exposure prevalence, and a sample size of 25,000, their FS analysis had 0.054 MB, 0.07–0.08 bias, and 0.178–0.172 rMSE, while ours had 0.047 MB, 0.0183 bias, and 0.025 rMSE, given the sample size of > 40,000 (eTable 10).

In our study, the two study groups were quite similar in that their PS distributions were substantially overlapped. However, when comparator groups are very different, the advantages of OW are actually greatest [37]. This is because the OW method will add more weight on those overlapped PS regions and fewer weights on those tailed PS regions. Given the same situation, the FS method will remove more subjects from non-overlapped regions which results in more severe bias and probably less model precision due to reduced sample size.

Our study has some limitations. First, the OW method can be used to estimate only ATE on the ATO population, but not average treatment effect on the treated population (ATT). However, two studies showed that when the exposure prevalence is small, ATO approximates ATT [12, 35]. Second, due to simulating rare outcome (1%) and exposure (2.5 -10.55%), some simulated samples faced the issue of complete separation or quasi-complete separation of data points that caused model estimation to be unstable. More advanced modeling methods could be used such as Firth’s method [34] and Bayesian method [33]. However, this is beyond the goal of the study. Third, our simulation findings may not be generalizable because our simulations were based on one empirical study. However, both OW and FS have been separately evaluated in multiple studies. Fourth, our simulation did not consider misspecifications of a PS model and/or degrees of overlap of PS distributions. However, Zhou et al [14] conducted simulations for such situations to compare the performances of OW, IPW, and other weighting methods. They found that OW was robust to these situations. One possible reason they pointed out was that the estimand of OW was not defined on the estimated but true PS and OW smoothly down-weights the influence of observations at both end of PS spectrum [14]. Last, to estimate PS, we used a logistic regression, that is, a logit modeled as a linear combination of covariates. To capture complex dependency patterns between outcome and covariates, a machine learning method such as random forest may provide more accurate and less model dependent estimate of PS [38]. This will be our future work.


As demonstrated by our analysis with real-world and extensive simulated claims data, the OW method can yield nearly-perfect covariate balance while also retaining all of the sample. Therefore, OW can enhance the accuracy of ATE estimation over FS in most cases. Balancing covariates between treatment and control groups in observational studies can be challenging, especially in settings with infrequent outcomes and exposures. Both OW and FS methods can effectively balance covariates. These two different PS-based methods have been separately evaluated against other methods [8,9,10,11,12,13,14, 17] but have never been compared against each other. We found that OW generally led to better covariate balance and model precision. However, in settings with extremely rare outcomes (≤ 1%) and exposures (≤ 10%), OW performed slightly worse than FS in at least one evaluation criterion. Future studies should analyze scenarios with rare outcomes and exposures in more detail. In conclusion, OW could be considered an effective and easy-to-implement method for balancing covariates for ATE estimation in settings with infrequent but not too rare outcomes and exposures.

Data availability

Data are available from the Centers for Medicare & Medicaid Services (CMS) under data use agreement provisions. Per the data use agreement, the relevant limited datasets cannot be made publicly available. For any data request, please contact CMS via the link: After obtaining Taxes State Medicaid claims data in 2012, the definitions of study population with diabetes and the diabetes-related hospitalization outcome can be found in the main study published in Knitter et al. (2022, Medical Care) via the link: The authors confirm that we all did not have any special access privileges that others would not have.


Abbreviation :



Average treatment effect in the total population


The ATE method to generate equal total weights between groups


The ATE method to generate unequal total weights between groups


Average treatment effect on the overlap population


Average treatment effect on the treated population


95% confidence interval


Probability of the 95% confidence interval that covers the true effect, ignoring whehter zero was covered or not


Probability of the 95% confidence interval (CI) that covers the true effect, but not zero


No balancing method used


Federally qualified health centers


Propensity score-based fine stratification method

FSF−equ :

The FS method with a full set of data and subjects’ weights assigned by ATE-equ

FSF−uneq :

The FS method with a full set of data and subjects’ weights assigned by ATE-unequ

FSX−equ :

The FS method with a subset of data and subjects’ weights assigned by ATE-equ

FSX−unequ :

The FS method with a subset of data and subjects’ weights assigned by ATE-unequ


Inverse probability weighting method


Mahalanobis balance


Propensity score-based Overlap weighting method


The OW method with a full set of data


The OW method with a subset of data


Propensity score


Propensity score matching method


Mean relative bias


Square root of mean squared error of bias


Standard deviation of rbias


Standardized mean difference between the two groups


  1. King G, Nielsen R. Why Propensity scores should not be used for matching. Political Anal. 2019;27(4).

  2. Ho DE, Imai K, King G, Stuart EA. Matching as nonparametric preprocessing for reducing Model Dependence in Parametric Causal Inference. Political Anal. 2007;15:199–236.

    Article  Google Scholar 

  3. Imai K, King G, Nall C. The essential role of pair matching in cluster-randomized experiments, with application to the Mexican Universal Health Insurance Evaluation. Stat Sci. 2009;24(1):29–53.

    Article  Google Scholar 

  4. Iacus SM, King G, Porro G. Multivariate Matching methods that are Monotonic Imbalance Bounding. J Am Stat Assoc. 2011;106(493):345–61.

    Article  CAS  Google Scholar 

  5. Rosenbaum PR, Rubin DB. Reducing Bias in Observational studies using subclassification on the Propensity score. J Am Stat Assoc. 1984;79(387):516–24.

    Article  Google Scholar 

  6. Desai RJ, Franklin JM. Alternative approaches for confounding adjustment in observational studies using weighting based on the propensity score: a primer for practitioners. BMJ. 2019;367:l5657. PubMed PMID: 31645336.

    Article  PubMed  Google Scholar 

  7. Austin PC. An introduction to Propensity score methods for reducing the effects of confounding in Observational studies. Multivar Behav Res. 2011;46(3):399–424. doi: 10.1080/00273171.2011.568786. PubMed PMID: 21818162; PubMed Central PMCID: PMCPMC3144483.

    Article  Google Scholar 

  8. Desai RJ, Rothman KJ, Bateman BT, Hernandez-Diaz S, Huybrechts KF. A propensity-score-based Fine Stratification Approach for Confounding Adjustment when exposure is infrequent. Epidemiology. 2017;28(2):249–57. doi: 10.1097/EDE.0000000000000595. PubMed PMID: 27922533; PubMed Central PMCID: PMCPMC5497217.

    Article  PubMed  PubMed Central  Google Scholar 

  9. Stuart EA. Matching methods for causal inference: a review and a look forward. Stat Sci. 2010;25(1):1–21. Epub 2010/09/28. doi: 10.1214/09-STS313. PubMed PMID: 20871802; PubMed Central PMCID: PMCPMC2943670.

    Article  PubMed  PubMed Central  Google Scholar 

  10. Lee BK, Lessler J, Stuart EA. Weight trimming and propensity score weighting. PLoS ONE. 2011;6(3):e18174. PubMed PMID: 21483818; PubMed Central PMCID: PMCPMC3069059.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Hirano K, Imbens GW. Estimation of Causal effects using propensity score weighting: an application to data on Right Heart catheterization. Health Serv Outcomes Res Methodol Volume. 2001;2:259–78.

    Article  Google Scholar 

  12. Li F, Morgan KL, Zaslavsky AM. Balancing covariates via Propensity score weighting. J Am Stat Assoc. 2018;113(521):390–400.

    Article  CAS  Google Scholar 

  13. Li F, Thomas LE, Li F. Addressing Extreme Propensity Scores via the Overlap Weights. Am J Epidemiol. 2019;188(1):250-7. PubMed PMID: 30189042.

  14. Zhou Y, Matsouaka RA, Thomas L. Propensity score weighting under limited overlap and model misspecification. Stat Methods Med Res. 2020;29(12):3721–56. Epub 2020/07/23. doi: 10.1177/0962280220940334. PubMed PMID: 32693715.

    Article  PubMed  Google Scholar 

  15. Benedetto U, Head SJ, Angelini GD, Blackstone EH. Statistical primer: propensity score matching and its alternatives. Eur J Cardiothorac Surg. 2018;53(6):1112–7. PubMed PMID: 29684154.

    Article  PubMed  Google Scholar 

  16. Chatton A, Borgne FL, Leyrat C, Foucher Y. G-computation and doubly robust standardisation for continuous-time data: a comparison with inverse probability weighting. Stat Methods Med Res. 2022;31(4):706–18. 10.1177/09622802211047345. PubMed PMID: 34861799.

    Article  PubMed  Google Scholar 

  17. Ripollone JE, Huybrechts KF, Rothman KJ, Ferguson RE, Franklin JM. Evaluating the utility of coarsened exact matching for Pharmacoepidemiology using real and simulated Claims Data. Am J Epidemiol. 2020;189(6):613–22. PubMed PMID: 31845719; PubMed Central PMCID: PMCPMC7368132.

    Article  PubMed  Google Scholar 

  18. SAS. SAS/STAT 14.3 User’s Guide: The PSMATCH Procedure. SAS. 2017;

  19. Guo S, Fraser MW. Propensity score analysis: statistical methods and applications. Thousand Oaks, CA: Sage; 2015.

    Google Scholar 

  20. Hong G. Marginal Mean Weighting through Stratification: Adjustment for Selection Bias in Multilevel Data. J Educational Behav Stat. 2010;35(5):499–531.

    Article  Google Scholar 

  21. Franklin JM, Rassen JA, Ackermann D, Bartels DB, Schneeweiss S. Metrics for covariate balance in cohort studies of causal effects. Stat Med. 2014;33(10):1685–99. PubMed PMID: 24323618.

    Article  PubMed  Google Scholar 

  22. Austin PC. Using the standardized difference to compare the prevalence of a Binary Variable between two groups in Observational Research. Commun Stat - Simul Comput. 2009;38(6):1228–34.

    Article  Google Scholar 

  23. Ripollone JE, Huybrechts KF, Rothman KJ, Ferguson RE, Franklin JM. Implications of the Propensity score matching Paradox in Pharmacoepidemiology. Am J Epidemiol. 2018;187(9):1951–61. PubMed PMID: 29750409; PubMed Central PMCID: PMCPMC6118075.

    Article  PubMed  PubMed Central  Google Scholar 

  24. Yang S, Starks MA, Hernandez AF, Turner EL, Califf RM, O’Connor CM, et al. Impact of baseline covariate imbalance on bias in treatment effect estimation in cluster randomized trials: race as an example. Contemp Clin Trials. 2020;88:105775. PubMed PMID: 31228563; PubMed Central PMCID: PMCPMC8337048.

    Article  PubMed  Google Scholar 

  25. Franklin JM, Schneeweiss S, Polinski JM, Rassen JA. Plasmode simulation for the evaluation of pharmacoepidemiologic methods in complex healthcare databases. Comput Stat Data Anal. 2014;72:219–26. PubMed PMID: 24587587; PubMed Central PMCID: PMCPMC3935334.

    Article  PubMed  Google Scholar 

  26. Vaughan LK, Divers J, Padilla M, Redden DT, Tiwari HK, Pomp D, et al. The use of plasmodes as a supplement to simulations: a simple example evaluating individual admixture estimation methodologies. Comput Stat Data Anal. 2009;53(5):1755–66. PubMed PMID: 20161321; PubMed Central PMCID: PMCPMC2678733.

    Article  PubMed  PubMed Central  Google Scholar 

  27. Franklin JM, Eddings W, Glynn RJ, Schneeweiss S. Regularized regression Versus the high-dimensional propensity score for Confounding Adjustment in secondary database analyses. Am J Epidemiol. 2015;182(7):651–9. PubMed PMID: 26233956.

    Article  PubMed  Google Scholar 

  28. Greifer N, Estimating Effects After M. 2022;

  29. Liang K-Y, Zeger S. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73(1):13–22.

    Article  Google Scholar 

  30. Blackwell M, Iacus S, King G. Cem: coarsened exact matching in Stata. Stata J. 2009;9(4):524–46.

    Article  Google Scholar 

  31. Trikalinos T, Hoaglin D, Schmid C, Empirical. and Simulation-Based Comparison of Univariate and Multivariate Meta-Analysis for Binary Outcomes. 2013;

  32. Romano JL, Kromrey JD, Hibbard ST. A Monte Carlo Study of eight confidence interval methods for Coefficient Alpha. Educ Psychol Meas. 2010;70(3):376–93.

    Article  Google Scholar 

  33. Rainey C. Dealing with separation in logistic regression models. Political Anal. 2016;24:339–55.

    Article  Google Scholar 

  34. Heinze G, Schemper M. A solution to the problem of separation in logistic regression. Stat Med. 2002;21(16):2409–19. Epub 2002/09/05.

    Article  PubMed  Google Scholar 

  35. Matsouaka RA, Zhou Y. A framework for causal inference in the presence of extreme inverse probability weights: the role of overlap weights. Math arXiv: Methodol. 2020;

  36. Sjoding MW, Luo K, Miller MA, Iwashyna TJ. When do confounding by indication and inadequate risk adjustment bias critical care studies? A simulation study. Crit Care. 2015;19:195. Epub 2015/05/01.

    Article  PubMed  PubMed Central  Google Scholar 

  37. Thomas LE, Li F, Pencina MJ. Overlap weighting: a propensity score method that mimics attributes of a Randomized Clinical Trial. JAMA. 2020;323(23):2417–8. PubMed PMID: 32369102.

    Article  PubMed  Google Scholar 

  38. Zhao P, Su X, Ge T, Fan J. Propensity score and proximity matching using random forest. Contemp Clin Trials. 2016;47:85–92. Epub 2015/12/27.

    Article  PubMed  Google Scholar 

Download references


We would like to thank Neda Laiteerapong, MD, and her team members for inputs on how to identify diabetic adults in Medicaid. Furthermore, we would like to express our gratitude to all reviewers and editors who made excellent comments and helped us strengthen the work greatly. 


All authors have no financial/commercial conflicts of interests. This study was funded by Health Resources and Services Administration (HRSA) (HHSH250201300025I) (MPI: Huang and Chin). Drs. Wan, Chin, and Huang were supported in part by the Chicago Center for Diabetes Translation Research (NIDDK P30 DK092949).

Author information

Authors and Affiliations



WW, MM, RSN, MHC, and ESH had full access to all the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis.

Study concept and design: All authors.

Acquisition, analysis, or interpretation of data: All authors.

Drafting of the manuscript: WW.

Critical revision of the manuscript for important intellectual content: All authors.

Statistical analysis: WW

Obtained funding: ESH and MHC.

Administrative, technical, or material support: All authors.

Study supervision: All authors.

Corresponding author

Correspondence to Wen Wan.

Ethics declarations

Ethics approval and consent to participate

The study was reviewed by the University of Chicago Institutional Review Board and determined to be non-human subject research. There was no specific informed consent for the study due to de-identification of the Medicaid data purchased from the Centers for Medicare and Medicaid Services (CMS) under a data use agreement.

Competing interests

The authors declare no competing interests.

Consent for publication

Not applicable.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wan, W., Murugesan, M., Nocon, R.S. et al. Comparison of two propensity score-based methods for balancing covariates: the overlap weighting and fine stratification methods in real-world claims data. BMC Med Res Methodol 24, 122 (2024).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: