Skip to main content

Quasi-rerandomization for observational studies



In the causal analysis of observational studies, covariates should be carefully balanced to approximate a randomized experiment. Numerous covariate balancing methods have been proposed for this purpose. However, it is often unclear what type of randomized experiments the balancing approaches aim to approximate; and this may cause ambiguity and hamper the synthesis of balancing characteristics within randomized experiments.


Randomized experiments based on rerandomization, known for significant improvement on covariate balance, have recently gained attention in the literature, but no attempt has been made to integrate this scheme into observational studies for improving covariate balance. Motivated by the above concerns, we propose quasi-rerandomization, a novel reweighting method, where observational covariates are rerandomized to be the anchor for reweighting such that the balanced covariates obtained from rerandomization can be reconstructed by the weighted data.


Through extensive numerical studies, not only does our approach demonstrate similar covariate balance and comparable estimation precision of treatment effect to rerandomization in many situations, but it also exhibits advantages over other balancing techniques in inferring the treatment effect.


Our quasi-rerandomization method can approximate the rerandomized experiments well in terms of improving the covariate balance and the precision of treatment effect estimation. Furthermore, our approach shows competitive performance compared with other weighting and matching methods. The codes for the numerical studies are available at

Peer Review reports


Randomized experiments are widely recognized as the gold standard for causal inference, due to the covariate balance and objectivity in treatment assignment [1, 2]. However, randomized experiments may be infeasible due to financial or ethical reasons, and it is often costly and takes a long time to conduct such experiments that may delay decision making. There is a tendency of using real-world evidence in observational studies with non-randomized data to infer the causal treatment effect in practice.

To analyze observational data, Bind and Rubin [3] recently advocated embedding the observational study in the context of a hypothetical randomized experiment, motivated by the ideas in [1, 4]. Specifically, they divided the analysis procedure into four major stages: (1) a conceptual stage that formulates the causal questions with the related assumptions in terms of a hypothetical randomized experiment; (2) a design stage that reconstructs the hypothetical randomized experiment on the observed data without access to the outcome data; (3) a statistical analysis stage that estimates the causal effect; and (4) a summary stage that summarizes the findings for the causal question.

In the cardinal design stage, many approaches have been proposed to approximate randomized experiments in terms of the covariate balance and thus reduce the estimation bias of treatment effect. One popular strategy uses a matching procedure, which generally assembles the units with similar propensity scores [5] between the treatment and control groups, including but not limited to the nearest-neighbor matching [6], optimal matching [7] and full matching [8]. One can refer to [9] for a systematic review of various matching approaches. Another scheme reweights observations to improve the covariate balance. One leading paradigm of reweighting is based on the inverse of propensity scores [10]. Imai and Ratkovic [11] leveraged the dual properties of propensity scores as a covariate balancing score to reweight samples. Hainmueller [12], Zubizarreta [13] and Chan et al. [14] directly optimized the sample weights to attain a set of predefined balancing conditions.

Although the aforementioned methods can improve the covariate balance, they cannot formally specify which type of randomized experiment is approximated. Such ambiguity would conceptually diminish the credibility of the whole observational analysis, because the experiment reconstructed at the design stage may not agree with the one considered at the conceptual stage. Furthermore, it may technically hinder the existing balancing approaches from synthesizing the valuable characteristics of some randomized experiments. Particularly, the celebrated rerandomization (ReR) proposed by Morgan and Rubin [15] has been widely recognized to outperform the classical complete randomization (CR) in terms of covariate balance [16]. It is thus highly preferable to make the balanced data approximate the rerandomized experiment rather than the CR experiment. Branson [17] proposed a test to diagnose the covariate balance of a matched dataset in contrast with rerandomzation, rather than directly balancing the observational data. Although the effectiveness of ReR has provoked further research for accommodating more complex experiments [18,19,20] and high-dimensional covariates [21,22,23], none of those extensions formally considered integrating rerandomization into the observational data analysis.

To address the above concerns, it is worth noting that only covariates are required for randomization. Therefore, we can also randomize the covariates in an observational study and then leverage the randomized data as the template to adjust the observational data. In this way, we bridge the observational study with the nominal randomized experiment, where adjusted observational data can directly imitate the appealing balancing properties. Towards this goal, we propose a reweighting approach, called quasi-rerandomization (QReR), which learns a generative neural network to yield random weight vectors such that the corresponding weighted datasets possess similar virtues of covariate balance to the rerandomized datasets.

Our approach has several advantages at the statistical analysis stage. First, our weight vectors can be conveniently paired with any weighted estimator for estimating the treatment effect. Second, it is allowed to ensemble multiple diverse weight vectors for improving estimation precision. Empirically, we compare the proposed method with the original rerandomization and other balancing methods through extensive numerical experiments. Not only does QReR demonstrate similar covariate balance and estimation performance to rerandomization in many situations, but it also shows superiority in estimating treatment effect in comparison with other balancing approaches, especially under the setting with complex response surfaces.

The remainder of this paper is organized as follows. In ‘Methods’, we introduce the problem setup of causal inference in observational studies and review the fundamental concepts of rerandomization. In particular, the subsection ‘Quasi-Rerandomization’ introduces the proposed QReR method in detail. In ‘Experiments and Discussion’, we conduct simulated experiments to compare our approach with rerandomization and other balancing algorithms as well as demonstrate the feasibility of our method with a real data example. We conclude with a discussion in ‘Conclusions’.


Treatment effect in observational studies

Suppose that the observational data consist of N units. Let \(\varvec{T}=(T_1,\dots ,T_N)\in \{0,1\}^{N}\) denote the treatment allocation vector for all units, where \(T_i=1\) if the ith unit is assigned to the treatment group and \(T_i=0\) if it is assigned to the control group. We define \(N_1=\sum _{i=1}^NT_i\) and \(N_0=\sum _{i=1}^N(1-T_i)\) to be the corresponding sample sizes for treatment and control groups. Let \(\varvec{X}_i=(X_{i1},\dots ,X_{id})^{\top }\in \mathbb {R}^{d}\) be the observed covariates for the ith unit. Following the potential outcome framework [24], each unit is associated with two potential outcomes \(Y_i(0)\) and \(Y_i(1)\) but only one of them can be observed,

$$\begin{aligned} Y_i^{\textrm{obs}}=Y_i(T_i)=T_iY_i(1)+(1-T_i)Y_i(0). \end{aligned}$$

It is typically assumed that there is no interference of the treatment effect between units and no hidden versions of treatment, known as the stable unit treatment value assumption (SUTVA) [25]. Furthermore, we impose the strongly ignorable assumption [5], such that the treatment assignment is independent of potential outcomes given the observed covariates and each unit has a chance to receive the treatment.

Given the samples \(\{(\varvec{X}_i,Y_i^{\textrm{obs}},T_i)\}_{i=1}^N\), we aim to estimate the sample average treatment effect (SATE),

$$\begin{aligned} \tau _{\textrm{SATE}} = \frac{1}{N}\sum \limits _{i=1}^N\{Y_i(1)-Y_i(0)\}, \end{aligned}$$

which is also the causal estimand in rerandomization. In addition, we consider the estimation of the population average treatment effect (PATE),

$$\begin{aligned} \tau _{\textrm{PATE}} = \mathbb {E}(\tau _{\textrm{SATE}}) = \mathbb {E}\{Y(1)-Y(0)\}, \end{aligned}$$

where the population refers to the set from which the finite observations are sampled. The estimand \(\tau _{\textrm{PATE}}\) is commonly considered in various balancing algorithms [9, 13, 14].


Given a covariate matrix \(\varvec{X}=(\varvec{X}_1,\dots ,\varvec{X}_N)^{\top }\in \mathbb {R}^{N\times d}\), we elaborate on how a rerandomized experiment can be carried out to generate a balanced allocation and conduct inference. We first randomly generate a vector denoted by \(\widetilde{\varvec{T}}=(\widetilde{T}_1,\dots ,\widetilde{T}_N)^{\top }\in \mathbb {R}^{N}\) with the constraint

$$\begin{aligned} \sum \limits _{i=1}^N\widetilde{T}_i=N_1,\quad \sum \limits _{i=1}^N(1-\widetilde{T}_i)=N_0. \end{aligned}$$

The covariate balance under the allocation \(\widetilde{\varvec{T}}\) is then evaluated by the Mahalanobis distance,

$$\begin{aligned} D(\varvec{X},\widetilde{\varvec{T}})=\varvec{\Delta }(\widetilde{\varvec{T}})^{\top }\left[ \textrm{cov}\{\varvec{\Delta }(\widetilde{\varvec{T}})\}\right] ^{-1}\varvec{\Delta }(\widetilde{\varvec{T}})=\frac{N_1N_0}{N}\varvec{\Delta }(\widetilde{\varvec{T}})^{\top }\left\{ \widehat{\textrm{cov}}(\varvec{X})\right\} ^{-1}\varvec{\Delta }(\widetilde{\varvec{T}}), \end{aligned}$$

where the vector \(\varvec{\Delta }(\widetilde{\varvec{T}}) = \bar{\varvec{X}}_1-\bar{\varvec{X}}_0\) is the mean difference of covariates between the treatment and control groups with \(\bar{\varvec{X}}_1=\sum _{i=1}^N\widetilde{T}_i\varvec{X}_i/N_1\) and \(\bar{\varvec{X}}_0=\sum _{i=1}^N(1-\widetilde{T}_i)\varvec{X}_i/N_0\). The matrix \(\textrm{cov}\left\{ \varvec{\Delta }(\widetilde{\varvec{T}})\right\}\) refers to the covariance of \(\varvec{\Delta }(\widetilde{\varvec{T}})\) regarding all random \(\widetilde{\varvec{T}}\)’s under the constraint (1), and \(\widehat{\textrm{cov}}(\varvec{X})\) is the sample covariance matrix with respect to \(\varvec{X}\). We accept the allocation \(\widetilde{\varvec{T}}\) if the corresponding Mahalanobis distance is no larger than a prefixed threshold \(a>0\), i.e., \(D(\varvec{X},\widetilde{\varvec{T}})\le a\); otherwise a new allocation \(\widetilde{\varvec{T}}\) is generated. Morgan and Rubin [15] showed that \(D(\varvec{X},\widetilde{\varvec{T}})\) asymptotically follows a chi-squared distribution so that one can determine the threshold a by \(P(\chi ^2_d\le a)=p_a\) given a predefined acceptance probability \(p_a\in (0,1]\). A smaller \(p_a\) used in rerandomization leads to more balanced covariates, and when \(p_a=1\), rerandomization reduces to the complete randomization.

Connecting rerandomization and observational studies

For the observational data analysis, the covariates \(\varvec{X}_i\)’s should be carefully balanced to approximate a hypothetical randomized experiment without access to the outcome variables. This stage helps to reduce the underlying confounding effects in the observational data for estimating \(\tau _{\textrm{SATE}}\) or \(\tau _{\textrm{PATE}}\). Particularly, most existing balancing techniques are proposed to essentially achieve some of the following empirical equations,

$$\begin{aligned} \sum \limits _{i=1}^{N}W_iT_if(\varvec{X}_i)= & {} \sum \limits _{i=1}^{N}W_i(1-T_i)f(\varvec{X}_i), \end{aligned}$$
$$\begin{aligned} \frac{1}{N}\sum \limits _{i=1}^{N}f(\varvec{X}_i)= & {} \sum \limits _{i=1}^{N}W_iT_if(\varvec{X}_i), \end{aligned}$$
$$\begin{aligned} \frac{1}{N}\sum \limits _{i=1}^{N}f(\varvec{X}_i)= & {} \sum \limits _{i=1}^{N}W_i(1-T_i)f(\varvec{X}_i). \end{aligned}$$

where \(\varvec{W}=(W_1,\dots ,W_N)\) denotes the vector of sample weights, and f refers to an vector-valued function of \(\varvec{X}_i\)’s. Heuristically, the first equation directly ensures the covariate balance between treatment and control groups, which is typically leveraged by propensity-score-based approaches. The propensity score \(\pi (\varvec{X}_i,\varvec{\eta })=P(T_i=1|\varvec{X}_i)\) can be estimated for each sample with unknown parameter \(\varvec{\eta }\). It can be shown that the propensity scores satisfy \(W_i=T_i/\pi (\varvec{X}_i,\varvec{\eta })+(1-T_i)/\{1-\pi (\varvec{X}_i,\varvec{\eta })\}\) and \(f(\varvec{X}_i)=\partial \pi (\varvec{X}_i,\varvec{\eta })/\partial \varvec{\eta }\) with respect to (3) [11]. Equations (4) and (5) quantify the fact that randomized covariates in treatment and control groups have similar characteristics to the pooled covariates, which also imply the covariate balance in Eq. (3). Some balancing techniques hereby treat \(W_i\)’s as unknown parameters for optimization, and modify (4) and (5) as either hard equality constraints [12, 14] or soft constraints, namely \(|\frac{1}{N}\sum _{i=1}^{N}f(\varvec{X}_i) -\sum _{i=1}^{N}W_iT_if(\varvec{X}_i)|\le \delta\) with a threshold \(\delta >0\) [13]. The function \(f(\cdot )\) is typically specified as the covariate moment, such as \(f(\varvec{X}_i)=\varvec{X}_i\), in those constraints.

However, the aforementioned approaches do not clarify what type of randomized experiment (e.g., complete randomization, rerandomization, or constrained randomization) they intend to approximate, which not only leads to conceptual uncertainty, but can also prevent integrating the properties of randomized experiments into balancing the covariates. One salient potential of such integration is to directly and objectively calibrate the covariate balance based on properly randomized covariates without assuming a parametric propensity score model for (3) or completely depending on implicit balancing constraints in (4) and (5). Finally, one may also exploit the superior balance of some randomized experiments, such as rerandomization rather than complete randomization, to enhance efficiency and inference in observational studies.

We hence propose the quasi-rerandomization (QReR), a reweighting method, to bridge rerandomization and observational studies. We first conduct rerandomization over the covariates and generate multiple acceptable allocation vectors, because rerandomization does not require the availability of responses. We then compute a balancing metric based on (2) for all acceptable assignments. Those metric values imply how the covariates would be balanced under the rerandomized experiment, and thus can be adopted as the anchor to guide further balance adjustment based on (3), (4) and (5) for the observational data.


In quasi-rerandomization, we first generate a large number of acceptable rerandomized allocations. A transformation function is then fitted via a neural network, which generates weight vectors from Dirichlet noises such that the weighted covariates have comparable balance properties to the rerandomized covariates.

Specifically, the sample weights satisfy \(\varvec{W}^{\top }\varvec{T}=1\) and \(\varvec{W}^{\top }(\varvec{1}_N-\varvec{T})=1\), where \(\varvec{1}_N\in \mathbb {R}^{N}\) is a vector of length N with all entries of 1, and \(\varvec{T}=(T_1,\dots ,T_N)\) is the observed treatment allocation vector. Without loss of generality, we assume that \(\varvec{T}=(\varvec{1}^{\top }_{N_1},\varvec{0}^{\top }_{N_0})^{\top }\), and the weight vector can thus be rewritten as \(\varvec{W}=(\varvec{W}_1^{\top },\varvec{W}_0^{\top })^{\top }\), where \(\varvec{W}_1\) and \(\varvec{W}_0\) respectively denote the sub-vectors in \(\varvec{W}\) for the treatment and control units with \(\varvec{W}_1^{\top }\varvec{1}_{N_1}=1\) and \(\varvec{W}_0^{\top }\varvec{1}_{N_0}=1\). We assume that \(\varvec{W}_1\) and \(\varvec{W}_0\) are drawn from Dirichlet distributions \(\textrm{Dir}(\varvec{1}_{N_1})\) and \(\textrm{Dir}(\varvec{1}_{N_0})\), respectively. We aim to learn a transformation function \(\widetilde{\varvec{W}} = G(\varvec{W}|\varvec{X},\varvec{\theta })\) with parameter \(\varvec{\theta }\) such that the weighted covariate mean difference with respect to \(\widetilde{\varvec{W}}\),

$$\begin{aligned} \varvec{\Delta }(\widetilde{\varvec{W}}) = \sum \limits _{i=1}^N\widetilde{W}_{i}T_{i}\varvec{X}_{i}-\sum \limits _{i=1}^N\widetilde{W}_{i}(1-T_{i})\varvec{X}_{i}, \end{aligned}$$

has a similar distribution to \(\varvec{\Delta }(\widetilde{\varvec{T}})\). Intuitively, we treat the rerandomized covariate mean difference \(\varvec{\Delta }(\widetilde{\varvec{T}})\) as the template to adjust the weighted counterpart \(\varvec{\Delta }(\widetilde{\varvec{W}})\), where \(\varvec{\Delta }(\widetilde{\varvec{W}})\) corresponds to (3) with \(f(\varvec{X}_i)=\varvec{X}_i\). As a result, the reweighted observational data imitate the balance characteristics as in the rerandomized data. We specify \(G(\varvec{W}|\varvec{X},\varvec{\theta })\) as a multi-layer neural network. The neural network has two hidden layers with each layer containing 512 neurons, where the number of neurons is selected based on the empirical observation that this value is sufficient in most cases. We adopt the ReLU activation function [26] for both hidden layers. We also apply the dropout scheme [27] with a dropout rate of 0.5 after each hidden layer, which is a common choice as suggested by [28]. The output layer applies two separate softmax functions to yield the unified weight vector \(\widetilde{\varvec{W}}=(\widetilde{\varvec{W}}_1^{\top },\widetilde{\varvec{W}}_0^{\top })^{\top }\).

The rationale for approximating the distribution of the covariate mean difference \(\varvec{\Delta }(\widetilde{\varvec{T}})\) from rerandomization is given as follows. First, the vector \(\varvec{\Delta }(\widetilde{\varvec{T}})\) itself is a balance measure, and thus the approximation can ensure the fundamental balancing property of our weighted data. Second, the distribution-based approximation can further incorporate the characteristics of covariate balance from rerandomization into the weighted data. In rerandomization, the covariate balance of all acceptable allocations is determined by the distribution of Mahalanobis distance, which is fully governed by \(\varvec{\Delta }(\widetilde{\varvec{T}})\) because the sample covariance matrix \(\widehat{\textrm{cov}}(\varvec{X})\) is fixed for the observed covariates as illustrated by (2). Moreover, the distribution of \(\varvec{\Delta }(\widetilde{\varvec{T}})\) contains more abundant high-dimensional information among covariates for the balancing property in contrast to that of Mahalanobis distance. Finally, the distribution-based approximation can also provide randomness and diversity for our generated weight vectors, which offers more options for making inference at the statistical analysis stage, such as using some ensembling techniques to estimate the treatment effects. Specifically, one may follow the idea of bagging [29] by separately applying a weighted treatment effect estimator (e.g., the weighted mean difference of response and the doubly robust estimator [30]) to each weight vector and then aggregating all estimators by taking the mean or median.

Motivated by [31], we adopt the maximum mean discrepancy (MMD) proposed by [32, 33] as the loss function to minimize the distribution deviation between \(\varvec{\Delta }(\widetilde{\varvec{W}})\) and \(\varvec{\Delta }(\widetilde{\varvec{T}})\). Let \(\{\varvec{\delta }(\widetilde{\varvec{T}}^{(b)})\}_{b=1}^B\) and \(\{\varvec{\delta }(\widetilde{\varvec{W}}^{(b)})\}_{b=1}^B\) be the samples for loss calculation, where \(\{\widetilde{\varvec{T}}^{(b)}\}_{b=1}^B\) are initially obtained from rerandomization, and \(\{\widetilde{\varvec{W}}^{(b)}\}_{b=1}^B\) with \(\widetilde{\varvec{W}}^{(b)}=G({\varvec{W}}^{(b)}|\varvec{X},\varvec{\theta })\) are obtained through transformation from the initial Dirichlet weights \(\{{\varvec{W}}^{(b)}\}_{b=1}^B\). Based on those samples, the MMD loss is defined as

$$\begin{aligned} \mathcal {L}_{\textrm{MMD}}(\varvec{\theta })= & {} \left\| \frac{1}{B}\sum \limits _{b=1}^{B}\phi \left\{ \varvec{\delta }(\widetilde{\varvec{W}}^{(b)})\right\} -\frac{1}{B} \sum \limits _{b=1}^{B}\phi \left\{ \varvec{\delta }(\widetilde{\varvec{T}}^{(b)})\right\} \right\| _2 \nonumber \\= & {} \left[ \frac{1}{B^2}\sum \limits _{i=1}^{B}\sum \limits _{j=1}^{B}\phi \left\{ \varvec{\delta }(\widetilde{\varvec{W}}^{(i)})\right\} ^{\top } \phi \left\{ \varvec{\delta }(\widetilde{\varvec{W}}^{(j)})\right\} \right. \nonumber \\{} & {} +\frac{1}{B^2}\sum \limits _{i=1}^{B}\sum \limits _{j=1}^{B}\phi \left\{ \varvec{\delta }(\widetilde{\varvec{T}}^{(i)})\right\} ^{\top } \phi \left\{ \varvec{\delta }(\widetilde{\varvec{T}}^{(j)})\right\} \nonumber \\{} & {} \left. -\frac{2}{B^2}\sum \limits _{i=1}^{B}\sum \limits _{j=1}^{B}\phi \left\{ \varvec{\delta }(\widetilde{\varvec{W}}^{(i)})\right\} ^{\top } \phi \left\{ \varvec{\delta }(\widetilde{\varvec{T}}^{(j)})\right\} \right] ^{1/2} \nonumber \\= & {} \left[ \frac{1}{B^2}\sum \limits _{i=1}^{B}\sum \limits _{j=1}^{B}K\left\{ \varvec{\delta }(\widetilde{\varvec{W}}^{(i)}), \varvec{\delta }(\widetilde{\varvec{W}}^{(j)})\right\} \right. \nonumber \\{} & {} +\frac{1}{B^2}\sum \limits _{i=1}^{B}\sum \limits _{j=1}^{B}K\left\{ \varvec{\delta }(\widetilde{\varvec{T}}^{(i)}), \varvec{\delta }(\widetilde{\varvec{T}}^{(j)})\right\} \nonumber \\{} & {} \left. -\frac{2}{B^2}\sum \limits _{i=1}^{B}\sum \limits _{j=1}^{B}K\left\{ \varvec{\delta }(\widetilde{\varvec{W}}^{(i)}), \varvec{\delta }(\widetilde{\varvec{T}}^{(j)})\right\} \right] ^{1/2}, \end{aligned}$$

where \(\phi (\varvec{x})\) is the feature mapping vector, and \(K(\varvec{x},\varvec{y})=\phi (\varvec{x})^{\top }\phi (\varvec{y})\) is the corresponding kernel function with respect to any feature vectors \(\varvec{x}, \varvec{y}\).

If \(\phi\) is the identity mapping, the MMD loss reduces to the Euclidean norm of the difference between sample means of \(\{\varvec{\delta }(\widetilde{\varvec{T}}^{(b)})\}_{b=1}^B\) and \(\{\varvec{\delta }(\widetilde{\varvec{W}}^{(b)})\}_{b=1}^B\), i.e., the deviation between the first moments. Using the kernel trick, one can implicitly map the sample vectors to a high-dimensional and nonlinear feature space, and the loss would correspond to the difference between higher-order moments of two samples. Gretton et al. [32, 33] showed that when the feature space is a universal reproduced kernel Hilbert space, the MMD loss asymptotically equals zero if and only if the distributions of \(\varvec{\Delta }(\widetilde{\varvec{T}})\) and \(\varvec{\Delta }(\widetilde{\varvec{W}})\) are the same. We choose the common radial basis function (RBF) kernel \(K(\varvec{x}_1,\varvec{x}_2)=\exp (- \gamma \Vert \varvec{x}_1-\varvec{x}_2\Vert ^2_2)\) with bandwidth \(\gamma\), whose feature space consists of moments of all orders.

We incorporate two additional regularization terms to the MMD loss to further improve the balancing and inference properties for the generated weights. It is easy to derive the relationships concerning any allocation vector \({\varvec{T}}\) between the fixed sample mean \(\bar{\varvec{X}}=\sum _{i=1}^N\varvec{X}_i/N\) and the sample means of treatment and control groups, \(\bar{\varvec{X}}_1\) and \(\bar{\varvec{X}}_0\),

$$\begin{aligned} \varvec{\delta }(\varvec{T}) = \frac{N}{N_0}\left( \bar{\varvec{X}}_1-\bar{\varvec{X}}\right) = -\frac{N}{N_1}\left( \bar{\varvec{X}}_0-\bar{\varvec{X}}\right) . \end{aligned}$$

Thus, the balance measure \(\varvec{\delta }(\varvec{T})\) can be represented by the mean difference \(\bar{\varvec{X}}-\bar{\varvec{X}}_1\) or \(\bar{\varvec{X}}-\bar{\varvec{X}}_0\). Moreover, \(\bar{\varvec{X}}_1\) and \(\bar{\varvec{X}}_0\) are close to the fixed \(\bar{\varvec{X}}\) under the rerandomization due to \(\varvec{\delta }(\widetilde{\varvec{T}})\approx \varvec{0}\) when \(\varvec{T}=\widetilde{\varvec{T}}\), i.e., \(\bar{\varvec{X}}\approx \bar{\varvec{X}}_1\approx \bar{\varvec{X}}_0\). The first regularizer aims to retain the above characteristics of rerandomization for QReR,

$$\begin{aligned} \mathcal {R}_1(\varvec{\theta }) =\frac{1}{B}\sum \limits _{b=1}^B\left( \left\| \sum \limits _{j=1}^N\widetilde{W}^{(b)}_{j}T_{j}\varvec{X}_{j}-\bar{\varvec{X}}\right\| _2^2+\left\| \sum \limits _{j=1}^N\widetilde{W}^{(b)}_{j}(1-T_{j})\varvec{X}_{j}-\bar{\varvec{X}}\right\| _2^2\right) . \end{aligned}$$

This regularizer essentially ensures the Eqs. in (4) and (5) for constraining the imbalance among covariates. To avoid extreme values among the weights, we introduce another regularization term,

$$\begin{aligned} \mathcal {R}_2(\varvec{\theta }) = \frac{1}{B}\sum \limits _{b=1}^B\left( \left\| \widetilde{\varvec{W}}_1^{(b)}-\varvec{1}_{N_1}/N_1\right\| _2^2 +\left\| \widetilde{\varvec{W}}_0^{(b)}-\varvec{1}_{N_0}/N_0\right\| _2^2\right) . \end{aligned}$$

Modified from the objective function of the stable balancing weights [13], this regularizer can control the variation of the transformed weights within each treatment group and help to avoid extreme weights, which can further improve inference, such as reducing the variance of the weighted point estimator. The uniform weights \(\varvec{1}_{N_1}/N_1\) and \(\varvec{1}_{N_0}/N_0\) are widely used as the base weights [12,13,14]. Based on (7) and (8), we estimate the parameter \(\varvec{\theta }\) of the network model by solving the optimization problem,

$$\begin{aligned} \min _{\varvec{\theta }} \left\{ \mathcal {L}_{\textrm{MMD}}(\varvec{\theta }) + \lambda _1\mathcal {R}_1(\varvec{\theta }) + \lambda _2\mathcal {R}_2(\varvec{\theta })\right\} , \end{aligned}$$

where \(\lambda _1\) and \(\lambda _2\) are the positive regularization parameters.

Network training

The network training can be divided into three main steps: network initialization, loss computation with parameter \(\varvec{\theta }\) updating, and early stopping of the training, as detailed in Algorithm 1. Accordingly, three distinct sets of weight vectors and acceptable ReR allocations are generated in advance for implementing those steps. We generate \(B_{\textrm{init}}\) weight vectors \(\{\varvec{W}^{(b)}\}_{b=1}^{B_{\textrm{init}}}\) for network initialization, \(B_{\textrm{loss}}\) acceptable allocations \(\{\widetilde{\varvec{T}}^{(b)}\}_{b=1}^{B_{\textrm{loss}}}\) for loss computation as well as parameter updating, and \(B_{\textrm{stop}}\) weight vectors \(\{\varvec{W}^{(b)}\}_{b=1}^{B_{\textrm{stop}}}\) together with allocations \(\{\widetilde{\varvec{T}}^{(b)}\}_{b=1}^{B_{\textrm{stop}}}\) for early stopping.

figure a

Algorithm 1 Network Training Algorithm.

In the network initialization for the parameter \(\varvec{\theta }\), we minimize the mean squared error between the logarithmic weights \(\{\log (\varvec{W}^{(b)})\}_{b=1}^{B_{\textrm{init}}}\) and \(\{\log (\widetilde{\varvec{W}}^{(b)})\}_{b=1}^{B_{\textrm{init}}}\) with \(\widetilde{\varvec{W}}^{(b)}=G({\varvec{W}}^{(b)}|\varvec{X},\varvec{\theta })\),

$$\begin{aligned} \min _{\varvec{\theta }}\frac{1}{B_{\textrm{init}}}\sum \limits _{b=1}^{B_{\textrm{init}}}\left\| \log (\varvec{W}^{(b)})-\log (\widetilde{\varvec{W}}^{(b)})\right\| _2^2, \end{aligned}$$

where the logarithmic transformation helps to amplify the difference between vectors. The intuition is that the initial network should be close to the identity mapping, i.e., \(\varvec{W}\approx G(\varvec{W}|\varvec{X},\varvec{\theta })\). Our initialization can be viewed as a special type of unsupervised pre-training [28, 34], which acts as a regularizer to improve the robustness of the network [35].

After initializing the network, we calculate the loss and optimize the network parameter \(\varvec{\theta }\) using the stochastic gradient descent. Different from the typical network training with a prefixed dataset, our training data, \(\varvec{\delta }(\widetilde{\varvec{T}})\) and \(\varvec{\delta }(\widetilde{\varvec{W}})\), can be generated infinitely from rerandomization and Dirichlet distributions; that is, we can generate a new batch of \(\widetilde{\varvec{T}}^{(b)}\)’s and \(\varvec{W}^{(b)}\)’s for every iteration. However, it could be computationally expensive to generate new \(\widetilde{\varvec{T}}^{(b)}\)’s in each iteration if the acceptance probability \(p_a\) for rerandomization is small. We thus generate a large number of feasible allocations \(\{\widetilde{\varvec{T}}^{(b)}\}_{b=1}^{B_{\textrm{loss}}}\) prior to the network training, and then sample a small subset \(\{\widetilde{\varvec{T}}^{(b)}\}_{b=1}^B\) from \(\{\widetilde{\varvec{T}}^{(b)}\}_{b=1}^{B_{\textrm{loss}}}\) with replacement. A new batch of weight vectors \(\{\widetilde{\varvec{W}}^{(b)}\}_{b=1}^B\) are jointly generated for the loss calculation.

Given a bandwidth \(\gamma\) and the regularization coefficients \(\lambda _1\) and \(\lambda _2\), we minimize the loss function (9) to update \(\varvec{\theta }\) based on \(\{\varvec{\delta }(\widetilde{\varvec{T}}^{(b)})\}_{b=1}^B\) and \(\{\varvec{\delta }(\widetilde{\varvec{W}}^{(b)})\}_{b=1}^B\) for each training iteration. The regularization parameters \(\lambda _1\) and \(\lambda _2\) are kept as constants throughout the training, while the bandwidth parameter \(\gamma \in (0,\infty )\) should be properly chosen to maximize the MMD loss \(\mathcal {L}_{\textrm{MMD}}(\varvec{\theta })\) [31, 36]. Instead of fixing the value of \(\gamma\) [31] or conducting a heuristic line search for \(\gamma\) [36], we adopt a data-driven strategy to adaptively update the value of \(\gamma\) along with the network training through the stochastic gradient descent. Given the updated network parameter \(\varvec{\theta }\), we update \(\gamma\) to maximize the MMD loss (6) based on the same batch of weight vectors and ReR allocations.

We also introduce a metric based on the Mahalanobis distance to early stop the training once the metric cannot be improved for more iterations. We define a weighted Mahalanobis distance \(D(\varvec{X},\widetilde{\varvec{W}})\) for the vector \(\widetilde{\varvec{W}}\) by plugging the corresponding weighted mean and covariance estimators into (2),

$$\begin{aligned} D(\varvec{X},\widetilde{\varvec{W}}) = \frac{N_1N_0}{N}\varvec{\Delta }(\widetilde{\varvec{W}})^{\top }\left\{ \widehat{\textrm{cov}}_{\widetilde{\varvec{W}}}(\varvec{X})\right\} ^{-1}\varvec{\Delta }(\widetilde{\varvec{W}}), \end{aligned}$$


$$\begin{aligned} \widehat{\textrm{cov}}_{\widetilde{\varvec{W}}}(\varvec{X})=\frac{1}{1-\sum _{i=1}^N\widetilde{W}^{*2}_i}\sum \limits _{i=1}^N \widetilde{W}^*_i(\varvec{X}_i-\bar{\varvec{X}}_{\widetilde{\varvec{W}}})(\varvec{X}_i-\bar{\varvec{X}}_{\widetilde{\varvec{W}}})^{\top }, \end{aligned}$$


$$\begin{aligned} \bar{\varvec{X}}_{\widetilde{\varvec{W}}} = \sum \limits _{i=1}^N\widetilde{W}^*_i\varvec{X}_i\quad \textrm{and} \quad \widetilde{W}^*_i = \left\{ N_1T_i+N_0(1-T_i)\right\} \widetilde{W}_i/N, \quad i =1,\dots , N. \end{aligned}$$

It is easy to check that \(\widehat{\textrm{cov}}_{\widetilde{\varvec{W}}}(\varvec{X}) = \widehat{\textrm{cov}}(\varvec{X})\) and \(D(\varvec{X},\widetilde{\varvec{W}})=D(\varvec{X},\widetilde{\varvec{T}})\) when the weights are all equal within the treatment and the control groups, i.e., \(\widetilde{\varvec{W}}_1=\varvec{1}_{N_1}/N_1\) and \(\widetilde{\varvec{W}}_0=\varvec{1}_{N_0}/N_0\). Motivated by the test proposed in [17], where the distribution of Mahalanobis distance incorporates the characteristics of covariate balance for randomized experiments, we specify the stopping metric as the Kolmogorov–Smirnov statistic with respect to the empirical distributions of Mahalanobis distances \(\{D(\varvec{X},\widetilde{\varvec{W}}^{(b)})\}_{b=1}^{B_{\textrm{stop}}}\) and \(\{D(\varvec{X},\widetilde{\varvec{T}}^{(b)})\}_{b=1}^{B_{\textrm{stop}}}\) based on \(\{\varvec{W}^{(b)}\}_{b=1}^{B_{\textrm{stop}}}\) and \(\{\widetilde{\varvec{T}}^{(b)}\}_{b=1}^{B_{\textrm{stop}}}\) respectively. Heuristically, a smaller value of the stopping metric indicates that the weighted data obtained from our model is more likely to resemble the rerandomized experiment.

Estimation for weighted data

After the network training, one can conduct statistical analysis using a set of transformed weights \(\{\widetilde{\varvec{W}}^{(m)}\}_{m=1}^M\) generated from the trained network, where M denotes the number of weight vectors used for estimation. For the sample average treatment effect \(\tau _{\textrm{SATE}}\), we adopt the mean difference estimator,

$$\begin{aligned} \hat{\tau }(\widetilde{\varvec{W}},\varvec{Y}^{\textrm{obs}}) = \sum \limits _{i=1}^N\widetilde{W}_{i}T_{i}Y^{\textrm{obs}}_{i}-\sum \limits _{i=1}^N\widetilde{W}_{i}(1-T_{i})Y^{\textrm{obs}}_{i}. \end{aligned}$$

Given \(\{\hat{\tau }(\widetilde{\varvec{W}}^{(m)},\varvec{Y}^{\textrm{obs}})\}_{m=1}^M\), we consider the following point estimator for \(\tau _{\textrm{SATE}}\),

$$\begin{aligned} \hat{\tau }_{M} = \frac{1}{M}\sum \limits _{m=1}^{M}\hat{\tau }(\widetilde{\varvec{W}}^{(m)},\varvec{Y}^{\textrm{obs}}) = \hat{\tau }\left( \frac{1}{M}\sum \limits _{m=1}^M\widetilde{\varvec{W}}^{(m)},\varvec{Y}^{\textrm{obs}}\right) . \end{aligned}$$

Such an estimation strategy leverages multiple weight vectors, and is equivalent to a weighted mean difference estimator based on the average weight vector \(\sum _{m=1}^M\widetilde{\varvec{W}}^{(m)}/M\). When using one vector with \(M=1\), the estimator mimics the estimation in rerandomization, where only one acceptable allocation would be used to collect the responses and estimate \(\tau _{\textrm{SATE}}\).

For a single weight vector \(\widetilde{\varvec{W}}\), we can pass it to any point estimator designed for \(\tau _{\textrm{PATE}}\) that supports weighted inputs, and the confidence interval can be constructed accordingly based on the robust sampling variance of the weighted point estimator [12, 37]. Particularly, we consider the simplest estimator using a weighted linear model to regress responses on the allocation indicators. Through some linear algebras, it is easy to check that such a point estimator for \(\tau _{\textrm{PATE}}\) exactly has the same form of the weighted mean difference in (12), and we can simultaneously obtain its sampling variance based on the linear model. Therefore, we similarly consider the ensembled estimator \(\hat{\tau }_{M}\) based on the average weight vector, where the aggregated vector empirically leads to smaller bias and root mean squared error than a random single vector.

Experiments and discussion

Experimental settings

The simulated settings are modified from those in [17, 38], which are designed to mimic the real cases. We fix the sample size for the treatment group as \(N_1=250\) and set the sample size of the control group as \(N_0=r\times N_1\) with the ratio \(r\in \{1, 2\}\). The treatment indicator vector is kept as \(\varvec{T}=(\varvec{1}_{N_1}^{\top },\varvec{0}_{N_0}^{\top })^{\top }\) for a given ratio r. The 8-dimensional covariate vector \(\varvec{X}_i=(X_{i1},\dots ,X_{i8})^{\top }\) is simulated as follows,

$$\begin{aligned} (X_{i1},\dots ,X_{i4})|T_i&\sim N\left( T_i\varvec{\mu },T_i\varvec{\Sigma }+(1-T_i)\varvec{I}_4\right) ,\\ X_{i5}, X_{i6}|T_i&\sim \textrm{Bernoulli}(0.1+0.068T_i),\\ X_{i7}, X_{i8}|T_i&\sim \textrm{Bernoulli}(0.4+0.242T_i),\quad i=1,\dots ,N, \end{aligned}$$

where we consider three different combinations of \(\varvec{\mu }\) and \(\varvec{\Sigma }\),

  • Scenario 1: \(\varvec{\mu }=(0.2,0.2,0.5,0.5)^{\top }\) and \(\varvec{\Sigma }=\varvec{I}_4\);

  • Scenario 2: \(\varvec{\mu }=\sqrt{1.5}\times (0.2,0.2,0.5,0.5)^{\top }\) and \(\varvec{\Sigma }=2\varvec{I}_4\);

  • Scenario 3: \(\varvec{\mu }=\sqrt{1.5}\times (0.2,0.2,0.5,0.5)^{\top }\) and \(\varvec{\Sigma }=1.5\varvec{I}_4+0.5\varvec{1}_4\varvec{1}_4^{\top }\).

Scenarios 1 and 2 represent the cases where the continuous covariates respectively have homogeneous and heterogeneous variances between the treatment and control groups. Scenario 3 considers the situation where there exist correlations among the Gaussian covariates in the treatment group. Let \((\mu _{T}-\mu _{C})/\sqrt{(\sigma ^2_T+\sigma ^2_C)/2}\) and \((p_{T}-p_{C})/\sqrt{\{p_T(1-p_T)+p_C(1-p_C)\}/2}\) be the true standardized mean difference for the Gaussian and Bernoulli covariates respectively, where \(\mu _{}\) and \(\sigma ^2\) correspond to the true mean and variance for a Gaussian variable, and p is the probability of taking a value of 1 for a Bernoulli variable. In all three scenarios, we keep the true standardized mean difference as 0.2 or 0.5 for each covariate, which represents a meaningful covariate imbalance due to its value larger than 0.1 [39, 40].

After obtaining the covariate matrix \(\varvec{X}=(\varvec{X}_1,\dots ,\varvec{X}_N)^{\top }\), we generate the response \(Y_i\) for each subject \((\varvec{X}_i,T_i)\) from the following model with \(\tau _{\textrm{SATE}}=\tau _{\textrm{PATE}}=\tau\),

$$\begin{aligned} Y_i = g(\varvec{X}_i) + \tau T_i + \varepsilon _i,\quad \varepsilon _i\sim N(0,1), \end{aligned}$$

where we consider three different forms for the response surface function \(g(\varvec{X}_i)\),

$$\begin{aligned} g_{\textrm{L}}(\varvec{X}_i)= & {} 3.5X_{i1}+4.5X_{i3}+1.5X_{i5}+2.5X_{i7}, \quad \quad \quad \quad \quad (\textrm{Linear})\\ g_{\textrm{I}}(\varvec{X}_i)= & {} g_{\textrm{L}}(\varvec{X}_i)+2.5\textrm{sign}(X_{i1})\sqrt{|X_{i1}|} + 2.5X_{i3}X_{i7} ,\quad (\textrm{Interaction})\\ g_{\textrm{P}}(\varvec{X}_i)= & {} g_{\textrm{I}}(\varvec{X}_i)+5.5X_{i3}^2-4.5X_{i1}X_{i3}^3.\quad \ \quad \quad \quad \quad \quad \quad \quad (\textrm{Polynomial}) \end{aligned}$$

Therefore, we can obtain three responses for the ith sample under different degrees of nonlinearity. The three types of responses represent the complex response surfaces in real situations, and partial inclusion of covariates in the response functions mimics the fact that not all covariates have an influence on the outcome in practice [38]. The first function \(g_{\textrm{L}}(\varvec{X}_i)\) represents a linear surface, whereas the other two functions \(g_{\textrm{I}}(\varvec{X}_i)\) and \(g_{\textrm{P}}(\varvec{X}_i)\) incrementally consider the nonlinear interactions and higher-order polynomial features. The additive treatment effect is fixed as \(\tau =1\) in all situations.

Through the above procedure, we can obtain a dataset including a covariate matrix, a treatment indicator vector, and three response vectors corresponding to three response surface functions for a given ratio r and a covariate scenario. We further standardize all covariates to have a unit variance and a zero mean. Following the setting in [41,42,43], we let the dataset with ratio \(r=2\) include the dataset with \(r=1\) under the same covariate scenario to correlate different datasets, which can increase the precision of comparisons and save the number of randomly generated datasets. We replicate 200 datasets for all combinations of the ratio r and covariate scenario to evaluate different methods.

We first compare the proposed QReR with the original rerandomization (ReR) in terms of the covariate balance and estimation for \(\tau _{\textrm{SATE}}\), which reveals how well QReR approximates ReR. For the covariate balance, we consider the similarity between the distributions of the unweighted and weighted covariate mean differences, i.e., \(\varvec{\delta }(\widetilde{\varvec{T}})\) and \(\varvec{\delta }(\widetilde{\varvec{W}})\). We report the average Kolmogorov–Smirnov (KS) statistics and the corresponding average p-values of the mean differences across each covariate, which are based on 1000 weight vectors and acceptable treatment allocations generated by QReR and ReR, respectively.

For the estimation of treatment effect, QReR uses both the observed covariates and responses for inference, whereas we perform ReR on the observed covariates and regenerate an acceptable allocation vector and corresponding responses. Regarding the metrics of assessment, the empirical bias and root mean square error (RMSE) are used to evaluate the precision of the estimator, where the mean difference between treated and control responses is chosen as the estimator for ReR as in [15]. Moreover, we report the Monte Carlo standard errors (MCSEs) for the above performance measurements of simulations (average KS statistics, average p-values, bias and RMSE) [44, 45], which evaluate the overall adequacy of simulations with finite repetitions under different random-number seeds.

We consider three different levels of acceptance probability for ReR and QReR, i.e., \(p_a=0.1,0.5,1\). For the other hyper-parameters of QReR in Algorithm 1, we set the iteration numbers as \((N_{\textrm{init}},N_{\textrm{train}},N_{\textrm{stop}})=(500,5000,15)\) for network initialization, training and early-stopping, and the batch sizes for training are specified as \((B,B_{\textrm{init}},B_{\textrm{stop}},B_{\textrm{loss}})=(512,1000,1000,10000)\). We use the Adam optimizer [46] with default parameters to conduct the stochastic gradient descent algorithm. The regularization coefficients \(\lambda _1\) and \(\lambda _2\) are both fixed as 1. During the inference, the QReR estimator \(\hat{\tau }_M\) is calculated based on \(M=1000\) weight vectors, denoted by \(\textrm{QReR}_{\textrm{M}}\). In addition, we compute the estimator using a single weight vector (\(M=1\)) denoted by \(\textrm{QReR}_{\textrm{S}}\) to compare with ReR and show the advantage of ensembling multiple weight vectors.

We further study the performance of QReR on inferring \(\tau _{\textrm{PATE}}\) in comparison with several popular balancing approaches. We consider the propensity score matching (PSM) provided by the R package Matching [47], optimal full matching (FM) in the MatchIt package [48, 49], inverse probability weighting (IPW) using the propensity score (Eq. 7 in [50]), entropy balancing (EBAL) proposed by [12], stable balancing weights (SBW) in [13] and empirical balancing calibration weighting (EBCW) based on the package ATE [14], where EBAL, SBW and EBCW are non-parametric reweighting algorithms. We use the R package WeightIt [37] to conduct SBW as well as EBAL, and the tolerance parameter of SBW is specified as 0.01. The propensity scores are estimated using the logistic regression in the relevant benchmarks, including PSM, FM and IPW. No further covariate adjustment is applied when estimating \(\tau _{\textrm{PATE}}\) after matching or reweighting. For QReR, we keep the same settings of hyper-parameters in the \(\tau _{\textrm{SATE}}\) estimation except that we only approximate the most stringent ReR with \(p_a=0.1\). Similar to [11,12,13,14], we focus on the bias and RMSE in the estimation of \(\tau _{\textrm{PATE}}\) when evaluating different methods.

Simulation results

Table 1 shows that QReR can approximate ReR well in terms of covariate balance. First, the average MCSEs have small values for both the average KS statistics and p-values, which implies that replications of 200 are adequate. For different combinations of r and scenarios, the average KS statistics are all small with the corresponding average p-values larger than 0.1, indicating that the weighted mean differences of QReR for each covariate share similar distributions to the counterparts in ReR. Furthermore, the KS statistics are larger for smaller \(p_a\) (more stringent), which implies that it is more difficult for QReR to reconstruct ReR when the criterion of covariate balance is more stringent. It may result from the fact that the distribution of \(\varvec{\delta }(\varvec{T})\) is more concentrated around zero for small \(p_a\) and thus is more difficult to approximate. For a more intuitive illustration, we draw the boxplots to visualize the covariate mean differences of a representative case for ReR and QReR with \(p_a=0.1\) in Fig. 1, where covariates are generated under Scenario 1 with \(r=2\). We observe that the paired boxplots of QReR and ReR generally exhibit similar shapes especially for the continuous covariates and a large value of \(p_a\). The medians in the paired boxplots are close, indicating that QReR and ReR have similar covariate balance, whereas other quantile points further show that the covariate mean difference of QReR displays a similar variation to that of ReR.

Table 1 The average Kolmogorov–Smirnov (KS) statistics and average p-values of covariate mean differences across all covariates between 1000 weighted and balanced datasets generated respectively by quasi-rerandomization and rerandomization. The average Monte Carlo standard errors (MCSEs) are 0.001 and 0.005 for the average KS statistics and the average p-values, respectively
Fig. 1
figure 1

An illustrative example for the (weighted) mean differences of all covariates between quasi-rerandomization (QReR) and rerandomization (ReR) with \(p_a=0.1, 0.5, 1\). The boxplots are based on 1000 acceptable ReR allocations \(\widetilde{\varvec{T}}\) and transformed weights \(\widetilde{\varvec{W}}\) under a simulated dataset from Scenario 1 with \(r=N_0/N_1=2\)

Tables 2 and 3 show the estimation performance of QReR in comparison with ReR. We first observe that the average MCSEs have relatively larger values for the NonLinear (Polynomial) model, because it is more difficult to balance covariates and thus leads to larger variations under the complex surface function. For QReR, we simultaneously consider two estimators \(\textrm{QReR}_{\textrm{M}}\) and \(\textrm{QReR}_{\textrm{S}}\) based on (13). When the outcome is generated from Linear or Nonlinear (Interaction) models, QReR yields comparable bias and RMSE to ReR. However, we observe much larger bias and RMSE under QReR for the response surface of Nonlinear (Polynomial) and covariates from Scenarios 2 and 3. The results under Nonlinear (Interaction) indicate that QReR for the observational data can demonstrate similar performance to rerandomized experiments even in the presence of unobserved nonlinear covariates. This could be explained by the MMD loss being capable of incorporating some nonlinear information of covariates by taking various orders of moments for \(\varvec{\Delta }(\widetilde{\varvec{T}})\) into account. In contrast, ReR inherently improves the covariate balance of any unobserved covariates due to the nature of randomization. Additionally, we find that QReR mimics ReR by delivering smaller RMSEs using a smaller acceptance probability \(p_a\) under various situations, which results from smaller values of \(\varvec{\delta }(\widetilde{\varvec{T}})\)’s entries in (6) due to more balanced covariates in ReR. The RMSEs of QReR and ReR are also smaller for \(r=2\) in contrast to \(r=1\), because more observations are provided. Concerning \(\textrm{QReR}_{\textrm{M}}\) and \(\textrm{QReR}_{\textrm{S}}\), we find that the simpler estimator \(\textrm{QReR}_{\textrm{S}}\) demonstrates more similar values of bias and RMSE to ReR, which stems from the similar covariate balance between QReR and ReR. The similarity also reflects that the single weight vector from QReR can approximate the inference properties of the acceptable allocation in ReR. The estimator \(\textrm{QReR}_{\textrm{M}}\) generally has smaller RMSE because it ensembles multiple \(\textrm{QReR}_{\textrm{S}}\)’s via taking the average, which shows the advantage of generating diverse weight vectors.

Table 2 The bias and RMSE for \(\tau _{\textrm{SATE}}\) under quasi-rerandomization (QReR) and rerandomization (ReR) under different combinations of the acceptance probability \(p_a\), covariate scenarios and response surfaces when \(r=N_0/N_1=1\). The average Monte Carlo standard errors (MCSEs) of bias are 0.03, 0.04 and 0.31 and those of RMSE are 0.02, 0.03 and 0.31 for Linear, NonLinear (Interaction) and NonLinear (Polynomial) models, respectively
Table 3 The bias and RMSE for \(\tau _{\textrm{SATE}}\) under quasi-rerandomization (QReR) and rerandomization (ReR) under different combinations of the acceptance probability \(p_a\), covariate scenarios and response surfaces when \(r=N_0/N_1=2\). The average Monte Carlo standard errors (MCSEs) of bias are 0.02, 0.03 and 0.26 and those of RMSE are 0.02, 0.02 and 0.30 for Linear, NonLinear (Interaction) and NonLinear (Polynomial) models, respectively

The point estimator for \(\tau _{\textrm{SATE}}\) has the same form of \(\tau _{\textrm{PATE}}\) so that we only compare \(\textrm{QReR}_{\textrm{M}}\) with other balancing algorithms in terms of bias and RMSE as shown by Table 4. In most cases, our approach has RMSE and bias as small as other non-parametric weighting approaches including SBW, EBAL and EBCW, and outperforms the parametric weighting method IPW as well as other matching methods such as PSM and FM. In addition, we find that SBW, EBAL and EBCW perform better than \(\textrm{QReR}_{\textrm{M}}\) and the propensity score methods in terms of bias under Linear and Nonlinear (Interaction) models. Furthermore, \(\textrm{QReR}_{\textrm{M}}\) demonstrates an evident advantage when the response is generated from a highly nonlinear response surface with heterogeneous covariate distributions. It shows that despite being inferior to ReR, our method performs better in the presence of nonlinear covariates relative to other balancing approaches. Moreover, unlike those benchmarks that rely on the balanced covariates without clear hypothetical randomized experiments, our estimator is pillared by the weight vectors that directly approximate the rerandomized experiment. Therefore, our method can incorporate rerandomization to balance covariates and simultaneously yield precise estimation of the treatment effect.

Table 4 The bias and RMSE for \(\tau _{\textrm{PATE}}\) under quasi-rerandomization (QReR) and other balancing methods under different combinations of the covariate scenarios, response surfaces and ratios (\(r=N_0/N_1\)). The average Monte Carlo standard errors (MCSEs) of bias are 0.02, 0.03 and 0.40 and those of RMSE are 0.02, 0.02 and 0.77 for Linear, NonLinear (Interaction) and NonLinear (Polynomial) models, respectively

Real application

We demonstrate the application of our proposed QReR on semi-synthetic data [51, 52], which consist of real and imbalanced covariates with allocations and simulated responses. The covariates were collected from the Infant Health and Development Program (IHDP), which targeted the low-birth-weight and premature infants. There were 19 binary covariates and 6 continuous covariates for each participant. High-quality childcare and professional home visits were provided for the treatment group, and the infants’ cognitive test score was the outcome of interest. There were 747 observations with 139 and 608 units for the treatment and control groups, respectively. Given the real covariates and allocations, Hill [51] simulated responses from various response surface functions to obtain the true treatment effect and thus evaluate different methods.

Shalit et al. [52] publicly provided 100 such datasetsFootnote 1 with different average treatment effects, each of which includes the potential outcomes \(\{(Y_{i}(0), Y_{i}(1))\}_{i=1}^{747}\) with their expectations \(\left\{ (\mathbb {E}\{Y_{i}(1)|X_i\}, \mathbb {E}\{Y_{i}(0)|X_i\})\right\} _{i=1}^{747}\), the treatment indicator vector \(\varvec{T}\in \mathbb {R}^{747}\) with \(\sum _{i=1}^NT_i=139\) and the same covariate matrix \(\varvec{X}\in \mathbb {R}^{747\times 25}\) using all 25 covariates. Moreover, these 100 datasets have heterogeneous individual treatment effects, i.e., both \(Y_{i}(1)-Y_{i}(0)\) and its expectation \(\mathbb {E}\{Y_{i}(1)|X_i\} - \mathbb {E}\{Y_{i}(0)|X_i\}\) have different values for \(i=1,\dots , N\). We further standardize all covariates to have zero means and unit variances. Based on the potential outcomes, we can easily obtain the finite sample average treatment effect \(\tau _{\textrm{SATE}}\) for each dataset, so that QReR can be similarly compared with ReR in terms of the bias and RMSE. We set the true population average treatment effect \(\tau _{\textrm{PATE}}=\sum _{i=1}^N\left[ \mathbb {E}\{Y_{i}(1)|X_i\}-\mathbb {E}\{Y_{i}(0)|X_i\}\right] /N\) following [52], and calculate the corresponding bias and RMSE for QReR and other balancing techniques. We use the same parameter settings in the simulations for various balancing approaches in analysis of IHDP datasets.

In Fig. 2, we present the covariate mean differences between QReR and ReR for continuous (\(X_1,\dots , X_6\)) and binary (\(X_7,\dots , X_{25}\)) covariates under different values of \(p_a\). It shows that \(\varvec{\delta }(\widetilde{\varvec{W}})\) of QReR generally has similar distributions to \(\varvec{\delta }(\widetilde{\varvec{T}})\) of ReR on the 25 covariates, particularly for the continuous ones, and thus QReR well reconstructs ReR in terms of the covariate mean differences. Furthermore, we observe that QReR can better approximate ReR when \(p_a\) increases, which is consistent with the findings from the simulations.

Fig. 2
figure 2

The (weighted) mean differences of all covariates between quasi-rerandomization (QReR) and rerandomization (ReR) with \(p_a=0.1, 0.5, 1\) on the covariates of IHDP datasets. The boxplots are based on 1000 acceptable ReR allocations \(\widetilde{\varvec{T}}\) and transformed weights \(\widetilde{\varvec{W}}\). The first six covariates are continuous, whereas the last nineteen covariates are binary

From the results in Table 5, we find that QReR also yields comparable bias and RMSE to ReR, and \(\textrm{QReR}_{\textrm{M}}\) performs better than \(\textrm{QReR}_{\textrm{S}}\) under the heterogeneous treatment effects, which strengthens the tendency of using multiple weight vectors. Table 6 shows that \(\textrm{QReR}_{\textrm{M}}\) achieves the smallest bias in contrast with other benchmarks. Its RMSE is much smaller than the matching approaches, but slightly worse than SBW, EBAL and EBCW due to a trade-off on the bias. This hence shows that our proposed approach is still advantageous in estimating \(\tau _{\textrm{PATE}}\) under the context of real covariates and heterogeneous individual treatment effects, and thus can be adopted broadly in the real applications.

Table 5 The bias and RMSE for the sample average treatment effect using quasi-rerandomization (QReR) and rerandomization (ReR) on the IHDP datasets, under the acceptance probability \(p_a= 0.1, 0.5, 1\)
Table 6 The bias and RMSE for the population average treatment effect using various balancing methods on the IHDP datasets. The quasi-rerandomization (QReR) reconstructs the rerandomization under the acceptance probability \(p_a=0.1\)


We propose a novel balancing technique, named quasi-rerandomization, for observational studies, which incorporates the covariate balance from rerandomization into the observational data via reweighting. The weights obtained from our method can be conveniently combined with weighted point estimators to perform the subsequent inference for both finite-sample and population treatment effects. We empirically show that our method can well approximate the rerandomized experiments in terms of improving the covariate balance and the precision of treatment effect estimation. Furthermore, our approach demonstrates competitive performance compared with other weighting and matching methods. Possible extensions may modify our algorithm to approximate other types of randomized experiments, such as block randomization. One may also explore whether the Bayesian framework can be leveraged to generate the weights that can reconstruct rerandomization in terms of covariate balance. The codes and datasets for the simulations and real application can be found at

Availability of data and materials

The data used in the manuscript are either simulated or publicly available, which can be found at


  1. The datasets can be downloaded from by merging two subsets IHDP-100 (train) and IHDP-100 (test).


  1. Rubin DB. For objective causal inference, design trumps analysis. Ann Appl Stat. 2008;2(3):808–40.

    Article  Google Scholar 

  2. Yin G. Clinical trial design: Bayesian and frequentist adaptive methods, vol 876. John Wiley & Sons; 2012.

  3. Bind MA, Rubin DB. Bridging observational studies and randomized experiments by embedding the former in the latter. Stat Methods Med Res. 2019;28(7):1958–78.

    Article  PubMed  Google Scholar 

  4. Rubin DB. The design versus the analysis of observational studies for causal effects: parallels with the design of randomized trials. Stat Med. 2007;26(1):20–36.

    Article  PubMed  Google Scholar 

  5. Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70(1):41–55.

    Article  Google Scholar 

  6. Rubin DB. Matching to remove bias in observational studies. Biometrics. 1973;29(1):159–83.

    Article  Google Scholar 

  7. Rosenbaum PR. Optimal matching for observational studies. J Am Stat Assoc. 1989;84(408):1024–32.

    Article  Google Scholar 

  8. Hansen BB. Full matching in an observational study of coaching for the SAT. J Am Stat Assoc. 2004;99(467):609–18.

    Article  Google Scholar 

  9. Stuart EA. Matching methods for causal inference: A review and a look forward. Stat Sci. 2010;25(1):1–21.

    Article  PubMed  PubMed Central  Google Scholar 

  10. Rosenbaum PR. Model-based direct adjustment. J Am Stat Assoc. 1987;82(398):387–94.

    Article  Google Scholar 

  11. Imai K, Ratkovic M. Covariate balancing propensity score. J R Stat Soc Ser B Stat Methodol. 2014;786(1):243–63.

    Article  Google Scholar 

  12. Hainmueller J. Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies. Polit Anal. 2012;20(1):25–46.

    Article  Google Scholar 

  13. Zubizarreta JR. Stable weights that balance covariates for estimation with incomplete outcome data. J Am Stat Assoc. 2015;110(511):910–22.

    Article  CAS  Google Scholar 

  14. Chan KCG, Yam SCP, Zhang Z. Globally efficient non-parametric inference of average treatment effects by empirical balancing calibration weighting. J R Stat Soc Ser B Stat Methodol. 2016;78(3):673–700.

    Article  Google Scholar 

  15. Morgan KL, Rubin DB. Rerandomization to improve covariate balance in experiments. Ann Stat. 2012;40(2):1263–82.

    Article  Google Scholar 

  16. Li X, Ding P, Rubin DB. Asymptotic theory of rerandomization in treatment-control experiments. Proc Natl Acad Sci. 2018;115(37):9157–62.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Branson Z. Randomization Tests to Assess Covariate Balance When Designing and Analyzing Matched Datasets. Observational Stud. 2021;7:1–36.

    Article  Google Scholar 

  18. Branson Z, Dasgupta T, Rubin DB. Improving covariate balance in \(2^{K}\) factorial designs via rerandomization with an application to a New York City Department of Education High School Study. Ann Appl Stat. 2016;10(4):1958–76.

  19. Zhou Q, Ernst PA, Morgan KL, Rubin DB, Zhang A. Sequential rerandomization. Biometrika. 2018;105(3):745–52.

    Article  PubMed  Google Scholar 

  20. Wang X, Wang T, Liu H. Rerandomization in stratified randomized experiments. arXiv preprint arXiv:2011.05195. 2020;.

  21. Morgan KL, Rubin DB. Rerandomization to balance tiers of covariates. J Am Stat Assoc. 2015;110(512):1412–21.

    Article  CAS  PubMed  Google Scholar 

  22. Branson Z, Shao S. Ridge rerandomization: An experimental design strategy in the presence of covariate collinearity. J Stat Plan Infer. 2021;211:287–314.

    Article  Google Scholar 

  23. Zhang H, Yin G, Rubin DB. PCA Rerandomization. Can J Stat. In press. arXiv preprint arXiv:2102.12262. 2022.

  24. Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol. 1974;66(5):688–701.

    Article  Google Scholar 

  25. Rubin D. Comment on ‘Randomization Analysis of Experimental Data: The Fisher Randomization Test’ by Debabrata Basu. J Am Stat Assoc. 1980;75(371):591–3.

    Google Scholar 

  26. Nair V, Hinton GE. Rectified linear units improve restricted Boltzmann machines. In: International Conference on Machine Learning. Proceedings of the 27th International Conference on Machine Learning. 2010. p. 807–814.

  27. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res. 2014;15(1):1929–58.

    Google Scholar 

  28. Goodfellow I, Bengio Y, Courville A. Deep Learning. Cambridge, Massachusetts, USA: MIT Press; 2016.

    Google Scholar 

  29. Breiman L. Bagging predictors. Mach Learn. 1996;24:123–40.

    Article  Google Scholar 

  30. Bang H, Robins JM. Doubly robust estimation in missing data and causal inference models. Biometrics. 2005;61(4):962–73.

    Article  PubMed  Google Scholar 

  31. Li Y, Swersky K, Zemel R. Generative moment matching networks. In: International Conference on Machine Learning. Proceedings of the 32nd International Conference on Machine Learning, vol 37. 2015. p. 1718–1727.

  32. Gretton A, Borgwardt K, Rasch M, Schölkopf B, Smola A. A kernel method for the two-sample-problem. In: Advances in Neural Information Processing Systems, vol 19. MIT Press. 2006. p. 513–520.

  33. Gretton A, Borgwardt KM, Rasch MJ, Schölkopf B, Smola A. A kernel two-sample test. J Mach Learn Res. 2012;13(1):723–73.

    Google Scholar 

  34. Bengio Y, Lamblin P, Popovici D, Larochelle H. Greedy Layer-Wise Training of Deep Networks. In: Advances in Neural Information Processing Systems, vol 19. 2007. p. 153–160.

  35. Erhan D, Bengio Y, Courville A, Manzagol PA, Vincent P, Bengio S. Why Does Unsupervised Pre-training Help Deep Learning? J Mach Learn Res. 2010;11(19):625–60.

    Google Scholar 

  36. Sriperumbudur BK, Fukumizu K, Gretton A, Lanckriet GR, Schölkopf B. Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions. In: Advances in Neural Information Processing Systems, vol 22. Curran Associates, Inc. 2009. p. 1750–1758.

  37. Greifer N. WeightIt: Weighting for Covariate Balance in Observational Studies, 2020. R package version 0.10.2.

  38. De Los Angeles Resa M, Zubizarreta JR. Evaluation of subset matching methods and forms of covariate balance. Stat Med. 2016;35(27):4961–4979.

  39. Austin PC, Mamdani MM. A comparison of propensity score methods: a case-study estimating the effectiveness of post-AMI statin use. Stat Med. 2006;25(12):2084–106.

    Article  PubMed  Google Scholar 

  40. Austin PC. Propensity-score matching in the cardiovascular surgery literature from 2004 to 2006: a systematic review and suggestions for improvement. J Thorac Cardiovasc Surg. 2007;134(5):1128–35.

    Article  PubMed  Google Scholar 

  41. Rubin DB. Using multivariate matched sampling and regression adjustment to control bias in observational studies. J Am Stat Assoc. 1979;74(366a):318–28.

    Article  Google Scholar 

  42. Cangul M, Chretien YR, Gutman R, Rubin DB. Testing treatment effects in unconfounded studies under model misspecification: Logistic regression, discretization, and their combination. Stat Med. 2009;28(20):2531–51.

    Article  CAS  PubMed  Google Scholar 

  43. Gutman R, Rubin DB. Robust estimation of causal effects of binary treatments in unconfounded studies with dichotomous outcomes. Stat Med. 2013;32(11):1795–814.

    Article  CAS  PubMed  Google Scholar 

  44. Koehler E, Brown E, Haneuse SJP. On the assessment of Monte Carlo error in simulation-based statistical analyses. Am Stat. 2009;63(2):155–62.

    Article  PubMed  PubMed Central  Google Scholar 

  45. Morris TP, White IR, Crowther MJ. Using simulation studies to evaluate statistical methods. Stat Med. 2019;38(11):2074–102.

    Article  PubMed  PubMed Central  Google Scholar 

  46. Kingma DP, Ba J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. 2014.

  47. Sekhon JS. Multivariate and Propensity Score Matching Software with Automated Balance Optimization: The Matching package for R. J Stat Softw. 2011;42(7):1–52.

    Article  Google Scholar 

  48. Hansen BB, Klopfer SO. Optimal full matching and related designs via network flows. J Comput Graph Stat. 2006;15(3):609–27.

    Article  Google Scholar 

  49. Ho DE, Imai K, King G, Stuart EA. MatchIt: Nonparametric Preprocessing for Parametric Causal Inference. J Stat Softw. 2011;42(8):1–28.

    Article  Google Scholar 

  50. Lunceford JK, Davidian M. Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Stat Med. 2004;23(19):2937–60.

    Article  PubMed  Google Scholar 

  51. Hill JL. Bayesian nonparametric modeling for causal inference. J Comput Graph Stat. 2011;20(1):217–40.

    Article  Google Scholar 

  52. Shalit U, Johansson FD, Sontag D. Estimating individual treatment effect: generalization bounds and algorithms. In: International Conference on Machine Learning. Proceedings of the 34th International Conference on Machine Learning. 2017. p. 3076–3085.

Download references


We thank both the editor and reviewers for providing valuable comments and suggestions that have helped us to improve the quality of the paper immensely.


The Research Grants Council of Hong Kong (17308420) for GY.

Author information

Authors and Affiliations



GY and HZ contributed to the ideas, methods, discussions, and writing of the manuscript, WS contributed the revision and writing, and HZ conducted all numerical studies.

Corresponding author

Correspondence to Guosheng Yin.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, H., Su, W. & Yin, G. Quasi-rerandomization for observational studies. BMC Med Res Methodol 23, 155 (2023).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: