 Research
 Open access
 Published:
Quantifying and reducing inequity in average treatment effect estimation
BMC Medical Research Methodology volume 23, Article number: 297 (2023)
Abstract
Background
Across studies of average treatment effects, some population subgroups consistently have lower representation than others which can lead to discrepancies in how well results generalize.
Methods
We develop a framework for quantifying inequity due to systemic disparities in sample representation and a method for mitigation during data analysis. Assuming subgroup treatment effects are exchangeable, an unbiased sample average treatment effect estimator will have lower meansquared error, on average across studies, for subgroups with less representation when treatment effects vary. We present a method for estimating average treatment effects in representationadjusted samples which enables subgroups to optimally leverage information from the full sample rather than only their own subgroup’s data. Two approaches for specifying representation adjustment are offered—one minimizes average meansquared error for each subgroup separately and the other balances minimization of meansquared error and equal representation. We conduct simulation studies to compare the performance of the proposed estimators to several subgroupspecific estimators.
Results
We find that the proposed estimators generally provide lower mean squared error, particularly for smaller subgroups, relative to the other estimators. As a case study, we apply this method to a subgroup analysis from a published study.
Conclusions
We recommend the use of the proposed estimators to mitigate the impact of disparities in representation, though structural change is ultimately needed.
Background
Historically, racial and ethnic minorities and women have not been afforded the same representation in clinical studies as White men [1, 2]. We refer to the proportion of a sample that belongs to a particular subgroup as that subgroup’s sample representation. Despite governmental policies aimed at increasing inclusion of women and racial and ethnic minorities [3,4,5], reviews of published results from NIHfunded randomized controlled trials have shown that disparities in sample representation persist [6, 7]. Indeed, the NIH RCDC (Research, Condition, Disease Category) Inclusion Statistics report shows large disparities in the typical sample representation of racial and ethnic groups [8]. For example, the median representation of individuals identifying as Asian in cancer studies in 2021 was 2%; less than 1% each for American Indian or Alaska Native, Native Hawaiian or Other Pacific Islander, and individuals indicating more than one race; 8% for Black or African American; and 74% for White; 6% identified as Hispanic or Latino and 87% as not Hispanic or Latino. Disparities in sample representation extend to other segments of the population as well, including older adults [9] and adults with less than 12 years of education [10].
Subgroup sample representation plays a role in the generalizability of average treatment effects (ATEs) in experimental and observational studies. We distinguish two types of generalizability: outofsample and withinsample generalizing (Fig. 1). Researchers seeking to generalize their findings from the sample to a target population, such as a geographic region or a population diagnosed with a particular disease, are generalizing outofsample. In this setting, if treatment effects vary across subgroups that are disproportionately represented relative to the target population, sample estimates can be biased for the quantity of interest in the target population [11]. Accordingly, researchers might aim to have the representation of subgroups in a sample align with their corresponding representation in a target population. When this is infeasible, analytic methods have been developed for outofsample generalizing [12,13,14,15,16,17,18,19].
Even if researchers are able to sample uniformly at random from the target population of interest (or make statistical adjustments to mimic this), results (i.e. average treatment effects) might not generalize equally well across subgroups within the sample. In contrast to outofsample generalizing, we refer to generalizing study results to subgroups as withinsample generalizing, which is the focus of this paper. Subgroup analyses and hypothesis tests for interactions are common ways to explore and/or confirm withinsample generalizability of ATEs [20, 21]. NIH requires Phase III clinical trials to provide “...valid analysis of whether the variables studied in the trial affect women or members of minority groups...differently than other subjects in the trial...” [5]. Nonetheless, subgroup analyses are not always performed and have been cautioned against for their potentially high statistical noise [22, 23]. When treatment effects vary and when some subgroups systematically have greater representation than others, an ethical question arises: who benefits and who is disadvantaged when researchers generalize a sample ATE to specific subgroups? While intuition might suggest that results will generalize best for subgroups with the greatest sample proportion, we offer a formal approach to answering this question below.
Trading off some bias in estimation for a reduction of noise is one way to ease one of the main concerns with subgroup analysis: imprecision [24, 25]. Biased estimators can borrow strength across subgroups and result in lower meansquared error, which incorporates both bias and noise, for each subgroup. Bayesian hierarchical modeling provides one route to improving subgroupspecific estimates by borrowing strength across subgroups [24, 26, 27] and these models can be fit using the beanz R package [28]. We distinguish the approach presented in this paper below in the Discussion section.
The key innovation of our approach is to build on existing methodological ideas used for outofsample generalizing to the context of withinsample generalizing, by creating pseudosamples in which representation has been reweighted to improve subgroup effect estimates. This work builds on recent approaches focusing on subgrouplevel inferences [29,30,31,32]. This approach makes it easy for researchers to (1) use existing unbiased subgroupspecific estimators, (2) combine this method with existing methods for generalizing subgroupspecific effects to a broader target population of interest, and (3) incorporate stakeholder input and judgement into subgroup modeling in a straightforward way. We proceed by first introducing notation used throughout the paper. Under some assumptions, we show how disparities in sample representation affects withinsample generalizability of the sample ATE. We define effects of interest, which we conceptualize as representationadjusted ATEs, and provide identification proofs and estimators. Lastly, we examine the performance of the proposed estimators in several simulation studies and a case study. We hope that providing a straightforward method for modestly improving the accuracy of subgroupspecific estimates will support researchers conducting subgroup analyses rather than reporting ATEs only.
Methods
Notation and definitions
We define the following random variables: \(A \in \mathcal {A} = \{0,1\}\) indicates treatment assignment; \(X\in \mathcal {X}\) is a vector of baseline covariates; \(Y(a) \in \mathbb {R}\) is the potential outcome under assignment to treatment a; Y is the outcome that is observed; and \(S \in \{0,1\}\) indicates sample membership. We observe a sample of N independent realizations of \((A, X, Y, S=1)\), where \(S=1\) indicates membership in the sample (data not in the study sample would have \(S=0\)). We assume the data follow a joint distribution \(\mathbb {P}\). Since we are interested in subgroups of the sample defined by baseline covariates, we define a partition of \(\mathcal {X}\) into G mutually exclusive, nonempty subgroups that cover all possible values: \(V = \{v_1, \ldots , v_G\}\) That is, each \(v_g\) defines a subgroup of individuals with \(X \in v_g\) [31].
We define the following quantities: the sample ATE (SATE),
the sample representation of individuals with \(X \in v_g\),
and the subgroup sample ATE for individuals with \(X \in v_g\),
We let \(p = (p_1, \ldots , p_G)^T\) be the vector of subgroup representation probabilities and \(\tau = (\tau _1, \ldots , \tau _G)^T\) be the vector of subgroup sample ATEs. With the law of iterated expectations, \(\beta = p^T \tau\). For sake of clarity of exposition, we focus on estimation of the sample ATEs since our focus is on withinsample generalization, but we note that the arguments and methods presented can be generalized to a target population using existing methods [31].
The impact of disparities in sample representation
Our first task is to reason quantitatively about the impact that disparities in sample representation have on withinsample generalizability. For motivation, we consider a toy example. Researchers conduct five different studies each with three subgroups (groups A, B, and C) with representation fixed to 70%, 20%, and 10% respectively. In each study, researchers obtain an unbiased and very precise estimate of the SATE. We examine how applicable the overall study result (SATE) is for each subgroup by measuring the absolute difference between the SATE and the subgroup’s true effect. Hypothetical data are shown in Table 1.
From this example, we see that for studies 24, the SATE is closest to the effect of the most represented subgroup (group A). However, in study 1, the SATE is closest to \(\tau _B\), and in study 5, the SATE is closest to \(\tau _C\). This illustrates that the generalizability of the SATE to a particular subgroup is not solely determined by that subgroup’s representation in the study [33]. The SATE is a weighted average of the subgroupspecific ATEs; although subgroups with greater representation get more weight, an extreme group might pull the average away from a majority group (as in study 1) or two groups with similar effects might benefit from each other’s presence (as in study 5). Consequently, representation is important, but not the only consideration when assessing the withinsample generalizability of results for a given study; knowledge of how much subgroupspecific ATEs are expected to vary is also required.
We formalize these observations by examining the risk function, which specifies the expected loss over repeated samples from the datagenerating distribution for a particular estimator and parameter of interest [34]. Under squared error loss, the risk function is the mean squared error (MSE). For an unbiased estimator of the SATE, \(\hat{\beta }\), the MSE for estimating \(\tau _g\) can be expressed as:
where \(\sigma _{\hat{\beta }}\) is the standard error for \(\hat{\beta }\) and \(e_g\) is a column vector with the gth entry equal to 1 and all other entries equal to 0 (derivation is in the section named "Impact of disparities in representation" of Additional file 1). The risk function (Eq. 4) for \(\hat{\beta }\) depends on: 1) the standard error of the SATE estimator and 2) the sample representation of subgroups coupled with the products of all pairs of subgroup ATEs.
Using the risk function directly to draw conclusions about withinsample generalizability is impractical due to its dependence on the unknown subgroup ATEs in \(\tau\). To make progress, we propose examining the average risk, also known as the Bayes risk, given a prior distribution or weighting for the parameter values [34]. We model \(\tau\) as following a joint distribution \(\pi\), which we interpret in two ways. Taking a Bayesian perspective, we can interpret \(\pi\) as a prior belief about the probability of different values of \(\tau\). Alternatively, from a frequentist perspective, we could interpret \(\pi\) to be the distribution of subgroup ATEs that we would observe across many different studies (similar to the toy example above). The latter framing allows us to examine the impact of systemic disparities in sample representation.
In general, we define the inequity in average risk between two subgroups, \(X \in v_i\) and \(X \in v_{j}\) as
where \(\hat{\theta }_i, \hat{\theta }_j\) are estimators for \(\tau _i, \tau _j\), respectively. Critically, inequity in average risk provides us a path forward for reasoning about the impact that disparities in sample representation have on withinsample generalizability. Other aspects of the distribution of differences in MSEs across studies could be considered, but here we focus on the mean difference. As is often the case, when an unbiased SATE estimator is used to obtain a single estimate from a sample, the inequity for individuals with \(X \in v_1\) relative to individuals with \(X \in v_2\) is
To calculate the inequity measure of \(\hat{\beta }\) in Eq. 6, we do not need to specify the full distribution of \(\tau\) but rather only \(\mathbb {E}_\tau [\tau \tau ^T] = \Sigma _\tau + \mu _\tau \mu _\tau ^T\), where \(\Sigma _\tau\) and \(\mu _\tau\) are the covariance matrix and mean vector for \(\tau\), respectively. If researchers are uncertain how the treatment effect varies across subgroups, we recommend they assume \(\pi\) is exchangeable across subgroups (i.e. permutation of subgroup labels leaves the joint distribution \(\pi\) unchanged) [27]. This results in the following simplification,
where \(\phi ^2 = \text {var}_\tau (\tau _i  \tau _j) = \mathbb {E}_\tau [(\tau _i  \tau _j)^2]\) for any \(i \ne j\).
Equation 7 has consequential implications: when treatment effects vary and researchers report only average treatment effects, results will be less applicable on average for subgroups with lower representation. This holds both in a single study in which researchers have no prior knowledge of subgroup effect heterogeneity and across a collection of studies with consistent representation disparities. The inequity in average risk between two subgroups is directly proportional to the difference in representation of the subgroups and the variance of subgroupspecific ATE differences, under the assumptions given above. At the design stage of a study, given an approximation of how much treatment effects are expected to vary across subgroups \(\phi\), this simple formula could inform how much disparity in representation could be tolerated during study enrollment. At the analysis and interpretation stage, researchers can use Eq. 7 to more quantitatively reason about the withinsample generalizability of study results, by considering both disparities in representation and expectations of subgroup differences in ATEs.
In cases where treatment effects are expected to vary substantially across subgroups, researchers could sample subgroups in equal proportion, so that \(p_2  p_1 = 0\), to eliminate the inequity in Eq. 7. However, there are no guarantees on how accurate the SATE estimate would be for each of the subgroups. Another option is to change the estimator used to obtain each of the subgroupspecific estimates. Simple alternatives would be to obtain unbiased estimates of subgroupspecific treatment effects by stratifying the analysis by subgroup or fitting a regression model of the outcome with treatmentsubgroup interaction terms. We quantify the inequity of this approach as
where \(\hat{\tau }_1\), \(\hat{\tau }_2\) are unbiased subgroupspecific estimators of \(\tau _1, \tau _2\), respectively, and \(\sigma _1, \sigma _2\) are the corresponding standard errors. In large samples, \(\sigma _1\) and \(\sigma _2\) will tend to be small. In small samples, when \(p_2 > p_1\), we generally have that \(\sigma _2 < \sigma _1\) due to having fewer data for the less represented group. This implies that subgroups with less representation will have higher risk on average. While this approach addresses the inequity due to differences in the bias of the SATE estimator, it creates another issue due to differences in variance. Next, we discuss a third option which seeks to find a balance between these two alternatives.
Representationadjusted ATEs
When estimating a subgroupspecific ATE, we need not completely dispense with the information provided by the other subgroups in the sample. Effect sizes for some subgroups in the sample can give an approximate sense of reasonable values for the effect sizes of other subgroups. For example, if we know that for many subgroups, the ATE is generally an increase of the outcome by 2 to 5 units, then a subgroupspecific effect of 20 units would be suspect (though not impossible). When subgroups are analyzed separately, this valuable information is lost. Similar to Bayesian analyses of subgroup effects [25, 27, 35], we sought to make use of this information to improve the precision of subgroupspecific ATE estimation.
Since we noted that differences in representation led to inequity in average MSE, we consider multiple pseudosamples in which all sample data are retained for each subgroup. In each, subgroup representations are adjusted in an optimal way to improve the accuracy of the SATE for the subgroup of interest. We develop a method that does not require specification of full prior distributions and is less computationally expensive than fully Bayesian approaches. In addition, our approach allows for correlated subgroupspecific ATE estimators. We denote membership in the representationadjusted sample for individuals with \(X \in v_g\) with the discrete random variables \(S_g\), taking values 0 or 1. The observed sample indicator is still denoted by S without a subscript. For each \(g = 1,\ldots , G\), we define new effects of interest as
We refer to this effect as a representationadjusted ATE (RATE) for individuals with \(X \in v_g\). The degree of representation adjustment will depend on both the amount of information we have for each subgroup and prior expectations of how much subgroup ATEs differ, which we discuss under Estimation and inference below.
Identification
The RATE can be expressed as a function of observed variables \((A,X,Y,S=1)\) under certain assumptions. We assume that mean potential outcomes in the subgroup in the observed sample are equal to mean potential outcomes in the same subgroup in the pseudosample; that is, \(\mathbb {E}[Y(a)  X \in v_g, S=1] = \mathbb {E}[Y(a)  X \in v_g, S_g = 1]\), for all g and \(a \in \mathcal {A}\). With this assumption of exchangeability over sample indicators—which is different than the subgroup effect exchangeability assumption we discussed in the previous subsection—the RATE can be expressed as a weighted average of the subgroupspecific effects as follows,
where the first equality is the definition of the effect, the second follows from the law of iterated expectations, the third follows from the assumption of exchangeability over sampling indicators, and the last follows from the definition of \(\tau _k\). Equation 10 is analogous to the transport formula given in [14]. Note that the equalities shown above could be applicable to both experimental and observational studies. In the case of observational studies, however, assumptions about exchangeability of treatment assignment are necessary for the identification of \(\tau _k\). Steps for identifying the subgroupspecific effects \(\tau _k\) closely follow the identification proofs in [31] which we detail in the section named "Identification" of Additional file 1. If outofsample generalizing is of interest, \(\tau _k\) can simply be replaced with the respective subgroup effects in the target population. Again, identification arguments follow those in [31] as described in the section named "Identification" of Additional file 1.
Estimation and inference
Based on Eq. 10, RATE estimators can be expressed as:
where \(\tilde{\tau }\) is a vector of RATE estimators, Q is matrix of probabilities with entries \(q_{ij} = \mathbb {P}(X \in v_j  S_i = 1)\), and \(\hat{\tau }\) is a vector of unbiased estimators for \(\tau\). Accordingly, we need to 1) unbiasedly estimate the subgroupspecific ATEs: \(\hat{\tau }\) and 2) specify subgroup probabilities in the representationadjusted samples Q which we treat as fixed. For step 1, the estimators presented in [31] can be adapted to estimate samplespecific subgroup effects by conditioning on sample membership. For example, a typical outcome modeling approach might estimate the subgroupspecific ATEs as
where \(\mathbb {E}_n[\cdot ]\) denotes an empirical expectation and \(\hat{g}_a(X)\) is an estimator for \(\mathbb {E}[YX, S=1, A = a]\) for \(a = 0, 1\). Inverseprobability weighting estimators, augmented or not, are another possibility.
For step 2, we consider two approaches. First, we can use estimated probabilities that minimize the average MSE for each subgroup. This approach is motivated by the notion that what is fair is to provide each subgroup with the best estimator possible given the data, where the best estimator is defined by minimal average MSE. In general, this requires specification of a particular prior distribution for the subgroup effects, but we partially avoid this by assuming exchangeability across subgroups. Assuming exchangeability, minimizing the average MSE yields the following construction for representationadjusted samples for subgroup g, which we refer to as the optimal weights (detailed derivations in the section named "Specifying subgroup representation for RATE estimators" of Additional file 1):
where \(q_g^T\) is the gth row of Q, \(\Omega = \phi ^2 (2\Sigma _{\hat{\tau }} + \phi ^2 \mathbb {I})^{1}\), \(\Sigma _{\hat{\tau }}\) is the covariance matrix for \(\hat{\tau }\), and \(\phi ^2\) was defined in the last section as \(\text {var}_\tau (\tau _i  \tau _j)\) for \(i\ne j\). When the set of subgroupspecific effects \(\tau\) are uncorrelated and the set of subgroupspecific estimators \(\hat{\tau }\) are uncorrelated, the RATE estimator corresponds to the pooled estimator from a simple Bayesian normal hierarchical model with a uniform prior on the hypermean and fixed hypervariance for \(\tau\).
One potential concern might be that the optimal weights could still result in disparities in the representation of each subgroup within their respective representationadjusted sample, with larger subgroups tending to have greater representation. To address this, a second way to specify probabilities in Q could follow a similar process but constrain subgroup probabilities to be the same for each representationadjusted sample. Adding constraints to optimization algorithms has been a common way of tackling unfairness in model performance in other applications [36, 37]. One way to force representation to be the same for the effect estimation for each subgroup is to require \(q_g = w e_g + (G1)^{1}(1w)(\textbf{1}  e_g)\) for some \(w \in [0,1]\). Then, we can choose a w to use for all subgroups by minimizing a joint function of the subgroupspecific average MSEs—specifically we consider the average MSE averaged over the subgroups. Under mild regularity conditions, this yields the following representation probabilities for the RATE for subgroup g, which we refer to as the shared weights (details in the section named "Specifying subgroup representation for RATE estimators" of Additional file 1):
where \(\gamma = \frac{\bar{\sigma ^2}(G1)  V_1}{\phi ^2G/2 + V_2/(G1)  V_1}\), \(\bar{\sigma ^2} = G^{1}\sum _{g\in \mathcal {G}}\sigma _g^2\), \(\sigma _g^2 = \text {var}(\hat{\tau }_g)\), \(V_1 = G^{1}\sum _{g\in \mathcal {G}} e_g^T\Sigma _{\hat{\tau }}(\textbf{1}  e_g)\), and \(V_2 = G^{1}\sum _{g\in \mathcal {G}} = (\textbf{1}e_g)^T\Sigma _{\hat{\tau }}(\textbf{1}e_g)\).
In practice, using either approach, researchers could specify \(\phi\) or a range of \(\phi\) values directly based on substantive knowledge and empirically estimate the values of \(\hat{\tau }\) and \(\Sigma _{\hat{\tau }}\). Note in Eqs. 13 and 14 that as \(\phi \rightarrow \infty\), \(q_g \rightarrow e_g\). This means that for large values of \(\phi\), subgroups are effectively analyzed separately and consequently, estimates become unbiased. In other words, large values of \(\phi\) effectively stratify the analysis by subgroup. On the other extreme, setting \(\phi = 0\) effectively ignores possible treatment effect heterogeneity. Choosing an intermediate value of \(\phi\), based on an expectation of the amount of treatment effect heterogeneity, permits some bias for a reduction in variance. To help reason about appropriate values for \(\phi\), researchers could make use of Popoviciu’s inequality on variances [38] which implies that if the difference in subgroupspecific ATEs is bounded, that is \(\mathbb {P}(\tau _i  \tau _j \le c) = 1\), then \(\phi = \text {SD}[\tau _i  \tau _j] \le c/2\). If a researcher knows that subgroupspecific ATEs should not differ by more than 2 units, then \(\phi\) should be no more than 1. The BhatiaDavis inequality is another option [39]. Figure 2 summarizes the steps a researcher would take to estimate RATEs. Note that \(\Sigma _{\hat{\tau }}\) in Eqs. 13 and 14 needs to be replaced with with an estimate \(\hat{\Sigma }_{\hat{\tau }}\).
Representationadjusted samples constructed in this way have some useful statistical properties. As the sample size grows, standard errors for subgroup estimators will shrink causing the representation for the subgroup of interest to approach 1. This means that RATE estimators are asymptotically unbiased (further detail in the section named "Large sample properties of RATE" of Additional file 1). As a result, improvements in the recruitment of participants from a subgroup naturally will reduce the bias in estimation, but in contexts where this is infeasible, this approach enables small subgroups to wield the full information of the data sample to inform their estimate. Lastly, inference for the RATE estimator is straightforward given it is a weighted average of the subgroupspecific estimators (\(\hat{\tau }\)). The covariance matrix of the RATE estimators is given by \(\Sigma _{\tilde{\tau }} = Q\Sigma _{\hat{\tau }}Q^T\), which could be used to construct confidence intervals. Nonparametric bootstrap intervals are another option.
Simulation
While the optimally weighted RATE estimator yields the lowest average MSE across a broad class of estimators, the distribution of MSE, sensitivity to misspecifying \(\phi\), and impact of adding a shared weight constraint is unclear. To address these questions, we simulated randomized controlled trials of a binary treatment and continuous outcome with sample size of 300. For simplicity, we assumed that trial participants were sampled uniformly at random from some target population. We considered trials with either three or five subgroups, and trials with subgroupspecific effects independently and identically drawn from three different distributions (standard normal; bimodal; Gamma(3,3)) for a total of six different scenarios. The bimodal distribution was a mixture of two normal distributions: N(0.5, 1) with probability 0.8 and N(3, .5) with probability 0.2. All distributions were scaled such that the true value of \(\phi\) was 1. In trials with three subgroups, representation was fixed to 75%, 15%, and 10% for groups 13 respectively. For trials with five subgroups, representation was fixed to 67%, 15%, 10%, 5%, 3% for groups 15 respectively. Treatment was randomly allocated with equal proportions within each subgroup to ensure treatment balance. Outcomes were generated as follows:
where A is a binary indicator of treatment assignment, \(\beta _G\) are the sampled subgroup effects for group G, \(X_1\) is a continuous covariate sampled from N(1, 1), \(X_2\) is a binary covariate sampled from a Bernoulli(0.3), and \(\epsilon\) is random error sampled from N(0, 1).
For each of the six scenarios, we drew 500 sets of subgroup effects and for each set, we simulated 500 data samples from which we estimated the root mean squared error (RMSE) of subgroupspecific effect estimates. In each data sample, we obtained estimates from a stratified model, regression model with treatmentsubgroup interaction terms, a model with a random effect for the treatment, and four different RATE estimators. Subgroupspecific effect estimates from the regression model were obtained using the multcomp R package [40]. In the random effects model, we obtained subgroupspecific predicted effects using the lme4 R package [41]. The random effects estimator is equivalent to a basic Bayesian shrinkage estimator with a fixed value for the prior variance of the subgroupspecific effects and noninformative prior for the mean of the subgroupspecific effects [25]; this served as an easytoimplement substitute for comparing the RATE estimators to a fully Bayesian hierarchical model. The RATE estimators used subgroupspecific estimates from the interaction model. For three of the RATE estimators, we used the optimal weights with different values of \(\phi\) (0.75, 1, 1.5); these values were chosen to undervalue, appropriately value, and overvalue the true value of \(\phi\), respectively. For the fourth RATE estimator we used the shared weights with \(\phi = 1\). We plotted cumulative estimates of the 25th, 50th, and 75th percentile of the RMSE distribution to confirm that estimates had stabilized after 500 draws from the subgroup effect distribution. All simulations were performed in R Statistical Software v4.2.1 [42].
Results
Simulation results
In the scenario with three subgroup effects drawn from a common normal distribution, we found that all the RATE estimators had slightly lower median RMSE than the other estimators for groups 2 and 3, even when \(\phi\) was misspecified. However, in cases where a given subgroup’s true effect happened to be very different from the effect in the other subgroups, the RMSE from the RATE estimators for the given subgroup was high. Boxplots of estimated RMSE of the subgroupspecific effect estimators are shown in Fig. 3; corresponding summary statistics are shown in Additional file 1 (Table S1). RATE estimators with a shared set of weights had similar performance to those with optimal weights for groups 2 and 3, but resulted in substantially worse performance for group 1, the largest group. Except for the shared weight RATE estimator, other estimators had similar RMSE for group 1. There was considerably more variability in the RMSE from the random effects model estimates for groups 2 and 3 compared to the other estimates. Corresponding figures for the five remaining scenarios are shown in Additional file 1 (Figs. S1–S5); results were generally consistent.
Empirical example
To demonstrate the use of this method in an applied example, we estimated RATEs using results from an analysis of the Moving to Opportunity study (MTO) [43]. The MTO ran from 1994 to 1998 and was sponsored by the U.S. Department of Housing and Urban Development in five U.S. cities. Briefly, the MTO randomly assigned families living in public housing in highpoverty areas to receive a voucher that would subsidize rent in the private market. The authors in [43] assessed how rental subsidies impacted psychological distress and behavioral problems of the children in the study, focusing on effect modification by gender and family health vulnerability. Families were considered “vulnerable” if any household member had a disability or any child in the household had a health or developmental problem. We focus on the analysis of psychological distress which was measured using standardized factor scores from a latent variable analysis of the Kessler 6 scale.
The authors in [43] found that the intervention benefited girls from nonvulnerable families but had a detrimental effect on boys from vulnerable families. We explored how robust these findings were when there was little prior expectation of these differences. We used published estimates for each of the four groups from [43]—nonvulnerable girls, vulnerable girls, nonvulnerable boys, vulnerable boys—and calculated corresponding standard errors based on the 95% confidence intervals. Although these estimates are likely correlated since they are not from a fully stratified model, we assumed they were uncorrelated for illustrative purposes. We assumed that subgroupspecific effects should not differ by more than 0.25 SDs. Based on Popoviciu’s inequality on variances, we would expect that \(\phi\), the standard deviation of differences in subgroupspecific effects, is less than or equal to 0.125. With \(\phi = 0.125\), we obtained RATE representation, estimates, and 95% confidence intervals (Table 2). The full Q matrix can be found in the section named "Case study detail" of the Supplementary Material. Even under this skeptical prior, there is still evidence of a difference in treatment effects between groups. Additionally, RATE estimates provide tighter confidence intervals for all subgroups compared to the original subgroupspecific estimates.
Discussion
Some population subgroups—including, but not limited to, racially and ethnically marginalized groups, women, and older adults—consistently have lower representation in experimental and observational studies compared to their counterparts. In many cases, such as investigations of associations between Framingham risk factors and cardiovascular disease [44] and genetic studies of various health outcomes [45], imbalance in study representation has led to study findings that generalize better for subgroups with greater representation. In other cases, such as randomized controlled trials related to depression, the impact of this imbalance is unknown because the heterogeneity in treatment effects is left unexplored [46]. In this paper, we have presented a statistical framework for understanding this phenomenon, focusing on SATE estimation, that can partially inform the design, analysis, and interpretation of studies in heterogeneous populations. We showed that the difference between subgroups in average risk (MSE) of the SATE estimator increases linearly with the disparity in representation and with the variance of treatment effect differences. Improving data collection and community engagement will be essential to addressing the inadequate inclusion of marginalized groups in experimental and observational studies and reduce this inequity.
In practice, given that many studies do have substantial disparities in representation, we sought to improve estimation accuracy, on average across studies. Motivated by the idea that sample representation has an impact on the generalizability of study results and that changing or adjusting sample representation for less represented subgroups could improve generalizability for these groups, we introduced a new effect of interest which we refer to as a RATE, which is the ATE in a representationadjusted sample. In general, the RATE is any weighted average of subgroupspecific effects. The RATE estimators require researchers to input into the analysis how different they expect subgroupspecific effects to be. Estimating the SATE or unbiased subgroupspecific effects are particular cases of the RATE estimators in which researchers either implicitly assume that subgroupspecific effects are completely homogeneous or completely distinct from one another, respectively. Specification of \(\phi\) allows for a balance between these choices and could be a discussion among key stakeholders and community members of the relevant sociodemographic subgroups. After \(\phi\) is specified, the unbiased subgroupspecific effects are optimally weighted and combined in a different way for each subgroup to minimize the average risk. This method can improve upon simple unbiased subgroupspecific estimates by borrowing strength from the other subgroups. With that said, the performance of the RATE estimator in any given study will depend on the true, unknown subgroupspecific effects in that study. The theoretical results presented in this paper show that the RATE estimator can provide the lowest MSE on average across many studies.
We explored an alternative RATE estimator in which subgroups were constrained to use a shared set of representation probabilities. We found in our simulations that this led to substantially worse performance for the largest subgroup relative to the optimally weighted RATE. While a set of shared weights might be valued for its ability to give subgroups equal representation, lower average MSE can always be achieved by using the optimally weighted RATE. In many scenarios in which algorithmic fairness is a concern (e.g., employment, criminal sentencing, and loan applications), relevant parties generally differ in their goals (e.g. hiring the best candidates vs. securing a job) leading to deliberation of what is or is not fair. Aside from cases of scarce resource allocation (e.g., organ transplantation), medical care differs in that all parties share the same goal: improving patient health [47]. Consequently, obtaining the most accurate estimates of treatment effectiveness possible for each subgroup should be the main objective. For this reason, we view the optimally weighted RATE as a more equitable approach to adjusting sample representation.
We distinguish the RATE estimators from Bayesian subgroup modeling, such as the methods discussed in [27], in a few ways. First, we do not make assumptions about the exact distribution of the underlying subgroup effects and relevant hyperparameters as a fully Bayesian approach would. Second, we allow for correlation between the subgroupspecific effect estimators. Subgroupspecific estimators obtained from a regression model adjusted for covariates are typically correlated unless covariate effects are estimated separately for each subgroup. This correlation allows for additional informationborrowing. Third, we specify the standard deviation of the differences in subgroup effects directly rather than incorporate an estimate from the data. Estimating this parameter from the data can be challenging when the number of subgroups is small. In simulation, we saw that in the case of 3 subgroups, the random effects estimator performed substantially worse. However, there were some cases in which specifying this parameter directly led to large RMSEs as well.
The RATE estimator with optimal weights relies on the assumption that the subgroup effects are exchangeable, which is typical for Bayesian subgroup analyses though more complex methods are available [27]. The full exchangeability assumption could be weakened and would require additional hyperparameters to be specified by the researcher based on the number of subsets of subgroup effects that reasonably would be exchangeable. Future studies could explore this extension.
Finally, one of the complications of studying effect modification is that it is scaledependent. Comparisons of subgroupspecific ATEs on the difference scale might be smaller, larger, or nonexistent than comparisons on the risk ratio and/or odds ratio scales. In fact, if baseline risk of an outcome varies across subgroups, then treatment effectiveness will vary on at least one scale. In this article, we focused on the risk difference scale, which is easy to interpret as well as the scale with greatest public health and policy importance [48, 49]. The approach we presented can be extended to other scales by shifting focus to estimating the potential outcome means under different treatment assignment and then combining them in the appropriate way. Identification and estimator derivation logic would remain the same but lead to the need to specify two hyperparameters—the standard deviation of mean difference under treatment and the standard deviation of mean differences under control—rather than just one—the standard deviation of treatment effect differences.
Conclusions
In conclusion, the framework laid out in this article provides a way to quantitatively assess the impact of reporting only the SATE when there is disparity in representation across population subgroups. Estimators that borrow strength across subgroups, such as the RATE estimator, can reduce the inequitable impact at the data analysis stage. Ultimately, structural change regarding data collection and funding priorities is needed to address systemic disparities in sample representation [50].
Availability of data and materials
Source code for simulations and RATE estimation are publicly available on Github at https://github.com/knieser/RATE.
Abbreviations
 ATE:

Average treatment effect
 SATE:

Sample average treatment effect
 RATE:

Representationadjusted average treatment effect
 SD:

Standard deviation
 MSE:

Meansquared error
 RMSE:

Root meansquared error
References
Dresser R. Wanted single, white male for medical research. Hast Cent Rep. 1992;22(1):24–9.
Meltzer LA, Childress JF. What Is Fair Participant Selection? In: Emanuel EJ, Grady CC, Crouch RA, Lie RK, Miller FG, Wendler DD, editors. The Oxford Textbook of Clinical Research Ethics. Oxford Textbook Ser. New York: Oxford University Press; 2008. p. 377–85.
US Food and Drug Administration. Guideline for the study and evaluation of gender differences in the clinical evaluation of drugs; notice. Fed Regist. 1993;58(139):39406–16.
National Institutes of Health. NIH guidelines on the inclusion of women and minorities as subjects in clinical research. Fed Regist. 1994;59:1408–13.
National Institutes of Health. NIH policy and guidelines on the inclusion of women and minorities as subjects in clinical research. 2001. https://grants.nih.gov/policy/inclusion/womenandminorities/guidelines.htm. Accessed 6 Dec 2023.
Geller SE, Koch A, Pellettieri B, Carnes M. Inclusion, Analysis, and Reporting of Sex and Race/Ethnicity in Clinical Trials: Have We Made Progress? J Wom Health. 2011;20(3):315–20.
Geller SE, Koch AR, Roesch P, Filut A, Hallgren E, Carnes M. The More Things Change, the More They Stay the Same: A Study to Evaluate Compliance With Inclusion and Assessment of Women and Minorities in Randomized Controlled Trials. Acad Med. 2018;93(4):630–5.
National Institutes of Health. NIH RCDC Inclusion Statistics Report. 2022. https://report.nih.gov/risr/#/. Accessed 02 June 2022.
Zulman DM, Sussman JB, Chen X, Cigolle CT, Blaum CS, Hayward RA. Examining the evidence: a systematic review of the inclusion and analysis of older adults in randomized controlled trials. J Gen Intern Med. 2011;26(7):783–90.
Susukida R, Crum RM, Stuart EA, Ebnesajjad C, Mojtabai R. Assessing sample representativeness in randomized controlled trials: application to the National Institute of Drug Abuse Clinical Trials Network. Addiction. 2016;111(7):1226–34.
Cole SR, Stuart EA. Generalizing Evidence From Randomized Clinical Trials to Target Populations. Am J Epidemiol. 2010;172(1):107–15.
Degtiar I, Rose S. A review of generalizability and transportability. Ann Rev Stat Appl. 2023;10(1):501–24.
Stuart EA, Cole SR, Bradshaw CP, Leaf PJ. The use of propensity scores to assess the generalizability of results from randomized trials. J R Stat Soc Ser A (Stat Soc). 2011;174(2):369–86.
Pearl J, Bareinboim E. Transportability of causal and statistical relations: a formal approach. In: Twentyfifth AAAI conference on artificial intelligence. Palo Alto: AAAI Press; 2011. p. 247–54.
Westreich D, Edwards JK, Lesko CR, Stuart E, Cole SR. Transportability of Trial Results Using Inverse Odds of Sampling Weights. Am J Epidemiol. 2017;186(8):1010–4.
Tipton E. Improving Generalizations From Experiments Using Propensity Score Subclassification: Assumptions, Properties, and Contexts. J Educ Behav Stat. 2013;38(3):239–66.
Chan W. Partially Identified Treatment Effects for Generalizability. J Res Educ Eff. 2017;10(3):646–69.
Dahabreh IJ, Robertson SE, Tchetgen EJ, Stuart EA, Hernán MA. Generalizing causal inferences from individuals in randomized trials to all trialeligible individuals. Biometrics. 2019;75(2):685–94.
Kennedy L, Gelman A. Know your population and know your model: using modelbased regression and poststratification to generalize findings beyond the observed sample. Psychol Methods. 2021;26:547–58.
Varadhan R, Segal JB, Boyd CM, Wu AW, Weiss CO. A framework for the analysis of heterogeneity of treatment effect in patientcentered outcomes research. J Clin Epidemiol. 2013;66(8):818–25.
Rothwell PM. Subgroup analysis in randomised controlled trials: importance, indications, and interpretation. Lancet. 2005;365(9454):176–86.
Assmann SF, Pocock SJ, Enos LE, Kasten LE. Subgroup analysis and other (mis)uses of baseline data in clinical trials. Lancet. 2000;355(9209):1064–9.
Pocock SJ, Assmann SE, Enos LE, Kasten LE. Subgroup analysis, covariate adjustment and baseline comparisons in clinical trial reporting: current practice and problems. Stat Med. 2002;21(19):2917–30.
Simon R. Bayesian subset analysis: application to studying treatmentbygender interactions. Stat Med. 2002;21(19):2909–16.
Henderson NC, Louis TA, Wang C, Varadhan R. Bayesian analysis of heterogeneous treatment effects for patientcentered outcomes research. Health Serv Outcome Res Methodol. 2016;16(4):213–33.
Greenland S. Principles of multilevel modelling. Int J Epidemiol. 2000;29(1):158–67.
Jones HE, Ohlssen DI, Neuenschwander B, Racine A, Branson M. Bayesian models for subgroup analysis in clinical trials. Clin Trials. 2011;8(2):129–43.
Wang C, Louis TA, Henderson NC, Weiss CO, Varadhan R. beanz: An R Package for Bayesian Analysis of Heterogeneous Treatment Effects with a Graphical User Interface. J Stat Softw. 2018;85(7):1–31.
Seamans MJ, Hong H, Ackerman B, Schmid I, Stuart EA. Generalizability of Subgroup Effects. Epidemiology. 2021;32(3):389–92.
Mehrotra ML, Westreich D, Glymour MM, Geng E, Glidden DV. Transporting Subgroup Analyses of Randomized Controlled Trials for Planning Implementation of New Interventions. Am J Epidemiol. 2021;190(8):1671–80.
Robertson SE, Steingrimsson JA, Joyce NR, Stuart EA, Dahabreh IJ. Estimating subgroup effects in generalizability and transportability analyses. Am J Epidemiol. 2022. https://doi.org/10.1093/aje/kwac036. Online ahead of print.
Robertson SE, Steingrimsson JA, Dahabreh IJ. Regressionbased estimation of heterogeneous treatment effects when extending inferences from a randomized trial to a target population. Eur J Epidemiol. 2023;38:123–33. https://doi.org/10.1007/s10654022009015.
Rothman KJ, Gallacher JE, Hatch EE. Why representativeness should be avoided. Int J Epidemiol. 2013;42(4):1012–4.
Lehmann EL, Casella G. Theory of point estimation. New York: Springer Science & Business Media; 2006.
Davis CE, Leffingwell DP. Empirical Bayes Estimates of Subgroup Effects in Clinical Trials. Control Clin Trials. 1990;11(1):37–42.
CorbettDavies S, Pierson E, Feller A, Goel S, Huq A. Algorithmic decision making and the cost of fairness. In: Proceedings of the 23rd acm sigkdd international conference on knowledge discovery and data mining. New York: Association for Computing Machinery; 2017. p. 797–806.
Zafar MB, Valera I, Gomez Rodriguez M, Gummadi KP. Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In: Proceedings of the 26th international conference on world wide web. Geneva: International World Wide Web Conferences Steering Committee; 2017. p. 1171–80.
Popoviciu T. Sur les équations algébriques ayant toutes leurs racines réelles. Mathematica. 1935;9(129–145):20.
Bhatia R, Davis C. A better bound on the variance. Am Math Mon. 2000;107(4):353–7.
Hothorn T, Bretz F, Westfall P. Simultaneous Inference in General Parametric Models. Biom J. 2008;50(3):346–63.
Bates D, Mächler M, Bolker B, Walker S. Fitting Linear MixedEffects Models Using lme4. J Stat Softw. 2015;67(1):1–48.
R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria; 2022. https://www.Rproject.org/. Accessed 6 Dec 2023.
Osypuk TL, Tchetgen EJT, AcevedoGarcia D, Earls FJ, Lincoln A, Schmidt NM, et al. Differential mental health effects of neighborhood relocation among youth in vulnerable families: results from a randomized trial. Arch Gen Psychiatr. 2012;69(12):1284–94.
Gijsberts CM, Groenewegen KA, Hoefer IE, Eijkemans MJC, Asselbergs FW, Anderson TJ, et al. Race/Ethnic Differences in the Associations of the Framingham Risk Factors with Carotid IMT and Cardiovascular Events. PLoS ONE. 2015;10(7):e0132321. https://doi.org/10.1371/journal.pone.0132321.
Borrell LN, Elhawary JR, FuentesAfflick E, Witonsky J, Bhakta N, Wu AHB, et al. Race and Genetic Ancestry in Medicine – A Time for Reckoning with Racism. N Engl J Med. 2021. https://doi.org/10.1056/NEJMms2029562.
Polo AJ, Makol BA, Castro AS, ColónQuintana N, Wagstaff AE, Guo S. Diversity in randomized clinical trials of depression: A 36year review. Clin Psychol Rev. 2019;67:22–35. https://doi.org/10.1016/j.cpr.2018.09.004.
Paulus JK, Kent DM. Predictably unequal: understanding and addressing concerns that algorithmic clinical prediction may increase health disparities. npj Digit Med. 2020;3(1):1–8.
Lesko CR, Henderson NC, Varadhan R. Considerations when assessing heterogeneity of treatment effect in patientcentered outcomes research. J Clin Epidemiol. 2018;100:22–31. https://doi.org/10.1016/j.jclinepi.2018.04.005.
Rothman KJ, Greenland S, Lash TL, et al. Modern epidemiology. vol. 3. Philadelphia: Wolters Kluwer Health/Lippincott Williams & Wilkins Philadelphia; 2008.
Chen IY, Pierson E, Rose S, Joshi S, Ferryman K, Ghassemi M. Ethical machine learning in healthcare. Ann Rev Biomed Data Sci. 2021;4:123–44.
Acknowledgements
We thank Dr. Ronald Gangnon for feedback on an earlier draft of this manuscript.
Funding
No funding to report.
Author information
Authors and Affiliations
Contributions
Mr. Nieser and Dr. Cochran conceptualized the methods, Mr. Nieser derived mathematical results, performed simulation studies and the case study, and drafted the initial version of the manuscript. Dr. Cochran supervised and edited the manuscript. Both authors approved the final manuscript and agreed to submit for publication.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1.
Supplementary material. This file contains additional mathematical details and support accompanying the main text, along with figures with results from the rest of the simulation studies and case study details.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Nieser, K., Cochran, A. Quantifying and reducing inequity in average treatment effect estimation. BMC Med Res Methodol 23, 297 (2023). https://doi.org/10.1186/s12874023021042
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12874023021042