Treatment effect estimation using the propensity score in clinical trials with historical control

Background Clinical trials assessing new treatment effects require a control group to compare the pure treatment effects. However, in clinical trials on regenerative medicine, rare diseases, and intractable diseases, it may be ethically difficult to assign participants to the control group. In recent years, the use of historical control data has attracted attention as a method for supplementing the number of participants in the control group. When combining historical control data with new randomized controlled trial (RCT) data, the assessment of heterogeneity using outcome data is not sufficient. Therefore, several statistical methods that consider participant outcomes and baseline characteristics, including the propensity score (PS) method have been proposed. Methods We propose a new method considering “information on whether the data are RCT data or not” in the PS model when combining the RCT and historical control data. The performance of the proposed method in estimating the treatment effect is evaluated using simulation data. Results When the distribution of covariates is similar between the RCT and historical control data, not much difference in performance is found between the proposed and conventional methods to estimate the treatment effect. On the other hand, when the distribution of covariates is not similar between the two kinds of data, the proposed method shows higher performance. Conclusions Even when it is not known whether RCT and historical control data are similar, the proposed PS model is useful to estimate the treatment effect appropriately in RCTs using historical control data. Supplementary Information The online version contains supplementary material available at 10.1186/s12874-023-02127-9.


Introduction
Clinical trials that assess new treatment effects require a control group to compare the pure treatment effects, which exclude baseline characteristics [1].Randomized controlled trials (RCTs) are considered the gold standard approach in confirmatory trials for reducing bias and assessing objective effects.However, in clinical trials for regenerative medicine, rare diseases, and intractable diseases, random assignment of participants to the control group may be ethically difficult.Recently, there has been active collection of real-world data and construction of a disease registry [2][3][4], and the utilization of historical control data has attracted attention as a supplement for the number of control group participants in clinical trials.Appropriate utilization of historical control data can ensure that patients are offered promising treatments faster by reducing the number of participants assigned to control groups, thus accelerating drug development [5,6].The U.S. Food and Drug Administration has issued draft guidance on natural history studies for rare disease drug development [7], and further utilization of external control is expected [2][3][4].
The use of historical control data is still being debated [8][9][10].Frequentist approaches include, the pooling method in which historical control data are equated with the new trial control group and merged as is, and the test-then-pool method, which is used after determining the similarity between both outcome data by hypothesis test [11].Bayesian approaches include power priors [12] and hierarchical modeling [13,14], which discount the amount of information in historical control data [11,15].A previous study proposed a method that calculates the difference between outcome data of a new trial control group and historical control data and used weighting as an estimate of heterogeneity [16].Evaluation of heterogeneity with outcome data is useful, but not sufficient in situations with different measurement periods and conditions.Besides, the information from historical control data may distort the true results from new trials [15], or conversely, historical control data may be hardly used [16], which poses a large risk for implementation.
In the causal inference framework, propensity scores (PS) [17,18] may be used to compare groups that are not randomized.The PS indicates the probability of treatment allocation calculated using baseline characteristics.Thus, by aligning the baseline characteristics between treatment groups, it is possible to estimate the treatment effect while minimizing the effect of confounding on treatment allocation.When utilizing historical control data, a method using the PS has been proposed for considering the heterogeneity of baseline characteristics.
In general, the matching [19,20] and inverse probability of treatment weighting (IPTW) [21] methods are used as PS methods [22,23].Methods using PS to assess the generalizability of the population participating in RCT to the patient population [24], and to merge RCT data with observational data [25] have also been proposed.Additionally, a method combining the PS methods and Bayesian dynamic borrowing framework has been proposed [26].
Furthermore, as this study considers a special clinical trial that uses historical data in combination with new RCT data includes information on whether the data are RCT or historical control data.This information could be an important confounding factor along with baseline characteristics.Accordingly, we evaluate the performance of the method used for the clinical trial that newly considers "information on whether the data are RCT data or not" in the conventional PS model when estimating the treatment effect using simulation data.

Proposal of the PS model
In a clinical trial in which the primary endpoint is binary outcome Y (presence or absence of an event), we assume historical control data are combined with new twoarmed RCT data as part of a control group.Y i = 1 indi- cates that an event has occurred, and Y i = 0 indicates that no event has occurred with participant i (i = 1 . . .l) .We set T as the treatment group indicator ( T i = 1 for the treatment and T i = 0 for the control groups for partici- pant i ) and X as the vector of all covariates X j ( j = 1 . . .k and X ij denotes the j th covariate of participant i ), which are the possible confounding factors.When estimating the PS, a model would generally be expressed as using a logistic regression [27,28], where β j j = 1 . . .k denotes a coefficient of the regression model.
Here, we might consider the information on whether the data were derived from the new RCT or historical control data as an important confounding factor.Therefore, in the proposed method of this study, the PS model newly considers information on whether the data are RCT data or not and sets that information as indicator variable X r .X ir = 1 indicates that participant i is from the RCT, and X ir = −1 indicates that participant i is from the historical control group.As a proposed method including X r , the PS model could be expressed as We considered that the performance in estimating the treatment effect between the conventional method using π and proposed method using π * may vary due to the dif- ference in the distribution of covariates between the RCT and historical control data.In Stimulation study section, we evaluate the performance of the method using simulated data.
As a PS method, although the matching method is easy to understand, there is a possibility that the amount of information will be drastically reduced.In this study, we apply the IPTW method to utilize more information when evaluating the model's performance.When estimating the treatment effect, each participant's weight w could be w = T /expit(π ) + (1 − T )/ 1 − expit(π) in the conventional method and w = T /expit(π * ) + (1 − T )/ 1 − expit(π * ) in the proposed method. (1)

Settings
In this simulation study, to evaluate the treatment effect, we set the total number of participants as n = 900 and the allocation ratio between the RCT treatment group, RCT control group, and historical control group as 1:1:2.Moreover, we set the outcome event rates as 50%, 10%, and 5%; the odds ratios as 1.0, 2.0, 5.0, and 10.0; and the two-sided significance level as 5%.Furthermore, we also examined cases where the number of participants was small.The simulation results assuming the total number of participants as n = 200 are shown.The method and conditions in the simulation setting are the same as those shown in the setting assuming that n = 900, except for the total number of participants.The supplementary examination was conducted by assuming a situation with odds ratios of 1.5 and 2.5 (Additional file 1: Appendix A).In addition, we assume a situation wherein the allocation ratios are different (Additional file 1: Appendix B) and one of the four covariates is binary data (Additional file 1: Appendix C).We also conducted simulations in which the assignment of treatment variables was completely random in the RCT population (Additional file 1: Appendix D), and simulations were based on parameter settings from the actual clinical trial [29] (Additional file 1: Appendix G).To estimate the treatment effect, the IPTW using the PS method is applied, and the odds ratio based on the weight is estimated by the logistic regression model.
The performance measurements of the simulation result include the following: (1) difference of the estimated log odds ratio from the true log odds ratio (bias), (2) mean squared error (MSE), (3) coverage of 95% confidence interval (coverage), and (4) type I error rate and power.The simulation data are generated while assuming two scenarios wherein the distribution of covariates is either similar or not similar between the RCT and historical control data.

Scenario (I)
In this situation, the distribution of covariates is similar between the RCT and historical control data.From the multivariate standard normal distribution, four covariates are generated for participant i as (Additional file 1: Appendix E).Based on Eq. ( 4), each participant's treatment allocation is determined from the Bernoulli distribution: The model that generates outcome data y i is as follows: where {α 0 , α 1 , α 2 , α 3 , α 4 } = {a 0 , 0.274, 0.137, −0.137, 0.137} .Here, β treat is the true log odds ratio of the treatment effect, and the error term ε i ∼ N (0, 1) is generated according to inde- pendent normal distribution.Besides, a 0 is a constant correction value corresponding to the outcome event rate (Additional file 1: Appendix E).Based on Eq. ( 6), each participant's outcome Y i is determined from the Bernoulli distribution:

Scenario (II)
In this situation, the distribution of covariates is not similar between the RCT and historical control data.As with scenario (I), after generating covariates from the multivariate standard normal distribution, each covariate in the RCT data are transformed as follows: For historical control data, the covariates without transformation, X i1 , X i2 , X i3 , and X i4 , are simply used from the generation of standard multivariate normal distributions, that is, Here, the true PS model π * i,true is provided as is the coefficient value of indicator variable X r in the true PS model for each treatment allocation ratio (Additional file 1: Appendix F).These parameters are simultaneously calculated using a true PS model for only RCT data, (5) the true PS model for only historical control data, and a covariate of each participant (Additional file 1: Appendix F; calculation method).Based on Eq. ( 12), treatment allocation T i for each participant is determined from the Bernoulli distribution: Outcome data y i are generated by where {α 0 , α 1 , α 2 , α 3 , α 4 , α r } = {a 0 , 0.274, 0.137, −0.137, 0.137, 0.137} .Based on Eq. ( 15), each participant's outcome Y i is deter- mined from the Bernoulli distribution:

The usual number of participants
In scenario (I), wherein the distribution of covariates is similar between the RCT and historical control data, not much difference in the proposed and conventional methods was found in the bias, MSE, coverage of 95% confidence interval, and type I error (Table 1).
On the other hand, in scenario (II), wherein the distribution of covariates is not similar between the RCT and historical control data, the proposed method tended to have a smaller bias, coverage of 95% confidence interval closer to 95%, and a type I error rate closer to 5%.In addition, there was not much difference between the proposed and conventional methods in the MSE (Table 2).
When the allocation ratios between the RCT treatment group, RCT control group, and historical control group were 2:1:3 (Additional file 1: Appendix . B.10)-that is, extremely skewed-the bias and MSE had increased.
In addition, the same trends were observed for all performance measures when one of the four covariates was binary data (Additional file 1: Appendix Table C.1) as when the four covariates were continuous data.
Moreover, the simulation where the treatment variable in RCT population was generated independent of covariates (Additional file 1: Appendix Table D.1) shown also almost the same result in the text.In a simulation where the parameter settings of an actual clinical trial were applied (Additional file 1: Appendix Table G.1) was also similar result in the text.

Small number of participants
In the case where the total number of participants is n = 200, the same tendency was observed in all performance measurements as in the case where the number of participants is n = 900.
That is, in scenario (I), wherein the distribution of covariates is similar between the RCT and historical control data, not much difference in the proposed and conventional methods was found in the bias, MSE, coverage of 95% confidence interval, and type I error rate (Table 3).
And then, in scenario (II), wherein the distribution of covariates is not similar between the RCT and historical control data, the proposed method tended to have a smaller bias, coverage of 95% confidence interval closer to 95%, and a type I error rate closer to 5%.In addition, there was not much difference between the proposed and conventional methods in the MSE (Table 4).

Discussion
The results in this study suggest that a situation wherein the distribution of covariates is similar between the RCT and historical control data-that is, scenario (I)-the estimation bias of the treatment effect in the PS model would not be affected by including the information on whether the participant data is RCT data or not.On the other hand, a situation wherein the distribution of covariates is not similar between the RCT and historical control datathat is, scenario (II)-the use of the proposed PS method is recommended because the performance of estimating the treatment effect is improved by including the information on whether the participant data is RCT data or not.
As for the relationship between the outcome event rate and performance of estimating the treatment effect, it is considered appropriate that the higher the outcome event rate, the higher the performance of the estimation.Therefore, in the situation where the distributions of covariates are similar, the treatment effect could be estimated appropriately using both the proposed and conventional methods for this situation.Meanwhile, where    the distributions of covariates are not similar, a similar tendency is observed when using the proposed method, and so it is considered that the appropriate treatment effect can be estimated.However, in the conventional method, the lower the outcome event rate, the higher the performance that can be estimated, and so there is a possibility that the appropriate treatment effect cannot be estimated.Moreover, even when the allocation ratio between the RCT treatment group, RCT control group, and historical control group is changed, if the allocation ratio is not extremely skewed, the same consideration is possible as in the allocation ratio of 1:1:2 in this situation.Namely, in the situation where the distributions of covariates are similar, when considering the information on whether the data are RCT data or not in the PS model, the effect on the performance of estimating the treatment effect was not as marked.And also, in the situation where the distributions of covariates are not similar, the performance of estimating the treatment effect was improved by considering whether the data are RCT data or not.Meanwhile, when the allocation ratio was extremely skewed, bias and MSE increased tremendously, and the estimation could not be conducted appropriately.This is because the number of participants in the RCT control group was extremely small when the allocation ratio was extremely biased.
As another situation, even if the total number of participants is small or and the covariates include binary data, the same consideration is possible as that when the total number of participant is n = 900 and the covariates are all continuous data.The same trend is suggested when the treatment variables in the RCT population are considered completely independently and randomly from the covariates.In other words, when the distribution of covariates is similar between the RCT and historical control data, not much difference in performance is found between the proposed and conventional methods to estimate the treatment effect.And, when the distribution of covariates is not similar between the two kinds of data, the proposed method shows higher performance.In addition, the same argument as above can be considered to apply even when there is variation in data such as actual clinical trial data.
For these reasons, when combining the RCT and historical control data in the clinical trial setting, it is important to consider whether the distribution of important participant baseline characteristics that influence the outcomes is similar or not.Moreover, for appropriate utilization of historical control data, it is useful to apply the proposed PS model that considers X r while assessing pos- sible differences.However, when considering the utilization of historical control data to reinforce the number of participants in the RCT control group, it is necessary to simulate several patterns of allocation ratio and evaluate the performance of the allowable range of how small the control group can be from the planning stage of the clinical trial, and use this with caution.In addition, since the proposed method uses PS, the possibility of the presence of unmeasured confounding factors, that is, whether the covariates used in the PS model are sufficient, should also be considered.And, this method is assuming that use single historical control data set, and have limited that could not have considered for difference between two or more historical control data set.Furthermore, in this study, we focused on the treatment effect in the entire population, including historical control data, and investigated a method for estimating the Average Treatment Effect (ATE).However, there may be situations in which it is desirable to estimate the Average Treatment Effect on the Treated (ATT) in the RCT population or treatment group, and we would like to consider the performance evaluation in such cases to be a future issue.While paying attention to issues such as the increase in type I error rate, it is possible to appropriately reduce the number of participants assigned to the RCT control group.We believe that this will help improve the efficiency of clinical trials, solve ethical problems, and thus save more people.

Conclusions
In clinical trials utilizing historical control data, considering information on whether the data are RCT data or not in the proposed PS model is useful for appropriately estimating the treatment effect, even when it is not known whether the RCT data and the historical control data are similar.Promotion of appropriate utilization of historical control data will contribute to the realization of better medical care.

Table 1
Scenario (I): performance of the estimated propensity score (PS) model π (without X r ): the conventional method; π * (with X r ): the proposed method

Table 2
Scenario (II): performance of the estimated propensity score (PS) model π (without X r ): the conventional method, π * (with X r ): the proposed method

Table 3
Scenario (I): performance of the estimated propensity score (PS) model by simulation setting assuming n = 200 π (without X r ): the conventional method; π * (with X r ): the proposed method

Table 4
Scenario (II): performance of the estimated propensity score (PS) model by simulation setting assuming n = 200 π (without X r ): the conventional method, π * (with X r ): the proposed method