 Research article
 Open Access
 Open Peer Review
 Published:
Adaptive propensity score procedure improves matching in prospective observational trials
BMC Medical Research Methodology volume 19, Article number: 150 (2019)
Abstract
Background
Randomized controlled trials are the goldstandard for clinical trials. However, randomization is not always feasible. In this article we propose a prospective and adaptive matched casecontrol trial design assuming that a control group already exists.
Methods
We propose and discuss an interim analysis step to estimate the matching rate using a resampling step followed by a sample size recalculation. The sample size recalculation is based on the observed mean resampling matching rate. We applied our approach in a simulation study and to a real data set to evaluate the characteristics of the proposed design and to compare the results to a naive approach.
Results
The proposed design achieves at least 10% higher matching rate than the naive approach at final analysis, thus providing a better estimation of the true matching rate. A good choice for the interim analysis seems to be a fraction of around \(\frac {1}{2}\) to \(\frac {2}{3}\) of the control patients.
Conclusion
The proposed resampling step in a prospective matched casecontrol trial design leads to an improved estimate of the final matching rate and, thus, to a gain in power of the approach due to sensible sample size recalculation.
Background
Randomized controlled designs are the goldstandard for clinical trials. The advantage of a random allocation of patients to treatment and control group is the comparability of patient groups. However, there are situations where a randomized trial is not applicable, for instance, due to ethical concerns or practical reasons, but an observational trial is possible [1, 2]. An observational singlearm study might be an option, but has the disadvantage that a direct comparison to placebo or the standard therapy is not possible. When data on the control recruited within an earlier study is available, an alternative way would be to use this external control group for comparison. Naively comparing study arms of different trials may lead to severe bias due to differences in patient characteristics. The lack of comparability can be addressed by matching procedures as, for example, Optimal Matching or Propensity Score Matching [3, 4]. These methods aim to balance the groups by the variables considered within the matching procedure. Combining a prospective singlearm study with an external control group under the usage of a matching approach is called a prospective matched casecontrol trial [5].
One essential part in planning a clinical trial is the calculation of the required sample size. In a prospective matched casecontrol trial, sample size calculation is not straightforward. The aim is to find an appropriate matching partner for as many patients of the already recruited study arm as possible. Usually it cannot be expected to find a matching partner for all patients in the control group when recruiting just the same number of patients in the treatment arm. Therefore, published trials fixed an additional percentage of intervention patients; for example, the trial of Charpentier et al. [5] recruited 1.3 times the number of control patients. In case a lower or higher number of patients can be matched to one of the controls than expected, the sample size would be too small or more patients than needed are recruited. In the following, the fraction of patients matched to one of the controls is called matching rate.
This work is motivated by a real data example, the KEEP SIMPLEST trial [6]. The trial aimed to compare aspects of periinterventional management in acute ischemic stroke (AIS) patients treated according to a new SOP (standard operating procedure) with patients having been randomized into the conscious sedation group of the SIESTA trial [7]. A randomized controlled trial was not applicable, because the early stage application of the new method cannot be reproduced. Therefore, a prospective matched casecontrol design was planned. The matching rate was unknown and as in most other trials most likely less than 100%, meaning that recruiting just the same number of patients as in the external control arm will not result in a situation where a matching partner is found for all patients in the control group. A possibility to address this uncertainty concerning the matching rate could be to perform an interim analysis to estimate the actual matching rate. The results of the interim analysis are then used to recalculate the sample size. This leads to an adaptive matched casecontrol design.
The recalculation might be done by using all available patients in the matching procedure (at interim and final analysis, respectively). Based on the matching rate observed in the interim analysis, the sample size is recalculated. This strategy will be called the naive method. In practice, we expect there may occur a potential overestimation of the matching rate. In consequence, a smaller number of patients than necessary is recruited after interim analysis and therefore, a smaller matching rate is achieved at the final analysis. To avoid overestimation, we propose a method for calculating the matching rate at interim analysis which is characterized by a resampling step and recalculation of the sample size based on the mean resampling matching rate. The time point of this interim analysis needs to be fixed at the beginning of the trial. We conducted a simulation study to investigate the operational characteristics of the proposed approach and to develop a recommendation for the time point of interim analysis.
In “Methods”, we explain the proposed method for calculating the sample size and the details of the conducted simulation study. Its results are described and followed by the application to the real data example in “Results”. The results are discussed and conclusions are drawn in “Discussion and conclusion”.
Methods
We propose an adaptive design for recalculating the sample size in a prospective matched casecontrol trial which is characterized by a resampling approach and a propensity score matching step.
Within the suggested design, two matching steps are conducted. At interim analysis, the matching rate is determined and is used for recalculation of the sample size needed to find a matching partner to all control patients in the final analysis. The matching procedure at interim analysis is solely used for calculating the matching rate; the final 1:1 matches are determined in the final analysis. To find pairs of treated and control patients, the propensity score method by Rosenbaum and Rubin [8, 9] is used. Propensity score matching aims to minimize the influence of observed and considered baseline characteristics on the treatment effect [10]. The propensity score e(X) is the conditional probability of being assigned to the treated study arm given (relevant) confounders X [10]. Assuming that there are n patients included, the propensity score is defined as
where Z_{i}∈{0,1} defines the group assignment and X_{i} is the vector of considered baseline characteristics.
The propensity score is estimated by using baseline characteristics as covariates in a logistic regression model with treatment status as outcome variable [9]
This model provides the propensity scores, the probability to be assigned to the treatment group. The treatment and control patients are matched according to the logit of the estimated propensity score
by using some caliper width of these estimates [4]. Austin recommend a caliper width of 0.2 of the standard deviation of the logit of the propensity score [10].
The already recruited study arm includes n_{control} patients. To determine the matching rate at the interim analysis, we resample the control patients. In order to further avoid an overestimation of the matching rate at interim analysis equally sized groups are used for the matching procedure at interim analysis. That means, a sample of n_{treated,interim} is taken from the n_{control} patients without replacement. Using the sampled set of patients, the matching step is performed which is explained in the following. This resampling and the matching step at interim analysis are repeated b times.
We calculate the mean resampling matching rate \(\overline {mr}\) and the lower limit of the 100·(1−α_{CI}) % confidence interval (CI) (using n_{control})
The total number of patients needed in the treated group is estimated by
In the following, this approach is called resampling CI method.
Another option would be to use a quantile of the distribution of the resampling matching rates which are independent of the number of patients in the control group. One would expect to observe a higher matching rate in trials with a large control arm, because a higher diversity of patients may be represented. Therefore, taking the number of control patients into account has the advantage of a smaller confidence interval for a larger number of control patients. For this reason, we stick with our proposed definition of the 100·(1−α_{CI})% CI.
Steps of the procedure at interim analysis:
Given entities:

b the number of resampling steps.

n_{control} the number of control patients in already recruited study arm.

n_{treated,interim} the number of treated patients at interim analysis.

Step 1
Repeat (a)  (d) b times:

(a)
Sample n_{treated,interim} patients without replacement out of the control group.

(b)
Calculate propensity scores for sampled control patients and treated patients.

(c)
Conduct a 1:1 matching according to the logit of the propensity scores.

(d)
Calculate the matching rate mr.

(a)

Step 2
Calculate the mean matching rate \(\overline {mr}\) of the b matching rates calculated in step 1.

Step 3
Calculate the lower limit of the 100·(1−α_{CI})% confidence interval using formula (2).

Step 4
Calculate the total number of treated patients needed for analysis as in formula (3).

Step 1
We conducted a simulation study with 10,000 replications of each scenario, to assess the performance of our approach. First, we compare the resampling CI method for recalculating the sample size to the naive strategy. The second part covers the determination of the optimal time point for the interim analysis.
Simulation setting
General setting
The chosen values for the involved parameters are simulated inspired by the clinical example (“Real data example” section). Some simplifications (e.g. less variables within the matching procedure) were made for the simulation study.
The outcome variable is assumed to be binary indicating some favourable event and the corresponding hypotheses are
where p_{control} and p_{treated} are the true rates in the control and the treatment group, respectively.
All distribution parameters and regression coefficients used to simulate the data are given in Table 1. The simulated data includes three binary variables (incl. the group variable Z), one categorical variable, and two continuous variables. The variables are used to simulate the group assignment and the outcome variable, as well as they are considered within the matching procedure.
First, two binary (X_{1},X_{2}) and one continuous variable (X_{3}) are sampled which describe for example gender, diabetes (yes/no), and age.
The group assignment depends on the variables X_{1} and X_{3}. Therefore, in the next step, the group variable is simulated based on a logistic regression model using the baseline variables X_{1} and X_{3}.
Based on the group allocation (group is considered as Z in the following), two additional variables (X_{4} and X_{5}) are simulated. The variable X_{4} is an ordinal variable with 10 levels which represent the ASPECTS score here. The Alberta Stroke Program Early CT score (ASPECTS) is a tool for detecting early ischemic changes on noncontrast CT scans [11]. The variable X_{5} follows a normal distribution and describes here the NIHSS. The National Institutes of Health Stroke Scale (NIHSS) is a tool to assess stroke severity [12]. These two additional variables are sampled out of different distributions (according to the group). By using a logistic regression model for the group allocation, as given in Table 1 for variable Z, and sampling clinical variables out of different distributions, differences between the groups are simulated which can be addressed by the matching procedure. In order to simplify the simulation study, the variables X_{1} to X_{5} are assumed to be independent. However, in practice correlations may occure and should be considered when selecting the matching variables.
The propensity score is estimated by using the baseline variables X_{2},X_{3}, and X_{5} as covariates in a logistic regression model with group as outcome variable (Z). This model includes baseline variable X_{2} which is not part of the true group model. This leads to a misspecification of the propensity score model. However, the true model is usually not known and therefore this setting avoids to be overoptimistic in the simulations.
The confidence level is set to α_{CI}∈{0.01,0.05,0.1}, and hence the resampling CI method is evaluated for 99%, 95%, and 90% confidence intervals in this simulation study.
For testing the nullhypothesis at the end of the trial, the McNemar test for paired data is used to account for the matched design.
Fixed time point  varying number of control patients
We start with a given number of patients in the control group n_{control} and a fixed fraction t of patients for the interim analysis. We set t=0.5 and
The number of patients needed to show the simulated effect with a power of 80% at a type I error rate of 5% would have been 142 per group. We investigate underpowered as well as overpowered scenarios. Underpowered situtations occure when the expected effect in the existing trial, where our control group is taken from, was higher than expected in the new trial. In cases with a smaller expected effect or multiple primary hypothesis in the existing trial we may face an overpowered scenario. At interim analysis, we calculated the matching rate on b=200 resampling sets of size n_{treated,interim}. Using the simulated data as described above and performing the steps listed above, we compare the proposed method with the naive approach. For evaluating the properties of the two approaches, the matching rate at final analysis, the recruited sample size n_{treated,final}, as well as the type I error and power at final analysis were evaluated.
Time point of interim analysis
The starting point is a fixed number of patients (in the control group) n_{control} used as one arm of our controlled trial. The confidence level for the resampling CI method is set to 99%, α_{CI}=0.01 respectively. The “time points” s considered for the interim analysis are
For recalculation of the sample size, the resampling CI method as well as the naive approach is used. Our recommendation will be based on the evaluation of the matching rate, the recruited sample size n_{treated}, as well as the type I error and power at final analysis.
We considered a small (n_{control}=50), medium (n_{control}=150), and a large (n_{control}=500) sample size in the control group to identify the influence on the time point of interim analysis. The regression coefficient β_{Z,outcome} varies between the considered sample sizes to obtain 80% power within each scenario:
For the small sample size, the considered time points of interim analysis start at \(\frac {1}{3} \cdot n_{\text {control}}\) and for the medium sample size at \(\frac {1}{4} \cdot n_{\text {control}}\); this is due to problems in finding matching partners if fewer than 15 patients are included in the matching procedure at the interim analysis step.
All simulations were done in R version 3.4.3 with the packages Matching (using the function Match) and boot (using the function inv.logit) [13–15].
Results
Simulation results
First, the results using a fixed time point but varying the number of control patients are shown and discussed, followed by the results for the time point of interim analysis which is assessed for both described methods.
Fixed time point  varying number of control patients
Comparing the matching rate curves at interim and final analysis within the naive approach, one observes that the matching rate at interim analysis is higher in all scenarios. The consequence of an overestimation of the matching rate at interim analysis is the recruitment of a too small number of patients. If the matching rate is smaller at final analysis (overestimation of matching rate at interim analysis), this results in a loss of power (Figs. 1, 2).
Our proposed method uses equal sample sizes for the matching procedure at interim analysis which underestimates the true matching rate. Therefore, more patients are recruited (Fig. 3) and a higher matching rate is achieved at the final analysis. Hence, too much matched pairs are included in the final analysis which results in a higher power.
For the naive method, we observed a dependency between the matching rate and the number of patients in the control group: the matching rate grows with the number of patients in the control group. In contrast, for our proposed method the matching rate at final analysis stays on a constant level with a mean matching rate between 91.7  91.8% for α_{CI}=0.01
Here, the required number of patients per group in a fixed design would have been n=142. For the proposed design, 80% power is reached for n_{control}≈150. Thus, a power of 80% is achieved requiring only slightly more patients in the control group than would have been in a fixed randomized design (Fig. 2). This higher number of control patients is caused by the matching rate which is lower than 100% (Fig. 1).
Type I error rate is approximately 5% (between 4.37% and 5.72%) for all scenarios and both methods. As expected, a difference between the two methods according to the type 1 error is not observed (Fig. 4).
Varying the confidence level within the resampling CI method results in small differences in the mean lower CI limit of the matching rate. The mean lower CI limit of the matching rate at interim analysis increases slightly for increasing α_{CI} or decreasing confidence level, respectively (Table 2). This increase leads to a slightly lower number of recruited patients for lower confidence levels. For α_{CI}=0.05, the mean recruited sample size in the treatment group is around 4 patients higher than for α_{CI}=0.1, and for α_{CI}=0.01 is for another 8 patients higher, for details see Table 3.
Time point of interim analysis
In this section, we only consider the results for a medium sample size in the control arm (n_{control}=150), as the simulations for small and large sample sizes show comparable results.
Using the naive method, we observe for early time points of the interim analysis a matching rate close to 100%, but in the final analysis, it is less than 85% (Fig. 5). Even for later time points, the matching rate is lower than 90% and as a consequence the power is less than 80% for all considered time points (Fig. 6). The total sample size is lowest for the early time point (Fig. 7) because the matching rate at interim analysis is highest for this time point. As expected, the type 1 error rate is around 5% (Fig. 8).
Our proposed method uses equal sized groups at the interim analysis. When performing an early interim analysis, the matching rate is very low and a high number of patients need to be additionally recruited for the final analysis. Comparing the matching rate at the final analysis between the different time points, the gain in the matching rate and therefore in power is small when performing an early interim analysis. With an increasing number of patients at interim analysis the matching rate seems to converge and the changes are very small (in the matching rate) when increasing the number of patients in the control group used at interim analysis above 50% (Fig. 5). Taking also the recruited sample size into account, it seems that a time point between \(\frac {1}{2}\) and \(\frac {2}{3}\) of the control patients is a good choice as a tradeoff between matching rate and sample size (Fig. 7). The matching rate lies between 90.7% and 93.4% for all considered time points. In all scenarios, the achieved power is around 80% and the type I error rate around 5% (Figs. 6, 8).
It seems that for small sample sizes in the control group a later interim analysis could be a good choice and for large sample sizes an earlier time point, respectively. Using only 50% of the control patients at the interim analysis in small trials leads to a low absolute number of patients which underestimates the matching rate. For large sample sizes, it is observed that a smaller absolute number of control patients leads to a good estimate of the matching rate. The results are shown in the Additional files 1 and 2: (Figures S1 to S8).
Real data example
The KEEP SIMPLEST trial aimed to compare aspects of periinterventional management in AIS patients treated according to a new SOP (standard operating procedure) with patients having been randomized into the conscious sedation group of the SIESTA trial [7].
The CS group of the SIESTA trial includes 77 patients but only 73 were considered in the matching analysis due to missings in matching variables. The study protocol intended an interim analysis after 50 patients to estimate the matching rate. The actually recruited number of treated patients at interim analysis was 51. We applied the resampling CI method for the recalculation of sample size using 200 resampling steps. To compare the two methods here, we additionally conducted the analysis using the naive method.
Within the matching procedure, four baseline variables were considered: Age, NIHSS on admission, premorbid mRS, and the ASPECTS score. The propensity score was estimated by a logistic regression model. For the propensity score matching, a caliper width of 0.2 of the standard deviation of the propensity score was used. The interim analysis showed the following results:
The extrapolation of \(\overline {mr}=0.461\) resulted in a total sample size of 161. The KEEP SIMPLEST data set consists of 154 patients with complete data (161 patients were included in the trial but 7 patients had a missing ASPECTS score). The matching procedure reached 0.945 matching rate in the final analysis, hence 69 pairs were found and analyzed.
The naive method would result in a total sample size of 122. Using only 122 patients in the treated group for the second matching procedure would have resulted in 63 matched pairs and hence a matching rate of 0.863. Thus, in our real data example, the resampling CI method achieves a 8.2% higher matching rate in the final analysis.
Discussion and conclusion
In this article, we propose to include an interim analysis step within a prospective matched casecontrol trial. We showed by simulations that the naive method might severly overestimate the matching rate at the interim analysis. This leads to a low matching rate at the final analysis and, therefore, a too low power. The resampling CI method avoids this overestimation and the recalculation results in a higher number of treated patients. As a consequence, a higher matching rate can be achieved at the final analysis and this is related to a gain in power. At the same time this approach still leads to a reasonable sample size and is therefore a very efficient approach. An increase of the confidence level showed only a small influence on the sample size in the treatment group and matching rate at final analysis. The application to a real data example also showed that the resampling CI method leads to a very good matching rate in contrast to the naive approach.
Our proposed approach is a powerful technique to reach a good matching rate and a high power at the final analysis. The simulation study demonstrates the characteristics of the approaches based on a single model including different types of covariates. Even though more complex models are not evaluated in the simulations, a higher model complexity is not expected to strongly influence the performance of the approaches when model convergence is guaranteed. A higher degree of misspecification of the propensity score model would lead to a lower matching rate. However, this would be the case for both methods. A limitation of our approach is the very specific application area and the relatively high number of intervention patients to be included. Another limitation is the fact that the maximal sample size per group is limited by the number of patients in the control group which leads to power restrictions. On the other hand, in case a very large control group exists, for instance out of a registry, our proposed method might underestimate the true matching rate at interim analysis and one should consider a 1:k matching design. As a tradeoff between matching rate, power, and sample size, we recommend a proportion of \(\frac {1}{2}\) to \(\frac {2}{3}\) of the number of patients in the control group at the interim analysis. It appears that it depends more on the absolute number of patients at the interim analysis than on the relative number of control patients to get a good estimate for the matching rate. Giving a recommendation for an absolute number of patients needed for the interim analysis independent of the trial size is difficult, because this number would be limited by the trials with a small sample size. Nevertheless, we provide clear recommendations for situations that appear typical in clinical trials and therefore helpful instructions for clinicial researchers in the planning stage of a trial.
Availability of data and materials
Any requests for simulated data should be directed to the corresponding author. The data of the real data example are not publicly available as it may contain information that could compromise research participant privacy.
Abbreviations
 AIS:

Acute ischemic stroke
 ASPECT:

Alberta stroke program early CT
 CI:

Confidence interval
 NIHSS:

National institutes of health stroke scale
 SOP:

Standard operating procedure
References
 1
Frakt A. An observational study goes where randomized clinical trials have not. Jama. 2015; 313(11):1091–2.
 2
Faraoni D, Schaefer S. Randomized controlled trials vs. observational studies: why not just live together?BMC Anesthesiol. 2016; 16(1):102.
 3
Gu X, Rosenbaum P. Comparison of Multivariate Matching Methods: Structures, Distances, and Algorithms. Comput Graph Stat. 1993; 2(4):405–20.
 4
Rosenbaum P, Rubin D. Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. Am Stat. 1985; 39(1):33–8.
 5
Charpentier P, Bogardus S, Inouye S. An algorithm for prospective individual matching in a nonrandomized clinical trial. J Clin Epidemiol. 2001; 54(11):1166–73.
 6
Schönenberger S, Weber D, Ungerer MN, et al.The KEEP SIMPLEST Study: Improving InHouse Delays and Periinterventional Management in Stroke Thrombectomy—A Matched Pair Analysis. Neurocrit Care. 2019. https://doi.org/10.1007/s12028018006673.
 7
Schönenberger S, Uhlmann L, Hacke W, et al.Effect of conscious sedation vs general anesthesia on early neurological improvement among patients with ischemic stroke undergoing endovascular thrombectomy: A randomized clinical trial. JAMA. 2016; 316(19):1986–96.
 8
Rosenbaum P, Rubin D. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983a; 70(1):41–55.
 9
Rosenbaum P, Rubin D. Reducing bias in observational studies using subclassification on the propensity score. J Am Stat Assoc. 1984; 79(1):33–8.
 10
Austin P. Optimal caliper widths for propensityscore matching when estimating differences in means and differences in proportions in observational studies. Pharm Stat. 2011; 10:150–61.
 11
Barber P, Demchuk A, Zhang J, Buchan A. Validity and reliability of a quantitative computed tomography score in predicting outcome of hyperacute stroke before thrombolytic therapy. Lancet. 2000; 355(9216):1670–4.
 12
Lyden P, Lu M, Levine S, Brott T, Broderick J. NINDS rtPA Stroke Study Group. A modified National Institutes of Health Stroke Scale for use in stroke clinical trials: preliminary reliability and validity. Stroke. 2001; 32:1310–7.
 13
R Core Team. R: A Language and Environment for Statistical Computing. Vienna; 2017.
 14
Sekhon J. Multivariate and Propensity Score Matching Software with Automated Balance Optimization: The Matching Package for R. J Stat Softw. 2011; 42(7):1–52.
 15
Canty A, Ripley B. boot: Bootstrap R (SPlus) Functions. 2017. R package version 1.320.
Acknowledgments
The authors want to thank two reviewers, who committed their time and efforts to improve this manuscript by their very helpful comments.
Funding
We acknowledge financial support by Deutsche Forschungsgemeinschaft within the funding programme Open Access Publishing, by the BadenWürttemberg Ministry of Science, Research and the Arts and by RuprechtKarlsUniversität Heidelberg.
Author information
Affiliations
Contributions
MK provided the initial methodological question which we addressed in this project. DW developed the methodology, conducted the simulation study and prepared the final manuscript in consultation and supervision of MK and LU. SSch provided clinical support and the real data example. All authors read and approved the final manuscript.
Corresponding author
Correspondence to Dorothea Weber.
Ethics declarations
Ethics approval and consent to participate
The trial of the real data example was approved by our institutional review board (Ethikkommission Medizinische Fakultät Heidelberg, ID S325/2015).
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional files
Additional file 1
Time point of Interim Analysis  Small Sample Size. Figure S1 Mean matching rate for different time points of the interim analysis (n_{control}=50). Figure S2 Power for different time points of the interim analysis (n_{control}=50). Figure S3 Mean sample size in treated group for different time points of the interim analysis (n_{control}=50). Figure S4 Type I error for different time points of the interim analysis (n_{control}=50). (ZIP 17.9 kb)
Additional file 2
Time point of Interim Analysis  Large Sample Size. Figure S5 Mean matching rate for different time points of the interim analysis (n_{control}=500). Figure S6 Power for different time points of the interim analysis (n_{control}=500). Figure S7 Mean sample size in treated group for different time points of the interim analysis (n_{control}=500). Figure S8 Type I error for different time points of the interim analysis (n_{control}=500). (ZIP 17.6 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Received
Accepted
Published
DOI
Keywords
 Adaptive design
 Clinical Trials
 Sample size recalculation
 Matched cohort
 Prospective matching