Multivariate permutation test to compare survival curves for matched data
BMC Medical Research Methodology volume 13, Article number: 16 (2013)
In the absence of randomization, the comparison of an experimental treatment with respect to the standard may be done based on a matched design. When there is a limited set of cases receiving the experimental treatment, matching of a proper set of controls in a non fixed proportion is convenient.
In order to deal with the highly stratified survival data generated by multiple matching, we extend the multivariate permutation testing approach, since standard nonparametric methods for the comparison of survival curves cannot be applied in this setting.
We demonstrate the validity of the proposed method with simulations, and we illustrate its application to data from an observational study for the comparison of bone marrow transplantation and chemotherapy in the treatment of paediatric leukaemia.
The use of the multivariate permutation testing approach is recommended in the highly stratified context of survival matched data, especially when the proportional hazards assumption does not hold.
In clinical trials with right-censored failure time outcome, inference often focuses on the comparison of survival curves. When data from observational studies are used to explore the role of different treatments, the main problem is to limit the biases due to the lack of randomization. Matching on relevant baseline features can be used in order to increase the comparability between subjects treated with an experimental therapy and those receiving standard treatment. In settings where matching is done with a variable number of controls (multiple matching), highly stratified data are produced, with strata containing a few, possibly censored, observations. For this reason, the statistical comparison of survival in the two groups cannot be directly addressed by means of the usual nonparametric procedures, such as the log-rank test. The stratified version of these tests, which should account for matching, is inefficient when the number of strata increases and the stratum size is small . Furthermore, these methods are less sensitive when proportional hazard is not satisfied and this might often be the case, especially in the clinical setting that motivated this paper, i.e. the comparison of bone marrow transplantation and chemotherapy in the treatment of leukemia [2, 3].
The comparison of the survival curves for highly stratified data due to non-paired matching was addressed by Galimberti et al. , who proposed a weighted Kaplan-Meier estimator for the controls and a nonparametric permutation test on the survival difference at one pre-fixed time point. The aim of this work is to extend the comparison of survival at multiple time points, using the multivariate permutation approach originally introduced by Pesarin . In the proposed procedure, differently from Pesarin and Salmaso , a desirable feature is that there is no need to resort to the missing data framework in order to deal with censoring.
Our method is compared with the stratified log-rank test, the modified log-rank test for highly stratified data of Schoenfeld and Tsiatis  and the Cox model with a sandwich robust standard error for the treatment effect [6, 7]. There were no natural competitors among other permutation tests on survival proposed in the literature [8–10].
The paper is organized as follows. In the next section, we show how the multivariate permutation approach can be extended to the survival matched data setting. We investigate the performance of the proposed test through simulation studies that are presented in the Simulation section and we illustrate the method, using data from the motivating study on childhood leukemia , in the Application section. Finally, we provide a discussion and concluding remarks.
Multivariate permutation tests for survival matched data
In the context of non randomized studies, especially in rare diseases, where only selected patients undergo experimental therapies, matching is an approach to identify a proper set of controls for an unbiased comparison. This was the case of a multicenter study conducted in Italy on the role of allogeneic Bone Marrow Transplantation (BMT) from matched sibling donors in children with acute lymphoblastic leukaemia (ALL) who, by presenting features, have a dismal prognosis . Matching produced a stratum for each transplanted patient, that included as many controls as available, in order to recover the maximum amount of information in this rare subgroup.
More formally, this setting presents finely stratified data: each of the k strata (j = 1,…,k) has 1+mj subjects, where the first is the case belonging to the experimental arm (group 1) and the remaining mj are the matched subjects in the control arm (group 2).
The available data refer to the potential survival and censoring times. We assume that, conditionally on the observed j-th stratum, the survival times of the mj controls are i.i.d., but we cannot exclude, due to matching, some dependence between the case and the corresponding controls. Moreover, the usual assumption that censoring is independent of the failure mechanism and of the treatment must hold within stratum. Across strata, both cases and controls are independent, but not necessarily identically distributed. Finally, under the null hypothesis, the distribution of the survival times of the case and the matched controls in each stratum is invariant under permutations.
The multidimensional permutation approach introduced by Pesarin is here adapted to the context of survival analysis with matched data for the comparison of the marginal survival distributions of the two groups, S1(t) and S2(t). The procedure is based on a nonparametric combination of multiple, possibly dependent, univariate permutation tests. It relies on the following main steps: reformulate the inferential question of interest into a set of sub-questions, set up the partial tests and set up the form for their combination .
Firstly, the null and the alternative hypothesis are decomposed into a finite set of q sub-hypotheses H0i and H1i (i=1,…,q) with the property that, if H0 is true, all the H0i are jointly true, while if H1 is true, at least one sub-hypothesis is true. The standard hypotheses on the survival distributions are thus rephrased into q sub-hypotheses H0i, H1i (i=1,…,q) so that:
H0: S1(t)=S2(t) becomes H0:with H0i: S1(ti)=S2(ti),H1: S1(t)≠S2(t) becomes H1: with H1i: S1(ti)≠ S2(ti).
Appropriate partial tests Ti 1 are then performed separately on each q sub-hypothesis H0i (i=1,…,q). We considered two versions of test statistics that are marginally unbiased and consistent, as required for the validity of the procedure . They are based on the distance of the two survival curves or on their complementary log-log transformation  estimated at q different times points (t1,.,ti,.,tq):
where is the usual Kaplan-Meier estimate of the survival distribution in group 1, while is a weighted version, with stratum specific weights used in group 2 to account for the variable number of subjects in each stratum . If we indicate with dj(u) and rj(u) the number of events and the number of subjects at risk at time u in stratum j, and with wj =1/mj the weight applied to the mj controls of the stratum, the formula for is:
which can also have a product integral representation for continuous time variables. This estimator has an intuitive appeal because it can be viewed as a Kaplan-Meier estimator computed on k components, one for each stratum, in which all the mj controls are given the same weight. The imbalance in prognostic factors induced by multiple matching is adjusted giving less weight to individuals from strata that are over-represented and vice versa. The idea behind this estimator is to standardize the survival curve of controls, using the population of cases as reference. The weighted Kaplan-Meier estimator is an unbiased estimator for S2, characterized by a Gaussian asymptotic distribution with a variance that can be consistently estimated via bootstrap .
The first order tests are performed by considering the permutation distribution of the estimated distances TSi 1 and TLSi 1. When the cardinality of the permutations is large, due to the large number of strata and to their size, it is worthwhile approximating the permutation distribution of the q test statistics with a Monte Carlo strategy that considers B random samples from the population of all possible permutations. Each of the B permutation samples is obtained by assigning, within each observed stratum, one subject to the experimental treatment and the remaining mj to the standard treatment. The q first order tests are then calculated on each permuted sample and, at each time point, the partial p-values λi’s for the q observed TSi 1 or TLSi 1 (i=1,…,q) are obtained from the empirical permutation distribution. Starting from the q permutation distributions, the p-values are also calculated for the first order test statistics of each permuted sample.
In order to test the global null hypothesis H0, the partial tests Ti 1 are subsequently combined in a unidimensional second-order test statistic T2. This combination has to be done nonparametrically, since the q univariate tests are dependent, and it is applied to the p-values λi, which are permutationally equivalent to the partial tests . Two alternative functions proposed by Fisher and Tippett, respectively, are considered, which can be applied on p-values either from TSi 1 or TLSi 1 as follows:
As the Tippett combinig function can be written as , it has a monotonic relationship with the well known function used by Westfall and Young  in their resampling approach to multiple testing.
While the q p-values λi are obtained from the marginal permutation distribution of the first order tests, the p-value of the global test is derived from the permutation distribution of the second order test, which is related to the q-variate distribution in the first step. Note that, no further permutations are needed at this stage, because we use, for the global test, the results on the p-values obtained from each permuted sample that define the permutation distributions of the first order tests .
In the survival setting the definition of the time points (t1,…,ti,…,tq) involved in the comparison may be relevant. We explored different strategies: i) fixed time points, ii) time points identified by percentiles of the overall event distribution and iii) all the observed event times included between the 10th and the 90th percentiles of the overall event distribution.
The performance of these tests in the finite sample situation are evaluated through simulations. The size and the power are calculated for different alternatives under proportional hazards and for three different scenarios of non-proportionality (e.g. survival curves showing both an early and a late difference or crossing each other). The behavior of the proposed tests is also explored in the presence of different number of strata (k=30, 50, 100) and of different degree of strata heterogeneity (no or low strata effect). The dimension of each stratum is defined by one case plus the number of controls generated by a Poisson (μ=4)+1 (1 is added in order to guarantee at least 1 control in each stratum).
In the simulations under H0, the case and the matched controls share the same conditional piecewise exponential hazard function: λj(t|αj,bj)= αj[b1I(t≤L)+ b2I(t>L)]. The randomness of λj is determined by αj~G[1/w,w], where G(a,b) is a Gamma distribution with mean ab and variance ab2, and the parameter w indicates the level of heterogeneity across strata (w=0, 0.22; specifically, for w=0, no strata effect is assumed as αj=1 with probability 1). Different scenarios are obtained by giving specific values to the landmark L, to b1 and b2. Under the alternative hypothesis, the hazard function for cases was decreased in order to define a survival which differs from that of controls, as specified in Figure 1. Censoring was uniform over 3–6 in order to have approximately 28–38 percent censoring, on average. For all the configurations studied, the results are based on 1000 samples and each of them uses B=2000 Monte Carlo permutations in order to define the empirical permutation distributions.
Definitions of the q time points involved in the comparison are as follows: i) 8 or 4 fixed equi-spaced times from 0.5-1 to 4, ii) 9 time points identified by the 10th, 20th,., 90th percentiles of the overall event distribution, and iii) all the observed failure times between the 10th and the 90th percentiles of the overall event distribution.
The empirical level of the tests is shown in Tables 1.A (Fisher combining function) and 1.B (Tippett function), while the power in detecting a difference in survival is reported in Table 2.A-B, accordingly.
As expected, the performance of the test Under H0 is generally good in all settings. In particular, with a small number of strata, the test based on the survival distance behaves very well. When the complementary log-log transform is considered, the permutation distribution at early time points, such as the 10th percentile, may be very irregular because this transformation magnifies survival values near one. In order to have a good behavior even with small number of strata, it is sufficient to start from a higher percentile such as the 20th (data not shown).
Under H1, as expected, power increases with increasing number of strata under every scenario and the test has a good behavior also in the situation of crossing hazards. No major gain in power is achieved by increasing the number of fixed time points from 4 to 8 and similar results are obtained when points are fixed at percentiles of the event distribution or take 80% of all observed event times. The fixed time point strategy seems generally more satisfactory for the proportional hazards and the late difference scenarios. A clear advantage in the use of the complementary log-log transform is seen only in the early difference scenario; conversely the survival difference seems to offer an advantage in the crossing hazards setting. Again, by comparing Table 2.A and 2.B, where the best performers for each scenario are in bold, the performance under the Fisher and Tippet functions is very similar, except for an advantage of Tippett in the crossing hazards scenario.
Results on the comparison of our permutation tests with other approaches are reported in Table 3. The stratified log-rank test, the modified log-rank test by Schoenfeld-Tsiatis and the robust test from the Cox model including only the covariate for treatment have size close to the nominal level. The multivariate permutation tests perform better than the latter tests in terms of power in the crossing hazards and early difference scenarios, while they have a similar behavior in the presence of proportional hazards.
The clinical context that motivated this work is a retrospective multicentre study on 427 children enrolled between 1985 and 1994 in high-risk protocols for the treatment of ALL, where 30 received an allogeneic bone-marrow transplant after achieving first remission. In this setting, where transplant was considered as an experimental treatment to be administered only to selected patients, we used matching to define the appropriate group of children treated with chemotherapy alone to be used for an unbiased comparison. The matching procedure was based on 6 factors (centre, front-line treatment protocol, white blood cell count and age at diagnosis, immunophenotype and waiting time to transplant). For each of the 30 BMT cases, at least one non transplanted matched control was found, with a maximum of 17 and a median of 3, for a total of 130 controls. The Disease Free Survival (DFS) curves of the two groups depicted in Figure 2 were estimated using the standard and the weighted version of Kaplan-Meier for the transplant and the chemotherapy group, respectively. This plot indicates a long term advantage of transplant over chemotherapy after 1 year from transplant, with a DFS at 3 years of 59.1% (Greenwood s.e. 9.1) and 39.5% (bootstrap s.e. 7.7), respectively.
A local two-sided permutation test was originally used on the estimated DFS difference of 19.6% at the time point of 3 years, considered a meaningful time point according to clinical experience (p-value=0.112). At 3 years, 7 cases and 39 controls were still at risk and most of the overall number of events (97%) had already occurred, as shown by the event distribution below Figure 2. We have now applied the multivariate permutation approach in order to compare the two DFS curves with a global test. The Fisher and the Tippett tests are performed based both on the direct estimates of survival and on its complementary log-log transformation, according to different choices of time points involved in the comparison. Based on previous studies and knowledge on the treatment implications on toxicity and survival, the most likely alternative to the null is a non proportional setting with an early crossing and late differences in hazards. The result in term of significance (p-value=0.199) points to the Tippett test on log-survival involving 4 points (at 1, 2, 3 and 4 years from the origin). Other tests on 4 fixed time points, starting from 0.5 or 1 year (and not necessarily equispaced), are applied, and results are in keeping with the observation that curves start diverging at 1 year. The tests involving the percentiles and all the event times within the 10th and 90th percentile, give p-values ranging from 0.486 to 0.829 (Table 4).
Due to the relatively small sample size, the power is likely not adequate to detect the advantage given by transplantation, and this holds true even if we are applying a more appropriate test with regard to standard methods (stratified log-rank test, p-values=0.966; Cox model with robust standard error, p-values=0.915).
This paper provides a global test for the comparison of survival curves in the context of highly stratified data produced by matching on a non-fixed proportion, by extending the multivariate permutation approach proposed by Pesarin . The nonparametric combination of partial tests is able to capture their underlying dependence without assumptions and to control the level of significance of the global test, avoiding the issue of multiplicity. In principle, the proposed tests can be viewed as a weighted combination of an infinite number of time points, giving weight 1 to those involved in the comparison, and weight 0 otherwise. Yet, in survival analysis, the choice of the time points where partial tests are performed is a crucial issue. Our approach carefully considers this aspect by evaluating three different convenient strategies among the many possible. The most simple is the one that identifies a number of fixed time points. This is appealing for applications where the researcher has reasons for evaluating certain times that are relevant for the phenomenon. Importantly, given the degree of arbitrariness involved, the choice of the time points should be carefully done, possibly a-priori, in a time window where more information is expected, given the experimental setting and the follow-up. Typically, in time points near the boundaries, i.e. at the very beginning or on the tail of the survival curves, where few events are generally observed, the permutation distribution of the partial tests may be irregular. Following this consideration, alternative non subjective definitions of time points are based on the observed distribution of events, either with a limited number defined by percentiles, or on the whole set of event times within the 10th up to the 90th percentile.
As expected, the extended simulation study shows good performances of the proposed test in terms of alpha coverage. Under the alternative, apparently a slight advantage is obtained considering all event times for partial tests in the crossing hazard and early difference scenarios. We could generally recommend the Tippett transform with some caution on small number of strata where, to achieve a regular behaviour, it is however sufficient to start from a percentile higher than the 10th. The Fisher test is very similar, slightly better under proportional hazards, but worst in the crossing hazards setting. Differently from what found in a non permutation test by Klein et al. , we do not observe a general superiority of the complementary log-log transform of survival on survival, as distance measure, except in the early distance scenario. Actually, the Tippett transform behaves well using the survival distance even with small number of strata. While a standardized survival difference could have been considered, the expected gain in performance would not counterbalance the heavy additional computational effort needed. This because the standard error of the difference is based on a bootstrap estimate that should account for the complex form of dependence possibly induced by matching, that cannot be easily specified.
In the context of non matched data (or no strata effect), the proposed multivariate permutation test maintains a good performance for survival data comparison. Indeed, this approach could be useful even in non matched data in the presence of departures from the proportional hazards assumption, especially crossing hazards, where an overall test is still matter of development.
Conditions for the applicability of this approach are very mild. The appropriate application of the multivariate permutation testing approach relies on the assumptions on censoring being independent from the failure process and treatment within strata. While this could be considered a weakness, it is indeed the usual assumption which applies in survival when a correct study planning is adopted. The test relies also on the assumption of exchangeability under the null. In our context, where permutations are within strata, this latter assumption is reasonable due to matching. Finally, the test can easily be extended to more than one subject of the experimental arm in each stratum, modifying the weighting scheme of the Kaplan-Meier estimator of survival.
Other permutation tests were not suitable for extension to this context. Specifically, the proposals of both Pesarin-Salmaso  and of Heller-Venkatraman  do not account for stratified and possibly correlated data; the approach of Sun and Sherman  needs relatively large strata of independent observations, and the versatile test of Shih and Fay , while dealing with highly stratified matched data, behaves similarly to the log-rank test or its stratified version, depending on the level of inter-strata heterogeneity.
In conclusion, our approach is different from other permutation tests proposed for survival analysis [8–10], as it relies on the multivariate permutation approach originally introduced by Pesarin . It is recommended in the highly stratified context of matched data where it is proven by simulations to be at least equal or superior to the stratified log-rank test, the modified log-rank test for highly stratified data and a Cox model with robust variance estimate. Preference to this test should be definitely given when the proportional hazards assumption does not hold, even in non matched data. As a general indication, we would suggest the application on survival differences at fixed time points for the first order test and the Tippett transform for the global combining function. This test requires some computational effort, but it is, in our experience, quite feasible with ad-hoc routines (one in fortran, as well as a function in R, can be provided by the authors upon request).
Acute lymphoblastic leukaemia
Bone Marrow Transplantation
Schoenfeld DA, Tsiatis AA: A modified log-rank test for highly stratified data. Biometrika. 1987, 74: 167-175. 10.1093/biomet/74.1.167.
Galimberti S, Sasieni P, Valsecchi MG: A weighted Kaplan-Meier estimator for matched data with application to the comparison of chemotherapy and bone-marrow transplant in leukemia. Stat Med. 2002, 21: 3847-3864. 10.1002/sim.1357.
Logan BR, Klein JP, Zhang MJ: Comparing treatments in the presence of crossing survival curves: an application to bone marrow transplantation. Biometrics. 2008, 64: 733-740. 10.1111/j.1541-0420.2007.00975.x.
Pesarin F: Multivariate Permutation Tests: With Applications in Biostatistics. 2001, Wiley, Chichester
Pesarin F, Salmaso L: Permutation Tests for Complex Data: Theory, Applications and Software. 2010, Wiley, Chichester
Lee EW, Wei LJ, Amato DA: Cox-type regression analysis for large numbers of small groups of correlated failure time observations. Survival analysis: state of the art. 1992, Kluwer Academic, Dordrecht, 237-247.
Therneau TM, Grambsch PM: Modeling Survival Data: Extending the Cox Model. 2000, Springer-Verlag, New-York
Heller G, Venkatraman ES: A bootstrap procedure to compare two survival distributions in the presence of right censored data. Biometrics. 1996, 52: 1204-1213. 10.2307/2532836.
Sun Y, Sherman M: Some permutation tests for survival data. Biometrics. 1996, 52: 87-97. 10.2307/2533147.
Shih JH, Fay MP: A class of permutation tests for stratified survival data. Biometrics. 1999, 55: 1156-1162. 10.1111/j.0006-341X.1999.01156.x.
Uderzo C, Valsecchi MG, Balduzzi A, Dini G, Miniero R, Locatelli F, Rondelli R, Pession A, Arcese W, Bacigalupo A, Polchi P, Andolina M, Messina C, Conter V, Aricò M, Galimberti S, Masera G: Allogeneic bone marrow transplantation versus chemotherapy in high-risk childhood acute lymphoblastic leukemia in first remission. Br J Haematol. 1997, 96: 387-394. 10.1046/j.1365-2141.1997.d01-2033.x.
Klein JP, Logan B, Harhoff M, Andersen PK: Analyzing survival curves at a fixed point in time. Stat Med. 2007, 26: 4505-4519. 10.1002/sim.2864.
Westfall PH, Young SS: Resampling-Based Multiple Testing: Examples and Methods for P-value Adjustment. 1993, Wiley, New-York
The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1471-2288/13/16/prepub
We are grateful to Prof. David Glidden and Prof. Fortunato Pesarin for their precious suggestions. We are also indebted with the two reviewers for the valuable comments that certainly improved the manuscript. This research was partially supported by a grant from the Associazione Italiana Ricerca Cancro (AIRC IG 5017) and the European Commission (ENCCA Project: FP7-HEALTH-F2-2011 261474).
The authors declare that they have no competing interests.
SG performed the simulation studies and drafted the manuscript. MGV supervised the analyses and contributed to the manuscript. All authors read and approved the final manuscript.
About this article
Cite this article
Galimberti, S., Valsecchi, M.G. Multivariate permutation test to compare survival curves for matched data. BMC Med Res Methodol 13, 16 (2013). https://doi.org/10.1186/1471-2288-13-16
- Highly stratified data
- Matched survival data
- Multiple matching
- Multivariate permutation tests