 Technical advance
 Open Access
 Published:
Multivariate permutation test to compare survival curves for matched data
BMC Medical Research Methodology volume 13, Article number: 16 (2013)
Abstract
Background
In the absence of randomization, the comparison of an experimental treatment with respect to the standard may be done based on a matched design. When there is a limited set of cases receiving the experimental treatment, matching of a proper set of controls in a non fixed proportion is convenient.
Methods
In order to deal with the highly stratified survival data generated by multiple matching, we extend the multivariate permutation testing approach, since standard nonparametric methods for the comparison of survival curves cannot be applied in this setting.
Results
We demonstrate the validity of the proposed method with simulations, and we illustrate its application to data from an observational study for the comparison of bone marrow transplantation and chemotherapy in the treatment of paediatric leukaemia.
Conclusions
The use of the multivariate permutation testing approach is recommended in the highly stratified context of survival matched data, especially when the proportional hazards assumption does not hold.
Background
In clinical trials with rightcensored failure time outcome, inference often focuses on the comparison of survival curves. When data from observational studies are used to explore the role of different treatments, the main problem is to limit the biases due to the lack of randomization. Matching on relevant baseline features can be used in order to increase the comparability between subjects treated with an experimental therapy and those receiving standard treatment. In settings where matching is done with a variable number of controls (multiple matching), highly stratified data are produced, with strata containing a few, possibly censored, observations. For this reason, the statistical comparison of survival in the two groups cannot be directly addressed by means of the usual nonparametric procedures, such as the logrank test. The stratified version of these tests, which should account for matching, is inefficient when the number of strata increases and the stratum size is small [1]. Furthermore, these methods are less sensitive when proportional hazard is not satisfied and this might often be the case, especially in the clinical setting that motivated this paper, i.e. the comparison of bone marrow transplantation and chemotherapy in the treatment of leukemia [2, 3].
The comparison of the survival curves for highly stratified data due to nonpaired matching was addressed by Galimberti et al. [2], who proposed a weighted KaplanMeier estimator for the controls and a nonparametric permutation test on the survival difference at one prefixed time point. The aim of this work is to extend the comparison of survival at multiple time points, using the multivariate permutation approach originally introduced by Pesarin [4]. In the proposed procedure, differently from Pesarin and Salmaso [5], a desirable feature is that there is no need to resort to the missing data framework in order to deal with censoring.
Our method is compared with the stratified logrank test, the modified logrank test for highly stratified data of Schoenfeld and Tsiatis [1] and the Cox model with a sandwich robust standard error for the treatment effect [6, 7]. There were no natural competitors among other permutation tests on survival proposed in the literature [8–10].
The paper is organized as follows. In the next section, we show how the multivariate permutation approach can be extended to the survival matched data setting. We investigate the performance of the proposed test through simulation studies that are presented in the Simulation section and we illustrate the method, using data from the motivating study on childhood leukemia [11], in the Application section. Finally, we provide a discussion and concluding remarks.
Methods
Multivariate permutation tests for survival matched data
In the context of non randomized studies, especially in rare diseases, where only selected patients undergo experimental therapies, matching is an approach to identify a proper set of controls for an unbiased comparison. This was the case of a multicenter study conducted in Italy on the role of allogeneic Bone Marrow Transplantation (BMT) from matched sibling donors in children with acute lymphoblastic leukaemia (ALL) who, by presenting features, have a dismal prognosis [11]. Matching produced a stratum for each transplanted patient, that included as many controls as available, in order to recover the maximum amount of information in this rare subgroup.
More formally, this setting presents finely stratified data: each of the k strata (j = 1,…,k) has 1+m_{j} subjects, where the first is the case belonging to the experimental arm (group 1) and the remaining m_{j} are the matched subjects in the control arm (group 2).
The available data refer to the potential survival and censoring times. We assume that, conditionally on the observed jth stratum, the survival times of the m_{j} controls are i.i.d., but we cannot exclude, due to matching, some dependence between the case and the corresponding controls. Moreover, the usual assumption that censoring is independent of the failure mechanism and of the treatment must hold within stratum. Across strata, both cases and controls are independent, but not necessarily identically distributed. Finally, under the null hypothesis, the distribution of the survival times of the case and the matched controls in each stratum is invariant under permutations.
The multidimensional permutation approach introduced by Pesarin is here adapted to the context of survival analysis with matched data for the comparison of the marginal survival distributions of the two groups, S_{1}(t) and S_{2}(t). The procedure is based on a nonparametric combination of multiple, possibly dependent, univariate permutation tests. It relies on the following main steps: reformulate the inferential question of interest into a set of subquestions, set up the partial tests and set up the form for their combination [4].
Firstly, the null and the alternative hypothesis are decomposed into a finite set of q subhypotheses H_{0i} and H_{1i} (i=1,…,q) with the property that, if H_{0} is true, all the H_{0i} are jointly true, while if H_{1} is true, at least one subhypothesis is true. The standard hypotheses on the survival distributions are thus rephrased into q subhypotheses H_{0i}, H_{1i} (i=1,…,q) so that:
H_{0}: S_{1}(t)=S_{2}(t) becomes H_{0}:{\displaystyle \underset{\text{i}=1}{\overset{\text{q}}{\cap}}{\text{H}}_{0\text{i}}}with H_{0i}: S_{1}(t_{i})=S_{2}(t_{i}),H_{1}: S_{1}(t)≠S_{2}(t) becomes H_{1}:{\displaystyle \underset{\text{i}=1}{\overset{\text{q}}{\cup}}{\text{H}}_{1\text{i}}} with H_{1i}: S_{1}(t_{i})≠ S_{2}(t_{i}).
Appropriate partial tests T_{i} ^{1} are then performed separately on each q subhypothesis H_{0i} (i=1,…,q). We considered two versions of test statistics that are marginally unbiased and consistent, as required for the validity of the procedure [4]. They are based on the distance of the two survival curves or on their complementary loglog transformation [12] estimated at q different times points (t_{1},.,t_{i},.,t_{q}):
TLS_{i} ^{1}=log{log[{\widehat{\mathrm{S}}}_{1}(t_{i})]}log{log[{\widehat{\mathrm{S}}}_{2}^{w}(t_{i})]}
where {\widehat{\mathrm{S}}}_{1}\left(.\right) is the usual KaplanMeier estimate of the survival distribution in group 1, while {\widehat{\mathrm{S}}}_{2}^{w}\left(.\right) is a weighted version, with stratum specific weights used in group 2 to account for the variable number of subjects in each stratum [2]. If we indicate with d_{j}(u) and r_{j}(u) the number of events and the number of subjects at risk at time u in stratum j, and with w_{j} =1/m_{j} the weight applied to the m_{j} controls of the stratum, the formula for {\widehat{\mathrm{S}}}_{2}^{w}\left(.\right) is:
which can also have a product integral representation for continuous time variables. This estimator has an intuitive appeal because it can be viewed as a KaplanMeier estimator computed on k components, one for each stratum, in which all the m_{j} controls are given the same weight. The imbalance in prognostic factors induced by multiple matching is adjusted giving less weight to individuals from strata that are overrepresented and vice versa. The idea behind this estimator is to standardize the survival curve of controls, using the population of cases as reference. The weighted KaplanMeier estimator {\widehat{\mathrm{S}}}_{2}^{w}\left(.\right) is an unbiased estimator for S_{2}, characterized by a Gaussian asymptotic distribution with a variance that can be consistently estimated via bootstrap [2].
The first order tests are performed by considering the permutation distribution of the estimated distances TS_{i} ^{1} and TLS_{i} ^{1}. When the cardinality {\displaystyle \prod _{\mathrm{j}=1}^{\mathrm{k}}{(\mathrm{m}}_{\mathrm{j}}}+1) of the permutations is large, due to the large number of strata and to their size, it is worthwhile approximating the permutation distribution of the q test statistics with a Monte Carlo strategy that considers B random samples from the population of all possible permutations. Each of the B permutation samples is obtained by assigning, within each observed stratum, one subject to the experimental treatment and the remaining m_{j} to the standard treatment. The q first order tests are then calculated on each permuted sample and, at each time point, the partial pvalues λ_{i}’s for the q observed TS_{i} ^{1} or TLS_{i} ^{1} (i=1,…,q) are obtained from the empirical permutation distribution. Starting from the q permutation distributions, the pvalues are also calculated for the first order test statistics of each permuted sample.
In order to test the global null hypothesis H_{0}, the partial tests T_{i} ^{1} are subsequently combined in a unidimensional secondorder test statistic T^{2}. This combination has to be done nonparametrically, since the q univariate tests are dependent, and it is applied to the pvalues λ_{i}, which are permutationally equivalent to the partial tests [4]. Two alternative functions proposed by Fisher and Tippett, respectively, are considered, which can be applied on pvalues either from TS_{i} ^{1} or TLS_{i} ^{1} as follows:
As the Tippett combinig function can be written as \underset{\phantom{\rule{0.25em}{0ex}}1\le \mathrm{i}\le \mathrm{q}}{1\mathrm{min}}\phantom{\rule{0.5em}{0ex}}{\lambda}_{i}, it has a monotonic relationship with the well known function \underset{\phantom{\rule{0.25em}{0ex}}1\le \mathrm{i}\le \mathrm{q}}{\mathrm{min}}\phantom{\rule{0.5em}{0ex}}{\lambda}_{i} used by Westfall and Young [13] in their resampling approach to multiple testing.
While the q pvalues λ_{i} are obtained from the marginal permutation distribution of the first order tests, the pvalue of the global test is derived from the permutation distribution of the second order test, which is related to the qvariate distribution in the first step. Note that, no further permutations are needed at this stage, because we use, for the global test, the results on the pvalues obtained from each permuted sample that define the permutation distributions of the first order tests [4].
In the survival setting the definition of the time points (t_{1},…,t_{i},…,t_{q}) involved in the comparison may be relevant. We explored different strategies: i) fixed time points, ii) time points identified by percentiles of the overall event distribution and iii) all the observed event times included between the 10^{th} and the 90^{th} percentiles of the overall event distribution.
Results
Simulations
The performance of these tests in the finite sample situation are evaluated through simulations. The size and the power are calculated for different alternatives under proportional hazards and for three different scenarios of nonproportionality (e.g. survival curves showing both an early and a late difference or crossing each other). The behavior of the proposed tests is also explored in the presence of different number of strata (k=30, 50, 100) and of different degree of strata heterogeneity (no or low strata effect). The dimension of each stratum is defined by one case plus the number of controls generated by a Poisson (μ=4)+1 (1 is added in order to guarantee at least 1 control in each stratum).
In the simulations under H_{0}, the case and the matched controls share the same conditional piecewise exponential hazard function: λ_{j}(tα_{j},b_{j})= α_{j}[b_{1}I(t≤L)+ b_{2}I(t>L)]. The randomness of λ_{j} is determined by α_{j}~G[1/w,w], where G(a,b) is a Gamma distribution with mean ab and variance ab^{2}, and the parameter w indicates the level of heterogeneity across strata (w=0, 0.22; specifically, for w=0, no strata effect is assumed as α_{j}=1 with probability 1). Different scenarios are obtained by giving specific values to the landmark L, to b_{1} and b_{2}. Under the alternative hypothesis, the hazard function for cases was decreased in order to define a survival which differs from that of controls, as specified in Figure 1. Censoring was uniform over 3–6 in order to have approximately 28–38 percent censoring, on average. For all the configurations studied, the results are based on 1000 samples and each of them uses B=2000 Monte Carlo permutations in order to define the empirical permutation distributions.
Definitions of the q time points involved in the comparison are as follows: i) 8 or 4 fixed equispaced times from 0.51 to 4, ii) 9 time points identified by the 10^{th}, 20^{th},., 90^{th} percentiles of the overall event distribution, and iii) all the observed failure times between the 10^{th} and the 90^{th} percentiles of the overall event distribution.
The empirical level of the tests is shown in Tables 1.A (Fisher combining function) and 1.B (Tippett function), while the power in detecting a difference in survival is reported in Table 2.AB, accordingly.
As expected, the performance of the test Under H_{0} is generally good in all settings. In particular, with a small number of strata, the test based on the survival distance behaves very well. When the complementary loglog transform is considered, the permutation distribution at early time points, such as the 10^{th} percentile, may be very irregular because this transformation magnifies survival values near one. In order to have a good behavior even with small number of strata, it is sufficient to start from a higher percentile such as the 20^{th} (data not shown).
Under H_{1}, as expected, power increases with increasing number of strata under every scenario and the test has a good behavior also in the situation of crossing hazards. No major gain in power is achieved by increasing the number of fixed time points from 4 to 8 and similar results are obtained when points are fixed at percentiles of the event distribution or take 80% of all observed event times. The fixed time point strategy seems generally more satisfactory for the proportional hazards and the late difference scenarios. A clear advantage in the use of the complementary loglog transform is seen only in the early difference scenario; conversely the survival difference seems to offer an advantage in the crossing hazards setting. Again, by comparing Table 2.A and 2.B, where the best performers for each scenario are in bold, the performance under the Fisher and Tippet functions is very similar, except for an advantage of Tippett in the crossing hazards scenario.
Results on the comparison of our permutation tests with other approaches are reported in Table 3. The stratified logrank test, the modified logrank test by SchoenfeldTsiatis and the robust test from the Cox model including only the covariate for treatment have size close to the nominal level. The multivariate permutation tests perform better than the latter tests in terms of power in the crossing hazards and early difference scenarios, while they have a similar behavior in the presence of proportional hazards.
Application
The clinical context that motivated this work is a retrospective multicentre study on 427 children enrolled between 1985 and 1994 in highrisk protocols for the treatment of ALL, where 30 received an allogeneic bonemarrow transplant after achieving first remission. In this setting, where transplant was considered as an experimental treatment to be administered only to selected patients, we used matching to define the appropriate group of children treated with chemotherapy alone to be used for an unbiased comparison. The matching procedure was based on 6 factors (centre, frontline treatment protocol, white blood cell count and age at diagnosis, immunophenotype and waiting time to transplant). For each of the 30 BMT cases, at least one non transplanted matched control was found, with a maximum of 17 and a median of 3, for a total of 130 controls. The Disease Free Survival (DFS) curves of the two groups depicted in Figure 2 were estimated using the standard and the weighted version of KaplanMeier for the transplant and the chemotherapy group, respectively. This plot indicates a long term advantage of transplant over chemotherapy after 1 year from transplant, with a DFS at 3 years of 59.1% (Greenwood s.e. 9.1) and 39.5% (bootstrap s.e. 7.7), respectively.
A local twosided permutation test was originally used on the estimated DFS difference of 19.6% at the time point of 3 years, considered a meaningful time point according to clinical experience (pvalue=0.112). At 3 years, 7 cases and 39 controls were still at risk and most of the overall number of events (97%) had already occurred, as shown by the event distribution below Figure 2. We have now applied the multivariate permutation approach in order to compare the two DFS curves with a global test. The Fisher and the Tippett tests are performed based both on the direct estimates of survival and on its complementary loglog transformation, according to different choices of time points involved in the comparison. Based on previous studies and knowledge on the treatment implications on toxicity and survival, the most likely alternative to the null is a non proportional setting with an early crossing and late differences in hazards. The result in term of significance (pvalue=0.199) points to the Tippett test on logsurvival involving 4 points (at 1, 2, 3 and 4 years from the origin). Other tests on 4 fixed time points, starting from 0.5 or 1 year (and not necessarily equispaced), are applied, and results are in keeping with the observation that curves start diverging at 1 year. The tests involving the percentiles and all the event times within the 10^{th} and 90^{th} percentile, give pvalues ranging from 0.486 to 0.829 (Table 4).
Due to the relatively small sample size, the power is likely not adequate to detect the advantage given by transplantation, and this holds true even if we are applying a more appropriate test with regard to standard methods (stratified logrank test, pvalues=0.966; Cox model with robust standard error, pvalues=0.915).
Discussion
This paper provides a global test for the comparison of survival curves in the context of highly stratified data produced by matching on a nonfixed proportion, by extending the multivariate permutation approach proposed by Pesarin [4]. The nonparametric combination of partial tests is able to capture their underlying dependence without assumptions and to control the level of significance of the global test, avoiding the issue of multiplicity. In principle, the proposed tests can be viewed as a weighted combination of an infinite number of time points, giving weight 1 to those involved in the comparison, and weight 0 otherwise. Yet, in survival analysis, the choice of the time points where partial tests are performed is a crucial issue. Our approach carefully considers this aspect by evaluating three different convenient strategies among the many possible. The most simple is the one that identifies a number of fixed time points. This is appealing for applications where the researcher has reasons for evaluating certain times that are relevant for the phenomenon. Importantly, given the degree of arbitrariness involved, the choice of the time points should be carefully done, possibly apriori, in a time window where more information is expected, given the experimental setting and the followup. Typically, in time points near the boundaries, i.e. at the very beginning or on the tail of the survival curves, where few events are generally observed, the permutation distribution of the partial tests may be irregular. Following this consideration, alternative non subjective definitions of time points are based on the observed distribution of events, either with a limited number defined by percentiles, or on the whole set of event times within the 10^{th} up to the 90^{th} percentile.
As expected, the extended simulation study shows good performances of the proposed test in terms of alpha coverage. Under the alternative, apparently a slight advantage is obtained considering all event times for partial tests in the crossing hazard and early difference scenarios. We could generally recommend the Tippett transform with some caution on small number of strata where, to achieve a regular behaviour, it is however sufficient to start from a percentile higher than the 10^{th}. The Fisher test is very similar, slightly better under proportional hazards, but worst in the crossing hazards setting. Differently from what found in a non permutation test by Klein et al. [12], we do not observe a general superiority of the complementary loglog transform of survival on survival, as distance measure, except in the early distance scenario. Actually, the Tippett transform behaves well using the survival distance even with small number of strata. While a standardized survival difference could have been considered, the expected gain in performance would not counterbalance the heavy additional computational effort needed. This because the standard error of the difference is based on a bootstrap estimate that should account for the complex form of dependence possibly induced by matching, that cannot be easily specified.
In the context of non matched data (or no strata effect), the proposed multivariate permutation test maintains a good performance for survival data comparison. Indeed, this approach could be useful even in non matched data in the presence of departures from the proportional hazards assumption, especially crossing hazards, where an overall test is still matter of development.
Conditions for the applicability of this approach are very mild. The appropriate application of the multivariate permutation testing approach relies on the assumptions on censoring being independent from the failure process and treatment within strata. While this could be considered a weakness, it is indeed the usual assumption which applies in survival when a correct study planning is adopted. The test relies also on the assumption of exchangeability under the null. In our context, where permutations are within strata, this latter assumption is reasonable due to matching. Finally, the test can easily be extended to more than one subject of the experimental arm in each stratum, modifying the weighting scheme of the KaplanMeier estimator of survival.
Other permutation tests were not suitable for extension to this context. Specifically, the proposals of both PesarinSalmaso [5] and of HellerVenkatraman [8] do not account for stratified and possibly correlated data; the approach of Sun and Sherman [9] needs relatively large strata of independent observations, and the versatile test of Shih and Fay [10], while dealing with highly stratified matched data, behaves similarly to the logrank test or its stratified version, depending on the level of interstrata heterogeneity.
Conclusions
In conclusion, our approach is different from other permutation tests proposed for survival analysis [8–10], as it relies on the multivariate permutation approach originally introduced by Pesarin [4]. It is recommended in the highly stratified context of matched data where it is proven by simulations to be at least equal or superior to the stratified logrank test, the modified logrank test for highly stratified data and a Cox model with robust variance estimate. Preference to this test should be definitely given when the proportional hazards assumption does not hold, even in non matched data. As a general indication, we would suggest the application on survival differences at fixed time points for the first order test and the Tippett transform for the global combining function. This test requires some computational effort, but it is, in our experience, quite feasible with adhoc routines (one in fortran, as well as a function in R, can be provided by the authors upon request).
Abbreviations
 ALL:

Acute lymphoblastic leukaemia
 BMT:

Bone Marrow Transplantation
 PH:

Proportional Hazards
 ED:

Early Difference
 LD:

Late Difference
 CH:

Crossing Hazards.
References
Schoenfeld DA, Tsiatis AA: A modified logrank test for highly stratified data. Biometrika. 1987, 74: 167175. 10.1093/biomet/74.1.167.
Galimberti S, Sasieni P, Valsecchi MG: A weighted KaplanMeier estimator for matched data with application to the comparison of chemotherapy and bonemarrow transplant in leukemia. Stat Med. 2002, 21: 38473864. 10.1002/sim.1357.
Logan BR, Klein JP, Zhang MJ: Comparing treatments in the presence of crossing survival curves: an application to bone marrow transplantation. Biometrics. 2008, 64: 733740. 10.1111/j.15410420.2007.00975.x.
Pesarin F: Multivariate Permutation Tests: With Applications in Biostatistics. 2001, Wiley, Chichester
Pesarin F, Salmaso L: Permutation Tests for Complex Data: Theory, Applications and Software. 2010, Wiley, Chichester
Lee EW, Wei LJ, Amato DA: Coxtype regression analysis for large numbers of small groups of correlated failure time observations. Survival analysis: state of the art. 1992, Kluwer Academic, Dordrecht, 237247.
Therneau TM, Grambsch PM: Modeling Survival Data: Extending the Cox Model. 2000, SpringerVerlag, NewYork
Heller G, Venkatraman ES: A bootstrap procedure to compare two survival distributions in the presence of right censored data. Biometrics. 1996, 52: 12041213. 10.2307/2532836.
Sun Y, Sherman M: Some permutation tests for survival data. Biometrics. 1996, 52: 8797. 10.2307/2533147.
Shih JH, Fay MP: A class of permutation tests for stratified survival data. Biometrics. 1999, 55: 11561162. 10.1111/j.0006341X.1999.01156.x.
Uderzo C, Valsecchi MG, Balduzzi A, Dini G, Miniero R, Locatelli F, Rondelli R, Pession A, Arcese W, Bacigalupo A, Polchi P, Andolina M, Messina C, Conter V, Aricò M, Galimberti S, Masera G: Allogeneic bone marrow transplantation versus chemotherapy in highrisk childhood acute lymphoblastic leukemia in first remission. Br J Haematol. 1997, 96: 387394. 10.1046/j.13652141.1997.d012033.x.
Klein JP, Logan B, Harhoff M, Andersen PK: Analyzing survival curves at a fixed point in time. Stat Med. 2007, 26: 45054519. 10.1002/sim.2864.
Westfall PH, Young SS: ResamplingBased Multiple Testing: Examples and Methods for Pvalue Adjustment. 1993, Wiley, NewYork
Prepublication history
The prepublication history for this paper can be accessed here:http://www.biomedcentral.com/14712288/13/16/prepub
Acknowledgements
We are grateful to Prof. David Glidden and Prof. Fortunato Pesarin for their precious suggestions. We are also indebted with the two reviewers for the valuable comments that certainly improved the manuscript. This research was partially supported by a grant from the Associazione Italiana Ricerca Cancro (AIRC IG 5017) and the European Commission (ENCCA Project: FP7HEALTHF22011 261474).
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
SG performed the simulation studies and drafted the manuscript. MGV supervised the analyses and contributed to the manuscript. All authors read and approved the final manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Galimberti, S., Valsecchi, M.G. Multivariate permutation test to compare survival curves for matched data. BMC Med Res Methodol 13, 16 (2013). https://doi.org/10.1186/147122881316
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/147122881316
Keywords
 Highly stratified data
 Matched survival data
 Multiple matching
 Multivariate permutation tests