Multivariate permutation test to compare survival curves for matched data

Background In the absence of randomization, the comparison of an experimental treatment with respect to the standard may be done based on a matched design. When there is a limited set of cases receiving the experimental treatment, matching of a proper set of controls in a non fixed proportion is convenient. Methods In order to deal with the highly stratified survival data generated by multiple matching, we extend the multivariate permutation testing approach, since standard nonparametric methods for the comparison of survival curves cannot be applied in this setting. Results We demonstrate the validity of the proposed method with simulations, and we illustrate its application to data from an observational study for the comparison of bone marrow transplantation and chemotherapy in the treatment of paediatric leukaemia. Conclusions The use of the multivariate permutation testing approach is recommended in the highly stratified context of survival matched data, especially when the proportional hazards assumption does not hold.


Background
In clinical trials with right-censored failure time outcome, inference often focuses on the comparison of survival curves. When data from observational studies are used to explore the role of different treatments, the main problem is to limit the biases due to the lack of randomization. Matching on relevant baseline features can be used in order to increase the comparability between subjects treated with an experimental therapy and those receiving standard treatment. In settings where matching is done with a variable number of controls (multiple matching), highly stratified data are produced, with strata containing a few, possibly censored, observations. For this reason, the statistical comparison of survival in the two groups cannot be directly addressed by means of the usual nonparametric procedures, such as the log-rank test. The stratified version of these tests, which should account for matching, is inefficient when the number of strata increases and the stratum size is small [1]. Furthermore, these methods are less sensitive when proportional hazard is not satisfied and this might often be the case, especially in the clinical setting that motivated this paper, i.e. the comparison of bone marrow transplantation and chemotherapy in the treatment of leukemia [2,3].
The comparison of the survival curves for highly stratified data due to non-paired matching was addressed by Galimberti et al. [2], who proposed a weighted Kaplan-Meier estimator for the controls and a nonparametric permutation test on the survival difference at one pre-fixed time point. The aim of this work is to extend the comparison of survival at multiple time points, using the multivariate permutation approach originally introduced by Pesarin [4]. In the proposed procedure, differently from Pesarin and Salmaso [5], a desirable feature is that there is no need to resort to the missing data framework in order to deal with censoring.
Our method is compared with the stratified log-rank test, the modified log-rank test for highly stratified data of Schoenfeld and Tsiatis [1] and the Cox model with a sandwich robust standard error for the treatment effect [6,7]. There were no natural competitors among other permutation tests on survival proposed in the literature [8][9][10].
The paper is organized as follows. In the next section, we show how the multivariate permutation approach can be extended to the survival matched data setting. We investigate the performance of the proposed test through simulation studies that are presented in the Simulation section and we illustrate the method, using data from the motivating study on childhood leukemia [11], in the Application section. Finally, we provide a discussion and concluding remarks.

Multivariate permutation tests for survival matched data
In the context of non randomized studies, especially in rare diseases, where only selected patients undergo experimental therapies, matching is an approach to identify a proper set of controls for an unbiased comparison. This was the case of a multicenter study conducted in Italy on the role of allogeneic Bone Marrow Transplantation (BMT) from matched sibling donors in children with acute lymphoblastic leukaemia (ALL) who, by presenting features, have a dismal prognosis [11]. Matching produced a stratum for each transplanted patient, that included as many controls as available, in order to recover the maximum amount of information in this rare subgroup.
More formally, this setting presents finely stratified data: each of the k strata (j = 1,. . .,k) has 1+m j subjects, where the first is the case belonging to the experimental arm (group 1) and the remaining m j are the matched subjects in the control arm (group 2).
The available data refer to the potential survival and censoring times. We assume that, conditionally on the observed j-th stratum, the survival times of the m j controls are i.i.d., but we cannot exclude, due to matching, some dependence between the case and the corresponding controls. Moreover, the usual assumption that censoring is independent of the failure mechanism and of the treatment must hold within stratum. Across strata, both cases and controls are independent, but not necessarily identically distributed. Finally, under the null hypothesis, the distribution of the survival times of the case and the matched controls in each stratum is invariant under permutations.
The multidimensional permutation approach introduced by Pesarin is here adapted to the context of survival analysis with matched data for the comparison of the marginal survival distributions of the two groups, S 1 (t) and S 2 (t). The procedure is based on a nonparametric combination of multiple, possibly dependent, univariate permutation tests. It relies on the following main steps: reformulate the inferential question of interest into a set of sub-questions, set up the partial tests and set up the form for their combination [4].
Firstly, the null and the alternative hypothesis are decomposed into a finite set of q sub-hypotheses H 0i and H 1i (i=1,. . .,q) with the property that, if H 0 is true, all the H 0i are jointly true, while if H 1 is true, at least one sub-hypothesis is true. The standard hypotheses on the survival distributions are thus rephrased into q subhypotheses H 0i , H 1i (i=1,. . .,q) so that: H 0i with H 0i : S 1 (t i )=S 2 (t i ), H 1i with H 1i : S 1 (t i )≠ S 2 (t i ).
Appropriate partial tests T i 1 are then performed separately on each q sub-hypothesis H 0i (i=1,. . .,q). We considered two versions of test statistics that are marginally unbiased and consistent, as required for the validity of the procedure [4]. They are based on the distance of the two survival curves or on their complementary log-log transformation [12] estimated at q different times points (t 1 ,..,t i ,..,t q ): ð Þ is the usual Kaplan-Meier estimate of the survival distribution in group 1, whileŜ 2 w : ð Þ is a weighted version, with stratum specific weights used in group 2 to account for the variable number of subjects in each stratum [2]. If we indicate with d j (u) and r j (u) the number of events and the number of subjects at risk at time u in stratum j, and with w j =1/m j the weight applied to the m j controls of the stratum, the formula for S 2 w : ð Þ is: which can also have a product integral representation for continuous time variables. This estimator has an intuitive appeal because it can be viewed as a Kaplan-Meier estimator computed on k components, one for each stratum, in which all the m j controls are given the same weight. The imbalance in prognostic factors induced by multiple matching is adjusted giving less weight to individuals from strata that are over-represented and vice versa. The idea behind this estimator is to standardize the survival curve of controls, using the population of cases as reference.
The weighted Kaplan-Meier estimatorŜ 2 w : ð Þ is an unbiased estimator for S 2 , characterized by a Gaussian asymptotic distribution with a variance that can be consistently estimated via bootstrap [2].
The first order tests are performed by considering the permutation distribution of the estimated distances TS i 1 and TLS i 1 . When the cardinality Y k j¼1 m j þ 1 À Á of the permutations is large, due to the large number of strata and to their size, it is worthwhile approximating the permutation distribution of the q test statistics with a Monte Carlo strategy that considers B random samples from the population of all possible permutations. Each of the B permutation samples is obtained by assigning, within each observed stratum, one subject to the experimental treatment and the remaining m j to the standard treatment. The q first order tests are then calculated on each permuted sample and, at each time point, the partial pvalues λ i 's for the q observed TS i 1 or TLS i 1 (i=1,. . .,q) are obtained from the empirical permutation distribution. Starting from the q permutation distributions, the pvalues are also calculated for the first order test statistics of each permuted sample.
In order to test the global null hypothesis H 0 , the partial tests T i 1 are subsequently combined in a unidimensional second-order test statistic T 2 . This combination has to be done nonparametrically, since the q univariate tests are dependent, and it is applied to the p-values λ i , which are permutationally equivalent to the partial tests [4]. Two alternative functions proposed by Fisher and Tippett, respectively, are considered, which can be applied on p-values either from TS i 1 or TLS i 1 as follows: As the Tippett combinig function can be written as 1 À min 1≤i≤q λ i , it has a monotonic relationship with the well known function min 1≤i≤q λ i used by Westfall and Young [13] in their resampling approach to multiple testing.
While the q p-values λ i are obtained from the marginal permutation distribution of the first order tests, the p-value of the global test is derived from the permutation distribution of the second order test, which is related to the q-variate distribution in the first step. Note that, no further permutations are needed at this stage, because we use, for the global test, the results on the p-values obtained from each permuted sample that define the permutation distributions of the first order tests [4].
In the survival setting the definition of the time points (t 1 ,. . .,t i ,. . .,t q ) involved in the comparison may be relevant. We explored different strategies: i) fixed time points, ii) time points identified by percentiles of the overall event distribution and iii) all the observed event times included between the 10 th and the 90 th percentiles of the overall event distribution.

Simulations
The performance of these tests in the finite sample situation are evaluated through simulations. The size and the power are calculated for different alternatives under proportional hazards and for three different scenarios of non-proportionality (e.g. survival curves showing both an early and a late difference or crossing each other). The behavior of the proposed tests is also explored in the presence of different number of strata (k=30, 50, 100) and of different degree of strata heterogeneity (no or low strata effect). The dimension of each stratum is defined by one case plus the number of controls generated by a Poisson (μ=4)+1 (1 is added in order to guarantee at least 1 control in each stratum).
In the simulations under H 0 , the case and the matched controls share the same conditional piecewise exponential hazard function: The randomness of λ j is determined by α j~G [1/w,w], where G(a,b) is a Gamma distribution with mean ab and variance ab 2 , and the parameter w indicates the level of heterogeneity across strata (w=0, 0.22; specifically, for w=0, no strata effect is assumed as α j =1 with probability 1). Different scenarios are obtained by giving specific values to the landmark L, to b 1 and b 2 . Under the alternative hypothesis, the hazard function for cases was decreased in order to define a survival which differs from that of controls, as specified in Figure 1. Censoring was uniform over 3-6 in order to have approximately 28-38 percent censoring, on average. For all the configurations studied, the results are based on 1000 samples and each of them uses B=2000 Monte Carlo permutations in order to define the empirical permutation distributions.
Definitions of the q time points involved in the comparison are as follows: i) 8 or 4 fixed equi-spaced times from 0.5-1 to 4, ii) 9 time points identified by the 10 th , 20 th ,.., 90 th percentiles of the overall event distribution, and iii) all the observed failure times between the 10 th and the 90 th percentiles of the overall event distribution.
The empirical level of the tests is shown in Tables 1.A (Fisher combining function) and 1.B (Tippett function), while the power in detecting a difference in survival is reported in Table 2.A-B, accordingly.
As expected, the performance of the test Under H 0 is generally good in all settings. In particular, with a small number of strata, the test based on the survival distance behaves very well. When the complementary log-log transform is considered, the permutation distribution at early time points, such as the 10 th percentile, may be very irregular because this transformation magnifies survival values near one. In order to have a good behavior even with small number of strata, it is sufficient to start from a higher percentile such as the 20 th (data not shown).
Under H 1 , as expected, power increases with increasing number of strata under every scenario and the test has a good behavior also in the situation of crossing hazards. No major gain in power is achieved by increasing the number of fixed time points from 4 to 8 and similar results are obtained when points are fixed at percentiles of the event distribution or take 80% of all observed event times. The fixed time point strategy seems generally more satisfactory for the proportional hazards and the late difference scenarios. A clear advantage in the use of the complementary log-log transform is seen only in the early difference scenario; conversely the survival difference seems to offer an advantage in the crossing hazards setting. Again, by comparing Table 2.A and 2.B, where the best performers for each scenario are in bold, the performance under the Fisher and Tippet functions is very similar, except for an advantage of Tippett in the crossing hazards scenario.
Results on the comparison of our permutation tests with other approaches are reported in Table 3. The stratified log-rank test, the modified log-rank test by Schoenfeld-Tsiatis and the robust test from the Cox model including only the covariate for treatment have size close to the nominal level. The multivariate permutation tests perform better than the latter tests in terms of power in the crossing hazards and early difference scenarios, while they have a similar behavior in the presence of proportional hazards.

Application
The clinical context that motivated this work is a retrospective multicentre study on 427 children enrolled between 1985 and 1994 in high-risk protocols for the treatment of ALL, where 30 received an allogeneic bone-marrow transplant after achieving first remission. In this setting, where transplant was considered as an experimental treatment to be administered only to selected patients, we used matching to define the appropriate group of children treated with chemotherapy alone to be used for an unbiased comparison. The matching procedure was based on 6 factors (centre, front-line treatment protocol, white blood cell count and age at diagnosis, immunophenotype and waiting time to transplant). For each of the 30 BMT cases, at least one non transplanted matched control was found, with a maximum of 17 and a median of 3, for a total of 130 controls. The Disease Free Survival (DFS) curves of the two groups depicted in Figure 2 were estimated using    -value=0.112). At 3 years, 7 cases and 39 controls were still at risk and most of the overall number of events (97%) had already occurred, as shown by the event distribution below Figure 2. We have now applied the multivariate permutation approach in order to compare the two DFS curves with a global test. The Fisher and the Tippett tests are performed based both on the direct estimates of survival and on its complementary log-log transformation, according to different choices of time points involved in the comparison. Based on previous studies and knowledge on the treatment implications on toxicity and survival, the most likely alternative to the null is a non proportional setting with an early crossing and late differences in hazards. The result in term of significance (p-value=0.199) points to the Tippett test on logsurvival involving 4 points (at 1, 2, 3 and 4 years from the origin). Other tests on 4 fixed time points, starting from 0.5 or 1 year (and not necessarily equispaced), are applied, and results are in keeping with the observation that curves start diverging at 1 year. The tests involving the percentiles and all the event times within the 10 th and 90 th percentile, give p-values ranging from 0.486 to 0.829 (Table 4).
Due to the relatively small sample size, the power is likely not adequate to detect the advantage given by transplantation, and this holds true even if we are applying a more appropriate test with regard to standard methods (stratified log-rank test, p-values=0.966; Cox model with robust standard error, p-values=0.915).

Discussion
This paper provides a global test for the comparison of survival curves in the context of highly stratified data produced by matching on a non-fixed proportion, by extending the multivariate permutation approach proposed by Pesarin [4]. The nonparametric combination of partial tests is able to capture their underlying dependence without assumptions and to control the level of significance of the global test, avoiding the issue of multiplicity. In principle, the proposed tests can be viewed as a weighted combination of an infinite number of time points, giving weight 1 to those involved in the comparison, and weight 0 otherwise. Yet, in survival analysis, the choice of the time points where partial tests are performed is a crucial issue. Our approach carefully considers this aspect by evaluating three different convenient strategies among the many possible. The most simple is the one that identifies a number of fixed time points. This is appealing for applications where the researcher has reasons for evaluating certain times that are relevant for the phenomenon. Importantly, given the degree of arbitrariness involved, the choice of the time points should be carefully done, possibly a-priori, in a time window where more information is expected, given the experimental setting and the follow-up. Typically, in time points near the boundaries, i.e. at the very beginning or on the tail of the survival curves, where few events are generally observed, the permutation distribution of the partial tests may be irregular. Following this consideration, alternative non subjective definitions of time points are based on the observed distribution of events, either with a limited number defined by percentiles, or on the whole set of event times within the 10 th up to the 90 th percentile.
As expected, the extended simulation study shows good performances of the proposed test in terms of alpha coverage. Under the alternative, apparently a slight advantage is obtained considering all event times for partial tests in the crossing hazard and early difference  , we do not observe a general superiority of the complementary log-log transform of survival on survival, as distance measure, except in the early distance scenario. Actually, the Tippett transform behaves well using the survival distance even with small number of strata. While a standardized survival difference could have been considered, the expected gain in performance would not counterbalance the heavy additional computational effort needed. This because the standard error of the difference is based on a bootstrap estimate that should account for the complex form of dependence possibly induced by matching, that cannot be easily specified.
In the context of non matched data (or no strata effect), the proposed multivariate permutation test maintains a good performance for survival data comparison. Indeed, this approach could be useful even in non matched data in the presence of departures from the proportional hazards assumption, especially crossing hazards, where an overall test is still matter of development.
Conditions for the applicability of this approach are very mild. The appropriate application of the multivariate permutation testing approach relies on the assumptions on censoring being independent from the failure process and treatment within strata. While this could be considered a weakness, it is indeed the usual assumption which applies in survival when a correct study planning is adopted. The test relies also on the assumption of exchangeability under the null. In our context, where permutations are within strata, this latter assumption is reasonable due to matching. Finally, the test can easily    be extended to more than one subject of the experimental arm in each stratum, modifying the weighting scheme of the Kaplan-Meier estimator of survival. Other permutation tests were not suitable for extension to this context. Specifically, the proposals of both Pesarin-Salmaso [5] and of Heller-Venkatraman [8] do not account for stratified and possibly correlated data; the approach of Sun and Sherman [9] needs relatively large strata of independent observations, and the versatile test of Shih and Fay [10], while dealing with highly stratified matched data, behaves similarly to the log-rank test or its stratified version, depending on the level of inter-strata heterogeneity.

Conclusions
In conclusion, our approach is different from other permutation tests proposed for survival analysis [8][9][10], as it relies on the multivariate permutation approach originally introduced by Pesarin [4]. It is recommended in the highly stratified context of matched data where it is proven by simulations to be at least equal or superior to the stratified log-rank test, the modified log-rank test for highly stratified data and a Cox model with robust variance estimate. Preference to this test should be definitely given when the proportional hazards assumption does not hold, even in non matched data. As a general indication, we would suggest the application on survival differences at fixed time points for the first order test and the Tippett transform for the global combining function. This test requires some computational effort, but it is, in our experience, quite feasible with ad-hoc routines (one in fortran, as well as a function in R, can be provided by the authors upon request).

Competing interests
The authors declare that they have no competing interests.  ) and for different choices of the time points (fixed or points at 10-90 th percentiles (perc) or 80% of all the observed event times (all)).