Correction of the significance level when attempting multiple transformations of an explanatory variable in generalized linear models

Background In statistical modeling, finding the most favorable coding for an exploratory quantitative variable involves many tests. This process involves multiple testing problems and requires the correction of the significance level. Methods For each coding, a test on the nullity of the coefficient associated with the new coded variable is computed. The selected coding corresponds to that associated with the largest statistical test (or equivalently the smallest pvalue). In the context of the Generalized Linear Model, Liquet and Commenges (Stat Probability Lett,71:33–38,2005) proposed an asymptotic correction of the significance level. This procedure, based on the score test, has been developed for dichotomous and Box-Cox transformations. In this paper, we suggest the use of resampling methods to estimate the significance level for categorical transformations with more than two levels and, by definition those that involve more than one parameter in the model. The categorical transformation is a more flexible way to explore the unknown shape of the effect between an explanatory and a dependent variable. Results The simulations we ran in this study showed good performances of the proposed methods. These methods were illustrated using the data from a study of the relationship between cholesterol and dementia. Conclusion The algorithms were implemented using R, and the associated CPMCGLM R package is available on the CRAN.


Background
In applied studies, the relationship between an explanatory and a dependent variable is routinely measured using a statistical model. For instance, in epidemiology it is quite common that a study focuses on one particular risk factor. The scientific problem is to analyze whether this risk factor has an influence on the risk of occurrence of a disease, a biological trait , or another outcome. To answer to this http://www.biomedcentral.com/1471-2288/ 13/75 Binary coding is often used in epidemiology, either to make interpretation easier, or because a threshold effect is suspected. In a regression model with multiple explanatory variables, the interpretation of the regression coefficient for a binary variable may be easier to understand than a change in one unit of the continuous variable. Dichotomous transformations of a variable X are defined as: Other transformations are also used, in particular Box-Cox transformations which have been defined as: but the choice of the transformation is often subjective. The arbitrariness of the choice of cutpoints may lead to the idea of trying more than one set of values. Hence to analyze data, the statistician may have to use several transformations, and for each the statistician applies a test for "β = 0" (where β is the coefficient representing the effect of the risk factor of interest). The most favorable transformation is then chosen. The cutpoint giving the minimum p value is often termed "optimal" [2,3]. When testing several codings of a variable, there is a problem with the multiplicity of tests performed, leading to an incorrect p value and possible overestimation of effects [4]. Generally, researchers fail to consider this problem and do not correct the significance level in relation to the number of tests performed [3], which can lead to an increase in the Type-I error [5]. The p value should thus be corrected to take into account the multiplicity of tests. In many cases, it is now widely recognized that categorization of a continuous variable could introduce major problems to an analysis and interpretation of the associated model [1,3]. It is important to note that the aim of this paper is not to defend this practice, but to improve a practice commonly used by epidemiologists in terms of multiple testing. Furthermore, despite known loss of power following dichotomization in the univariate case, Westfall [6] shown that dichotomizing continuous data can greatly improve the power when multiple comparisons are performed.
Many methods of correction exist, the most simple and well known being the Bonferroni rule. Several authors have improved this method to make it more powerful, however most do not take into account the correlation between the tests [7][8][9][10][11]. If the tests are independent, or moderately dependent, then they provide an upper bound which may be satisfactory. Efron [12] proposed a correction that account for the correlation between two consecutive tests if there is a natural order between the tests, with high correlation between adjacent tests. Liquet and Commenges [13,14] and Hashemi and Commenges [15] proposed a more exact correction, accounting for the whole correlation matrix, for score tests obtained in logistic regression, generalized linear model and proportional hazards models.
Here, we proprose extending these studies to a categorical transformation (with m > 2 categories) of the continuous variable by involving more than one parameter in the model; m − 1 dummy variables are introduced in the model. The categorical transformation is a more flexible way to explore the unknown shape of the effect. In this context, we propose a method and an R program based on resampling approaches to determine the significance level for a series of several transformations (including dichotomous, Box-Cox and categorical transformations) of an explanatory variable in a Generalized Linear Model. The problem of correcting the estimation of the effect will not be examined here.
First, we revisit the example proposed by Liquet and Commenges [14] on the relationship between cholesterol and dementia [16] to provide a framework for our discussion. In section 'Methods: Statistical context' , we present the statistical contexts relating to multiple testing; the model, the maximum test and the minimum p value procedure and finally the score tests are exposed. Section 'Methods: Significance level correction' presents the different methods of correction of the Type-I error. A simulation study for the different strategies of coding, and application of the model to the initial example are presented in the section 'Results' . Concluding remarks are given in the two last sections.

Example: revisiting the PAQUID cohort example
We revisited the example presented in the article of Liquet and Commenges [13] for a coding of a binary variable in a logistic regression. This example is based on the work of Bonarek et al. [16], who studied the relationship between serum cholesterol levels and dementia. The data came from a nested case-control study of 334 elderly French subjects aged 73 and over who participated in the PAQUID cohort (37 subjects with dementia and 297 controls). The variables age, sex, level of education and wine consumption were considered as adjustment variables. The analysis focused on the influence of HDL-cholesterol(high-density lipoprotein) on the risk of dementia. Bonarek et al. [16] first considered HDL-cholesterol as a continuous variable; then, to ease clinical interpretation, they chose to transform the HDLcholesterol into a categorical variable with four classes. Finally, as there was no significant difference between the first three quartiles, HDL-cholesterol was split into two categories with a cutpoint at the last quartile. The best p value , 0.007, was obtained in the latter analysis http://www.biomedcentral.com/1471-2288/13/75 and was selected for interpretation. However, this p value did not take into account the numerous transformations performed to determine the best representation of the variable of interest. Legitimate questions arising from this include the following: What is the real association between dementia and HDL-cholesterol, with a correction of the Type-I error? Is it really significant? Liquet and Commenges [14] proposed correcting the p value associated with multiple transformation including dichotomous and Box-Cox transformation, however, their method cannot be used with categorical transformation.

Statistical context Model
Let us consider a Generalized Linear Model with p explanatory variables [17], where Y i (1 ≤ i ≤ n) are independently distributed with probability density function in the exponential family defined as follows: and where a(·), b(·), and c(·) are known and differentiable functions. b(·) is three times differentiable, and its first derivative b (·) is invertible. Parameters (θ i , φ) belong to ⊂ R 2 , where θ i is the canonical parameter and φ is the dispersion parameter.
In this context, we wished to test the association between the outcome Y i and explanatory variable of interest X i , adjusted on a vector of explanatory variables Z i . The form of the effect of X i is unknown, so we may consider K transformations of this variable X i (k) = g k (X i ) with k = 1, . . . , K. For example, if we transform the continuous variable in m k classes, m k − 1 dummy variables are defined from the function g k (·): . Different numbers of level m k of the categorical transformation are possible. The model for a transformation k can then be obtained by modeling the canonical parameter θ i as: ) and γ = (γ 0 , . . . , γ p−1 ) T is a p − 1 vector of coefficients, and β k is the m k − 1 vector of coefficients associated with a categorical transformation k of the variable X i . For dichotomous or Box-Cox transformations β k reduce to a scalar (β k ∈ R). The hypothesis of the test for the transformation k is defined as follows: Thus all the null hypotheses are the same, and denote it by H 0 .

Maximum test and minimum P-value procedures
For each coding, k, of the variable X i , a test statistic T k is performed on the nullity of the vector β k . We then have a vector of test statistics T = (T 1 , . . . , T K ) for the same null hypothesis (no effect of the risk factor of interest). In the context of dichotomous and Box-Cox transformations, each test statisitic, T k , has asymptotically, a standard normal distribution. Thus rejecting the null hypothesis if one of the absolute values of the test T k is larger than a critical value c α , is equivalent to rejecting the null hypothesis To cope with the multiplicity problem, Liquet and Commenges [13,14] proposed that the probablity of Type-I error for the statistic T max under the null hypothesis be computed as: where t max is the realization of T max . An equivalent approach is to use a procedure based on the individual p value of each test T k noted P k = P(|T k | > |t k |) (where t k is the realization of T k ). The minimum of the K realized p value corresponds to the test k which obtains the highest realization (in absolute values; k/ t max = |t k |). Then, we have: where P min = min(P 1 , . . . , P K ) and p min is the realization of P min . The interest of using a procedure based on the p value is the possibility of combining statistical tests which do not follow the same distribution. In the current context, we will combine dichotomous, Box-Cox and categorical transformations with more than two levels.

Score test
We briefly present the score test used for all of the K transformations where the same null hypothesis is tested (i.e. H 0 : "β k = 0 m k −1 " given by θ i (k) = γ Z i (with different alternatives)). We present the main results obtained by Liquet and Commenges [14] for the Generalized Linear Model in the context of dichotomous and Box-Cox transformations, and then consider the score test for categorical transformations.

Dichotomous and Box-cox Transformations
In the context of dichotomous and Box-Cox transformations, the score test used for testing the effect of the transformed http://www.biomedcentral.com/1471-2288/13/75 variable (β k = 0 with β k ∈ R) follows asymptotically a standard normal distribution: The correlation between the different tests has been defined by Liquet and Commenges [14]. Asymptotically, the joint distribution of T 1 , . . . , T K is a multivariate normal distribution with zero mean and a certain covariance matrix. Thus Liquet and Commenges [14] propose that the p value (associated with the test T max ) defined in (2) using numerical integration [18] be calculated. They called their method the "exact method".

Categorical transformations
In the context of a categorical transformation in m k classes, the score test testing H 0 : "β k = 0 m k −1 " (with β k ∈ R m k −1 ) follows asymptotically a χ 2 distribution with m k − 1 degrees of freedom and is defined as: where U k and I k are respectively the score function and the Fisher information matrix under the null hypothesis [19]. To compute the p value defined in (3), it is necessary to know the joint distribution of T = (T 1 , . . . , T K ). Some studies have defined the distribution of the multivariate χ 2 [20,21]. However, even though the correlation between the different tests could be easily estimated, it has not been possible, as far as we know, to obtain the joint distribution of T = (T 1 , . . . , T K ). To overcome this problem, we propose approximating the p value (defined in (3) by the minimum p value procedure) using a resampling method (defined in the next section) which also accounts for the correlation between the test statistics.

Significance level correction Bonferroni method
One of the most common corrections in multiple testing is the Bonferroni method. It has been described by several authors in various applications [7,11,22]. It allows an upper bound of the significance level of the minimum p value procedure to be computed as: where K is the number of tests. This method is very simple and does not require any assumption about the correlation between the different tests. It can therefore be applied directly to the different possible codings of an explanatory variable. However, this only provides an upper bound of the p value , which may be very conservative if the correlation between tests are high and the number of transformation are large.

Resampling based methods
We propose the use of resampling based methods [23,24] with the aim of building a reference distribution for the test statistics. These procedures have the advantage of taking into account the dependence of the test statistics for evaluating the correct significance level of the minimum p value procedure (or the maximum test procedure). The principle of resampling procedures is to define new samples from the probability measure defined under H 0 : "β k = 0 m k −1 ".
Permutation test procedure Permutation methods can be used to construct tests which control the Type-I error rate [25]. In our context, the algorithm of the permutation procedure is defined as follows: 1. Apply the minimum p value procedure to the original data for the K transformations considered. We note p min the realization of the minimum of the p value ; 2. As under H 0 , the X i variable has no effect on the response variable, a new dataset is generated by permuting the X i variable in the initial dataset; 3. Generate B new datasets s * b , b = 1, . . . , B by repeating B times the step 2; 4. For each new dataset, apply the minimum p value procedure for the transfomation under consideration. We note p * b min the smallest p value for each new dataset. 5. The p value defined in (3) is then approximated by: where I {·} is an indicator function.
However, it is important to note that exchangeability need to be satisfied [25][26][27][28][29][30]. This condition is much more restrictive than it appears at first sight. In fact, Commenges [29] and Commenges and Liquet [25] showed that the permutation test approach for the score test is robust if the model has only one intercept under the null hypothesis, or if X i are independent of Z i for all i in the context of a linear model and the proportional hazards model. This issue applies in our context. Thus we investigated, the robustness of the permutation method when the exchangeability assumptions is violated.
Parametric bootstrap procedure In 2000, Good [31] explained: "Permutations test hypotheses concerning distributions; bootstraps test hypotheses concerning parameters. As a result, the bootstrap implies less stringent assumptions". Therefore, an alternative way may be to use http://www.biomedcentral.com/1471-2288/13/75 resampling method based on bootstrap [32], which give us an asymptotic reference distribution. This procedure could be defined by the following algorithm: 1. Apply the minimum p value procedure to the original data for the K transformations being considered. We note p min the realization of the minimum of the p value ; 2. Fit the model under the null hypothesis, using the observed data, and obtainγ , the maximum likelihood estimate (MLE) of γ ; 3. Generate a new outcome Y * i for each subject from the probability measure defined under H 0 . For example, for a logistic model (where a(φ) Repeat this for all the subjects to obtain a sample noted . . , B by repeating B times the step 3; 5. Apply for each new dataset, the minimum p value procedure for the transformation considered. We note p * b min the smallest p value for each new dataset. 6. Then, the p value defined in (3) is then approximated by:

Simulation study
The aim of this simulation study was to assess the performance of the two resampling methods to correct the significance level. Three different scenarios of transformations were investigated: dichotomous transformations, categorical transformations with three classes, and categorical transformations with different numbers of classes.
To shorten the simulation study section we have not presented the results for the Box-Cox transformations. For each simulation case, the control of the Type-I error and the power of the developed methods were evaluated. For all simulations, the data come from a logistic model (where a(φ) = 1, b(θ i ) = log(1 + e θ i ), and μ i = E(Y i ) = e θ i /(1 + e θ i )) consisting of two explanatory variables: Z, an adjustment variable, and X, the variable of interest. We considered the following models: where Z i and X i are independent and were generated according to a standard normal distribution and the vector X i (k) was a transformation of a continuous variable X i . The sample size was set to be 100. We used 1000 replications for each simulation and 1000 samples for the resampling methods.

Dichotomous transformations
We only considered dichotomous transformations to explore a shape effect of the variable of interest. To obtain the best transformation, several cutpoints c k may be tested. When epidemiological references are not available, a strategy based on the quantile of the continuous variable is most commonly applied. In this simulation we used the median for one dichotomous transformation. For two dichotomous transformations we used the first tercile as the first cutpoint, and the second tercile as the second cutpoint, and so on. This strategy is summarized in Table 1. Firstly, we investigated the Type-I error rate. For a replication, the rejection criterion of the null hypothesis (β k = 0) was a p value less than 0.05. Thus, for a simulation of 1 000 replications, the empirical Type-I error rate was the proportion of tests where the p value was less than 0.05. Figure 1(a) shows the evolution of the Type-I error rate for dichotomous transformations. The naive method, without correction of the multiple testing, increases the Type-I error rate with the number of codings tried. For ten codings this error rate reached 0.27. The error rate calculated by the Bonferroni method decreased with the number of cutpoints. This correction was therefore too conservative whereas the exact method and resampling methods gave a Type-I error rate close to the nominal 0.05 value.
When information on the shape of the effect of the explanatory variable was unknown we investigated the power of the methods applied above. We studied the power for a threshold effect model with a cutpoint value at the first tercile. Figure 1(b) gives the power as a function of the number of cutpoints tried. The power of the exact and resampling methods are quite similar to one another, and higher than the Bonferroni method. The difference between these methods and Bonferroni method  increases with the number of cutpoints. We also observed that the power was highest at two cutpoints (two transformations). This result, was in fact, expected since we used the first and second terciles respectively as cutpoints for each dichotomous transformation. Power increased again when trying five and eight codings due to the fact that one of these codings corresponded to the first tercile. To conclude, the simulation study with dichotomous transformations showed that the resampling methods provide similar results for the Type-I error rate control and the power as those seen with the exact method.

Categorical transformations with same number of classes
We considered here only categorical transformations with three classes. In this situation, the choices of the two cutpoints (noted c 1 k and c 2 k ) defining the categorical variables into three classes are also subjective. For this simulation study, our strategy was to attempt to find the most favorable transformation into three classes. This consisted of using the tercile of the variable for one transformation with two cutpoints (c 1 1 = q 1/3 and c 2 1 = q 2/3 ); for two transformations we add to the previous choice a transformation with the first quartile and the third quartile for the two cutpoints (c 1 2 = q 1/4 and c 2 2 = q 3/4 ). The global strategy until we obtain 10 transformations in three classes is presented in Table 2.
We investigated the Type-I error rate. Figure 2(a) shows the evolution of the Type-I error rate for categorical transformations in three classes. The results are similar to those we observed for dichotomous transformations. The Bonferroni correction was still too conservative, while resampling methods gave a Type-I error rate close to the nominal 0.05 value.
Next we considered the power of the different methods when the simulated model was specified with a categorical transformation of the continuous variable in three classes defined by cutpoints at the first and third quartile. The two resampling methods gave similar results with a higher power than the Bonferroni method (see Figure 2(b)). The power was highest for two transformations. This result was also expected because, with the strategy presented in Table 2, the transformation into three classes with cutpoints at the first and third quartile is used.

Various categorical transformations
In this last simulation, we presented a more realistic situation where different kinds of transformations were used to investigate the effect of the variable of interest. We proposed trying different categorical transformations and varying the number of classes. The most natural method is to use a dichotomous transformation at the median for one transformation. For two transformations, we added the previous coding and a categorical transformation in Table 2  three classes based on the tercile. For three transformations, we added the two previous codings and a categorical transformation in four classes based on the quartile, and so on. The strategy proposed in this simulation is presented in Table 3.
The results for the Type-I error rate were similar to the previous simulation case (not shown here). We then studied the power of the different methods when the simulated model is specified with a categorical transformation of the continuous variable in five classes defined by cutpoints at the quintile. We can see in Figure 3, that, in this situation, the parametric bootstrap method seems slightly more powerful than the permutation method. The resampling methods were also more powerful than the Bonferroni method. Finally, as expected, we can see that the power was highest for four transformations, where one of the transformations used corresponded to a categorical transformation with quintiles as cutpoints.

Robustness of resampling methods
We investigated the robustness of the resampling methods when the exchangeability assumption is violated. The data came from the model defined in (4) with two dependent variables X i (k) and Z i . The dependency between X i (k) and Z i (formalized by the correlation ratio(η 2 )) was specified by the following model: where X i (k) is the binary coding of the X i variable with a cutpoint at the median. The coefficient β * was computed according to η 2 and the variance of X i (k) variable. We tested three different binary codings with cutpoints at the first, the second and the third tercile. The strategy is used for various values of the correlation ratio (η 2 ) from 0 to 0.6. The robustness of the permutation method when the exchangeability assumption is violated was evaluated with respect to the results of the exact method. For different correlation ratios (η 2 ) we evaluated the control of the Type-I error, the power, the Mean Square Error (MSE) of the estimated p value (p value from the exact method was used as a reference), and the rate of good decision (same decision as for the exact method). These results are presented in Figure 4 and show the good behavior of the permutation method since the Type-I error is controlled at the level 0.05, the power is the same for all the methods, the rate of good decision is always greater than 0.97, and the MSE is very low. Moreover, the distributions of the estimated p value are quite similar for different methods (not shown).

Example: revisiting the PAQUID cohort example
In order to find the real association between the two variables of interest in the example described at the end of Background section, we applied our newly developed approach which combined different kinds of transformations. Liquet and Commenges [14] have proposed seven dichotomous and five Box-Cox transformations. However, their method did not allow for categorical transformations. We proposed to add, to the seven dichotomous and http://www.biomedcentral.com/1471-2288/13/75 five Box-Cox transformations for this application, four codings in three classes and four codings in four classes. The best transformation appeared to be the dichotomous transformation of HDL-cholesterol with a cutpoint at the third quartile, as already found by Bonarek et al. [16]. The Bonferroni correction gave a p value equal to 0.140, thus not significant for an α level at 0.05. The p value , which is given by both resampling based methods is 0.038. To conclude, it is important to chose a powerful method of correction, because in this context the p value with no correction given by Bonarek et al. [16] was very optimistic (0.007), and the Bonferroni correction was very conservative, yielding an incorrect conclusion. The proposed approach based on the resampling procedure gave a result which was still significant and more realistic than the uncorrected p value .

Discussion
In this paper, we have considered the problem of correction of significance level for a series of several codings of an explanatory variable in a Generalized Linear Model with several adjusting variables. The methods developed, based on resampling methods, enable us to consider categorical transformations as more flexible in order to explore the unknown shape of the effect between an explanatory and a dependent variable. The simulation studies presented above show, firstly, that the resampling method provides similar results for the Type-I error rate control and the power as those found with the exact method proposed by Liquet and Commenges [14] for dichotomous and Box-Cox transformations. Secondly, in the situation of categorical transformations, these simulations demonstrate the good performance of our proposed approaches. Finally we observed the robustness estimation of the p value by the resampling methods. These methods can be easily generalized to other models, such as the proportional hazards model, and to potentially extend the work of Hashemi and Commenges [15] in the same context.

Conclusion
To conclude, the methods developed, based on resampling, demonstrate good performances, and we have implemented different methods and different strategies of coding in an R package called CPMCGLM M (for Correction of the Pvalue after Multiple Coding in a Generalized Linear Model).