Spurious interaction as a result of categorization

Background It is common in applied epidemiological and clinical research to convert continuous variables into categorical variables by grouping values into categories. Such categorized variables are then often used as exposure variables in some regression model. There are numerous statistical arguments why this practice should be avoided, and in this paper we present yet another such argument. Methods We show that categorization may lead to spurious interaction in multiple regression models. We give precise analytical expressions for when this may happen in the linear regression model with normally distributed exposure variables, and we show by simulations that the analytical results are valid also for other distributions. Further, we give an interpretation of the results in terms of a measurement error problem. Results We show that, in the case of a linear model with two normally distributed exposure variables, both categorized at the same cut point, a spurious interaction will be induced unless the two variables are categorized at the median or they are uncorrelated. In simulations with exposure variables following other distributions, we confirm this general effect of categorization, but we also show that the effect of the choice of cut point varies over different distributions. Conclusion Categorization of continuous exposure variables leads to a number of problems, among them spurious interaction effects. Hence, this practice should be avoided and other methods should be considered.


Background
It is common in epidemiological and medical research to categorize exposure variables measured on a continuous scale and treat them as categorical in the statistical analysis. The continuous variables can be dichotomized or they can be divided into more than two groups, the latter alternative allowing investigation of a possible dose-response relationship. Examples of this practice include Body Mass Index (BMI) categorized according to pre-defined values and nutritional intake categorized according to observed quintiles. There may be several reasons for this practice. The most common ones being that it makes the analysis and the interpretation easier; one avoids having to model the actual relationship between the exposure variables and the response, and that it mimics clinical practice where one typically divides patients into groups (hypertensive vs. normotensive, obese vs. non-obese).
A number of papers have appeared, both in the biostatistical [1][2][3][4][5][6][7], epidemiological [8][9][10][11][12][13] and psychological [14][15][16] literature, pointing to problems with this approach and arguing against it. Among the problems are loss of information and power, but also an increased risk of type I error if continuous confounder variables are categorized. Recently, also predictive performance of models with categorized predictors is criticized [7]. We will not repeat their arguments here, but rather point to a problem that has received much less attention; that categorization of continuous exposure variables may lead to spurious interaction effects in a multiple regression model against an outcome.
This problem was observed already in 1974, by Humphreys and Fleishman [17], in a simulation study of the behavior of the ANOVA model in situations with two categorized explanatory variables. Later, the same type of problem was also noted by Paunonen and Jackson [18] in an investigation of possible effect modification of the association between personality trait measures and resulting behavior. They noted that effect modification, or interaction, was often observed if the effect modifier in question was dichotomized and a stratified analysis was performed, while if the same association was investigated in a linear regression model, keeping the effect modifier on its original scale and introducing a product term, effect modification was less often observed. This observation led Bissonnette et al. [19] to perform a simulation study investigating the same problem. They also found that when they simulated a model with no interaction, dichotomized the potential effect modifier and then performed a stratified analysis, effect modification was often observed. However, no real understanding of or explanation for the findings was provided.
Maxwell and Delaney [14] carried out a formal analytical investigation of the effects of dichotomizing the explanatory variables in a linear regression model. Specifically, they showed that in a model with two correlated explanatory variables X 1 and X 2 , if the true relationship between one of the explanatory variables and the outcome Y was non-linear, a spurious interaction effect appeared when dichotomizing X 1 and X 2 at the median. This result has later been referred to by a number of other authors who have discussed the practice of categorization [2,16], and the non-linear nature of the relationship between the explanatory variable and the outcome seems to have been taken as the explanation of the spurious interaction.
In this paper we will look into this problem in some more detail. We have two exposure variables X 1 and X 2 that are correlated, and where both of them are dichotomized. The situation where only one of them is dichotomized follows directly. Furthermore, they are both to be related to an outcome variable Y by some regression model. Throughout we will assume that there is no interaction between X 1 and X 2 . In much earlier work, it has been assumed that the variables have been categorized at the median. However, in many medical and epidemiological applications, it is more relevant to consider categorization at more extreme values, and / or at several cut points. This leads to some interesting findings.
In our analytical treatment of the problem, we take the regression model to be linear, and we assume normally distributed variables X 1 and X 2 . However, the results apply to regression models and distributions in general. We have also carried out a small simulation study in order to explore the effects for different distributions. We will first give two examples to show the relevance of the problem.

Illustration 1, height and lung function
The first illustration uses data on lung function collected among Norwegian medical students. Peak expiratory flow (PEF), l/min, was measured six times for each student; three times in sitting position and three times in standing position. We will be using the mean of the six measurements in this illustration. In addition to PEF, we also measured the height of the students (cm) and we have gender information. In total we have data from 377 students, and we will model PEF as a function of gender and height (centered). Running the simple linear regression model PEF = β 0 + β 1 × gender + β 2 × height + β3 × gender × height + ε leads to the following estimated coefficients (SE):β 0 = 556.7 l/min (9.2 l/min),β 1 = − 129.7 l/min (11.0 l/min),β 2 = 3.7 (l/min)/cm (0.9 (l/min)/cm), β 3 = − 0.8 l/min (1.1 l/min). Using the conventional 5% significance level, we have clearly significant effects of gender and height, but no interaction. Next, we categorize height according to the gender specific 90th percentiles; 189 cm for men and 175 cm for women, coding zero if the subject is below the cut-off and one if above. We will then run the same linear model as above, but with the categorized version of height. This leads to the following estimated coefficients (SE):β 0 =578.4 l/min (6.4 l/min),β 1 = − 168.8 l/min (8.1 l/min),β 2 = 64.7 l/min (17.3 l/min),β 3 = − 48.5 l/min (22.6 l/min). We notice that we have a significant interaction at the conventional 5% level (p = 0.03). This would indicate that the effect of being among the higher 10% is significantly lower among women than among men, and while being among the 10% highest males leads to an increased PEF of 64.7 l/ min, the same increase is only 16.2 l/min for females.

Illustration 2, myocardial infarction
The second example is taken from a huge Norwegian health survey. During the period from 1985 until 1999 the Norwegian government conducted health surveys inviting men and women in the age of 40-42 years to participate. We will be analyzing a subset of these data, collected during the period 1985 to 1994. We have measured, among other things, Body Mass Index (BMI) and systolic blood pressure (BP) on a total of 133,139 subjects. These subjects have been followed for on average 19 years, and death of myocardial infarction is registered through a linkage to the Norwegian Cause of Death Registry. We will restrict our analysis to subjects with BMI > 20 to avoid having to deal with obvious non-linearities. This leaves us with 132,150 subjects. Among these there were 2542 (1.9%) deaths. We fit a logistic regression model for the odds of death by myocardial infarction as a linear function of BMI (kg/m 2 ) and BP (per 10 mmHg) and the interaction between BMI and BP. The estimated coefficients (SE) are 0.20 (0.04) for BMI, 0.61 (0.08) for BP, − 0.01 (0.003) for the interaction term and a constant term of − 14.06 (1.20). Due to the large sample size, all the estimated effects are clearly significant at the conventional 5% level. However, the estimated interaction term has no practical significance. Next, we divide the sample into two BMI groups (obese vs. non-obese) by making a cut-off at 30 kg/m 2 , and we divide the sample into two BP groups (hypertensive vs. normotensive) by making a cut-off at 140 mmHg. Based on the two categorized variables we run the same logistic model as above (main effects of BMI and BP and the interaction BMI × BP). The estimated coefficients are 0.81 (0.09) for BMI, 1.03 (0.04) for BP, − 0.41 (0.11) for the interaction and a constant term of − 4.43 (0.03). This would indicate that the effect of being obese varies between normotensive and hypertensive so that among normotensive subjects there is an odds ratio of death equal Exp(0.81) = 2.25, while among the hypertensive the effect of being obese is reduced to an odds ratio of Exp(0.81-0.41) = 1.49, a substantial difference.
Both examples show how categorization may lead to substantial changes in the interpretation of the data in practical data analysis.

Analytical developments, categorization in the bivariate normal situation
Assume we have a linear relationship between two exposure variables X 1 , X 2 , and an outcome variable Y, satisfying the linear regression equation.
where we assume ðX 1 ; X 2 Þ∼Nð0; ΣÞ; Σ ¼ 1 ρ ρ 1 : Notice that there is no interaction in the true model. Furthermore, assumeX 1 We have no interaction betweenX 1 andX 2 if and only if μ 00 − μ 10 = μ 01 − μ 11 ⇒ μ 11 − μ 01 − μ 10 + μ 00 = 0. Based on model (1) we have In order to further investigate the relationships of interest, we need to be able to calculate the conditional expectations that enter these expressions. Define F 00 = , the probabilities of belonging to each of the four combinations ofX 1 ;X 2 . Due to symmetry, so in the following we will focus on the conditional expectations of X 1 . Regier and Hamdan [20], using the Mehler identity [21], gave the following identity: where φ(⋅) denotes the normal density function and Φ(⋅) denotes the corresponding cumulative distribution function.
As indicated by Regier and Hamdan [20], corresponding identities can be found by appropriate combinations of integrals and we have that As mentioned, we have no interaction if μ 11 − μ 01 − μ 10 + μ 00 = 0. Using Eqs. (2), (3), (4), (5), this leads to From this, it is immediately clear that we can still have a spurious interaction even in situations where one of the two exposure variables is not associated with the outcome (β 1 or β 2 equal zero) as the product in (10) can still be different from zero. Furthermore, no spurious interaction can take place if β 1 = − β 2 . However, this last point is a function of our highly symmetrical situation and hence less relevant.
Using Eqs. (6), (7), (8), (9) and the fact that in our example, E(X 1 ) = 0 and F 01 = F 10 , formula (10) can be written We have to remember that F 11 , F 00 and F 01 are also functions of c and ρ. Solving this (numerically) leads to ρ = 0 and / or c = 0, in addition to β 1 = − β 2 (and the degenerate solution ρ = − 1 ). That is, we have no interaction if X 1 and X 2 are uncorrelated (ρ = 0) or if we split at the median (c = 0). Otherwise, categorization leads to spurious interaction. By the same arguments one can show that a spurious interaction will also appear if only one of the two variables X 1 and X 2 are categorized, or if X 1 and X 2 are split into more than two categories.
One should remember that Eq. (11) is only relevant for our specific situation, with normally distributed variables and common cut point c. The important message here is that the equation is true in only very few situations, which means that with a few exceptions, categorization leads to spurious interaction. This general message is true also for other distributions.
Having established the presence of spurious interaction, it is of interest to investigate the potential size of the problem. Based on the model above, we can investigate the size of the induced interaction term relative to main effects of the categorized versions of X 1 , X 2 . Assume we fit a model Y ¼β 0 þβ 1X 1 þβ 2X 2 þβ 3X 1X 2 þε whereX 1 ;X 2 are coded 0, 1. It is easy to show that β 2 is given by μ 01 − μ 00 where μ 00 and μ 01 are given in (2) and (3), and we have already established thatβ 3 is given by μ 11 − μ 01 − μ 10 + μ 00 . We can use this to investigate the size ofβ 3 relative toβ 2 for different choices of β 1 , β 2 , the cut point c and the correlation ρ. Figure 1 gives the absolute value ofβ 3 =β 2 for varying c and ρ, for β 1 = β 2 = 1. We observe that with increasing cut points, the influence of the induced product term becomes substantial, even for moderate values of the correlation ρ. In this case, the ratio ofβ 3 toβ 1 will of course be the same as toβ 2 . As a bi-product we can also study the size ofβ 1 andβ 2 as a function of the same cut point c and correlation ρ (Fig. 2), and we observe that the estimated effects increase quite rapidly with the more extreme cut points and correlations.

Interpretation
Categorization of a continuous variable can be seen as an extreme form of measurement error. If there is measurement error in X 1 and / or X 2 , and the error in X 1 is differential with respect to X 2 (meaning the measurement error in X 1 varies with X 2 ) or vice versa, it can be shown that this leads to an induced interaction between the observed versions of X 1 , X 2 in a regression model [22][23][24][25]. In our case, the measurement error can be characterized by the reliability of the dichotomized vari-ableX i , relative to the continuous variable X i , i = 1, 2, as measured by the point-biserial correlation. If X i is normally distributed, this correlation is given by h= ffiffiffiffiffi pq p , where h denotes the ordinate of the normal curve at the cut point and p and q denote the proportion of the population (or probability mass) above and below the cut point. It is easily seen that this correlation will vary with the other variable X j , i ≠ j, as p and q will vary with X j . Hence, the measurement error is differential and an induced interaction is to be expected. Another way of looking at the same problem is as a problem of residual confounding. If we have a situation with a confounding variable where the effect of the confounder is not properly adjusted for, we are left with some residual confounding. This is exactly what is happening when a confounder is categorized. It is well known that this leads to biased estimates of exposureoutcome associations. In particular, Marshall and Hastrup [26] showed through simulations that such residual confounding can lead to apparent effects of variables that are strongly correlated to the confounder, but which in reality bear no association with the outcome. Marshall and Hastrup termed this effect "resonance of strong confounders". Important for our investigation, differences in residual confounding across strata may lead to spurious interaction [27].

Simulations
To illustrate our findings and to further explore the effects of dichotomization in different situations we conducted a small simulation study. Notice that our main focus here has been to investigate the qualitative aspects of the induced interactions, and hence, we used a rather large sample size to minimize randomness.
We simulated X 1 , X 2 according to three different distributions; standard bivariate normal, uniform [0,1], and chi-square with 2 df. We let the correlation between X 1 and X 2 , ρ, vary over 0.2, 0.5, and 0.7, and we categorized at the 60th and the 80th percentiles, respectively. Furthermore, we simulated a response Y according to the following model, Y = β 0 + β 1 X 1 + β 2 X 2 + ε with ε normally distributed with zero expectation and variance σ 2 and independent of X 1 , X 2 . We then fit a model Y ¼β 0 þβ 1X 1 þβ 2X 2 þβ 3X 1X 2 þε whereβ 3 is an interaction parameter. In all our simulations we let β 0 = 0 and β 1 = β 2 = 1. Furthermore, we let the residual variance σ 2 vary over the distributions in such a way that Corr(Y, X i ) = 0.3, i = 1, 2, when ρ = 0.5. The correlation ρ was then varied without changing σ 2 . The results of these simulations are given in Table 1. In addition, we repeated the situation with ρ = 0.7 and X 1 , X 2 categorized at the 80th percentile, but now with β i = 2 for i = 1, 2. Further, we simulated the same situation once more with β 1 = 1 and β 2 = 0. The results of these simulations are given in Table 2.
In order to generate correlated variates from the three distributions, we generated bivariate normal variates with a specified correlation structure in the standard way. Furthermore, we generated uniform marginals by applying the standard normal cumulative distribution function to each of the normal variates. Finally, on the basis of a uniform variate V j , we can generate X j ∼ chi-square with 2 df. by X j = − 2 ln(V j ) [28]. It is an easy task to adjust the pre-specified correlation structure so that the observed correlations are as one wish. This is done empirically, by running preliminary simulations. For each setting we ran 1000 simulations with a sample size of 10,000. Tables 1 and 2 give the results of these simulations. There are some common trends, but also some interesting differences between the distributions. First, the interaction effect becomes stronger with increasing correlation for all the distributions. We also observe that, naturally, the interaction effect becomes stronger with increasing main effects (the difference between Table 1 and the second situation in Table 2). In general, the normal distribution and the uniform distribution behave very similar. We  The table gives estimated regression coefficients with corresponding empirical standard errors in parentheses see a stronger interaction effect with the more extreme cut-off (80th percentile vs. 60th percentile). In the skewed chi-square distribution, we observe the opposite; the stronger interaction effects appear with a cut-off at the 60th percentile. Furthermore, for the normal-and uniform distributions, the estimated interaction parameter has a negative sign, while for the chi-square distribution it has a positive sign. Finally, we confirm the result from the theoretical calculations, that even in the situation with β 2 = 0, a spurious interaction appear.

Results
If we look to the estimated standard errors, the interaction parameter becomes significant with a cut-off at the 60th percentile for the chi-square distribution in both Table 1 and Table 2, with an exception of the situation with the weaker correlation ρ = 0.2 in Table 1. For the other two distributions, we need in general to go to the more extreme cut-off (80th percentile) to find significant effects. Here, however, there are few significant effects for the chi-square distribution. It should be mentioned that categorization in general will lead to efficiency loss and the power to detect interaction effects will be low. For a general treatment of this topic, see [29].
To explain our findings, it will again be instructional to look at the problem as a measurement error problem. For the normal distribution, it is easy to show that the correlation between the continuous variable and the categorized version of the same variable is at its maximum with a cut-off at the median value, and it is then decreased with more extreme cut-off values. This means that there is more measurement error with the more extreme cut-off values. Based on the simulations, one can show that this is also true for the uniform distribution. However, also based on the simulations, for the chi-square distribution there is more measurement error with a cut-off at the 60th percentile than with a cut-off at the 80th percentile, as measured by the point-biserial correlation. This explains why we find the stronger interaction effects with a cut-off at the 60th percentile for the chi-square distribution, and with a cut-off at the 80th percentile for the two other distributions, as there will be more residual confounding, and hence more room for "resonance" in situations with more measurement error.
Furthermore, let us look into the differentiality of the measurement error. Thinking to the standard measurement error situation, with β 1 and β 2 > 0 we will have a positive interaction when there is less attenuation in the effect of X 1 for higher values of X 2 (and opposite), meaning the measurement error decreases with increasing X 2 . Again based on the simulated data, we can look at the measurement error (the point-biserial correlation) of X 1 forX 2 ¼ 0 vs.X 2 ¼ 1. For the normal-and the Uniform distribution, we find more measurement error in X 1 for the lower category ofX 2 , while for the chi-square distribution, there is more measurement error in X 1 for the higher category ofX 2 . This explains why the sign of the interaction term differs between the distributions.

Collapsing categorical covariates
We will briefly mention another consequence of this type of induced interaction. It is well known that tests of interaction usually have rather low power. When modelling an interaction between categorical exposure variables that takes more than two categories, a number of extra parameters have to be introduced. A common advice is then to combine exposure groups in order to decrease the number of extra parameters and hence, increase power (see e.g. Kirkwood & Sterne, Ch. 29.5) [30]. However, by doing this we will again run the risk of introducing a spurious interaction, which can be easily realized by the following illustration.
Assume we have one binary exposure variable X 1 (e.g. gender) and one categorical exposure variable X 2 with four categories. These two exposure variables are to be related to a binary outcome (diseased / not diseased). Assume the data look as in Table 3.
As seen from the table, there is no interaction between X 1 and X 2 , as the association between X 2 and the outcome as measured by the relative risk (RR) is constant across the two levels of X 1 . Next, we will combine categories 1 and 2, and 3 and 4 of X 2 , producing a new variableX 2 taking only two categories. This leads to Table 4.
As observed, the association betweenX 2 and the outcome now differs between the two levels of X 1 and a spurious interaction is introduced. Although the effect is not The table gives the true situation particularly strong in this example, an apparent increase in power may partially be due to such an induced interaction.

Discussion
We have given yet another argument to why continuous explanatory variables should not be categorized when entered into a regression model. If we have correlated exposure variables and categorize, this may lead to spurious interactions in the regression model. Furthermore, we have given an interpretation of this as a measurement error problem. Statistical interaction is scale dependent, both with regard to model and measurement. An additive effect on a linear scale may appear as multiplicative on a transformed scale, like the logit. In the same way, any monotone transformation of the measurement scales of X 1 , X 2 may lead to interactions. Hence, statistical interactions need to be interpreted relative to the scales on which they appear. As such, the main problem with the type of interaction discussed in the present work is not its existence but the fact that it is typically interpreted relative to its original measurement scale.
Our formal development has been within the framework of the linear regression model. However, based on the considerations above, it is easy to realize that this holds also for other regression models. Indeed, we have shown the appearance of such an interaction effect in logistic models through an example.
The practical implication of this is that whenever an interaction effect appears in an analysis based on categorized explanatory variables, the categorization itself should be considered as a possible explanation. However, one should also be aware that such an induced interaction may counteract any possible true interaction present in the data, and hence, mask this true interaction. It should be mentioned that in a practical data analysis, one will need a rather large sample size or strong effects for these interactions to appear as statistically significant.

Conclusion
In summary, categorization of continuous variables should be avoided. It leads to a number of problems, including biased estimates, loss of power, inflated type-I error, and spurious interaction effects. If the true effect of the exposure variable(s) in question cannot be easily modelled by classical parametric models, non-parametric regression methods should be preferred in order to avoid the above-mentioned problems and to gain insight into the underlying relationship. If one choose to categorize despite such warnings, it is generally preferable to categorize into more than two groups in order to minimize the information loss.