Dichotomisation using a distributional approach when the outcome is skewed

Background Dichotomisation of continuous outcomes has been rightly criticised by statisticians because of the loss of information incurred. However to communicate a comparison of risks, dichotomised outcomes may be necessary. Peacock et al. developed a distributional approach to the dichotomisation of normally distributed outcomes allowing the presentation of a comparison of proportions with a measure of precision which reflects the comparison of means. Many common health outcomes are skewed so that the distributional method for the dichotomisation of continuous outcomes may not apply. Methods We present a methodology to obtain dichotomised outcomes for skewed variables illustrated with data from several observational studies. We also report the results of a simulation study which tests the robustness of the method to deviation from normality and assess the validity of the newly developed method. Results The review showed that the pattern of dichotomisation was varying between outcomes. Birthweight, Blood pressure and BMI can either be transformed to normal so that normal distributional estimates for a comparison of proportions can be obtained or better, the skew-normal method can be used. For gestational age, no satisfactory transformation is available and only the skew-normal method is reliable. The normal distributional method is reliable also when there are small deviations from normality. Conclusions The distributional method with its applicability for common skewed data allows researchers to provide both continuous and dichotomised estimates without losing information or precision. This will have the effect of providing a practical understanding of the difference in means in terms of proportions.


Background
Researchers and practitioners in medicine often use continuous measurements to classify subjects as either normal or abnormal according to a particular cut-off. This dichotomisation is typically done for one of three reasons. The first is to facilitate a treatment decision for an individual, such as to give anti-hypertensive drugs if systolic blood pressure is over 160 mmHg. Secondly dichotomisation may be used to enable the quantification of the proportion in a population with abnormal outcome, such as the proportion of babies with low birthweight, i.e. birthweight under 2500 g. Thirdly dichotomisation is used *Correspondence: odile.sauzet@uni-bielefeld.de 1 Epidemiology and International Public Health, School of Public Health, Universität Bielefeld, Bielefeld, Germany Full list of author information is available at the end of the article to provide estimates that are more clinically meaningful for example in comparing two groups when a difference in say, mean birthweight in two groups may be difficult to interpret while a difference in the proportion with low birthweight is intuitively more meaningful. Dichotomisation is thus commonly seen and used but is known to be problematic because of the obvious loss of information and reduced statistical power.
The distributional approach [1,2] was developed to remedy this problem by providing a way to dichotomise a continuous outcome without losing precision by considering the proportion below a cut-off as a function of the mean and standard deviation of the distribution. In this way researchers may present both a mean difference and a comparison of proportions below a given cut-off with equivalent precision. With dual outcomes, the dichotomisation of continuous data is statistically rigorous.
The distributional method requires that the data follow a normal distribution or can be transformed to normal, for example by using a logarithmic transform. Many common health outcomes, e.g. blood pressure, body mass index (BMI), are not normally distributed because of perturbations due to the presence in the population of a few people with very high blood pressures or BMIs. This process has been described to lead to a skew-normal distribution of outcomes [3].
A small systematic review was undertaken to illustrate the ways in which three common outcomes, blood pressure, body mass index, and gestational age are analysed and presented in medical journals. To do this the Pubmed database was searched using the terms blood pressure, body mass index (BMI), and gestational age OR preterm birth and all their related Mesh terms. One hundred and ninety studies were retrieved, and after screening the full texts, 49 eligible studies were identified (blood pressure (BP): 23, BMI: 13, gestational age (GA): 13). Among the BP studies, analysis used the continuous data in 17/23 studies, dichotomous in 9/23 and both in 3/23. BMI was analysed as continuous in 9/13 studies and dichotomous in 5/13. One study included both continuous and dichotomous outcomes. The pattern for GA was different as most studies (12/13) used the dichotomous form, while 3/13 used thecontinuous outcome and two studies use both forms. Over all three outcomes, authors rarely (4/49) commented on the distribution of the data. Those are typical outcome for which the distributional method for dichotomisation could be beneficial because the population at risk are defined by a threshold. It is not known how robust the distributional method is to small deviations from normality. In this paper we investigate if the distributional method remains reliable in the case of deviations from normality and propose a generalisation of the distributional method to allow for skewness in distributions using the skew-normal distribution.

Methods
The methods section consists of two parts. In the first part we derive the estimates and standard error for the skew-normal distributional method for dichotomisation, and in the second part we provide the methods for two studies. The first study consists in illustrating the skew-normal method with real data and the second in assessing the robustness of the normal method to small deviation trom normality and to validate the skew-normal method through simulation. The research reported does not require any ethical approval due to its methodological nature.

Distributional method for the dichotomisation of skewed data The skew-normal distributional method
The normal distributional method has been previously described in detail [1] and [2]. In brief it provides a large sample approximation for the estimation of proportions and their standard errors assuming a normal distribution for the underlying population with parameters obtained from the data. The skew-normal distributional method uses the skew-normal distribution which has been extensively studied in [3]. This distribution is a generalisation of the normal distribution which works by adding a third parameter α which defines the skewness (if α = 0, the distribution is normal). The method of derivation of the distributional standard error for the proportion above or below a threshold is similar to one in [1] using the delta method.
Lets X n be the sample mean of n independent identically skew-normal distributed random variables X i , i = 1 . . . n with mean μ, variance σ 2 and skewness parameter α. Lets x 0 be a threshold of interest. The random variable p(X n ) for the proportion of the population with outcome value under the threshold x 0 is defined as where α = μ − wμ z and 1+α 2 (see [3]) From the delta method we obtain that p(X n ) is approximately normally distributed with standard deviation We outline the derivation of p (X n ) the formula for the standard deviation in the Appendix.
Let n 1 , n 2 , μ 1 , μ 2 , α, sd, p 1 , and p 2 be the sample sizes, the sample means, the pooled sample skew coefficient, the pooled sample standard deviation and the skew-normal distributional estimates of the proportions under the threshold x 0 in each group for the two groups being compared. For each i = 1, 2, Let d, rr and or be the skew-normal distributional estimates of the difference in proportions, risk ratio and odds ratio. The following formulae provide the variances (se 2 ) for these estimates or their logarithm.
These standard errors use more information than the standard errors used for proportion estimate obtained from the data. They depend on the underlying distribution and not just on the sample proportion and sample size.

Proportions and transformed data
Transformed data presents difficulties of interpretation because it may not be possible to back-transform to the natural scale and even when this can be done, the meaning is changed. However the proportion below a cut-point is not affected if the transformation function is continuous and monotonic such as logarithm, square root, reciprocal etc. The proportions of patients with a condition defined by a threshold remain unchanged under a transformation of the outcome. In mathematical terms: If y is an outcome and Y a certain threshold such that for example, if the outcome for patient i, y i is smaller than Y then patient i is to be treated then for f a continuous increasing function And for g a continuous decreasing function then Among the usual functions used for transforming data, the logarithm, the square root and the square (all three applied only to positive values) are increasing functions. The inverse function (1/x) for positive outcomes or taking the opposite value (-x) are decreasing functions therefore a proportion in the lower tail in the original scale will be in the upper tail in the transformed scale.

Study 1: Examples from data from several observational studies
To illustrate the use of the distributional method for the dichotomisation of skewed outcomes, we present the analysis of skewed data using the skew-normal distributional method and compare the results with the normal distribution method for transformed data. The data come from two observational studies: Birthweight (BW), body-mass index (BMI) and gestational age (GA) are outcomes taken from the St George's Birthweight Study [4] and systolic blood pressure (SBP) was measured on stroke patients included in the South London Stroke Register [5,6] which was set up in 1995 and records all first-ever strokes in an inner city area of South London.

Study 2: Robustness to small deviation from normality and validation of the skew-normal method
We assess the robustness of the (normal) distributional method in the presence of skewness for two reasons: to find out if the results remain reliable even if the data are not exactly normally distributed and to establish the necessity of an alternative method for the case of data with more skewness. We also validate the the skew-normal method. Data were generated from 1. a lognormal distribution with skewed upper tails and 2. using a left and right skewed skew-normal distribution. The data were analysed using the normal distributional method and for the skewnormal data also using the skew-normal method. The log standard deviation σ 2 log provides a measure of skewness for the lognormal data via the ratio of the expected value by the median which is equal to exp σ 2 log 2 . Values for the log standard deviation considered in this study range between 0.02 and 1. The parameter α of the skew-normal distribution was used as a measure of skewness for the skew-normal data ranging from -20 to 20. The values -1 and 1 provide small deviation from normality.
The validity of the distributional method is assessed through the bias of the estimate, how well the standard error (se) is an accurate measure of the variability of the estimate and the coverage of the 95% confidence interval of the true value. The varying parameters used for the simulation are the cut-point, the skewness (by varying the log standard deviation, from 0.02 to 1), the effect size (mean difference over standard error, from 0.01 to 0.5) and the sample size (20 to 500).
Simulations were performed using the statistical software R . The following algorithm was followed 20 000 times for each set of parameter values. For each simulated dataset, the mean and standard error are obtained to compute the normal distributional estimates with standard error for the difference in proportion, risk ratio and odds ratio.
Summaries are then obtained for the 20 000 datasets in the following way: • Mean values over the 20 000 datasets are obtained for all estimates and standard errors. • Standard deviations over the 20 000 datasets are also obtained for difference in proportions, RR and OR in order to be compared to the mean standard errors. • The mean bias (defined as the relative difference between true values and estimates) is obtained for all estimates • The coverage of the 95% distributional confidence interval (DCI) is computed as the proportion of datasets for which the true value of the parameter was in the DCI.

Study 1: Skew-normal distributional method illustrated with data from several observational studies Normal data
Data from the St George's Birthweight study [4] were used to compare the proportions of low birthweight (LBW) babies among smoking and non-smoking mothers. Results are given in Table 1a.
• Birthweight data for term babies is known to be normally distributed [7] (Figure 1) and the distributional method can be used without transformation. • The mean BW (SD) in the non-smoking group was 3452g (435) for 983 observations and for the smoking group 3267g (441) for 494 observations • The data are normally distributed (see above) and standard deviations can be assumed to be equal. • The difference in means (SE) between smoking and non-smoking mothers is 184 (24)

Lognormal data
A dataset from The South London Stroke Registry provided the last recorded systolic blood pressure (SBP) before the first time stroke of 1896 patients. There are known differences in the risk of stroke for ethnic minorities in the UK [5,6] and here we look at the difference in proportions of high blood pressure between white and non-white patients. Results are given in Table 1b.
• SBP is a right skewed outcome (see Figure 2a.) and the proportion of interest is in the right tail (patients with SBP≥ 160). A logarithmic transformation provides a normally distributed outcome. In the transformed scale, high blood pressure patients are those with transformed SBP above log (160)

Inverse transformation
Data from the St George's Birthweight study [4] were used to obtain the BMI from the height and weight of pregnant women at the beginning of pregnancy. The usual threshold of 30 kg/m 2 to compute the proportion of mothers with obesity was used. Results are given in Table 1c.
• The histogram of BMIs (Figure 3a.) showed a right skewed distribution. Taking the inverse of BMI provides a distribution which is approximately normal (Figure 3b.

Other types of transformations
A newborn is considered preterm if its gestational age (GA) is under 37 completed weeks. Due to the natural termination and to medical intervention the duration of gestation does not normally go much over 43 weeks while there are a small number of very early birth, the

Histogram of birthweights for term babies
Weight distribution of GA is therefore left skewed.While we tried to perform a transformation, this one remains imperfect and the results show that using the skew-normal distributional method is the best alternative to reflect the difference means on the original scale. Results are presented in Table 1d.
• The first transformation is to take 45-GA which provides a right skewed positive outcome. Then a log transformation provides a fit close to normal (see Figure 4b.

Study 2: Robustness of the distributional method and validation of the skew-normal method
Results of the simulations are summarised in Table 2 for the log-normal data and in Table 3 or the skewnormal data. Bias of estimates are summarised with the 3rd quantile of the absolute value. This shows that the bias for all sample size and skewness under 0.1 (log normal) remains small but then increases to level which may not be acceptable. For skew normal data, the normal method provides satisfactory results for small coefficients of skewness (±1 in these simulations). For RR and OR, the estimates are biased for small sample sized as seen in [2] but for sample size of 50 (100 for OR) per group

Histogram of systolic blood presure
a Systolic blood presure (mmHg), HBP in the right tail or more the estimates are more robust to skewness that the difference in proportions. With increasing skewness the normal method is no more reliable but then the skew-normal method provides acceptable results for the skew-normal data. For small skewness parameter the skew normal method is unreliable and the normal method muss be used.
Bias of standard error defined as the difference between the mean standard error and the standard deviation relative to the standard deviation are summarised in Tables 2 and 3 with the 3rd quantile of the absolute value. This shows that the standard error reflects well the true variability of the parameter estimates unless the skewness is very large (log normal data) or if the sample size is small (20 per group) for the skew normal method.
The results for bias of estimates and of standard error are reflected in the coverage of the 95% (normal) distributional confidence interval also shown in Tables 2 and 3 with the interquartile range.

Discussion
Our small review of the literature mentioned in the introduction showed that in 49 studies, only 4 authors described the distribution of their data. Skewed data were often analysed and presented as means, perhaps because they are easier to interpret on the original scale. Relatively few authors present both the continuous and dichotomous form of their outcome, when in fact the dual presentation provides a richer summary of the data. The distributional method provides a way to remedy this by providing dichomomised estimates that sits alongside its continuous outcome comparison but which does not lose power. However, the distributional method requires the data to follow a normal distribution and so we have sought to generalise the normal distributional method by adding a parameter and using the skew-normal distribution. We have performed two studies to complement the skew-normal method. In Study 2, we have seen using simulations that small deviation from normality did not affect the reliability of the normal distributional method, but for larger skewness a correction was required. We also saw that for larger skewness, the skew-normal method was reliable even for smaller sample sizes (50 per group or more, less so for 20 per group). In Study 1, we illustrated the skew-normal method with real data. We have shown with the gestational age example that a good transformation is not always available and the skew-normal distributional method is a better alternative. But more generally, the distributional method applied   to transformed data will reflect the difference in means on the transformed scale (leading to potentially different conclusions) while both the skew-normal and normal distributional methods will reflect the difference in means in the original scale and the most appropriate should be preferred.
In study 1, in the birthweight example we saw that for data almost normal the skew-normal and normal methods provided similar results. However the sample size in this dataset was large. Study 2 showed that for data almost normal the skew normal method did not perform well unless the sample size was large enough. The reason for this remains unclear but if the data looks normal and the sample size is nor large, the normal method should be preferred.
In this paper we presented only unadjusted estimates of comparison of proportions. But the method can be applied after a linear regression (also mixed models). Software are available [8] for Stata and R.

Conclusion
This study has dealt with the two following issues: we have shown that the normal distributional method continued to perform well even if the actual distribution was slightly skewed showing that the method could be used with confidence with real data which will only be approximately normal. We have also generalised the method to include skewed data. The distributional method with its applicability for skewed data allows researchers to provide both continuous and dichotomised estimates without losing information or precision. This will have the effect of providing a practical understanding of the difference in means in terms of proportions.  *Mean of the relative difference between estimates and true parameter to the true paramter. **Relative difference between the mean standard error and the standard deviation to the standard deviation.
Varying parameter include effect size (difference in mean) and cut-point.