Transformations of summary statistics as input in meta-analysis for linear dose-response models on a logarithmic scale: a methodology developed within EURRECA

Background To derive micronutrient recommendations in a scientifically sound way, it is important to obtain and analyse all published information on the association between micronutrient intake and biochemical proxies for micronutrient status using a systematic approach. Therefore, it is important to incorporate information from randomized controlled trials as well as observational studies as both of these provide information on the association. However, original research papers present their data in various ways. Methods This paper presents a methodology to obtain an estimate of the dose–response curve, assuming a bivariate normal linear model on the logarithmic scale, incorporating a range of transformations of the original reported data. Results The simulation study, conducted to validate the methodology, shows that there is no bias in the transformations. Furthermore, it is shown that when the original studies report the mean and standard deviation or the geometric mean and confidence interval the results are less variable compared to when the median with IQR or range is reported in the original study. Conclusions The presented methodology with transformations for various reported data provides a valid way to estimate the dose–response curve for micronutrient intake and status using both randomized controlled trials and observational studies.


Background
Meta-analysis of the association between micronutrient intake and biochemical proxies for micronutrient status or function is needed when setting micronutrient recommendations. Information on this association may come from randomized controlled trials as well as from observational studies. In a randomized trial subjects are randomized to receive either the intervention treatment or the control treatment, and a meta-analysis of such studies will usually provide a mean difference in micronutrient status between placebo and intervention groups, answering the question whether the biochemical status marker responds to the dietary intake of a micronutrient [1][2][3]. However, this analysis does not provide an estimate of the slope of the dose-response relationship. On the other hand, a meta-analysis of observational studies provides an estimate of the slope of the dose-response relation, but observational studies are hampered by for instance measurement error in the intake estimates, which causes bias in the reported association [4][5][6].
Ideally, information from observational studies and randomized controlled trials should be compared or even combined in a single meta-analysis to ensure that all reported information is taken into account over a broad range of intake. This requires that the summary statistics reported in individual studies are transformed into estimates of the dose-response relation. Since both intake and status are continuous variables, this estimate is actually an estimate of the regression coefficient of the linear regression of micronutrient status on micronutrient intake. The individual estimates of the dose-response regression coefficient may then be combined in a meta-analysis.
The statistical combination of study results may be complicated by the variety of ways that individual studies report the summary statistics. The results from randomized controlled trials as well as the baseline summary statistics of micronutrient intake and status may be reported as means, medians or geometric means. Variability is often reported as standard deviations, standard errors, interquartile ranges (IQR), ranges or confidence intervals (CI). In observational studies the relation between intake and status can be reported as a Pearson correlation coefficient, a Spearman rank correlation coefficient or a regression coefficient. In addition, either the intake variable or the status variable or both could have been logarithmically transformed before the correlation or association was calculated. All these different ways of reporting need to be standardized before meta-analysis is even possible.
This paper gives an overview of transformation methods to algebraically derive an estimate from each study of the regression coefficient (slope, b) and its standard error (se(b)), for studies that do not directly report these. The methods are validated by comparing the calculated values with theoretical values in a small-scale simulation study.

Methods
In order to derive transformations we assume a bivariate normal distribution on the log-scale for intake and status of an individual person. The log-scale was chosen because both intake and status values are always above zero, and the observed distributions of the micronutrient variables are often right-skewed. Moreover, as the true shape of the dose-response curve is usually unknown the linear relation between logarithmically transformed quantities provides the simplest approximation. More in detail, for the dose-response meta-analysis of observational studies we assume that ξ 0 (intake of micronutrient) and η 0 (status or continuous health outcome) are log-normally distributed. The assumption of bivariate normality entails a linear association between ξ ¼ ln ξ 0 ð Þand η ¼ ln η 0 À Á , where ln denotes the natural logarithm. Note that we use the Greek letters ξ and η for the theoretical values of intake and status/response, and the Latin letters X and Y for the observed values of these variables. Furthermore, we reserve letters without subscript (e.g. X and Y) for values expressed on the ln-scale, and use letters with subscript 0 (e.g., X 0 and Y 0 ) for values expressed on the absolute (i.e., original) scale.
The process of data transformations to obtain the required statistics from what is reported in observational studies, consists of four steps ( Figure 1). The first step is to obtain the mean of X (mX) and Y (mY) and the standard deviation of X (sX) and Y (sY). Secondly, the mean of X 0 (mX 0 ) and Y 0 (mY 0 ) and the standard deviation of X 0 (sX 0 ) and Y 0 (sY 0 ) are calculated when needed for the calculations in step 3. In this third step the correlation coefficient of the association between X and Y (rXY) is Step 1 Calculate mX, sX and mY, sY from reported univariate statistic using equations (1)-(7) Step 2 Calculate mX0, sX0 and mY0, sY0 from mX, sX and mY, sY: equations (8), (9) Step 3 Calculate rXY from reported bivariate statistic using equations (10)- (17) Step 4 Calculate bYX from rXY: equation

OBSERVATIONAL STUDIES RANDOMIZED CONTROLLED TRIALS
Step 1 Calculate mY and sY from reported values after intervention for placebo and intervention group using equations (1)-(7) Step 2 Calculate mX as ln(mX0) for both placebo and intervention group calculated from the reported data. In the last step, the regression coefficient of the linear regression from Y on X (bYX) is calculated from rXY, and the se(bYX) is calculated from rXY, sY, sX and the sample size (n). For reports on randomized controlled trials, the process consists of three steps. In the first step, mY and sY are obtained for both intervention and placebo group. In the second step, mX is obtained, and in the last step, bYX and se(bYX) are calculated. The equations for all these transformations are given below.

Univariate transformations
First, we describe how the univariate statistics of the normal distributions at the ln-scale can be obtained from various reported statistics. We present formulas for mX and sX, which of course can also be used similarly for mY and sY in observational studies. For randomized controlled trials the situation is different, because the variation in X is artificial and is not described by a normal distribution. Therefore, the transformations should be used only to obtain mY and sY in the intervention and placebo groups separately. In most trials the withingroup variation in X will be ignorable compared with the difference between the groups, consequently mX is calculated simply as mX con = ln(mX 0_con ) for the placebo group and as mX int = ln(mX 0_int ) the intervention group. For these transformations, we assume that ξ is normally distributed with parameters μ ξ and σ ξ . For a lognormal distribution the mean on the absolute scale, μ ξ 0 , is given by and the standard deviation on the absolute scale, σ ξ 0 , is given by It follows that when the mean (mX 0 ) and the standard deviation (sX 0 ) are reported, mX can be calculated as: The exponential function of the mean of the lognormal distribution is equal to the median on the absolute scale. Therefore, when the median (medX 0 ) has been reported on the absolute scale, mX is calculated as: As a measure of variability an IQRx or range (rangex) is often reported together with the median or mean. The IQR is the difference between the third quartile Q 3 and first quartile Q 1 (the 75 th percentile and the 25 th percentile). Basically, there are two cases. If the lower and upper limits are reported as such, the difference between the ln-transformed limits may be equated to an appropriate multiple of the standard deviation sX. On the other hand, if only the IQR or range is reported as such, the derivation is more complex. When IQRX 0 is reported together with the median, the relation between these and sX is given by , where z represents the appropriate percentage point in the standard normal distribution (i.e., z 0.75 = 0.6745).
In this case sX may be calculated as When the IQR is reported together with the mean no explicit formula exists to derive sX. Therefore, to obtain an estimate of sX from these quantities a nonlinear function optimization is employed to find the value of sX for which the following equation holds When the lower and upper bounds of the IQR (i.e., Q 1 (X 0 ) and Q 3 (X 0 ) respectively) are reported, rather than the difference, sX may be calculated as The range is the difference between the maximum and the minimum value of the data. Equations (4) and (5) may be similarly used when the range is reported, but here we consider that the minimum and the maximum represent the lower and upper (1/n) fraction of the dataset of n observations. Therefore we expect a fraction p = 1-1/(2n) below the minimum and the same fraction above the maximum, and in the equations above we need to use z p . For example, in a dataset with n = 100 we use z 0.995 = 2.576.
The geometric mean (gm) of the lognormal distribution is equal to exp(mX), and is most often reported in papers together with the 95% confidence limits. mX and sX are obtained for these quantities using: where X 0,upp is the upper limit, X 0,low is the lower limit of the 95% confidence interval and z 0.975 = 1.96 represents the 97.5th percentage point in the standard normal distribution.
Then in step 2 for observational studies, mx and sx are calculated in case these estimates were not already available. These statistics at the original scale may be needed in the bivariate transformations described below. The equations are: Bivariate transformations (to obtain regression or correlation coefficients) For observational studies, the next step is to obtain an estimate of the correlation between X and Y (rXY). The equations below can be used to obtain rXY from reported correlation and regression coefficients taking into account the possibility that either X 0 , log 10 (X 0 ), X, Y 0 , log 10 (Y 0 ) or Y was used for the originally reported statistic. When a study reports the association as a Spearman rank correlation coefficient (r S ), rXY is calculated as Another option is that the association between X 0 and Y 0 is reported as a regression coefficient (bY 0 X 0 ). In that case the correlation coefficient, rX 0 Y 0 , is calculated first using and then rXY is calculated using the following equation which was derived from Johnson & Kotz [7]: This formula (12) is also used when the Pearson product-moment correlation coefficient rX 0 Y 0 is directly reported in a paper.
For observational studies that report the regression coefficient between Y 0 and X, the correlation coefficient, rXY 0 , is calculated using When log 10 (X 0 ) is used instead of X, sX is replaced by sX/ln(10) in formula (13).
Then rXY is calculated using the following equation [8,9]: This formula (14) is also used when rXY 0 is reported directly or when the Pearson product-moment correlation coefficient is reported between log 10 (X 0 ) and Y 0 .
When the regression coefficient between Y and X 0 is reported in an observational study, the regression coefficient, rX 0 Y, is calculated using When log 10 (Y 0 ) is used instead of Y, sY is replaced by sY/ln(10) in formula (15).
Using rX 0 Y or the directly reported Pearson productmoment correlation coefficient between X 0 and log 10 (Y 0 ) or Y in an observational study, rXY is calculated using [8,9]: When the regression coefficient between X and Y is reported, rXY is calculated as Calculation of dose-response regression coefficient In the last step, for both observational studies and randomized controlled trials, we need to obtain bYX and se(bYX). For observational studies, the required regression coefficient bYX is calculated from the correlation coefficient: and the corresponding standard error (se(bYX)) is calculated as For randomized controlled trials, the required regression coefficient bYX is calculated as: bYX ¼ mY int À mY con mX int À mX con where 'int' indicates the intervention group and 'con' indicates the control or placebo group. The corresponding standard error is calculated as: Simulation study A simulation study was conducted to validate the performance of the transformations given in this paper. Bivariate lognormal data (X,Y) were simulated where X~Normal(1.60,0.85 2 ) and Y~Normal(5.70,0.45 2 ). Parameter values were based on values of vitamin B12 intake (X) and serum/plasma vitamin B12 (Y) [10][11][12][13]. Different strengths of the correlation between X and Y were simulated, namely 0.1, 0.5 and 0.9.
A sample of individuals (with sample size 100, 200 or 500) was randomly drawn, and values that represent different often used reporting options were calculated from this sample, namely the mean and SD, the median and IQR, the median and range and the geometric mean and 95% CI (all summary statistics on the absolute scale). Also, the correlation and regression coefficients of X and Y expressed in different scales were calculated. These 'reported' values were rounded to two decimal places. From these 'reported' values, the parameter estimates mX, mY, sX, sY and rXY were calculated using the transformations described in this paper. This process was repeated 1000 times. Table 1 shows the simulation results for the univariate statistics. On average the calculated values of mX and mY are almost the same as the true values, indicating that no important bias is present in these calculations. As expected, the 95% CI of the simulations is smaller for the simulations with a sample size of 500 than for the simulations with a sample size of 200 or 100. For sX and sY, the estimates are most precise when a geometric mean with a 95% CI is reported, and least precise when a median with a range is reported. Figure 2 shows the simulation results when a correlation coefficient is reported, and Figure 3 shows the simulation results when a linear regression coefficient is reported. Both these figures show the simulation results with true rXY = 0.5. Results are similar for true rXY = 0.9 and true rXY = 0.1 (data not shown). For the situation in which a correlation coefficient is the reported bivariate statistic, there is no difference for the four univariate reporting options. Therefore, these results are pooled in Figure 2.

Results
None of the combinations of univariate and bivariate reporting options shows evidence of bias with the average of the simulations almost equal to the true value. The width of the confidence interval indicates the variability of the simulations. Because there is no appreciable bias, a smaller CI width indicates that the individual simulations are closer to the true correlation. The accuracy is best when rX 0 Y is reported and worst when rX 0 Y 0 is reported. As expected, the accuracy is also better when the sample size is larger. Figure 3 shows that the CI is wider when the reported univariate statistics are the median and IQR or median and range. The larger variation in the results for the transformation from bYX 0 ( Figure 3B) compared with the variation in the results from bY 0 X ( Figure 3C) is caused by the fact the X was simulated with larger standard deviation than Y.

Example
To illustrate the methodology some examples of its use on real data for vitamin B12 are reported in Table 2 (observational studies [14,15]) and Table 3 (randomized controlled trials [16,17]). The tables show the statistics as reported in the studies and the statistics that are calculated using the different equations presented in this paper (which are entitled 'required statistics' in the tables).

Discussion
The investigated means, standard deviations, correlation coefficients and sample sizes were based on real-life values. The univariate statistics that are investigated in this paper were limited to mean and SD, median and IQR or range and geometric mean and 95% CI. These do not represent all reporting options that can be encountered in the literature, but cover most published papers. Other combinations of univariate statistics that were seen are for example mean with IQR, mean with range, and geometric mean with standard deviation. Also, the investigated regression and correlation coefficients are limited in this paper to those on the absolute or logarithmic scale, whereas sometimes other transformations to normality have been used in reports, such as a square root transformation. However, as the logarithmic transformation is by far the most often used transformation in papers in the medical research area, the A. B.
D. C.  equations in this paper will cover most published papers in this field.
The bivariate normal linear model on the logarithmic scale is an approximation that is used here because the data are positive data. Note that it allows the relationship between X 0 and Y 0 to be a linear, monotonic convex or monotonic concave function (i.e., for a slope equal, higher or lower than one, respectively). Even though some randomized controlled trials may investigate the dose-response relationship by providing multiple dosages in their study, most of these studies include only one intervention and one control group and consequently it is often unknown what the true relationship is. Therefore, this approximation provides a practical methodology to estimate the dose-response relationship and to combine the results from randomized controlled trials and observational studies. It was outside the scope of the simulation study to investigate other shapes of the dose-response relation.
The transformations in this paper consider reported regression and correlation coefficients that are unadjusted for other variables. It is possible to adjust the equations for adjusted regression or correlation coefficients, if these adjustments were done on the log-scale. However, most often adjustment has been done on another scale, and moreover studies do not report all required statistics. Therefore, we did not consider adjusted coefficients.
In this paper we presented a methodology that allows for information from RCTs and observational studies to be summarised in comparable statistics. One possible application is to combine results of both types of study in a single meta-analysis. In general, a meta-analysis should include as much information as possible. However, there may be systematic differences between observational studies and randomized controlled trials. Therefore, it is advisable to check whether the size of the estimated regression coefficient differs between these different study designs. This may be done by stratified analysis or by using meta-regression techniques.

Conclusions
The presented methodology provides calculations to use results from published literature to estimate the slope of the dose-response relation incorporating information from both randomized controlled trials and observational studies. The simulations clearly show that there is no observable bias associated with the transformations. Also, it can be seen that when a regression coefficient is reported, it is preferable to report the univariate statistics as mean and SD or geometric mean and 95% CI rather than as median with IQR or range.

Competing interests
The authors declare that they have no competing interests'.
Authors' contributions OS participated in the design of the simulation study, performed the statistical analysis and drafted the manuscript. CD helped to draft the manuscript and participated in the design of the simulation study. PvtV participated in the coordination of the study and revised the manuscript critically. HvdV conceived of the study, helped with the statistical analysis and interpretation of the and revised the manuscript critically. All authors read and approved the final manuscript.
Authors' information OS and CD are both postdoctoral research fellows at the Division of Human Nutrition of Wageningen University, the Netherlands. PvtV is professor of Nutrition and Epidemiology at the Division of Human Nutrition, the Netherlands. HvdV is statistician at Biometris, Wageningen University and Research centre, the Netherlands.