Estimating the sample mean and standard deviation from the sample size, median, range and/or interquartile range

Background In systematic reviews and meta-analysis, researchers often pool the results of the sample mean and standard deviation from a set of similar clinical trials. A number of the trials, however, reported the study using the median, the minimum and maximum values, and/or the first and third quartiles. Hence, in order to combine results, one may have to estimate the sample mean and standard deviation for such trials. Methods In this paper, we propose to improve the existing literature in several directions. First, we show that the sample standard deviation estimation in Hozo et al.’s method (BMC Med Res Methodol 5:13, 2005) has some serious limitations and is always less satisfactory in practice. Inspired by this, we propose a new estimation method by incorporating the sample size. Second, we systematically study the sample mean and standard deviation estimation problem under several other interesting settings where the interquartile range is also available for the trials. Results We demonstrate the performance of the proposed methods through simulation studies for the three frequently encountered scenarios, respectively. For the first two scenarios, our method greatly improves existing methods and provides a nearly unbiased estimate of the true sample standard deviation for normal data and a slightly biased estimate for skewed data. For the third scenario, our method still performs very well for both normal data and skewed data. Furthermore, we compare the estimators of the sample mean and standard deviation under all three scenarios and present some suggestions on which scenario is preferred in real-world applications. Conclusions In this paper, we discuss different approximation methods in the estimation of the sample mean and standard deviation and propose some new estimation methods to improve the existing literature. We conclude our work with a summary table (an Excel spread sheet including all formulas) that serves as a comprehensive guidance for performing meta-analysis in different situations. Electronic supplementary material The online version of this article (doi:10.1186/1471-2288-14-135) contains supplementary material, which is available to authorized users.


Introduction
In medical research, it is common to find that several similar trials are conducted to verify the clinical effectiveness of a certain treatment.While individual trial study could fail to show a statistically significant treatment effect, systematic reviews and meta-analysis of combined results might reveal the potential benefits of treatment.For instance, Antman et al. (1992) pointed out that systematic reviews and meta-analysis of randomized control trials would have led to earlier recognition of the benefits of thrombolytic therapy for myocardial infarction and may save a large number of patients.
Prior to the 1990s, the traditional approach to combining results from multiple trials is to conduct narrative (unsystematic) reviews which are mainly based on the experience and subjectivity of experts in the area (Cipriani & Geddes 2003).However, this approach suffers from many critical flaws.The major one is due to inconsistent criteria of different reviewers.To claim a treatment effect, different reviewers may use different thresholds which often lead to opposite conclusions from the same study.Hence, from the mid-1980s, systematic reviews and meta-analysis have become an imperative tool in medical effectiveness measurement.Systematic reviews use specific and explicit criteria to identify and assemble related studies and usually provide a quantitative (statistic) estimate of aggregate effect over all the included studies.The methodology in systematic reviews is usually referred to as meta-analysis.With the combination of several studies and more data taken into consideration in systematic reviews, the accuracy of estimations will get improved and more precise interpretations towards the treatment effect can be achieved via meta-analysis.
When performing meta-analysis, some summary statistics mainly including the sample mean and standard deviation are required from included studies.This, however, can be difficult because results from different studies are often presented in different and non-consistent forms.Specifically in medical research, instead of reporting the sample mean and standard deviation of the trials, some trial studies only report the median, the minimum and maximum values, and/or the first and third quartiles.Therefore, we need to estimate the sample mean and standard deviation from these quantities so that we can pool results in a consistent format.Hozo et al. (2005) was the first to address this estimation problem and proposed a simple method for estimating the sample mean and the sample variance (or equivalently the sample standard deviation) from the median, range, and the size of the sample.Their method is now widely accepted in the literature of systematic reviews and meta-analysis.For instance, a search of Google Scholar on 18 August 2014 showed that the article of Hozo et al. (2005) has been cited 655 times where 360 citations are made recently in 2013 and 2014.
In this paper, we will show that the estimation of the sample standard deviation in Hozo et al. (2005) has some serious limitations.In particular, their estimator did not incorporate the information of the sample size and so consequently, it is always less satisfactory in practice.Inspired by this, we will propose a new estimation method that will greatly improve their method.In addition, we will systematically investigate the estimation problem under more general settings where the first and third quartiles are also available for the trials.
Throughout the paper, we define the following summary statistics: a = the minimum value, q 1 = the first quartile, m = the median, q 3 = the third quartile, b = the maximum value, n = the sample size.
The {a, q 1 , m, q 3 , b} is often referred to as the 5-number summary (Triola 2009).Note that the 5-number summary may not always be given in full.The three frequently encountered scenarios are: Hozo et al. (2005) only addressed the estimation of the sample mean and variance under Scenarios C 1 while Scenarios C 2 and C 3 are also common in systematic review and meta-analysis.In Sections 2-4, we study the estimation problem under these three scenarios, respectively.Simulation studies are conducted in each scenario to demonstrate the superiority of the proposed methods.We conclude the paper in Section 5 with some discussions and a summary table to provide a comprehensive guidance for performing meta-analysis in different situations.
2 Estimating X and S from C 1 = {a, m, b; n} Scenario C 1 assumes that the median, the minimum, the maximum and the sample size are given for a clinical trial study.This is the same assumption as made in Hozo et al. (2005).To estimate the sample mean and standard deviation, we first review the estimation method in Hozo et al. (2005) and point out some limitations of their method in estimating the sample standard deviation.We then propose to improve their estimation by incorporating the information of the sample size.
Throughout the paper, we let X 1 , X 2 , . . ., X n be a random sample of size n from the normal distribution N(µ, σ 2 ), and Also for the sake of simplicity, we assume that n = 4Q + 1 with Q being a positive integer.Then (1) In this section, we are interested in estimating the sample mean X = n i=1 X i and the sample standard deviation S = [ n i=1 (X i − X) 2 /(n − 1)] 1/2 , given that a, m, b, and n of the data are known.

Hozo et al. (2005)'s method
For ease of notation, let M = 2Q + 1.Then, M = (n + 1)/2.To estimate the mean value, Hozo et al. (2005) applied the following inequalities: Adding up all above inequalities and dividing by n, we have LB 1 ≤ X ≤ UB 1 , where the lower and upper bounds are Hozo et al. (2005) then estimated the sample mean by Note that the second term in (2) is negligible when the sample size is large.A simplified mean estimation is given as For estimating the sample standard deviation, by assuming that the data are nonnegative, Hozo et al. (2005) applied the following inequalities: With some simple algebra and approximations on formula (4), we have LSB USB 1 , where the lower and upper bounds are Then approximately, n i=1 By ( 3) and ( 5), the sample standard deviation is estimated by S = √ S 2 , where When n is large, it results in the following well-known range rule of thumb: Note that the range rule of thumb ( 6) is independent of the sample size.It may not work well in practice, especially when n is extremely small or large.To overcome this problem, Hozo et al. (2005) proposed the following improved range rule of thumb with respect to the different size of the sample: where the formula for n ≤ 15 is derived under the equidistantly spaced data assumption, and the formula for n > 70 is suggested by the Chebyshev's inequality (Hogg & Craig 1995).Note also that when the data are symmetric, we have a + b ≈ 2m and so Hozo et al. (2005) showed that the adaptive formula (7) performs better than the original formula (6) in most settings.

Improved estimation of S
We think, however, that the adaptive formula (7) may still be less accurate for practical use.First, the threshold values 15 and 70 are suggested somewhat arbitrarily.Second, given the normal data N(µ, σ 2 ) with σ > 0 being a finite value, we know that σ ≈ This contradicts to the assumption that σ is a finite value.
Third, the non-negative data assumption in Hozo et al. (2005) is also quite restrictive.
In this section, we propose a new estimator to further improve (7) and, in addition, we remove the non-negative assumption on the data.Let Z 1 , . . ., Z n be independent and identically distributed (i.i.d.) random variables from the standard normal distribution N(0, 1), and , we choose the following estimation for the sample standard deviation: Next, we present a method to approximate ξ(n) and establish an adaptive rule of thumb for standard deviation estimation.By David & Nagaraja (2003), the expected /2 be the probability density function and Φ(z) = z −∞ φ(t)dt be the cumulative distribution function of the standard normal distribution.For ease of reference, we have computed the values of ξ(n) by numerical integration using the computer in Table 1 for n up to 100.From Table 1, it is evident that the adaptive formula (7) by Hozo et al. (2005) is less accurate and also less flexible.
When n is large (say n > 50), we may also apply Blom (1958)'s method to approximate E(Z (n) ).Specifically, Blom (1958) suggested the following approximation for the expected values of the order statistics: where Φ −1 (z) is the inverse function of Φ(z), or equivalently, the upper zth percentile of the standard normal distribution.Blom observed that the value of α increases as n increases, with the lowest value being 0.330 for n = 2. Overall, Blom suggested α = 0.375 as a compromise value for practical use.Further discussion on the choice of α can be seen, for example, in Harter (1961) and Cramér (1999).Finally, by ( 8) and ( 9) with r = n and α = 0.375, we estimate the sample standard deviation by In the statistical software R, the upper zth percentile Φ −1 (z) can be computed by the command "qnorm(z)".

Simulations studies
In this section, we conduct simulation studies to compare the performance of Hozo et al.'s method and our new method for estimating the sample standard deviation.Following From Figure 2 with the skewed data, our proposed method (10) makes a slightly biased estimate with the relative errors about 5% of the true sample standard deviation.
Nevertheless, it is still obvious that the new method is much better compared to Hozo et al.'s method.We also note that, for the beta and Weibull distributions, the best cutoff values of n should be larger than 70 for switching (b − a)/4 and (b − a)/6.This again coincides with Table 1 in Hozo et al. (2005) where the suggested cutoff value is n = 100 for Beta and n = 110 for Weibull.
3 Estimating X and S from C 2 = {a, q 1 , m, q 3 , b; n} Scenario C 2 assumes that the first quartile, q 1 , and the third quartile, q 3 , are also available in addition to C 1 .In this setting, Bland (2013) extended Hozo et al.'s results by incorporating the additional information of the interquartile range (IQR).He further claimed that the new estimators for the sample mean and standard deviation are superior to those in Hozo et al. (2005).In this section, we first review the estimation method in Bland (2013) and point out some limitations of this method.We then, accordingly, propose to improve this method by incorporating the size of a sample.

Bland (2013)'s method
Noting that n = 4Q + 1, we have Q = (n − 1)/4.To estimate the sample mean, Bland (2013) considered the following inequalities: Adding up all above inequalities and dividing by n, we have LB 2 ≤ X ≤ UB 2 , where the lower and upper bounds are We then estimate the sample mean by When the sample size is large, the second term in ( 12) is negligible and thus a simplified mean estimation is For the sample standard deviation, Bland considered the following inequalities: With some simple algebra and approximation on formula ( 14), we have LSB 2 ≤ n i=1 X 2 i ≤ USB 2 , where the lower and upper bounds are Taking the average of LSB 2 and USB 2 , n i=1 X 2 i can be estimated by Further, by the formula S 2 = ( n i=1 X 2 i − n X2 )/(n − 1), we have Finally, Bland (2013) took the square root √ S 2 to estimate the sample standard deviation.
Note that the estimator (15) is independent of the sample size n.Hence, it may not be sufficient for general use, especially when n is small or large.In the next section, we propose an improved estimation for the sample standard deviation by incorporating the additional information of the sample size.

Improved estimation of S
Recall that the range b−a was used to estimate the sample standard deviation in Scenario C 1 .Now for Scenario C 2 , since the IQR q 3 − q 1 is also known, another approach is to estimate the sample standard deviation by (q 3 − q 1 )/η(n), where η(n) is a function of n.

Simulations studies
In this section, we evaluate the performance of the proposed method ( 17) and compare it to Bland's method (15).Following Bland's settings, we consider (i) the normal distribution with mean µ = 5 and standard deviation σ = 1, and (ii) the log-normal distribution with location parameter µ = 5 and scale parameter σ = 0.25, 0.5, and 1, respectively.For simplicity, we consider the sample size being n = 4Q + 1, where Q takes values from 1 to 50.And as in Section 2.3, we assess the accuracy of the two estimates by the relative error defined in (11).
In each simulation, we draw a total of n observations randomly from the given distribution and compute the true sample standard deviation of the sample.We then use and only use the minimum value, the first quartile, the median, the third quartile, and the maximum value to estimate the sample standard deviation by formulas ( 15) and ( 17), respectively.With 1000 simulations, we report the average relative errors in Figure 3 for the four specified distributions.From Figure 3, we observe that the new method provides a nearly unbiased estimate of the true sample standard deviation.Even for the very highly skewed log-normal data with σ = 1, the relative error of the new method is also less than 10% for most sample sizes.On the contrary, Bland's method is less satisfactory.
As reported in Bland (2013), the method (15) only works for a small range of sample sizes (In our simulations, the range is about from 20 to 40).When the sample size gets larger or the distribution is highly skewed, the sample standard deviations will be highly overestimated.Additionally, we note that the sample standard deviations will be seriously underestimated if n is very small.Overall, it is evident that the new method is better than Bland's method in most settings.
4 Estimating X and S from C 3 = {q 1 , m, q 3 ; n} Scenario C 3 is an alternative way to report the study other than Scenarios C 1 and C 2 .It reports the first and third quartiles instead of the minimum and maximum values.One main reason to report C 3 is because the IQR is usually less sensitive to outliers compared to the range.For the new scenario, we note that the methods in Hozo et al. (2005) and Bland (2013) will no longer be applicable.Particularly, if their ideas are followed, we have the following inequalities: where the first Q inequalities are unbounded for the lower limit, and the last Q inequalities are unbounded for the upper limit.Now adding up all above inequalities and dividing by n, we have −∞ ≤ X ≤ ∞.This shows that the approaches based on the inequalities do not apply to Scenario C 3 .
In contrast, the following procedure is commonly adopted in the recent literature including Liu et al. (2007) and Zhu et al. (2014): "If the study provided medians and IQR, we imputed the means and standard deviations as described by Hozo et al. (2005).
We calculated the lower and upper ends of the range by multiplying the difference between the median and upper and lower ends of the IQR by 2 and adding or subtracting the product from the median, respectively."This procedure, however, performs very poorly in our simulations (not shown).
4.1 A quantile method for estimating X and S In this section, we propose a quantile method for estimating the sample mean and the sample standard deviation, respectively.In detail, we first revisit the estimation method in Scenario C 2 .By (13), we have Now for Scenario C 3 , a and b are not given.Hence, a reasonable solution can be to remove a and b from the estimation and keep the second term.By doing so, we have the estimation form as X ≈ (q 1 + m + q 3 )/C, where C is a constant.Finally, noting that , we let C = 3 and define the estimator of the sample mean as follows: For the sample standard deviation, following the idea in constructing ( 16) we propose the following estimation: where η(n) = 2E(Z (3Q+1) ).As mentioned above that E(q 3 − q 1 ) = 2σE(Z (3Q+1) ) = ση(n), therefore, the estimator (19) provides a good estimate for the sample standard deviation.
The numerical values of η(n) are given in Table 2 for n ≤ 50.When n is large, by the approximation E(Z (3Q+1) ) ≈ Φ −1 ((0.75n − 0.125)/(n + 0.25)), we can also estimate the sample standard deviation by To the best of our knowledge, our work is the first to systematically address the sample mean and standard deviation estimation under Scenario C 3 , or more generally, under the scenarios where the minimum and maximum values are not reported in the clinical trial study.

Simulations studies
Since our method is the first to estimate the sample mean and standard deviation under Scenario C 3 , it is not needed to conduct simulation studies to compare with the existing methods.Instead, we conduct a comparison study that not only assesses the accuracy of the proposed method under Scenario C 3 , but also addresses a more realistic question in meta-analysis, "For a clinical trial study, which summary statistics should be preferred to report, C 1 , C 2 or C 3 ?and why?" For the sample mean estimation, we consider the formulas (3), ( 13), and (18) under three different scenarios, respectively.The accuracy of the mean estimation is also assessed by the relative error, which is defined in the same way as that for the sample standard deviation estimation.Similarly, for the sample standard deviation estimation, we consider the formulas (10), (17), and (19) under three different scenarios, respectively.
The distributions we considered are the same as in Section 2.3, i.e., the normal, lognormal, beta, exponential and Weibull distributions with the same parameters as those in previous two simulation studies.
In each simulation, we first draw a random sample of size n from each distribution.The true sample mean and the true sample standard deviation are computed using the whole sample.The summary statistics are also computed and are categorized into Scenarios C 1 , C 2 and C 3 .We then use the aforementioned formulas to estimate the sample mean and standard deviation, respectively.The sample sizes are n = 4Q + 1, where Q takes values from 1 to 50.With 1000 simulations, we report the average relative errors in Figure 4 for both X and S with the normal distribution, in Figure 5 for the sample mean estimation with the non-normal distributions, and in Figure 6 for the sample standard deviation estimation with the non-normal distributions.
For normal data which meta-analysis would commonly assume, all three methods provide a nearly unbiased estimate of the true sample mean.The relative errors in the sample standard deviation estimation are also very small in most settings (within 1% in general).Among the three methods, however, we recommend to estimate X and S using the summary statistics in Scenario C 3 .One main reason is because the first and third quartiles are usually less sensitive to outliers compared to the minimum and maximum values.Consequently, C 3 produces a more stable estimation than C 1 , and also C 2 that is partially affected by the minimum and maximum values.
For non-normal data, from Figure 5, we note that the mean estimation from C 3 is always better than that from C 1 .That is, if the additional information in the first and third quartile is available, we should always use such information.On the other hand, the estimation from C 3 may not always better than that from C 2 .Therefore, if the additional information contains extreme values, we need to be cautious as they may not be fully reliable and may even lead to worse estimation.It is also noteworthy that (i) the mean estimation from C 3 is not sensitive to the sample size, and (ii) C 1 and C 3 always lead to opposite estimations (one underestimates and another overestimates the true value).
While from Figure 6, we observe that (i) the standard deviation estimation from C 3 is quite sensitive to the skewness of the data, (ii) C 1 and C 3 would also lead to the opposite estimations except for very small sample sizes, and (iii) C 2 turns out to be a good compromise for estimating the sample standard deviation.Taking both into account, we recommend to report Scenario C 2 in clinical trial studies.However, if we do not have all information in the 5-number summary and have to make a decision between C 1 and C 3 , we recommend C 1 for small sample sizes (say n ≤ 30), and C 3 for large sample sizes.

Conclusion and Discussion
Researchers often use the sample mean and standard deviation to perform meta-analysis from clinical trials.However, sometimes, the reported results may only include the sample size, median, range and/or IQR.To combine these results in meta-analysis, we need to estimate the sample mean and standard deviation from them.In this paper, we discuss different approximation methods in the estimation of the sample mean and standard deviation and propose some new estimation methods to improve the existing literature.
Through simulation studies, we demonstrate that the proposed methods greatly improve the existing methods and enrich the literature.Here we summarize all discussed and proposed estimators under different scenarios in Table 3.
Specifically, we point out that the widely accepted estimator of standard deviation proposed by Hozo et al. (2005) has some serious limitations and is always less satisfactory in practice because the estimator does not fully incorporate the sample size.As we explained in Section 2, using (b − a)/6 for n > 70 in Hozo et al.'s adaptive estimation is untenable because the range b − a tends to be infinity as n approaches infinity if the distribution is not bounded, such as the normal and log-normal distributions.Our estimator replaces the adaptively selected thresholds ( √ 12, 4, 6) with a unified quantity 2Φ −1 ((n − 0.375)/(n + 0.25)), which can be quickly computed and obviously is more stable and adaptive.In addition, our method removes the non-negative data assumption in Hozo et al. (2005) and so is more applicable in practice.
Bland (2013) extended Hozo et al.'s method by using the additional information in the IQR.Since extra information is included, it is expected that Bland's estimators are superior to those in Hozo et al. (2005).However, the sample size is still not considered in Bland (2013) for the sample standard deviation, which again limits its capability in real-word cases.Our simulation studies show that Bland's estimator significantly over-estimates the sample standard deviation when the sample size is large while seriously underestimates it when the sample size is small.Again, we incorporate the information of the sample size in the estimation of standard deviation via two unified quantities, 4Φ −1 ((n − 0.375)/(n + 0.25)) and 4Φ −1 ((0.75n − 0.125)/(n + 0.25)).With some extra but trivial computing costs, our method makes significant improvement over Bland's method when the IQR is available.
Moreover, we pay special attention to an overlooked scenario where the minimum and maximum values are not available.We show that the methodology following the ideas in Hozo et al. (2005) and Bland (2013) will lead to unbounded estimators and is not feasible.
On the contrary, we follow the ideas of our proposed methods in the other two scenarios and again construct a simple but still valid estimator.After that, we take a step forward to compare the estimators of the sample mean and standard deviation under all three scenarios.
Finally, we mention that the proposed methods are established under the assumption that the data are normally distributed.In contrast, the methods in Hozo et al. (2005) and Bland (2013) make no assumption on the distribution of the underlying data.Nevertheless, via simulation studies we have demonstrated that our methods still outperform the existing methods for non-normal data including the log-normal, beta, exponential, and Weibull distributions.Of course, if the distribution of the underlying data is known, it can be possible to further improve the proposed methods in this paper.In addition, for simplicity we have only considered three most commonly used scenarios including C 1 , C 2 and C 3 .Our method, however, can be readily generalized to other scenarios, e.g., when only {a, q 1 , q 3 , b; n} are known or when additional quantile information is given.

)
Note that ξ(n) plays an important role in the sample standard deviation estimation.If we let ξ(n) ≡ 4, then (8) reduces to the original rule of thumb in (6).If we let ξ(n) = √ 12 for n ≤ 15, 4 for 15 < n ≤ 70, or 6 for n > 70, then (8) reduces to the improved rule of thumb in (7).
Figure 1 where the method (b − a)/4 crosses the x-axis between n = 20 and n = 30, and the method (b − a)/6 crosses the x-axis between n = 400 and n = 500.

Figure 1 :Figure 2 :Figure 4 :
Figure 1: Relative errors of the sample standard deviation estimation for normal data, where the red lines with solid circles represent Hozo et al.'s method, and the green lines with empty circles represent the new method.