 Research article
 Open Access
 Open Peer Review
 Published:
Likelihoodbased randomeffects metaanalysis with few studies: empirical and simulation studies
BMC Medical Research Methodologyvolume 19, Article number: 16 (2019)
Abstract
Background
Standard randomeffects metaanalysis methods perform poorly when applied to few studies only. Such settings however are commonly encountered in practice. It is unclear, whether or to what extent smallsamplesize behaviour can be improved by more sophisticated modeling.
Methods
We consider likelihoodbased methods, the DerSimonianLaird approach, Empirical Bayes, several adjustment methods and a fully Bayesian approach. Confidence intervals are based on a normal approximation, or on adjustments based on the Studenttdistribution. In addition, a linear mixed model and two generalized linear mixed models (GLMMs) assuming binomial or Poisson distributed numbers of events per study arm are considered for pairwise binary metaanalyses. We extract an empirical data set of 40 metaanalyses from recent reviews published by the German Institute for Quality and Efficiency in Health Care (IQWiG). Methods are then compared empirically as well as in a simulation study, based on few studies, imbalanced study sizes, and considering oddsratio (OR) and risk ratio (RR) effect sizes. Coverage probabilities and interval widths for the combined effect estimate are evaluated to compare the different approaches.
Results
Empirically, a majority of the identified metaanalyses include only 2 studies. Variation of methods or effect measures affects the estimation results. In the simulation study, coverage probability is, in the presence of heterogeneity and few studies, mostly below the nominal level for all frequentist methods based on normal approximation, in particular when sizes in metaanalyses are not balanced, but improve when confidence intervals are adjusted. Bayesian methods result in better coverage than the frequentist methods with normal approximation in all scenarios, except for some cases of very large heterogeneity where the coverage is slightly lower. Credible intervals are empirically and in the simulation study wider than unadjusted confidence intervals, but considerably narrower than adjusted ones, with some exceptions when considering RRs and small numbers of patients per trialarm. Confidence intervals based on the GLMMs are, in general, slightly narrower than those from other frequentist methods. Some methods turned out impractical due to frequent numerical problems.
Conclusions
In the presence of betweenstudy heterogeneity, especially with unbalanced study sizes, caution is needed in applying metaanalytical methods to few studies, as either coverage probabilities might be compromised, or intervals are inconclusively wide. Bayesian estimation with a sensibly chosen prior for betweentrial heterogeneity may offer a promising compromise.
Background
Metaanalyses of few studies are common in practice. For instance, a review of the Cochrane Library revealed that half of the metaanalyses reported in the Cochrane Library are conducted with two or three studies [1]. However, standard randomeffects metaanalysis methods perform poorly when applied to few studies only [2, 3]. It is unclear, whether or to what extent smallsamplesize behaviour can be improved by more sophisticated modeling. Bayesian randomeffects metaanalyses with weakly informative priors for the betweenstudy heterogeneity have been proposed for this setting [4] and their performance has been found to be satisfactory in numerical applications and simulations [3, 5]. Other alternative approaches including likelihood based methods have been mentioned as potential remedies [6].
In metaanalyses commonly a twostage approach is applied. In the first step, data from the individual studies are analyzed resulting in effect estimates with standard errors. These are then combined in a second step. As individual patient data (IPD) are not generally available and effects with standard errors can typically extracted from publications, this twostage approach makes a lot of sense from a practical point of view. With binary data, however, the individual patient data are summarized by 2 × 2 frequency tables and are usually readily available from publications [7]. Therefore, preference might be given to onestage approaches in this setting over the commonly applied twostage approach. However, numerical differences between the onestage and twostage approaches have been found to be small in a simple Gaussian model [8]. If differences are observed, these arise mostly for differing models [9, 10] or relate not to the main effects but interactions [11]. So, while a simpler twostage model is often sufficient (especially in case of many studies and nonrare events), a onestage model may on the other hand be expected to be more flexible and more exact [12]. A Bayesian approach may be more suitable especially in cases of few studies [3–5]. For a more detailed discussion of common models for binary data, see also Jackson et al. [13].
Although some model and method comparison studies appeared recently [13, 14], a systematic evaluation and comparison of the various methods is lacking in the context of few studies. Here we intend to close this gap by an empirical study and comprehensive simulations.
This manuscript is structured as follows. In the following section we summarize the metaanalysis approaches compared, the extraction of the empirical data set and the setup of the simulation study. Then the results of the empirical study as well as of the simulation study are presented. We close with a brief discussion and some conclusions.
Methods
Modeling approaches
In the following, we will consider metaanalyses based on binary endpoints, where each study’s outcome may be summarized in a 2 × 2 table giving the the numbers of participants with and without an event in both study arms.
Normalnormal hierarchical model (NNHM)
Model specification
Traditionally, metaanalytical methods often follow a contrastbased summary measure approach which is based on the logtransformed empirical estimates of the outcome measure and their standard errors, and assuming an approximate normal likelihood [15].
In a common situation in randomeffects metaanalysis, k independent studies are available in which the treatment effect θ_{i} is the parameter of interest (i=1,2,…,k). From each study, an effect estimate \(\hat {\theta }_{i}\) with its estimated variance (squared standard error) \(\sigma _{i}^{2}\) is provided for this treatment effect. It is then assumed that \(\hat {\theta }_{i}\) follows a normal distribution centered around the unknown true treatment effect θ_{i}, with the variance \(\sigma _{i}^{2}\) accounting for the measurement uncertainty, or withinstudy variation. Although \(\sigma _{i}^{2}\) usually only is an estimate, it is commonly treated as known. The θ_{i} may vary across study populations around a global mean μ due to the betweenstudy heterogeneity τ. After integrating out the parameters θ_{i}, the marginal model can be expressed as
This model is commonly applied to both logtransformed risk ratio (RR) or odds ratio (OR) measures of treatment effect for binary data \(\hat {\theta }_{i}\) [16, 17]; it is denoted as “model 1” in the investigation by Jackson et al. [13].
Inference
We will consider frequentist and Bayesian approaches to inference within the generic NNHM. In the frequentist approaches, an estimate of the betweenstudy heterogeneity τ is usually required first. Different estimators are available; in the following we consider the commonly used DerSimonianLaird (DL) [18], maximum likelihood (ML), restricted maximum likelihood (REML) [19, 20] and empirical Bayes (EB) estimators, the latter also being known as the PauleMandel estimator [21, 22]. Based on an estimate of this heterogeneity \(\hat {\tau }\), the mean effect estimates are determined in a subsequent step by conditioning on the \(\hat {\tau }\) value as if it were known.
The fully Bayesian estimation within the NNHM framework is done using three different prior specifications for the betweenstudy heterogeneity (τ). Uncertainty in the heterogeneity is naturally accounted for when estimating the combined treatment effect μ by marginalisation. Especially if the number of studies is small, however, the choice of priors matters, as has been discussed by Turner et al. [23], Dias et al. [24, Sec. 6.2], or Röver [25]. We follow Friede et al. [3] and Spiegelhalter et al. [26, Sec. 5.7] and consider two halfnormal priors with scales 0.5 and 1.0 for the betweenstudy heterogeneity. These specifications include up to “fairly high” and “fairly extreme” heterogeneity [26, Sec. 5.7.3], and they also span the range of values considered in the simulations (see Table 1). In all of these approaches risk ratios (RR) and odds ratios (OR) can be used as the treatment effect.
Generalized linear mixed models (GLMM)
Models
The statistical model may also be based directly on the count data, using either a binomial or a Poisson assumption on the numbers of events per study arm. Generalized linear mixed models (GLMMs) may then be fitted to the data, using a logarithmic link for Poisson rates or a logit link for proportions. Treatment effects may be modeled based on ORs or RRs, and random effects may be included at several stages in order to account for heterogeneity. In addition, we also consider some approximate variants of these models. The models used are outlined briefly below; most of these are also discussed in more detail by Jackson et al. [13].
Model specification and inference
If a Poisson distribution is assumed for the number of events per arm and study, a loglink will be used to model the RR. Following Böhning et al. [7, Ch. 2] this model is estimated using the profile likelihood; in the following, this model will be denoted as the “PNPL” model.
For binomially distributed numbers of events per study arm, a logitlink will be applied to model ORs in a logistic regression. Four different specifications are included in the comparison. Unconditional logistic regression with fixed and random studyspecific nuisance parameters as discussed by Turner et al. [27] are considered (“UM.FS” and “UM.RS”, respectively, in the following). These correspond to models 4 and 5 in Jackson et al. [13].
In addition, we follow van Houwelingen et al. [28] in using a conditional logistic approach, where the total number of events per study is conditioned upon, in order to avoid the need to also model their variability [19]. The likelihood of this conditional model can be described using Fisher’s noncentral hypergeometric distribution [28] (“CM.EL” in the following, and corresponding to model 7 in [13]).
Fisher’s noncentral hypergeometric distribution may be approximated by a binomial distribution, if the number of cases is small compared to the overall participants in that study [29]; this model specification will be denoted by “CM.AL” in the following (approximate version of model 7 in [13, Sec. 3.7.2]). All of the logistic regression models are fitted using maximum likelihood.
Confidence and credible intervals combined effects
The 95% credible intervals in the Bayesian estimation and confidence intervals in the frequentist approaches are estimated for the combined treatment effect μ. The narrowest 95% highest posterior density intervals are used in the Bayesian estimation. For the construction of confidence intervals, Waldtype intervals based on normal quantiles are considered, which are known to be anticonservative when the number of studies is small or nonnegligible amounts of heterogeneity are present [2, 3, 30, 31]. To account for this behaviour, confidence intervals are in addition constructed using Student’s tdistribution in case of the GLMMs, and the HartungKnappSidikJonkman (HKSJ) adjustment [30–32] in case of the NNHM. The HKSJadjusted intervals tend to be wider than the Waldtype intervals, although this is not strictly the case [2, 32, 33]. Knapp and Hartung [33] proposed a modification of the HartungKnappSidikJonkman adjustment (mHKSJ) correcting HKSJadjusted intervals in the cases where they are counterintuitively narrow. These modified confidence intervals are also considered.
I ^{2} as measure of betweenstudy heterogeneity
The “relative amount of betweenstudy heterogeneity” can be expressed in terms of the measure I^{2}, which expresses the the betweenstudy variance (τ^{2}) in relation to the overall variance (\(\tilde {\sigma }^{2}\)) [34], which is stated as
In the calculation of I^{2}, a “typical” \(\tilde {\sigma }^{2}\) value is required as an estimate of the withinstudy variances \(\sigma _{i}^{2}\). Higgins and Thompson [34] suggest a weighted average of the individual withinstudy variances as “typical” value. This, together with the fact that the I^{2} is bounded between zero and one, permits the interpretation of heterogeneity magnitude as a relative percentage. The I^{2} is used to set the amount of heterogeneity in the simulation study. Hoaglin [35] remarks that the probability for observing a moderate (estimated) I^{2} even in the absence of heterogeneity is dependent on the number of studies included and is not negligible. As the I^{2} expresses the betweenstudy variation relative to the total variation, the same values of τ may lead to different values of I^{2}, depending on the precision of the underlying studies and should therefore always be interpreted as a relative measure [36].
Extraction of the empirical data set
A data set of 40 metaanalyses was extracted from publications of the German Institute for Quality and Efficiency in Health Care (IQWiG). IQWiG publications were searched chronologically for metaanalyses of binary data in April 2017 starting with the most recent available ones and reaching back to March 2012. In total, 521 documents were screened, including all document types in the search. If a detailed and a short version of a document existed, only the detailed version was considered. From documents including at least one metaanalysis of binary data, the first one was extracted to obtain a realistic data set with respect to the number of studies typically included in a metaanalysis and the sample sizes of those studies. Metaanalyses involving studies with zero events in one or more arms were excluded from the data set for better comparability of the evaluated methods.
Simulation procedure
To compare properties of the investigated approaches to metaanalysis, we conducted a MonteCarlo simulation adapting the setup from IntHout et al. [14] who described the simulation of 2 × 2 tables. In deviation from IntHout et al. [14], series of trials with up to 10 studies were simulated, and each series was repeated only 2000 times. Three different designs were considered, where in the first one all studies were of equal size, one study was ten times larger than the other studies in the second, and one study was only a tenth of the size of the other studies in the third design. It should be noted however that this ratio corresponds to extreme, but not unrealistic cases, as is also illustrated in the right panel of Fig. 1. The (less common) case of equal sizes is of interest here, as this is where we expect the HKSJ methods to perform best [2, 14].
To generate dichotomous outcomes, p_{0} and I^{2} have to be set in advance. Considered values of the I^{2} correspond to levels of no, low, moderate, high and very high heterogeneity, respectively [37]. Note however, that the same I^{2} value may correspond to different values of betweenstudy heterogeneity τ depending on the effect measure used, and on whether or not study sizes are balanced; the resulting τ values are shown in Table 1. From the τ values one can see that in some of the scenarios, the I^{2} settings imply unrealistically large absolute heterogeneity [26, Sec. 5.7.3], which needs to be considered in the interpretation. This would be true for instance for odds ratios with I^{2} in the range of 0.75 and 0.90 and one small study when τ is roughly in the range of 1 to 2 (see Table 1). The baseline event rate (p_{0}) needs to be set as an additional parameter and varies from 0.1 to 0.9 in steps of 0.2. The treatment effect θ_{i} is set to unity for both RR and OR, which corresponds to the absence of an effect. Note that while for metaanalyses of continuous (or, more specifically, normally distributed) endpoints the magnitude of the simulated treatment effect (θ_{i}) should not affect performance, e.g. for binomial counts it may make a difference, as it affects the chances of observing few or zero events in the treatment arm. However, since we chose not to focus on rareevent issues, and in order to keep the number of simulation scenarios manageable, only the case of no effect was investigated. For every combination of the simulation parameters 2000 repetitions are simulated. In case zero event counts occurred, for the models based on the NNHM, a continuity correction of 0.5 was added to all cells of the affected study’s contingency table. Zero counts, however, were rare in the scenarios considered. The simulation scenarios are also summarized in Table 2. For more details on the simulation procedure see also IntHout et al. [14] and Fig. 2 below. As in the case of the empirical data set, the twosided significance level α was set to 0.05. Different methods and scenarios are compared based on observed confidence or credible interval coverage probabilities and lengths.
Estimation in R
The software environment R [38] and two of its extensions, the metafor [39, 40] and bayesmeta [25, 41] packages are used with their default options. As no implementation in R was found for the PL estimation of Poissonnormal model we translated the steps described by Böhning et al. [7, Ch. 2] into R code which is shown in the Additional file 1.
Results
Empirical study
Most (419; 80%) of the 521 documents searched did not include a metaanalysis, because either the assignment was canceled (11; 2%), the assignment had just started without results being available at the time of search (70; 13%), no metaanalysis was included or accepted by the IQWiG (186; 36%), no study (34; 7%) or just one study (118; 23%) was identified. Out of the remaining 102 documents which included at least one metaanalysis, 25 (5%) did not include any binary metaanalysis, 19 (4%) were network metaanalyses, and in 18 (3%) cases the first binary metaanalysis included at least one study with zero events. An overview over the identified metaanalyses is given in Table 3; the data are also available online [42].
In the original publications, a slight majority of studies (26 of 40) was analyzed using RR as the effect measure. In the extracted data set, 21 out of the 40 metaanalyses (53%) included only 2 studies, while 10 (25%) consisted of three studies. Even in this small example, a common occurrence of 2 and 3study metaanalyses is found, which is also observed empirically by [43] and [44]. The distribution of study sizes and endpoints is also illustrated on the left panel in Fig. 1. With only two studies included, three methods coincide: the DL, the REML and the EB estimation [45]. As this is the case for a major share of the data set, these three methods are expected to show similar results in the analysis. The maximum number of studies observed is 18. The original analyses were based on the NNHM, and, with only the exceptions of the publications A1545, S1101 and A1130 performed using DL variance estimation.
Imbalance in study sizes may influence the estimation of an overall treatment effect [2, 14, 46]. As IntHout et al. [14] observe in an empirical study, such unequal study sizes are common in metaanalyses. In the data set extracted from IQWiG publications, ratios of sample sizes between the largest and the smallest study in a metaanalysis ranged from 1.0 up to 15.8, with a mean of 3.4 and a median of 1.9. Nearly half of the metaanalyses included at least one study twice as large as the smallest study. In the NNHM, studyspecific variances \(\sigma _{i}^{2}\) should roughly be inversely proportional to sample sizes; imbalances in sample size then affect analysis via an imbalance in the σ_{i}. Ratios of largest to smallest study sizes and variances using both effect measures for all studies are shown on the right panel in Fig. 1, where the ratio between the largest and the smallest value is ordered by the ratio of sample sizes in descending order. It can be observed that the ratios of the variances of ORs seem to vary more when study sizes are unbalanced than those of the RRs. However, they both roughly follow the same pattern as the ratio of study sizes.
The extracted data set is then analyzed based on the models and methods described above. As 2 × 2 tables are available for all studies, both effect measures are used to summarize the individual metaanalyses and to evaluate the influence of the choice of effect measure on the estimation results. The ratios of point estimates of the different methods against the standard DL approach are illustrated by the first row of Fig. 3 where the RR is displayed in the left and the OR in the right panel. As expected, DL, REML and EB estimation coincide in the majority of cases including only two studies [45]. These three estimators are also observed to behave comparable when more than two studies are included, as do the point estimates of the Bayesian approach. The greatest deviation from the standard DL approach is observed in the GLMMs in both effect measures. In the case of OR as an effect measure, UM.FS and UM.RS perform comparable. CM.EL estimation does not converge in all cases, however, in the cases where convergence was achieved, it is in line with DL estimation. The CM.AL however, is in general different from the DL estimation. The Poissonbased results also differ considerably from the DL estimates.
The length of confidence intervals for the frequentist and credible intervals in the Bayesian estimation are also of importance as it might not be possible to detect significant treatment effects if intervals are inconclusively wide. For both effect measures, all intervals and the discussed adjustments are shown in the second row in Fig. 3. Again, the RR is displayed in the left and the OR in the right panel. The Bayesian credible intervals are generally wider than the unadjusted confidence intervals and more similar to the adjusted ones with respect to the median length, but exhibiting less variability.
Simulation study
Figures 4 and 5 illustrate the coverage rates (first row) and lengths (second row) of the 95% confidence or credible intervals of the different methods for the relative risks and odds ratios, respectively. All results shown here exemplarily refer to the combination of 100 participants per arm and study and a baseline event rate of 0.7. Results of the other scenarios may be found in the supplement (see Additional files 2 and 3). The different methods are indicated by colours, while the different adjustments are indicated by the line type.
Nonconvergence rates averaged over all scenarios and both effect measures are mostly negligible in the methods based on the normal likelihood on the logscale (ML: 0.049%, EB: 0.032%, HN(1.0): 0.002%, HN(0.5): 0.036%). Estimation based on REML or the methods taking the distributional assumptions on the trialarms lead to slightly higher nonconvergence rates (REML: 0.43%, UM.FS: 0.43%, UM.RS: 0.22%, CM.AL: 0.47%). The only method with high nonconvergence rates is CM.EL, with an average of 18%. None of the methods fully dominates the others over the range of the investigated scenarios. Estimation using CM.EL for the binomialnormal model however was computationally expensive and convergence was problematic in a large proportion of scenarios (using the default values), as has been noted before [13, 19]; these results are omitted here. Coverage rates of all methods are comparable when either the number of studies included in each metaanalysis is sufficiently large or when the heterogeneity is absent or low (I^{2} ≤ 0.25). However, given the frequency with which 2 or 3study metaanalyses occur empirically in our example data set and others [43, 44] and the difficulties in the determination of the absence of heterogeneity [2] this is hardly relevant in practice. In general when study sizes were not balanced, coverage rates for all methods were substantially lower even in the presence of only low heterogeneity in the simulation of OR, while Bayesian estimation and the adjustment of frequentist confidence intervals resulted in better coverage when RR was used. This might be due to the I^{2} values translating to lower values of absolute heterogeneity in the latter case. In the presence of heterogeneity, coverage could drop as low as 40% for some extreme scenarios in both, frequentist and Bayesian estimation, resulting in high falsepositive rates. In general, it was also observed that one large study tended to lead to lower coverage than one small study per metaanalysis in the frequentist methods when more than k = 2 studies are present, which is in line with [14]. This effect is more noticeable in the unadjusted methods, in scenarios where the number of patients per arm (n_{i}) is small, or when heterogeneity (I^{2}) is large. When considering k = 2 trials per metaanalysis, coverages are comparable (OR) or the effect is even reversed (RR). The frequentist methods based on the normalnormal hierarchical model perform similarly, or, in the case of two studies per metaanalysis, even identically [45]. In the case of heterogeneous data, in particular regarding small study sizes, all frequentist methods perform below the nominal coverage probability when confidence intervals are not adjusted. In the scenarios with unbalanced study sizes this is even more pronounced than in the balanced scenarios; this is in line with the findings of [14] and [2]. Coverage can, at the cost of interval width, be increased by either using the HKSJ or the mHKSJ adjustment, but the HKSJ adjustment yields in some scenarios coverage probabilities which are still below the nominal level [14].
The length of confidence and credible intervals is illustrated in the second row of Figs. 4 and 5. Bayesian credible intervals are, as in the case for the empirical data set, in general wider than the unadjusted confidence intervals from the frequentist estimations. When compared to adjusted confidence intervals, Bayesian credible intervals tend to be, especially in the presence of heterogeneity and with only two studies per metaanalysis, narrower than the frequentist intervals based on the normalnormal model as long as the number of patients per trial arm is small. However, when n_{i} increases, this is no longer true for RR (in contrast to the scenarios using OR). These differences may be due to the fact that idential I^{2} settings can imply very different (and sometimes possibly unrealistically large) magnitudes of heterogeneity values on the τ scale, as can also be seen in Table 1. In these extreme scenarios, adjusted frequentist confidence intervals are observed to be inconclusively wide, with the exception of BNUM.FS and BNUM.RS, and especially when estimation is based on the NNHM. In the other scenarios, Bayesian credible and adjusted confidence intervals are comparable.
Discussion
In our empirical study we found that the majority of the 40 metaanalyses extracted from publications of IQWiG included only two studies. This is in agreement with a much larger empirical investigation based on the Cochrane Library by Turner et al. [1]. This finding emphasizes the need for methods appropriate for metaanalysis with few studies. Furthermore, varying methods and / or effect measures lead to differences in the results for the 40 metaanalyses considered. This demonstrates that prespecification of methods as well as effect measures is important for controlling operating characteristics. The problems encountered in metaanalyses of few studies may mostly be attributed to the estimation of heterogeneity, and in particular to the proper accounting for its uncertainty in constructing intervals for the combined effect. The difference in performance between different heterogeneity estimators is relatively small compared to the difference in whether or how heterogeneity uncertainty is propagated through to the effect estimate [3].
In the simulation study, coverage probability was below the nominal level for all frequentist methods in the presence of heterogeneity and few studies. This phenomenon is even more pronounced when studies included in a metaanalysis are of unequal size. However, coverage probabilities generally improve when confidence intervals are adjusted based on the Studenttdistribution. Bayesian methods mostly result in better coverage across all scenarios, except for some cases of very large heterogeneity (in terms of τ) where the coverage is slightly lower. Credible intervals are empirically and in the simulation study wider than unadjusted confidence intervals, but considerably narrower than adjusted ones, with some exceptions when considering RRs and large numbers of patients per trialarm. Previous simulation studies comparing a more restricted set of methods including standard frequentist and Bayesian approaches only led to similar conclusions. The simulations presented here considering a wider set of methods show that the issues entailed by the increased complexity of some likelihoodbased approaches may often outweigh their expected advantages [6]. However, confidence intervals based on the GLMMs for example are in general slightly narrower than those from other frequentist methods. Furthermore, certain maximumlikelihood methods turned out to suffer from frequent numerical problems in the setting with few studies. To our knowledge, this has not been described previously.
Our empirical investigation did not consider all IQWiG reports, but only the most recent 40 metaanalyses at the time of extraction. A consideration of all metaanalyses might have led to a more complete picture, but was not feasible with the resources of this project as no specific funding was available. Furthermore, the simulation study could have been enriched by additional methods. For instance, we only considered Bayesian twostage approaches but did not include Bayesian approaches utilizing the full information of the 2 × 2 tables. The latter was considered recently by [47] in the context of network metaanalyses, where pairwise metaanalysis would be a special case. As for the likelihood methods, we would expect that the results of the onestage approach are overall quite similar to those of the twostage approach considered, maybe with the potential of some small improvements. As discussed in the context of the simulation setup, a prespecified I^{2} value may correspond to rather different τ values, depending on the circumstances (see also Table 1). Consequently, one may generally expect larger I^{2} values for logRR endpoints, and smaller I^{2} values for logOR endpoints, while heterogeneity priors are probably best discussed at the scale of τ values (a prior specification in terms of I^{2} would be possible [25], but this would be hard to motivate). By relating the heterogeneity to the τ value, the question to consider is by what factor the true RRs or ORs θ_{i} are expected to differ solely due to betweentrial heterogeneity [26, Sec. 5.7.3], and the reasonably expected range should then be covered by the prior. For example, a heterogeneity of τ = 1.0 implies that the central 95% of true study means (θ_{i}) span a range of a factor of 50 [3, 26]. The HN(0.5)prior confines τ to values below 1.0 with roughly 95% probability, while the HN(1.0)prior constitutes a conservative variation that instead allows for twice as large heterogeneity, implying a plausible range of roughly up to factor of 50^{2}=2500.
The limits of applicability of approximate metaanalysis methods have been discussed from the perspective of the NNHM by Jackson and White [48]. In the limit of many studies (large k) and large sample sizes (large n_{i}), the normal approximation usually works well. It starts breaking down, however, when the number of studies (k) gets too small. The problem then is related to the estimation of heterogeneity (τ) and proper accounting for the associated uncertainty; inference would still be exact if the heterogeneity was known. In the frequentist context, use of the HKSJ adjustment helps, especially if the studyspecific standard errors are roughly balanced [49]. This is not so much of a problem when Bayesian methods along with reasonable priors are used; these methods yield valid inference irrespective of the number of included studies [50]. Problems also arise when events are rare or sample sizes (n_{i}) are small. In either case, the chances of observing few or no events in a treatment group increase, and normal approximations to the likelihood break down. In such situations, a solution might be to resort to exact likelihoods respecting the discrete nature of the data, for example a GLMM, which may again be done in frequentist or Bayesian frameworks [7, 40, 51].
Conclusions
In the presence of betweenstudy heterogeneity, especially with unbalanced study sizes, caution is needed in applying metaanalytical methods to few studies, as either coverage probabilities of intervals may be compromised, or they may be inconclusively wide. Bayesian estimation with sensibly chosen prior for the betweenstudy heterogeneity may offer a compromise and promising alternative.
Abbreviations
 For abbreviations of model variations used,:

see also Table 4.
 DL:

DerSimonianLaird
 EB:

Empirical Bayes
 GLMM:

Generalized linear mixed model
 HKSJ:

HartungKnappSidikJonkman
 IPD:

Individual patient data
 IQWiG:

Institut für Qualität und Wirtschaftlichkeit im Gesundheitswesen (Institute for Quality and Efficiency in Health Care)
 mHKSJ:

Modified HKSJ
 ML:

Maximum likelihood
 NNHM:

Normalnormal hierarchical model
 OR:

Odds ratio
 PL:

Profile likelihood
 REML:

Restricted ML
 RR:

Risk ratio
References
 1
Turner RM, Davey J, Clarke MJ, Thompson SG, Higgins JPT. Predicting the extent of heterogeneity in metaanalysis, using empirical data from the Cochrane Database of Systematic Reviews. Int J Epidemiol. 2012; 41(3):818–27. https://doi.org/10.1093/ije/dys041.
 2
Röver C, Knapp G, Friede T. HartungKnappSidikJonkman approach and its modification for randomeffects metaanalysis with few studies. BMC Med Res Methodol. 2015;15. https://doi.org/10.1186/s1287401500911.
 3
Friede T, Röver C, Wandel S, Neuenschwander B. Metaanalysis of few small studies in orphan diseases. Res Synth Methods. 2017; 8(1):79–91. https://doi.org/10.1002/jrsm.1217.
 4
Higgins JPT, Thompson SG, Spiegelhalter DJ. A reevaluation of randomeffects metaanalysis. Journal of the Royal Statistical Society A. 2009; 172(1):137–59. https://doi.org/10.1111/j.1467985X.2008.00552.x.
 5
Friede T, Röver C, Wandel S, Neuenschwander B. Metaanalysis of two studies in the presence of heterogeneity with applications in rare diseases. Biom J. 2017; 59(4):658–71. https://doi.org/10.1002/bimj.201500236.
 6
Bender R, Friede T, Koch A, Kuss O, Schlattmann P, Schwarzer G, Skipka G. Methods for evidence synthesis in the case of very few studies. Res Synth Methods. 2018. https://doi.org/10.1002/jrsm.1297.
 7
Böhning D, Rattanasiri S, Kuhnert R. Metaanalysis of Binary Data Using Profile Likelihood. Boca Raton: Taylor & Francis; 2008.
 8
Morris TP, Fisher DJ, Kenward MG, Carpenter JR. Metaanalysis of Gaussian individual patient data: twostage or not twostage?Stat Med. 2018. https://doi.org/10.1002/sim.7589.
 9
Mathew T, Nordström K. Comparison of onestep and twostep analysis models using individual patient data. Biom J. 2010; 52(2):271–87. https://doi.org/10.1002/bimj.200900143.
 10
Burke DL, Ensor J, Riley RD. Metaanalysis using individual participant data: onestage and twostage approaches, and why they may differ. Stat Med. 2017; 36(5):855–75.
 11
Kontopantelis E. A comparison of onestage vs twostage individual patient data metaanalysis methods: a simulation study. Res Synth Methods. 2018. https://doi.org/10.1002/jrsm.1303.
 12
Debray T, Moons KGM, AboZaid GMA, Koffijberg H, Riley RD. Individual participant data metaanalysis for a binary outcome: onestage or twostage?PLoS ONE. 2013; 8(4):60650.
 13
Jackson D, Law M, Stijnen T, Viechtbauer W, White IR. A comparison of seven randomeffects models for metaanalyses that estimate the summary odds ratio. Stat Med. 2018; 37(7):1059–85. https://doi.org/10.1002/sim.7588.
 14
IntHout J, Ioannidis JPA, Borm GF. The HartungKnappSidikJonkman method for random effects metaanalysis is straightforward and considerably outperforms the standard DerSimonianLaird method. BMC Med Res Methodol. 2014; 14:25. https://doi.org/10.1186/147122881425.
 15
Fleiss JL. The statistical basis of metaanalysis. Stat Methods Med Res. 1993; 2(2):121–45.
 16
Hedges LV, Olkin I. Statistical Methods for Metaanalysis. San Diego: Academic Press; 1985.
 17
Hartung J, Knapp G, Sinha BK. Statistical Metaanalysis with Applications. Hoboken: Wiley; 2008.
 18
DerSimonian R, Laird N. Metaanalysis in clinical trials. Control Clin Trials. 1986; 7(3):177–88. https://doi.org/10.1016/01972456(86)900462.
 19
Viechtbauer W. Bias and efficiency of metaanalytic variance estimators in the randomeffects model. J Educ Behav Stat. 2005; 30(3):261–93. https://doi.org/10.3102/10769986030003261.
 20
Raudenbush SW. Analyzing effect sizes: randomeffects models In: Cooper HM, Larry VH, Valentine JC, editors. The Handbook of Research Synthesis and MetaAnalysis. New York City: Russell Sage Foundation: 2009. p. 295–316.
 21
Morris CN. Empirical Bayes methods for combining likelihoods: comment. J Am Stat Assoc. 1996; 91(434):555–8. https://doi.org/10.2307/2291646.
 22
Paule RC, Mandel J. Consensus values and weighting factors. J Res Natl Bur Stand. 1982; 87(5):1–9. https://doi.org/10.6028/jres.087.022.
 23
Turner RM, Jackson D, Wei Y, Thompson SG, Higgins PT. Predictive distributions for betweenstudy heterogeneity and simple methods for their application in Bayesian metaanalysis. Stat Med. 2015; 34(6):984–98. https://doi.org/10.1002/sim.6381.
 24
Dias S, Sutton AJ, Welton NJ, Ades AE. NICE DSU Technical Support Document 2: A Generalized Linear Modelling Framework for Pairwise and Network Metaanalysis of Randomized Controlled Trials. London: National Institute for Health and Clinical Excellence (NICE); 2014. National Institute for Health and Clinical Excellence (NICE). available from: http://www.nicedsu.org.uk.
 25
Röver C. Bayesian randomeffects metaanalysis using the bayesmeta R package. arXiv preprint 1711.08683. 2017. http://www.arxiv.org/abs/1711.08683.
 26
Spiegelhalter DJ, Abrams KR, Myles JP. Bayesian Approaches to Clinical Trials and Healthcare Evaluation. Chichester: Wiley; 2004. https://doi.org/10.1002/0470092602.
 27
Turner RM, Omar RZ, Yang M, Goldstein H, Thompson SG. A multilevel model framework for metaanalysis of clinical trials with binary outcomes. Stat Med. 2000; 19(24):3417–32.
 28
van Houwelingen HC, Zwinderman KH, Stijnen T. A bivariate approach to metaanalysis. Stat Med. 1993; 12(24):2273–84. https://doi.org/10.1002/sim.4780122405.
 29
Stijnen T, Hamza TH, Özdemir P. Random effects metaanalysis of event outcome in the framework of the generalized linear mixed model with applications in sparse data. Stat Med. 2010; 29(29):3046–67. https://doi.org/10.1002/sim.4040.
 30
Hartung J, Knapp G. On tests of the overall treatment effect in metaanalysis with normally distributed responses. Stat Med. 2001; 20(12):1771–82. https://doi.org/10.1002/sim.791.
 31
Hartung J, Knapp G. A refined method for the metaanalysis of controlled clinical trials with binary outcome. Stat Med. 2001; 20(24):3875–89. https://doi.org/10.1002/sim.1009.
 32
Sidik K, Jonkman JN. A simple confidence interval for metaanalysis. Stat Med. 2002; 21(21):3153–9. https://doi.org/10.1002/sim.1262.
 33
Knapp G, Hartung J. Improved tests for a random effects metaregression with a single covariate. Stat Med. 2003; 22(17):2693–710. https://doi.org/10.1002/sim.1482.
 34
Higgins JPT, Thompson SG. Quantifying heterogeneity in a metaanalysis. Stat Med. 2002; 21(11):1539–58. https://doi.org/10.1002/sim.1186.
 35
Hoaglin DC. Misunderstandings about Q and ’Cochran’s Q test’ in metaanalysis. Stat Med. 2016; 35(4):485–95. https://doi.org/10.1002/sim.6632.
 36
Borenstein M, Higgins JPT, Hedges LV, Rothstein HR. Basics of metaanalysis: I ^{2} is not an absolute measure of heterogeneity. Res Synth Methods. 2017; 8(1):5–18. https://doi.org/10.1002/jrsm.1230.
 37
Higgins JPT, Thompson SG, Deeks JJ, Altman DG. Measuring inconsistency in metaanalyses. BMJ. 2003; 327(7414):557–60. https://doi.org/10.1136/bmj.327.7414.557.
 38
R Core Team. R: a Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2016. R Foundation for Statistical Computing. https://www.Rproject.org/.
 39
Viechtbauer W. metafor: Metaanalysis Package For R. 2009. R package. https://cran.rproject.org/package=metafor.
 40
Viechtbauer W. Conducting metaanalyses in R with the metafor package. J Stat Softw. 2010; 36(3):1–48.
 41
Röver C. bayesmeta: Bayesian randomeffects Metaanalysis. 2015. R package. https://cran.rproject.org/package=bayesmeta.
 42
Seide SE, Röver C, Friede T. Metaanalysis data extracted from IQWiG publications. Göttingen Research Online. 2018. https://doi.org/10.25625/BWYBNK.
 43
Turner RM, Davey J, Clarke MJ, Thompson SG, Higgins JPT. Predicting the extent of heterogeneity in metaanalysis, using empirical data from the Cochrane Database of Systematic Reviews. Int J Epidemiol. 2012; 41(3):818. https://doi.org/10.1093/ije/dys041.
 44
Kontopantelis E, Springate DA, Reeves D. A reanalysis of the Cochrane Library data: the dangers of unobserved heterogeneity in metaanalyses. PLoS ONE. 2013; 8(7):1–14. https://doi.org/10.1371/journal.pone.0069930.
 45
Rukhin AL. Estimating heterogeneity variance in metaanalysis. J R Stat Soc Ser B (Stat Methodol). 2013; 75(3):451–69. https://doi.org/10.1111/j.14679868.2012.01047.x.
 46
Partlett C, Riley RD. Random effects metaanalysis: Coverage performance of 95% confidence and prediction intervals following REML estimation. Stat Med. 2017; 36(2):301–17. https://doi.org/10.1002/sim.7140.
 47
Günhan BK, Friede T, Held L. A designbytreatment interaction model for network metaanalysis and metaregression with integrated nested Laplace approximations. 2018; 9(2):179–94. https://doi.org/10.1002/jrsm.1285.
 48
Jackson D, White IR. When should metaanalysis avoid making hidden normality assumptions?Biom J. 2018. https://doi.org/10.1002/bimj.201800071.
 49
Veroniki AA, Jackson D, Bender R, Kuß O, Langan D, Higgins JPT, Knapp G, Salanti G. Methods to calculate uncertainty in the estimated overall effect size from a randomeffects metaanalysis. 2018. https://doi.org/10.1002/jrsm.1319.
 50
Röver C, Friede T. Contribution to the discussion of “When should metaanalysis avoid making hidden normality assumptions?”: A Bayesian perspective. Biom J. 2018; 60(6):1068–70. https://doi.org/10.1002/bimj.201800179.
 51
Günhan BK, Röver S, Friede T. Metaanalysis of few studies involving rare events. arXiv preprint 1809.04407. 2018.
Acknowledgements
The authors would like to thank Ralf Bender (IQWiG, Köln, Germany) for helpful comments on the manuscript.
Funding
Not applicable.
Availability of data and materials
The IQWiG dataset used in the current study is available online [42].
Author information
Affiliations
Contributions
TF conceived the concept of this study. SES carried out the simulations and drafted the manuscricpt. CR critically reviewed and made substantial contributions to the manuscript. All authors commented on and approved the final manuscript.
Corresponding author
Correspondence to Christian Röver.
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional files
Additional file 2
Supplement1.pdf: The plots analogous to Fig. 4, for all simulation scenarios. (PDF 676 kb)
Additional file 3
Supplement2.pdf: The plots analogous to Fig. 5, for all simulation scenarios. (PDF 826 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Received
Accepted
Published
DOI
Keywords
 Randomeffects metaanalysis
 Normalnormal hierarchical model (NNHM)
 HartungKnappSidikJonkman (HKSJ) adjustment
 Generalized linear mixed model (GLMM)
 Count data