Accurate confidence intervals for risk difference in meta-analysis with rare events

Background Meta-analysis provides a useful statistical tool to effectively estimate treatment effect from multiple studies. When the outcome is binary and it is rare (e.g., safety data in clinical trials), the traditionally used methods may have unsatisfactory performance. Methods We propose using importance sampling to compute confidence intervals for risk difference in meta-analysis with rare events. The proposed intervals are not exact, but they often have the coverage probabilities close to the nominal level. We compare the proposed accurate intervals with the existing intervals from the fixed- or random-effects models and the interval by Tian et al. (2009). Results We conduct extensive simulation studies to compare them with regards to coverage probability and average length, when data are simulated under the homogeneity or heterogeneity assumption of study effects. Conclusions The proposed accurate interval based on the random-effects model for sample space ordering generally has satisfactory performance under the heterogeneity assumption, while the traditionally used interval based on the fixed-effects model works well when the studies are homogeneous.

exact conditional framework [2]. It is well known that asymptotic approaches often do not have satisfactory performance when outcome is extreme or sample size is small.
Multiple methods have been developed for metaanalysis with rare events over decades [3,4]. The fixedeffects models are conveniently used in practice, such as the Mantel-Haenszel method [5]. When one or both groups in a study have zero events, a continuity correction is often needed in order to estimate risk ratio or odds ratio, but the traditional correction by adding 0.5 may lead to undesirable influence on the analysis results as pointed out by Sweeting et al [6]. Later, they developed a continuity correction method by adding a float value based on the size of each group to improve the coverage probability. Multiple follow-up articles discussed this issue whether or not a small value should be added to studies with rare events in data analysis [7,8]. Kuss et al. [8] suggested using a beta-binomial model to avoid adding arbitrary values to each cell in data analysis. Recently, Tian et al. [9] proposed a simple and effective method for confidence interval calculation without artificial continuity correction. The confidence intervals from each study were weighted to construct an overall interval from simulation studies under the fixed-effects model. Their developed confidence intervals were shown to have better coverages when the events are rare, but the length of their intervals could be much longer than others.
In contrast to fixed-effects models, the treatment effect in the random-effects model is assumed to follow a normal distribution. DerSimonian and Laird [10] proposed a random-effects model by including a random study effect to account for the variation of study population or study design. The statistical software R package meta can be used to compute confidence intervals for the fixed-effects model and random-effects model [11]. Recently, Bakbergenuly and Kulinskaya [12] suggested the generalized linear mixed models (GLMMs) in meta-analysis to include the correlation between point estimate and its variance estimate in data analysis.
The aforementioned exact conditional approach assumes both marginal totals in each study fixed [2]. It is reasonable to assume that the numbers of participant in each treatment group are fixed. It is not usual that a repeated study has the same total number of events as the observed study. The exact one-sided limit by Buehler [13] follows the study design with sample size in each treatment group fixed [14][15][16]. However, it is too computationally intensive to generate all possible samples in meta-analysis with binary outcome [17].
In this article, we propose using importance sampling to construct confidence interval for risk difference in metaanalysis with rare events. We apply the importance sampling method described by Lloyd and Li [18] to compute the profile confidence limit proposed by Kabaila and Lloyd [19]. Importance sampling methods have been studied by many researchers with regards to coverages of confidence intervals [20,21]. Importance sampling does not require to enumerate all possible samples [19]. This approach simulates samples from the distribution estimated from the observed data. Importance sampling has to be used in conjunction with a designated statistic to order the limits of simulated samples. We consider the existing intervals from the fixed-effects and random-effects models as designated statistics in this article.
The rest of this article is organized as follows. In "Methods" section, we describe the fixed-effects and random-effects models to estimate confidence intervals for risk difference. We then introduce importance sampling for interval calculation. In "Results" section, we use an example from 18 schizophrenia clinical trials to illustrate the application of the proposed intervals, and then compare the proposed intervals with the existing inte rvals with regards to coverage probability and average length. In "Conclusions" section, we provide some remarks on data analysis for meta-analysis with rare events.

Methods
For meta-analysis with binary outcome, data can be organized in a K × 4 table, where K is the number of studies (Table 1). Each row represents the results from a parallel study with the number of events and the number of nonevents in the new treatment group and the control group, respectively. Let the two treatment groups be indexed by 0 and 1 for the control and the new treatment, respectively. Suppose X ijr is the number of participants having r events from the treatment j in ith study, where i = 1, 2, · · · , K, j = 0, 1, and r = 0, 1. For studies with rare events, X ij1 is often very small. Let n ij = X ij1 + X ij0 be the total number of participants from the treatment j in the ith study, and N 1 = (n 11 , n 21 , · · · , n K1 ) and N 0 = (n 10 , n 20 , · · · , n K0 ) be the sample sizes for the new treatment group and the control group, respectively. Suppose p j is the event rate of the treatment j. Given the sample size n ij , the number of responses among these participants, X ij1 , follows a binomial distribution, B(n ij , p j ). We assume that each study is independent from each other, and the two groups within each study are independent from each other as well. The parameter of interest here is the risk difference between the treatment group and the control group, We first review the existing methods to construct twosided confidence intervals for in "Intervals based on fixed or random-effects model" section, and then develop accurate intervals in "Accurate intervals" section.

Intervals based on fixed or random-effects model
We first consider the fixed-effects model to calculate confidence interval for . Under the study homogeneity assumption, the treatment effect in each study is assumed to be the same, where μ is the treatment effect. In the ith study, the risk difference i is estimated as · · · · · · · · · · · · · · · K X K11 X K10 X K21 X K20 wherep ij = X ij1 /n ij is the estimated rate of the treatment j in the ith study. The variance is estimated as from two independent proportions. The weight for the ith study is n i1 +n i0 is the factor to standardize the weight values, with K i=1 w i = 1. It is easy to show that w i is an increasing function of n i1 (n i0 ) when n i0 (n i1 ) is fixed.
The overall weighted treatment effect using the fixedeffects model is calculated as and its variance is estimated as The standardized statistic / SE F follows the standard normal distribution asymptotically when = 0. Therefore, the asymptotic confidence interval for based on the fixed-effects model (the F interval) at the nominal level of 100(1 − α)% is where z a is the ath quantile of the standard normal distribution.
In the observation of study heterogeneity which could be caused by study population or study design or influential covariates, DerSimonian and Laird [10] proposed using the random-effects model to include the study random effect in the model as where u i is the deviation of the ith study from the population mean μ, and it follows a normal distribution. Let v i be the weight of the ith study from the fitted random-effects model. Then, the weighted treatment effect and its vari- It follows that the asymptotic confidence interval for using the random-effects model (the R interval) is computed as It can be seen that the difference between CI F and CI R is the weights used in the treatment effect and its variance calculation. The F interval and the R interval can be computed by using the function metabin from the statistical software package meta [11,22]. In the metabin function, we use MH.exact = TRUE in the option with no continuity correction in the estimates.

Accurate intervals
Exact confidence limit by Buehler [13] for is preferable, but it is computationally intensive to save all the possible samples in meta-analysis with sample size n ij fixed. For this reason, we consider importance sampling (IS) to construct accurate intervals for by simulating samples from the distribution estimated from the observed data to make statistical inference. Importance sampling has been applied to many important medical research areas that often only have one nuisance parameter (e.g., the proportion difference in a parallel study [21,23]). We extend the application of IS to meta-analysis with multiple nuisance parameters in confidence interval calculation. The intervals computed using importance sampling are accurate with coverage close to the nominal level. In addition, importance sampling has the computational advantage over exact methods [19].
The calculation of the IS intervals has to be used in conjunction with a designated statistic for the interval ordering. Let T be the considered designated statistic. Suppose p 0 = (p 10 , p 20 , · · · , p K0 ) is the probability vector of the control group, where p i0 is the probability of the control group in the ith study. The accurate upper limit based on the designated statistic T is computed as the supremum of such that where y obs is the observed data, Y is data from the simulated data set, andp 0 ( ) is the maximum likelihood estimate of p 0 given . Suppose we simulate B data sets from independent binomial distributions with the probabilities using * and p 0 ( * ) estimated from the observed data y obs . For studies with double zeros, although their estimated risk differences are zero, sample sizes from such studies are still valuable information in estimating the overall and it confidence intervals [24]. Sample sizes from all studies including the ones with double zeros are used in the proposed method. The number of events are simulated from binomial distributions with the probabilities of p 0 ( * ).
The designated statistic of each simulated data set is computed, and compared with T(y obs ). The set of T(Y) ≤ T(y obs ) equals to T (y obs ) = {Y : T(Y) ≤ T(y obs )}. Let the size of T (y obs ) be B 1 with data: Y 1 , · · · , Y B 1 . Then, the upper limit in Eq. 3 can be rewritten as the supremum of such that is the probability density function of Y b , which is a product of independent binomial distributions with parameters (n ij , p ij ) for the treatment j in the ith study. For a given , numerical algorithms can be used to find the maximum likelihood estimator of p 0 ( ) to calculate G( ). Similarly, the IS lower limit can be computed. It should be noted that designated statistics from the same model are used for the IS upper limit and the IS lower limit. For example, the asymptotic upper limit from the fixed-effects model is used as the designated statistic for the accurate upper limit, and then the lower limits from the same model is used for the accurate lower limit. We refer this accurate interval as the IS-F interval. When the asymptotic limits from the random-effects model are used as the designated statistics, the computed accurate limits are referred to be as the IS-R interval.

Results
We first use an example from 18 schizophrenia clinical trials to illustrate the application of the proposed accurate intervals. In addition to the F interval, the R interval, the IS-F interval, and the IS-R interval, We also include the confidence interval for by Tian [9] in the comparison (referred to be as the Tian interval). Tian interval can be computed by using their developed R function meta.exact from the exactmeta function, without the mid-p value approach. All data including studies with zero events are used in the confidence interval calculation.
These 18 schizophrenia clinical trials reported the number of all-cause mortality for patients treated with the long-acting injectable antipsychotics (LAI-AP) or the oral antipsychotics (OAP) which is the control treatment here. Data of these 18 trials are presented in Table 2, which was provided by Efthimiou [25]. Out of a total of 3774 participants treated with the LAI-AP, 7 events were observed. In the OAP group, there were 6 events recorded from a total of 2145 participants in the control group. The naive estimates for all-cause mortality rates are 0.185% and 0.279% in the LAI-AP group and the OAP group, respectively. Table 3 presents the estimated and the 95% confidence interval for using the five methods. The point estimate of from the R method is similar to the Tian method, and they are larger than that from the F method. It can be seen that the Tian interval is much wider than others, and the asymptotic F or R intervals have shorter lengths than the proposed accurate intervals. The upper limits of the proposed accurate intervals are smaller than those of other intervals. All the intervals contain zero. Therefore, we fail to reject the null hypothesis that there is no difference between the LAI-AP treatment and the OAP treatment with regards to the all-cause mortality rate.

Simulation studies
We conduct extensive simulation studies to compare coverage probability and average length of the five intervals: the F interval, the R interval, the IS-F interval, the IS-R interval, and the Tian interval. The nominal confidence level is set as 95%. The sample sizes, n ij , are assumed to be the same as those in the aforementioned example, as N 1 and N 0 in Table 2. The number of responses X ij1 follows a binomial distribution (n ij , p ij ). We simulate D = 1, 000 data for each configuration: Y 1 , Y 2 , · · · , and Y D . For the proposed IS intervals, we generate B = 2, 000 importance samples from the estimated distribution using each simulated data.
Coverage probability is defined as the proportion of the pre-specified risk difference being included in the confidence intervals: A confidence interval with the simulated interval being closer to the nominal level is preferable. Average length is defined as the average of all the lengths where CI lower and CI upper are the lower limit and the upper limit of an interval. When two intervals are comparable with regards to coverage probability, the one with a shorter average length outperforms the other.

Homogeneity of study effects
We first compare the coverage probabilities of the five methods with fixed probabilities, p 1 and p 0 . For simplicity, we assume a common rate in the control group, p i0 = p, with p from 0.01% to 10%. The treatment probability is p i1 = p + . For each configuration of (p, ), the coverage probabilities of these methods are computed, see Fig. 1 when =0.005 and 0.05. It can be seen that the F method has the coverage closer to the nominal level when =0.005, except the case in which p is very low. As is increased to 0.05, the F interval, the IS-R interval, and the IS-F interval have similar coverages when p is small. The IS-F interval and the IS-R interval are conservative when p is large. In this plot with = 0.05, the Tian interval and the R interval have the coverage probabilities below the nominal level. Overall, the F interval has good performance with regards to coverage when studies are homogeneous and have common rates.
Given the number of nuisance probabilities, it is difficult to compare the performance of the five methods under each configuration. With 18 studies and 5 considered probabilities, the number of possible configurations is 5 36 , which is over 10 25 . For this reason, we follow the approach by Tian et al. [9] to compare the performance of these methods by simulating the probabilities of the control group (p 0 ) from uniform distributions:   Table 4 presents coverage and average length comparison between the five intervals when p 0 ∼ U(0, 0.01%). Coverage probabilities of the F interval range from 89% to 96%. The R interval is very conservative when is small, and its coverage is below 95% when is larger. The Tian interval is conservative when ≤ 1%, but it could be as low as 76% when is 10%. The proposed accurate intervals always have the coverage probabilities close to the nominal level as compared to the existing intervals. Average length is always an increasing function of for each confidence interval method. The Tian intervals are wider than others when they all guarantee the coverage probability. The IS-R interval generally has a shorter length as compared to the R interval and the IS-F intervals.
When the event rates of the control group are higher with p 0 ∼ U(0, 0.1%) in Fig. 2, the F interval generally performs better than others with regards to coverage probability and average length. When is large (e.g., 10%), coverage probabilities of these intervals are all slightly below 95%. In this case with a small p 0 and a relatively large , the proposed intervals (IS-R or IS-F intervals) have better coverage probabilities than the F interval, and the length difference between the accurate intervals and the F interval is small. When = 10%, the coverage probability of the Tian interval is below 80%. When the rates are even higher with p 0 ∼ U(0, 1%), the rates are not rare in these configurations, and the F interval outperforms others as seen in Fig. 2.

Heterogeneity of study effects
Under the study heterogeneity assumption, the probability in the treatment group is p i1 =p i0 + u i , where u i is   the random study effect that follows a normal distribution with mean of and standard deviation of /2. Figure 3 presents the coverage probability and average length comparison between the five intervals when p 0 ∼ U(0, 0.01%), U(0, 0.1%), and U(0, 1%). As increases, the standard deviation of the probabilities in the treatment group goes up. When is small, the F interval, the IS-R interval, and the IS-F interval have the coverage probabilities closer to the nominal level as compared to the R interval and the Tian interval. Coverage probabilities of the F interval and the IS-F interval drop to almost 50% when is 10%. The R interval generally has good coverage when is large. However, the R interval's coverage probabilities are very low when = 1% in meta-analysis with rare events (e.g., p 0 ∼ U(0, 0.01%) or U(0, 0.1%)). The IS-R interval has consistent good performance with regards to coverage and length as compared to others in metaanalysis with rare events. Figure 3 also presents the results when the event rates are not rare (e.g., p 0 ∼ U(0, 1%)). When is large, the R interval and the IS-R interval have better coverage probabilities than others. When variance of study effects is small (for the configurations with

Conclusions
We propose using importance sampling to construct confidence intervals for risk difference in meta-analysis with rare events. The traditionally used F interval has satisfactory performance with regards to coverage probability and interval length when the rate of events is not rare under the study homogeneity assumption, but this interval could have a very low coverage probability under the study heterogeneity assumption. The IS-R interval based on the asymptotic limits from the random-effects model outperforms the existing intervals under the heterogeneity assumption. The IS intervals use the existing asymptotic limits to order the sample space. Although the asymptotic limits are computed from asymptotic approaches whose performances are based on the approximation of the test statistic to the limiting distribution, the order of these limits provides a useful information to produce better IS limits.
The Tian interval often guarantees the coverage probability when the rates of both groups are rare, but that interval could have the coverage probability below the nominal level when is large. Theoretically, the Tian interval can be used as a designated statistic to order the sample space. However, simulations are involved in the Tian interval calculation that would significantly increase the computational intensity of the proposed IS intervals. In addition, the ordering of the sample space based on the Tian interval may change as the number of simulations being utilized. For these reasons, we do not include the IS intervals based on the ordering by the Tian interval.

Discussion
The method by Buehler [13] to construct exact one-sided confidence interval is ideal for binary outcome when the size of the sample size is not too large that allows a full enumeration of the sample space [16,[26][27][28][29]. However, it is not feasible in meta-analysis as it is extremely difficult to save the sample space under the unconditional framework with sample size in each treatment group fixed. If the upper bound of the possible number of events can be determined and the size of the sample size is not too large, exact Buehler interval may be computed. Otherwise, an efficient search algorithm should be developed to order the sample space efficiently.
Exact confidence intervals are preferable for statistical inference. However, it is often computationally intensive, such as the aforementioned the exact interval by Buehler [28,[30][31][32]. For these reasons, simulation based intervals are proposed for use in practice, including the proposed interval here, the Tian interval, and the interval based on confidence distribution [24,[33][34][35]. It is still a big challenge in exact meta-analysis by enumerating all possible data, which becomes a big data problem with the requirement of huge memory and computational power.
In addition to risk difference, odd ratio and risk ratio are also used to measure the treatment effect. For studies with zero events in one or both treatment groups, the estimated risk difference is zero. However, the estimated ratios could be infinity [17,[36][37][38][39]. In order to avoid this issue, an arbitrary small number (e.g., =0.5, 1) is often added to each cell in the data. The performance of the test statistics is affected by the chosen small value [6,[40][41][42]. The added value also raises the question of whether the number of participants in a study should be n ij or n ij + 2 . We consider this as future work to study the IS intervals for ratios.