Data
The Nordic Cochrane Centre provided the content of the first issue from 2008 of the CDSR. The database includes metaanalyses within reviews which have been classified previously by outcome type, medical specialty and types of interventions included in the pairwise comparisons [12]. The database did not record whether data type was timetoevent; however, based on the outcome classification we were able to identify (using words such as “survival”, “death”, “fatality”) three sets of timetoevent metaanalyses:

“binary”: Those with outcome classification “allcause mortality” where the information recorded was based only on the number of events and participants per arm;

“OEV”: Those with outcome classifications “overall survival” and “progression/disease free survival” where the information recorded was based on “binary” data in addition to logrank “OE” and “V” statistics”; these were originally analysed as HRs in the RevMan software;

Those with estimated log HR and its standard error. These were removed from further analyses since there was no available information on the number of events and participants per arm and therefore no binary data metaanalysis could be conducted.
Therefore, we identified two subsets of timetoevent metaanalyses: those with binary summaries, and those with binary summaries in addition to OEV data; we analysed each outcome per dataset separately to assess whether differences exist due to different characteristics of the outcomes. We also examined whether the information obtained from “OEV” data was based on aggregate data or IPD by examining the individual Cochrane reviews.
Eligibility Criteria
RMT (for “binary” data) and TS (for “OEV” data) initially extracted these data and conducted cleaning including examination of the outcome classification; TS repeated the “binary” data extraction to confirm the information obtained were accurate and RMT confirmed the choice of included metaanalyses obtained from “OEV” data extraction. Both datasets could contribute more than one metaanalysis per Cochrane review. RMT and TS identified 46 misclassifications due to disagreement with the original outcome classification as listed in the datasets, conflicting information in the database or unavailability of the correct version of the Cochrane review. We excluded 1,284 studies including double zero events, since they do not contribute to the metaanalysis results [12, 13]. We removed another 359 metaanalyses including fewer than 3 studies because some of the models applied below (i.e. generalised linear mixed models) will be affected by estimation issues and inevitable failures using small numbers of studies [14]; hence we wanted to make fair comparisons between the models applied. Derivation of the analysis sample is provided in Fig. 1.
Descriptive statistics
We describe the number of studies per metaanalysis, number of events and study size by the median and interquartile range. We also identify the number of medical specialities, and median number of events (and interquartile range) per medical specialty.
Model description for “binary” data
We used the following metaanalysis models to analyse the data on the OR or HR scale. The first was a model proposed for “binary” data (assuming a binomial likelihood with a logit link) which is based only on the number of patients and number of events which occurred. Interpretation for the treatment effect is conducted in terms of the logarithm of an OR.
In the second approach, we modelled the binary data using a normal approximation to binomial likelihood with a complementary log–log link (cloglog), where treatment effect interpretation was based on the logarithm of a HR. This method is also based only on the number of patients and events which occurred, and ignores censoring and the time element; however it is closely related to continuoustime models, has a builtin proportional hazards assumption, and therefore has important application in survival analysis [6].
Fitting twostage randomeffects models for “binary” data
Prior to fitting the twostage randomeffects models, study arms with zero events were identified for the “binary” data. For 771 studies, a “treatment arm” continuity correction was applied as proposed by Sweeting et al. [15] and was constrained to sum to one as this ensures that the same amount of information is added to each study.
Let \(i=\mathrm{1,2},\dots ,n\) denote the study. The estimated log odds and log hazard ratios were given by:
$$y_i=\left\{\begin{array}{lll}\log{\left(\frac{{\mathrm A}_{\mathrm i}}{{\mathrm B}_{\mathrm i}}\right)}\log{\left(\frac{{\mathrm C}_{\mathrm i}}{{\mathrm D}_{\mathrm i}}\right)} \mathrm {for}\ \mathrm{ORs} & \qquad(1) \\{\log}{\left[\log{\left(1P_{Ti}\right)}\right]}\log{\left[\log{\left(1P_{Ci}\right)}\right]}\mathrm {for}\ \mathrm {HRs} & \qquad(2)\end{array}\right.$$
where \({\mathrm{A}}_{\mathrm{i}},{\mathrm{C}}_{\mathrm{i}}\) represented number of events, \({\mathrm{B}}_{\mathrm{i}},{\mathrm{D}}_{\mathrm{i}}\) represented number of nonevents in the treatment and control groups respectively, \({P}_{Ti}=\frac{{\mathrm{A}}_{\mathrm{i}}}{{\mathrm{A}}_{\mathrm{i}}+{\mathrm{B}}_{\mathrm{i}}}\) was the proportion of events on the treatment arm of the \({i}^{th}\) study, and \({P}_{Ci}=\frac{{\mathrm{C}}_{\mathrm{i}}}{{\mathrm{C}}_{\mathrm{i}}+{\mathrm{D}}_{\mathrm{i}}}\) was the proportion of events on the control arm of the \({i}^{th}\) study.
The corresponding variances were given by:
$$s_i^2\;=\;\left\{\begin{array}{lll}\frac1{A_i}\;+\;\frac1{B_i}\;+\;\frac1{C_i}\;+\;\frac1{D_i}\;for\;ORs & \qquad(3) \\ \left(\frac1{\log\;\left(1P_{Ti}\right)\ast\left(P_{Ti}1\right)}\right)^2\;\ast\;\left(\frac{P_{Ti\ast}\left(1P_{Ti}\right)}{A_{\mathrm i}\;+\;B_{\mathrm i}}\right)\;+\;\left(\frac1{\log\left(1P_{Ci}\right)\ast\left(P_{Ci}\;\;1\right)}\right)^2\;\ast\;\left(\frac{P_{Ci}\ast\left(1P_{Ci}\right)}{{\text{C}}_\text{i}\;+\;{\text{D}}_\text{i}}\right)\;for\;HRs & \qquad(4)\end{array}\right.$$
Equations 2 and 4 provided a HR estimate via the use of the complementary log–log link considered as a useful link function for the discretetime hazards models as recommended by Hedeker et al. [7] and Singer et al. [6]. We estimated the studyspecific log odds ratios or log hazard ratios, \({y}_{i}\) and their withinstudy variances \({s}_{i}^{2}\) as shown above and fitted a standard twostage randomeffects model to these. Additionally, we obtained the \({I}^{2}\) statistic from the fitted models as follows:
$${I}^{2}=\frac{{\widehat{\tau }}^{2}}{{\widehat{\tau }}^{2}+{\widehat{\sigma }}^{2}}$$
where \({\tau }^{2}\) denotes the variance of the underlying true effects across studies and \({\sigma }^{2}\) the typical withinstudy variance.
To avoid downward bias in the variance components estimates, we used the REML estimator for model implementation [16]. The models were implemented via the “rma.uni” command from “metafor” package in R. We also fitted onestage randomeffects models for “binary” data. The methods related to onestage metaanalysis models and code is available in Additional file 1.
Model description for “OEV” data
For “OEV” data, the “OE” and “V” statistics were available in the Cochrane database alongside the number of patients and events. These data came either from published reports or from IPD; TS examined the individual reviews from the Cochrane database and assessed the data origin. Since there were more available information for these data the following three models were applied, using only twostage metaanalysis models.
Similarly to “binary” data, we initially analysed the “OEV” data as “binary” and modelled them as described in detail in the preceding section. We also used the logrank Observed—Expected events (OE) and the logrank Variance (V) statistics calculated previously from the number of events and the individual times to event on each research arm of the trial; we used the logrank approach [17] in order to obtain another type of HR estimate. We used randomeffects models to analyse the data throughout, including betweenstudy heterogeneity to account for variation across studies.
Fitting twostage randomeffects models for “OEV” data
Similarly to the “binary” data, the estimated log odds and log hazard ratios were given by Eqs. 1 and 2 for the binary summaries while the “OE” and “V” statistics were used as follows:
$$y_i=\frac{logrank\ ObservedExpected\ events\ (OE)}{logrank\ Variance\ (V)}\ for\ HRs$$
(5)
The corresponding variances were given by Eqs. 3 and 4 for binary summaries while for “OE” and “V” statistics as follows:
$$s_i^2=\frac{1}{logrank\ Variance\ (V)}\ for\ HRs$$
(6)
where \(V\) denotes the variance of the logrank statistic. We used the REML estimator for model implementation [16] and the models were implemented via the “rma.uni” command from “metafor” package in R.
Model comparison for “binary” data
The following model comparisons were performed. For the “binary” data set, we examined whether the results from analysing survival data as binary on an OR scale are similar to results from analysing on the HR scale using the cloglog link, both under twostage and onestage models. For presentation purposes, we present only comparisons of the results under twostage models in the main paper (and for onestage models in the Additional file 1) in order to assess the discrepancies between the model using the logit link and the model using the complementary log–log link.
First, we examined the proportion of significant and nonsignificant metaanalytic pooled effect estimates under the different scales used (OR vs HR scale); we identified the number of metaanalyses which were significant under one scale and nonsignificant under the other at a twosided 5% level of significance.
Bland–Altman plots with associated 95% limits of agreement were constructed, with the aim of facilitating interpretation of results and producing fair comparisons between the two scales [18]. In order to create these plots, results were standardised by dividing the logarithm of the estimate by its standard error. Plots were produced for the standardised treatment effect estimates and for the \({I}^{2}\) statistics. \({I}^{2}\) represents the percentage of variability that is due to betweenstudy heterogeneity rather than chance; \({I}^{2}\) values range from 0 to 100%. This measure was chosen for model comparison as it enables us to compare results directly between the two scales used. The variance of underlying true effects across studies (\({\tau }^{2}\)) was not used as it does not allow direct comparison between different outcome measures.
We identified “outliers” as metaanalyses outside the 95% limits of agreement, and we examined their characteristics. The metaanalysis characteristics we examined were the following:

betweenscale differences in the magnitude of the pooled treatment effect estimate and its 95% confidence intervals

the levels of withinstudy standard error and betweenstudy heterogeneity and study weights in the metaanalysis

studyspecific event probabilities and baseline risk
We summarised these differences by metaanalysis and reported those characteristics which were mostly associated with substantial differences between OR pooled effect estimates and corresponding HR pooled effect estimates.
Model comparison for “OEV” data
For the “OEV” data set, comparisons on overall and progression disease free survival outcomes were conducted separately; this was because differences between these outcomes might be observed in the presence of different disease severities, and therefore this would be associated with different length of followup and risk of the outcome.
For both outcomes, we performed comparisons by examining the differences between analysing the data as binary on an OR scale, analysing the data as binary using the cloglog link on a HR scale, or analysing the data using the “OE” and “V” statistics on a HR scale. We assessed whether the differences observed from analysing the data as binary on an OR scale could be reduced by the use of the cloglog link. We present only comparisons of the results under twostage models since there were no available IPD to perform comparisons under onestage models.
Similarly to “binary” data, we examined the proportion of significant and nonsignificant metaanalytic pooled effect estimates under the different scales used and identified the number of metaanalyses which were significant under one scale and nonsignificant under the other. We created Bland–Altman plots for the standardised treatment effect estimates and for the \({I}^{2}\) statistics to explore the agreement among the methods producing fair comparisons between the two scales [18]. Metaanalyses outside the 95% limits of agreement were examined for their characteristics.