A sensitive search of PubMed was performed to identify RCTs published in four leading medical journals between 1 January 2003 and 31 December 2005 (BMJ, JAMA, Lancet, and the New England Journal of Medicine [NEJM]). The search was limited to citations with abstracts. The search strategy is available in Additional file 1.
All full texts of retrieved RCTs were then screened to identify eligible articles, i.e. RCTs including at least one Kaplan-Meier plot presenting a comparison of two or more therapies. One Kaplan-Meier plot from each eligible article was assessed, preferably a plot displaying the outcome "all-cause mortality" (or a composite outcome including all-cause mortality). If no mortality outcome was reported, the primary endpoint was used.
Data were extracted using an extraction form that is available from the authors on request. The items extracted were: (1) definition and number of events of interest and competing events; (2) information on numbers of patients (for each group separately, if possible) (a) randomised, (b) analysed, (c) with incomplete follow-up, and (d) at risk; (3) minimum duration of follow-up (preferably the actual duration, or if not available, either the duration estimated by means of the period between end of enrolment and end of study or the planned duration).
In articles including information on all items above, the numbers of LTFU patients in each group can be inferred from the Kaplan-Meier plot if numbers at risk are given at a time point before minimum follow-up. These articles were classified as "assessable". In some articles details on LTFU can also be inferred even if information on some items is missing. For instance, in small studies each patient can be identified in the plot. These articles were also classified as "assessable". The remaining publications were classified as "not assessable".
Assessable articles underwent further evaluation: At the last time point with information on numbers at risk before the time of minimum follow-up ("time point t"), the survival probability was read from the curve. As no patient should be censored before time point t, the Kaplan-Meier curve represents 1 minus the empirical failure distribution function. The numbers of patients who still ought to be at risk at time point t can be calculated by multiplying the survival probability with the number of randomised patients (see Figure 1 for an example calculation). If the calculated number of patients at risk was higher than the numbers at risk reported in the figure legend, we tried to solve this discrepancy by considering information on LTFU reported in the text. If the outcome of interest was not "all-cause mortality" (or a composite outcome including all-cause mortality), the number of competing events was also considered. Articles were then classified as "consistent" if the numbers calculated matched the reported numbers at risk. If inconsistencies were noted between the LTFU information derived from the plot and given in the text or if LTFU information could be derived from the plot and no further information was provided in the text or the calculated number at risk was larger than the reported one, the articles were classified as "not consistent" (see Figure 2 for an example calculation).
All articles were assessed by either EV or MK. A subset of articles (those published in 2005) was assessed by both authors and no relevant discrepancies in the assessment were noted. Articles that were classified as "not consistent" and articles where classification was initially unclear were reassessed by a second reviewer (MK, EV, TK, or GS). Disagreement was resolved by consensus.
In order to evaluate the robustness and validity of study results, sensitivity analyses were performed for all study publications classified as "not consistent". In these publications we calculated a higher number of patients at risk than was reported in the Kaplan-Meier plot and which could not be explained by the reported LTFU. We aimed to assess the potential risk of bias caused by this discrepancy. For this purpose, we generated a 2 × 2 contingency table for time point t (one time point before minimum follow-up, as defined above) by calculating the number of events of interest up to this time and then performed a χ2-test. We generated a second contingency table where the difference between calculated and reported numbers at risk, minus the reported LTFU, was imputed (unreported LTFU). If no LTFU were reported their number was assumed to be zero and the total difference was imputed. We classified a treatment effect as "robust" if the effect estimate did not change direction and the corresponding p-value remained significant or not significant (α = 5%) after imputation. In the equal-case scenario, the unreported LTFU data were imputed as "event" in both groups. In the worst-case scenario, unreported LTFU data were imputed as "event" in the test group and "no event" in the control group (best-case scenario: vice versa).