 Software
 Open Access
 Published:
IPDfromKM: reconstruct individual patient data from published KaplanMeier survival curves
BMC Medical Research Methodology volume 21, Article number: 111 (2021)
Abstract
Background
When applying secondary analysis on published survival data, it is critical to obtain each patient’s raw data, because the individual patient data (IPD) approach has been considered as the gold standard of data analysis. However, researchers often lack access to IPD. We aim to propose a straightforward and robust approach to obtain IPD from published survival curves with a userfriendly software platform.
Results
Improving upon existing methods, we propose an easytouse, twostage approach to reconstruct IPD from published KaplanMeier (KM) curves. Stage 1 extracts raw data coordinates and Stage 2 reconstructs IPD using the proposed method. To facilitate the use of the proposed method, we developed the R package IPDfromKM and an accompanying webbased Shiny application. Both the R package and Shiny application have an “allinone” feature such that users can use them to extract raw data coordinates from published KM curves, reconstruct IPD from the extracted data coordinates, visualize the reconstructed IPD, assess the accuracy of the reconstruction, and perform secondary analysis on the basis of the reconstructed IPD. We illustrate the use of the R package and the Shiny application with KM curves from published studies. Extensive simulations and realworld data applications demonstrate that the proposed method has high accuracy and great reliability in estimating the number of events, number of patients at risk, survival probabilities, median survival times, and hazard ratios.
Conclusions
IPDfromKM has great flexibility and accuracy to reconstruct IPD from published KM curves with different shapes. We believe that the R package and the Shiny application will greatly facilitate the potential use of quality IPD and advance the use of secondary data to facilitate informed decision making in medical research.
Background
Typical information used for metaanalysis of survival data reported from clinical trials often includes a summary of outcomes for each arm, including but not limited to hazard ratios and KaplanMeier (KM) curves along with the number of patients at risk [1]. When applying secondary analysis on such published survival data, difficulties usually come from insufficient details in the reported data, which are often reported using aggregated summary statistics. For example, when conducting the metaanalysis on timetoevent data, it is possible that the proportional hazard ratio assumption may not hold, and alternative measures of the survival difference are needed to avoid bias [2, 3]. In this case, it is of great importance to obtain individual patient data (IPD), with which one can not only perform the standard survival analysis, but also can assess if the assumption of the original method is appropriate in order to decide whether or not alternative methods should be applied [4]. Furthermore, one can undertake additional subgroup analyses not reported in the aggregated data. For this reason, the IPD approach is considered as the gold standard in data analysis.
However, researchers conducting metaanalysis or other secondary analyses may lack access to the IPD, partly due to the confidentiality of clinical data. Therefore, a method that is able to reconstruct IPD from published KM curves can greatly facilitate secondary analyses on survival data. Several methods have been reported in the literature. The iterative algorithm based on KM estimation method (referred to as “iKM” hereafter) proposed by Guyot et al. [5] is a classic approach among many proposed. It has been used in various secondary analyses. For example, Satagopan et al. [6] used it to reconstruct timetoevent data from a melanoma data set for evaluating different treatment benefits according to biomarker subgroups. Wei and Royston [7] developed a STATA function to apply the iKM algorithm with some adaptions and applied it to reconstruct IPD from KM curves from multiple trials.
However, there are a few factors that limit the use of the iKM method. First, external software is needed to extract data points before using the iKM algorithm. The original iKM method suggests a manual approach to pick up a sufficient number of points from KM curves via mouseclicks using the DigitizeIt software.
Second, the iKM method has restricted requirements when picking up the points manually: (1) the survival probability needed to decrease monotonically as time increases, (2) the points where the number of patients at risk are reported must be included, and (3) users need to sort the data coordinates into time intervals determined by the time points, at which the number of patients at risk is reported. KM curves in publication typically report the number of patients at risk at several time points under the xaxis. If the number of patients at risk is reported at month 3, 6, and 9, then data coordinates need to be manually organized into the time intervals: [0−3),[3−6),[6−9), and [9,+∞). The first two requirements are generally hard to keep for a large amount of manual mouseclicks and the last one is also time consuming. Thus, it is important to have a flexible function to perform these tasks automatically.
Third, there is a numeric issue in the original iKM method: estimation can be negative values in certain scenarios. Our examination reveals that this is due to a boundary setup in the iterative process.
Finally, there is a lack of userfriendly software for clinical researchers to use for this method. The published functions [5, 6, 8] for the iKM method have at least one of the following inconveniences: external digitizing software is need for data extraction; users need to manually check if data input is appropriate, or the auto process program is not convenient or stable; or no accuracy assessment is provided for users to directly evaluate the reconstruction results within the same software.
To overcome the aforementioned limitations, we propose a twostage modifiediKM approach, which provides an improved, accurate, userfriendly, and stable workflow to reconstruct IPD. In this approach, we not only relax the restricted requirement for data input, but also improve the robustness and stability of the original iKM method. More importantly, we develop an allinone software platform that allows clinicians and medical researchers to go from a single KM curve image directly to reconstructed IPD, without using additional software to aid data extraction or without manual treatment on the data before reconstruction. The software includes an R package and a webbased Shiny application. Specifically, users can use either the R package or the Shiny application to (1) extract data coordinates from published KM curves; (2) preprocess extracted data points; (3) run the modifiediKM algorithm to estimate the number of patients at risk, the number of events, and the number of censored outcomes for each prespecified interval, and to reconstruct IPD; (4) provide graphs and statistical summary for users to evaluate the accuracy of the reconstruction process; and (5) conduct survival analysis based on reconstructed IPD. The R package is beneficial for users with some programming skills, as they can perform the reconstruction of IPD within R or expand their future research on the package. The advantages of the Shiny application are that it has a pointandclick interface, does not require installing any software, and can be used on any machine with an internet browser. This feature is appealing for clinicians and medical researchers who are not necessarily familiar with statistical programming. Extensive simulations and realworld data applications demonstrate that the proposed method has high accuracy and great reliability in the estimations of the parameters of interest, e.g., median survival, hazard ratio, and survival probability.
Our work is of great importance because it not only further supplements and strengthens the original iKM algorithm, but it also provides a userfriendly, allinone software available in different platforms to accommodate the needs of researchers with or without statistical programming familiarity. The availability of the two different software platforms can further enhance quality extraction of IPD for metaanalysis and other secondary analyses using timetoevent data.
Implementation
The modifiediKM algorithm for IPD reconstruction is a twostage process, as shown in Fig. 1. Stage 1 aims to extract quality data coordinates (time, survival probability) from KM curves. In Stage 2, the data coordinates are preprocessed and IPD is reconstructed using the proposed iterative algorithm.
Coordinates extraction in Stage 1
Data coordinates can be extracted from KM curves using the R package and Shiny application we develop (details in the “Results” section). There are also a number of other software options available on Windows or Mac operating systems to digitize the graphs. The commonly used software are DigitizeIt (http://www.digitizeit.de/), ScanIt (https://www.amsterchem.com/scanit.html), and Plot Digitizer (http://plotdigitizer.sourceforge.net/). Extensive applications of realworld trial examples and simulated KM curves show that data extracted using the different approaches yield comparable results during IPD reconstruction in Stage 2 (more details in the “Implementation” and “Simulation result” sections). We provide video tutorials on how to extract data coordinates using these tools in our Shiny application.
To ensure accuracy of estimation in Stage 2, it is critical to extract quality data coordinates in Stage 1. We recommend the following when extracting data points:

1.
Use Adobe Illustrator to separate the curves before extracting data points, when there are multiple KM curves in the same figure and the curves are tangled together. A video tutorial on this is included in the “Shiny application” section.

2.
Extract as many points as possible.

3.
Make sure the data points extracted are as evenly distributed as possible on the KM curves.

4.
Click on both the top and bottom points of the vertical segment at points where survival probability drops.
IPD reconstruction in Stage 2
The IPD reconstruction is carried out using the modifiediKM algorithm, which is based on the KM estimation method [9] and improves upon the iKM algorithm [5]. We provide an overview of the algorithm in this section and delineate it in detail in Appendix ModifiediKM algorithm in details. Let T_{k} and S_{k} denote the time and survival probability, respectively, at time k. The data points extracted in Stage 1 are typically a N×2 table, with each row being (T_{k},S_{k}), for k=1,2,...,N, where N is the number of data points extracted. The IPD reconstruction consists of two main steps as follows:

1.
Process the raw data.

a
Sort the data by T_{k}.

b
Make monotonicity adjustment by first using Tukey’s fence [10] to detect and remove unreasonable inputs, and then ensure that survival probability decreases over time (known as “force monotonicity” hereafter).

c
Perform step control.

a

2.
Reconstruct IPD using an iterative algorithm adapted from the iKM method.

a
Estimate KM parameters at each coordinate (k) including the number of patients at risk (\(\hat {n}_{k}\)), number of patients censored (\(\widehat {cens}_{k}\)), and number of events(\(\hat {d}_{k}), k=1,\cdots, N\). In this step, we modified the boundary setup for the number of censored observations to prevent abysmal estimations in some scenarios (more details in Appendices ModifiediKM algorithm in detailsRobustness and stability of the modifiediKM algorithm).

b
Construct IPD using the parameters estimated from Step (2a).

a
Tukey’s fence in Step (1b) is a nonparametric method for outlier detection. It is calculated by creating a “fence” boundary using TF=[Q_{1}−k(Q_{3}−Q_{1}),Q_{3}+k(Q_{3}−Q_{1})], where Q_{1} and Q_{3} are the first and third quartiles of the data points, and k is a constant that takes common values of 1.5 or 3. We used k=3 in our algorithm in order to not accidentally delete useful data points. Any point outside of TF is considered an outlier. The illustrative example in Appendix: Robustness and stability of the modifiediKM algorithm shows that the use of Tukey’s fence in Step (1b) improves the robustness of the algorithm in the presence of outliers (Fig. 5 in Appendix). Step (1c) safeguards the algorithm from a potential underestimation problem (details in Appendix: ModifiediKM algorithm in details). Step (2a) enhances the stability of the iterative algorithm. In Appendix: Robustness and stability of the modifiediKM algorithm, we demonstrate the robustness and stability of the modifiediKM algorithm using two data applications.
To assess the accuracy of the modifiediKM algorithm, we employ several metrics and make them easily accessible through our developed software. First, we provide graphs to visualize the reconstructed results: (1) estimated survival probability at each readin time point using the reconstructed IPD compared with the corresponding readin survival probabilities, and (2) estimated number of patients at risk (when given) compared with reported values. Second, we provide several summary statistics to aid the accuracy assessment. One is the root mean square error (RMSE), which measures the difference in survival probabilities calculated using reconstructed data and original data. Additionally, the mean absolute error and the max absolute error are provided to assess the precision of the estimation. Upon careful assessment of the empirical distributions of the RMSE, max absolute error, and mean absolute error for all of the real trial examples and simulated curves considered later in this study, we recommend the following thresholds: RMSE ≤0.05, mean absolute error ≤0.02, and max absolute error ≤0.05 to indicate that the extracted data points are sufficiently wellcaptured for subsequent analyses. Please note that these recommendations are based on a limited number of examples. In practice, users can use these thresholds as a rule of thumb along with their own judgement to determine if they need to reextract the data points. Third, we use the KolmogorovSmirnov test to compare the distributions of the readin and the estimated survival curves. A large pvalue is desired, as it indicates that there is a lack of statistical evidence to show the discrepancy between the readin and the estimated survival curves.
Results
R package
To facilitate the use of the modifiediKM method, we developed an R package called “IPDfromKM” with R version 3.6.0. The package is available via the Comprehensive R Archive Network (CRAN) at https://CRAN.Rproject.org/package=IPDfromKM. The package contains several functions; the descriptions and objects returned for the functions are presented in Table 1. We provide an example for each of the functions below.
Extract data coordinates
The getpoints() function is used to extract data coordinates from published KM curves. The function has the following arguments:

f: the KM curves in a bitmap image (e.g.,.png,.jpeg,.bmp,.tiff).

x1: the actual label of the leftmost points on xaxis.

x2: the actual label of the rightmost points on xaxis.

y1: the actual label of the lowest point on yaxis.

y2: the actual label of the highest point on yaxis.
For the image, the use of a.png file is highly recommend, since it can shorten the processing time in R. In addition to the image itself, two xcoordinates (x1 and x2) and two ycoordinates (y1 and y2) are needed to decide the location and scale of the coordinates system. Below is an example to read in an image for data extraction, where 60 is the rightmost label on the xaxis.
After the file is read into R, instructions will be provided in the R console to guide the extraction of data points. Specifically, users will be guided to click on the leftmost and rightmost points on the xaxis, and click on the lowest and the highest points on the yaxis. Then they can collect the data coordinates by mouseclicking on the curve. To get a desirable estimation, we suggest that users follow recommendations 24 in the “Coordinates extraction in Stage 1” section. The data points extracted will be returned as a twocolumn data set (e.g., points in the example), and this data set can be used as the input of the preprocess() function described in the “Process data coordinates” section.
In the following text, we take the buildin data set Radiationdata to demonstrate the use of the package to preprocess raw data, reconstruct IPD, and conduct secondary analysis on the reconstructed IPD. The data set was extracted from published KM curves from a twoarm randomized controlled trial [11] using ScanIt. This study randomized 424 head and neck cancer patients to two treatment groups: 213 in the radiotherapy group (referred to as “radio”) and 211 in the radiotherapy plus cetuximab group (referred to as “radio_plus”). The primary outcome was the duration of locoregional control. There were 145 pairs of coordinates extracted for the radio treatment, and 136 pairs for the radio_plus treatment. Coordinates for each treatment were saved as a twocolumn table in Radiationdata: the first column is for survival times, and the second column is for survival probabilities reported in percentages. Radiationdata also includes risk times (in months): trisk=(0,10,20,30,40,50,60,70) and the number of patients at risk for each risk time point (nrisk).
Process data coordinates
To prepare data to reconstruct IPD, we first preprocess the raw data points in an appropriate format. This can be done using the preprocess() function. This function contains the following arguments:
The output of the preprocess() function is a class object that can be used directly in the getIPD() function to construct IPD.
Reconstruct IPD
After the raw data is processed using the preprocess() function, we can use the getIPD() function to reconstruct the IPD. The getIPD() function has the following arguments:

prep: the class object returned from the preprocess() function.

armID: an arbitrary label used as the group indicator for the reconstructed IPD. Typically 0 for the control group and 1 for the treatment group.

tot.events: total number of events. Only available for some published curves, and the default value is NULL.
Accuracy assessment
To view the accuracy assessment results, simply call the summary() function. Because of page limits, we show only one example below.
We see that our algorithm can accurately estimate the numbers at risk and provide estimates on the number of events at each risk time. The small values for RMSE (<0.05), mean absolute error (<0.02), and max absolute error (<0.05), along with the large pvalue of the KolmogorovSmirnov test shows that the reconstructed IPD is accurate. Additionally, we can use the plot() function to graph the survival curves from the reconstructed IPD, and compare them with those generated using original data points. The function takes the object returned by getIPD() directly.
The output using the plot() for the radio group is provided in Fig. 6 (Appendix: Figures returned for accuracy assessment or secondary analysis), which shows three graphs: (1) KM curves based on reconstructed IPD and readin data points, respectively; (2) number of patients at risk using the reconstructed IPD versus reported; and (3) difference in survival probability at different time points for the reconstructed IPD and the readin data. When the interval information is not available while reconstructing the IPD, the second graph will not be shown.
Secondary analysis
If survival analysis is of interest, we can run the survreport() function, which includes the following arguments:

ipd1: a threecolumn (i.e., time, status, treatment indicator) table of IPD for treatment 1.

ipd2: a threecolumn (i.e., time, status, treatment indicator) table of IPD for treatment 2.

arms: number of treatment arms (value of either 1 or 2).

interval: the time intervals for which the landmark survival probabilities are of interest. The default is at every 6 months.

s: the survival probabilities for which the corresponding survival times are of interest, e.g, s=0.5 means that the median survival time is desired.
Researchers working with clinical data are often interested in the survival times at which survival probabilities are specified (e.g., median survival time at which 50% of patients survive). The example below shows the survival times for the prespecified survival probability s=(0.50,0.75,0.95). This function also returns KM curves and cumulative risk in a figure for both groups (Fig. 7 in Appendix: Figures returned for accuracy assessment or secondary analysis).
Shiny application
To facilitate the use of the modifiediKM algorithm to reconstruct IPD for a broader audience, who is not necessarily familiar with statistical programming, we develop a userfriendly Shiny application, which is freely available at https://www.trialdesign.org/onepageshell.html#IPDfromKM. Compared with the R package, the Shiny application does not require any installation, is not limited by the hardware of users’ computers, and comes with a pointandclick feature such that no statistical programming skill is needed to use it [12]. In addition, we regularly maintain and update the application on the server, thus users do not have to worry about software updates. Whenever users have any questions, they can contact the authors directly through the Shiny application host website. Appendix: Longterm commitment of the shiny application shows how users can directly contact the authors.
The application has a straightforward interface with three main panels: Data Extraction (used to extract data points), Reconstruct Individual Patient Data (IPD) (used to reconstruct IPD), and User Guide (providing extensive tutorials). Overall, the app has the capability to complete the following tasks:

1.
Extract data points from published KM curves.

2.
Process extracted data points.

3.
Reconstruct IPD using the modifiediKM algorithm.

4.
Assess the accuracy of the reconstruction.

5.
Perform survival analysis using the reconstructed IPD.

6.
Generate a concise report for the IPD reconstruction.

7.
Provide an extensive user guide for understanding the method and using the Shiny application.
Typical input required to use the app includes the following, where input 1 is for data extraction and inputs 26 are for IPD reconstruction.

1.
An image file (.png or.jpeg) containing the KM curve of interest.

2.
A.csv or.txt file containing the coordinates extracted from published KM curves (for IPD reconstruction). File templates are available in the Shiny application.

3.
Risk time (trisk).

4.
Number of patients at risk (nrisk).

5.
Total number of patients reported (optional when information for nrisk is provided).

6.
Total number of events reported (optional, but having it will improve accuracy).
We provide video tutorials on how to extract coordinates using the Shiny application and other software, and how to reconstruct IPD from extracted data coordinates. The tutorial are accessible under the User Guide panel of the Shiny application. In the following text, we show an example of using the Shiny application to reconstruct IPD for two treatment groups simultaneously.
We illustrate the use of the application using data for the radio and radio_plus treatment groups introduced in the R package examples. The data set saved in the radio_radioplus.csv file has been provided in the app as a template. As shown in Fig. 2, to input the data, we select “Two” under Number of treatment groups and upload the data file. We then type in available information including risk times, number of patients at risk, and the total number of patients for each group. When finishing data input, we simply click the Begin Calculation button and the results are shown on the right side of the app. There are four tabs for displaying the results.
As shown in Fig. 2, the first tab Reconstructed IPD shows the individual patient data reconstructed from the data provided. The Accuracy assessment tab (Fig. 10 in Appendix) shows two plots and a table. The first plot displays the comparison of KM curves using the IPD and original data set for each treatment group, followed by a plot of the difference between survival probabilities. The table shows the summary statistics such as RMSE and KolmogorovSmirnov test statistics and pvalues to help assess the accuracy of the IPD reconstruction. Under the Survival analysis tab (Fig. 11 in Appendix), KM curves and cumulative hazard functions are displayed for both treatments. For each treatment, the landmark survival probabilities, corresponding standard error, and 95% confidence interval (CI) are reported. Below the landmark probability tables, the application also provides the critical survival times given prespecified survival probabilities. For example, we see that the median survival time is 14.9 months for treatment 1 (radio group) and 24.5 months for treatment 2 (radio_plus group). A concise report is available under the Download report tab (Fig. 12 in Appendix).
Implementation example
We now illustrate the use of the modifiediKM method for reconstructing IPD from KM curves published for a twoarm randomized controlled trial in nonsmall cell lung cancer (NSCLC) patients [13]. This trial is known as the POLAR trial, in which a total of 287 patients with previously treated, advanced or metastatic NSCLC were randomized to receive either atezolizumab or docetaxel. The aim of this trial was to assess the efficacy of the two drugs for patients with NSCLC, analyzed by PDL1 expression levels on tumor cells and tumorinfiltrating immune cells. Here, the baseline PDL1 expression was scored by immunohistochemistry in tumor cells (as the percentage of PDL1 expressing tumor cells TC3 (≥50%), TC2 (≥5% and <50%), TC1 (≥1% and < 5%), and TC0 (<1%)) and tumorinfiltrating immune cells (as the percentage of tumor area: IC3 (≥10%), IC2 (≥5% and <10%), IC1 (≥1% and <5%), and IC0 (<1%)). Overall survival was estimated for all patients in five groups: TC3 or IC3 patients, TC2/3 or IC2/3 patients, TC1/2/3 or IC1/2/3 patients, and TC0 or IC0, and intention to treat for both atezolizumab and docetaxel treatments. The five pairs of KM curves in the POLAR trial are shown in Fig. 8 (Appendix: Original KM curves in the POLAR trial).
To apply our method and assess its accuracy, we extract raw data points from the KM curves, reconstruct IPD from the points extracted, use the reconstructed IPD to calculate the median overall survival (OS) and hazard ratio, and compare them with the published results. As shown in Table 2, the point estimates for the median OS and HR are close to reported values based on raw data.
Considering that the reconstructed IPD data may not be independent and that data extraction may bring up extra variation in the reconstruction process, it is of interest to see how the variability of test statistics based on the reconstructed IPD impacts the hypothesis testing of interest. We evaluated this by conducting hypothesis tests using the reconstructed IPD from the five pairs of KM curves in the POLAR trial and six additional pairs of simulated KM curves (generated in “Simulation result” section). As shown in Table 3 (Appendix: Discussion on variability of test statistics), our method, with reconstructed IPD, reaches the same conclusion as that using the true IPD, indicating that results based on reconstructed IPD are reliable. Moreover, we further investigated the variability of statistics of interest (e.g., hazard ratio) between using true IPD and using reconstructed IPD, through bootstrap confidence intervals. As shown in Table 4 (Appendix: Discussion on variability of test statistics), the variability of statistics based on reconstructed IPD is comparable to that based on true IPD. The examinations on the variability of test statistics in our study are limited. Future studies are warranted to further investigate this aspect. In terms of software used, we see that the data points extracted using our function from the IPDfromKM package have competitive performance to those extracted using DigitizeIt and ScanIt. In practice, users can use the software of their choice as long as they follow the instructions to carefully extract data points. Nevertheless, the allinone feature of our software greatly streamlines the data extraction, IPD reconstruction, and subsequent data analysis, making it a superb choice for users who wish to do everything within one platform.
Simulation result
To further assess the accuracy of the estimations using the modifiediKM algorithm, we conducted a simulation study with six trials that each had both control and treatment groups. The sample size for each of the 12 groups was 200. We generated the underlying survival time (T_{i},i=1,⋯,200) using the Weibull distribution with the survival function S(t)=exp(−λt^{γ}), where λ and γ are the scale and shape parameters, respectively [14]. When γ>1, the survival curve has an increasing hazard; when γ=1, the survival curve has a constant hazard; and when γ<1, the survival curve has a decreasing hazard.
In the simulation, the scale parameter was specified such that the mean survival E(X)=λΓ(1+1/γ) was 12 months for treatment groups, and 6 months for control groups. We assume 36 months of recruiting time, and a maximum of 24 months of followup. The censoring time was generated by the minimum value of the time generated from exponential distribution with parameter λ^{∗} and the maximum followup time. The value of λ^{∗} was set such that the censor rate was either 30% or 60%. KM curves were then generated by the survfit() function from the survivalR package. The number of patients at risk was reported every 3 months (20 intervals), or every 10 months (6 intervals), or not reported at all (no risk information). The simulated trials represented diverse situations that mimicked KM survival curves obtained from real trials.
We used the IPDfromKM package, DigitizeIt, and ScanIt software to extract the coordinates of the curves, then we used the preprocess() and getIPD() functions from the IPDfromKM package to process the raw data and obtain IPD. We compared the estimated IPD with the true IPD generated. We first examined the accuracy of the three software packages in the coordinates extraction by comparing the estimated number at risk to the true number at risk. Figure 3 shows the results for two simulated curves. As demonstrated, modifiediKM accurately estimated the number of patients at risk, regardless of software used to extract data points, as long as a study reported the number of patients at risk, under both low and high censoring rates. Without reported numbers at risk, estimation cannot be accurate, as previously noted in the literature [5]. Thus, reporting this information in published studies is highly recommended.
Figure 3b shows the mean absolute error of the estimation for the number of patients at risk when the reported number at risk was available (i.e., the case with 6 or 20 time intervals in the trial), where the error was determined by (estimated value − true value). The data extracted yielded comparable results for the different software. The results were more accurate with more time intervals provided. For instance, when there was information for 20 intervals, the mean absolute error was less than 2 regardless of censoring rate. When the censoring rate was large (e.g., 60%), the estimation using information from six intervals had greater error.
Next, we used the reconstructed IPD to conduct standard survival analysis to evaluate the performance of our method. Figure 4a shows the estimated median survival times and the oneyear survival probability. The difference between the estimated and true median survival times was within 5% in most cases. Figure 4b shows the comparison between the true hazard ratio and estimated hazard ratios on the log scale. The hazard ratios and corresponding standard deviations were almost identical in most cases, regardless of which software was used to extract the data coordinates.
Conclusion
In this work, we introduce the R package IPDfromKM, and its accompanying Shiny application, to reconstruct IPD from published KM curves based on the proposed modifiediKM algorithm. There are several improvements in the modifiediKM algorithm in comparison to the original iKM method. First, we provide a function for users to extract data points from KM curves directly. This enables the allinone feature of our software, which streamlines the process of IPD reconstruction without requiring extra software to obtain the raw data coordinates. Our realworld examples and simulated examples show that data points extracted using the IPDfromKM package yield competitive results to those extracted using DigitizeIt and ScanIt as long as points extracted are carefully extracted.
Second, our procedure to preprocess raw data coordinates is flexible and accurate. The original iKM method does not have a function for preprocessing raw data coordinates, and users need to manually check the monotonicity assumption and sort the extracted data points. Stagopan et al.[6] published a function that can automatically preprocess the raw data coordinates. However, their function tends to be unstable. For example, the function simply deletes points that have a larger survival probability than the points right before them, which introduces additional errors into the extracted data points. The function also trims data points at the tail of a KM curve. This can be detrimental, as the KM curve often has a long horizontal tail. If data points were trimmed at the tail, it would be less likely to reliably estimate the number of censored observations occurring at the end of the trial. The preprocess() function from the ReconstructKM package published on GitHub [8] can preprocess the data coordinates but it also requires manual check for monotonicity and the curve endings. To overcome these limitations, we proposed a flexible and accurate algorithm to preprocess the raw data coordinates extracted from KM curves and the algorithm is easily implemented by using the preprocess() function from the IPDfromKM package or by uploading the data to the Shiny application. Moreover, the use of Tukey’s fence improves the robustness of reconstruction in Stage 2 in the presence of outliers.
Third, in the estimation stage, we refine the boundary setup in the iterative procedure to prevent abysmal estimations, which greatly improves the stability of the iKM method.
Finally, we demonstrate through simulation that the variability of statistics of interest is comparable to that based on the true IPD and that the conclusion based on reconstructed IPD is the same as that based on the true IPD in hypothesis testing. These findings show that the IPD reconstructed using our method is reliable. In the R package and the Shiny application, we provide several approaches to evaluate the accuracy of the algorithm with an easytouse function, with which the quality of the reconstructed IPD is easily assessed.
Despite the strengths of the modifiediKM algorithm and its accompanying software, there are a couple of challenges or limitations worth mentioning. First, when there are tangled lines with a lot of censoring markers, even with the help of software such as Adobe Illustrator, it is still challenging to separate out each KM curve for digitizing. Second, while the modifiediKM algorithm can accurately estimate the survival probabilities, the number of events, and the number of patients at risk, the algorithm is based on a uniform censoring assumption, which may be violated in certain trials. Future work can expand on this and consider extension to a nonuniform censoring mechanism.
Availability and requirements
Project name: IPDfromKM R package and Shiny web application.
Project home page: https://CRAN.Rproject.org/package=IPDfromKM, and https://www.trialdesign.org/onepageshell.html#IPDfromKM.
Operating system(s): Platform independent.
Programming language: R.
Other requirements: none.
License: GPL2.
Any restrictions to use by nonacademics: none
Appendix
ModifiediKM algorithm in details
The IPD reconstruction is carried out using the modifiediKM algorithm, which is based on the KM estimation method [9] and improves upon the iKM algorithm [5]. Details regarding the iKM estimation method and the modifiediKM algorithm are provided below.
The KM estimator
The KM estimator, first proposed by Kaplan and Meier [9] is a nonparametric estimator of the survival function. It is determined by the product over the failure times of the conditional probabilities of surviving to the next failure time. Specifically, suppose there are Q distinct failure times. Let t_{q} denote a time where at least an event (e.g., a patient dies) is observed, n_{q} is the number of subjects at risk at time t_{q}, and d_{q} is the number of subjects who experience the event at that time, q=1,⋯,Q. The KM estimator (\(S^{KM}_{t_{q}}\)) is formally defined as
The number of patients at risk (n_{q+1}) can be determined by the number of patients at risk at time t_{q}, minus the corresponding number of patient experienced events and number of patients censored.
The modifiediKM algorithm
Before initializing the iterative algorithm, we first sort extracted coordinates by time and make monotonicity adjustments on survival probabilities. In addition, to improve estimation accuracy, we propose a step control procedure while calculating the number of events at the coordinate k (denoted as \(\hat {d}_{k}\)). Usually, there will be multiple points at the same time, and thus multiple survival probabilities are available at this time. In a KM curve, this is reflected by a drop in the survival probability at a time point. We refer to such a time point as a jumping step. Suppose there are s consecutive points read out at the jumping step from the target KM curve, and denote them as j,j+1,...j+s−1. The original iKM method requires the use of all the values for \(\hat {d}_{j}, \hat {d}_{j+1},...,\hat {d}_{j+s1}\) and adds them together to get the estimated total number of events that happened at the jumping step. However, this could end up in an underestimated number at risk due to rounding. That is, \(\hat {d}_{j}, \hat {d}_{j+1},...,\hat {d}_{j+s1}\) can be too small to round off to zero. When too many jumping points exist, the number at risk is underestimated. As a simple illustrative example, suppose s=10, if all the values for \(\hat {d}_{j}, \hat {d}_{j+1},...,\hat {d}_{j+s1}\) are within [0.1,0.5). After rounding these values separately and adding them together, the estimated number at risk is zero, but in reality, it lies between 1 and 5. We note that this rarely happens in practice based on extensive simulations and real trial examples. Nevertheless, it is great to safe guard against the potential underestimation problem. For this reason, we recommend controlling the steps to ensure that there are only two points remaining in each vertical segment on the KM curve: one at the beginning and the other one at the end of the segment. The corresponding survival rates at these two points are simply \(\hat {S}^{KM}_{last(k)}\) and S_{k}. By doing this, the number of events at this step needs to be calculated and rounded only one time, and it would not end up with the potential underestimated problem.
To start the iterative estimation process, we first divide the preprocessed coordinates from the KM curve into I intervals. Denote the number of patients at risk at the beginning of the intervals as (nrisk_{1},nrisk_{2},⋯,nrisk_{I}) and denote the time at which the number of patients at risk is provided as (trisk_{1},trisk_{2},⋯,trisk_{I}). For each interval, we denote the index of the first point as lower_{i} and of the last point as upper_{i}. The iterative estimation process proceeds as follows:

1.
Initialize the total number of patients censored at interval i (\(\widehat {ncensor}_{i}\)) using the difference between the reported number at risk at the beginning of interval i+1 and the number at risk in this interval if no censoring occurs (\(nrisk_{i+1}^{nocensor}\)) in interval i. The value of \(nrisk_{i+1}^{nocensor}\) is given by \(nrisk_{i+1}^{nocensor}=nrisk_{i}*S_{lower_{i+1}}/S_{lower_{i}}\) rounded to the nearest integer, where \(S_{lower_{i+1}}\)/\(S_{lower_{i}}\) is the probability of survival at the beginning of interval i+1 conditional on being alive at the beginning of interval i. Thus we have
$$ \widehat{ncensor}_{i}=nrisk_{i}*S_{lower_{i+1}}/S_{lower_{i}}nrisk_{i+1}. $$(3) 
2.
Determine the number of patients censored between the extracted coordinates k and k+1 (denoted as \(\widehat {ncenso}r_{k}\)). Assuming a constant censoring rate, it is straightforward to determine the censoring time by distributing the number of censored patients evenly over the interval i: \(tcensor_{m}=T_{lower_{i}}+m*(T_{lower_{i+1}}T_{lower_{i}})/(nc\hat {en}sor_{i}+1), m=1, \cdots, \widehat {ncensor}_{i}\). The value of \(\widehat {ncensor}_{k}\) is determined by counting the censoring times that lie between [T_{k},T_{k+1}]:
$$ \widehat{ncensor}_{k}= \sum_{m=1}^{n\hat{en}sor_{i}}I(tcensor_{m}\in[T_{k}, T_{k+1}]), $$(4)where I(tcensor_{m}∈[T_{k},T_{k+1}]) is an indicator function, which returns 1 if the censoring time tcensor_{m} lies within the interval [T_{k},T_{k+1}].

3.
Determine the number of patients at risk for the coordinate k+1 (i.e., \(\hat {n}_{k+1}\)) as
$$ \hat{n}_{k+1}=\hat{n}_{k}\hat{d}_{k}nc\hat{en}sor_{k}, k=lower_{i}, \cdots, upper_{i} $$(5)according to Eq. (2), where \(\hat {d}_{k}\) is determined by \(\hat {n}_{k}*\left (1S_{k}/\hat {S}^{KM}_{last(k)}\right)\). Round to the nearest integer, based on an rearrangement of Eq. (1). Here \(\hat {S}^{KM}_{last(k)}\), instead of \(\hat {S}^{KM}_{k1}\), is used because there may not be an event at the extracted coordinate k−1. Thus, \(\hat {S}^{KM}_{last(k)}\) is the survival probability of the last point before time T_{k}.

4.
Check if the estimated number at risk at the start of the next interval i+1 (i.e., \(\widehat {nrisk}_{i+1}\), which is just \(\hat {n}_{upper_{i}+1}\)) is equal to the reported value nrisk_{i+1}.

If \(\widehat {nrisk}_{i+1}=nrisk_{i+1}\), move to step 5.

Otherwise, set \( {\hat{ncensor}}_i:= {\hat{ncensor}}_i+{\hat{n}}_{\left( uppe{r}_i\right)+1} nris{k}_{i+1}. \)


5.
Repeat the iteration in steps 14, as long as one of the following two conditions holds:
$$ {}\left(\hat{n}_{(upper_{i})+1} > nrisk_{i+1}\right) \mathbf{and}\! \left(\widehat{ncensor}_{i} \!<\! nrisk_{i}  nrisk_{i+1} \right). $$(6)or
$$ \left(\hat{n}_{(upper_{i})+1} < nrisk_{i+1}\right) \mathbf{and } \left(\widehat{ncensor}_{i} > 0\right). $$(7)The condition in (6) shows that the number of patients at risk for interval i is greater than the reported value, so the iteration will continue to increase the number of censored patients for this interval. On the other hand, condition (7) shows that the number of patients at risk at the end of interval i is less than the reported value, thus the iteration will continue to decrease the estimation of the number of censored patients for the interval. Note that, in condition (6), the upper bound on the number of censored patients (\(\widehat {ncensor}_{i}\)) is added (i.e., \(\widehat {ncensor}_{i} < nrisk_{i}  nrisk_{i+1}\)) to ensure that the number of censored patients for each interval lies within the range of [0,nrisk_{i}−nrisk_{i+1}]. This proper boundary condition is needed to avoid having an infinite number of iterations, which was not considered in the original iKM method.

6.
Check if the iteration reaches the last interval.

If i+1 is not the last interval, i.e., i+1≠I, repeat steps 14.

Otherwise, adjust the initial guess of the censored patients in the last interval as
$$ \begin{aligned} \widehat{ncensor}_{I} &= min(\text{Average censoring rate}*\text{length of}\\ &\text{the last interval}, \\ & nrisk_{I}  endpts  (tot.events  \sum_{k=1}^{upper_{I1}} \hat{d}_{k})), \end{aligned} $$(8)where endpts is the last reported number of patients at risk. The adjustment is made because the number at risk for the last interval (nrisk_{I}) is typically not available, and thus the equation in (3) can cannot be determined directly.

Robustness and stability of the modifiediKM algorithm
We state that Step (1a) and (2b) of the modifiediKM algorithm improve the robustness and stability of the original iKM. In this section, we show two data applications to demonstrate these advantages.
Robustness We demonstrate the advantage of Tukey’s fence using the data for the acute Myelogenous Leukemia (AML) survival data in the survival package. We generated the KM curve using the data for the control group with the survminer package and extracted data points from the KM curve using the IPDfromKM package. The first panel in Fig. 5a shows the data points extracted from the published KM curve, where the red dots indicate the outliers flagged using Tukey’s fence. The second panel shows the reconstructed KM curves with and without the outliers. As shown, we can see that by using Tukey’s fence to remove outliers, the algorithm can ensure accurate estimation. This shows that the use of Tukey’s fence increases the robustness of the algorithm such that outliers in the extracted data points will be removed automatically and have a minute impact on the estimation. After removing outliers, we use “force monotonicity” to ensure that survival probability monotonically decrease with time. While isotonic regression is commonly used when monotonicity assumption is violated, we show that force monotonicity is better in most of the curves considered in this study. This may be due to the fact that the former is more likely to smooth out the curve (an example is show in Fig. 5b).
Stability Step (2a) proves to be most useful when there are many reported time intervals. Take the tenth KM curve in our simulation study for example. We extracted 132 data points from the curve and applied both the modifiediKM and the original iKM methods to estimate the number of patients at risk (nrisk) and number of patients censored at each interval (ncensor). As shown in Table 5, if 6interval information was used, both modifiediKM and original KM would produce similar estimations. However, if 11interval information was used, the modifiediKM can reliably produce \(\widehat {ncensor}\) for each interval due to its stability by adding the additional boundary setup (Appendix ModifiediKM algorithm in details). In contrast, the original iKM algorithm behaves abysmally (i.e., negative estimated values). This demonstrates the stability of the modifiediKM.
Figures returned for accuracy assessment or secondary analysis
The IPDfromKM package provides the plot() function to visualize the accuracy of IPD reconstruction (Fig. 6), in addition to a formal testing. It also provides the survreport() function to conduct secondary survival analysis and returns figures for the new analysis (Fig. 7).
Original KM curves in the POLAR trial
Figure 8 shows the KM curves and estimation of median survival and hazard ratio for the subgroups in the POLAR trial.
Discussion on variability of test statistics
We show in Table 2 that the point estimates for the median OS and HR are close to reported values based on raw data. Considering that the reconstructed IPD data may not be independent and that data extraction may bring up extra variation in the reconstruction process, it is of interest to see how the variability of test statistics based on the reconstructed IPD impacts hypothesis testing. We evaluated this by (1) conducting hypothesis tests using the reconstructed IPD from the five pairs of KM curves in the POLAR trial and six additional pairs of simulated KM curves (generated in “Simulation result” section), and (2) constructing bootstrap confidence intervals for the six simulated trials using both true and reconstructed IPDs. Table 3 shows that our method, with reconstructed IPD, reaches the same conclusion as that with the true IPD, indicating that results based on reconstructed IPD are reliable. Table 4 demonstrates that the variability of statistics based on the reconstructed IPD is comparable to that based on the true IPD.
Longterm commitment of the shiny application
We develop the IPDfromKM Shiny application to help users extract data points from published KM curves and estimate parameters of interest using the modifiediKM. The application is freely available at https://www.trialdesign.org/onepageshell.html#IPDfromKM and regularly maintained. Figure 9 shows the procedure to contact the authors should users have any questions. Specifically, panel A of Fig. 9 shows the part of the app interface where the “CONTACT” (upper right corner) can be used to communicate with the authors. Upon clicking on “CONTACT,” users will be able to fill out their contact information and specific questions. Responses to users’ questions will be sent back to the users’ email addresses provided in panel B of Fig. 9.
Additional output of using the app to reconstruct IPD for two treatment groups
Availability of data and materials
The R code for the project can be download at https://CRAN.Rproject.org/package=IPDfromKM.
Declarations
Abbreviations
 IPD:

Individual patient data
 iKM:

Iterative algorithm based on KaplanMeier estimation
 RMSE:

Root mean square error
 KM:

KaplanMeier
References
Deeks JJ, Higgins JPT, Altman DG. Analysing and presenting results In: Higgins JPT, Green S, editors. Cochrane Handbook for Systematic Reviews of Interventions 4.2.6 [updated September 2006]; Section 8. In: The Cochrane Library, Issue 4. Chichester: Wiley: 2006.
Earle CC, Wells GA. An assessment of methods to combine published survival curves. Med Decis Making. 2000; 20(1):104–11.
Lueza B, Rotolo F, Bonastre J, Pignon JP, Michiels S. Bias and precision of methods for estimating the difference in restricted mean survival time from an individual patient data metaanalysis. BMC Med Res Methodol. 2016; 16(1):37.
Tudur C, Williamson PR, Khan S, Best LY. The value of the aggregate data approach in metaanalysis with timetoevent outcomes. J R Stat Soc Ser A Stat Soc. 2001; 164(2):357–70.
Guyot P, Ades A, Ouwens MJ, Welton NJ. Enhanced secondary analysis of survival data: reconstructing the data from published KaplanMeier survival curves. BMC Med Res Methodol. 2012; 12(1):9.
Satagopan JM, Iasonos A, Kanik JG. A reconstructed melanoma data set for evaluating differential treatment benefit according to biomarker subgroups. Data Brief. 2017; 12:667–75.
Wei Y, Royston P. Reconstructing timetoevent data from published Kaplan–Meier curves. Stata J. 2017; 17(4):786–802.
Sun R. R package to Reconstruct Individual Level Data From KM Plots Published in Academic Journals. 2018. https://github.com/ryanrsun/reconstructKM.
Kaplan EL, Meier P. Nonparametric estimation from incomplete observations. J Am Stat Assoc. 1958; 53(282):457–81.
Tukey JW. Exploratory data analysis. AddisonWesley Series in Behavioral Science: Quantitative Methods. 1977.
Bonner JA, Harari PM, Giralt J, Azarnia N, Shin DM, Cohen RB, Jones CU, Sur R, Raben D, Jassem J, et al. Radiotherapy plus cetuximab for squamouscell carcinoma of the head and neck. N Engl J Med. 2006; 354(6):567–78.
Zhou Y, Lin R, Kuo YW, Lee JJ, Yuan Y. BOIN suite: A software platform to design and implement novel earlyphase clinical trials. JCO Clin Cancer Informa. 2021; 5:91–101.
Fehrenbacher L, Spira A, Ballinger A, Lewanski C, et al. Atezolizumab versus docetaxel for patients with previously treated nonsmallcell lung cancer (POPLAR): a multicentre, openlabel, phase 2 randomised controlled trial. The Lancet. 2016; 387(10030):1837–46.
Kim JS, Yum BJ. Selection between weibull and lognormal distributions: A comparative simulation study. Comput Stat Data Anal. 2008; 53(2):477–85.
Acknowledgements
The authors thank Ryan Sun for his constructive comments in the earlier version of the manuscript and Jessica Swann for her editorial assistance.
Funding
JJL’s research was supported in part by the grants CA016672 and CA221703 from the National Cancer Institute, RP150519 and RP160668 from the Cancer Prevention and Research Institute of Texas, and The University of Texas MD Anderson Cancer CenterOropharynx Cancer Program, generously supported by Mr. and Mrs. Charles W. Stiefel.
Author information
Authors and Affiliations
Contributions
All authors contributed to the conception and design of the study. NL and YZ developed the R package and the webbased Shiny application. All authors participated in manuscript writing. JJL gave final approval of the manuscript submitted for publication. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Liu, N., Zhou, Y. & Lee, J.J. IPDfromKM: reconstruct individual patient data from published KaplanMeier survival curves. BMC Med Res Methodol 21, 111 (2021). https://doi.org/10.1186/s12874021013088
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12874021013088
Keywords
 Individual patient data (IPD)
 KaplanMeier curve
 Metaanalysis
 R package
 Shiny application
 Survival analysis