IPDfromKM: reconstruct individual patient data from published Kaplan-Meier survival curves

Background When applying secondary analysis on published survival data, it is critical to obtain each patient’s raw data, because the individual patient data (IPD) approach has been considered as the gold standard of data analysis. However, researchers often lack access to IPD. We aim to propose a straightforward and robust approach to obtain IPD from published survival curves with a user-friendly software platform. Results Improving upon existing methods, we propose an easy-to-use, two-stage approach to reconstruct IPD from published Kaplan-Meier (K-M) curves. Stage 1 extracts raw data coordinates and Stage 2 reconstructs IPD using the proposed method. To facilitate the use of the proposed method, we developed the R package IPDfromKM and an accompanying web-based Shiny application. Both the R package and Shiny application have an “all-in-one” feature such that users can use them to extract raw data coordinates from published K-M curves, reconstruct IPD from the extracted data coordinates, visualize the reconstructed IPD, assess the accuracy of the reconstruction, and perform secondary analysis on the basis of the reconstructed IPD. We illustrate the use of the R package and the Shiny application with K-M curves from published studies. Extensive simulations and real-world data applications demonstrate that the proposed method has high accuracy and great reliability in estimating the number of events, number of patients at risk, survival probabilities, median survival times, and hazard ratios. Conclusions IPDfromKM has great flexibility and accuracy to reconstruct IPD from published K-M curves with different shapes. We believe that the R package and the Shiny application will greatly facilitate the potential use of quality IPD and advance the use of secondary data to facilitate informed decision making in medical research.

undertake additional subgroup analyses not reported in the aggregated data. For this reason, the IPD approach is considered as the gold standard in data analysis.
However, researchers conducting meta-analysis or other secondary analyses may lack access to the IPD, partly due to the confidentiality of clinical data. Therefore, a method that is able to reconstruct IPD from published K-M curves can greatly facilitate secondary analyses on survival data. Several methods have been reported in the literature. The iterative algorithm based on K-M estimation method (referred to as "iKM" hereafter) proposed by Guyot et al. [5] is a classic approach among many proposed. It has been used in various secondary analyses. For example, Satagopan et al. [6] used it to reconstruct time-to-event data from a melanoma data set for evaluating different treatment benefits according to biomarker subgroups. Wei and Royston [7] developed a STATA function to apply the iKM algorithm with some adaptions and applied it to reconstruct IPD from K-M curves from multiple trials.
However, there are a few factors that limit the use of the iKM method. First, external software is needed to extract data points before using the iKM algorithm. The original iKM method suggests a manual approach to pick up a sufficient number of points from K-M curves via mouse-clicks using the DigitizeIt software.
Second, the iKM method has restricted requirements when picking up the points manually: (1) the survival probability needed to decrease monotonically as time increases, (2) the points where the number of patients at risk are reported must be included, and (3) users need to sort the data coordinates into time intervals determined by the time points, at which the number of patients at risk is reported. K-M curves in publication typically report the number of patients at risk at several time points under the x-axis. If the number of patients at risk is reported at month 3, 6, and 9, then data coordinates need to be manually organized into the time intervals: [ 0 −3), [ 3−6), [ 6− 9), and [ 9, +∞). The first two requirements are generally hard to keep for a large amount of manual mouse-clicks and the last one is also time consuming. Thus, it is important to have a flexible function to perform these tasks automatically.
Third, there is a numeric issue in the original iKM method: estimation can be negative values in certain scenarios. Our examination reveals that this is due to a boundary setup in the iterative process.
Finally, there is a lack of user-friendly software for clinical researchers to use for this method. The published functions [5,6,8] for the iKM method have at least one of the following inconveniences: external digitizing software is need for data extraction; users need to manually check if data input is appropriate, or the auto process program is not convenient or stable; or no accuracy assessment is provided for users to directly evaluate the reconstruction results within the same software.
To overcome the aforementioned limitations, we propose a two-stage modified-iKM approach, which provides an improved, accurate, user-friendly, and stable workflow to reconstruct IPD. In this approach, we not only relax the restricted requirement for data input, but also improve the robustness and stability of the original iKM method. More importantly, we develop an all-in-one software platform that allows clinicians and medical researchers to go from a single K-M curve image directly to reconstructed IPD, without using additional software to aid data extraction or without manual treatment on the data before reconstruction. The software includes an R package and a web-based Shiny application. Specifically, users can use either the R package or the Shiny application to (1) extract data coordinates from published K-M curves; (2) preprocess extracted data points; (3) run the modified-iKM algorithm to estimate the number of patients at risk, the number of events, and the number of censored outcomes for each pre-specified interval, and to reconstruct IPD; (4) provide graphs and statistical summary for users to evaluate the accuracy of the reconstruction process; and (5) conduct survival analysis based on reconstructed IPD. The R package is beneficial for users with some programming skills, as they can perform the reconstruction of IPD within R or expand their future research on the package. The advantages of the Shiny application are that it has a point-and-click interface, does not require installing any software, and can be used on any machine with an internet browser. This feature is appealing for clinicians and medical researchers who are not necessarily familiar with statistical programming. Extensive simulations and real-world data applications demonstrate that the proposed method has high accuracy and great reliability in the estimations of the parameters of interest, e.g., median survival, hazard ratio, and survival probability.
Our work is of great importance because it not only further supplements and strengthens the original iKM algorithm, but it also provides a user-friendly, all-in-one software available in different platforms to accommodate the needs of researchers with or without statistical programming familiarity. The availability of the two different software platforms can further enhance quality extraction of IPD for meta-analysis and other secondary analyses using time-to-event data.

Implementation
The modified-iKM algorithm for IPD reconstruction is a two-stage process, as shown in Fig. 1. Stage 1 aims to extract quality data coordinates (time, survival probability) from K-M curves. In Stage 2, the data coordinates are preprocessed and IPD is reconstructed using the proposed iterative algorithm.

Coordinates extraction in Stage 1
Data coordinates can be extracted from K-M curves using the R package and Shiny application we develop (details in the "Results" section). There are also a number of other software options available on Windows or Mac operating systems to digitize the graphs. The commonly used software are DigitizeIt (http://www.digitizeit. de/), ScanIt (https://www.amsterchem.com/scanit.html), and Plot Digitizer (http://plotdigitizer.sourceforge.net/). Extensive applications of real-world trial examples and simulated K-M curves show that data extracted using the different approaches yield comparable results during IPD reconstruction in Stage 2 (more details in the "Implementation" and "Simulation result" sections). We provide video tutorials on how to extract data coordinates using these tools in our Shiny application.
To ensure accuracy of estimation in Stage 2, it is critical to extract quality data coordinates in Stage 1. We recommend the following when extracting data points: 1. Use Adobe Illustrator to separate the curves before extracting data points, when there are multiple K-M curves in the same figure and the curves are tangled together. A video tutorial on this is included in the "Shiny application" section. 2. Extract as many points as possible. 3. Make sure the data points extracted are as evenly distributed as possible on the K-M curves. 4. Click on both the top and bottom points of the vertical segment at points where survival probability drops.

IPD reconstruction in Stage 2
The IPD reconstruction is carried out using the modified-iKM algorithm, which is based on the K-M estimation method [9] and improves upon the iKM algorithm [5]. We provide an overview of the algorithm in this section and delineate it in detail in Appendix Modified-iKM algorithm in details. Let T k and S k denote the time and survival probability, respectively, at time k. The data points extracted in Stage 1 are typically a N × 2 table, with each row being (T k , S k ), for k = 1, 2, ..., N, where N is the number of data points extracted. The IPD reconstruction consists of two main steps as follows: 1. Process the raw data.
a Sort the data by T k . b Make monotonicity adjustment by first using Tukey's fence [10] to detect and remove unreasonable inputs, and then ensure that survival probability decreases over time (known as "force monotonicity" hereafter). c Perform step control.
2. Reconstruct IPD using an iterative algorithm adapted from the iKM method.
a Estimate K-M parameters at each coordinate (k ) including the number of patients at risk (n k ), number of patients censored ( cens k ), and number of events(d k ), k = 1, · · · , N. In this step, we modified the boundary setup for the number of censored observations to prevent abysmal estimations in some scenarios (more details in Appendices Modified-iKM algorithm in details-Robustness and stability of the modified-iKM algorithm). b Construct IPD using the parameters estimated from Step (2a).
Tukey's fence in Step (1b) is a nonparametric method for outlier detection. It is calculated by creating a "fence" boundary using where Q 1 and Q 3 are the first and third quartiles of the data points, and k is a constant that takes common values of 1.5 or 3. We used k = 3 in our algorithm in order to not accidentally delete useful data points. Any point outside of TF is considered an outlier. The illustrative example in Appendix: Robustness and stability of the modified-iKM algorithm shows that the use of Tukey's fence in Step (1b) improves the robustness of the algorithm in the presence of outliers (Fig. 5 in Appendix).
Step (1c) safeguards the algorithm from a potential underestimation problem (details in Appendix: Modified-iKM algorithm in details).
Step (2a) enhances the stability of the iterative algorithm. In Appendix: Robustness and stability of the modified-iKM algorithm, we demonstrate the robustness and stability of the modified-iKM algorithm using two data applications.
To assess the accuracy of the modified-iKM algorithm, we employ several metrics and make them easily accessible through our developed software. First, we provide graphs to visualize the reconstructed results: (1) estimated survival probability at each read-in time point using the reconstructed IPD compared with the corresponding read-in survival probabilities, and (2) estimated number of patients at risk (when given) compared with reported values. Second, we provide several summary statistics to aid the accuracy assessment. One is the root mean square error (RMSE), which measures the difference in survival probabilities calculated using reconstructed data and original data. Additionally, the mean absolute error and the max absolute error are provided to assess the precision of the estimation. Upon careful assessment of the empirical distributions of the RMSE, max absolute error, and mean absolute error for all of the real trial examples and simulated curves considered later in this study, we recommend the following thresholds: RMSE ≤0.05, mean absolute error ≤0.02, and max absolute error ≤0.05 to indicate that the extracted data points are sufficiently well-captured for subsequent analyses. Please note that these recommendations are based on a limited number of examples. In practice, users can use these thresholds as a rule of thumb along with their own judgement to determine if they need to re-extract the data points. Third, we use the Kolmogorov-Smirnov test to compare the distributions of the read-in and the estimated survival curves. A large p-value is desired, as it indicates that there is a lack of statistical evidence to show the discrepancy between the read-in and the estimated survival curves.

R package
To facilitate the use of the modified-iKM method, we developed an R package called "IPDfromKM" with R version 3.6.0. The package is available via the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project. org/package=IPDfromKM. The package contains several functions; the descriptions and objects returned for the functions are presented in Table 1. We provide an example for each of the functions below.

Extract data coordinates
The getpoints() function is used to extract data coordinates from published K-M curves. The function has the following arguments: After the file is read into R, instructions will be provided in the R console to guide the extraction of data points. Specifically, users will be guided to click on the leftmost and rightmost points on the x-axis, and click on the lowest and the highest points on the y-axis. Then they can collect the data coordinates by mouse-clicking on the curve. To get a desirable estimation, we suggest that users follow recommendations 2-4 in the "Coordinates extraction in Stage 1" section. The data points extracted will be returned as a two-column data set (e.g., points in the example), and this data set can be used as the input of the preprocess() function described in the "Process data coordinates" section.
In the following text, we take the build-in data set Radiationdata to demonstrate the use of the package to preprocess raw data, reconstruct IPD, and conduct secondary analysis on the reconstructed IPD. The data set was extracted from published K-M curves from a two-arm randomized controlled trial [11] using ScanIt. This study randomized 424 head and neck cancer patients to two treatment groups: 213 in the radiotherapy group (referred to as "radio") and 211 in the radiotherapy plus cetuximab group (referred to as "radio_plus"). The primary outcome was the duration of locoregional control. There were 145 pairs of coordinates extracted for the radio treatment, and 136 pairs for the radio_plus treatment. Coordinates for each treatment were saved as a two-column table in A data frame containing the x-and y-coordinates of the K-M curve of interest.
preprocess Preprocess the read-in data coordinates. A list including cleaned data ready for reconstruction and a "riskmat" table displaying the index of read-in points within each time interval.
getIPD Estimate the IPD. A list including the reconstructed IPD.
survreport Perform survival analysis on reconstructed IPD. K-M curve, cumulative hazard, times for targeted survival probabilities.
plot Plot the object returned by getIPD(). K-M curves and number at risk for both reconstructed IPD and read-in data.
Descriptive results for accuracy assessment and survival analysis on reconstructed IPD.
Please consult the documentation (e.g., help("preprocess")) for function arguments and detailed return types Radiationdata: the first column is for survival times, and the second column is for survival probabilities reported in percentages. Radiationdata also includes risk times (in months): trisk = (0, 10, 20, 30, 40, 50, 60, 70) and the number of patients at risk for each risk time point (nrisk).

Process data coordinates
To prepare data to reconstruct IPD, we first preprocess the raw data points in an appropriate format. This can be done using the preprocess() function. This function contains the following arguments: • dat: a two-column data set with the first column being survival time, and the second the survival probability extracted from a published K-M curve. • trisk: a vector containing risk time points (i.e., time points at which the number of patients at risk is reported). This often can be found under the x-axis of a K-M curve. • nrisk: a vector containing the number of patients at risk. This often can be found under the x-axis of a K-M curve. • maxy: the scale of survival probability. Set maxy=100 when the probabilities are reported in percentages (e.g., 70%). Set maxy=1 when the probabilities are reported using decimal numbers (e.g., 0.7).

Reconstruct IPD
After the raw data is processed using the preprocess() function, we can use the getIPD() function to reconstruct the IPD. The getIPD() function has the following arguments: • prep: the class object returned from the preprocess() function. • armID: an arbitrary label used as the group indicator for the reconstructed IPD. Typically 0 for the control group and 1 for the treatment group. • tot.events: total number of events. Only available for some published curves, and the default value is NULL.

Accuracy assessment
To view the accuracy assessment results, simply call the summary() function. Because of page limits, we show only one example below.

R > summary(est_radio)
The function read in 144 points from the K-M curve, and 6 number of patients at risk. Thus the read-in points are divided into 6 time intervals.
The Kolmogorov-Smirnov test: Test statistics D= 0.07639 p-value= 0.7948 Null hypothesis: distributions of the read-in and estimated survival probabilities are the same.
We see that our algorithm can accurately estimate the numbers at risk and provide estimates on the number of events at each risk time. The small values for RMSE (< 0.05), mean absolute error (< 0.02), and max absolute error (< 0.05), along with the large p-value of the Kolmogorov-Smirnov test shows that the reconstructed IPD is accurate. Additionally, we can use the plot() function to graph the survival curves from the reconstructed IPD, and compare them with those generated using original data points. The function takes the object returned by getIPD() directly.

R > plot(est_radio) R > plot(est_radio_plus)
The output using the plot() for the radio group is provided in Fig. 6 (Appendix: Figures returned for accuracy assessment or secondary analysis), which shows three graphs: (1) K-M curves based on reconstructed IPD and read-in data points, respectively; (2) number of patients at risk using the reconstructed IPD versus reported; and (3) difference in survival probability at different time points for the reconstructed IPD and the read-in data. When the interval information is not available while reconstructing the IPD, the second graph will not be shown.

Secondary analysis
If survival analysis is of interest, we can run the survreport() function, which includes the following arguments: • ipd1: a three-column (i.e., time, status, treatment indicator) Researchers working with clinical data are often interested in the survival times at which survival probabilities are specified (e.g., median survival time at which 50% of patients survive). The example below shows the survival times for the pre-specified survival probability s =

Shiny application
To facilitate the use of the modified-iKM algorithm to reconstruct IPD for a broader audience, who is not necessarily familiar with statistical programming, we develop a user-friendly Shiny application, which is freely available at https://www.trialdesign.org/one-page-shell.html# IPDfromKM. Compared with the R package, the Shiny application does not require any installation, is not limited by the hardware of users' computers, and comes with a point-and-click feature such that no statistical programming skill is needed to use it [12]. In addition, we regularly maintain and update the application on the server, thus users do not have to worry about software updates. Whenever users have any questions, they can contact the authors directly through the Shiny application host website. Appendix: Long-term commitment of the shiny application shows how users can directly contact the authors.
The application has a straightforward interface with three main panels: Data Extraction (used to extract data points), Reconstruct Individual Patient Data (IPD) (used to reconstruct IPD), and User Guide (providing extensive tutorials). Overall, the app has the capability to complete the following tasks: 1. Extract data points from published K-M curves. 2. Process extracted data points. 3. Reconstruct IPD using the modified-iKM algorithm. 4. Assess the accuracy of the reconstruction. 5. Perform survival analysis using the reconstructed IPD. 6. Generate a concise report for the IPD reconstruction. 7. Provide an extensive user guide for understanding the method and using the Shiny application.
Typical input required to use the app includes the following, where input 1 is for data extraction and inputs 2-6 are for IPD reconstruction.
We provide video tutorials on how to extract coordinates using the Shiny application and other software, and how to reconstruct IPD from extracted data coordinates. The tutorial are accessible under the User Guide panel of the Shiny application. In the following text, we show an example of using the Shiny application to reconstruct IPD for two treatment groups simultaneously.
We illustrate the use of the application using data for the radio and radio_plus treatment groups introduced in the R package examples. The data set saved in the radio_radioplus.csv file has been provided in the app as a template. As shown in Fig. 2, to input the data, we select "Two" under Number of treatment groups and upload the data file. We then type in available information including risk times, number of patients at risk, and the total number of patients for each group. When finishing data input, we simply click the Begin Calculation button and the results are shown on the right side of the app. There are four tabs for displaying the results.
As shown in Fig. 2, the first tab Reconstructed IPD shows the individual patient data reconstructed from the data provided. The Accuracy assessment tab (Fig. 10 in Appendix) shows two plots and a table. The first plot displays the comparison of K-M curves using the IPD and original data set for each treatment group, followed by a plot of the difference between survival probabilities. The table shows the summary statistics such as RMSE and Kolmogorov-Smirnov test statistics and p-values to help assess the accuracy of the IPD reconstruction. Under the Survival analysis tab (Fig. 11 in Appendix), K-M curves and cumulative hazard functions are displayed for both treatments. For each treatment, the landmark survival probabilities, corresponding standard error, and 95% confidence interval (CI) are reported. Below the landmark probability tables, the application also provides the critical survival times given pre-specified survival probabilities. For example, we see that the median survival time is 14.9 months for treatment 1 (radio group) and 24.5 months for treatment 2 (radio_plus group). A concise report is available under the Download report tab (Fig. 12 in Appendix).

Implementation example
We now illustrate the use of the modified-iKM method for reconstructing IPD from K-M curves published for a two-arm randomized controlled trial in non-small cell lung cancer (NSCLC) patients [13]. This trial is known as the POLAR trial, in which a total of 287 patients with previously treated, advanced or metastatic NSCLC were randomized to receive either atezolizumab or docetaxel. The aim of this trial was to assess the efficacy of the two drugs for patients with NSCLC, analyzed by PD-L1 expression levels on tumor cells and tumor-infiltrating immune cells. Here, the baseline PD-L1 expression was scored by immunohistochemistry in tumor cells (as the percentage of PD-L1 expressing tumor cells TC3 (≥50%), TC2 (≥5% and <50%), TC1 (≥1% and < 5%), and TC0 (<1%)) and tumor-infiltrating immune cells (as the percentage of tumor area: IC3 (≥10%), IC2 (≥5% and <10%), IC1 (≥1% and <5%), and IC0 (<1%)). Overall survival was estimated for all patients in five groups: TC3 or IC3 To apply our method and assess its accuracy, we extract raw data points from the K-M curves, reconstruct IPD from the points extracted, use the reconstructed IPD to calculate the median overall survival (OS) and hazard ratio, and compare them with the published results. As shown in Table 2, the point estimates for the median OS and HR are close to reported values based on raw data.
Considering that the reconstructed IPD data may not be independent and that data extraction may bring up extra variation in the reconstruction process, it is of interest to see how the variability of test statistics based on the reconstructed IPD impacts the hypothesis testing of interest. We evaluated this by conducting hypothesis tests using the reconstructed IPD from the five pairs of K-M curves in the POLAR trial and six additional pairs of simulated K-M curves (generated in "Simulation result" section). As shown in Table 3 (Appendix: Discussion on variability of test statistics), our method, with reconstructed IPD, reaches the same conclusion as that using the true IPD, indicating that results based on reconstructed IPD are reliable. Moreover, we further investigated the variability of statistics of interest (e.g., hazard ratio) between using true IPD and using reconstructed IPD, through bootstrap confidence intervals. As shown in Table 4 (Appendix: Discussion on variability of test statistics), the variability of statistics based on reconstructed IPD is comparable to that based on true IPD. The examinations on the variability of test statistics in our study are limited. Future studies are warranted to further investigate this aspect. In terms of software used, we see that the data points extracted using our function from the IPDfromKM package have competitive performance to those extracted using Digi-tizeIt and ScanIt. In practice, users can use the software of their choice as long as they follow the instructions to carefully extract data points. Nevertheless, the all-inone feature of our software greatly streamlines the data extraction, IPD reconstruction, and subsequent data analysis, making it a superb choice for users who wish to do everything within one platform.

Simulation result
To further assess the accuracy of the estimations using the modified-iKM algorithm, we conducted a simulation study with six trials that each had both control and treatment groups. The sample size for each of the 12 groups was 200. We generated the underlying survival time (T i , i = 1, · · · , 200) using the Weibull distribution with the survival function S(t) = exp(−λt γ ), where λ and γ are the scale and shape parameters, respectively [14]. When γ > 1, the survival curve has an increasing hazard; when γ = 1, the survival curve has a constant hazard; and when γ < 1, the survival curve has a decreasing hazard.
In the simulation, the scale parameter was specified such that the mean survival E(X) = λ (1 + 1/γ ) was 12 months for treatment groups, and 6 months for control groups. We assume 36 months of recruiting time, and a maximum of 24 months of follow-up. The censoring time Table 2 Estimates of median overall survival (OS) and hazard ratio using the modified-iKM algorithm based on data extracted using different software (R: R package IPDfromKM, D: DigitizeIt, S: ScanIt), in comparison to published results (Report) in the POLAR trial was generated by the minimum value of the time generated from exponential distribution with parameter λ * and the maximum follow-up time. The value of λ * was set such that the censor rate was either 30% or 60%. K-M curves were then generated by the survfit() function from the survival R package. The number of patients at risk was reported every 3 months (20 intervals), or every 10 months (6 intervals), or not reported at all (no risk information). The simulated trials represented diverse situations that mimicked K-M survival curves obtained from real trials. We used the IPDfromKM package, DigitizeIt, and ScanIt software to extract the coordinates of the curves, then we used the preprocess() and getIPD() functions from the IPDfromKM package to process the raw data and obtain IPD. We compared the estimated IPD with the true IPD generated. We first examined the accuracy of the three software packages in the coordinates extraction by comparing the estimated number at risk to the true number at risk. Figure 3 shows the results for two simulated curves. As demonstrated, modified-iKM accurately estimated the number of patients at risk, regardless of software used to extract data points, as long as a study reported the number of patients at risk, under both low and high censoring rates. Without reported numbers at risk, estimation cannot be accurate, as previously noted in the literature [5]. Thus, reporting this information in published studies is highly recommended. Figure 3b shows the mean absolute error of the estimation for the number of patients at risk when the reported number at risk was available (i.e., the case with 6 or 20 time intervals in the trial), where the error was determined by (estimated value − true value). The data extracted yielded comparable results for the different software.
The results were more accurate with more time intervals provided. For instance, when there was information for 20 intervals, the mean absolute error was less than 2 regardless of censoring rate. When the censoring rate was large (e.g., 60%), the estimation using information from six intervals had greater error.
Next, we used the reconstructed IPD to conduct standard survival analysis to evaluate the performance of our method. Figure 4a shows the estimated median survival times and the one-year survival probability. The difference between the estimated and true median survival times was within 5% in most cases. Figure 4b shows the comparison between the true hazard ratio and estimated hazard ratios on the log scale. The hazard ratios and corresponding standard deviations were almost identical in most cases, regardless of which software was used to extract the data coordinates.

Conclusion
In this work, we introduce the R package IPDfromKM, and its accompanying Shiny application, to reconstruct IPD from published K-M curves based on the proposed modified-iKM algorithm. There are several improvements in the modified-iKM algorithm in comparison to the original iKM method. First, we provide a function for users to extract data points from K-M curves directly. This enables the all-in-one feature of our software, which streamlines the process of IPD reconstruction without requiring extra software to obtain the raw data coordinates. Second, our procedure to preprocess raw data coordinates is flexible and accurate. The original iKM method does not have a function for preprocessing raw data coordinates, and users need to manually check the monotonicity assumption and sort the extracted data points. Stagopan et al. [6] published a function that can automatically preprocess the raw data coordinates. However, their function tends to be unstable. For example, the function simply deletes points that have a larger survival probability than the points right before them, which introduces additional errors into the extracted data points. The function also trims data points at the tail of a K-M curve. This can be detrimental, as the K-M curve often has a long horizontal tail. If data points were trimmed at the tail, it would be less likely to reliably estimate the number of censored observations occurring at the end of the trial. The preprocess() function from the ReconstructKM package published on GitHub [8] can preprocess the data coordinates but it also requires manual check for monotonicity and the curve endings. To overcome these limitations, we proposed a flexible and accurate algorithm to preprocess the raw data coordinates extracted from K-M curves and the algorithm is easily implemented by using the preprocess() function from the IPDfromKM package or by uploading the data to the Shiny application. Moreover, the use of Tukey's fence improves the robustness of reconstruction in Stage 2 in the presence of outliers.
Third, in the estimation stage, we refine the boundary setup in the iterative procedure to prevent abysmal estimations, which greatly improves the stability of the iKM method.
Finally, we demonstrate through simulation that the variability of statistics of interest is comparable to that based on the true IPD and that the conclusion based on reconstructed IPD is the same as that based on the true IPD in hypothesis testing. These findings show that the IPD reconstructed using our method is reliable. In the R package and the Shiny application, we provide several approaches to evaluate the accuracy of the algorithm with an easy-to-use function, with which the quality of the reconstructed IPD is easily assessed.
Despite the strengths of the modified-iKM algorithm and its accompanying software, there are a couple of challenges or limitations worth mentioning. First, when there are tangled lines with a lot of censoring markers, even with the help of software such as Adobe Illustrator, it is still challenging to separate out each K-M curve for digitizing. Second, while the modified-iKM algorithm can accurately estimate the survival probabilities, the number of events, and the number of patients at risk, the algorithm is based on a uniform censoring assumption, which may be violated in certain trials. Future work can expand on this and consider extension to a non-uniform censoring mechanism.

Modified-iKM algorithm in details
The IPD reconstruction is carried out using the modified-iKM algorithm, which is based on the K-M estimation method [9] and improves upon the iKM algorithm [5]. Details regarding the iKM estimation method and the modified-iKM algorithm are provided below.

The KM estimator
The KM estimator, first proposed by Kaplan and Meier [9] is a non-parametric estimator of the survival function. It is determined by the product over the failure times of the conditional probabilities of surviving to the next failure time. Specifically, suppose there are Q distinct failure times. Let t q denote a time where at least an event (e.g., a patient dies) is observed, n q is the number of subjects at risk at time t q , and d q is the number of subjects who experience the event at that time, q = 1, · · · , Q. The KM estimator (S KM t q ) is formally defined as The number of patients at risk (n q+1 ) can be determined by the number of patients at risk at time t q , minus the corresponding number of patient experienced events and number of patients censored.

The modified-iKM algorithm
Before initializing the iterative algorithm, we first sort extracted coordinates by time and make monotonicity adjustments on survival probabilities. In addition, to improve estimation accuracy, we propose a step control procedure while calculating the number of events at the coordinate k (denoted asd k ). Usually, there will be multiple points at the same time, and thus multiple survival probabilities are available at this time. In a K-M curve, this is reflected by a drop in the survival probability at a time point. We refer to such a time point as a jumping step. Suppose there are s consecutive points read out at the jumping step from the target K-M curve, and denote them as j, j + 1, ...j + s − 1. The original iKM method requires the use of all the values ford j ,d j+1 , ...,d j+s−1 and adds them together to get the estimated total number of events that happened at the jumping step. However, this could end up in an under-estimated number at risk due to rounding. That is,d j ,d j+1 , ...,d j+s−1 can be too small to round off to zero. When too many jumping points exist, the number at risk is under-estimated. As a simple illustrative example, suppose s = 10, if all the values ford j ,d j+1 , ...,d j+s−1 are within [ 0.1, 0.5). After rounding these values separately and adding them together, the estimated number at risk is zero, but in reality, it lies between 1 and 5. We note that this rarely happens in practice based on extensive simulations and real trial examples. Nevertheless, it is great to safe guard against the potential under-estimation problem. For this reason, we recommend controlling the steps to ensure that there are only two points remaining in each vertical segment on the K-M curve: one at the beginning and the other one at the end of the segment. The corresponding survival rates at these two points are simplyŜ KM last(k) and S k . By doing this, the number of events at this step needs to be calculated and rounded only one time, and it would not end up with the potential under-estimated problem.
To start the iterative estimation process, we first divide the preprocessed coordinates from the K-M curve into I intervals. Denote the number of patients at risk at the beginning of the intervals as (nrisk 1 , nrisk 2 , · · · , nrisk I ) and denote the time at which the number of patients at risk is provided as (trisk 1 , trisk 2 , · · · , trisk I ). For each interval, we denote the index of the first point as lower i and of the last point as upper i . The iterative estimation process proceeds as follows: 1. Initialize the total number of patients censored at interval i ( ncensor i ) using the difference between the reported number at risk at the beginning of interval i + 1 and the number at risk in this interval if no censoring occurs (nrisk nocensor i+1 ) in interval i. The value of nrisk nocensor i+1 is given by nrisk nocensor i+1 = nrisk i * S lower i+1 /S lower i rounded to the nearest integer, where S lower i+1 /S lower i is the probability of survival at the beginning of interval i + 1 conditional on being alive at the beginning of interval i. Thus we have 2. Determine the number of patients censored between the extracted coordinates k and k + 1 (denoted as ncensor k ). Assuming a constant censoring rate, it is straightforward to determine the censoring time by distributing the number of censored patients evenly over the interval i : is the survival probability of the last point before time T k . 4. Check if the estimated number at risk at the start of the next interval i + 1 (i.e., nrisk i+1 , which is just n upper i +1 ) is equal to the reported value nrisk i+1 .

5.
Repeat the iteration in steps 1-4, as long as one of the following two conditions holds: or n (upper i )+1 < nrisk i+1 and ncensor i > 0 . (7) The condition in (6) shows that the number of patients at risk for interval i is greater than the reported value, so the iteration will continue to increase the number of censored patients for this interval. On the other hand, condition (7) shows that the number of patients at risk at the end of interval i is less than the reported value, thus the iteration will continue to decrease the estimation of the number of censored patients for the interval. Note that, in condition (6), the upper bound on the number of censored patients ( ncensor i ) is added (i.e., ncensor i < nrisk i − nrisk i+1 ) to ensure that the number of censored patients for each interval lies within the range of [ 0, nrisk i − nrisk i+1 ]. This proper boundary condition is needed to avoid having an infinite number of iterations, which was not considered in the original iKM method. 6. Check if the iteration reaches the last interval.
• If i + 1 is not the last interval, i.e., i + 1 = I, repeat steps 1-4. • Otherwise, adjust the initial guess of the censored patients in the last interval as ncensor I = min(Average censoring rate * length of the last interval, where endpts is the last reported number of patients at risk. The adjustment is made because the number at risk for the last interval (nrisk I ) is typically not available, and thus the equation in (3) can cannot be determined directly.

Robustness and stability of the modified-iKM algorithm
We state that Step (1a) and (2b) of the modified-iKM algorithm improve the robustness and stability of the original iKM. In this section, we show two data applications to demonstrate these advantages.

Robustness
We demonstrate the advantage of Tukey's fence using the data for the acute Myelogenous Leukemia (AML) survival data in the survival package. We generated the K-M curve using the data for the control group with the survminer package and extracted data points from the K-M curve using the IPDfromKM package. The first panel in Fig. 5a shows the data points extracted from the published K-M curve, where the red dots indicate the outliers flagged using Tukey's fence. The second panel shows the reconstructed K-M curves with and without the outliers. As shown, we can see that by using Tukey's fence to remove outliers, the algorithm can ensure accurate estimation. This shows that the use of Tukey's fence increases the robustness of the algorithm such that outliers in the extracted data points will be removed automatically and have a minute impact on the estimation. After removing outliers, we use "force monotonicity" to ensure that survival probability monotonically decrease with time. While isotonic regression is commonly used when monotonicity assumption is violated, we show that force monotonicity is better in most of the curves considered in this study. This may be due to the fact that the former is more likely to smooth out the curve (an example is show in Fig. 5b).

Stability
Step (2a) proves to be most useful when there are many reported time intervals. Take the tenth K-M curve in our simulation study for example. We extracted 132 data points from the curve and applied both the modified-iKM and the original iKM methods to estimate the number of patients at risk (nrisk) and number of patients censored at each interval (ncensor). As shown in Table 5, if 6-interval information was used, both modified-iKM and original K-M would produce similar   estimations. However, if 11-interval information was used, the modified-iKM can reliably produce ncensor for each interval due to its stability by adding the additional boundary setup (Appendix Modified-iKM algorithm in details). In contrast, the original iKM algorithm behaves abysmally (i.e., negative estimated values). This demonstrates the stability of the modified-iKM.

Figures returned for accuracy assessment or secondary analysis
The IPDfromKM package provides the plot() function to visualize the accuracy of IPD reconstruction (Fig. 6), in addition to a formal testing. It also provides the survreport() function to conduct secondary survival analysis and returns figures for the new analysis (Fig. 7). Table 4 Hazard ratio and 95% Bootstrap confidence interval (BCI) for the six simulated trials in Table 3 Results Original K-M curves in the POLAR trial Figure 8 shows the K-M curves and estimation of median survival and hazard ratio for the subgroups in the POLAR trial.

Discussion on variability of test statistics
We show in Table 2 that the point estimates for the median OS and HR are close to reported values based on raw data. Considering that the reconstructed IPD data may not be independent and that data extraction may bring up extra variation in the reconstruction process, it is of interest to see how the variability of test statistics based on the reconstructed IPD impacts hypothesis testing. We evaluated this by (1) conducting hypothesis tests using the reconstructed IPD from the five pairs of K-M curves in the POLAR trial and six additional pairs of simulated K-M curves (generated in "Simulation result" section), and (2) constructing bootstrap confidence intervals for the six simulated trials using both true and reconstructed IPDs. Table 3 shows that our method, with reconstructed IPD, reaches the same conclusion as that with the true IPD, indicating that results based on reconstructed IPD are reliable. Table 4 demonstrates that the variability of statistics based on the reconstructed IPD is comparable to that based on the true IPD.

Long-term commitment of the shiny application
We develop the IPDfromKM Shiny application to help users extract data points from published K-M curves and estimate parameters of interest using the modified-iKM.