Skip to main content

Goodness-of-fit two-phase sampling designs for time-to-event outcomes: a simulation study based on New York University Women’s Health Study for breast cancer



Sub-cohort sampling designs such as a case-cohort study play a key role in studying biomarker-disease associations due to their cost effectiveness. Time-to-event outcome is often the focus in cohort studies, and the research goal is to assess the association between the event risk and risk factors. In this paper, we propose a novel goodness-of-fit two-phase sampling design for time-to-event outcomes when some covariates (e.g., biomarkers) can only be measured on a subgroup of study subjects.


Assuming that an external model, which can be the well-established risk models such as the Gail model for breast cancer, Gleason score for prostate cancer, and Framingham risk models for heart diseases, or built from preliminary data, is available to relate the outcome and complete covariates, we propose to oversample subjects with worse goodness-of-fit (GOF) based on an external survival model and time-to-event. With the cases and controls sampled using the GOF two-phase design, the inverse sampling probability weighting method is used to estimate the log hazard ratio of both incomplete and complete covariates. We conducted extensive simulations to evaluate the efficiency gain of our proposed GOF two-phase sampling designs over case-cohort study designs.


Through extensive simulations based on a dataset from the New York University Women’s Health Study, we showed that the proposed GOF two-phase sampling designs were unbiased and generally had higher efficiency compared to the standard case-cohort study designs.


In cohort studies with rare outcomes, an important design question is how to select informative subjects to reduce sampling costs while maintaining statistical efficiency. Our proposed goodness-of-fit two-phase design provides efficient alternatives to standard case-cohort designs for assessing the association between time-to-event outcome and risk factors. This method is conveniently implemented in standard software.

Peer Review reports


In biomedical studies, sub-cohort sampling designs have been widely used to estimate biomarker-disease associations because of their cost effectiveness. Wang et al. recently developed novel two-phase sampling designs for binary outcomes [1]. Assuming that an external model is available to relate the outcome and complete covariates that are available in the first phase, the designs oversample cases and controls with worse goodness-of-fit (GOF) based on the external model and further match them on complete covariates similarly to the balanced design [2]. The GOF designs exhibit improved efficiency comparing to case–control design or a balanced case–control design for binary outcomes [1].

In cohort studies that follow subjects over time, time-to-event outcome (or survival outcome) is commonly of interest. In our motivating study from the New York University Women’s Health Study (NYUWHS), one outcome of interest is time to breast cancer diagnosis and we are interested in studying the association of hormone biomarkers and breast cancer risk in younger women [3, 4]. Although full-cohort studies provide an ideal setting to study biomarker-disease associations, the combination of large sample sizes, low incidence rates, and high costs (e.g., blood measurements) make it difficult and costly to measure the biomarkers on the entire cohort [5, 6].

Two-phase sampling designs such as nested case–control (NCC) designs [7] and case-cohort (CC) designs [8] can help overcome this limitation. Previous studies [5, 9, 10] that examined efficiency with these designs have primarily focused on comparing various inference procedures rather than from sampling design perspectives: e.g., Prentice [8], Self and Prentice [11], and Lin and Ying [6] for the un-stratified CC designs and Borgan I and II methods [12] for the stratified CC designs.

In this paper, we extend the novel GOF two-phase sampling designs proposed by Wang et al. [1] for estimating hazard ratio parameters with time-to-event data. Assuming that an external model exists to relate the survival outcome and phase I complete covariates, we propose a sampling strategy that is based on the survival probability computed from the external model as well as the follow-up time, thereby extending the GOF design to survival outcomes. For estimation and inference, we propose to use the inverse probability weighting (IPW) method to account for the sampling design.

The paper is organized as follows. In Methods section, we describe the sample designs and estimation procedures of the GOF two-phase sampling designs. Simulation of NYUWHS data section includes simulation studies evaluating the efficiency of our proposed designs based on the real dataset from NYUWHS. We conclude with Discussion section.


Outline and notations

Consider a cohort of \(N\) subjects followed over time. Let \(T=\mathrm{min}\left({T}^{*}, C\right)\) be the observed survival time (or failure time), where \({T}^{*}\) is true time-to-event (for those who develop the event) and \(C\) is censoring time (for those who have not developed the event by the end of follow-up). Let \(\delta =I\left({T}^{*}\le C\right)\) denote the event indicator, where the indicator function \(I\left(\cdot\right)\) takes the value 1 if \({T}^{*}\le C\), and 0 otherwise. Let \(X\) denote the collection of phase I covariates that are available for the entire cohort, and \(Z\) denote phase II covariates (e.g. biomarkers) that can only be measured on a subset of \(m \left(m\ll N\right)\). We assume that censoring time and true survival time are independent conditioning on covariates. The Cox proportional hazards (PH) regression model can be used to describe the relationship between the covariates and time-to-event outcome,

$$\lambda \left(t\right)={\lambda }_{0}\left(t\right){e}^{{\beta }^{T}X+{\alpha }^{T}Z},$$

where \({\lambda }_{0}\left(t\right)\) is the unknown baseline hazard function, \(\beta\) and \(\alpha\) are the log HR parameters for covariates \(X\) and \(Z\), respectively. The partial likelihood principle has been proposed to estimate the regression coefficients, \(\beta\) and \(\alpha\), while circumvents the estimation of infinite dimensional baseline hazard function [13, 14].

Goodness-of-fit two-phase sampling design for time-to-event outcome

We first assume that an external working model exists and only depends on \(X\), that is,

$${\lambda }_{e}\left(t|X\right)={\lambda }_{e0}(t){e}^{{\eta }^{T}X},$$

where \({\lambda }_{e0}(t)\), the baseline hazard function, and \(\eta\), the hazard ratio parameters, are both known or can be obtained from external models. Here and in the sequel, the subscript “\(e\)” represents the external model. We note that such preliminary models often exist: e.g., breast cancer risk prediction models such as the Gail model [15, 16]. Note that either the complete set or a subset of \(X\) can be included in the external model. We compute a GOF-based quantity for subject \(i \left(i=1,\dots ,N\right)\) using the external model accounting for the length of follow-up, i.e. \(D\left({T}_{i},{\delta }_{i}, {X}_{i}\right)=\left|{\delta }_{i}-{P}_{e}\left(T<{T}_{i}|{X}_{i}\right)\right|=\left|{\delta }_{i}-\left(1-{P}_{e}\left(T\ge {T}_{i}|{X}_{i}\right)\right)\right|=\left|{\delta }_{i}-1+{S}_{e}\left({T}_{i}|{X}_{i}\right)\right|,\) with survival function \({S}_{e}\left(t|X\right)=\mathrm{exp}\left(-{\int }_{0}^{t}{\lambda }_{e}\left(s|X\right)ds\right)\).

Let \(R\) denote whether a subject is selected into phase II, with \(R=1\) indicating selection and \(R=0\) for non-selection. We propose to use the quantity \(D\left({T}_{i},{\delta }_{i}, {X}_{i}\right)\) to select \(m\) subjects into phase II as below, where \(m={\sum }_{i=1}^{N}{R}_{i}\). Because the quantity \(D\) informs the goodness-of-fit (GOF) of the external model (2), the GOF two-phase design over-samples subjects who show poor fit to the risk prediction working model as they potentially are more informative and likely show benefits from including the new phase-II biomarkers into their risk prediction. It is also desirable to achieve a prespecified case–control ratio within \(m\) number of the phase II subjects as commonly done in epidemiological studies. We use the sampling probability \(P\left({R}_{i}=1|{T}_{i}, {\delta }_{i}, {X}_{i}\right),\) which is \(D\left({T}_{i}, {\delta }_{i}, {X}_{i}\right)\) multiplied by a constant \({c}_{1} \left({c}_{1}>0\right)\) for cases and \({c}_{0} \left({c}_{0}>0\right)\) for controls, i.e. \(P\left({R}_{i}=1|{\delta }_{i}=k,{T}_{i}, {X}_{i}\right)=\mathrm{min}\left\{1, {c}_{k}D\left({T}_{i},{\delta }_{i}, {X}_{i}\right)\right\}, k=\) 0 or 1. When it is desirable to include all cases in phase II, the sampling probability for cases can be set as 1, and \({c}_{0}\) is selected to achieve the targeted number of controls by \(P\left({R}_{i}=1|{\delta }_{i}=0,{T}_{i}, {X}_{i}\right)=\mathrm{min}\left\{1, {c}_{0}D\left({T}_{i},{\delta }_{i}, {X}_{i}\right)\right\}.\)

Meanwhile, sub-cohort sampling designs often use a stratification on some full-cohort covariates (i.e., Phase I covariates) for various reasons: i) controlling for confounders, ii) reducing measurements error, and iii) improving efficiency of the estimates. Briefly, stratified designs first partition the cohort into different strata by confounder values (e.g., age group and race), then select random samples of sub-cohort subjects from each stratum. Our GOF sampling designs for survival outcome can also be implemented by stratifying on phase I covariates as demonstrated in Discussion section of Wang et al.[1]. When we select subjects into phase II, the balanced GOF designs allow different sampling probabilities for different strata. We term this design the balanced GOF two-phase sampling.

Statistical inference for GOF two-phase sampling designs

Directly fitting the Cox PH model (1) to only the phase II subset selected via the GOF two-phase design can lead to the biased estimation of parameters \(\beta\) and \(\alpha\) because the phase II subjects are not a random representative sample of the full-cohort and are selected based on an external model using the information of outcome and phase I covariates. Thus, we propose to apply the IPW partial likelihood method for analysis, where the sampling probabilities are used as weights. Based on Eq. (1), the weighted partial likelihood function is specified as

$$PL\left(\beta,\alpha\right)={\prod\nolimits_{i=1}^m\left[\frac{w_i{\cdot e}^{\beta^TX_i+\alpha^TZ_i}}{\sum_{j=1}^mY_j\left(T_i\right)\cdot w_j\cdot e^{\beta^TX_j+\alpha^TZ_j}}\right]}^{\delta_i},$$

where \({Y}_{j}\left(t\right)=I\left({T}_{j}\ge t\right)\) is the at-risk indicator function, and \({w}_{i}=1/P\left({R}_{i}=1|{\delta }_{i}, {T}_{i}, {X}_{i}\right)\).

For the implementation, \(\widehat{\beta }\), \(\widehat{\alpha }\) and their standard errors can be directly estimated from standard statistical software by fitting the weighted Cox PH regression model to the phase II data (e.g., coxph function with the inverse of the sampling probability \(P\left({R}_{i}=1|{\delta }_{i}, {T}_{i}, {X}_{i}\right)\) in the weight argument in the R package, “survival”) [17]. Because the weights are calculated from the external model, the standard errors of the estimates are calculated using the robust variance formula, achieved by specifying option robust = TRUE in the coxph function. Under this assumption, the variability of weight estimation is not accounted in the process of evaluating standard errors of hazard ratios of the main model. When the weights are estimated using preliminary data, other approaches such as the delta method or bootstrapping method would be considered to properly account the variability of weight estimations.

Simulation of NYUWHS data

Data generating process

Our simulation designs were based on the NYUWHS which consisted of 6550 women younger than 50 years of age at enrollment, where the objective was to identify risk factors for breast cancer in young women [3, 4]. As phase I covariates, we used real values of the risk factors including age at enrollment (AGE; continuous), age at menarche (AGEMEN; continuous), history of benign breast biopsy (BIOPSY; yes or no), experience of full-term pregnancy (FTP; yes or no), family history of breast cancer (REL; yes or no), and race (RACE; white or non-white). Given these covariates, we generated the time to breast cancer onset from the Eq. (1), where \(X\) denoted the set of the phase I covariates and a biomarker \(Z\), as a phase II covariate, was simulated as \(Z=-2.15+0.05Age+\epsilon , \epsilon \sim N\left(0, 1\right)\) to yield approximately 0.2 of correlation between the \(Z\) and AGE variables. The true parameter vector \(\beta\) for the phase I covariates \(X\) was set to be \({\left(0.028, -0.034, 0.431, -0.105, 0.541, 0.347\right)}^{T}\) based on the NYUWHS full-cohort analysis. We set \(\alpha\), the coefficient for \(Z\), to be 0.2 or 0.5 corresponding to a weak or strong biomarker association with disease risk, respectively. The baseline hazard function, \({\lambda }_{0}\left(t\right)\), assumed the \(Weibull(k=0.929, \lambda =0.002)\). Random censoring times were independently generated from \(\mathrm{min}\left(exp\left({\lambda }^{*}\right), 25\right)\), where \({\lambda }^{*}\) was set to yield a 5% or 10% event rate approximately.

Comparisons of sub-cohort sampling designs

Under each simulation, the full-cohort analysis results were considered as the gold standard. For the GOF two-phase sampling designs, we selected phase II subjects using the sampling probability based on the GOF quantify from the external model that independently developed from a working Cox PH model \({\lambda }^{e}(t)={\lambda }_{0}^{e}(t){e}^{{\eta }^{T}X}\), using 10,000 samples bootstrapped from the full cohort data. To be comparable with the case-cohort designs, we selected all cases and used a constant \({c}_{0}\) to ensure 1-to-1 or 1-to-2 case–control ratios. We generated case-cohort data where a certain number of sub-cohort was randomly selected so that the sample sizes between our GOF two-phase sampling designs and the CC designs were almost same. Two different stratifying procedures were performed: (i) unstratified and (ii) stratified by the median of AGE variable.

We applied the standard partial likelihood method for analyzing full-cohort data and the IPW method for our GOF sampled data. As the commonly used methods for the CC designs, Prentice and Borgan I methods were applied for un-stratified CC data and stratified CC data, respectively. Because both Prentice and Borgan I methods use individual weights as the inverse of the sub-cohort selection probability, the estimation technique is essentially the same to our IPW method under the GOF two-phase designs. Therefore, we can interpret the difference in simulation results readily as the consequence of using different designs. Furthermore, we conducted the semiparametric maximum-likelihood approach (SMLE) which has been known as an efficient method among sub-cohort sampling methods using the R package, “TwoPhaseReg” [18]. The SMLE method models the conditional probability of phase II covariates given phase I covariates in the likelihood function using B-spline sieve approximation. Even though SMLE can accommodate continuous phase I covariates for analyzing two-phase data, the dimensionality of phase I covariates has to be necessarily small [18].

Measures of model performance

The bias and standard deviation of the log hazard ratio estimates were reported as performance measures of the methods. The asymptotic standard error of the estimated log HR and the coverage probability (CP) of the 95% confidence interval (CI) were also obtained to evaluate the precision of the estimates. For comparison of the efficiency between the methods, we computed the relative efficiency as the averaged ratio of the asymptotic variances between two methods. With the setting of the large number of phase I covariates and large sample size (e.g., over 5,000 as in NYUWHS), the implementation of the SMLE method was extremely time consuming. Thus, we used random sample \(N=2000\) from the full cohort of NYUWHS at each simulation, and 500 simulations were run. To investigate type I error and power of our proposed GOF two-phase designs, we additionally conducted 5,000 simulations when event rate was 5 and 10% with true \(\alpha =0.2\) and \(0.5\). All computations were conducted in R (version 4.0.3).

Simulation results

The results on estimation of the biomarker’s coefficient \(\alpha\) are presented in Table 1. Under the sampling designs including the full-cohort design, all estimations of \(\alpha\) had negligible biases. The CPs of the 95% CIs for \(\alpha\) were closed to the nominal level in all methods, indicating that the standard error estimates were accurate. Full-cohort analysis showed the highest efficiency (i.e., lowest standard deviation of the estimates) as expected. In general, the proposed GOF two-phase sampling designs showed better efficiency than the standard CC designs, and the SMLE estimation method was more efficient than IPW and weighted method for CC designs.

Table 1 Performance measures of the simulated biomarker coefficient \(\left(\widehat{\alpha }\right)\): Bias (emp SD; CP)

We visualized the standard error of the estimated \(\alpha\) in the case of 5% event rate (Fig. 1). The results for the 10% event rate are similar (Supplementary Fig. 1). Our proposed GOF two-phase sampling designs generally had higher efficiency than the standard CC designs. The SMLE method and the IPW under the GOF two-phase sampling design were comparably efficient. The numerical relative efficiency of the asymptotic variance of \(\widehat{\alpha }\) are summarized in Table 2. In general, the proposed GOF two-phase design was more efficient compared to the standard CC design. When we compared the efficiency between each method (i.e., denominator of relative efficiency) and the SMLE method (i.e., numerator of relative efficiency) under the GOF two-phase design, the range of the relative efficiency of the IPW method was from 0.75 to 0.95 (i.e., 5–25% of efficiency loss), while standard method under the CC designs had 40–50% of additional efficiency loss. We note that the computation of SMLE can be expensive when the number of biomarker and covariates increases. Therefore, our simulations clearly demonstrated the value of novel sampling design, which can improve the efficiency of two-phase sample collection using easily implemented estimation method and is scalable to studies with large sample size and large number of biomarkers and covariates. All other phase I covariates were unbiased and showed reasonable efficiency under our proposed two-phase designs (Supplementary Tables 1 to 6).

Fig. 1
figure 1

Asymptotic standard error of the estimated log HR for simulated biomarker \(\left(\widehat{\alpha }\right)\) under 5% of the event rate. Abbreviations: Standard Cox PH model (Cox); full cohort design (Full cohort); IPW based Cox PH model (IPW); GOF two-phase sampling design (Two-phase); semiparametric maximum-likelihood method (SMLE); Prentice method as unstratified approach and Borgan I method as stratified approach (Standard); standard case-cohort design (CC). Note that we describe each method under each design as method:design using the abbreviations

Table 2 Performance measures of the simulated biomarker coefficient \(\left(\widehat{\alpha }\right)\): Relative efficiency of the asymptotic variance of \(\widehat{\alpha }\) under SMLE relative to each method

As shown in Table 3, we observed that the empirical type I error rate approached the nominal level of 0.05. The power showed that our proposed two-phase design performed increasingly well to reject the null hypothesis when the true \(\alpha\) deviated from zero and with increasing event rates. Full cohort designs showed higher power than our proposed two-phase designs as expected.

Table 3 Type I error and power of the simulated biomarker


Motivated by common epidemiologic time-to-event analyses, for instance, to identify risk factors of a disease in prospective cohorts, we extended the GOF two-phase sampling designs proposed by Wang et al. [1] for binary outcomes to time-to-event outcomes. We used their approach which is to oversample subjects who show poor goodness-of-fit based on an external model. We based our simulations on data from an existing study of risk factors for breast cancer in a prospective cohort, the NYUWHS. Through extensive simulations, we empirically compared our proposed method with full cohort analysis, standard weighting methods under the CC designs, and the SMLE method under both GOF two-phase sampling and CC designs. Our simulation demonstrated that inverse probability weighting methods generally showed higher efficiency in our proposed GOF two-phase sampling designs rather than the standard CC designs. Furthermore, the IPW method performed well in terms of both unbiasedness and efficiency under the GOF two-phase sampling design. Notably, balanced GOF designs achieved additional efficiency, in particular for estimating the covariates which were used for stratifying (Supplementary Tables 1 and 4). Note that this finding is consistent with the case of binary outcomes in previous study [1]. Furthermore, we also investigated the efficiency gain by the different levels of correlations between AGE variable and the simulated biomarker. Our proposed GOF two-phase designs consistently showed higher efficiency (i.e., lower than 1 of the relative efficiency) compared to standard CC designs (Supplementary Table 7).

In addition to the simulated external model used in Simulation of NYUWHS data section, we conducted simulations using the Gail model [15] with its implementation in the R package “BCRA”, which provides risk projections of invasive breast cancer according to National Cancer Institute’s Breast Cancer Risk Assessment Tool algorithm [19], to generate the GOF sampling probability. Specifically, we followed the same simulation setup of 5% event rate, true \(\alpha =0.2\), and 1-to-1 case and control ratio. Using all of 6550 subjects from the NYUWHS cohort, we compared the proposed GOF two-phase designs with standard case-cohort designs. The simulation results demonstrated that the proposed GOF two-phase sampling design maintained higher efficiency (30–40% efficiency gain) than the standard CC designs (Supplementary Table 8).

Even though the SMLE promised the highest efficiency for analyzing two-phase data, it has practical limitations: i) the number of phase I covariates has to be small, especially when the covariates are continuous, and ii) the computational time heavily depends on the sample size. When the number of phase I covariates increases with the sample size, numerical cost of implementing SMLE becomes too expensive for practical use. On the other hand, the IPW method can be conveniently implemented in standard software. Furthermore, rather than randomly sampling the sub-cohort by the standard CC designs, the proposed GOF two-phase sampling design provides a new perspective to define “informative” subjects for efficient sampling, especially with respect to the potential of added values by the phase II covariates to risk characterization or prediction. By oversampling subjects with worse goodness-of-fit based on an external model, the design can include those more “informative” subjects and thus lead. to efficiency gain. This is the key idea of our proposed GOF two-phase design as in Wang et al. (2020) that the lack of fit would be suggestive of the necessity to include phase II covariate in the model to achieve better goodness-of-fit. Lastly, our proposed GOF two-phase sampling designs with the IPW method for analysis would be readily scalable in cohort studies even when the sample size is large and event rate is low.

Availability of data and materials

The datasets used and R code for this study are not publicly available but are available from the corresponding author on reasonable request.


  1. Wang L, et al. Novel two-phase sampling designs for studying binary outcomes. Biometrics. 2020;76(1):210–23.

    Article  PubMed  Google Scholar 

  2. Breslow N, Cain K. Logistic regression for two-stage case-control data. Biometrika. 1988;75(1):11–20.

    Article  Google Scholar 

  3. Clendenen TV, et al. Breast cancer risk prediction in women aged 35–50 years: impact of including sex hormone concentrations in the Gail model. Breast Cancer Res. 2019;21(1):1–12.

    Article  Google Scholar 

  4. Ge W, et al. Circulating anti-Müllerian hormone and breast cancer risk: a study in ten prospective cohorts. Int J Cancer. 2018;142(11):2215–26.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Ganna A, et al. Risk prediction measures for case-cohort and nested case-control designs: an application to cardiovascular disease. Am J Epidemiol. 2012;175(7):715–24.

    Article  PubMed  PubMed Central  Google Scholar 

  6. Lin D, Ying Z. Cox regression with incomplete covariate measurements. J Am Stat Assoc. 1993;88(424):1341–9.

    Article  Google Scholar 

  7. Liddell F, McDonald J, Thomas D. Methods of cohort analysis: appraisal by application to asbestos mining. J R Stat Soc Seri A (General). 1977;140(4):469–83.

    Article  Google Scholar 

  8. Prentice RL. A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika. 1986;73(1):1–11.

    Article  Google Scholar 

  9. Kim RS. A new comparison of nested case–control and case–cohort designs and methods. Eur J Epidemiol. 2015;30(3):197–207.

    Article  PubMed  Google Scholar 

  10. Lee M, Zeleniuch-Jacquotte A, Liu M. Empirical evaluation of sub-cohort sampling designs for risk prediction modeling. J Appl Stat. 2021;48(8):1374–401.

    Article  PubMed  Google Scholar 

  11. Self SG, Prentice RL. Asymptotic Distribution Theory and Efficiency Results for Case-Cohort Studies. Ann Stat. 1988;16(1):64–81.

  12. Borgan O, et al. Exposure stratified case-cohort designs. Lifetime Data Anal. 2000;6(1):39–58.

    Article  CAS  PubMed  Google Scholar 

  13. Cox DR. Regression models and life-tables. J Roy Stat Soc Ser B (Methodol). 1972;34(2):187–202.

    Google Scholar 

  14. Cox DR. Partial likelihood. Biometrika. 1975;62(2):269–76.

    Article  Google Scholar 

  15. Gail MH, et al. Projecting individualized probabilities of developing breast cancer for white females who are being examined annually. J Natl Cancer Inst. 1989;81(24):1879–86.

    Article  CAS  PubMed  Google Scholar 

  16. Kim G, Bahl M. Assessing risk of breast cancer: a review of risk prediction models. J Breast Imaging. 2021;3(2):144–55.

    Article  PubMed  PubMed Central  Google Scholar 

  17. Therneau TM, Lumley T. Package ‘survival.’ R Top Doc. 2015;128(10):28–33.

    Google Scholar 

  18. Tao R, Zeng D, Lin D-Y. Efficient semiparametric inference under two-phase sampling, with applications to genetic association studies. J Am Stat Assoc. 2017;112(520):1468–76.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Zhang F, et al. Package ‘BCRA’. 2020.

    Google Scholar 

Download references


The authors would like to thank the editor and two reviewers for their constructive and insightful suggestions that greatly improved the paper.


This work was partially supported by the U.S. National Institute of Health grants R01ES032808, 5U01CA182934, 5R01CA178949, R01HL138306 and R01CA236468.

Author information

Authors and Affiliations



MLee, JC, MLiu contributed to the conception and design of the study, prepared the dataset, performed the simulation study, and contributed to the writing of the manuscript. AZ contributed the NYUWHS dataset and the design of the simulation studies. All authors contributed to data interpretation and revised the intellectual content of the manuscript and approved the final version of this manuscript.

Corresponding author

Correspondence to Mengling Liu.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1:

 Supplementary table 1. Simulation results for 5% event rate: Bias (SD). Supplementary table 2. Simulation results for 5% event rate: average SE (SD of SE). Supplementary table 3. Simulation results for 5% event rate: coverage probability for 95% CI. Supplementary table 4. Simulation results for 10% event rate: Bias (SD). Supplementary table 5. Simulation results for 10% event rate: average SE (SD of SE). Supplementary table 6. Simulation results for 10% event rate: coverage probability for 95% CI. Supplementary table 7. Relative efficiency of the asymptotic variance under GOF two-phase designs to standard CC designs by different level of correlations. Supplementary table 8. Performance measures: Bias (SD), asymptotic standard error (SD of SE) and coverage probability for 95% CI for additional simulations using the external BCRA model. Supplementary figure 1. Asymptotic standard error (SE) of the estimated log HR for simulated biomarker (α ̂ ) from each method under each simulation setting of 10% event rate.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lee, M., Chen, J., Zeleniuch-Jacquotte, A. et al. Goodness-of-fit two-phase sampling designs for time-to-event outcomes: a simulation study based on New York University Women’s Health Study for breast cancer. BMC Med Res Methodol 23, 119 (2023).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: