- Research
- Open access
- Published:
The iterative bisection procedure: a useful tool for determining parameter values in data-generating processes in Monte Carlo simulations
BMC Medical Research Methodology volume 23, Article number: 45 (2023)
Abstract
Background
Data-generating processes are key to the design of Monte Carlo simulations. It is important for investigators to be able to simulate data with specific characteristics.
Methods
We described an iterative bisection procedure that can be used to determine the numeric values of parameters of a data-generating process to produce simulated samples with specified characteristics. We illustrated the application of the procedure in four different scenarios: (i) simulating binary outcome data from a logistic model such that the prevalence of the outcome is equal to a specified value; (ii) simulating binary outcome data from a logistic model based on treatment status and baseline covariates so that the simulated outcomes have a specified treatment relative risk; (iii) simulating binary outcome data from a logistic model so that the model c-statistic has a specified value; (iv) simulating time-to-event outcome data from a Cox proportional hazards model so that treatment induces a specified marginal or population-average hazard ratio.
Results
In each of the four scenarios the bisection procedure converged rapidly and identified parameter values that resulted in the simulated data having the desired characteristics.
Conclusion
An iterative bisection procedure can be used to identify numeric values for parameters in data-generating processes to generate data with specified characteristics.
Introduction
Monte Carlo simulations are a critical tool in modern statistical research [1, 2]. Simulations allow one to investigate the properties of statistical estimators and procedures in settings in which analytic calculations are not feasible. A crucial component of any simulation is a data-generating process that allows the investigator to simulate data with specified characteristics. While a data-generating process can often be quickly constructed, it is more difficult to specify the values of the parameters of the data-generating process to result in the simulated data having specified characteristics.
For instance, given a set of baseline covariates simulated from a multivariate distribution, a logistic model can be used to simulate binary outcomes so as to induce an odds ratio of a specified magnitude for the association between a variable denoting treatment status (treatment vs. control) and the outcome. However, if one wanted to induce a treatment effect with a specific relative risk or risk difference, rather than a specific odds ratio, one would need to identify the given log-odds ratio for treatment that resulted in the desired relative risk or risk difference [3, 4]. This log-odds ratio would depend on the distribution of baseline covariates. Similarly, if one wanted to simulate binary outcome data such that the logistic regression model for the outcome had a specified c-statistic (equivalent to the area under the receiver operating characteristic (ROC) curve), one would need to determine the regression coefficients for the logistic regression model that result in the desired c-statistic.
We describe an iterative bisection procedure that allows researchers to determine the required value of parameters in a data-generating process to result in simulated data with the desired characteristics. We illustrate the iterative bisection procedure by applying it to four different examples. In “Determining the intercept of a logistic regression model so that prevalence of treatment is equal to a specified value when using a logistic regression model to simulate outcomes” section, we apply the iterative bisection procedure to construct a data-generating process for simulating binary outcomes from a multivariable logistic regression model so that the prevalence of the outcome in the population is equal to a specified probability. In “Determining the odds ratio for a binary treatment variable in a logistic regression model to induce a desired treatment risk difference or relative risk in the population” section, we apply the bisection procedure to construct a data-generating process for simulating binary outcomes using a multivariable logistic regression model such that a binary treatment (or exposure) induces a relative risk of a given magnitude. In “Determining the regression coefficients for a logistic regression model so that the model has a specified c-statistic” section, we apply the bisection procedure to construct a data-generating process for simulating binary outcomes from a multivariable logistic regression model with a specified c-statistic. In “Determining the conditional hazard ratio for treatment/exposure in an adjusted Cox regression model to induce a specified marginal hazard ratio” section, we apply the bisection procedure to construct a data-generating process for simulating time-to-event outcomes with a specified marginal hazard ratio for treatment. Finally, we provide a summary in “Discussion” section.
Determining the intercept of a logistic regression model so that prevalence of treatment is equal to a specified value when using a logistic regression model to simulate outcomes
Description of method
In this section we consider a setting in which one wants to simulate a binary outcome that is related to a vector of covariates, such that the prevalence of the outcome in the population is equal to a specified value. Let \({\text{P}}^{{{\text{target}}}}\) denote the specified or target prevalence of the outcome in the population.
The first step is to simulate a vector of covariates for each subject in a large super-population, say of size N = 1,000,000. The distribution of the baseline covariates can be chosen by the investigator. The application of the bisection procedure is independent of this distributional decision. Assume that we simulate p baseline covariates (\(X_{1} , \ldots ,X_{{\text{p}}}\)) from a given multivariable distribution.
The second step is to specify a logistic regression model for generating the binary outcomes:
The regression coefficients \(\beta_{1} , \ldots ,\beta_{{\text{p}}}\) can be chosen by the investigator to reflect the desired relationship between each of the p covariates and the log-odds of the outcome. The prevalence of the outcome in the population is primarily determined by the intercept, \(\beta_{0}\). One must determine the value of the intercept that produces the desired prevalence of the outcome. Lower values of \(\beta_{0}\) are associated with lower prevalences of the outcome, while higher values of \(\beta_{0}\) are associated with higher prevalences of the outcome. For a given value of \(\beta_{0}\) we can simulate a binary outcome for each subject in the super-population from a Bernoulli distribution with subject-specific parameter determined by formula (1). Let \({\text{Y}}_{i}^{{\beta_{0} }}\) denote the simulated outcome for the ith subject when the intercept for the regression model (1) is set equal to \(\beta_{0}\).
The next step is to specify the endpoints of an interval for the parameter of interest; in this case the regression intercept, \(\beta_{0}\). Denote this interval by \((\beta_{0}^{{{\text{lower}}}} ,\beta_{0}^{{{\text{upper}}}} )\). The lower endpoint \(\beta_{0}^{{{\text{lower}}}}\) is chosen such that \(\frac{1}{N}\sum\limits_{i = 1}^{N} {{\text{Y}}_{i}^{{{\beta_{0}{{\text{lower}}}} }} } < {\text{P}}^{{{\text{target}}}}\). In other words, the prevalence of the simulated outcome is less than the target value when using \(\beta_{0}^{{{\text{lower}}}}\). Similarly, the upper endpoint \(\beta_{0}^{{{\text{upper}}}}\) is chosen such that \(\frac{1}{N}\sum\limits_{i = 1}^{N} {{\text{Y}}_{i}^{{{\beta_{0}{{\text{upper}}}} }} } {\text{ > P}}^{{{\text{target}}}}\). In other words, the prevalence of the simulated outcome is greater than the target value when using \(\beta_{0}^{{{\text{upper}}}}\). The endpoints can be identified through a grid search or by trial and error.
Once the endpoints of the interval \((\beta_{0}^{{{\text{lower}}}} ,\beta_{0}^{{{\text{upper}}}} )\) have been determined, compute the midpoint of the interval: \(\beta_{0}^{{{\text{midpoint}}}} = \frac{{\beta_{0}^{{{\text{lower}}}} + \beta_{0}^{{{\text{upper}}}} }}{2}\) (e.g., if the original endpoints of the interval are ± 10, the original midpoint will be 0). We then use \(\beta_{0}^{{{\text{midpoint}}}}\) in formula (1) and simulate a binary outcome for each subject: \({\text{Y}}_{i}^{{{\beta_{0}^{{\text{midpoint}}}} }}\). We then compute the prevalence of the outcome in the super-population: \(\frac{1}{N}\sum\limits_{i = 1}^{N} {{\text{Y}}_{i}^{{{\beta_{0}^{{\text{midpoint}}}} }} }\). If \(\frac{1}{N}\sum\limits_{i = 1}^{N} {{\text{Y}}_{i}^{{{\beta_{0}^{{\text{midpoint}}}} }} } < {\text{P}}^{{{\text{target}}}}\), the prevalence of the outcome is too low and the intercept of formula (1) has to be increased. If \(\frac{1}{N}\sum\limits_{i = 1}^{N} {{\text{Y}}_{i}^{{{\beta_{0}^{{\text{midpoint}}}} }} } {\text{ > P}}^{{{\text{target}}}}\), the prevalence is too high and the intercept of formula (1) has to be decreased.
If \(\frac{1}{N}\sum\limits_{i = 1}^{N} {{\text{Y}}_{i}^{{{\beta_{0}^{{\text{midpoint}}}} }} } < {\text{P}}^{{{\text{target}}}}\), then define a new interval: \((\beta_{0}^{{{\text{midpoint}}}} ,\beta_{0}^{{{\text{upper}}}} )\). Conversely, if \(\frac{1}{N}\sum\limits_{i = 1}^{N} {{\text{Y}}_{i}^{{{\beta_{0}^{{\text{midpoint}}}} }} } {\text{ > P}}^{{{\text{target}}}}\), then define a new interval \((\beta_{0}^{{{\text{lower}}}} ,\beta_{0}^{{{\text{midpoint}}}} )\). In the first case, the new interval is the upper half of the initial interval, while the in the second case the new interval is the lower half of the initial interval. In either case, the width of the new interval is half the width of the initial interval. We have bisected the initial interval. One then repeats this process iteratively. After K iterations, the width of the resultant interval is \(\frac{1}{{2^{K} }}\) of the width of the initial interval. The iterative process can be continued until \(\frac{1}{N}\sum\limits_{i = 1}^{N} {{\text{Y}}_{i}^{{{\beta_{0}^{{\text{midpoint}}}} }} }\) is as close to \({\text{P}}^{{{\text{target}}}}\) as desired.
Application of method
We applied the iterative bisection procedure to simulate data for a sample of size N = 1,000,000. We simulated 10 baseline covariates. The first 5 from independent standard normal distributions, and the last five from independent Bernoulli distributions with parameter 0.5. The regression coefficients (equivalent to log-odds ratios) for the 10 covariates were set to \(\begin{gathered} \beta_{1} = \log (1.25),\;\beta_{2} = \log (1.5),\;\beta_{3} = \log (1.75),\;\beta_{4} = \log (2),\;\beta_{5} = \log (2.5),\; \hfill \\ \beta_{6} = \log (1.25),\;\beta_{7} = \log (1.5),\;\beta_{8} = \log (1.75),\;\beta_{9} = \log (2),\;\beta_{10} = \log (2.5). \hfill \\ \end{gathered}\).
Our objective was to simulate data such that the prevalence of the outcome was 0.10 (10%). The initial interval for \((\beta_{0}^{{{\text{lower}}}} ,\beta_{0}^{{{\text{upper}}}} )\) was set to (-10,10). R code to implement the bisection procedure is provided at the author’s GitHub account [https://github.com/peter-austin/BMC_MRM-bisection-procedures-for-Monte-Carlo-simulations]. The estimates of \(\beta_{0}^{{{\text{midpoint}}}} {\text{ and }}\frac{1}{N}\sum\limits_{i = 1}^{N} {{\text{Y}}_{i}^{{{\beta_{0}^{{\text{midpoint}}}} }} }\) at each iteration are reported in Table 1. After 14 iterations of the procedure, an intercept equal to -4.368896 resulted in the generation of outcomes such that the prevalence of the outcome was 0.099923.
Determining the odds ratio for a binary treatment variable in a logistic regression model to induce a desired treatment risk difference or relative risk in the population
Description of method
The logistic regression model is commonly-used in biomedical and epidemiological research for assessing the association between a binary outcome and a set of covariates [5]. When using a logistic regression model, the odds ratio is the resultant measure of association. The odds ratio denotes the relative increase in the odds of the binary outcome associated with a one unit increase in the given covariate. Other measures of effect for binary outcomes include: the risk difference, the relative risk, and the number needed to treat, where the latter is the reciprocal of the risk difference. Several clinical commentators have suggested that these latter three measures of effect are preferable to the odds ratio for clinical decision making [6,7,8,9].
To study the performance of statistical methods for estimating risk differences or relative risks, one requires a data-generating process that can simulate data with a given risk difference or relative risk [3, 4]. We assume that our data-generating process for simulating outcomes is a modification of the one described above. We modify the logistic regression model as follows:
The model has been modified by including a binary treatment variable (Z = 1 treated; Z = 0 control) with an associated log-odds ratio of \(\gamma\). Thus, treatment is associated with an increase of \(\gamma\) in the log-odds of the outcome. Let \({\text{RR}}^{{{\text{target}}}}\) denote the target treatment relative risk in the population.
The first step is to simulate baseline covariates \(X_{1} , \ldots ,X_{{\text{p}}}\) from a chosen distribution. One can then simulate treatment status using methods described in “Determining the intercept of a logistic regression model so that prevalence of treatment is equal to a specified value when using a logistic regression model to simulate outcomes” section, so that receipt of treatment has a specified association with each of the baseline covariates and so that the prevalence of treatment in the population is equal to the specified value.
The second step is to set the regression coefficients associated with the baseline covariates in formula (2) to the desired quantities. One can use the methods described in “Determining the intercept of a logistic regression model so that prevalence of treatment is equal to a specified value when using a logistic regression model to simulate outcomes” section to determine the intercept (\(\beta_{0}\)) of formula (2) so that the prevalence of the outcome in the population if no one were treated is equal to a specified value.
We introduce the potential outcomes framework, as this facilitates identifying the appropriate value of \(\gamma\) [10]. Given a binary treatment Z, let Y(1) and Y(0) denote a subject’s outcomes under treatment (Z = 1) and control (Z = 0) if received under identical circumstances. The average treatment effect (ATE) is defined as E[Y(1) – Y(0)]. The marginal value of the relative risk is defined as E[Y(1)]/E[Y(0)].
The population relative risk due to treatment is determined by the log-odds ratio for treatment, \(\gamma\). One must determine the value of \(\gamma\) that results in the desired relative risk. As the value of \(\gamma\) increases, the relative risk increases. Lower values of \(\gamma\) are associated with lower relative risks, while higher values of \(\gamma\) are associated with higher relative risks. For a given value of \(\gamma\) we can simulate the two potential outcomes for each subject using formula (2). First, we set Z = 0 (control) for all subjects in the super-population and simulate a binary outcome for each subject in the super-population from a Bernoulli distribution with subject-specific parameter determined by formula (2). Let \({\text{Y(0)}}_{i}^{\gamma }\) denote the simulated outcome under control for the ith subject when the log-odds ratio for treatment in regression model (2) is set equal to \(\gamma\). Second, we set Z = 1 (treated) for all subjects in the super-population and simulate a binary outcome for each subject in the super-population from a Bernoulli distribution with subject-specific parameter determined by formula (2). Let \({\text{Y(1)}}_{i}^{\gamma }\) denote the simulated outcome under treatment for the ith subject when log-odds ratio for treatment in regression model (2) is set equal to \(\gamma\). The population relative risk when the log-odds ratio for treatment is set to \(\gamma\) is equal to \({\text{E[Y(1)}}_{i}^{\gamma } {\text{]/Y[(0)}}_{i}^{\gamma } ] = \frac{{\frac{1}{N}\sum\limits_{i = 1}^{N} {{\text{[Y(1)}}_{i}^{\gamma } ]} }}{{\frac{1}{N}\sum\limits_{i = 1}^{N} {{\text{[Y(0)}}_{i}^{\gamma } ]} }}\).
The next step is to specify the endpoints of an interval for the log-odds ratio for treatment, \(\gamma\). Denote this interval by \((\gamma_{{}}^{{{\text{lower}}}} ,\gamma_{{}}^{{{\text{upper}}}} )\). The lower endpoint \(\gamma^{{{\text{lower}}}}\) is chosen such that \(\frac{{\frac{1}{N}\sum\limits_{i = 1}^{N} {{\text{[Y(1)}}_{i}^{\gamma } ]} }}{{\frac{1}{N}\sum\limits_{i = 1}^{N} {{\text{[Y(0)}}_{i}^{\gamma } ]} }} < {\text{RR}}^{{{\text{target}}}}\). Similarly, the upper endpoint \(\gamma^{{{\text{upper}}}}\) is chosen such that \(\frac{{\frac{1}{N}\sum\limits_{i = 1}^{N} {{\text{[Y(1)}}_{i}^{\gamma } ]} }}{{\frac{1}{N}\sum\limits_{i = 1}^{N} {{\text{[Y(0)}}_{i}^{\gamma } ]} }}{\text{ > RR}}^{{{\text{target}}}}\). The endpoints can be identified through a grid search or by trial and error.
Once the endpoints of the interval \((\gamma_{{}}^{{{\text{lower}}}} ,\gamma_{{}}^{{{\text{upper}}}} )\) have been determined, compute the midpoint of the interval: \(\gamma^{{{\text{midpoint}}}} = \frac{{\gamma^{{{\text{lower}}}} + \gamma^{{{\text{upper}}}} }}{2}\). We then use \(\gamma^{{{\text{midpoint}}}}\) in formula (2) and simulate the two potential outcomes under treatment and control for each subject: \({\text{Y(1)}}_{i}^{{\gamma^{{{\text{midpoint}}}} }}\) and \({\text{Y(0)}}_{i}^{{\gamma^{{{\text{midpoint}}}} }}\). We then compute the treatment relative risk in the super-population: \(\frac{{\frac{1}{N}\sum\limits_{i = 1}^{N} {{\text{[Y(1)}}_{i}^{{\gamma^{{{\text{midpoint}}}} }} ]} }}{{\frac{1}{N}\sum\limits_{i = 1}^{N} {{\text{[Y(0)}}_{i}^{{\gamma^{{{\text{midpoint}}}} }} ]} }}\). If \(\frac{{\frac{1}{N}\sum\limits_{i = 1}^{N} {{\text{[Y(1)}}_{i}^{{\gamma^{{{\text{midpoint}}}} }} ]} }}{{\frac{1}{N}\sum\limits_{i = 1}^{N} {{\text{[Y(0)}}_{i}^{{\gamma^{{{\text{midpoint}}}} }} ]} }} < {\text{RR}}^{{{\text{target}}}}\), the relative risk is too low and \(\gamma\) in formula (2) has to be increased. If \(\frac{{\frac{1}{N}\sum\limits_{i = 1}^{N} {{\text{[Y(1)}}_{i}^{{\gamma^{{{\text{midpoint}}}} }} ]} }}{{\frac{1}{N}\sum\limits_{i = 1}^{N} {{\text{[Y(0)}}_{i}^{{\gamma^{{{\text{midpoint}}}} }} ]} }}{\text{ > RR}}^{{{\text{target}}}}\), the relative risk is too large and \(\gamma\) in formula (2) has to be decreased.
If \(\frac{{\frac{1}{N}\sum\limits_{i = 1}^{N} {{\text{[Y(1)}}_{i}^{{\gamma^{{{\text{midpoint}}}} }} ]} }}{{\frac{1}{N}\sum\limits_{i = 1}^{N} {{\text{[Y(0)}}_{i}^{{\gamma^{{{\text{midpoint}}}} }} ]} }} < {\text{RR}}^{{{\text{target}}}}\), then define a new interval: \((\gamma_{{}}^{{{\text{midpoint}}}} ,\gamma_{{}}^{{{\text{upper}}}} )\). Conversely, if \(\frac{{\frac{1}{N}\sum\limits_{i = 1}^{N} {{\text{[Y(1)}}_{i}^{{\gamma^{{{\text{midpoint}}}} }} ]} }}{{\frac{1}{N}\sum\limits_{i = 1}^{N} {{\text{[Y(0)}}_{i}^{{\gamma^{{{\text{midpoint}}}} }} ]} }}{\text{ > RR}}^{{{\text{target}}}}\), then define a new interval \((\gamma_{{}}^{{{\text{lower}}}} ,\gamma_{{}}^{{{\text{midpoint}}}} )\). In the first case, the new interval is the upper half of the initial interval, while the in the second case the new interval is the lower half of the initial interval. In either case, the width of the new interval is half the width of the initial interval. We have bisected the initial interval. One then repeats this process iteratively until \(\frac{{\frac{1}{N}\sum\limits_{i = 1}^{N} {{\text{[Y(1)}}_{i}^{{\gamma^{{{\text{midpoint}}}} }} ]} }}{{\frac{1}{N}\sum\limits_{i = 1}^{N} {{\text{[Y(0)}}_{i}^{{\gamma^{{{\text{midpoint}}}} }} ]} }} \,\) is as close to \({\text{RR}}^{{{\text{target}}}}\) as desired.
The above procedure allows one to determine the value of \(\gamma\) necessary to induce a given treatment relative. The procedure can be modified to induce a given treatment risk difference. To do so, all occurrences of \(\frac{{\frac{1}{N}\sum\limits_{i = 1}^{N} {{\text{Y(1)}}_{i}^{\gamma } } }}{{\frac{1}{N}\sum\limits_{i = 1}^{N} {{\text{Y(0)}}_{i}^{\gamma } } }}\) are replaced by \(\frac{1}{N}\sum\limits_{i = 1}^{N} {{\text{[Y(1)}}_{i}^{\gamma } {\text{ - Y(0)}}_{i}^{\gamma } ]}\).
Application of method
We applied the iterative bisection procedure to simulate data for a sample of size N = 1,000,000. We simulated 10 baseline covariates as in “Description of method” of “Determining the intercept of a logistic regression model so that prevalence of treatment is equal to a specified value when using a logistic regression model to simulate outcomes” section. We first simulated a binary treatment status variable using formula (1) with the 10 regression coefficients for the baseline covariates in the treatment-selection model set to log(1.1), log(2), log(3), log(1.5), log(1.5), log(1.1), log(2), log(3), log(1.5), and log(1.5). We used the bisection process described in Determining the intercept of a logistic regression model so that prevalence of treatment is equal to a specified value when using a logistic regression model to simulate outcomes” section to determine the intercept for the treatment selection model such that the prevalence of treatment in the population was 0.2. The resultant intercept was -3.31749.
The regression coefficients (equivalent to log-odds ratios) for the 10 covariates in the outcome model (formula (2)) were set to \(\begin{gathered} \beta_{1} = \log (1.25),\;\beta_{2} = \log (1.5),\;\beta_{3} = \log (1.75),\;\beta_{4} = \log (2),\;\beta_{5} = \log (2.5),\;\beta_{6} = \log (1.25),\;\beta_{7} = \log (1.5),\; \hfill \\ \beta_{8} = \log (1.75),\;\beta_{9} = \log (2),\;\beta_{10} = \log (2.5). \hfill \\ \end{gathered}\)
We set the value of \(\beta_{0}\) to that determined in the first “Description of method” of “Determining the intercept of a logistic regression model so that prevalence of treatment is equal to a specified value when using a logistic regression model to simulate outcomes” (i.e., the subsection in the Determining the intercept of a logistic regression model so that prevalence of treatment is equal to a specified value when using a logistic regression model to simulate outcomes section) section (\(\beta_{0}\) = -4.367676) so that the prevalence of outcome was 0.10 (10%) if no subjects were treated.
Our objective was to simulate data such that the treatment relative risk is 0.80. The initial interval for \(\gamma\) was set to (-10,10). R code to implement the bisection procedure is provided at the author’s GitHub account [https://github.com/peter-austin/BMC_MRM-bisection-procedures-for-Monte-Carlo-simulations]. The estimates of \(\gamma_{{}}^{{{\text{midpoint}}}} {\text{ and }}\frac{{\frac{1}{N}\sum\limits_{i = 1}^{N} {{\text{Y(1)}}_{i}^{\gamma } } }}{{\frac{1}{N}\sum\limits_{i = 1}^{N} {{\text{Y(0)}}_{i}^{\gamma } } }}\) at each iteration are reported in Table 2. After 16 iterations, the procedure identified that a treatment log-odds ratio of -0.2999878 (equivalent to an odds ratio of 0.741) resulted in a relative risk of 0.8000121.
Determining the regression coefficients for a logistic regression model so that the model has a specified c-statistic
Description of method
The c-statistic (equivalent to the area under the receiver operating characteristic (ROC) curve, which is sometimes abbreviated as the AUC) is a measure of discrimination used to assess the predictive performance of logistic regression models [11, 12]. In this section we describe how to specify the coefficients of a logistic regression model to simulate outcomes such that the underlying logistic regression model has a specified c-statistic. In doing so, we make use of the fact that the c-statistic of a univariate logistic regression model is a function of the variance of the covariate and the log-odds ratio for that covariate [13]. Let \({\text{AUC}}^{{{\text{target}}}}\) denote the target c-statistic.
We modify the logistic regression model described in formula (1):
The regression coefficients \(\beta_{1} , \ldots ,\beta_{{\text{p}}}\) can be chosen by the investigator to reflect the desired relationship between each of the covariates and the log-odds of the outcome. By introducing the scalar \(\sigma\), we are modifying each log-odds ratio, but doing so in such a way that the ratio of any two log-odds ratios remains constant after modification. We need to identify the value of \(\sigma\) required to induce the desired c-statistic. Larger values of \(\sigma\) are associated with larger values of the c-statistic, while lower values of \(\sigma\) are associated with smaller values of the c-statistic.
The first step is to simulate a vector of covariates for each subject in a large super-population, say of size N = 1,000,000. The distribution of the baseline covariates can be chosen by the investigator. The application of the bisection procedure is independent of this distributional decision.
For a given value of \(\sigma\) we can simulate a binary outcome for each subject in the super-population from a Bernoulli distribution with subject-specific parameter determined by formula (3). Let \({\text{Y}}_{i}^{\sigma }\) denote the simulated outcome for the ith subject when \(\sigma\) is a constant scalar as shown in formula (3).
The next step is to specify the endpoints of an interval for \(\sigma\). Denote this interval by \((\sigma^{{{\text{lower}}}} ,\sigma^{{{\text{upper}}}} )\). The lower endpoint \(\sigma^{{{\text{lower}}}}\) is chosen such that when binary outcomes are simulated using a Bernoulli distribution with subject-specific parameter determined using formula (3), the c-statistic of the logistic regression model fit to the simulated data has a c-statistic that is less than \({\text{AUC}}^{{{\text{target}}}}\). Similarly, the upper endpoint \(\sigma^{{{\text{upper}}}}\) is chosen that when binary outcomes are simulated using a Bernoulli distribution with subject-specific parameter determined using formula (3), the c-statistic of the logistic regression model fit to the simulated data has a c-statistic that is greater than \({\text{AUC}}^{{{\text{target}}}}\). The endpoints can be identified through a grid search or by trial and error.
Once the endpoints of the interval \((\sigma^{{{\text{lower}}}} ,\sigma^{{{\text{upper}}}} )\) have been determined, compute the midpoint of the interval: \(\sigma^{{{\text{midpoint}}}} = \frac{{\sigma^{{{\text{lower}}}} + \sigma^{{{\text{upper}}}} }}{2}\). We then use \(\sigma^{{{\text{midpoint}}}}\) in formula (3) and simulate a binary outcome for each subject: \({\text{Y}}_{i}^{{\sigma^{{{\text{midpoint}}}} }}\). We fit a logistic regression model in the simulated sample and determine its c-statistic, which we refer to as \({\text{AUC}}^{{\sigma^{{{\text{midpoint}}}} }}\). If \({\text{AUC}}^{{\sigma^{{{\text{midpoint}}}} }} < {\text{AUC}}^{{{\text{target}}}}\), the c-statistic is too low and \(\sigma\) has to be increased. If \({\text{AUC}}^{{\sigma^{{{\text{midpoint}}}} }} {\text{ > AUC}}^{{{\text{target}}}}\), the c-statistic is too large and \(\sigma^{{{\text{midpoint}}}}\) has to be decreased.
If \({\text{AUC}}^{{\sigma^{{{\text{midpoint}}}} }} < {\text{AUC}}^{{{\text{target}}}}\), then define a new interval: \((\sigma^{{{\text{midpoint}}}} ,\sigma^{{{\text{upper}}}} )\). Conversely, if \({\text{AUC}}^{{\sigma^{{{\text{midpoint}}}} }} {\text{ > AUC}}^{{{\text{target}}}}\), then define a new interval \((\sigma^{{{\text{lower}}}} ,\sigma^{{{\text{midpoint}}}} )\). One then repeats this process iteratively until \({\text{AUC}}^{{\sigma^{{{\text{midpoint}}}} }}\) is as close to \({\text{AUC}}^{{{\text{target}}}}\) as desired.
Application of method
We applied the iterative bisection procedure to simulate data for a sample of size N = 1,000,000. We simulated 10 baseline covariates as in “Description of method” of “Determining the intercept of a logistic regression model so that prevalence of treatment is equal to a specified value when using a logistic regression model to simulate outcomes” section. As above, the regression coefficients (equivalent to log-odds ratios) for the 10 covariates in the outcome model (formula (3)) were set to \(\begin{gathered} \beta_{1} = \log (1.25),\;\beta_{2} = \log (1.5),\;\beta_{3} = \log (1.75),\;\beta_{4} = \log (2),\;\beta_{5} = \log (2.5),\;\beta_{6} = \log (1.25),\;\beta_{7} = \log (1.5),\; \hfill \\ \beta_{8} = \log (1.75),\;\beta_{9} = \log (2),\;\beta_{10} = \log (2.5). \hfill \\ \end{gathered}\)
We set the value of \(\beta_{0}\) to that determined in the first “Description of method” of “Determining the intercept of a logistic regression model so that prevalence of treatment is equal to a specified value when using a logistic regression model to simulate outcomes” section (\(\beta_{0}\) = -4.368896).
Our objective was to simulate binary outcomes such that c-statistic of the logistic regression model was 0.8. The initial interval for \(\sigma\) was set to (0,10). R code to implement the bisection procedure is provided at the author’s GitHub account [https://github.com/peter-austin/BMC_MRM-bisection-procedures-for-Monte-Carlo-simulations. The values of \(\sigma_{{}}^{{{\text{midpoint}}}} ,{\text{ and AUC}}^{{\sigma^{{{\text{midpoint}}}} }}\) at each iteration are reported in Table 3. After 11 iterations of the bisection procedure, \(\sigma = 0.8349609\) resulted in an empirical c-statistic of 0.8006215.
Determining the conditional hazard ratio for treatment/exposure in an adjusted Cox regression model to induce a specified marginal hazard ratio
Description of method
The Cox proportional hazard regression is frequently used in biomedical and epidemiological research [14]. When fitting a multivariable Cox regression model, the regression coefficients, when exponentiated, are interpretated as conditional (or adjusted) hazard ratios. For a given covariate, the conditional hazard ratio compares the relative difference in the hazard of the outcome between two subjects for whom the covariate in question differs by one unit and for whom all the other covariates are identical. In contrast to the conditional (or adjusted) hazard ratio is the marginal (or population-average) hazard ratio. The marginal hazard ratio denotes the relative difference in the hazard function between two populations, for whom the covariate in question differs by one unit, and all other covariates are identical between populations. Due to the phenomenon known as the non-collapsibility of the hazard ratio, marginal and conditional hazard ratios do not coincide (unless one is null) [15].
Bender and colleagues have described data-generating processes for time-to-event outcomes based on an underlying hazard regression model [16]. This data-generating process simulates time-to-event outcomes with specified conditional hazard ratios. One can use the bisection approach with Bender’s data-generating process to induce data with a desired marginal effect for a given covariate.
Assume that we simulate p baseline covariates (\(X_{1} , \ldots ,X_{{\text{p}}}\)) from a given multivariable distribution. Furthermore, assume that we have the following Cox proportional hazards model:
where \(h_{i} (t)\) denotes the hazard function for the ith subject, h0(t) denotes the baseline hazard function, and where Z denotes a binary treatment variable (Z = 1 treated; Z = 0 control) with an associated conditional log-hazard ratio of \(\gamma\). Thus, treatment is associated with an increase of \(\gamma\) in the log-hazard of the outcome. Let \({\text{HR}}^{{{\text{target}}}}\) denote the target marginal hazard ratio in the population.
The first step is to simulate baseline covariates \(X_{1} , \ldots ,X_{{\text{p}}}\) from a chosen distribution. One can then simulate treatment status (Z) using methods described above, so that receipt of treatment has a specified association with each of the baseline covariates and the prevalence of treatment in the population is equal to the specified value.
The second step is to set the regression coefficients associated with the baseline covariates in formula (4) to the desired quantities.
As above, we use the potential outcomes framework. The marginal hazard ratio due to treatment is determined by the log-hazard ratio for treatment, \(\gamma\). One must determine the value of \(\gamma\) that results in the desired marginal hazard ratio. For a given value of \(\gamma\) we can use Bender’s approach to simulate the two potential outcomes for each subject using formula (4). First, we set Z = 0 (control) for all subjects in the super-population and simulate a time-to-event outcome for each subject in the super-population. Let \({\text{T(0)}}_{i}^{\gamma }\) denote the simulated outcome under control for the ith subject when the log-hazard ratio for treatment in regression model (4) is set to \(\gamma\). Second, we set Z = 1 (treated) for all subjects in the super-population and simulate a time-to-event outcome for each subject in the super-population. Let \({\text{T(1)}}_{i}^{\gamma }\) denote the simulated outcome under treatment for the ith subject when the log-hazard ratio for treatment in regression model (4) is set to \(\gamma\). One then creates a large super-population by concatenating the two simulated datasets (one under control and one under treatment). Using a univariate Cox proportional hazards model, one then regresses the hazard of the outcome on the variable denoting treatment status. The resultant hazard ratio is an estimate of the marginal hazard ratio.
The next step is to specify the endpoints of an interval for the log-hazard ratio for treatment, \(\gamma\). Denote this interval by \((\gamma_{{}}^{{{\text{lower}}}} ,\gamma_{{}}^{{{\text{upper}}}} )\). The lower endpoint \(\gamma^{{{\text{lower}}}}\) is chosen such that the estimated marginal hazard ratio is less than \({\text{HR}}^{{{\text{target}}}}\). Similarly, the upper endpoint \(\gamma^{{{\text{upper}}}}\) is chosen such that the estimated marginal hazard ratio is greater than \({\text{HR}}^{{{\text{target}}}}\). The endpoints can be identified through a grid search or by trial and error.
Once the endpoints of the interval \((\gamma_{{}}^{{{\text{lower}}}} ,\gamma_{{}}^{{{\text{upper}}}} )\) have been determined, compute the midpoint of the interval: \(\gamma^{{{\text{midpoint}}}} = \frac{{\gamma^{{{\text{lower}}}} + \gamma^{{{\text{upper}}}} }}{2}\). We then use \(\gamma^{{{\text{midpoint}}}}\) in formula (4) and simulate the two potential outcomes under treatment and control for each subject: \({\text{T(1)}}_{i}^{{\gamma^{{{\text{midpoint}}}} }}\) and \({\text{T(0)}}_{i}^{{\gamma^{{{\text{midpoint}}}} }}\). We then compute the marginal treatment when the dataset consisting of both potential outcomes for all subjects is used to regress the hazard of the outcome on treatment status. If the estimated marginal hazard ratio is less than \({\text{HR}}^{{{\text{target}}}}\), the hazard ratio is too low and \(\gamma\) in formula (4) must be increased. If the estimated marginal hazard ratio is greater than \({\text{HR}}^{{{\text{target}}}}\), the hazard ratio is too large and \(\gamma\) in formula (4) must be decreased.
If the estimated marginal hazard ratio is less than \({\text{HR}}^{{{\text{target}}}}\), then define a new interval: \((\gamma_{{}}^{{{\text{midpoint}}}} ,\gamma_{{}}^{{{\text{upper}}}} )\). Conversely, if the estimated marginal hazard ratio is greater than \({\text{HR}}^{{{\text{target}}}}\), then define a new interval \((\gamma_{{}}^{{{\text{lower}}}} ,\gamma_{{}}^{{{\text{midpoint}}}} )\). In the first case, the new interval is the upper half of the initial interval, while the in the second case the new interval is the lower half of the initial interval. In either case, the width of the new interval is half the width of the initial interval. We have bisected the initial interval. One then repeats this process iteratively until the estimated marginal hazard ratio is as close to the target marginal hazard ratio, \({\text{HR}}^{{{\text{target}}}}\), as desired.
Application of method
We applied the iterative bisection procedure to simulate data for a sample of size N = 1,000,000. We simulated 10 baseline covariates as in “Description of method” of “Determining the intercept of a logistic regression model so that prevalence of treatment is equal to a specified value when using a logistic regression model to simulate outcomes” section. We used the regression coefficients for the treatment-selection model described in “Description of method” of “Determining the odds ratio for a binary treatment variable in a logistic regression model to induce a desired treatment risk difference or relative risk in the population” section, with the same intercept as determined in “Description of method” of “Determining the odds ratio for a binary treatment variable in a logistic regression model to induce a desired treatment risk difference or relative risk in the population” section, so that the prevalence of treatment was 0.20.
The regression coefficients for the Cox regression model described in formula (4) were set to \(\begin{gathered} \beta_{1} = \log (1.25),\;\beta_{2} = \log (1.5),\;\beta_{3} = \log (1.75),\;\beta_{4} = \log (2),\;\beta_{5} = \log (2.5),\;\beta_{6} = \log (1.25),\;\beta_{7} = \log (1.5),\; \hfill \\ \beta_{8} = \log (1.75),\;\beta_{9} = \log (2),\;\beta_{10} = \log (2.5). \hfill \\ \end{gathered}\)
Our objective was to simulate data such that the marginal hazard ratio for treatment was 0.80. The initial interval for \(\gamma\) was set to (-10,10). R code to implement the bisection procedure is provided at the author’s GitHub account [https://github.com/peter-austin/BMC_MRM-bisection-procedures-for-Monte-Carlo-simulations]. The estimates of \(\gamma_{{}}^{{{\text{midpoint}}}} {\text{ and HR}}_{{{\text{marginal}}}}^{{\gamma_{{}}^{{{\text{midpoint}}}} }}\) at each iteration are reported in Table 4. After 12 iterations of the bisection procedure, a conditional log-hazard ratio for treatment equal to 0.6298828 resulted in simulated outcomes with a marginal hazard ratio of 0.7991514.
Discussion
We illustrated the application of an iterative bisection procedure that allows investigators to select the numeric values of parameters in a data-generating process to produce simulated datasets with specified characteristics. This will facilitate designing data-generating processes that produce simulated datasets that are tailored to the investigators’ specifications.
We illustrated the use of the bisection procedure when there is one characteristic that requires specification (e.g., the prevalence of the outcome or the c-statistic of a logistic regression model). The procedure can be modified to simulate data such that two characteristics are fixed at specified values (e.g., both the prevalence of the outcome and the c-statistic of the logistic regression model). To do so, one would apply the procedure sequentially and then iteratively repeat the sequential process until both characteristics are close to the target values. It is necessary to repeat the process iteratively as modifying the parameter values during the second application of the procedure (e.g., for the c-statistic of the regression model) may modify the value of the first characteristic (e.g., the prevalence of the outcome).
The bisection procedure has been successfully used in previous studies that used simulations to: assess the ability to rank hospitals by their performance on composite indicators [17], describe a data-generating process for data with a specified marginal odds ratio [18], describe a data-generating process for data with a specified risk difference or number needed to treat [19], to determine the rate parameter for an exponential censoring distribution so as to induce the desired proportion of censoring [20], in a study of the performance of double propensity score adjustment [21], in a study on the performance of the generalized propensity score for estimating the effect of quantitative exposures on time-to-event outcomes [22], to assess the performance of propensity score methods for estimating marginal hazard ratios [23], in a study of the use of the bootstrap with propensity score matching [24], in a comparison of algorithms for matching on the propensity score [25], assess the use of optimal matching with survival outcomes [26], to assess methods of variance estimation when using inverse probability of treatment weighting with survival outcomes [27], to assess the use of propensity score matching in the presence of competing risks [28], to assess the performance of calibration metrics for survival models [29], to assess the performance of variance estimators for survival outcomes when using propensity score matching with replacement [30], to assess the effect of constraints on the matching ratio when using full matching [31], to examine the consequences multiply-imputing missing potential outcomes under control [32], and to examine sample size and power calculations when using inverse probability of treatment weighting [33].
Conclusion
We have a described an iterative bisection procedure that can be used in designing data-generating processes that produce simulated datasets with specific characteristics.
Availability of data and materials
No data were used in the study. The software code for simulating artificial datasets is provided on the author’s GitHub account [https://github.com/peter-austin/BMC_MRM-bisection-procedures-for-Monte-Carlo-simulations].
Abbreviations
- AUC:
-
Area under the curve
- RR:
-
Relative risk
- HR:
-
Hazard ratio
- ROC:
-
Receiver operating characteristic
References
Morris TP, White IR, Crowther MJ. Using simulation studies to evaluate statistical methods. Stat Med. 2019;38(11):2074–102.
Harrison RL. Introduction To Monte Carlo Simulation. AIP Conf Proc. 2010;1204:17–21.
Austin PC. The performance of different propensity-score methods for estimating relative risks. J Clin Epidemiol. 2008;61(6):537–45.
Austin PC. Comparing paired vs non-paired statistical methods of analyses when making inferences about absolute risk reductions in propensity-score matched samples. Stat Med. 2011;30(11):1292–301.
Hosmer DW, Lemeshow S. Applied Logistic Regression. New York, NY: John Wiley & Sons; 1989.
Laupacis A, Sackett DL, Roberts RS. An assessment of clinically useful measures of the consequences of treatment. N Engl J Med. 1988;318:1728–33.
Cook RJ, Sackett DL. The number needed to treat: a clinically useful measure of treatment effect. BMJ. 1995;310(6977):452–4.
Sackett DL. Down with odds ratios! Evid Based Med. 1996;1:164–6.
Jaeschke R, Guyatt G, Shannon H, Walter S, Cook D, Heddle N. Basic statistics for clinicians: 3. Assessing the effects of treatment: measures of association. Can Med Assoc J. 1995;152(3):351–7.
Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol. 1974;66:688–701.
Steyerberg EW. Clinical Prediction Models. 2nd ed. New York: Springer-Verlag; 2019.
Harrell FE Jr. Regression modeling strategies. 2nd ed. New York, NY: Springer-Verlag; 2015.
Austin PC, Steyerberg EW. Interpreting the concordance statistic of a logistic regression model: relation to the variance and odds ratio of a continuous explanatory variable. BMC Med Res Methodol. 2012;12:82.
Cox DR. Regression models and life tables (with discussion). Journal of the Royal Statistical Society - Series B. 1972;34:187–220.
Gail MH, Wieand S, Piantadosi S. Biased estimates of treatment effect in randomized experiments with nonlinear regressions and omitted covariates. Biometrika. 1984;7:431–44.
Bender R, Augustin T, Blettner M. Generating survival times to simulate Cox proportional hazards models. Stat Med. 2005;24(11):1713–23.
Austin PC, Ceyisakar IE, Steyerberg EW, Lingsma HF, Marang-van de Mheen PJ. Ranking hospital performance based on individual indicators: can we increase reliability by creating composite indicators? BMC Med Res Methodol. 2019;19(1):131.
Austin PC, Stafford J. The performance of two data-generation processes for data with specified marginal treatment odds ratios. Communications in Statistics - Simulation and Computation. 2008;37:1039–51.
Austin PC. A data-generation process for data with specified risk differences or numbers needed to treat. Communications in Statistics - Simulation and Computation. 2010;39:563–77.
Austin PC, Putter H, Giardiello D, van Klaveren D. Graphical calibration curves and the integrated calibration index (ICI) for competing risk models. Diagn Progn Res. 2022;6(1):2.
Austin PC. Double propensity-score adjustment: A solution to design bias or bias due to incomplete matching. Stat Methods Med Res. 2017;26(1):201–22.
Austin PC. Assessing the performance of the generalized propensity score for estimating the effect of quantitative or continuous exposures on survival or time-to-event outcomes. Stat Methods Med Res. 2019;28(8):2348–67.
Austin PC. The performance of different propensity score methods for estimating marginal hazard ratios. Stat Med. 2013;32(16):2837–49.
Austin PC, Small DS. The use of bootstrapping when using propensity-score matching without replacement: A simulation study. Stat Med. 2014;33(24):4306–19.
Austin PC. A comparison of 12 algorithms for matching on the propensity score. Stat Med. 2014;33(6):1057–69.
Austin PC, Stuart EA. Optimal full matching for survival outcomes: a method that merits more widespread use. Stat Med. 2015;34(30):3949–67.
Austin PC. Variance estimation when using inverse probability of treatment weighting (IPTW) with survival analysis. Statisics in Medicine. 2016;35(30):5642–55.
Austin PC, Fine JP. Propensity-score matching with competing risks in survival analysis. Stat Med. 2019;38(5):751–77.
Austin PC, Harrell FE Jr, van Klaveren D. Graphical calibration curves and the integrated calibration index (ICI) for survival models. Stat Med. 2020;39(21):2714–42.
Austin PC, Cafri G. Variance estimation when using propensity-score matching with replacement with survival or time-to-event outcomes. Stat Med. 2020;39(11):1623–40.
Austin PC, Stuart EA. The effect of a constraint on the maximum number of controls matched to each treated subject on the performance of full matching on the propensity score when estimating risk differences. Stat Med. 2021;40(1):101–18.
Austin PC, Rubin DB, Thomas N. Estimating adjusted risk differences by multiply-imputing missing control binary potential outcomes following propensity score-matching. Stat Med. 2021;40(25):5565–86.
Austin PC. Informing power and sample size calculations when using inverse probability of treatment weighting using the propensity score. Stat Med. 2021;40(27):6150–63.
Acknowledgements
Not applicable.
Funding
ICES is an independent, non-profit research institute funded by an annual grant from the Ontario Ministry of Health (MOH) and the Ministry of Long-Term Care (MLTC). As a prescribed entity under Ontario’s privacy legislation, ICES is authorized to collect and use health care data for the purposes of health system analysis, evaluation and decision support. Secure access to these data is governed by policies and procedures that are approved by the Information and Privacy Commissioner of Ontario. The opinions, results and conclusions reported in this paper are those of the authors and are independent from the funding sources. No endorsement by ICES or the Ontario MOH or MLTC is intended or should be inferred. This research was supported by operating grant from the Canadian Institutes of Health Research (CIHR) (PJT 166161).
Author information
Authors and Affiliations
Contributions
PA conceived the study, conducted the simulations, wrote the manuscript, and approved the final manuscript.
Authors’ information
Not applicable.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
No data were used in this study. The study described algorithms for randomly generating data.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Austin, P.C. The iterative bisection procedure: a useful tool for determining parameter values in data-generating processes in Monte Carlo simulations. BMC Med Res Methodol 23, 45 (2023). https://doi.org/10.1186/s12874-023-01836-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12874-023-01836-5