Evaluating the performance of copula models in phase I-II clinical trials under model misspecification

Background Traditionally, phase I oncology trials are designed to determine the maximum tolerated dose (MTD), defined as the highest dose with an acceptable probability of dose limiting toxicities(DLT), of a new treatment via a dose escalation study. An alternate approach is to jointly model toxicity and efficacy and allow dose escalation to depend on a pre-specified efficacy/toxicity tradeoff in a phase I-II design. Several phase I-II trial designs have been discussed in the literature; while these model-based designs are attractive in their performance, they are potentially vulnerable to model misspecification. Methods Phase I-II designs often rely on copula models to specify the joint distribution of toxicity and efficacy, which include an additional correlation parameter that can be difficult to estimate. We compare and contrast three models for the joint probability of toxicity and efficacy, including two copula models that have been proposed for use in phase I-II clinical trials and a simple model that assumes the two outcomes are independent. We evaluate the performance of the various models through simulation both when the models are correct and under model misspecification. Results Both models exhibited similar performance, as measured by the probability of correctly identifying the optimal dose and the number of subjects treated at the optimal dose, regardless of whether the data were generated from the correct or incorrect copula, even when there is substantial correlation between the two outcomes. Similar results were observed for a simple model that assumes independence, even in the presence of strong correlation. Further simulation results indicate that estimating the correlation parameter in copula models is difficult with the sample sizes used in Phase I-II clinical trials. Conclusions Our simulation results indicate that the operating characteristics of phase I-II clinical trials are robust to misspecification of the copula model but that a simple model that assumes independence performs just as well due to difficulty in estimating the copula model correlation parameters from binary data.


Background
Phase I oncology trials are primarily concerned with establishing the safety profile of a new treatment via determining the maximum tolerated dose (MTD), defined as the highest dose with the probability of toxicity less than a pre-specified target toxicity rate. Dose escalation studies can be categorized as either rule-based, such as the traditional 3 + 3 [1], or model-based, such as the continual reassessment method (CRM) [2]. Standard rulebased designs are advantageous with their simplicity in implementation; however, such designs are less desirable in their performance and efficiency, since the selection probability for the true MTD can be poor and dose assignment is based on information from the current dose level only. The standard CRM uses a simple parametric model, such as a one-parameter power model or two-parameter logistic regression mode, to characterize the relationship between dose level and the probability of experiencing a dose limiting toxicity (DLT). This method assumes a monotonic dose-response relationship between dose level http://www.biomedcentral.com/1471-2288/14/51 and toxicity and has a variety of proposed modifications to better ensure patient safety [3].
Standard phase I designs assume that both the probabilities of toxicity and efficacy of a new drug increase as dose level increases. Nevertheless, in some instances an increase in dose level may result in a substantial increase in toxicity but only a small increase in efficacy. Thus, an alternate approach is to consider the tradeoff between toxicity and efficacy during dose escalation. To this end, several phase I-II study designs that jointly model toxicity and efficacy have been discussed in the literature [4][5][6]. As with the CRM, these methods assume a parametric model for both the dose-toxicity and dose-efficacy relationship, which may also include a quadratic term for efficacy to allow for model flexibility. In addition, these methods require that we specify a joint probability model for efficacy and toxicity, which is often accomplished using a copula model.
Copula models provide a flexible framework for specifying the joint distribution of two random variables [7]. In a copula model, a joint distribution is specified on the unit square and a joint distribution for any two random variables can be derived using an inverse transformation. In the context of a phase I-II clinical trial, we specify parametric models for the dose-response relationship for toxicity and efficacy and a copula model is used to specify a joint model for efficacy and toxicity.
Model-based designs tend to surpass rule-based designs in their ability to correctly identify the MTD and in the number of patients treated at the MTD [8], but are potentially vulnerable to model misspecification. This problem is exacerbated by the presence of a copula model in efficacy/toxicity tradeoff designs. Copula models impose a rigid structure on the relationship between toxicity and efficacy, which may not accurately reflect the underlying data generating process. An incorrectly specified model could potentially lead to a decreased probability of correctly identifying the MTD and decreased number of patients treated at the MTD.
In this manuscript, we use simulation to investigate the impact of model misspecification in phase I-II clinical trials. We consider five scenarios for the true probabilities of toxicity and efficacy. Data are simulated assuming one of two copula models and fit using the correct copula, an incorrect copula, and a model that assumes independence. Data are also simulated assuming differing degrees of positive correlation between toxicity and efficacy. Our results show that the two models are relatively robust to model misspecification but that the independence model actually performs better in many cases.
The remaining sections of this manuscript are organized as follows. In Section 'Methods', we introduce several joint probability models used in phase I-II clinical trials and describe the dose-finding algorithm used in our simulation study. In Section 'Results and discussion', we present simulation results evaluating the performance of the two copula models when correctly specified and under model specification. Finally, we conclude with a brief discussion in Section 'Conclusions'.

Methods
In this section, we introduce two joint probability models used in phase I-II clinical trials. In both cases, we specify marginal models for the probabilities of toxicity and efficacy and develop a joint model using a copula model. First, we specify models for the marginal probability of toxicity and the marginal probability of efficacy.
Let Y T and Y E be the binary indicators of toxicity and efficacy, respectively. Denote π(y T , y E |z) = Pr(Y T = y T , Y E = y E |z) as the joint probability of toxicity and efficacy given dose level z, with marginal probabilities of toxicity and efficacy, π T and π E , respectively, also functions of z. We can model the dose-toxicity and dose-efficacy relationships with any monotonic function. For simplicity, we assume logistic regression models for efficacy and toxicity as follows: and log We include a quadratic term for efficacy to allow model flexibility should the probability of efficacy level off or diminish after a certain dose level. We note that the intercept terms in (1) and (2) correspond to the log-odds of toxicity and efficacy, respectively, at the first dose level. This is useful for interpretation and prior specification purposes. We next describe two copula models used in phase I-II clinical trials for specifying a joint distribution for Y T and Y E .

Braun copula
We first consider the copula model discussed by Arnold and Strauss [9] and applied to joint modeling of efficacy and toxicity in the setting of a Phase I-II clinical trial by Braun [5]. These authors specify the joint distribution of Y T and Y E as: Here, ψ 1 represents the correlation between Y T and Y E and takes on values between 0 and 1. ψ 1 greater than 0.5 reflects positive correlation, ψ 1 less than 0.5 reflects http://www.biomedcentral.com/1471-2288/14/51 negative correlation and ψ 1 = 0.5 represents independence. We note also that k(π T , π E , ψ 1 ) is a constant that is included to assure that the four probabilities sum to 1 and depends on π T , π E , and ψ 1 . The conditional probability of Y E |Y T can be derived from the joint probability in (3) and is equal to: . An analogous conditional probability of Y T |Y E can also be derived.
There are two key properties of the above model that are worthy of discussion. First, the correlation parameter, ψ 1 , has the useful interpretation that ψ 1 / (1 − ψ 1 ) is the odds ratio between Y E and Y T . A second, less desirable property, is that π T and π E are no longer the marginal probabilities of Y T and Y E equal to 1, respectively, if ψ 1 = .5. Instead, the marginal probability of Y E equal to 1 is: and the marginal probability of Y T equal to 1 is: This is a key point that must be considered during dose finding.

Gumbel copula
Thall and Cook [4] instead model the joint probability of efficacy and toxicity using the Gumbel copula discussed by Murtaugh and Fisher [10], which implies the following joint probability model for Y T and Y E : Here, ψ 2 ∈ (−1, 1) captures the correlation between Y T and Y E , with ψ 2 = 0 implying independence and ψ 2 ∈ (0, 1) implying positive correlation. We can again derive the conditional probability of Y E given Y T , The conditional probability of Y T given Y E can be expressed in an analogous fashion.
An advantage of this model is that both π E and π T retain their original interpretations as the marginal probabilities of efficacy and toxicity, respectively. This can be easily seen by summing P (Y E = 1, Y T = 1) and P (Y E = 1, Y T = 0) from (4). Unlike the Braun Copula, the correlation parameter for the Gumbel copula, ψ 2 , does not have a straight-forward interpretation.

Independent model
An alternate approach would be to assume independence between Y T and Y E , in which case the joint probability of toxicity and efficacy is simply the product of the marginal probabilities, (5) which is of course what we get by setting ψ 1 = 0.5 and ψ 2 = 0 in the Braun and Gumbel copulas, respectively. While it is unlikely that this model accurately reflects the true association between Y T and Y E , this model may still be useful because the sample size in phase I-II oncology trials is limited and we may lack the sample size to precisely estimate ψ 1 and ψ 2 . If the likelihood contains very little information about these parameters, it may be that we do not lose much with respect to our ability to identify the optimal dose by assuming independence instead of a more complicated model.
We must specify a prior distribution for each regression and association parameter, to complete a Bayesian analysis. We specify the following normal priors for the two intercept terms and the quadratic term for efficacy: 1 4 . The priors for β 0,T and β 0,E correspond to a prior belief of P(Y T = 1|z = 1) = 0.05 and P(Y E = 1|z = 1) = 0.27 but provide sufficient support over all plausible values for β 0,T and β 0,E and represent only mildly informative priors. The prior for β 2,E is chosen to reflect a strong belief against a quadratic relationship but allows the model flexibility should there be drastic departures from a linear relationship. Gamma priors were set for β 1,T and β 1,E with mean 1 and standard deviation 2, corresponding to a Gamma( 1 4 , 1 4 ). Assuming Gamma priors for β 1,T and β 1,E implies that the marginal probability of toxicity will be monotonically increasing but the same is not true for the marginal probability of efficacy due to the inclusion of a quadratic term for the marginal probability of efficacy. Finally, we specify non-informative uniform priors http://www.biomedcentral.com/1471-2288/14/51 for the association parameters: Uniform (0, 1) for ψ 1 and Uniform (−1, 1) for ψ 2 .

Dose-finding algorithm
For our simulation study, we follow the dose-finding algorithm proposed by Thall and Cook [4]. These authors identify a set of acceptable doses by defining a maximum acceptable probability of toxicity assuming 100% efficacy, a minimum acceptable probability of efficacy assuming no toxicity and define a desirability index to identify the optimal dose from the set of acceptable doses.
Let π T be the maximum acceptable probability of toxicity assuming 100% efficacy and π E be the minimum acceptable probability of efficacy assuming no toxicity, as specified by the physician. A dose, z, is acceptable if the posterior probabilities of the two events π T (z) < π T and π E (z) > π E exceed a pre-specified threshold, p, i.e.
The trial terminates for futility if, at any point during the trial, all doses are unacceptable according to Equation (6). The optimal dose is selected from the set of acceptable doses using a desirability index. The desirability index for a (π T (z), π E (z)) pair is defined by Thall and Cook as follows: where q is defined by identifying a probability of toxicity and probability of efficacy pair, π * T , π * E , that is equally desirable to (π T , 1.0) and 0, π E , plugging π * T , π * E into (7) and solving for q when D(z) equals 0. Larger values of D(z) are considered more desirable and the optimal combination, (0.0, 1.0), has D(z) equal to 1 regardless of q.
The dose-finding algorithm proceeds as follows: 1. Treat the first cohort of m patients at the lowest dose level. 2. Update the posterior distributions of the probabilities of toxicity and efficacy for each dose level using data from all previous cohorts. 3. Identify the set of acceptable doses using criterion (6).
If no dose is found acceptable, terminate for futility. 4. Treat the next cohort at the dose that maximizes D(z) under the restriction that dose levels may not be skipped when escalating. Return to step 2. 5. Repeat steps 2-4 until the maximum sample size is reached. The dose that maximizes D(z) at study completion is considered the optimal dose.

Results and discussion
We completed a small simulation study to evaluate the performance of phase I-II clinical trials when the copula model is misspecified. Trial parameters were set as follows. We assume a cohort size of 3 patients with a maximum of 15 cohorts, for a maximum sample size of 45 patients. The maximum acceptable probability of toxicity assuming 100% efficacy and minimum acceptable probability of efficacy assuming no toxicity were set at π T = 0.5 and π E = 0.55, respectively. We set π * T , π * E equal to (0.25, 0.60), which corresponds to q = 2. The pre-specified threshold, p, to determine the set of acceptable doses using the posterior probability of toxicity and efficacy is assumed to be 0.05. We consider four dose levels with dose index, z = {1, 2, 3, 4}. All simulations were completed in R version 2.15.1 [11]. Gibbs sampling was completed in JAGS as called from R using rjags [12]. We simulate 1000 trials; within each trial, 1000 iterations were kept for inference following a period of 5000 iterations for burn-in.

Scenarios
We simulated data from five scenarios to evaluate the performance of our models in a variety of settings. Table 1 contains the true marginal probability of toxicity, marginal probability of efficacy and D(z) at each dose level for the five scenarios, with the optimal dose in bold. Scenarios 1-3 represent the case where the probability of toxicity and efficacy increase as dose increases with the optimal dose ranging from dose level 1 to dose level 4. In scenario 4, both dose levels three and four are acceptable but dose level three is the optimal dose. Finally, in scenario 5, all doses are safe but none have acceptable efficacy and the correct decision is to terminate for futility. For each scenario, we simulated data from both copula models and fit the data using either the correct copula model, the incorrect copula model or the independence model. The association parameters, ψ 1 and ψ 2 , were varied to determine the impact of the correlation of Y T and Y E on model performance under model misspecification. The Braun association parameter, ψ 1 , in (3) ranges from 0.5 to 0.9 by increments of 0.2; this corresponds to an odds ratio between toxicity and efficacy ranging from 1 to 9. The Gumbel association parameter, ψ 2 in (4) ranges from 0 to 0.8 by increments of 0.4. Recall that efficacy and toxicity are independent if ψ 1 equals 0.5 for the Braun model and ψ 2 equals 0 for the Gumbel model. In these cases, the independence model is the correct model and the Braun and Gumbel models are unnecessarily trying to estimate a correlation parameter when the two endpoints are actually independent.

Results
Tables 2, 3, 4, 5 and 6 display the performance of the Braun, Gumbel, and independence models for the five scenarios at no, moderate, and very strong degrees of positive association; each table summarizes the results for data simulated from both the Braun and Gumbel copula http://www.biomedcentral.com/1471-2288/14/51 models. Within a model stratum, the first row displays the posterior probability of selecting that dose; while the second row displays the average number of patients treated at that dose. Operating characteristics for the optimal dose are in bold. Table 2 displays results for Scenario 1. Concentrating first on the results when data are generated from the Braun model, we see that both copula models perform similarly well in their ability to select the correct dose and in the number of patients treated at the optimal dose regardless of the true correlation. The probability of correctly identifying the optimal dose differs by less than 0.024 and the number of patients treated at the optimal dose differs by less than 1 across the three correlation conditions. Surprisingly, the independence model also performs very well regardless of the true correlation and has the highest probability of identifying the optimal dose in two of the three correlation conditions. Although, the differences are small and the independence model has essentially the same performance as the two copula models.
Similar results are observed when data are generated from the Gumbel model. Both copula models performed similarly with respect to the probability of accurately identifying the optimal dose and the number of patients treated at the optimal dose. The independence model again exhibits good performance across the three scenarios, which is surprising because the independence model lacks the flexibility to model the correlation between the The first row is the selection probability for dose z; and the second is the average number of patients treated at dose z. Y T and Y S are simulated with association, ψ k ; k = 1, 2 when appropriate. The operating characteristics for the target dose are in bold.
two outcomes. Across the three correlation scenarios, the probability of correctly identifying the optimal dose differed by less than 0.05 and the average number of http://www.biomedcentral.com/1471-2288/14/51 patients treated at the optimal dose differed by less than 1.5 between the three models. Results for Scenarios 2 and 3 are found in Tables 3  and 4, respectively. The results for Scenarios 2 and 3 are similar to the results for Scenario 1. The probability of correctly identifying the optimal dose and the average number of patients treated at the optimal dose are similar for both copula models regardless of how the data are generated and the independence model provides similar performance even though the independence model is unable to appropriately model the correlation between Y T and Y E . The first row is the selection probability for dose z; and the second is the average number of patients treated at dose z. Y T and Y S are simulated with association, ψ k ; k = 1, 2 when appropriate. The operating characteristics for the target dose are in bold.
Scenario 4 (Table 5), represents the scenario where multiple dose levels are acceptable, dose levels 3 and 4, but dose level 3 is optimal. This scenario represents one of the primary motivations for phase I-II designs as there is http://www.biomedcentral.com/1471-2288/14/51 a dose level where further escalation results in a greater probability of toxicity but relatively little efficacy benefit. The results for Scenario 4 are consistent with our previous results: there is little difference between the three models in the probability of correctly identifying the optimal dose and the average number of patients treated at the optimal dose regardless of the correlation between endpoints and how the data are generated. Finally, the results for Scenario 5 can be found in Table 6. In this scenario, all dose levels are safe but have unacceptable efficacy and the correct decision is to terminate for futility. The Gumbel and independence models exhibit similar performance across all scenarios but we do observe that the Braun model is more likely to terminate for futility when Y T and Y E are correlated and the data are generated for the Braun model. Although, the differences are small in both cases.
The simulations results presented in Tables 2, 3, 4, 5 and 6 indicate that specifying the correct copula model has little impact on the operating characteristics of Phase I-II clinical trials but are dependent on a number of factors, including the sample size and priors specified for the logistic regression models used for efficacy and toxicity. In order to account for these factors, we completed additional simulations to evaluate the robustness of our conclusions to changing the sample size and prior distributions. Figure 1 presents simulation results to evaluate the impact of sample size on our conclusions. Presented are the simulated probabilities of correctly identifying the optimal dose for Scenario 1 for maximum sample sizes of 30, 45, 60 and 75 subjects, with data simulated from both models and various levels of correlation. As expected, the probability of correctly identifying the optimal dose increases as the maximum sample size increases. More importantly, increasing the sample size appears to have no impact on our primary conclusion that incorrectly specifying the copula model has little impact on the probability of correctly identifying the optimal dose regardless of the level of correlation between the two endpoints and the model from which the data are generated. Figure 2 presents simulation results to evaluate the impact of changing the priors for the parameters of the logistic regression models for efficacy and toxicity. We considered four prior specifications. Prior specification 1 is the original priors: Presented are the simulated probabilities of correctly identifying the optimal dose for Scenario 1 for the four prior specifications, with data simulated from both models and various levels of correlation. We see that increasing the prior variance of the intercept terms has little impact on the probability of correctly identifying the optimal dose but increasing the variance of the slope decreases the probability of correctly identifying the optimal dose considerably. More importantly, we see that the independence model performs as well or better than the other model in all cases. In addition, we see that the Braun model performs worse than the other two models when we increase the prior variance of the slope parameters with a larger effect observed with moderate and high correlation. This effect is similar regardless of how the data are generated and, in fact, appears slightly larger when the Braun model is actually the correct model.
Our simulation results suggest that specifying the correct copula model does not improve the operating characteristics of our study compared to simply using an independence model and, in fact, results in worse operating characteristics in many cases. This is counter-intuitive as specifying the correct model should result in more efficient estimates of the model parameters. One possible explanation for this phenomenon is that the correlation parameter between Y T and Y E may be too difficult to estimate given the limited sample size. We completed a small simulation study to investigate our ability to estimate ψ 1 and ψ 2 in the Braun and Gumbel copulas, respectively. For each scenario, we considered a study with 11 subjects at each dose level, for a total of 44 subjects, and considered various levels of correlation between Y T and Y E . 1000 simulations were considered for each scenario. Table 7 presents the mean and standard deviation of the posterior mean for ψ 1 and ψ 2 under various levels of correlation. We see that the Braun model, while certainly biased towards independence, appears to be learning about ψ 1 and has posterior means of approximately 0.65 and 0.80 when ψ 1 is equal to 0.70 and 0.90, respectively. In contrast, there is little information about ψ 2 in the Gumbel model and the posterior means are approximately 0.10 and 0.20 when ψ 2 is equal to 0.40 and 0.80, respectively.

Conclusions
We completed a simulation study to evaluate the performance of copula models in phase I-II clinical trials under model misspecification. Our results suggest that the operating characteristics of our study are relatively robust to misspecifying the copula model. Both models exhibited similar performance, as measured by the probability of correctly identifying the optimal dose and the number of subjects treated at the optimal dose, regardless of whether the data were generated from the correct or incorrect copula, even when there is substantial correlation between Y T and Y E . These results were robust to changes in the maximum sample size and the prior distributions for the parameters of the logistic regression models for toxicity and efficacy. In comparing the two models, there was little difference in the operating characteristics, although, the straight-forward interpretation of the model parameters in the Gumbel model may make the Gumbel model more desirable. http://www.biomedcentral.com/1471-2288/14/51 Surprisingly, the naive model that ignores the correlation between Y E and Y T performed as well, better in some cases, with respect to correctly identifying the optimal dose and the number of subjects treated at the optimal dose than even the correct model. This was true regardless of the scenario and true correlation between Y E and Y T . This result is not intuitive as we would expect that correctly specifying the copula model would result in more efficient parameter estimates and improved operating characteristics of our study.
There are several possible explanations for the lack of benefit when utilizing the correct copula model in Phase I-II clinical trials. First, it is possible that the likelihood contains very little information about the correlation parameter and any benefit of modeling the correlation is negated by the need to estimate an additional correlation parameter. In this case, fitting a copula model may result in more variable estimates, in general, which would result in performance that is no better, and potentially worse, than simply assuming that the two endpoints are independent. A second explanation is that Phase I-II clinical trials do not provide sufficient information for selecting the correct copula model. Phase I-II clinical trials utilize small sample sizes, which makes it difficult to properly evaluate the fit of a model. Furthermore, regulatory bodies typically require that a model is specified in advance when utilizing an adaptive trial design. These challenges make it difficult to identify the correct copula from the data, which may negate any benefit from modeling the correlation between the toxicity and efficacy endpoint. Finally, properly modeling the correlation between two endpoints is necessary to complete proper inference (hypothesis tests, credible intervals, etc.) but it may be that modeling this correlation is not necessary in a Phase I-II clinical trial where the goal is to select a dose at study completion regardless of the error associated with estimates of the probability of efficacy and toxicity. In this case, we would not expect any benefit from modeling the correlation, which is consistent with our simulation results.
We completed a second simulation study to investigate the model's ability to estimate the correlation parameters with the sample sizes used in phase I-II clinical trials in order to fully understand the behavior of copula models in phase I-II clinical trials. Estimates of the correlation parameters were biased towards the prior mean of no correlation in both cases but the average posterior mean of the correlation parameter in the Braun model was much closer to the true value of the correlation parameter than in the Gumbel model. This suggests that, while the likelihood for the Braun model contains a fair amount of information for the correlation parameter, the likelihood for the Gumbel model contains very little information about the correlation parameter and provides a potential explanation for the apparent lack of benefit due to properly modeling the correlation between efficacy and toxicity in phase I-II clinical trials.
The results of this manuscript are dependent the decision rule proposed by Thall and Cook [4]. Other decision rules have been proposed for phase I-II clinical trials [5,13] and it is possible that the results of our simulation study would change with a different decision rule. We think that this is unlikely given that we consistently found no benefit of appropriately modeling the correlation between toxicity and efficacy in all scenarios and additional simulation results illustrated that the likelihood contains little information for estimating the correlation