Skip to main content

Optimal designs for phase II/III drug development programs including methods for discounting of phase II results



Go/no-go decisions after phase II and sample size chosen for phase III are usually based on phase II results (e.g., the treatment effect estimate of phase II). Due to the decision rule (only promising phase II results lead to phase III), treatment effect estimates from phase II that initiate a phase III trial commonly overestimate the true treatment effect. Underpowered phase III trials are the consequence. Optimistic findings may then not be reproduced, leading to the failure of potentially expensive drug development programs. For some disease areas these failure rates are described to be quite high: 62.5%.


We integrate the ideas of multiplicative and additive adjustment of treatment effect estimates after go decisions in a utility-based framework for optimizing drug development programs. The design of a phase II/III program, i.e., the “right amount of adjustment”, the allocation of the resources to phase II and III in terms of sample size, and the rule applied to decide whether to stop or to proceed with phase III influences its success considerably. Given specific drug development program characteristics (e.g., fixed and variable per patient costs for phase II and III, probable gain in case of market launch), optimal designs with respect to the maximal expected utility can be identified by the proposed Bayesian-frequentist approach. The method will be illustrated by application to practical examples characteristic for oncological studies.


In general, our results show that the program set-ups with adjusted treatment effect estimate used for phase III planning are superior to the “naïve” program set-ups with respect to the maximal expected utility. Therefore, we recommend considering an adjusted phase II treatment effect estimate for the phase III sample size calculation. However, there is no one-fits-all design.


Individual drug development planning for a specific program is necessary to find the optimal design. The optimal choice of the design parameters for a specific drug development program at hand can be found by our user friendly R Shiny application and package (both assessable open-source via [1]).

Peer Review reports


Exploratory studies are usually carried out to provide a basis for deciding whether or not to proceed with a confirmatory trial and, if necessary, to provide information for planning purposes. In drug development programs, this strong link between exploratory (e.g., phase II) and confirmatory (e.g., phase III) studies favors integrated planning. In particular, the costs of phase III studies have increased remarkably in recent years [2, 3], while failure rates are quite high (approx. 45%, see [4] and the reference mentioned therein). Therefore, the availability and application of quantitative methods for decision making, which should be data-driven and objective, is desirable [5].

Already over 30 years ago, Hughes and Pocock [6] pointed out that decision rules in clinical trials can lead to a bias in the point estimate of the treatment effect, so that the true underlying effect might be overestimated at the time of an early positive decision. Twenty four years and various attempts of authors to adjust for overestimation of the treatment effect (in group sequential designs) later (e.g., [7] and references mentioned therein), Zhang et al. [8] still criticize that the cause and effect of this phenomenon is generally not well-understood. Trying to illustrate the problem, they provide a graphical explanation for the occurrence of overestimation. They argue that random variability (i.e., random highs and lows) of the treatment effect estimate is always present, but stabilizes around the true treatment effect as the trial continues to its end. However, when implementing a decision rule the variability favors the random highs: in a phase II/III drug development program with a go/no-go decision rule, it is only proceeded with phase III when large treatment effects are observed, but stopped when small effects occur. This selective handling of random variability may lead to overestimation of the magnitude of the treatment effect after phase II.

Ellenberg et al. [9] as well as Nardini [10] emphasize that the aim of treatment effect estimation is not to decide whether or not one therapy is better than the other, but to describe the size of therapeutic effects. Thus, we are concerned with a problem of estimation, not a problem of testing. Nardini concludes that estimates arising after a decision rule “should [consequently] not be taken at face value as true estimates of the new treatment’s effect”. Ellenberg et al. point out that statistical methods to adjust for this “random-high bias” exist, but criticize that “they are not applied as often as they should be”. Recently, the U.S. Food & Drug Administration reported 22 case studies since 1999 in which promising phase II clinical trial results were not confirmed in phase III clinical testing [11]. Such experiences are not rare: for some disease areas, the failure rate for phase III trials is reported to be as high as 62.5% [12] and about 50% for approval [13]. Chuang-Stein and Kirby [14] give cause for serious concern, as the severity of this may multiply, considering that the bigger the estimated effect from, e.g., a proof of concept trial, the greater the temptation to invest heavily and conduct multiple studies in parallel. They advise to use the concept of “assurance” for quantification of success probabilities and, moreover, to apply an adjustment for the overestimation of the treatment effect (e.g., [15]) when planning the next phase of a drug development program.

In our framework, we follow the concept of “assurance” [16, 17], which had first been introduced by Spiegelhalter et al. in 1986 with the concept of Bayesian predictive power (compare also “average power”) [18, 19]. This methodology was used later in various contexts by O’Hagan et al. [16, 17] (“assurance”), Chuang-Stein [20], Chuang-Stein and Yang [21] (“average success probability”) and finally by Gasparini et al. and Saint-Hilary et al. (“predictive probability of success”) [22, 23]. The idea is to use a prior distribution for the true assumed treatment effect for trial planning. This is in contrast to the “frequentist world”, where a fixed value is assumed. The “assurance” is then the weighted (unconditional) probability of a successful trial for a given effect, the weighting resulting from the likelihood that the therapy will achieve this effect. Due to synthesizing Bayesian principles in the planning phase and frequentistic decision-making procedures in the analysis, the above-mentioned approaches are described in the literature as “mixed Bayesian-frequentist”.

Kirby et al. [15] and Wang et al. [24] attempt to reduce the impact of overestimation by discounting the phase II treatment effect estimate by applying a multiplicative or additive adjustment, respectively. However, their suggestions are not universally applicable, and are rather “rules of thumb”, e.g., Kirby et al. suggest to use a retention factor of 0.9 times the assumed ratio of the phase III effect to phase II effect.

De Martini [4, 25] reports that the phase II sample size should be almost as large as the ideal phase III sample size (at least 2/3 of the latter) in order to have a sufficiently good information basis for phase III planning. He criticizes that in practice this ratio is only 1/4 on average and that an increase in sponsorship gains from drug development through larger phase II studies has not yet been well investigated. Larger phase II sample sizes would reduce the level of overestimation but increase the estimated phase III sample size [26] and could retrospectively be regarded as an unnecessary high investment in case of a no-go decision. Therefore, an optimal balance is required.

In this article, we integrate the general concepts of using a multiplicative or additive adjustment method to correct for overestimation of the treatment effect in a framework of utility-based optimization of phase II/III development programs [27]. That is, we want to critically examine adjustment methods from an economic point of view. In addition to simultaneously optimizing the phase II go/no-go decision rule and the sample size, we also optimize over the adjustment parameter used for the phase II treatment effect estimation to find “the right level of adjustment” for the specific situation at hand. Our approach can build the bridge between the long existing gap of theory and practice: we provide a Bayesian-frequentist hybrid framework, in which methods proposed for addressing the problem of overestimation of the treatment effect after go decisions are included in the optimization of drug development programs.

In the second section of this paper, we will introduce the basic setting and notation, explain the adjustment methods and show how they are incorporated in our optimization framework. After introducing the utility function and explaining the optimization procedure, we present optimal designs for exemplary settings of drug development programs in Section 3. We finish with a discussion in Section 4 and a conclusion in Section 5.


Basic setting

The considered drug development program consists of one exploratory phase II and one confirmatory phase III trial. Both are randomized trials with two arms (each with 1:1 sample size allocation), performed independently, investigating the same time-to-event primary endpoint and the same population. The true treatment effect is given by the negative logarithm of the true hazard ratio (θ =  − log(HR)), which is the ratio of the hazard functions of the treatment and the control group. In order to reflect the uncertainty in the true treatment effect, θ can be modelled by a prior distribution f(θ). In phase II, the total number of events is denoted by d2 and the maximum likelihood estimate of θ is given by \( {\hat{\theta}}_2 \). We assume that the estimator \( {\hat{\theta}}_2 \) is asymptotically normally distributed with \( {\hat{\theta}}_2\mid \theta \sim N\left(\theta, 4/{d}_2\right) \) (Note that the notation used will not differentiate between the treatment effect estimator (i.e., rule applied to estimate the quantity of interest, which is a random variable) and the treatment effect estimate (i.e., particular realization, fixed value), but by context it will be clear which quantity is meant.). Furthermore, we require that only phase II trials with promising results lead to a phase III trial. This is quantified by a go/no-go criterion with a go-decision in case of \( {\hat{\theta}}_2\ge \kappa \), where κ is a predefined threshold value. In case of a go decision, the number of events for the phase III trial is calculated based on the observed treatment effect of phase II. If the confirmatory analysis in phase III reveals a significant result, program success is declared (compare Fig. 1).

Fig. 1
figure 1

Graphical illustration of basic phase II/III drug development program. The drug development program consists of one exploratory phase II trial, which is, in case of a go decision (i.e., treatment effect estimate of phase II \( {\hat{\theta}}_2 \) exceeds predefined threshold value κ =  − log(HRgo)), followed by one confirmatory phase III trial, where the sample size planning is based on \( {\hat{\theta}}_2 \). The program is considered successful if phase III has a positive (significant) result (i.e., normalized log rank test statistic of phase III T3 is above the 1 − α quantile of the standard normal distribution z1 − α)

Due to the decision rule after phase II, the treatment effect estimate of phase II is biased. The bias is positive with κ > 0 as probability mass is shifted towards higher values:

$$ {\displaystyle \begin{array}{l}E\left[{\hat{\theta}}_2|{\hat{\theta}}_2\ge \kappa \right]=\underset{-\infty }{\overset{\infty }{\int }}\underset{-\infty }{\overset{\infty }{\int }}{1}_{\left\{{\hat{\theta}}_2\ge \kappa \right\}}\cdot {\hat{\theta}}_2\cdot \frac{f\left({\hat{\theta}}_2|\theta \right)}{P\left({\hat{\theta}}_2\ge \kappa |\theta \right)}d{\hat{\theta}}_2\cdot f\left(\theta \right) d\theta \\ {}\kern6em +\underset{-\infty }{\overset{\infty }{\int }}\underset{-\infty }{\overset{\infty }{\int }}{1}_{\left\{{\hat{\theta}}_2\ge \kappa \right\}}\cdot {\hat{\theta}}_2\cdot 0\;d{\hat{\theta}}_2\cdot f\left(\theta \right) d\theta \\ {}\kern6em >\underset{-\infty }{\overset{\infty }{\int }}\underset{-\infty }{\overset{\infty }{\int }}{1}_{\left\{{\hat{\theta}}_2\ge \kappa \right\}}\cdot {\hat{\theta}}_2\cdot f\left({\hat{\theta}}_2|\theta \right)d{\hat{\theta}}_2\cdot f\left(\theta \right) d\theta \\ {}\kern6em +\underset{-\infty }{\overset{\infty }{\int }}\underset{-\infty }{\overset{\infty }{\int }}{1}_{\left\{{\hat{\theta}}_2\ge \kappa \right\}}\cdot {\hat{\theta}}_2\cdot f\left({\hat{\theta}}_2|\theta \right)d{\hat{\theta}}_2\cdot f\left(\theta \right) d\theta =\underset{-\infty }{\overset{\infty }{\int }}\underset{-\infty }{\overset{\infty }{\int }}{\hat{\theta}}_2\cdot f\left({\hat{\theta}}_2|\theta \right)\cdot f\left(\theta \right)d{\hat{\theta}}_2 d\theta =E\left[{\hat{\theta}}_2\right],\end{array}} $$

where here and in the following 1A denotes the indicator function of event A and the density of the distribution of the respective argument is indicated by f(.). The inequation holds as \( \frac{1}{P\left({\hat{\theta}}_2\ge \kappa |\theta \right)}>1 \) and \( {\int}_{-\infty}^{\infty }{1}_{\left\{{\hat{\theta}}_2\ge \kappa \right\}}\frac{f\left({\hat{\theta}}_2|\theta \right)}{P\left({\hat{\theta}}_2\ge \kappa |\theta \right)}d{\hat{\theta}}_2=1 \) and, therefore, the probability mass assigned to values less than κ in the unconditional expectation \( E\left[{\hat{\theta}}_2\right] \) is distributed between values greater than κ in \( E\left[{\hat{\theta}}_2|{\hat{\theta}}_2\ge \kappa \right] \).

Note that the representation of the bias cannot be further simplified, neither by calculating \( \mathrm{E}\left[{\hat{\theta}}_2\right]-\mathrm{E}\left[{\hat{\theta}}_2|{\hat{\theta}}_2\ge \kappa \right] \) nor \( \mathrm{E}\left[{\hat{\theta}}_2\right]/\mathrm{E}\left[{\hat{\theta}}_2|{\hat{\theta}}_2\ge \kappa \right] \).

Therefore, in the following, multiplicative and additive adjustment methods for the treatment effect estimate obtained in phase II will be investigated. Afterwards, dependent on the respective adjustment method, launch criteria and approaches to calculate the number of events for phase III will be presented.

Additive and multiplicative adjustment methods

In this section, we introduce two methods (an additive and a multiplicative adjustment method) to adjust for the overestimation of the phase II treatment effect estimate. It should be mentioned that the terms “multiplicative” and “additive” relate to the specific type of scale and endpoint considered here.

Wang et al. [24] advise to apply an additive adjustment to the phase II treatment effect estimate if it is used for planning the sample size of phase III. They discuss using the lower limit of the one and two standard deviation confidence interval (CI) from the phase II trial (i.e., the lower limit of the CI for \( {\hat{\theta}}_2 \), corresponding to one or two standard deviations below the point estimate), respectively. We denote the significance level of the lower bound for the one-sided CI related to the phase II treatment effect estimate as αCI [0.025, 0.5] and define the additive adjusted treatment effect estimate by \( {\hat{\theta}}_2^{a_{CI}}={\hat{\theta}}_2-{z}_{1-{a}_{CI}}\cdot \sqrt{4/{d}_2} \), with z1 − γ = Φ−1(1 − γ), where Φ(.) denotes the distribution function of the standard normal distribution. Note that our version of the additive adjusted treatment effect estimate is a generalization of that of Wang et al., as they use the lower limit of the one and two standard deviation two-sided CI (i.e., in our notation αCI = 0.32/2 and αCI = 0.05/2) and we allow αCI ranging from 0.025 to 0.5. For αCI = 0.5, the additive adjusted treatment effect estimate is not discounted as \( {\hat{\theta}}_2-{z}_{1-0.5}\cdot \sqrt{4/{d}_2}={\hat{\theta}}_2 \).

Kirby et al. [15] propose a multiplicative adjustment approach. They multiply the observed treatment effect estimate with a factor λ, which can be understood as a retention factor, that is, the fraction of the treatment effect retained. Integrated in our setting, we define \( {\hat{\theta}}_2^{\lambda }=\lambda \cdot {\hat{\theta}}_2 \), where the multiplicative adjustment parameter λ [0.2, 1] can be viewed as the result of discounting the observed treatment effect of phase II by 1 − λ. Note that for λ = 1 the multiplicative adjusted treatment effect estimate is not discounted.

Go/no-go criteria, calculation of expected number of events for phase III and related program characteristics

When designing the phase II/III program, the observed treatment effect estimate of phase II plays a key role in two ways: 1. when making the go/no-go decision (selection s1); 2. when calculating the phase III sample size (selection s2; compare Fig. 1). At both instances, one has to decide whether or not to use an adjusted or unadjusted treatment effect estimate. To ease notation, the naïve (unadjusted) treatment effect estimate of phase II is denoted by \( {\hat{\theta}}_2^u={\hat{\theta}}_2 \).

1.: If the treatment effect estimate \( {\hat{\theta}}_2^{s_1} \), where s1 = λ, αCI or u (i.e., the multiplicatively adjusted, additively adjusted or unadjusted treatment effect estimate is selected for the decision rule), exceeds a predefined threshold value κ, it is decided to go to phase III and otherwise to stop the program. Hence, the expected probability to go to phase III can be determined by

$$ {p}_{go}\left({\hat{\theta}}_2^{s_1}\right)={\int}_{-\infty}^{\infty }P\left({\hat{\theta}}_2^{s_1}\ge \kappa |\theta \right)\cdot f\left(\theta \right) d\theta, $$

s1 = λ, αCI or u (compare Table A0 in the Additional file 1).

2.: In case of a go decision, the number of events for phase III is calculated based on the treatment effect estimate of phase II \( {\hat{\theta}}_2^{s_2} \), s2 = λ, αCI or u, a desired power 1 − β, and a one-sided significance level α. For a balanced allocation ratio, it can be calculated by

$$ {D}_3={D}_3\left({\hat{\theta}}_2^{s_2}\right)=\frac{4\cdot {\left({z}_{1-\alpha }+{z}_{1-\beta}\right)}^2}{{\left({\hat{\theta}}_2^{s_2}\right)}^2}, $$

by assuming proportional hazards and asymptotic properties of the log-rank test statistic [28]. When going to phase III, the expected number of events (in phase II/III programs with decision rule \( {\hat{\theta}}_2^{s_1}\ge \kappa \) and \( {\hat{\theta}}_2^{s_2} \) used to calculate the number of events for phase III) can be determined by

$$ {d}_3\left({\hat{\theta}}_2^{s_1},{\hat{\theta}}_2^{s_2}\right)=\mathrm{E}\left[{D}_3\left({\hat{\theta}}_2^{s_2}\right)\cdot {1}_{\left\{{\hat{\theta}}_2^{s_1}\ge \kappa \right\}}\right]={\int}_{-\infty}^{\infty }{\int}_{-\infty}^{\infty }{1}_{\left\{{\hat{\theta}}_2^{s_1}\ge \kappa \right\}}\cdot \frac{4\cdot {\left({z}_{1-\alpha }+{z}_{1-\beta}\right)}^2}{{\left({\hat{\theta}}_2^{s_2}\right)}^2}\cdot f\left({\hat{\theta}}_2|\theta \right)d{\hat{\theta}}_2\cdot f\left(\theta \right) d\theta, $$

(compare Table A0). The expectation of the estimate (of phase II) used for the sample size calculation can be calculated by

$$ {e}_2={e}_2\left({\hat{\theta}}_2^{s_1},{\hat{\theta}}_2^{s_2}\right)=\mathrm{E}\left[{\hat{\theta}}_2^{s_2}|{\hat{\theta}}_2^{s_1}\ge \kappa \right]={\int}_{-\infty}^{\infty}\frac{1}{\mathrm{P}\left({\hat{\theta}}_2^{s_1}\ge \kappa |\theta \right)}{\int}_{-\infty}^{\infty }{1}_{\left\{{\hat{\theta}}_2^{s_1}\ge \kappa \right\}}\cdot {\hat{\theta}}_2^{s_2}\cdot f\left({\hat{\theta}}_2|\theta \right)d{\hat{\theta}}_2\cdot f\left(\theta \right) d\theta, $$

for s1, s2 = λ, αCI or u (compare Table A0) in order to calculate the bias \( \mathrm{E}\left[{\hat{\theta}}_2^{s_2}|{\hat{\theta}}_2^{s_1}\ge \kappa \right]-\mathrm{E}\left[{\hat{\theta}}_2\right] \). As proposed by De Martini [4, 25], the ratio of the number of events in phase II and III will also be calculated.

The program is considered to be successful, if the one-sided null hypothesis H0 : θ ≤ 0 is rejected in favour of H1 : θ > 0 at a one-sided significance level α. This is the case if T3 > z1 − α, where T3 is the normalized log-rank test statistic in phase III, which is assumed to be asymptotically normally distributed, i.e., \( {T}_3={T}_3\mid {\hat{\theta}}_2,\theta \sim N\left(\theta /\sqrt{4/{D}_3},1\right) \). Note that significance testing is performed on phase III data only. Therefore, the expected probability of a successful program \( PsP\left({\hat{\theta}}_2^{s_1},{\hat{\theta}}_2^{s_2}\right) \) (with decision rule \( {\hat{\theta}}_2^{s_1}\ge \kappa \), and \( {\hat{\theta}}_2^{s_2} \) used to calculate the number of events for phase III), which is defined as the expected probability of the joint event of going to phase III and achieving a significant result [25, 27], can be calculated by

$$ PsP\left({\hat{\theta}}_2^{s_1},{\hat{\theta}}_2^{s_2}\right)={\int}_{-\infty}^{\infty }{\int}_{-\infty}^{\infty }{1}_{\left\{{\hat{\theta}}_2^{s_1}\ge \kappa \right\}}\cdot {\int}_{\left\{{z}_{1-\alpha}\right\}}^{\infty }f\left({t}_3|{\hat{\theta}}_2,\theta \right)d{t}_3\cdot f\left({\hat{\theta}}_2|\theta \right)d{\hat{\theta}}_2\cdot f\left(\theta \right) d\theta, $$

where t3 is a realization of \( {T}_3\mid {\hat{\theta}}_2,\theta \) (compare Table A0). One reviewer pointed out that this definition of a successful program records a false positive result (i.e. T3 > z1 − α under H0) as program success. We discuss this aspect in detail in Section A1 of Additional file 1. In reality, regulatory approval and with that a monetary gain, which is the core driver for our utility function, is achieved when a significant result is observed in phase III, acknowledging that there is a probability of α that it is a false positive decision. Thus, we keep the commonly used term “success” and PsP which should be regarded as probability of market access and not a probability of a correct decision.

Considered program set-ups

We investigate the impact of using adjusted treatment effect estimates (i.e., \( {\hat{\theta}}_2^{\lambda } \) or \( {\hat{\theta}}_2^{\alpha_{CI}} \)) for the go/no-go decision and/or for the calculation of the number of events for phase III on the drug development program characteristics and compare the results to those where the unadjusted (naïve) treatment effect estimate \( {\hat{\theta}}_2^u \) was used. Therefore, we investigate different program set-ups \( S\left({\hat{\theta}}_2^{s_1},{\hat{\theta}}_2^{s_2}\right) \) which are defined by the selection of the treatment effect estimate used for the decision rule (selection s1) and, in case of a go decision, by the choice of the treatment effect estimate used for the calculation of the number of events for phase III (selection s2).

Table 1 gives an overview of the considered program set-ups. We compare the “unadjusted” set-up \( \left({\hat{\theta}}_2^u,{\hat{\theta}}_2^u\right) \), where \( {\hat{\theta}}_2^u={\hat{\theta}}_2 \) (i.e., s1, s2 = u), with two “multiplicatively adjusted” set-ups \( S\left({\hat{\theta}}_2^{s_1},{\hat{\theta}}_2^{\lambda}\right) \) (s1 {u, λ}, s2 = λ), and two “additively adjusted” set-ups \( S\left({\hat{\theta}}_2^{s_1},{\hat{\theta}}_2^{\alpha_{CI}}\right) \) (s1 {u, αCI}, s2 = αCI). Note that if s1 ≠ u, we define s2 = s1, which means that if an adjustment parameter is used for the decision rule, the same adjustment parameter is used for the calculation of the expected number of events for phase III (for reasons which will be given later).

Table 1 Overview of program set-ups \( S\left({\hat{\theta}}_2^{s_1},{\hat{\theta}}_2^{s_2}\right) \)

Utility function

The aim is to optimize a phase II/III drug development program in terms of the adjustment parameters λ or αCI, the number of events in phase II d2, and the go/no-go decision threshold value κ. Therefore, we set up a utility function, which utilizes the difference between program costs and potential gains after successful market launch (compare Fig. 2 for a graphical illustration). For the costs, fixed (c02, c03) and variable per-patient (c2, c3) costs are included for the phase II and III trial, respectively. By dividing the number of events by the event rate ξi, the total number of patients can be calculated for the respective phase i = 2, 3. Obviously, only in case of a go decision the costs of the phase III trial apply. In case of program success, a benefit b is obtained, and we assume that the level of benefit depends on the observed treatment effect in the phase III trial as suggested by a report of the German Institute for Quality and Efficiency in Health Care [29]. As proposed by them, three effect size categories (small, medium and large) are used, whereby each category is defined by a threshold value (1, 0.95, 0.85) for the upper boundary of the 95% confidence interval for the HR (for details on the derivation of these threshold values, the interested reader may be referred to the “Anhang A”of [29]). The corresponding amount of benefit is denoted by b1, b2 and b3, respectively. Based on this, costs c(d2, κ, s2) and gain g(d2, κ, s2) for a phase II/III program with program set-up \( S\left({\hat{\theta}}_2^{s_1},{\hat{\theta}}_2^{s_2}\right) \) are given by

$$ c\left({d}_2,\kappa, {s}_2\right)={c}_{02}+\frac{d_2}{\xi_2}\cdot {c}_2+{1}_{\left\{{\hat{\theta}}_2^{s_1}\ge \kappa \right\}}\cdot \left({c}_{03}+\frac{D_3}{\xi_3}\cdot {c}_3\right) $$
$$ g\left({d}_2,\kappa, {s}_2\right)={1}_{\left\{{\hat{\theta}}_2^{s_1}\ge \kappa \right\}}\cdot \left({b}_1\cdot {1}_{\left\{{T}_3\in {I}_1\right\}}+{b}_2\cdot {1}_{\left\{{T}_3\in {I}_2\right\}}+{b}_3\cdot {1}_{\left\{{T}_3\in {I}_3\right\}}\right), $$

where \( {I}_1=\left({z}_{1-\alpha },-\log (0.95)/\sqrt{4/{D}_3}+{z}_{1-\alpha}\right] \), \( {I}_2=\left(-\log (0.95)/\sqrt{4/{D}_3}+{z}_{1-\alpha },-\log (0.85)/\sqrt{4/{D}_3}+{z}_{1-\alpha}\right] \) and \( {I}_3=\left(-\log (0.85)/\sqrt{4/{D}_3}+{z}_{1-\alpha },\infty \right) \) are transformations of the effect size intervals to intervals on the test statistic scale of T3. Thus, the costs depend on the observed treatment effect in phase II and the gain depends on the observed treatment effect in phase II and III.

Fig. 2
figure 2

Graphical illustration of (adjusted) utility-based optimization. The treatment effect estimate of phase II may be adjusted for the decision rule (selection s1 {u, s2} and/or for the calculation of the number of events for phase III (selection s2 {λ, αCI, u}). The utility (including the costs and the gain) is optimized over the number of events for phase II d2, the threshold value for the decision rule HRgo, and the adjustment parameter s2 = λ or αCI (see Section 2.5 for details), ξi event rate in phase i = 2, 3, bj = bj(T3) benefit categories j = 1, 2, 3

The utility is defined as the difference between costs and gain and expressed as a function of d2 and κ over which it is simultaneously optimized. In the adjusted program set-ups, the optimization is also over λ in the multiplicatively, and over αCI in the additively adjusted set-ups, respectively. Thus, we define the utility for program set-up \( S\left({\hat{\theta}}_2^{s_1},{\hat{\theta}}_2^{s_2}\right) \) by

$$ u\left({d}_2,\kappa, {s}_2\right)=g\left({d}_2,\kappa, {s}_2\right)-c\left({d}_2,\kappa, {s}_2\right), $$

where for the unadjusted program set-up \( S\left({\hat{\theta}}_2^u,{\hat{\theta}}_2^u\right) \) u(d2, κ, s2) = u(d2, κ). To incorporate the development risk in terms of success probabilities, we consider the expected utility with respect to θ, \( {\hat{\theta}}_2 \) and T3 E[u(d2, κ, s2)] = E[g(d2, κ, s2)] − E[c(d2, κ, s2)], where the expected costs and gain with respect to θ, \( {\hat{\theta}}_2 \) and T3 are given by

$$ {\displaystyle \begin{array}{c}E\left[c\left({d}_2,\kappa, {s}_2\right)\right]={c}_{02}+{d}_2/{\xi}_2\cdot {c}_2+{c}_{03}\cdot {\int}_{-\infty}^{\infty }{\int}_{-\infty}^{\infty }{1}_{\left\{{\hat{\theta}}_2^{s_1}\ge \kappa \right\}}\cdot f\left({\hat{\theta}}_2|\theta \right)\cdot f\left(\theta \right)d{\hat{\theta}}_2 d\theta \\ {}+{c}_3/{\xi}_3\cdot {\int}_{-\infty}^{\infty }{\int}_{-\infty}^{\infty }{1}_{\left\{{\hat{\theta}}_2^{s_1}\ge \kappa \right\}}\cdot {D}_3\left({\hat{\theta}}_2^{s_2}\right)\cdot f\left({\hat{\theta}}_2|\theta \right)\cdot f\left(\theta \right)d{\hat{\theta}}_2 d\theta, \\ {}E\left[g\left({d}_2,\kappa, {s}_2\right)\right]=\sum \limits_{j=1}^3{b}_j{\int}_{-\infty}^{\infty }{\int}_{-\infty}^{\infty }{\int}_{-\infty}^{\infty }{1}_{\left\{{\hat{\theta}}_2^{s_1}\ge \kappa \right\}}\cdot {1}_{\left\{{T}_3\in {I}_j\right\}}\cdot f\left({t}_3|{\hat{\theta}}_2,\theta \right)\cdot f\left({\hat{\theta}}_2|\theta \right)\cdot f\left(\theta \right)d{t}_3d{\hat{\theta}}_2 d\theta .\end{array}} $$

The aim is to find a design δ = (d2, κ, s2) that maximizes the expected utility E[u(d2, κ, s2)] for programs with program set-up \( S\left({\hat{\theta}}_2^{s_1},{\hat{\theta}}_2^{s_2}\right). \) The optimization is carried out over d2, κ, and λ in the multiplicatively or αCI in the additively adjusted set-ups, respectively. The optimal design δ for each program set-up \( S\left({\hat{\theta}}_2^{s_1},{\hat{\theta}}_2^{s_2}\right) \) is defined to be the design for which the expected utility is maximized, that is, \( \mathrm{E}\left[u\left({\delta}^{\ast}\right)\right]=\underset{\delta \in D}{\max}\mathrm{E}\left[u\left(\delta \right)\right] \), where D = {δ = (d2, κ, s2)} is the optimization set.

The optimization is solved by using numerical integration procedures written in the programming language R [30]. In order to facilitate the application of the approach, an user friendly R Shiny App (bias) and an R package (drugdevelopR including the R function optimal_bias) are provided open-source (both assessable via [1]).

Illustration of the framework by application to oncology trial example and practical extensions

In this paper, the parameters in the oncology trial example are chosen as in Kirchner et al. [27] to allow comparison of results. It should be noted that the example is primarily given to illustrate the framework and the chosen parameters should not be taken as face values. We tried to elicit the design parameters as realistic as possible to mimic an oncology drug development program by means of information from relevant literature and consultation with experts from the pharmaceutical industry in the field of oncology. However, it should be noted, that these parameters must be chosen carefully and specifically for each drug development scenario at hand.

The event rates for phase II and III are set to ξi = 0.7 for i = 2, 3. Therefore, the total sample size can be calculated by di/0.7, i = 2, 3. In practice, estimates on the event rates could be obtained by taking recruitment rates and duration as well as drop-out rates and treatment group specific hazards into account. However, using those parameters often leads to event rates around ξi = 0.7 as it is a compromise between data maturity and avoidance of long follow-up times if drop-out rates are higher than expected. If ξi < 0.5 the median event time might not be observed while if ξi is too high, the planned number of events might not be reached at all with substantial drop-out rates.

For phase III oncology trials, per-patient costs between 75,000 and 125,000 US $ are reported [31]. Therefore, per-patient costs for phase III of $105 are considered and c3 is set to 1 (in $105). Furthermore, the per-patient costs for phase II are set to c2 = 0.75 (in $105). Due to, for example, additional biomarker measurements made in phase III, or because regulatory agencies may require more extensive data collection in phase III [32], higher per-patient costs in phase III compared with phase II are reasonable. In this example, the fixed costs for phase II and III are set to c02 = 100 and c03 = 150 (in $105), respectively. To investigate different scenarios, the benefit parameters b1, b2 and b3 are chosen to embody a low (b1, b2, b3) 1 : (1000, 2000, 3000), 2 : (1000, 2000, 4000), 3 : (1000, 3000, 4000) and a medium to large (b1, b2, b3) = 4 : (1000, 3000, 5000), 5 : (1000, 4000, 5000), 6 : (1000, 3000, 6000), 7 : (1000, 4000, 6000) over-all benefit (in $105), where we assume a 5-year income period and profit margin of 0.2. Thus, seven different benefit scenarios (bs 1–7) will be considered. A mixture distribution consisting of the weighted sum of two normal distributions

$$ \theta \sim w\cdotp N\left(-\mathit{\log}(0.69),\left(4/210\right)\right)+\left(1-w\right)\cdotp N\left(-\mathit{\log}(0.88),\left(4/420\right)\right), $$

as proposed by Götte et al. [26] can be used to model the true treatment effect. The two normal distributions each depict a distribution for θ, whereby the means represent values of the assumed true treatment effect and the denominators of the associated variances can be viewed as “amount of certainty” about the treatment effect size in terms of numbers of events. The parameters of the distributions (i.e., means and variances) are elected such that a realistic range for the HR is covered (compare Fig. A2 in Additional file 1 and/or investigate the prior distribution with the help of our R shiny App prior [33]). The mean of the first of the two normal distributions characterizes a strong, the second one a moderate to low treatment effect, so that by ranging w from, e.g., 0.3 to 0.9 we can mirror pessimistic to more optimistic opinions about the true treatment effect. In practice, the choice of w can be guided by formal expert elicitation methods. Dallow et al. [34] presented an overview of such methods including elicitation of Gaussian mixture distributions. Note that the approach is general and allows for implementation of any alternative prior distribution. Again, elicitation methods (compare also, e.g., [35]) are a useful tool that may help (a group of) experts to quantify their opinions about the treatment effect as a probability distribution. Various software packages enable their practical application (compare, e.g., [36]).

In our framework it is also possible to account for, e.g., different population structures in phase II and phase III (due to different countries, centers, in-/exclusion criteria, …) by assuming different distributions for the assumed true treatment effect in phase II and III (i.e., θ2θ3), so that \( {\hat{\theta}}_2\mid {\theta}_2\sim N\left({\theta}_2,4/{d}_2\right) \) and \( {T}_3\mid {\hat{\theta}}_2,{\theta}_2,{\theta}_3\sim N\left({\theta}_3/\sqrt{4/{D}_3},1\right) \). For ease of interpretation, all formulas and results presented in the main part are for the special case, where the true treatment effect is modelled by the same distribution for phase II and III (e.g., θ~θ2~θ3), and a brief investigation of this aspect can be found in Section A2 of Additional file 1.

In this example, we chose a wide range for κ (and d2, as well as λ or αCI, respectively) such that the optimization is not influenced by that choice. Therefore, the optimization set is D ={δ = (d2, κ, s2), d2 {50, 52, …, 350}, κ {− log(0.9), − log(0.89), …, − log(0.7)}, s2 = λ {0.2, 0.225, …, 1} or s2 = αCI {0.025, 0.075, …, 0.5}}. However, the lower bound of the decision rule set for κ can also be seen as representing a predefined clinically relevant effect size: phase III trials are then only conducted if the treatment effect observed in phase II is at least of that size. In Section A3 of Additional file 1, we present results of the procedure, where we chose min(κ) =  − log(0.8). Furthermore, it might be interesting to see how the optimal program design is influenced by the sponsor’s real life budget constraint. Therefore, we also consider optimizing the drug development program with a constraint K on the expected costs of the program, i.e., E[c(d2, κ, s2)] ≤ K (see Section A4 of Additional file 1 for more details). In pharmaceutical industry there are often discussions about skipping the phase II trial. For example, if competitors have already approved a drug with a similar mode of action one might see no need for further learning about the drug and go directly to a confirmatory trial. Our framework allows to systematically assess this aspect by setting d2 = 0, c02 = c2 = 0 and pgo = 1 (see Section A5 of Additional file 1 for more details). In addition, different definitions of the cost and benefit functions are possible. As mentioned above, the choice of three effect size categories (and therefore the benefit function) is based on a report of the German Institute for Quality and Efficiency in Health Care [29]. However, the presented framework could also be applied to an alternative set-up as, for example, the one proposed by Ding et al. [32]. Here, a proportional relationship between benefit and effect size is considered. In Section A6 of Additional file 1 we investigate this possibility in more detail.


This section is organized as follows. It starts with general observations across all program set-ups \( S\left({\hat{\theta}}_2^{s_1},{\hat{\theta}}_2^{s_2}\right) \). Then, we compare multiplicative \( S\left({\hat{\theta}}_2^{s_1},{\hat{\theta}}_2^{\lambda}\right) \) vs. additive \( S\left({\hat{\theta}}_2^{s_1},{\hat{\theta}}_2^{\alpha_{CI}}\right) \) vs. no adjustment \( S\left({\hat{\theta}}_2^u,{\hat{\theta}}_2^u\right) \), where s1 = u or s1 = s2. The impact of adjusting the go/no go decision making, i.e., the differences between both multiplicative (\( S\left({\hat{\theta}}_2^u,{\hat{\theta}}_2^{\lambda}\right) \) vs. \( S\left({\hat{\theta}}_2^{\lambda },{\hat{\theta}}_2^{\lambda}\right) \)) and both additive adjustment methods (\( S\left({\hat{\theta}}_2^u,{\hat{\theta}}_2^{\alpha_{CI}}\right) \) vs. \( S\left({\hat{\theta}}_2^{\alpha_{CI}},{\hat{\theta}}_2^{\alpha_{CI}}\right) \)) are also presented. A discussion of the results is given in the next section.

The optimization results are presented in Table 2 (naïve setting, multiplicative adjustment), Table 3 (additive adjustment) and Figure 3, which show the optimal design parameters \( {\delta}^{\ast }=\left({d}_2^{\ast },{\kappa}^{\ast },{s}_2^{\ast}\right) \):

  • optimal total number of events for phase II \( {d}_2^{\ast } \) (given by the optimal value of d2D),

  • optimal go/no-go decision rule threshold value \( {HR}_{go}^{\ast } \) (given by the optimal value of κD in “HR-scale”, i.e., \( {HR}_{go}^{\ast }=\exp \left(-{\kappa}^{\ast}\right) \)) and

  • optimal adjustment parameter \( {s}_2^{\ast}\in \left\{{\lambda}^{\ast },{a}_{CI}^{\ast}\right\} \) (given by the optimal value of s2D) for the multiplicative and additive adjustment method, respectively,

with corresponding program characteristics for the optimal design:

  • maximal expected utility u = E[u(δ)],

  • expected number of events for phase III \( {d}_3^{\ast }={d}_3\left({\hat{\theta}}_2^{s_1},{\hat{\theta}}_2^{s_2^{\ast }}\right) \), where we chose a desired power of 1 − β = 0.9 and a one-sided significance level α = 0.025,

  • total number of expected events in the program \( {d}^{\ast }={d}_3^{\ast }+{d}_2^{\ast } \),

  • expected probability to go to phase III \( {p}_{go}^{\ast }={p}_{go}\left({\hat{\theta}}_2^{s_1}\right) \),

  • expected probability of a successful program \( {sP}^{\ast }= PsP\left({\hat{\theta}}_2^{s_1},{\hat{\theta}}_2^{s_2^{\ast }}\right) \) and

  • expected estimate of phase II used for sample size calculation \( {\varepsilon}_2^{\ast }=\exp \left(-{e}_2\left({\hat{\theta}}_2^{s_1},{\hat{\theta}}_2^{s_2^{\ast }}\right)\right) \) in “HR-scale”,

for program set-ups \( S\left({\hat{\theta}}_2^{s_1},{\hat{\theta}}_2^{s_2}\right) \), where s1 = u or \( {s}_1={s}_2^{\ast}\in \left\{{\lambda}^{\ast },{a}_{CI}^{\ast}\right\} \), benefit scenarios (bs 1-7) and weights for the prior distribution of the true underlying effect (w = 0.3, 0.6, 0.9), where \( E\left[{\hat{\theta}}_2\right]={\int}_{-\infty}^{\infty }{\int}_{-\infty}^{\infty }{\hat{\theta}}_2\cdot f\left({\hat{\theta}}_2|\theta \right)\cdot f\left(\theta \right)d{\hat{\theta}}_2 d\theta \).

Table 2 Optimal design parameters for unadjusted and multiplicatively adjusted program set-ups
Table 3 Optimal design parameters for additively adjusted program set-ups
Fig. 3
figure 3

Optimization results. Maximal expected utility u, corresponding optimal design parameters \( {\delta}^{\ast }=\left({d}_2^{\ast },{HR}_{go}^{\ast}\right) \), \( {\delta}^{\ast }=\left({d}_2^{\ast },{HR}_{go}^{\ast },{\lambda}^{\ast}\right) \) or \( {\delta}^{\ast }=\left({d}_2^{\ast },{HR}_{go}^{\ast },{\alpha}_{CI}^{\ast}\right) \), expected probability to go to phase III \( {p}_{go}^{\ast } \), expected probability of a successful program sP, expected estimate used for sample size calculation \( {\varepsilon}_2^{\ast } \), expected number of events in phase III when going to phase III \( {d}_3^{\ast } \) and expected total number of events of program d in the optimal design, for c2 = 0.75, c3 = 1, c02 = 100, c03 = 150 in $ 105, ξ2 = ξ3 = 0.7, 1 − β = 0.9, α = 0.025 (one sided), for program set-ups \( S\left({\hat{\theta}}_2^{s_1},{\hat{\theta}}_2^{s_2}\right) \), s1, s2 = λ, αCI or u (that is \( S\left({\hat{\theta}}_2^u,{\hat{\theta}}_2^u\right) \): black circle; \( S\left({\hat{\theta}}_2^u,{\hat{\theta}}_2^{\lambda}\right) \), \( S\left({\hat{\theta}}_2^{\lambda },{\hat{\theta}}_2^{\lambda}\right) \): green cross; \( S\left({\hat{\theta}}_2^u,{\hat{\theta}}_2^{\alpha_{CI}}\right) \), \( S\left({\hat{\theta}}_2^{\alpha_{CI}},{\hat{\theta}}_2^{\alpha_{CI}}\right) \): violet triangle), benefit scenarios bs 1–7, and weights for the prior distribution w = 0.3, 0.6, 0.9, where the yellow line indicates \( \exp \left(-\mathrm{E}\left[{\hat{\theta}}_2\right]\right) \). Note that the symbols used to show the program characteristics of both multiplicatively and additively adjusted program set-ups, i.e., green crosses and violet triangles, appear as stars when plotted on top of each other

Overall, larger assumed benefits (i.e., larger values for (b1, b2, b3)) lead to more liberal optimal decision rules (i.e., larger values for \( {HR}_{go}^{\ast } \)) and higher investment in phase II (i.e., larger number of events for phase II \( {d}_2^{\ast } \)). This leads to a larger investment (in phase III), i.e., a higher expected probability to go to phase III \( {p}_{go}^{\ast } \) and a larger expected number of events in phase III \( {d}_3^{\ast } \), respectively. This results in a larger expected probability of a successful program sP and thus in a larger maximal expected utility u.

In the multiplicatively adjusted program set-ups \( S\left({\hat{\theta}}_2^{s_1},{\hat{\theta}}_2^{\lambda}\right) \), the maximal expected utility is always higher than the maximal expected utility in the additively adjusted program set-ups \( S\left({\hat{\theta}}_2^{s_1},{\hat{\theta}}_2^{\alpha_{CI}}\right) \), which in turn is always higher than the maximal expected utility in the unadjusted program set-up \( S\left({\hat{\theta}}_2^u,{\hat{\theta}}_2^u\right) \). It stands out that the investment in terms of numbers of events (i.e., \( {d}_2^{\ast },{d}_3^{\ast },{d}^{\ast } \)) tends to be higher in the adjusted program set-ups compared to the unadjusted program set-up, especially for scenarios with higher benefits and more optimistic prior. The expected probability to go to phase III \( {p}_{go}^{\ast } \) is notably lower in the adjusted program set-ups compared to the unadjusted program set-up, whereas the expected probability of a successful program sP is higher.

Dividing the optimal number of events in phase II by the expected number of events in phase III (i.e., \( {d}_2^{\ast } \) / \( {d}_3^{\ast } \)), leads to values of 0.55–0.64, 0.55–0.64, 0.58–0.67, 0.43–0.54 and 0.42–0.54 in program set-up \( S\left({\hat{\theta}}_2^u,{\hat{\theta}}_2^u\right) \), \( S\left({\hat{\theta}}_2^u,{\hat{\theta}}_2^{\alpha_{CI}}\right) \), \( S\left({\hat{\theta}}_2^{\alpha_{CI}},{\hat{\theta}}_2^{\alpha_{CI}}\right) \), \( S\left({\hat{\theta}}_2^u,{\hat{\theta}}_2^{\lambda}\right) \) and \( S\left({\hat{\theta}}_2^{\lambda },{\hat{\theta}}_2^{\lambda}\right) \), respectively. Furthermore, it can be observed that the treatment effect estimate of phase II used for sample size calculation in the optimal design is overestimated in the unadjusted setting (\( {\varepsilon}_2^{\ast }<\exp \left(-\mathrm{E}\left[{\hat{\theta}}_2\right]\right) \) as indicated by the black circles and yellow line in Figure 3). This overestimation is lower in the adjusted settings and can even result in an underestimation (compare multiplicative settings for w = 0.9).

The operating characteristics for the optimal designs (e.g., u, sP) compared between the two multiplicatively and the two additively adjusted program set-ups do not vary (much) for each benefit scenario bs and choice of weight for the prior distribution w, respectively. However, there are differences in the optimal choice of the threshold value for the decision rule \( {HR}_{go}^{\ast } \): in the program set-ups with adjusted phase II treatment effect estimate used for decision making (\( S\left({\hat{\theta}}_2^{\lambda },{\hat{\theta}}_2^{\lambda}\right) \) and \( S\left({\hat{\theta}}_2^{\alpha_{CI}},{\hat{\theta}}_2^{\alpha_{CI}}\right) \)), \( {HR}_{go}^{\ast } \) is always larger (by 0.04 to 0.06 and by 0.01 to 0.07, respectively) than in the program set-ups with unadjusted treatment effect used for decision making (\( S\left({\hat{\theta}}_2^u,{\hat{\theta}}_2^{\lambda}\right) \) and \( S\left({\hat{\theta}}_2^u,{\hat{\theta}}_2^{\alpha_{CI}}\right) \)).


To find optimal drug development designs, the costs of the program (fixed/variable costs for phase II/III), the assumed benefit, and the development risk (i.e., the expected probability of a successful program) are taken into account. By maximizing the expected utility with respect to the design parameters (adjustment parameter, number of events for phase II and threshold value for the go/no-go decision rule), optimal phase II/III drug development program designs can be found. Therefore, it enables quantitative reasoning for the design (i.e., the optimal “amount of adjustment”, sample size and decision rule) for specific drug development programs at hand.

We investigated two adjustment methods (additive and multiplicative adjustment), several benefit scenarios (e.g., low, medium, large overall benefit), different distributions for the true treatment effect (with the same and different distributions in phase II and III), scenarios with a real life budget constraint, scenarios with a predefined clinically relevant effect, and scenarios where phase II could be skipped, hence presented a method for the implementation of a variety of possible oncology drug development program scenarios, and an opportunity for assessing associated changes of the optimal design parameters. Of course, the implementation of alternative (e.g., proportional relationship between benefit and effect size) or more complex planning situations and broader application to other research areas are possible by choosing relevant (e.g., cost and benefit) parameters appropriately [37,38,39]. As the framework has been shown to be very flexible, frequent scenarios in oncology drug development are adequately mapped with our approach. However, certain situations may be simplified. For example, in our framework the development program consists entirely of just one phase II trial and one phase III trial, which is, however, not unusual in oncology. For situations that two or more phase III trials are performed, the framework of optimal planning of development programs was presented in a recent article by Preussler et al. [40]. Furthermore, we assumed the phase II trial to be two-armed. In the field of oncology dose investigations are often performed before and not as a part of phase II. However, in other indications dose-finding is performed in phase II. Methods for optimizing phase II/III programs with multi-armed phase II/III studies are presented in Preussler et al. [41]. Futility investigations in the phase III trial and/or considering a “seamless design” for the final analysis may be a worthwhile option, and it will be a topic of future research to investigate their impact on the optimal design. We assumed that the endpoint used in phase II and phase III is the same. We are currently exploring the situation that a surrogate (like progression-free or disease-free survival) is captured in phase II and overall survival is the primary endpoint in phase III. Another important point is that time-effects are not considered in this article. The program is unaccounted for the duration of development which is amongst others discussed in Preussler et al. [41]. That work presents in detail how to incorporate the impact of trial duration into the framework (compare Supplementary Material A2 [41]). However, when trying to incorporate “time” into the utility function, many aspects have to be considered. For example, one could take into account the “life cycle” of a drug as proposed by Patel & Ankolekar [42] who describe a typical life cycle by an early growth phase followed by a plateau, after which the sales decline as the patent expires. Furthermore, if there are several competitors investigating a similar drug then the company, who is the first to bring the drug to the market, usually gets the higher market share, i.e., higher gain. However, including these aspects requires competitor information and assumptions about their unknown future observed treatment effects. Any such assumptions are usually associated with very high uncertainty. Instead of trying to include too many (unknown) aspects into the utility function a rather simplified approach, as presented here, is advisable. If after observing phase II data further information about the potential of the drug, dose, target population or (time-dependent) benefits are available the probability of success (compare [43]) and the utility function could be updated to support go/no-go decisions as well as the design of the phase III trial.

In general, our results show that the adjusted program set-ups are superior to the unadjusted program set-up with respect to the maximal expected utility. This is associated with higher investments in terms of number of events and lower expected probabilities to go to phase III in the adjusted program set-ups compared to the unadjusted approach. Thus, in the adjusted program set-ups it is less often decided to go to phase III, but in case of a go decision, the investment in terms of sample size is higher. These aspects are particularly true for the multiplicatively adjusted program set-ups, which have also higher expected probabilities of a successful program compared to the additively adjusted and unadjusted program set-ups. Simply said, the money is spent more wisely when adjustment methods are used.

Values for the adjustment parameters that do not lead to an adjustment (i.e., αCI = 0.5 and λ = 1 in the additively and multiplicatively adjusted program set-ups, respectively) were included but never selected in the optimization. Thus, the results suggest that adjustment should always be considered, which is in line with Chuang-Stein and Kirby [14]. Furthermore, we see that in the unadjusted case there is an overestimation of the treatment effect after phase II, which is mitigated by the adjustments. In the multiplicative setting it is even shown that an overcorrection and thus an even larger investment in terms of sample size can be worthwhile with respect to the expected utility. Note that the focus is on maximal expected utility and the expected estimate of phase II is only a supporting variable, i.e., obtaining a “perfectly” unbiased estimator is not the goal in this application. With regard to the optimal number of events in phase II compared to phase III (\( {d}_2^{\ast } \) / \( {d}_3^{\ast } \)), it can be seen that with the framework in the unadjusted and additive case one ends up in the “desirable” (according to De Martini [4, 25]) range of 2/3 and also in the multiplicative case with lower \( {d}_2^{\ast } \) / \( {d}_3^{\ast } \), one still exceeds the often used 1/4. However, it should be noted that the total optimal sample size is highest for the multiplicative case.

Both multiplicatively adjusted (i.e., \( S\left({\hat{\theta}}_2^{s_1},{\hat{\theta}}_2^{\lambda}\right) \)) and additively adjusted (i.e., \( S\left({\hat{\theta}}_2^{s_1},{\hat{\theta}}_2^{\alpha_{CI}}\right) \)) program set-ups do not differ in their maximal expected utility, whereas the program set-ups with adjusted estimate used for decision making (i.e., \( S\left({\hat{\theta}}_2^{\lambda },{\hat{\theta}}_2^{\lambda}\right) \) and \( S\left({\hat{\theta}}_2^{\alpha_{CI}},{\hat{\theta}}_2^{\alpha_{CI}}\right) \)) have larger optimal threshold values for the decision rule than program set-ups where only the estimate used for calculating the expected number of events for phase III is adjusted (i.e., \( S\left({\hat{\theta}}_2^u,{\hat{\theta}}_2^{\lambda}\right) \) and \( S\left({\hat{\theta}}_2^u,{\hat{\theta}}_2^{\alpha_{CI}}\right) \)). Considering only these two aspects, adjustment of the treatment effect estimate used for the decision rule may be omitted when also optimizing the threshold value for the decision rule: this only leads to larger values for \( {HR}_{go}^{\ast } \) (i.e., more liberal decision rules) which compensate the adjusted (more conservative) treatment effect estimates. For the same reason, program set-ups \( S\left({\hat{\theta}}_2^{\lambda },{\hat{\theta}}_2^u\right) \) and \( S\left({\hat{\theta}}_2^{\alpha_{CI}},{\hat{\theta}}_2^u\right) \) (i.e., multiplicative or additive adjustment used for the decision rule and no adjustment applied for the calculation of the number of events for phase III) are not considered. Furthermore, as adjustment of the treatment effect estimate used for the decision rule may be omitted when also optimizing over the threshold value for the decision rule, we did not consider program set-ups where different adjustment parameters used for the decision rule and the calculation of the expected number of events are optimized (in our notation \( S\left({\hat{\theta}}_2^{\lambda_1},{\hat{\theta}}_2^{\lambda_2}\right) \) and \( S\left({\hat{\theta}}_2^{{\alpha_{CI}}_1},{\hat{\theta}}_2^{{\alpha_{CI}}_2}\right) \)).


Based on our results, we highly recommend using (multiplicatively) adjusted phase II treatment effect estimates for calculation of the phase III number of events in a phase II/III drug development program with go/no-go decision rule (compare Chuang-Stein & Kirby [14], Kirby et al. [15] and De Martini [4, 25]). However, as our results also show that the optimal design parameters of each method depend on the cost and benefit parameters as well as on the applied prior distribution, no general rule exists. In contrast, the design parameters should be determined by applying our proposed optimization procedure for specific values of the parameters in the respective drug development program. Therefore, we provide an user friendly R Shiny App (bias) and an R package (drugdevelopR including the R function optimal_bias) open-source (both assessable via [1]).

Availability of data and materials

The datasets used can be generated with the help of the R package drugdevelopR and the code containing the respective function calls is provided in the additional files (see file Code.R).


α CI :

Adjustment parameter for additive and multiplicative adjustment method, respectively

bs :

Benefit scenario

CI :

Confidence interval

d 2 ,d 3 ,d :

Total number of events for phase II, III and the program, respectively

HR :

True assumed hazard ratio

κ :

Threshold value for the go/no-go decision rule, κ =  −  log (HRgo)

s 1 ,s 2 :

Estimate used for go/no-go decision and calculation of number of events, respectively

θ :

True assumed treatment effect, θ =  −  log (HR)


  1. Erdmann, S. drugdevelopR: bias. Accessed 02 Jul 2020.

  2. DiMasi JA, Hansen RW, Grabowski HC, Lasagna L. Research and development costs for new drugs by therapeutic category. Pharmaco Economics. 1995;7:152–69.

    Article  CAS  Google Scholar 

  3. DiMasi JA, Feldman L, Seckler A, Wilson A. Trends in risks associated with new drug development: success rates for investigational drugs. Clinical Pharmacology & Therapeutics. 2010;87:272–7.

    Article  CAS  Google Scholar 

  4. De Martini D. Empowering phase II clinical trials to reduce phase III failures. Pharm Stat. 2020;19:178–86.

    Article  Google Scholar 

  5. Antonijevic Z. Optimization of Pharmaceutical R&D Programs and portfolios: design and investment strategy. Heidelberg: Springer; 2015.

    Book  Google Scholar 

  6. Hughes MD, Pocock SJ. Stopping rules and estimation problems in clinical trials. Stat Med. 1988;7:1231–42.

    Article  CAS  Google Scholar 

  7. Fan X, DeMets DL, Lan KG. Conditional bias of point estimates following a group sequential test. J Biopharm Stat. 2004;14:505–30.

    Article  Google Scholar 

  8. Zhang JJ, Blumenthal G, He K, Tang S, Cortazar P, Sridhara R. Overestimation of the effect size in group sequential trials. Clin Cancer Res. 2012;18:18,4872–6.

    Google Scholar 

  9. Ellenberg SS, DeMets DL, Fleming TR. Bias and trials stopped early for benefit. Jama. 2010;304:156–9.

    Article  Google Scholar 

  10. Nardini C. Monitoring in clinical trials: benefit or bias? Theoretical Medicine and Bioethics. 2013;34:259–74.

    Article  Google Scholar 

  11. US Food and Drug Administration. 22 case studies where phase 2 and phase 3 trials had divergent results. 2017. Available at Accessed 02 Jul 2020.

  12. Gan HK, You B, Pond GR, Chen EX. Assumptions of expected benefits in randomized phase III trials evaluating systemic treatments for cancer. J Natl Cancer Inst. 2012;104:590–8.

    Article  Google Scholar 

  13. Arrowsmith J. Trial watch: phase II failures: 2008–2010. Nat Rev Drug Discov. 2011;10:328–9.

    Article  CAS  Google Scholar 

  14. Chuang-Stein C, Kirby S. The shrinking or disappearing observed treatment effect. Pharm Stat. 2014;13:277–80.

    Article  Google Scholar 

  15. Kirby S, Burke J, Chuang-Stein C, Sin C. Discounting phase 2 results when planning phase 3 clinical trials. Pharm Stat. 2012;11:373–85.

    Article  CAS  Google Scholar 

  16. O'Hagan A, Stevens JW, Montmartin J. Bayesian cost-effectiveness analysis from clinical trial data. Stat Med. 2001;20:733–753.2005.

    Article  CAS  Google Scholar 

  17. O'Hagan A, Stevens JW, Campbell MJ. Assurance in clinical trial design. Pharm Stat. 2005;4:187–201.

    Article  Google Scholar 

  18. Spiegelhalter DJ, Freedman LS, Blackburn PR. Monitoring clinical trials: conditional or predictive power? Control Clin Trials. 1986;7:8–17.

    Article  CAS  Google Scholar 

  19. Spiegelhalter DJ, Freedman LS. A predictive approach to selecting the size of a clinical trial, based on subjective clinical opinion. Statistics Med. 1986;5:1–13.

    Article  CAS  Google Scholar 

  20. Chuang-Stein C. Sample size and the probability of a successful trial. Pharm Stat J Appl Stat Pharm Ind. 2006;5:305–9.

    Google Scholar 

  21. Chuang-Stein C, Yang R. A revisit of sample size decisions in confirmatory trials. Statistics in Biopharmaceutical Research. 2010;2:239–48.

    Article  Google Scholar 

  22. Gasparini M, Di Scala L, Bretz F, Racine-Poon A. Some uses of predictive probability of success in clinical drug development. Epidemiology, biostatistics and. Public Health. 2013;10:1.

    Google Scholar 

  23. Saint-Hilary G, Barboux V, Pannaux M, Gasparini M, Robert V, Mastrantonio G. Predictive probability of success using surrogate endpoints. Stat Med. 2019;38:1753–74.

    Article  PubMed  Google Scholar 

  24. Wang SJ, Hung HM, O'Neill RT. Adapting the sample size planning of a phase III trial based on phase II data. Pharm Stat. 2006;5:85–97.

    Article  CAS  Google Scholar 

  25. De Martini D. Adapting by calibration the sample size of a phase III trial on the basis of phase II data. Pharm Stat. 2011;10:89–95.

    Article  Google Scholar 

  26. Götte H, Schüler A, Kirchner M, Kieser M. Sample size planning for phase II trials based on success probabilities for phase III. Pharm Stat. 2015;14:515–24.

    Article  Google Scholar 

  27. Kirchner M, Kieser M, Götte H, Schüler A. Utility-based optimization of phase II/III programs. Stat Med. 2016;35:305–16.

    Article  Google Scholar 

  28. Schoenfeld D. The asymptotic properties of nonparametric tests for comparing survival distributions. Biometrika. 1981;68:316–9.

    Article  Google Scholar 

  29. IQWiG. Allgemeine Methoden. Version 5.0, 10.07.2016, Technical Report.

  30. R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, 2019. Available at Accessed 02 Jul 2020.

  31. Steensma DP, Kantarjian HM. Impact of cancer research bureaucracy on innovation, costs, and patient care. J Clin Oncol. 2014;32:376–8.

    Article  Google Scholar 

  32. Ding M, Rosner GL, Müller P. Bayesian optimal design for phase II screening trials. Biometrics. 2008;3:886–94.

    Article  Google Scholar 

  33. Erdmann, S. drugdevelopR: prior. Accessed 02 Jul 2020.

  34. Dallow N, Best N, Montague TH. Better decision making in drug development through adoption of formal prior elicitation. Pharm Stat. 2018;17:301–16.

    Article  Google Scholar 

  35. O'Hagan A, Buck CE, Daneshkhah A, Eiser JR, Garthwaite PH, Jenkinson DJ, et al. Uncertain judgements: eliciting experts' probabilities. Chichester: Wiley; 2006.

    Book  Google Scholar 

  36. Devilee JLA, Knol AB. Software to support expert elicitation: an exploratory study of existing software packages; 2011.

    Google Scholar 

  37. DiMasi JA, Grabowski HG, Vernon J. R&D costs and returns by therapeutic category. Drug Information J. 2004;38:211–23.

    Article  Google Scholar 

  38. Adams CP, Brantner VV. Spending on new drug development. Health Econ. 2010;19:130–41.

    Article  Google Scholar 

  39. Morgan S, Grootendorst P, Lexchin J, Cunningham C, Greyson D. The cost of drug development: a systematic review. Health Policy. 2011;100:4–17.

    Article  Google Scholar 

  40. Preussler S, Kieser M, Kirchner M. Optimal sample size allocation and go/no-go decision rules for phase II/III programs where several phase III trials are performed. Biom J. 2019;61(2):357–78.

    Article  Google Scholar 

  41. Preussler S, Kirchner M, Götte H, Kieser M. Optimal designs for multi-arm phase II/III drug development programs. Statistics in Biopharmaceutical Res. 2019.

  42. Patel NR, Ankolekar S. A Bayesian approach for incorporating economic factors in sample size design for clinical trials of individual drugs and portfolios of drugs. Stat Med. 2007;26:4976–88.

    Article  Google Scholar 

  43. Götte H, Kirchner M, Sailer MO, Kieser M. Simulation-based adjustment after exploratory biomarker subgroup selection in phase II. Stat Med. 2017;36:2378–90.

    Article  Google Scholar 

Download references


We thank the reviewer for their valuable comments, which improved the manuscript remarkably.


We would like to thank the Deutsche Forschungsgemeinschaft (DFG) for supporting this research by the research grant KI 708/2–1 (financing a statistical position of the Institute of Medical Biometry of the University Hospital Heidelberg). Furthermore, we would like acknowledge the financial support by the DFG within the funding program “Open Access Publishing”, by the Baden-Württemberg Ministry of Science, Research and the Arts and by Ruprecht-Karls-Universität Heidelberg (payment of the publishing costs). The role of the funding bodies was of financial manner only, therefore they took no direct part in the analysis or in writing the manuscript. Open access funding provided by Projekt DEAL.

Author information

Authors and Affiliations



SE, MKir, HG, MKie developed the proposed method. SE wrote the manuscript and associated software code. All authors read and approved the final manuscript.

Authors’ information

The first author of this manuscript (SE) changed her name from Stella Preussler to Stella Erdmann. Therefore, the (first) author of [1, 33, 40], [41] and this manuscript is the same.

Corresponding author

Correspondence to Stella Erdmann.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests. HG has no conflict of interest with the subject matter of this manuscript while being an employee of Merck Healthcare KGaA.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1.

In the Additional file 1, an overview of formulas in program set-ups \( S\left({\hat{\theta}}_2^{s_1},,,{\hat{\theta}}_2^{s_2}\right) \),s1, s2 = λ, aCI, u (A0) and investigation of an alternative definition of program success is given (A1). Furthermore, more details and results of the application example when modelling different population structures in phase II and III (A2), when using a predefined minimal clinically relevant effect for phase III planning (A3), when using a budget constraint (A4), when skipping phase II (A5) and when using a linear function for modelling the gain (A6) are presented. The file Code.R includes the main function calls for generating the datasets and tables, using the R package drugdevelopR.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Erdmann, S., Kirchner, M., Götte, H. et al. Optimal designs for phase II/III drug development programs including methods for discounting of phase II results. BMC Med Res Methodol 20, 253 (2020).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: