Optimal designs for phase II/III drug development programs including methods for discounting of phase II results

Background Go/no-go decisions after phase II and sample size chosen for phase III are usually based on phase II results (e.g., the treatment effect estimate of phase II). Due to the decision rule (only promising phase II results lead to phase III), treatment effect estimates from phase II that initiate a phase III trial commonly overestimate the true treatment effect. Underpowered phase III trials are the consequence. Optimistic findings may then not be reproduced, leading to the failure of potentially expensive drug development programs. For some disease areas these failure rates are described to be quite high: 62.5%. Methods We integrate the ideas of multiplicative and additive adjustment of treatment effect estimates after go decisions in a utility-based framework for optimizing drug development programs. The design of a phase II/III program, i.e., the “right amount of adjustment”, the allocation of the resources to phase II and III in terms of sample size, and the rule applied to decide whether to stop or to proceed with phase III influences its success considerably. Given specific drug development program characteristics (e.g., fixed and variable per patient costs for phase II and III, probable gain in case of market launch), optimal designs with respect to the maximal expected utility can be identified by the proposed Bayesian-frequentist approach. The method will be illustrated by application to practical examples characteristic for oncological studies. Results In general, our results show that the program set-ups with adjusted treatment effect estimate used for phase III planning are superior to the “naïve” program set-ups with respect to the maximal expected utility. Therefore, we recommend considering an adjusted phase II treatment effect estimate for the phase III sample size calculation. However, there is no one-fits-all design. Conclusion Individual drug development planning for a specific program is necessary to find the optimal design. The optimal choice of the design parameters for a specific drug development program at hand can be found by our user friendly R Shiny application and package (both assessable open-source via [1]).


Background
Exploratory studies are usually carried out to provide a basis for deciding whether or not to proceed with a confirmatory trial and, if necessary, to provide information for planning purposes. In drug development programs, this strong link between exploratory (e.g., phase II) and confirmatory (e.g., phase III) studies favors integrated planning. In particular, the costs of phase III studies have increased remarkably in recent years [2,3], while failure rates are quite high (approx. 45%, see [4] and the reference mentioned therein). Therefore, the availability and application of quantitative methods for decision making, which should be data-driven and objective, is desirable [5].
Already over 30 years ago, Hughes and Pocock [6] pointed out that decision rules in clinical trials can lead to a bias in the point estimate of the treatment effect, so that the true underlying effect might be overestimated at the time of an early positive decision. Twenty four years and various attempts of authors to adjust for overestimation of the treatment effect (in group sequential designs) later (e.g., [7] and references mentioned therein), Zhang et al. [8] still criticize that the cause and effect of this phenomenon is generally not well-understood. Trying to illustrate the problem, they provide a graphical explanation for the occurrence of overestimation. They argue that random variability (i.e., random highs and lows) of the treatment effect estimate is always present, but stabilizes around the true treatment effect as the trial continues to its end. However, when implementing a decision rule the variability favors the random highs: in a phase II/III drug development program with a go/ no-go decision rule, it is only proceeded with phase III when large treatment effects are observed, but stopped when small effects occur. This selective handling of random variability may lead to overestimation of the magnitude of the treatment effect after phase II.
Ellenberg et al. [9] as well as Nardini [10] emphasize that the aim of treatment effect estimation is not to decide whether or not one therapy is better than the other, but to describe the size of therapeutic effects. Thus, we are concerned with a problem of estimation, not a problem of testing. Nardini concludes that estimates arising after a decision rule "should [consequently] not be taken at face value as true estimates of the new treatment's effect". Ellenberg et al. point out that statistical methods to adjust for this "random-high bias" exist, but criticize that "they are not applied as often as they should be". Recently, the U.S. Food & Drug Administration reported 22 case studies since 1999 in which promising phase II clinical trial results were not confirmed in phase III clinical testing [11]. Such experiences are not rare: for some disease areas, the failure rate for phase III trials is reported to be as high as 62.5% [12] and about 50% for approval [13]. Chuang-Stein and Kirby [14] give cause for serious concern, as the severity of this may multiply, considering that the bigger the estimated effect from, e.g., a proof of concept trial, the greater the temptation to invest heavily and conduct multiple studies in parallel. They advise to use the concept of "assurance" for quantification of success probabilities and, moreover, to apply an adjustment for the overestimation of the treatment effect (e.g., [15]) when planning the next phase of a drug development program.
In our framework, we follow the concept of "assurance" [16,17], which had first been introduced by Spiegelhalter et al. in 1986 with the concept of Bayesian predictive power (compare also "average power") [18,19]. This methodology was used later in various contexts by O'Hagan et al. [16,17] ("assurance"), Chuang-Stein [20], Chuang-Stein and Yang [21] ("average success probability") and finally by Gasparini et al. and Saint-Hilary et al. ("predictive probability of success") [22,23]. The idea is to use a prior distribution for the true assumed treatment effect for trial planning. This is in contrast to the "frequentist world", where a fixed value is assumed. The "assurance" is then the weighted (unconditional) probability of a successful trial for a given effect, the weighting resulting from the likelihood that the therapy will achieve this effect. Due to synthesizing Bayesian principles in the planning phase and frequentistic decision-making procedures in the analysis, the above-mentioned approaches are described in the literature as "mixed Bayesian-frequentist".
Kirby et al. [15] and Wang et al. [24] attempt to reduce the impact of overestimation by discounting the phase II treatment effect estimate by applying a multiplicative or additive adjustment, respectively. However, their suggestions are not universally applicable, and are rather "rules of thumb", e.g., Kirby et al. suggest to use a retention factor of 0.9 times the assumed ratio of the phase III effect to phase II effect.
De Martini [4,25] reports that the phase II sample size should be almost as large as the ideal phase III sample size (at least 2/3 of the latter) in order to have a sufficiently good information basis for phase III planning. He criticizes that in practice this ratio is only 1/4 on average and that an increase in sponsorship gains from drug development through larger phase II studies has not yet been well investigated. Larger phase II sample sizes would reduce the level of overestimation but increase the estimated phase III sample size [26] and could retrospectively be regarded as an unnecessary high investment in case of a no-go decision. Therefore, an optimal balance is required.
In this article, we integrate the general concepts of using a multiplicative or additive adjustment method to correct for overestimation of the treatment effect in a framework of utility-based optimization of phase II/III development programs [27]. That is, we want to critically examine adjustment methods from an economic point of view. In addition to simultaneously optimizing the phase II go/no-go decision rule and the sample size, we also optimize over the adjustment parameter used for the phase II treatment effect estimation to find "the right level of adjustment" for the specific situation at hand. Our approach can build the bridge between the long existing gap of theory and practice: we provide a Bayesian-frequentist hybrid framework, in which methods proposed for addressing the problem of overestimation of the treatment effect after go decisions are included in the optimization of drug development programs.
In the second section of this paper, we will introduce the basic setting and notation, explain the adjustment methods and show how they are incorporated in our optimization framework. After introducing the utility function and explaining the optimization procedure, we present optimal designs for exemplary settings of drug development programs in Section 3. We finish with a discussion in Section 4 and a conclusion in Section 5.

Basic setting
The considered drug development program consists of one exploratory phase II and one confirmatory phase III trial. Both are randomized trials with two arms (each with 1:1 sample size allocation), performed independently, investigating the same time-to-event primary endpoint and the same population. The true treatment effect is given by the negative logarithm of the true hazard ratio (θ = − log(HR)), which is the ratio of the hazard functions of the treatment and the control group. In order to reflect the uncertainty in the true treatment effect, θ can be modelled by a prior distribution f(θ). In phase II, the total number of events is denoted by d 2 and the maximum likelihood estimate of θ is given byθ 2 . We assume that the estimatorθ 2 is asymptotically normally distributed withθ 2 j θ $ Nðθ; 4=d 2 Þ (Note that the notation used will not differentiate between the treatment effect estimator (i.e., rule applied to estimate the quantity of interest, which is a random variable) and the treatment effect estimate (i.e., particular realization, fixed value), but by context it will be clear which quantity is meant.). Furthermore, we require that only phase II trials with promising results lead to a phase III trial. This is quantified by a go/no-go criterion with a godecision in case ofθ 2 ≥ κ, where κ is a predefined threshold value. In case of a go decision, the number of events for the phase III trial is calculated based on the observed treatment effect of phase II. If the confirmatory analysis in phase III reveals a significant result, program success is declared (compare Fig. 1).
Due to the decision rule after phase II, the treatment effect estimate of phase II is biased. The bias is positive with κ > 0 as probability mass is shifted towards higher values: where here and in the following 1 A denotes the indicator function of event A and the density of the distribution of the respective argument is indicated by f(.). The inequation holds as 1 Pðθ2 ≥ κjθÞ > 1 and Pðθ2 ≥ κjθÞ dθ2 ¼ 1 and, therefore, the probability mass assigned to values less than κ in the unconditional expectation E½θ 2 is distributed between values greater than κ in E½θ 2 jθ 2 ≥ κ.
Therefore, in the following, multiplicative and additive adjustment methods for the treatment effect estimate obtained in phase II will be investigated. Afterwards, dependent on the respective adjustment method, launch criteria and approaches to calculate the number of events for phase III will be presented.

Additive and multiplicative adjustment methods
In this section, we introduce two methods (an additive and a multiplicative adjustment method) to adjust for the overestimation of the phase II treatment effect estimate. It should be mentioned that the terms "multiplicative" and "additive" relate to the specific type of scale and endpoint considered here.
Wang et al. [24] advise to apply an additive adjustment to the phase II treatment effect estimate if it is used for planning the sample size of phase III. They discuss using the lower limit of the one and two standard deviation confidence interval (CI) from the phase II trial (i.e., the lower limit of the CI forθ 2 , corresponding to one or two standard deviations below the point estimate), respectively. We denote the significance level of the lower bound for the one-sided CI related to the phase II treatment effect estimate as α CI ∈ [0.025, 0.5] and define the additive adjusted treatment effect estimate bŷ θ , where Φ(.) denotes the distribution function of the standard normal distribution. Note that our version of the additive adjusted treatment effect estimate is a generalization of that of Wang et al., as they use the lower limit of the one and two standard deviation two-sided CI (i.e., in our notation α CI = 0.32/2 and α CI = 0.05/2) and we allow α CI ranging from 0.025 to 0.5. For α CI = 0.5, the additive adjusted treatment effect estimate is not discounted asθ 2 − z 1 − 0:5 Á ffiffiffiffiffiffiffiffiffiffi 4=d 2 p ¼θ 2 . Kirby et al. [15] propose a multiplicative adjustment approach. They multiply the observed treatment effect estimate with a factor λ, which can be understood as a retention factor, that is, the fraction of the treatment effect retained. Integrated in our setting, we defineθ λ 2 ¼ λ Áθ 2 , where the multiplicative adjustment parameter λ ∈ [0.2, 1] can be viewed as the result of discounting the observed treatment effect of phase II by 1 − λ. Note that for λ = 1 the multiplicative adjusted treatment effect estimate is not discounted.
Go/no-go criteria, calculation of expected number of events for phase III and related program characteristics When designing the phase II/III program, the observed treatment effect estimate of phase II plays a key role in two ways: 1. when making the go/no-go decision (selection s 1 ); 2. when calculating the phase III sample size (selection s 2 ; compare Fig. 1). At both instances, one has to decide whether or not to use an adjusted or unadjusted treatment effect estimate. To ease notation, the naïve (unadjusted) treatment effect estimate of phase II is denoted byθ , where s 1 = λ, α CI or u (i.e., the multiplicatively adjusted, additively adjusted or unadjusted treatment effect estimate is selected for the decision rule), exceeds a predefined threshold value κ, it is decided to go to phase III and otherwise to stop the program. Hence, the expected probability to go to phase III can be determined by p goθ Table A0 in the Additional file 1).
2.: In case of a go decision, the number of events for phase III is calculated based on the treatment effect estimate of phase IIθ s 2 2 , s 2 = λ, α CI or u, a desired power 1 − β, and a one-sided significance level α. For a balanced allocation ratio, it can be calculated by Fig. 1 Graphical illustration of basic phase II/III drug development program. The drug development program consists of one exploratory phase II trial, which is, in case of a go decision (i.e., treatment effect estimate of phase IIθ 2 exceeds predefined threshold value κ = − log(HR go )), followed by one confirmatory phase III trial, where the sample size planning is based onθ 2 . The program is considered successful if phase III has a positive (significant) result (i.e., normalized log rank test statistic of phase III T 3 is above the 1 − α quantile of the standard normal distribution z 1 − α ) De Martini [4,25], the ratio of the number of events in phase II and III will also be calculated.
The program is considered to be successful, if the onesided null hypothesis H 0 : θ ≤ 0 is rejected in favour of H 1 : θ > 0 at a one-sided significance level α. This is the case if T 3 > z 1 − α , where T 3 is the normalized log-rank test statistic in phase III, which is assumed to be asymptotically normally distributed, i.e., ; 1Þ. Note that significance testing is performed on phase III data only. Therefore, the expected probability of a successful program PsPðθ s 1 2 ;θ s 2 2 Þ (with decision ruleθ s 1 2 ≥κ , and θ s 2 2 used to calculate the number of events for phase III), which is defined as the expected probability of the joint event of going to phase III and achieving a significant result [25,27], can be calculated by where t 3 is a realization of T 3 jθ 2 ; θ (compare Table  A0). One reviewer pointed out that this definition of a successful program records a false positive result (i.e. T 3 > z 1 − α under H 0 ) as program success. We discuss this aspect in detail in Section A1 of Additional file 1. In reality, regulatory approval and with that a monetary gain, which is the core driver for our utility function, is achieved when a significant result is observed in phase III, acknowledging that there is a probability of α that it is a false positive decision. Thus, we keep the commonly used term "success" and PsP which should be regarded as probability of market access and not a probability of a correct decision.

Considered program set-ups
We investigate the impact of using adjusted treatment effect estimates (i.e.,θ λ 2 orθ α CI 2 ) for the go/no-go decision and/or for the calculation of the number of events for phase III on the drug development program characteristics and compare the results to those where the unadjusted (naïve) treatment effect estimateθ u 2 was used. Therefore, we investigate different program setups Sðθ s 1 2 ;θ s 2 2 Þ which are defined by the selection of the treatment effect estimate used for the decision rule (selection s 1 ) and, in case of a go decision, by the choice of the treatment effect estimate used for the calculation of the number of events for phase III (selection s 2 ). Table 1 gives an overview of the considered program set-ups. We compare the "unadjusted" set-up ðθ u 2 ;θ u 2 Þ , whereθ u 2 ¼θ 2 (i.e., s 1 , s 2 = u), with two "multiplicatively adjusted" set-ups Sðθ s 1 2 ;θ λ 2 Þ (s 1 ∈ {u, λ}, s 2 = λ), and two "additively adjusted" set-ups Sðθ . Note that if s 1 ≠ u, we define s 2 = s 1 , which means that if an adjustment parameter is used for the decision rule, the same adjustment parameter is used for the calculation of the expected number of events for phase III (for reasons which will be given later).

Utility function
The aim is to optimize a phase II/III drug development program in terms of the adjustment parameters λ or α CI , the number of events in phase II d 2 , and the go/no-go decision threshold value κ. Therefore, we set up a utility function, which utilizes the difference between program costs and potential gains after successful market launch (compare Fig. 2 for a graphical illustration). For the costs, fixed (c 02 , c 03 ) and variable per-patient (c 2 , c 3 ) costs are included for the phase II and III trial, respectively. By dividing the number of events by the event rate ξ i , the total number of patients can be calculated for the respective phase i = 2, 3. Obviously, only in case of a go decision the costs of the phase III trial apply. In case of program success, a benefit b is obtained, and we assume that the level of benefit depends on the observed treatment effect in the phase III trial as suggested by a report of the German Institute for Quality and Efficiency in Health Care [29]. As proposed by them, three effect size categories (small, medium and large) are used, whereby each category is defined by a threshold value (1, 0.95, 0.85) for the upper boundary of the 95% confidence interval for the HR (for details on the derivation of these threshold values, the interested reader may be referred to the "Anhang A"of [29]). The corresponding amount of benefit is denoted by b 1 , b 2 and b 3 , respectively. Based on this, costs c(d 2 , κ, s 2 ) and gain g(d 2 , κ, s 2 ) for a phase II/III program with program set-up Sðθ where ∞Þ are transformations of the effect size intervals to intervals on the test statistic scale of T 3 . Thus, the costs depend on the observed treatment effect in phase II and the gain depends on the observed treatment effect in phase II and III.
The utility is defined as the difference between costs and gain and expressed as a function of d 2 and κ over which it is simultaneously optimized. In the adjusted program set-ups, the optimization is also over λ in the multiplicatively, and over α CI in the additively adjusted set-ups, respectively. Thus, we define the utility for program set-up Sðθ where for the unadjusted program set-up Sðθ To incorporate the development risk in terms of success probabilities, we consider the expected utility with respect to θ,θ 2 and , where the expected costs and gain with respect to θ,θ 2 and T 3 are given by The aim is to find a design δ = (d 2 , κ, s 2 ) that maximizes the expected utility E[u(d 2 , κ, s 2 )] for programs with program set-up Sðθ s 1 2 ;θ s 2 2 Þ: The optimization is carried out over d 2 , κ, and λ in the multiplicatively or α CI in Fig. 2 Graphical illustration of (adjusted) utility-based optimization. The treatment effect estimate of phase II may be adjusted for the decision rule (selection s 1 ∈ {u, s 2 } and/or for the calculation of the number of events for phase III (selection s 2 ∈ {λ, α CI , u}). The utility (including the costs and the gain) is optimized over the number of events for phase II d 2 , the threshold value for the decision rule HR go , and the adjustment parameter s 2 = λ or α CI (see Section 2.5 for details), ξ i event rate in phase i = 2, 3, b j = b j (T 3 ) benefit categories j = 1, 2, 3  the additively adjusted set-ups, respectively. The optimal design δ * for each program set-up Sðθ s 1 2 ;θ s 2 2 Þ is defined to be the design for which the expected utility is maximized, that is, The optimization is solved by using numerical integration procedures written in the programming language R [30]. In order to facilitate the application of the approach, an user friendly R Shiny App (bias) and an R package (drugdevelopR including the R function optimal_bias) are provided open-source (both assessable via [1]).

Illustration of the framework by application to oncology trial example and practical extensions
In this paper, the parameters in the oncology trial example are chosen as in Kirchner et al. [27] to allow comparison of results. It should be noted that the example is primarily given to illustrate the framework and the chosen parameters should not be taken as face values. We tried to elicit the design parameters as realistic as possible to mimic an oncology drug development program by means of information from relevant literature and consultation with experts from the pharmaceutical industry in the field of oncology. However, it should be noted, that these parameters must be chosen carefully and specifically for each drug development scenario at hand.
The event rates for phase II and III are set to ξ i = 0.7 for i = 2, 3. Therefore, the total sample size can be calculated by d i /0.7, i = 2, 3. In practice, estimates on the event rates could be obtained by taking recruitment rates and duration as well as drop-out rates and treatment group specific hazards into account. However, using those parameters often leads to event rates around ξ i = 0.7 as it is a compromise between data maturity and avoidance of long follow-up times if drop-out rates are higher than expected. If ξ i < 0.5 the median event time might not be observed while if ξ i is too high, the planned number of events might not be reached at all with substantial drop-out rates.
For phase III oncology trials, per-patient costs between 75,000 and 125,000 US $ are reported [31]. Therefore, per-patient costs for phase III of $10 5 are considered and c 3 is set to 1 (in $10 5 ). Furthermore, the per-patient costs for phase II are set to c 2 = 0.75 (in $10 5 ). Due to, for example, additional biomarker measurements made in phase III, or because regulatory agencies may require more extensive data collection in phase III [32], higher per-patient costs in phase III compared with phase II are reasonable. In this example, the fixed costs for phase II and III are set to c 02 = 100 and c 03 = 150 (in $10 5 ), respectively. To investigate different scenarios, the benefit parameters b 1 , b 2 and b 3 are chosen to embody a low (b 1 , b 2 , b 3 ) 1 : (1000, 2000, 3000), 2 : (1000, 2000, 4000), 3 : (1000, 3000, 4000) and a medium to large (b 1 , b 2 , b 3 ) = 4 : (1000, 3000, 5000), 5 : (1000, 4000, 5000), 6 : (1000, 3000, 6000), 7 : (1000, 4000, 6000) over-all benefit (in $10 5 ), where we assume a 5-year income period and profit margin of 0.2. Thus, seven different benefit scenarios (bs 1-7) will be considered. A mixture distribution consisting of the weighted sum of two normal distributions as proposed by Götte et al. [26] can be used to model the true treatment effect. The two normal distributions each depict a distribution for θ, whereby the means represent values of the assumed true treatment effect and the denominators of the associated variances can be viewed as "amount of certainty" about the treatment effect size in terms of numbers of events. The parameters of the distributions (i.e., means and variances) are elected such that a realistic range for the HR is covered (compare Fig. A2 in Additional file 1 and/or investigate the prior distribution with the help of our R shiny App prior [33]). The mean of the first of the two normal distributions characterizes a strong, the second one a moderate to low treatment effect, so that by ranging w from, e.g., 0.3 to 0.9 we can mirror pessimistic to more optimistic opinions about the true treatment effect. In practice, the choice of w can be guided by formal expert elicitation methods. Dallow et al. [34] presented an overview of such methods including elicitation of Gaussian mixture distributions. Note that the approach is general and allows for implementation of any alternative prior distribution. Again, elicitation methods (compare also, e.g., [35]) are a useful tool that may help (a group of) experts to quantify their opinions about the treatment effect as a probability distribution. Various software packages enable their practical application (compare, e.g., [36]). In our framework it is also possible to account for, e.g., different population structures in phase II and phase III (due to different countries, centers, in-/exclusion criteria, …) by assuming different distributions for the assumed true treatment effect in phase II and III (i.e., θ 2 ≁ θ 3 ), so thatθ 2 j θ 2 $ Nðθ 2 ; 4=d 2 Þ and T 3 jθ 2 ; θ 2 ; θ 3 $ Nðθ 3 = ffiffiffiffiffiffiffiffiffiffi ffi 4=D 3 p ; 1Þ . For ease of interpretation, all formulas and results presented in the main part are for the special case, where the true treatment effect is modelled by the same distribution for phase II and III (e.g., θ~θ 2~θ3 ), and a brief investigation of this aspect can be found in Section A2 of Additional file 1.
In this example, we chose a wide range for κ (and d 2 , as well as λ or α CI , respectively) such that the optimization is not influenced by that choice. Therefore, the optimization set is D ={δ = (d 2 , κ, s 2 ), d 2 ∈ {50, 52, …, 350}, κ ∈ {− log(0.9), − log(0.89), …, − log(0.7)}, s 2 = λ ∈ {0.2, 0.225, …, 1} or s 2 = α CI ∈ {0.025, 0.075, …, 0.5}}. However, the lower bound of the decision rule set for κ can also be seen as representing a predefined clinically relevant effect size: phase III trials are then only conducted if the treatment effect observed in phase II is at least of that size. In Section A3 of Additional file 1, we present results of the procedure, where we chose min(κ) = − log(0.8). Furthermore, it might be interesting to see how the optimal program design is influenced by the sponsor's real life budget constraint. Therefore, we also consider optimizing the drug development program with a constraint K on the expected costs of the program, i.e., E[c(d 2 , κ, s 2 )] ≤ K (see Section A4 of Additional file 1 for more details). In pharmaceutical industry there are often discussions about skipping the phase II trial. For example, if competitors have already approved a drug with a similar mode of action one might see no need for further learning about the drug and go directly to a confirmatory trial. Our framework allows to systematically assess this aspect by setting d 2 = 0, c 02 = c 2 = 0 and p go = 1 (see Section A5 of Additional file 1 for more details). In addition, different definitions of the cost and benefit functions are possible. As mentioned above, the choice of three effect size categories (and therefore the benefit function) is based on a report of the German Institute for Quality and Efficiency in Health Care [29]. However, the presented framework could also be applied to an alternative set-up as, for example, the one proposed by Ding et al. [32]. Here, a proportional relationship between benefit and effect size is considered. In Section A6 of Additional file 1 we investigate this possibility in more detail.

Results
This section is organized as follows. It starts with general observations across all program set-ups Sðθ The optimization results are presented in Table 2 (naïve setting, multiplicative adjustment), Table 3 (additive adjustment) and Figure 3, which show the optimal design parameters δ Ã ¼ ðd Ã 2 ; κ Ã ; s Ã 2 Þ: optimal total number of events for phase II d Ã 2 (given by the optimal value of d 2 ∈ D), optimal go/no-go decision rule threshold value HR Ã go (given by the optimal value of κ ∈ D in "HR-scale", i.e., HR Ã go ¼ expð − κ Ã Þ) and optimal adjustment parameter s Ã 2 ∈fλ Ã ; a Ã CI g (given by the optimal value of s 2 ∈ D) for the multiplicative and additive adjustment method, respectively, with corresponding program characteristics for the optimal design: where we chose a desired power of 1 − β = 0.9 and a one-sided significance level α = 0.025, total number of expected events in the program CI g , benefit scenarios (bs 1-7) and weights for the prior distribution of the true underlying effect (w = 0.3, 0.6, 0.9), where Overall, larger assumed benefits (i.e., larger values for (b 1 , b 2 , b 3 )) lead to more liberal optimal decision rules (i.e., larger values for HR Ã go ) and higher investment in phase II (i.e., larger number of events for phase II d Ã 2 ). This leads to a larger investment (in phase III), i.e., a higher expected probability to go to phase III p Ã go and a larger expected number of events in phase III d Ã 3 , respectively. This results in a larger expected probability of a successful program sP * and thus in a larger maximal expected utility u * .
In the multiplicatively adjusted program set-ups Sðθ s 1 2 ;θ λ 2 Þ , the maximal expected utility is always higher than the maximal expected utility in the additively adjusted program set-ups Sðθ s 1 2 ;θ α CI 2 Þ , which in turn is always higher than the maximal expected utility in the unadjusted program set-up Sðθ u 2 ;θ u 2 Þ. It stands out that the investment in terms of numbers of events (i.e., d Ã 2 ; d Ã 3 ; d Ã ) tends to be higher in the adjusted program set-ups compared to the unadjusted program set-up, especially for scenarios with higher benefits and more optimistic prior. The expected probability to go to phase III p Ã go is notably lower in the adjusted program set-ups compared to the unadjusted program set-up, whereas the expected probability of a successful program sP * is higher.
Dividing the optimal number of events in phase II by the expected number of events in phase III (i.e., ;θ λ 2 Þ , respectively. Furthermore, it can be observed that the treatment effect estimate of phase II used for sample size calculation in the optimal design is overestimated in the unadjusted setting (ε Ã 2 < expð − E½θ 2 Þ as indicated by the black circles and yellow line in Figure 3). This overestimation is lower in the adjusted settings and can even result in an underestimation (compare multiplicative settings for w = 0.9).
The operating characteristics for the optimal designs (e.g., u * , sP * ) compared between the two multiplicatively and the two additively adjusted program set-ups do not vary (much) for each benefit scenario bs and choice of weight for the prior distribution w, respectively. However, there are differences in the optimal choice of the threshold value for the decision rule HR Ã go : in the program set-ups with adjusted phase II treatment effect estimate used for decision making ( Sðθ

Discussion
To find optimal drug development designs, the costs of the program (fixed/variable costs for phase II/III), the assumed benefit, and the development risk (i.e., the  Optimal design parameters λ * , d Ã 2 and HR Ã go , corresponding value of maximal expected utility u * , expected estimate used for sample size calculation ε Ã 2 , expected number of events in phase III when going to phase III d Ã 3 , expected total number of events of program d * , expected probability to go to phase III p Ã go , and expected probability of a successful program sP * for the optimal design, for c 2 = 0.75, c 3 = 1, c 02 = 100, c 03 = 150 in $ 10 5 , ξ 2 = ξ 3 = 0.7, 1 − β = 0.9, α = 0.025 (one sided), benefit scenarios bs 1-7, weights for the prior distribution w = 0.3, 0.6, 0.9, for the unadjusted program set-up Sðθ u 2 ;θ u 2 Þ and multiplicatively adjusted program set-ups Sðθ s1 2 ;θ λ 2 Þ, respectively expected probability of a successful program) are taken into account. By maximizing the expected utility with respect to the design parameters (adjustment parameter, number of events for phase II and threshold value for the go/no-go decision rule), optimal phase II/III drug development program designs can be found. Therefore, it enables quantitative reasoning for the design (i.e., the optimal "amount of adjustment", sample size and decision rule) for specific drug development programs at hand.
We investigated two adjustment methods (additive and multiplicative adjustment), several benefit scenarios (e.g., low, medium, large overall benefit), different distributions for the true treatment effect (with the same and different distributions in phase II and III), scenarios with a real life budget constraint, scenarios with a predefined clinically relevant effect, and scenarios where phase II could be skipped, hence presented a method for the implementation of a variety of possible oncology drug development program scenarios, and an opportunity for assessing associated changes of the optimal design parameters. Of course, the implementation of alternative (e.g., proportional relationship between benefit and effect size) or more complex planning situations and broader application to other research areas are possible by choosing relevant (e.g., cost and benefit) parameters appropriately [37][38][39]. As the framework has been shown to be very flexible, frequent scenarios in oncology drug development are adequately mapped with our approach. However, certain situations may be simplified. For example, in our framework the development program consists entirely of just one phase II trial and one phase III trial, which is, however, not unusual in oncology. For situations that two or more phase III trials are performed, the framework of optimal planning of Optimal design parameters α CI * , d Ã 2 and HR Ã go , corresponding value of expected utility u * , expected estimate used for sample size calculation ε Ã 2 , expected number of events in phase III when going to phase III d Ã 3 , expected total number of events of program d * , expected probability to go to phase III p Ã go , and expected probability of a successful program sP * for the optimal design, for c 2 = 0.75,c 3 = 1, c 02 = 100,c 03 = 150 in $ 10 5 , ξ 2 = ξ 3 = 0.7, 1 − β = 0.9, α = 0.025 (one sided), benefit scenarios bs 1-7, weights for the prior distribution w = 0.3, 0.6, 0.9 for the additively adjusted program set-ups Sðθ s1 2 ;θ αCI 2 Þ development programs was presented in a recent article by Preussler et al. [40]. Furthermore, we assumed the phase II trial to be two-armed. In the field of oncology dose investigations are often performed before and not as a part of phase II. However, in other indications dosefinding is performed in phase II. Methods for optimizing phase II/III programs with multi-armed phase II/III studies are presented in Preussler et al. [41]. Futility investigations in the phase III trial and/or considering a "seamless design" for the final analysis may be a worthwhile option, and it will be a topic of future research to investigate their impact on the optimal design. We assumed that the endpoint used in phase II and phase III is the same. We are currently exploring the situation that a surrogate (like progression-free or disease-free survival) is captured in phase II and overall survival is the primary endpoint in phase III. Another important point is that time-effects are not considered in this article. The program is unaccounted for the duration of development which is amongst others discussed in Preussler et al. [41]. That work presents in detail how to incorporate the impact of trial duration into the framework (compare Supplementary Material A2 [41]). However, when trying to incorporate "time" into the utility function, many aspects have to be considered. For example, one could take into account the "life cycle" of a drug as proposed by Patel & Ankolekar [42] who describe a typical life cycle by an early growth phase followed by a plateau, after which the sales decline as the patent expires. Furthermore, if there are several competitors investigating a similar drug then the company, who is the first to bring the drug to the market, usually gets the higher market share, i.e., higher gain. However, including these aspects requires competitor information and assumptions about their unknown future observed treatment effects. Any such assumptions are usually associated with very high uncertainty. Instead of trying to include too many (unknown) aspects into the utility function a rather simplified approach, as presented here, is advisable. If after observing phase II data further information about the potential of the drug, dose, target population or (time-dependent) benefits are Fig. 3 Optimization results. Maximal expected utility u * , corresponding optimal design parameters δ Ã ¼ ðd Ã 2 ; HR Ã go Þ, δ Ã ¼ ðd Ã 2 ; HR Ã go ; λ Ã Þ or δ Ã ¼ ðd Ã 2 ; HR Ã go ; α Ã CI Þ, expected probability to go to phase III p Ã go , expected probability of a successful program sP * , expected estimate used for sample size calculation ε Ã 2 , expected number of events in phase III when going to phase III d Ã 3 and expected total number of events of program d * in the optimal design, for c 2 = 0.75, c 3 = 1, c 02 = 100, c 03 = 150 in $ 10 5 , ξ 2 = ξ 3 = 0.7, 1 − β = 0.9, α = 0.025 (one sided), for program set-ups Sðθ Note that the symbols used to show the program characteristics of both multiplicatively and additively adjusted program set-ups, i.e., green crosses and violet triangles, appear as stars when plotted on top of each other available the probability of success (compare [43]) and the utility function could be updated to support go/nogo decisions as well as the design of the phase III trial.
In general, our results show that the adjusted program set-ups are superior to the unadjusted program set-up with respect to the maximal expected utility. This is associated with higher investments in terms of number of events and lower expected probabilities to go to phase III in the adjusted program set-ups compared to the unadjusted approach. Thus, in the adjusted program setups it is less often decided to go to phase III, but in case of a go decision, the investment in terms of sample size is higher. These aspects are particularly true for the multiplicatively adjusted program set-ups, which have also higher expected probabilities of a successful program compared to the additively adjusted and unadjusted program set-ups. Simply said, the money is spent more wisely when adjustment methods are used.
Values for the adjustment parameters that do not lead to an adjustment (i.e., α CI = 0.5 and λ = 1 in the additively and multiplicatively adjusted program set-ups, respectively) were included but never selected in the optimization. Thus, the results suggest that adjustment should always be considered, which is in line with Chuang-Stein and Kirby [14]. Furthermore, we see that in the unadjusted case there is an overestimation of the treatment effect after phase II, which is mitigated by the adjustments. In the multiplicative setting it is even shown that an overcorrection and thus an even larger investment in terms of sample size can be worthwhile with respect to the expected utility. Note that the focus is on maximal expected utility and the expected estimate of phase II is only a supporting variable, i.e., obtaining a "perfectly" unbiased estimator is not the goal in this application. With regard to the optimal number of events in phase II compared to phase III (d Ã 2 / d Ã 3 ), it can be seen that with the framework in the unadjusted and additive case one ends up in the "desirable" (according to De Martini [4,25]) range of 2/3 and also in the multiplicative case with lower d Ã 2 / d Ã 3 , one still exceeds the often used 1/4. However, it should be noted that the total optimal sample size is highest for the multiplicative case.
Both multiplicatively adjusted (i.e., Sðθ Considering only these two aspects, adjustment of the treatment effect estimate used for the decision rule may be omitted when also optimizing the threshold value for the decision rule: this only leads to larger values for HR Ã go (i.e., more liberal decision rules) which compensate the adjusted (more conservative) treatment effect estimates. For the same reason, program setups Sðθ λ 2 ;θ u 2 Þ and Sðθ α CI 2 ;θ u 2 Þ (i.e., multiplicative or additive adjustment used for the decision rule and no adjustment applied for the calculation of the number of events for phase III) are not considered. Furthermore, as adjustment of the treatment effect estimate used for the decision rule may be omitted when also optimizing over the threshold value for the decision rule, we did not consider program set-ups where different adjustment parameters used for the decision rule and the calculation of the expected number of events are optimized (in our notation Sðθ

Conclusions
Based on our results, we highly recommend using (multiplicatively) adjusted phase II treatment effect estimates for calculation of the phase III number of events in a phase II/III drug development program with go/no-go decision rule (compare Chuang-Stein & Kirby [14], Kirby et al. [15] and De Martini [4,25]). However, as our results also show that the optimal design parameters of each method depend on the cost and benefit parameters as well as on the applied prior distribution, no general rule exists. In contrast, the design parameters should be determined by applying our proposed optimization procedure for specific values of the parameters in the respective drug development program. Therefore, we provide an user friendly R Shiny App (bias) and an R package (drugdevelopR including the R function optimal_bias) open-source (both assessable via [1]).
Additional file 1. In the Additional file 1, an overview of formulas in program set-ups Sðθ s1 2 ; θ s2 2 Þ,s 1 , s 2 = λ, a CI , u (A0) and investigation of an alternative definition of program success is given (A1). Furthermore, more details and results of the application example when modelling different population structures in phase II and III (A2), when using a predefined minimal clinically relevant effect for phase III planning (A3), when using a budget constraint (A4), when skipping phase II (A5) and when using a linear function for modelling the gain (A6) are presented. The file Code.R includes the main function calls for generating the datasets and tables, using the R package drugdevelopR.
Abbreviations α CI ,λ: Adjustment parameter for additive and multiplicative adjustment method, respectively; bs: Benefit scenario; CI: Confidence interval; d 2 ,d 3 ,d: Total number of events for phase II, III and the program, respectively; HR: True assumed hazard ratio; κ: Threshold value for the go/no-go decision rule, κ = − log (HR go ); s 1 ,s 2 : Estimate used for go/no-go decision and calculation of number of events, respectively; θ: True assumed treatment effect, θ = − log (HR)