### The jackknife estimator of the value function

In this section, we introduce general notations and present some existing work relevant to this paper. Let \(\{(\mathbf {X}_{\mathbf {i}}, A_i, Y_i )\}_{i=1}^n\), be independent and identically distributed triples \(({\mathbf {X}}, A, Y )\) for *n* patients. We define \(\mathbf {X} \in \mathcal {X} \subseteq \mathbb {R}^p\) as a vector of baseline patient characteristics, \(A \in \{0, 1 \}\) as the binary treatment indicator, and \(Y \in \mathbb {R}\) as a scalar outcome, with higher values of *Y* representing more favorable outcomes. Precision medicine looks to maximize the expected reward of a potential outcome under a treatment rule. The treatment rule maps the patient-level covariates \(\mathcal {X}\) to the binary treatment space \(\{ 0, 1 \}\). In a single stage setting, the optimal treatment rule is selected among a class of all possible treatment rules \(\mathcal {D}\) that maximizes the value of a potential outcome when applied to future patients [2]. The expected reward or value function is defined as:

$$\begin{aligned} V(d) = E^d (Y) = E\left[ Y \frac{1 \{ A=d (\mathbf {X})\}}{P(A|\mathbf {X})} \right] , \end{aligned}$$

where \(P(A|\mathbf {X})\) is the propensity score, which is known in a randomized study or can be estimated with a logistic regression or some other model in an observational study. The optimal treatment rule is defined as \(d^* = \arg \max _{d} V(d)\).

In the right-censored survival data setting, we consider \(\tilde{T}\) as the true survival time and \(T= min(\tilde{T}, \tau )\) as the \(\tau\)-truncated survival time where \(\tau < \infty\) is the maximum follow-up time. Also, we assume censoring time *C* is independent of *T* given \(({\mathbf {X}}, A)\) and denote the censoring indicator as \(\delta =I(T \le C)\). When the interest is in maximizing the restricted mean survival time, the expression of the above value function then becomes

$$\begin{aligned} V(d) = E\left[ T \frac{1 \{ A=d (\mathbf {X})\}}{P(A|\mathbf {X})} \right] . \end{aligned}$$

The optimal treatments are estimated with regression-based or classification-based approaches. Regression-based approaches model the mean outcome conditional on treatment assignment and patient data [3, 17,18,19]. The performance of the regression-based approaches depend on the correct specification of the posited models. In addition, modeling treatment-covariate interactions can lead to a high-dimensional situation, further complicating the estimation procedures. Classification-based approaches were developed to avoid issues of model misspecification [4, 5, 20]. This alternative class of approaches relies on fewer modeling assumptions and the optimal treatment rule is estimated by minimizing the expected weighted misclassification error. The selected approach is then evaluated via cross-validation techniques for assessing the value function associated with the estimated treatment rule.

Jiang et al [7] proposed a LOOCV or jackknife estimator for the value function for continuous outcomes. The approach requires weak assumptions, namely that the data are independent and identically distributed. The proposed jackknife estimator is expressed as:

$$\begin{aligned} \widehat{V}^{jk} (\hat{d}_{n}) = \frac{ \sum _{i=1}^{n} Y_{i} \frac{1\left\{A_{i}=\hat{d}^{(-i)}_{n} (\mathbf {X}_{i})\right\}}{{P}(A_{i}|\mathbf {X}_{i})} }{ \sum _{i=1}^{n} \frac{1\left\{A_{i}=\hat{d}^{(-i)}_{n} (\mathbf {X}_{i})\right\}}{{P}(A_{i}|\mathbf {X}_{i})}} , \end{aligned}$$

(1)

where \(\hat{d}^{(-i)}_{n}\) represents the treatment rule estimated from a training set of size *n* with the *i*-th observation left out and \({P}(A_{i}|\mathbf {X}_{i})\) is the propensity score of the test set *i*. The variance of the estimator in (1) is estimated by

$$\begin{aligned} \widehat{Var}\left[\widehat{V}^{jk} (\hat{d}_{n})\right] = \frac{1}{n (n-1)} \sum _{i=1}^{n} \left(R^{jk}_{i}\right)^2, \end{aligned}$$

where \(R_i^{jk}= \frac{1}{\overline{W}_{n}} U_i - \frac{\overline{U}_{n}}{\overline{W}^2_{n}} W_i\), with \(U_i = Y_{i} \frac{1\left\{A_{i}=\hat{d}^{(-i)}_{n} (\mathbf {X}_{i})\right\}}{{P}(A_{i}|\mathbf {X}_{i})}\), \(W_i= \frac{1\left\{A_{i}=\hat{d}^{(-i)}_{n} (\mathbf {X}_{i})\right\}}{{P}(A_{i}|\mathbf {X}_{i})}\), \(\overline{U}_{n}=n^{-1} \sum _{i=1}^{n} U_{i}\), and \(\overline{W}_{n}=n^{-1} \sum _{i=1}^{n} W_{i}\). Through proofs and simulations, Jiang et al. [7] showed consistency and asymptotic normality of the proposed jackknife estimator of the value function. Note that we are ignoring the contribution to the limiting distribution due to estimating the propensity scores. It appears from our simulation study that this omission does not harm the performance of our method for a range of sample sizes. Evaluating this question theoretically is beyond the scope of this paper.

In the presence of dropouts or administrative censoring, the unobserved events bring some complexity to statistical analysis. Specifically, the presence of informative censoring (i.e., the censoring time is dependent on the failure time) may incur bias in both survival times and censoring probabilities. Thus, the higher the proportion of censoring in the data, the more caution is to be taken during the estimations procedures. One approach that is commonly used to address the issues arising from censoring is the inverse probability of censoring weighted estimator (IPCW). The IPCW was initially developed to address bias in survival probabilities due to censoring [21,22,23,24]. The IPCW estimator corrects for bias by giving zero weight to censored observations and extra weights to uncensored observations. The weights are assigned to observations with similar characteristics to the censored observations and are usually estimated by logistic regression or Cox proportional hazards regression. Applying the IPCW recreates a potential scenario of no censoring and the re-weighted sample obtained is asymptotically unbiased. However, IPCW does not address possible models misspecification that may remain when parametric or semi-parametric estimation methods are employed.

### Jackknife estimator of the value for right-censored data

#### Definition of the jackknife estimator of the value function for right-censored data

We propose an extension of the jackknife value estimator developed by Jiang et al [7] in the presence of right censoring. Our jackknife value estimator corrects for the censoring-induced bias by applying the IPCW method to the estimator expressed in (1). The proposed jackknife value estimator has the following form:

$$\begin{aligned} \widehat{V}^{jk} (\hat{d}_{n}) = \frac{ \sum _{i=1}^{n} T_{i} \frac{1\left\{A_{i}=\hat{d}^{(-i)}_{n} (\mathbf {X}_{i})\right\}}{{P}(A_{i}|\mathbf {X}_{i})} \frac{\delta _{i}}{{S}_{c}(T_i|\mathbf {X}_{i})} }{ \sum _{i=1}^{n} \frac{1\left\{A_{i}=\hat{d}^{(-i)}_{n} (\mathbf {X}_{i})\right\}}{{P}(A_{i}|\mathbf {X}_{i})} \frac{\delta _{i}}{{S}_{c}(T_i|\mathbf {X}_{i})} }, \end{aligned}$$

(2)

where \(\hat{d}^{(-i)}_{n}\) represents the treatment rule estimated from a training set of size *n* with the *i*-th observation left out, \({P}(A_{i}|\mathbf {X}_{i})\) is the propensity score of the test set *i*, \({S}_{c}(t|\mathbf {X}_{i})\) the probability that the *i*-th observation has not been censored by time *t* conditional on \(\mathbf {X}_{i}\), and \(\delta _i=I(T_i \le C_i)\). Here, we assume that \({P}(A_{i}|\mathbf {X}_{i})\) and \({S}_{c}(t|\mathbf {X}_{i})\) are known. However, in practice, \({P}(A_{i}|\mathbf {X}_{i})\) and \({S}_{c}(t|\mathbf {X}_{i})\) are sometimes unknown and are estimated by regression models. Once obtained, the estimated propensity scores and censoring probabilities are plugged in the expression for the jackknife estimator in (2).

The estimated variance of the jackknife value estimator is

$$\begin{aligned} \widehat{Var}\left[\widehat{V}^{jk} (\hat{d}_{n})\right] = \frac{1}{n (n-1)} \sum _{i=1}^{n} \left(R^{jk}_{i}\right)^2, \end{aligned}$$

where \(R_i^{jk}= \frac{1}{\overline{W}_{n}} U_i - \frac{\overline{U}_{n}}{\overline{W}^2_{n}} W_i\) is the bias-corrected form of value function inspired by the influence function, with \(U_i= T_{i} \frac{1\left\{A_{i}=\hat{d}^{(-i)}_{n} (\mathbf {X}_{i})\right\}}{{P}(A_{i}|\mathbf {X}_{i})} \frac{\delta _{i}}{{S}_{c}(T_i|\mathbf {X}_{i})}\), \(W_i= \frac{1\left\{A_{i}=\hat{d}^{(-i)}_{n} (\mathbf {X}_{i})\right\}}{{P}(A_{i}|\mathbf {X}_{i})} \frac{\delta _{i}}{{S}_{c}(T_i|\mathbf {X}_{i})}\), \(\overline{U}_{n}=n^{-1} \sum _{i=1}^{n} U_{i}\), and \(\overline{W}_{n}=n^{-1} \sum _{i=1}^{n} W_{i}\).

We note that, as before, we are ignoring the potential variability due to estimating \({P}(A_{i}|\mathbf {X}_{i})\) and \({S}_{c}(t|\mathbf {X}_{i})\), but our simulations verify that the approach works acceptably well for a range of sample sizes considered.

#### Estimation of the optimal treatment rule

The optimal treatment rule \(d_n^{opt}\) maps the patient-level covariates \(\mathcal {X}\) to the binary treatment space \(\{ 0, 1 \}\). \(d_n^{opt}\) is selected among a class of all possible treatment rules \(\mathcal {D}\) that maximize the value of a potential outcome when applied to future patients [2].

A wide range of machine learning approaches has been used to obtain \(\hat{d}_{n}^{opt}\), the estimated optimal treatment rule when the outcome is right-censored. Goldberg and Kosorok [15] proposed a Q-learning algorithm with backward recursion for the multi-stage setting. Zhao et al. [11] developed a doubly robust outcome-weighted learning approach that uses IPCW and requires the correct specification of either the survival or the censoring model. Cui et al. [12] further proposed an extension of outcome weighted learning which relaxes the previous requirements by considering a tree-based approach. In this manuscript, we opt for using random survival forests [25] (RSF) and the Cox proportional hazards model (COX) as our estimation approaches. Random survival forests is one of the most commonly used tree-based methods for survival analysis. Tree-based methods are widely used because they are flexible methods that do not make distributional assumptions for the failure times. Even though Cox proportional hazards models [26] make some assumptions on the distribution of the failure times, they are robust to misspecification and remain the most commonly used regression method for survival analysis [27].

Using each of the selected estimation approaches, we obtain the estimated restricted mean survival times (RMST) [28] corresponding to each treatment option. The optimal treatment for a patient *i* is then obtained by the treatment that yields the larger area under the curve.

Under mild assumptions, we prove that the proposed jackknife estimator for right-censored data is consistent.

### Consistency of the jackknife estimator for right-censored data

In this section, we show proof of the consistency of the jackknife estimator. First, we make the following two assumptions.

### Assumption 1

\(E[P_{X}(\hat{d}_{n} (\mathbf {X}) \ne \hat{d}_{n-1} (\mathbf {X}) )] \longrightarrow 0\)

as \(n\rightarrow \infty\)

### Assumption 2

\(E\Big [\frac{\mathbf {T}^{2} + 1}{P^2(A|\mathbf {X}) \; S^2_{c} (T|\mathbf {X}) } \Big ] < \infty\)

Assumption 1 makes the assumption that the treatment rules based on samples of size *n* and \(n-1\) are asymptotically equal in probability. Assumption 2 assumes that the expectation of the second moment of the outcome, adjusted by the propensity score and the probability of censoring, is finite.

### Theorem 1

Given the above assumptions,

$$\begin{aligned}&\frac{\sum _{i=1}^{n} T_{i} \frac{1\left\{A_{i}=\hat{d}^{(-i)}_{n} (\mathbf {X}_{i})\right\}}{P(A_{i}|\mathbf {X}_{i})} \frac{\delta _{i}}{S_{c}(T_{i}|\mathbf {X}_{i})} }{ \sum _{i=1}^{n} \frac{1\left\{A_{i}=\hat{d}^{(-i)}_{n} (\mathbf {X}_{i})\right\}}{P(A_{i}|\mathbf {X}_{i})} \frac{\delta _{i}}{S_{c}(T_{i}|\mathbf {X}_{i})} } - E\big [ T|A= \hat{d}_{n} (\mathbf {X}) \big ]\nonumber \\&\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \overset{p}{\longrightarrow } 0 \; \text {as} \; n \rightarrow \infty , \end{aligned}$$

(3)

when \(P(A|\mathbf {X})\) and \(S_{c}(T|\mathbf {X})\) are known.

The complete proof is shown in Additional file 1.

### Z-test for comparing two ITRs in the presence of right-censored data

We build a test to compare optimal treatment rules from two estimation approaches. Similarly to Jiang et al. [7], the test statistic is expressed as:

$$\begin{aligned} T(\hat{d}_{n,1}, \hat{d}_{n,2}) = \frac{\widehat{V}(\hat{d}_{n,1}) -\widehat{V}(\hat{d}_{n,2}) }{\sqrt{\frac{\sum _{i=1}^{n} (R_{\text {1,i}}-R_{\text {2,i}})^2}{n(n-1)}}} , \end{aligned}$$

(4)

where \(\widehat{V}(\hat{d}_{n,1})\) and \(\widehat{V}(\hat{d}_{n,2})\) are the estimated value functions for the estimated optimal treatment rules obtained from the two estimation approaches. \(R_{\text {1,i}}\) and \(R_{\text {2,i}}\) are the bias-corrected, influence function-inspired value function of the *i*-th individual under these treatment rules.

The *p*-value for the test is obtained as \(p= 2 \; \int _{\left| T \right| }^{\infty } f(z) \; dz\), where *f* is the density of the standard normal distribution. We define the power of the test as the estimated proportion of the total number of simulations whose *p*-values are under 0.05.

Through simulations, we study the asymptotic properties of the jackknife estimator. We provide Q-Q plots for the distribution of the test statistic above and boxplots for the jackknife value estimate obtained after 500 replications.