### Data and Notation

Let \(i=1,2,...,n\) index the *n* subjects in a longitudinal dataset for training the prediction model. Let \(\tilde{T}_{i}\) be the time to the clinical endpoint of interest, and \(C_{i}\) be the time to censoring. The observed time-to-event is \(T_{i} = {\textrm{min}}\{ \tilde{T}_{i}, C_{i} \}\). The observed event indicator is \(\delta _{i} = 1\{ \tilde{T}_{i} \leqslant C_{i} \}\), which equals to 1 if the event occurrence is observed and 0 if the event is censored. We make the conventional independent censoring assumption that \(C_{i}\) is independent of \(\tilde{T}_{i}\) and predictor variables. This assumption is valid in studies where the censoring is mainly due to staggered study entry or loss of follow-up due to non-medical reasons. Let \(t_{ij}\), \(j = 1, ..., n_{i}\), index the clinical visit times of the *i*-th subject. The patient’s characteristics and health conditions are measured at these visits. Throughout this paper, we assume that \(\{ t_{ij} \}\) are non-informative observation times in the sense that their distribution does not depend on other variables except \(0 \leqslant t_{ij} \leqslant T_i\). Without loss of generality, we let \(0 \equiv t_{i1}< t_{i2}< ...< t_{in_i} < T_i\). Here \(t_{i1}\) is the time of baseline, which may be study entry or enrollment, receipt of study intervention, diagnosis of a disease condition, etc., depending on the context. As described above, this formulation is most useful if the baseline represents a clinical milestone, but this is not required. Unless stated otherwise, all time variables above are expressed as the time since \(t_{i1}\).

Let \(\varvec{Z}_i(t_{ij})\) denote subject *i*’s vector of predictors at time \(t_{ij}\). For the ease of notation, let \(\varvec{Z}_{ij} \equiv \varvec{Z}_i(t_{ij})\). This notation covers both time-varying and time-invariant variables. For the latter, \(\varvec{Z}_{ij} \equiv \varvec{Z}_{ij'}\) for any \(j \ne j'\). The predictor vector \(\varvec{Z}_{ij}\) may include any predictive numerical features of patient *i* available at time \(t_{ij}\), including biomarkers from lab tests, clinical symptoms, patient demographics and genetic features, etc. \(\varvec{Z}_{ij}\) may also include any pre-defined numerical features of the longitudinal history of this patient up to \(t_{ij}\), such as the mean, slope or volatility of a biomarker within a period of time prior to \(t_{ij}\). One or more of the three time variables (\(t_{ij}\), the baseline age, and the time-varying age at \(t_{ij}\)) should always be included in \(\varvec{Z}_{ij}\) in the absence of exact collinearity; otherwise the timing of the longitudinal data is lost. Note that the age at \(t_{ij}\) equals to the baseline age plus \(t_{ij}\) in a deterministic relationship, and baseline age is just the time-varying age at \(t_{i1}\).

Let \(\tilde{T}_{ij} = \tilde{T}_i - t_{ij}\) and \(C_{ij} = C_i - t_{ij}\) be the residual survival and residual censoring time since the clincial visit at \(t_{ij}\). \(\tilde{T}_{ij}\) and \(C_{ij}\) are conditionally independent given \(\varvec{Z}_{ij}\), due to the independent censoring assumption. The observed residual survival is hence \(T_{ij} \equiv T_i(t_{ij}) = min\left( \tilde{T}_{ij}, C_{ij} \right) = T_{i} - t_{ij}\). The observed residual censoring indicator is \(\delta _{ij} \equiv \delta _i(t_{ij}) = \textbf{1}\left( \tilde{T}_{ij} \le C_{ij} \right) = \delta _i\). The landmark dataset for training purpose has \(\sum _{i=1}^n n_i\) rows. At each row, the data include \(T_{ij}\), \(\delta _{ij}\), and \(\{ t_{ij} \} \cup \{ \varvec{Z}_{ij} \}\). By definition, patient *i* is at-risk at the *j*-th visit, if the corresponding \(t_{ij}\) is in the landmark dataset. The goal of the landmark analysis is to build a model for the bilateral relationship between \(\varvec{Z}_{ij}\) and \(\tilde{T}_{ij}\) from the landmark dataset. For a new patient in validation data, we first determine the \(\varvec{Z}\) of that patient at the time of prediction, and then plug the \(\varvec{Z}\) in the model to obtain the prediction.

### Model parameterization for landmark analysis

The LA model takes the following general form, specified based on the Cox model formulation:

$$\begin{aligned} \lambda (u; \varvec{Z}(s)) = \lambda _0(u, s) + \varvec{\beta }(u, s)^T \bar{\varvec{Z}}(s)~~, ~~~~ u \geqslant 0. \end{aligned}$$

(1)

This is a log hazard function of the residual survival since the landmark time *s*, conditional on \(\varvec{Z}(s)\). The notation *u* denotes time since the landmark, and hence is on the residual survival time scale. The \(\lambda _0(u, s)\) is the log baseline hazard function, and \(\varvec{\beta }(u, s)\) is the time-varying log hazard ratios. We use notation \(\bar{\varvec{Z}}(s)\) to denote the vector of predictor variables measured at *s* other than the landmark time *s* itself. The effect of *s* is absorbed into \(\lambda _0(u, s)\). This model is specified among patients who have \(\varvec{Z}(s)\), i.e., who are at-risk of the event outcome. At a given *s*, (1) is a Cox model with time-independent covariates and time-varying coefficients. But since *s* can be any landmark time, the LA models as a whole include a bivariate log baseline hazard function, which is quite different from Cox model. The LA models at different landmark times do not generally satisfy the coherent condition of a survival process [14]. Consequently, \(\lambda (s_2 - s_1; \varvec{Z}(s_1)) \ne \lambda (0; \varvec{Z}(s_2))\) for all \(s_1 < s_2\). Therefore, (1) is usually viewed as a working model [1]. However, as far as prediction is concerned, this model is useful if it ensures a good approximation to the bilateral relationship between \(\varvec{Z}_{ij}\) and \(\tilde{T}_{ij}\) at all landmark times. In other words, the model predicted survival distribution \(1 - \textrm{exp}\{ -\int _0^u \lambda (v; \varvec{Z}(s)) dv \}\) needs to match the observed residual survival data. While most LA research has used a Cox model-based formulation such as (1), other survival regression models can be used [15]. The previously introduced notation \(\varvec{\theta }(s)\) encompasses all the model parameters at landmark time *s*, including \(\varvec{\beta }(u, s)\) and \(\lambda _0(u, s)\).

Model (1) has never been applied in its fully general form in the published literature, possibly because of the excessive number of parameters if all coefficients are bivariate functions of time. Simplified versions used in data analysis include: (A) letting \(\varvec{\beta }(u, s)\) to depend only on *s* [1, 7, 12], (B) letting \(\varvec{\beta }(u, s)\) depend only on *u* [6, 16], and (C) letting \(\lambda _0(u, s)\) to be a product of two univariate functions of *u* and *s* [17]. Regardless of these parameterizations, the LA model (1) implies infinitely many survival models, each defined on a distinct *s* and with parameter \(\varvec{\theta }(s)\). It should not be viewed as a single varying-coefficient model with bivariate time-varying coefficients, because there are infinitely many values of *s* as well as a second argument *u* which corresponds to the time since each *s*. Note that *u* is on the residual survival time scale and hence the model parameters at the landmark time *s* do not depend on *u*. The parameterization (B) was proposed to deal with violation of the proportional hazard assumption. Since the prediction horizon is usually not very large, because of repeated predictions at clinical visits, the log hazard ratio often does not vary substantially within the horizon. Therefore, in this paper, we focus the discussion on the parameterization (A):

$$\begin{aligned} \lambda (u; \varvec{Z}(s)) = \lambda _0(u, s) + \varvec{\beta }(s)^T \bar{\varvec{Z}}(s) . \end{aligned}$$

(2)

This parameterization can be estimated by kernel weighting [7], where both \(\lambda _0(u, s)\) and \(\varvec{\beta }(s)\) are estimated nonparametrically. In theory (1) can also be estimated this way. The intuitive idea is to assign kernel weights to the rows of the landmark dataset whose \(t_{ij}\) is close to *s*, and estimate \(\varvec{\theta }(s)\) by fitting a Cox model with time-independent covariates to the weighted landmark data. Since a single subject may contribute multiple rows with positive weights, this is a Cox model for multivariate survival data with working independence, which is commonly used in landmark analysis research [1, 7, 15]. In this paper, we say this approach “localizes” on the landmark time \(\{ t_{ij} \}\). When the clinical visit times of different subjects are synchronized (i.e., \(t_{ij} = t_{j}\), such as discrete time points), the kernel approach reduces to just fitting a Cox model at each distinct landmark time [12]. When a uniform kernel function is used, weighting is equivalent to matching: a weight of 1 indicates being matched and a weight of 0 indicates being unmatched. For any new subject in the validation dataset, we search for the training landmark dataset and identify rows with similar landmark times; a Cox model is fit to the matched landmark dataset and a prediction is generated based on the new subject’s predictor \(\varvec{Z}\).

In summary, the commonly used LA model (1) is a parameterization for the bilateral relationship between \(\varvec{Z}_{ij}\) and \(\tilde{T}_{ij}\) in the landmark dataset. This parameterization implies a distinct survival regression model at each landmark time *s*. This model can be estimated by a kernel weighting method that localizes on the landmark time. An algorithmic approach can also be used, where we match the new subject with the landmark dataset on the landmark time, fit a regression model to the matched data, and plug in \(\varvec{Z}\) to that model to obtain the prediction.

### A different model parameterization: the generalized landmark analysis

When the baseline is not a clinically meaningful milestone, the time since baseline, i.e., the landmark time *s*, may not be a useful index of the model or parameters with good interpretation. In this paper, we propose a different parameterization for this scenario. The matching interpretation to the estimation approach to model (1) suggests that instead of matching on the landmark time, we could match on another time-varying predictor variable which better reflects disease severity or stage, denoted by \(\varvec{V}\). We call \(\varvec{V}\) a landmark variable. It is part of \(\varvec{Z}\). A survival regression model can then be fit to the matched landmark dataset and produce the prediction.

In lieu of the connection between matching and weighting, we can also use a kernel weighting approach. Let \(\varvec{V}^*\) be the landmark variable of a new subject in the validation dataset. The weights of the rows in the landmark training dataset are calculated by a kernel function \(K_h\) with bandwidth *h*: \(W_{ij} = K_h(\varvec{V}^*, \varvec{V}_{ij})\). For example, \(K_h(u) = (4h)^{-1}(1-u^2)\textbf{1}(|u|\le 1)\) is the Epanechnikov kernel. Other kernel function or local linear approximation [18] can also be used. Local polynomial theory suggests that the choice of kernel function and degree of the polynomial are less important than the bandwidth. For simplicity, we use kernel, i.e., local constant, approach. Bandwidth is a tuning parameter of the algorithm and will be discussed later.

Taking one step further, we can view this kernel weighting method as the estimation approach to the following parameterization of the bilateral relationship between \(\varvec{Z}_{ij}\) and \(\tilde{T}_{ij}\) in the landmark training dataset:

$$\begin{aligned} \lambda (u; \varvec{Z}(s)) = \lambda _0(u, \varvec{V}(s)) + \varvec{\beta }(\varvec{V}(s))^T \bar{\varvec{Z}}(s)~~, ~~~~ u \geqslant 0 \end{aligned}$$

(3)

Similar to (1), \(\bar{\varvec{Z}}(s)\) in this context denotes the vector of predictor variables without the landmark variable \(\varvec{V}\). The effect of \(\varvec{V}\) is absorbed into \(\lambda _0(u, \varvec{V})\). The development above leads to an analogue to the LA in terms of both model parameterization and estimation approach, with a change from landmark time to the more general concept of landmark variable. This change bypasses the use of the time since baseline when it does not have an explicit clinical interpretation. Both (1) and (3) are different parameterization of this more general bilateral relationship, and (3) includes (2) as a special case because the landmark time is part of \(\varvec{Z}\). For this reason, we call (3) and its corresponding matching or weighting estimation the *generalized landmark analysis (GLA)*.

**Remark 1**. The matching idea is similar to the localized regression proposed by Kosel and Heagerty [13] (Note: not local polynomial regression), though that paper was written in the context of cross-sectional data but ours is for longitudinal data. The localized regression essentially matches training data with the subject in the validation data with respect to all predictors. However, such a matching approach is not feasible even with a small number of predictor variables in \(\varvec{Z}\) unless the sample size is very large or a large caliper is allowed. To solve this problem, we have used a hybrid approach, which includes matching/weighting on the landmark variable(s) followed by a regression analysis using the other variables and the matched or weighted data.

**Remark 2**. In (3), the effect of \(\varvec{V}\) is modeled nonparametrically, but the effects of other predictors in \(\bar{\varvec{Z}}\) are modeled with linearity assumption. Therefore, there is less chance of model misspecification on \(\varvec{V}\). This justifies selecting a strong prognostic variable in \(\varvec{V}\), because presumably, misspecifying the effect of a stronger predictor has a larger effect on the prediction result. In the LA, the landmark time is usually a strong predictor when the baseline marks the start of a risk-changing period. In the GLA, there is often stronger predictors than the time since baseline because the latter does not have clinical meaning. For the CKD data application, the landmark variable can be the time-varying age or eGFR at clinical visits. The eGFR is the estimated glomerular filtration rate, an important biomarker for renal function. When matching on age at clinical visits, it is equivalent to the age-alignment method [19, 20], which was previously proposed for landmark analysis. However, it is widely known among nephrologists that eGFR is a much stronger predictor than age for terminal renal outcomes. In fact, the classification of CKD progression stages is solely based on eGFR values, not age. Broadly speaking, the GLA localizing on eGFR is equivalent to developing a prediction model tailored for the CKD stage of the new subject in validation dataset. This perhaps has better interpretation than GLA localizing on age (a tailored prediction model for the new subject’s age), given that eGFR is clinically more relevant.

**Remark 3**. Exact matching may not be feasible with continuous variables. Caliper matching is equivalent to kernel weighting with uniform kernel, unless it is restricted to the less efficient approach of selecting no more than one matched row per subject from the landmark training dataset. The implementation of GLA in this paper is based on kernel weighting. There could be more than one landmark variables in \(\varvec{V}\), as illustrated by the WisARD data example below. From the kernel regression literature, it should be rare to use more than two landmark variables, unless the dataset is extremely large.

**Remark 4**. The GLA algorithm has a tuning parameter, i.e., the bandwidth (or caliper in the case of matching). When a very large bandwidth is used, it is equivalent to fitting the landmark dataset \(\left\{ T_{ij}, \delta _{ij}, t_{ij}, \varvec{Z}_{ij} \right\}\) with a single survival model and assuming that the model parameters do not vary with the landmark variable. When a very small bandwidth is used, the model parameters are estimated with large sampling variation. Neither case is expected to produce optimal prediction. There is no evidence that the relationship between bandwidth and prediction accuracy takes a U-shape with a single optimal bandwidth in the middle. Therefore, we recommend calculating the cross-validated prediction accuracy with a range of reasonable bandwidth choices, and pick the one producing the best accuracy. This is illustrated in our data application below. If multiple bandwidths provide similar prediction accuracy in cross-validation, the smallest is preferred as long as the estimated parameters vary with the landmark variable. Loss of efficiency is another aspect of the bias-variance trade-off, but it is less of a concern when the prediction accuracy is similar. In practice, the distribution of the landmark variable may have some sparse regions. To deal with this issue, we can use an adaptive bandwidth choice, such as the span. We define the bandwidth as the \(\gamma\)th quantile of the distances between \(\varvec{V}^*\) and all \(\varvec{V}_{ij}\) in the training dataset. Here \(\gamma \in (0, 1]\) is a pre-specified span parameter.

### Prediction accuracy assessment

Following the common practice in dynamic prediction literature [1, 2, 7, 15], we used two prediction accuracy measures in our data analysis, the area under the time-dependent ROC curve (AUC) [21] and time-dependent Brier score [22]. They quantify the discrimination and calibration of the predicted risk score, respectively. Let \(\tilde{T}\) and \(p^*\) denote the time-to-event outcome and the predicted probability of event occurring within the prediction horizon \(\tau\). The time-dependent sensitivity (*Se*) and specificity (*Sp*) are defined as \(Se(c) = P(p^* \geqslant c | \tilde{T} \leqslant \tau )\) and \(Sp(c) = P(p^* < c | \tilde{T} > \tau )\), where *c* is the threshold for a positive or negative prediction. The time-dependent ROC curve is a plot of 1 - *Sp*(*c*) vs. *Se*(*c*) at all possible values of *c*. The AUC is the area under this curve. The time-dependent Brier score is defined as \(E( 1\{ \tilde{T} \leqslant \tau \} - p^* )^2\). In the definitions above, the probability or expectation is defined on the distribution of \(\{ \tilde{T}, p^* \}\) in the validation data. Various statistical methods have been proposed to estimate the AUC and BS as defined above, by accounting for censored \(\tilde{T}\) in the validation dataset [21,22,23]. To guard against over-fitting, we used cross-validation in the numerical studies of this paper. We randomly split the original dataset into a training and validation dataset, each with half of the subjects. We fit the model using training data and calculated the prediction accuracy using validation data. We repeated the model-fitting and prediction accuracy assessment in 2,000 bootstrap samples extracted from the training data and validation data respectively, and then calculated the 95% bootstrap confidence interval. To further reduce sampling variability of this process, we repeated the split 5 times and the results were averaged.