 Research article
 Open Access
 Open Peer Review
 Published:
Evaluating screening approaches for hepatocellular carcinoma in a cohort of HCV related cirrhosis patients from the Veteran’s Affairs Health Care System
BMC Medical Research Methodologyvolume 18, Article number: 1 (2018)
Abstract
Background
Hepatocellular carcinoma (HCC) has limited treatment options in patients with advanced stage disease and early detection of HCC through surveillance programs is a key component towards reducing mortality. The current practice guidelines recommend that highrisk cirrhosis patients are screened every six months with ultrasonography but these are done in local hospitals with variable quality leading to disagreement about the benefit of HCC surveillance. The wellestablished diagnostic biomarker αFetoprotein (AFP) is used widely in screening but the reported performance varies widely across studies. We evaluate two biomarker screening approaches, a sixmonth risk prediction model and a parametric empirical Bayes (PEB) algorithm, in terms of their ability to improve the likelihood of early detection of HCC compared to current AFP alone when applied prospectively in a future study.
Methods
We used electronic medical records from the Department of Veterans Affairs Hepatitis C Clinical Case Registry to construct our analysis cohort, which consists of serial AFP tests in 11,222 cirrhosis control patients and 902 HCC cases prior to their HCC diagnosis. The sixmonth risk prediction model incorporates routinely measured laboratory tests, age, the rate of change in AFP over the past year with the current AFP. The PEB algorithm incorporates prior AFP screening values to identify patients with a significant elevated level of AFP at their current screen. We split the analysis cohort into independent training and validation datasets. All model fitting and parameter estimation was performed using the training data and the algorithm performance was assessed by applying each approach to patients in the validation dataset.
Results
When the screeninglevel false positive rate was set at 10%, the patientlevel true positive rate using current AFP alone was 53.88% while the patientlevel true positive rate for the sixmonth risk prediction model was 58.09% (4.21% increase) and PEB approach was 63.64% (9.76% increase). Both screening approaches identify a greater proportion of HCC cases earlier than using AFP alone.
Conclusions
The two approaches show greater potential to improve early detection of HCC compared to using the current AFP only and are worthy of further study.
Background
The incidence of hepatocellular carcinoma (HCC) in the United States has tripled over the last twenty years; however, the prognosis of patients diagnosed with HCC has remained poor with the fiveyear survival remaining less than 12% [1]. Patients with advanced stage HCC have few treatment options, with fiveyear survival between 0–10%, while those with early stage HCC have multiple treatment options (including surgical resection and liver transplantation), with 5year survival for patients receiving these treatments > 60% [2]. Early detection of HCC through surveillance programs is a key component in reducing mortality.
The majority (80–90%) of HCC cases occur in patients with cirrhosis. Targeted cancer surveillance programs focus on those patients at high risk of disease and aim to increase the likelihood of early detection of cancer, while maintaining reasonable costs. The American Association for the Study of Liver Diseases (AASLD) recommends ultrasonography every six months in patients with cirrhosis [3]. The majority of surveillance ultrasounds in the United States take place in local hospitals with variable quality because ultrasonography is operator dependent, not sensitive in detecting early lesions and difficult to perform in obese patients. While ultrasonography has greater than 90% specificity, the reported sensitivity varies between 65–80%. Consequently there is disagreement in the field about the benefit of surveillance since there has been little evidence of improved survival in the few randomized clinical trials conducted. Considerable research has focused on developing highly sensitive standardized biomarker screening tests to complement (or replace) ultrasonography and provide motivation for HCC surveillance. A potential approach needs through vetting prior to being used in a prospective screening trial where the algorithm is used to trigger additional diagnosis workups.
Serum αFetoprotein (AFP) is a well established diagnostic biomarker for HCC that is widely used in screening despite the wide variation in its reported performance. A populationbased US cohort study found that among HCC patients with a prior diagnosis of cirrhosis who received regular surveillance, 52% received both ultrasonography and AFP, 46% received AFP alone and 2% received ultrasonography alone [4]. A 2017 update of the AASLD guidelines recommends surveillance using ultrasonography, with or without AFP, every six months [5]. The sensitivity for AFP varies between 41–65% and the specificity between 80–95% in both diagnostic and screening settings and across a range of study designs when using a threshold of 20 ng/ml [6]. While other FDA approved biomarkers exist, it is unlikely that any other biomarker will be integrated into widespread HCC surveillance practice in the United States in the near future. Methods that improve the performance of the AFP, which can be used independently and in conjunction with ultrasonography, are critically needed in the shortterm.
In this paper, we evaluate two approaches to improve the performance of AFP screening. The first incorporates routinely measured laboratory tests for evaluating the underlying liver disease of patients with cirrhosis and the rate of change in AFP in a sixmonth risk prediction model [7, 8]. The motivation behind this approach was driven by several studies that have explored the association between elevated AFP and other factors [9, 10]. In particular, Richardson et al. [11] found that in patients with no HCC, elevated AFP was associated with elevated alanine aminotransferase (ALT). Adjusting for these factors could improve the specificity of AFP in HCC surveillance. Since AFP is elevated in early stage HCC in only a subset of cases, including laboratory tests that monitor liver function could improve early detection of HCC.
The second approach is a parametric empirical Bayes (PEB) screening algorithm. The PEB method was first proposed by McIntosh & Urban [12] for cancer screening with a longitudinal biomarker. Previous algorithms for screening with longitudinal biomarkers, such as Skates et al. [13], required specifying the early preclinical behavior of the biomarker after disease onset in cases, which can be challenging when faced with limited serial data, as well as the biomarker trajectory in control patients. Patients whose biomarker trajectory todate more closely resembles that of a case than a control patient were flagged as positive screens. In contrast, the PEB algorithm specifies the biomarker trajectory in control patients only, for whom there is often a great deal of data, and flags any significant deviations from the expected behavior given the model and the patient’s own serial history to date. The PEB algorithm has been applied to serial AFP data from the Hepatitis C Antiviral Longterm Treatment against Cirrhosis (HALTC) trial [14]. In this randomized control trial, the PEB algorithm method improved the sensitivity of AFP by almost 17% compared to the standard thresholding approach (77.1% vs 60.4%) when the false positive rate among all screenings was set to 10%.
Our goal in this paper is to assess the performance of both the laboratorybased algorithm and the PEB algorithm for their ability to improve likelihood of early detection of HCC when applied prospectively in a future study. We consider (1) the sensitivity at fixed false positive rates during the entire screening period, within periods close to diagnosis, and within periods close to diagnosis while excluding intervals very close to clinical diagnosis where clinically significant earlier detection is unlikely; (2) the true positive rate, false positive rate, positive predictive value, and negative predictive value curves; and (3) the timing of first positive screen for each approach.
The most common etiology for cirrhosis in the United States is hepatitis C virus infection (HCV). The Department of Veterans Affairs (VA) is the largest integrated healthcare provider in the United States and the veterans that utilize the VA are at high risk of HCV infection. The VA HCV Clinical Case Registry includes patient data for all HCV infected patients at 128 VA facilities and the detailed patient records and history contained in the HCV registry is possible as a result of the longstanding use of electronic medical records at VA facilitates. Current VA good practice guidelines recommend regular surveillance testing with ultrasonography and AFP at 6–12 month intervals in patients with cirrhosis. The VA HCVcirrhosis cohort disease progression, variability of biomarkers under consideration and adherence to recommended screening visits more accurately reflects HCC screening in practice than a randomized control trial. In this analysis, we examine whether we can improve the likelihood of earlier detection of HCC in the regular clinical care setting, within the largest health care system in the United States, either by using routine blood tests in addition to current AFP levels in the laboratorybased algorithm or by using longitudinal AFP screening history via the PEB algorithm.
Methods
VA cohort construction
The VA HCV Clinical Case Registry includes patient demographic characteristics, laboratory test results, inpatient and outpatient visits, diagnostic and procedure codes, and date of death for all HCVinfected patients at VA facilities. All patients in our analysis cohort had a positive HCV antibody and HCV RNA test between 10/1/1997 and 9/30/2005. We used three ICD9 diagnostic codes to identify patients with cirrhosis from the HCV cohort. The date of the first appearance of either 571.2, 571.5 or 571.6 in the electronic medical records was defined to be the cirrhosis diagnosis date. This definition has been validated and found to have 90% positive predictive value and 87% negative predictive value [15]. Our analysis cohort consisted of patients with a cirrhosis diagnosis at any time prior to end of study (12/31/2006). All the available patient information between the HCV index date (date of HCV diagnosis) and HCC diagnosis or end of study (12/31/2006) were included in the analysis dataset. In most patients (∼ 80%), the HCV index date preceded the cirrhosis diagnosis date and we chose to retain patient information between the HCV index date and the cirrhosis diagnosis date since the cirrhosis diagnosis is often delayed in these patients. If the HCV index date occurred after the cirrhosis diagnosis date, then it is likely that the cirrhosis diagnosis prompted the HCV testing.
The HCC diagnosis date was determined using both the ICD9 codes and a subsequent manual structured review of the electronic medical records. First, we defined all patients with an ICD9 code of 155.0 but without 155.1 to be probable HCC cases. A subset (∼ 82%) of these were manually reviewed and the date of HCC diagnosis was defined as the date of the earliest appearance of a liver mass on ultrasound that was subsequently confirmed by computed tomography (CT), magnetic resonance imaging (MRI), and/or biopsy, or in the absence of a mass on ultrasound, by the first evidence on CT, MRI and/or biopsy. We excluded patients who had ICD9 codes that indicated an HCC diagnosis but had no confirmation of the HCC diagnosis in the manual review of the electronic medical records. The date of the first appearance of ICD9 codes was defined to be the HCC diagnosis date in the small subset of HCC cases that were not manually reviewed.
Additional inclusion criteria used to create the analysis dataset were: at least one valid (> 0 ng/mL) serum AFP during the study period and HCC diagnosed at least 6 months after HCV index date and prior to the end of the study (12/31/2006). The analysis dataset at this stage consisted of 12,508 patients, of whom 930 have an HCC diagnosis during the study period. The dataset construction flow diagram is given in Fig. 1.
In order to obtain unbiased estimates of each algorithms performance, we split the analysis cohort into training and validation datasets. All model fitting and parameter estimation was performed using the training data and the algorithm performance was assessed by independently applying each approach to patients in the validation dataset. The training and validation cohorts were constructed by randomly dividing both the HCC cases and controls into two subgroups of equal sample size. Then one random sample of HCC cases and controls formed the training dataset and the other random sample of HCC cases and controls formed the validation dataset.
We used the following notation in the description of the both screening algorithms. Suppose that there are N_{ T } patients in the training dataset used to fit the model and N_{ V } patients in the validation dataset used to assess the model performance. Without loss of generality, we assume all time is measured from the HCV index date. In the notation that follows, the subscript i=1,…,N_{ T } indexes the training dataset patients and i=N_{ T }+1,…,N_{ T }+N_{ V } indexes the validation dataset patients. For those patients diagnosed with HCC during the study (i.e., cases), let δ_{ i }=1 and d_{ i } be the months to HCC diagnosis since HCV index date. For patients not diagnosed with HCC during the study (i.e., controls), let δ_{ i }=0 and d_{ i } be the months to the end of study since HCV index date.
Screening algorithms
The standard approach to screening with AFP is to compare the biomarker level at each screening time to a fixed threshold. In our paper, we explore two approaches to screening that incorporate information beyond the current AFP level at each screening visit.
Laboratorybased algorithm
The first is based on using a short term risk prediction model that incorporates AFP, the rate of change in AFP in the past year (if available), other laboratory tests and demographic variables to determine which patients are at high risk of developing HCC in the short term. The risk score from this model can then be used to determine which patients should be sent for further imaging because they could possibly have HCC.
ElSerag et al. [7] considered laboratory tests that are widely used, standardized, reproducible and clearly defined in the electronic medical records for inclusion in their sixmonth risk prediction model. The final model included AFP, ALT, platelets (PLT), age and twoway interactions between AFP and ALT and AFP and PLT. Logarithmic transformations and spline functions were used to model the nonlinear relationship between the included covariates and sixmonth probability of HCC. This model selection was done using the same cohort of patients that we are using in this study. Therefore, while we have attempted to reduce the bias from overfitting by splitting the VA cohort into training and testing dataset, we will require a new cohort of patients to get truly independent validation of the performance results for this proposed laboratorybased algorithm. In White et al. [8], the sixmonth risk prediction model was updated to include the rate of change in AFP in the past year, since it has been shown that both the current AFP levels and the trajectory of AFP are predictive of HCC [16]. Since not all patients will have an AFP measurement in the prior year, we have adapted their model to include change in AFP in the last year when it is available.
We use laboratory tests extracted from the electronic medical record to estimate the model parameters and assess the performance of the proposed laboratorybased algorithm. We consider each AFP test date to be a screening visit. It is unlikely that all the patients will have the other tests (ALT, PLT) performed on the same day; in practice these tests will not be rerun if they were recently performed. We considered any ALT or PLT laboratory test within 6 months prior to the AFP test to be a valid concurrent lab test.
A crosssectional resampling approach was used to estimate the predicted probability of HCC. For each patient in the training dataset, a random screening visit was chosen from all possible screenings and the sixmonth risk prediction model was fit using a logistic regression model. This process was repeated 100 times and the parameter estimates from each iteration were saved. For each new patient, the predicted probability of HCC within six months is calculated by averaging the 100 estimates of the predicted probability of HCC within sixmonths based on the parameter estimates from each iteration. The full details of the approach are described below.
Each patient has n_{ i } AFP tests performed at screening visits {t_{ i j },j=1,…,n_{ i }}. We define an indicator variable for each screening visit that is 1 if the patient is diagnosed with HCC within six months of that visit and 0 otherwise. i.e. At the j^{th} screening visit for the i^{th} patient, D_{ i j }=1 if d_{ i }<(t_{ i j }+6) and δ_{ i }=1 and 0 otherwise. For the i^{th} patient at the j^{th} screening visit, we extract AFP_{ i j }=AFP level in ng/ml, ALT_{ i j }=most recent ALT level in IU/ml measured within the interval [t_{ i j }−6,t_{ i j }], PLT_{ i j }= most recent PLT level in 1000’s measured within the interval [t_{ i j }−6,t_{ i j }] and Age_{ i j }= age in years at AFP test. Note that AFP_{ i j }, ALT_{ i j } and PLT_{ i j } are truncated at 1. We define an indicator function \(\Delta AFP_{ij}^{obs}\) that is 1 if t_{ i j }−t_{i(j−1)}≤12, i.e. the previous AFP measurement was within the last year, and 0 otherwise. The AFP rate of change in the previous year is defined to be log2(ΔAFP_{ i j })=[ log2(AFP_{ i j })− log2{AFP_{i(j−1)}}]/[{t_{ i j }−t_{i(j−1)}}/12].
The sixmonth risk prediction model is
where β_{1} is the row vector containing the model parameters and X_{ i j } is defined to be the row vector \([\mathbf {AFP}_{ij},\mathbf {ALT}_{ij},\mathbf {PLT}_{ij},\mathbf {Age}_{ij},\mathbf {AFP}_{ij}*\mathbf {ALT}_{ij},\noindent \mathbf {AFP}_{ij}*\mathbf {PLT}_{ij}, 1\Delta AFP_{ij}^{obs}, \mathbf {\Delta AFP}_{ij}]\) with
Note that I(·) is an indicator function that takes the value 1 when the argument is true and 0 when the argument is false.
The crosssectional resampling algorithm used to estimate the predicted probability of HCC for a each patient in the validation cohort is:

For each k=1,…,100,

1.
Create k^{th} crosssectional draw from longitudinal training data: for each patient draw a random visit t_{ i j } from {t_{ i j },j=1,…,n_{ i }} with replacement, i=1,…,N_{ T }.

2.
Fit logistic regression model (1) to get parameter estimates \(\hat {\beta _{1}}_{k}\).

1.

The predicted probability of HCC within sixmonths at the j^{th} screening visit for the i^{th} patient (i=N_{ T }+1,…,N_{ T }+N_{ V }) is
$$\begin{array}{@{}rcl@{}} \eta_{ij}=\frac{1}{100} \sum_{k=1}^{100} \frac{\exp\left(\hat{\beta_1}_{k} \mathbf{X}_{ij}^{T}\right)}{1+\exp\left(\hat{\beta_1}_{k} \mathbf{X}_{ij}^{T}\right)} \end{array} $$
The laboratorybased algorithm will indicate a positive screen if η_{ i j } exceeds a prespecified threshold c.
Parametric empirical Bayes algorithm
The second approach was proposed by McIntosh and Urban [12] and incorporates the longitudinal history of screening biomarker to define subject and screen specific thresholds. The defining feature of a useful screening biomarker is that it is predictable or stable in the absence of disease and exhibits a characteristic change after disease onset. For these biomarkers, the PEB algorithm incorporates the known information about the variability of longitudinal biomarker measurements within a patient and between patients to detect smaller but significant increases in the biomarker. In addition, the PEB algorithm may reduce the number of false positive screens in patients with no disease and a stable biomarker trajectory that is higher than average since it has the ability to learn from prior false positive screens.
Let Y_{ i j }= log2(AFP_{ i j }) be the transformed AFP level in the i^{th} patient at the j^{th} screen. The PEB approach assumes the following hierarchical model to describe the distribution of the transformed biomarker in the population of control patients.
I.e. given the patientspecific mean θ_{ i }, the transformed biomarker levels Y_{ i j } are independent and identically distributed with mean θ_{ i } and variance σ^{2} and θ_{ i } itself is normally distributed with mean \(\bar {\theta }\) and variance τ^{2}. The withinsubject variance σ^{2} and betweensubject variance τ^{2} are key measures that affect the performance of the PEB algorithm. Y_{ i j } can be centered and rescaled to simplify the derivation. Let \(Z_{ij}=(Y_{ij}\bar {\theta })/\sqrt {\sigma ^{2} + \tau ^{2}}\). Then
Note that a simple calculation verifies that the marginal distribution of Z_{ i j } is the standard normal distribution. The PEB algorithm can be modified using different distributional assumptions but we continue with the original formulation of the approach for two reasons. Firstly, screening rules are invariant to monotonic transformation so for any continuous marker, a transformation to normality is assured and secondly, the hierarchical normal model results in simplified derivations and calculations. In the implementation of the PEB algorithm, standard tests of normality can be used to evaluate whether the chosen transformation is appropriate.
The standard threshold approach ignores prior screening history of the patient and instead uses the same threshold for all patients. One possible approach for determining this threshold is to use the above model, which describes the transformed biomarker distribution in the control population, to specify a threshold that controls the populationwide false positive rate (FPR). Since Z_{ i j } is assumed to follow a standard normal distribution, then \(\phantom {\dot {i}\!}Pr(Z_{ij} > z_{1f_{0}})=f_{0}\) where \(z_{1f_{0}}\) is the 100(1−f_{0}) percentile of the standard normal distribution. Therefore, using the standard threshold screening rule, patient i has a positive screen at the j^{th} screening visit if \(\phantom {\dot {i}\!}Z_{ij} > z_{1f_{0}}\).
If the patient’s mean biomarker level (μ_{ i }) were known, we could define an individually tailored screening rule that still ensures the populationwide FPR is not more than f_{0} since given μ_{ i }, \((Z_{ij}\mu _{i})/\sqrt {1B_{1}}\) follows a standard normal distribution. Therefore \(Pr\{(Z_{ij}\mu _{i})/\sqrt {1B_{1}} > z_{1f_{0}}  \mu _{i}\}=f_{0}\) and patient i has a positive screen at the j^{th} screening visit if \(Z_{ij} > \mu _{i} + z_{1f_{0}}\sqrt {1B_{1}}\).
However μ_{ i } is not known, so instead we use the PEB estimate of this parameter. This estimate, denoted by \(\hat {\mu }_{ij}\), is a weighted average of the population mean (which is 0 in this case) and the sample average of the patients screening history. The PEB screening rule then indicates a positive screen for patient i at the j^{th} screening visit if
where \(\hat {\mu }_{ij} = 0*(1B_{j}) + \bar {Z}_{ij}*B_{j}\), \(\bar {Z}_{ij}=\frac {1}{j1}\sum _{j'=1}^{j1} Z_{ij'}\) and \(B_{j}=\frac {\tau ^{2}}{\sigma ^{2}/(j1) + \tau ^{2}}\).
To implement the PEB screening algorithm, we require estimates for the parameters \(\bar {\theta }\), σ^{2} and τ^{2}. These can be obtained by fitting a linear mixed model with a random intercept in the control patients from the training cohort. We then apply the PEB screening rule to all the screenings conducted in the validation cohort.
Incorporation of an OR rule
In clinical practice, if the current AFP level is very high (e.g. AFP_{ i j }≥400 ng/ml) then the patient will automatically be sent for followup imaging with CT or MRI and no additional screening algorithm will be used. To formalize this practice, we include an OR rule in the implementation of both the screening algorithms in the validation dataset. This approach is called an OR rule since patient i has a positive screen at time t_{ i j } if AFP_{ i j }≥400 ng/ml or if the screening algorithm indicates a positive screen. We define a general variable P_{ i j }(·) that is 1 when patient i has a positive screen at time t_{ i j } and 0 otherwise and is a function of the thresholding parameter for each algorithm. For the laboratorybased algorithm, P_{ i j }(·) is defined to be
and for the PEB algorithm it is
where Φ is the standard normal cumulative distribution function. The threshold of 400 ng/ml for AFP was chosen because it corresponds to a very low false positive rate (0.006) in our training dataset.
Evaluation of screening algorithms
The standard measures used to evaluate the performance of biomarker screening approaches are based on screening at a single time point. For example, sensitivity is proportion of cases with a positive test and specificity is proportion of controls with a negative test. We have extended these definitions to the longitudinal screening setting. In patients not diagnosed with HCC during the study period, it is clear that any negative screening result is a true negative, while any positive screening result is a false positive screen. However in patients diagnosed with HCC, we do not know when the cancer started developing, we only know when it was clinically diagnosed. Therefore, we consider multiple possible definitions for sensitivity and specificity in the longitudinal setting that are all dependent on which screenings in HCC cases are considered true positive screens and which are considered false positive screens because its unlikely that additional imaging with CT or MRI would have resulted in detection of HCC at that time.
In Fig. 2 we illustrate these definitions where we progressively increase the time period prior to clinical diagnosis during which positive screens in HCC cases were considered to be true positive screens. In definitions A1D1, we considered all screenings prior to clinical diagnosis of HCC when calculating the patientlevel sensitivity. In definitions A2D2, we excluded screenings within three months of clinical diagnosis of HCC when calculating the patientlevel sensitivity since the goal of the screening algorithms are to increase the earlier detection of HCC and a positive screen result within three months of the clinical diagnosis of HCC is unlikely to result in a clinically significant difference in the prognosis for a patient.
We then define patientlevel sensitivity or true positive rate (TPR) as the probability of an HCC case having at least one positive screen during the specified preclinical detection period indicated in Fig. 2:
where τ_{1} and τ_{2} define the boundaries within which a positive screen was considered to be a true positive. For example, in definition A1, τ_{1}=6 months and τ_{2}=0 months and in definition A2, τ_{1}=6 months and τ_{2}=3. We defined sensitivity at the patientlevel because the goal is to assess the future performance of the algorithm in terms of the number of HCC cases that could be detected prior to clinical diagnosis. In the future, a single positive screen that leads to confirmation of HCC via additional imaging would terminate further screening.
Screeninglevel FPR (1specificity), was defined as the probability of a positive screen among (1) all the screenings conducted in the control patients and (2) the screenings conducted in HCC cases that are considered to be outside the detection period indicated in Fig. 2:
The FPR was defined at the screening level because each false positive result would lead to further testing that can be expensive and may increase the likelihood of complications and anxiety.
The positive predictive value (PPV) was defined as the probability of positive screen occurring in an HCC case within the specified preclinical detection period indicated in Fig. 2:
This measure was reported at the screeninglevel because the goal is to evaluate the probability of any positive screen being a true positive.
The negative predictive value (NPV) was defined as the probability of negative screen occurring in (1) control patients or (2) in HCC cases that are considered to be outside the detection period indicated in Fig. 2:
This measure was reported at the screeninglevel because the goal is to evaluate the probability of any negative screen being a true negative. Note that both the PPV and NPV measures are influenced by the prevalence of HCC in our analysis cohort as well as the number of screenings conducted in patients. In Additional file 1: Appendix A, we provide estimators of the four measures that we used in our analysis.
Results
Of the 12,508 patients in the analysis cohort, 12,124 had at least one AFP test with both an ALT and PLT laboratory test performed within the prior six months. This cohort of patients was randomly split into the training and validation cohorts each consisting of 451 HCC cases and 5611 controls. Our goal is to assess the performance of each of the screening algorithms within the OR rule, i.e. the patient has a positive screen if either AFP≥400 ng/ml or the screening algorithm indicates a positive screen. Therefore the training cohort was further restricted to only those with AFP<400 ng/ml since the screening algorithms will only be applied in those patients. We do not restrict the validation cohort since our goal is to assess the performance of the screening algorithms as they would be used in clinical practice, which includes the OR rule. Note that in our analysis we have patients with multiple laboratory tests on the same day. For these patients, the multiple laboratory tests on the same day were summarized (average of the log2 measurements) and this value was used in the analysis.
In Table 1 we describe the training and validation cohorts. Across the cohorts, we observe that age at baseline (first AFP test), the proportion of white and black patients, the months between AFP tests and the baseline AFP, ALT and PLT were all similar within controls and HCC cases. In control patients, baseline AFP and ALT were slightly lower and baseline PLT was slightly higher compared to those patients eventually diagnosed with HCC. The average screening interval was around 12 months. Approximately 28% of the patients had only a single AFP test during the study, while ∼ 22% had more than four AFP tests during the study.
In the tables and figures, the “AFP only” approach is a sixmonth risk prediction model with AFP only, the laboratorybased algorithm is referred to as the “AFP+Lab+ ΔAFP” algorithm and the PEB algorithm applied to AFP is referred to as the “PEB: AFP” approach. In the first comparison of the screening algorithms, we focused on the patientlevel TPR when the screeninglevel FPR was fixed. In Table 2, the screeninglevel FPR was fixed at 10% and in Table A in Additional file 1: Appendix B, the screeninglevel FPR was fixed at 5%. We observe that both the laboratorybased algorithm and the PEB approach show improved TPR over the standard thresholding approach with AFP only across all the definitions of true positive screenings in HCC cases (A1D1 and A2D2). The TPR of the PEB algorithm was 9.75% greater than the standard thresholding approach with AFP only (63.64% vs 53.88%) and 5.55% greater than the AFP+Lab+ ΔAFP approach (63.64% vs 58.09%) over the entire screening period (definition D1) and 5.81% greater than the standard thresholding approach with AFP only (60.23% vs 54.42%) and 1.16% greater than the AFP+Lab+ ΔAFP approach (60.23% vs 59.07%) in the twoyears prior to clinical diagnosis (defintion C1).
When the screeninglevel FPR is fixed at 5% (Table A in Additional file 1: Appendix B), we observe that the PEB algorithm outperforms the other approaches implemented for all the definitions of true positive screenings in HCC cases except when comparing the PEB approach to the AFP+Lab+ ΔAFP approach in the three to six and three to twelve months prior to clinical diagnosis (defintion A2 and B2). In the remaining analyses we focus on 10% screeninglevel FPR because HCC screening is performed in highrisk cirrhosis patients and therefore we can allow for a higher number of false positive screenings. In our validation cohort a 10% screeninglevel FPR corresponds to a fixed threshold of 35.7 ng/ml for AFP in the standard approach based on definition D1. This was higher than the most commonly used threshold for AFP of 20 ng/ml, which would have a higher screeninglevel FPR.
We chose a splitsample approach with training and validation cohorts to evaluate our HCC screening algorithms since we have a large cohort with 902 HCC cases and 11,222 controls. In a sensitivity analysis, we utilized an outofbag bootstrap validation approach, where each bootstrap training cohort consisted of 12,124 patients drawn with replacement from the full analysis cohort and each bootstrap validation cohort consisted of all the patients not included in the bootstrap training cohort. The model parameters for each of the HCC screening algorithms were estimated using the training cohort, the screening algorithms were implemented in the validation cohort and the patientlevel TPR at 10% screeninglevel FPR was estimated. This procedure was repeated 300 times and the results were averaged over the bootstrap iterations. In Table B in Additional file 1: Appendix B, we observe that the results are mostly consistent; both the laboratorybased algorithm and the PEB approach showing improved TPR over the standard thresholding approach with AFP only across all the definitions of true positive screenings in HCC cases except one. For definition A2 with a restrictive time frame (only positive screens within 3–6 months prior to HCC diagnosis are true positives) and fewer HCC cases, the PEB algorithm and AFP only algorithm are approximately equivalent. In the twoyears prior to HCC diagnosis (definition C1), the TPR of the PEB algorithm was 5.03% greater than the standard thresholding approach with AFP only (61.26% vs 56.23%) and 1.57% greater than the AFP+Lab+ ΔAFP approach (61.26% vs 59.69%).
In Fig. 3, we compared the patientlevel TPR, screeninglevel FPR, PPV and NPV curves of the screening algorithms when only positive screens within two years of clinical diagnosis were considered to be true positive screens (definition C1). We defined the four measures in “Evaluation of screening algorithms” section, where all measures are functions of the thresholding parameter for each screening algorithm (c or 1−f_{0}). In order to standardize the curves for each approach, we redefined each measure to be a function of the risk percentile: the proportion of screens that lie below c or 1−f_{0}. In addition, we estimated the risk of HCC within τ_{1}=24 months for each decile. I.e. we estimated the probability of being diagnosed with HCC within the next two years, given that a patient’s current screen places them within the k^{th} decile, for k=1,…,10. In Fig. 3, we used a cubic spline to create the estimated risk curve.
We observed small differences in the PPV and NPV curves across the screening algorithms (middle panel of Fig. 3). In the bottom panel of Fig. 3, we observed the screeninglevel FPR had a linear relationship with the risk percentile that was the same for each approach (by definition) and that there was separation between the patientlevel TPR across the different methods.
The structure of Fig. 3 allows for comparison across the different methods and conveys a great deal of information. For example, we illustrate how to extract the results for Table 2 from these curves in Fig. 3. In the bottom panel, we fix the screeninglevel FPR at 0.1 and find the corresponding risk percentile. Using vertical dashed lines in each panel, we can extract the patientlevel TPR and PPV and NPV as well as the corresponding estimate of the risk of HCC within two years for each algorithm. Alternatively, we could fix any other measure at a prespecified level and compare the screening algorithms with respect to the remaining measures. In Additional file 1: Appendix B, we include the corresponding figures for definitions A1, B1 and D1 in Figures A, B and C respectively.
Next, we evaluated the screening algorithms at the individual patient level. When we considered all positive screens more than two years prior to clinical diagnosis in HCC cases to be false positive screens (definition C1) and fixed the screeninglevel FPR at 10%, we observed that 4.15%, 3.93% and 1.79% of patients have more than two false positive screening using the AFP only, AFP+Lab+ ΔAFP and PEB approach respectively. Since we have fixed the number of false positive screenings allowed in each method, this illustrates how each method distributes the number of false positive screens across the patients and reveals one of the advantages of the PEB approach— the ability to learn from prior false positive screens and reduce the number of false positive screens in an individual patient.
Among the 430 HCC cases with screenings in the two years prior to clinical diagnosis, 282 had at least one positive screen and 148 had no positive screening for any of the approaches during this period. In Fig. 4, we compare the timing of the first positive screens in the 282 HCC cases that were flagged positive by at least one screening algorithm. The time of the first positive screen for any screening algorithm with no positive screens during the twoyears prior to clinical diagnosis was defined to be the clinical diagnosis time. In the first panel, we compared the AFP only approach to the AFP+Lab+ ΔAFP algorithm and observed that while 69.86% of the HCC cases were first flagged positive at the same screening visit, 17.73% were flagged first by the AFP+Lab+ ΔAFP algorithm compared to the 6.38% that were first flagged positive by the AFP only approach. In the middle panel, we observe that while a similar proportion of the HCC cases were first flagged positive at the same time by both the AFP only approach and the PEB approach (70.21%), 20.92% were flagged first by the PEB algorithm while only 3.19% were first flagged positive by the AFP only approach. The earlier positive screens for the PEB approach were demonstrated in the third panel, which compared the PEB approach to the AFP+Lab+ ΔAFP algorithm.
Discussion
We have evaluated multiple approaches for HCC screening in a cohort of active HCVrelated cirrhosis patients from the VA patient population between 1997 and 2006. Each of the approaches under consideration included information beyond the current AFP level to increase the number of patients that are flagged with positive screens. Across all the analyses, we observed that including additional widely available and objective information leads to improvements in HCC screening performance measures. The goal of HCC screening is to detect HCC earlier, when there are potentially more curative treatment options available to the patient, and screening algorithms that have positive screens in the one to two years prior to clinical diagnosis of HCC are more likely to lead to earlier detection of HCC.
The performance of the PEB algorithm is affected by both the variability of AFP within and between patients. In the HALTC Trial, a clinical trial with strict patient inclusion criteria, the PEB algorithm improved the sensitivity of AFP by 16.7% compared to the standard thresholding approach (77.1% vs 60.4%) when the screeninglevel false positive rate was set to 10% and all positive screens in HCC cases were considered to be true positive screens. By comparison, within this VA cohort, which is a more realistic setting for HCC screening, we observed a 9.76% improvement in the PEB algorithm compared to the standard thresholding approach (63.64% vs 53.88%). The within and betweensubject variability of AFP across control patients in these study populations could explain the difference in performance. In the HALTC trial, the betweensubject variance of log2(AFP) was 1.77, the withinsubject variance was 0.39 and the resulting intraclass correlation (ratio of betweensubject variance to total variance) was 0.82. In the VA cohort, the betweensubject variance was 1.90, the withinsubject variance was 0.71 and the intraclass correlation was then 0.73. Therefore, in the VA cohort we observed almost twice the variability in the longitudinal AFP measurements within a patient compared to the HALTC Trial in patients that do not develop HCC. In this study we do not have information regarding the brand of AFP assay kits used, but this could be a source of the additional variability observed in the VA cohort that we are unable to quantify.
We explored multiple extensions of the PEB algorithm in the VA cohort, including using demographic variables and other liver function markers to explain the variability of AFP, however none of these approaches resulted in clinically significant improvements in the screening performance over the standard PEB algorithm with AFP only (see Additional file 1: Appendix B). The sensitivity of the PEB algorithm also depends on the biomarker behavior after HCC onset, as well as the likelihood of having a screening test soon after HCC onset. In the HALTC Trial, patients had AFP tests every three months during the first 48 months postrandomization and every six months thereafter. An exploratory analysis that considered only those AFP tests from the HALTC Trial that were six months apart found that the PEB algorithm method improved the sensitivity of AFP by 12.6% compared to the standard thresholding approach when the screeninglevel false positive rate was set to 10% and all positive screens in HCC cases were considered to be true positive screens. In the VA cohort, the average time between AFP tests was around 12 months. If we restrict our analysis to only those VA patients with frequent AFP tests (no more than nine months between AFP tests) then the improvement of the PEB algorithm with AFP compared to the standard thresholding approach was 4.37% (57.28% vs 52.91%) when the screeninglevel false positive rate was set to 10% and all positive screens in HCC cases were considered to be true positive screens.
There are several limitation of this study. The current VA cohort is restricted to those patients with active HCV related cirrhosis and therefore we do not know how these screening algorithms will perform in patients with other etiologies. In addition, the VA patient population in general is older, overwhelming male with few Hispanics and Asians and with high rates of comorbid conditions including alcohol abuse; therefore we do not know how well results generalize to the cirrhosis population in the United States. We are assembling an updated cohort of cirrhosis patients from the VA (2010–2015) that will include multiple cirrhosis etiologies, including HCV, hepatitis B infection, alcoholic liver disease and nonalcoholic fatty liver disease. Patients with nonHCV etiologies have been shown to have lower risk of progression to HCC [17] and in this patient population, we can study the performance of the screening approaches in different cirrhosis subgroups and tailor the algorithms, if necessary, to each disease etiology. We will also study the screening approaches in an external cohort of cirrhosis patients from the communitybased Kaiser Permanente Northern California health care system. In this cohort, we will have a more representative sample of the general cirrhosis population in which to further study the screening approaches that we have developed.
Conclusions
We have evaluated multiple screening algorithms from different perspectives to better understand the potential performance in a future prospective study. In addition, we have extended the definitions of the standard measures (sensitivity, specificity, positive and negative predictive value) from those used when a biomarker is measured at a single time point to the longitudinal screening setting. The proposed measures reflect clinically relevant performance characteristics of the screening algorithms that allow for a clearer understanding of potential future performance.
Abbreviations
 AASLD:

American Association for the Study of Liver Diseases
 AFP:

αFetoprotein
 ALT:

Alanine aminotransferase
 CT:

Computed tomography
 FPR:

False positive rate
 HALTC:

Hepatitis C antiviral longterm treatment against cirrhosis
 HCC:

Hepatocellular carcinoma
 HCV:

Hepatitis C virus infection
 MRI:

Magnetic resonance imaging
 NPV:

Negative predictive value
 PEB:

Parametric empirical Bayes
 PLT:

Platelets
 PPV:

Positive predictive value
 TPR:

True positive rate
 VA:

Department of Veterans Affairs
References
 1
ElSerag HB. Hepatocellular carcinoma. N Engl J Med. 2011; 365(12):1118–27.
 2
Bruix J, Sherman M. Management of hepatocellular carcinoma. Hepatology. 2005; 42(5):1208–36.
 3
Bruix J, Sherman M. Management of hepatocellular carcinoma: an update. Hepatology. 2011; 53(3):1020–2.
 4
Davila JA, Morgan RO, Richardson PA, Du XL, McGlynn KA, ElSerag HB. Use of surveillance for hepatocellular carcinoma among patients with cirrhosis in the united states. Hepatology. 2010; 52(1):132–41.
 5
Heimbach JK, Kulik LM, Finn RS, Sirlin CB, Abecassis MM, Roberts LR, Zhu AX, Murad MH, Marrero JA. Aasld guidelines for the treatment of hepatocellular carcinoma. Hepatology. 2018; 67:358–80.
 6
Gupta S, Bent S, Kohlwes J. Test characteristics of αfetoprotein for detecting hepatocellular carcinoma in patients with hepatitis c: a systematic review and critical analysis. Ann Intern Med. 2003; 139(1):46–50.
 7
ElSerag HB, Kanwal F, Davila JA, Kramer J, Richardson P. A new laboratorybased algorithm to predict development of hepatocellular carcinoma in patients with hepatitis c and cirrhosis. Gastroenterology. 2014; 146(5):1249–551.
 8
White DL, Richardson P, Tayoub N, Davila JA, Kanwal F, ElSerag HB. The updated model: An adjusted serum alphafetoproteinbased algorithm for hepatocellular carcinoma detection with hepatitis c virusrelated cirrhosis. Gastroenterology. 2015; 149(7):1986–7.
 9
Di Bisceglie AM, Sterling RK, Chung RT, Everhart JE, Dienstag JL, Bonkovsky HL, Wright EC, Everson GT, Lindsay KL, Lok ASF, Lee WM, Morgan TR, Ghany MG, Gretch DR, The HALTC Trial Group. Serum alphafetoprotein levels in patients with advanced hepatitis c: Results from the haltc trial. J Hepatol. 2005; 43(3):434–41.
 10
Chen TM, Huang PT, Tsai MH, Lin LF, Liu CC, Ho KS, Siauw CP, Chao PL, Tung JN. Predictors of alphafetoprotein elevation in patients with chronic hepatitis c, but not hepatocellular carcinoma, and its normalization after pegylated interferon alfa 2aribavirin combination therapy. J Gastroenterol Hepatol. 2007; 22(5):669–75.
 11
Richardson P, Duan Z, Kramer J, Davila JA, Tyson GL, ElSerag HB. Determinants of serum alphafetoprotein levels in hepatitis cinfected patients. Clin Gastroenterol Hepatol. 2012; 10(4):428–33.
 12
McIntosh MW, Urban N. A parametric empirical bayes method for cancer screening using longitudinal observations of a biomarker. Biostatistics. 2003; 4(1):27–40.
 13
Skates SJ, Pauler DK, Jacobs IJ. Screening based on the risk of cancer calculation from bayesian hierarchical changepoint and mixture models of longitudinal markers. JASA. 2001; 96(454):429–39.
 14
Tayob N, Lok ASF, Do KA, Feng Z. Improved detection of hepatocellular carcinoma by using a longitudinal alphafetoprotein screening algorithm. Clin Gastroenterol Hepatol. 2016; 14(3):469–75.
 15
Kramer JR, Davila JA, Miller ED, Richardson P, Giordano TP, ElSerag HB. The validity of viral hepatitis and chronic liver disease diagnoses in veterans affairs administrative databases. Aliment Pharmacol Ther. 2008; 27(3):274–82.
 16
Lee E, Edward S, Singal AG, Lavieri MS, Volk M. Improving screening for hepatocellular carcinoma by incorporating data on levels of αfetoprotein, over time. Clin Gastroenterol Hepatol. 2013; 11(4):437–40.
 17
White DL, Kanwal F, ElSerag HB. Nonalcoholic fatty liver disease and hepatocellular cancer: A systematic review. Clin Gastroenterol Hepatol. 2012; 10(12):1342–59.
Acknowledgments
Not applicable
Funding
This research was supported by an NCI Grant (R01CA190776).
Availability of data and materials
The data that support the findings of this study are available from the study PI (HES) upon reasonable request and with BCM IRB and Michael E DeBakey VA Research & Development Committee permission.
Author information
Affiliations
Contributions
HES obtained funding. PR oversaw data collection. NT conducted analysis and drafted the manuscript. All authors contributed to study design, analysis, interpretation and subsequent manuscript revisions. All authors read and approved the final manuscript.
Corresponding author
Correspondence to Nabihah Tayob.
Ethics declarations
Ethics approval and consent to participate
The Institutional Review Board for Human Subject Research for Baylor College of Medicine and Affiliated Hospitals (BCM IRB) reviewed and approved the study protocol. Participant consent was waived due to minimal risks and no adverse effect on the privacy rights and the welfare of the individuals in a retrospective review of information from the patient medical record. All patient identifying information was removed from the analytical dataset.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional file
Additional file 1
Supplementary Materials. Appendix A: Estimators of measures used to evaluate screening algorithms defined in the Methods section. Appendix B: Additional results including Table A–C and Figure A–C. (PDF 267 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Received
Accepted
Published
DOI
Keywords
 Early detection
 Hepatocellular carcinoma
 Longitudinal biomarkers
 αfetoprotein
 Parametric empirical Bayes
 Shortterm risk prediction