 Research Article
 Open Access
 Published:
Riskadjusted CUSUM control charts for shared frailty survival models with application to hip replacement outcomes: a study using the NJR dataset
BMC Medical Research Methodology volume 19, Article number: 217 (2019)
Abstract
Background
Continuous monitoring of surgical outcomes after joint replacement is needed to detect which brands’ components have a higher than expected failure rate and are therefore no longer recommended to be used in surgical practice. We developed a monitoring method based on cumulative sum (CUSUM) chart specifically for this application.
Methods
Our method entails the use of the competing risks model with the Weibull and the Gompertz hazard functions adjusted for observed covariates to approximate the baseline timetorevision and timetodeath distributions, respectively. The correlated shared frailty terms for competing risks, corresponding to the operating unit, are also included in the model. A bootstrapbased boundary adjustment is then required for riskadjusted CUSUM charts to guarantee a given probability of the false alarm rates. We propose a method to evaluate the CUSUM scores and the adjusted boundary for a survival model with the shared frailty terms. We also introduce a unit performance quality score based on the posterior frailty distribution. This method is illustrated using the 20032012 hip replacement data from the UK National Joint Registry (NJR).
Results
We found that the best model included the shared frailty for revision but not for death. This means that the competing risks of revision and death are independent in NJR data. Our method was superior to the standard NJR methodology. For one of the two monitored components, it produced alarms four years before the increased failure rate came to the attention of the UK regulatory authorities. The hazard ratios of revision across the units varied from 0.38 to 2.28.
Conclusions
An earlier detection of failure signal by our method in comparison to the standard method used by the NJR may be explained by proper riskadjustment and the ability to accommodate timedependent hazards. The continuous monitoring of hip replacement outcomes should include risk adjustment at both the individual and unit level.
Background
Continuous monitoring of healthcare, and increasingly, social care across various providers is an important task of the healthcare regulator, such as the Care Quality Commission (QCC) in the UK. Additionally, a number of professional bodies and registers take on the same function for their clinical discipline. For instance, in regards to joint replacement, surgeon and operating unitlevel outcomes are compiled by National Joint Registry for England and Wales (NJR). Methods of continuous monitoring of production quality have been initially developed and employed in quality control in the industry [1]. One of the most popular methods is the cumulative sum (CUSUM) chart, a graphical method based on sequential monitoring of cumulative performance over time. This method is based on sequential procedures and allows timely identification of a deterioration in performance. A number of CUSUMbased quality control systems are being implemented in various clinical disciplines, with the earliest application being in cardiothoracic surgery [2]. Currently they are used in surveillance of the healthcare quality by QCC [3], and by Dr Foster unit at Imperial College [4]. In this paper we expand the CUSUM methodology and adapt it for monitoring the performance of hip prostheses using the NJR data.
A hip replacement is a surgical operation where the damaged hip joint is replaced by a prosthesis. This operation is recommended to reduce pain and improve mobility of a patient after other therapies have failed. There are currently hundreds of types and brands of prosthesis components for use in the hip replacement surgery, and new brands of implant continue to be introduced through technological innovations. An important aspect of an implant brand’s performance is its expected timetorevision. The current exception is that all prostheses used as treatment for end stage arthritis should have a failure rate of less than 5% at 10 years. Because of a relatively long timetofailure of the hip prosthesis, longterm premarketing clinical trials are unfeasible. Therefore, continuous monitoring methods are needed for early detection of poor performance and timely withdrawal of the inferior components from clinical practice.
The first CUSUMbased methods for healthcare were based on binomial or Poisson distributions, monitoring failure rates within a fixed time interval, e.g. 30days mortality [5], or one year hip replacement failure rates [6]. CUSUM methods for survival data are a natural extension of the methods for binary data. Censoring, truncation, adjustment for observed covariates and unobserved factors (frailties) can be easily included in survival models. By monitoring the individualspecific outcomes, the CUSUM score can be evaluated sequentially, changing at each individual failure. However, this method seems not to be appropriate in the case of hip replacement, where the expected timetorevision is longer than 10 years. Hardoon et al. [7] proposed to compare the number of revisions within a certain time interval to that expected given a target revision rate and the total number of hip years in the interval. That is, patients contribute to a CUSUM score until revision or censoring (death or end of followup). They analysed the data from Swedish Arthroplasty register using Weibull distribution to model time to revision of hip replacement.
However, timetorevision of hip prostheses varies depending on the patient characteristics, and on the type of fixation used [8]. This necessitates the use of case mix adjusted monitoring methods. The first risk adjusted CUSUM methods for timetofailure (survival) data were introduced by Biswas and Kalbfleisch [9]. This method was picked up by the Scottish Arthroplasty Project, where CUSUMs are used to monitor complication rates of joint replacements by surgeon and unit from 2010. This is achieved by likelihoodbased scoring method with risk adjustment for age, sex, osteoarthritis (OA) and rheumatoid arthritis (RA) [10]. A Bayesianbased CUSUM method for Weibull survival time is described in Assareh et al. [11].
Although the event of interest in our study is a revision, a priori death should not be treated as noninformative censoring. We develop a general competing risk version of the survival model for NJR data, where death is a competing risk. To safeguard the properties of the CUSUM charts, the control limits for riskadjusted CUSUMs need to be revised to accommodate the estimation error.
We propose and implement a parametric version of the approach by Gandy and Kvaløy [12], of using bootstrap to provide the control limits conditional on the estimated incontrol distribution, resulting in less conservative, i.e. more powerful, procedures.
We are using the Weibull distribution for fitting the baseline revisionspecific hazard function, because this distribution has a good fit to the empirical distribution of timetorevision [7]. The Gompertz distribution is used for fitting the baseline mortalityspecific hazard function. The observed covariates and the correlated frailty components at the unit level are included in the model, assuming that all patients from a unit share the same unobservable gamma distributed risks of prosthesis revision and of death after hip replacement surgery.
We develop a bootstrapbased boundary adjustment for the riskadjusted CUSUM chart to guarantee a given conditional probability of the false alarm rates. We also propose a score characterizing the quality of the hip replacement surgery in a unit. This score is based on the estimate of the posterior conditional frailties for units given the observed data. Mathematical development of the CUSUM scores for a Weibull/Gompertz survival model with shared frailty is provided in the Appendix.
The developed methods are applied to the 20032012 hip replacement data from the NJR. We illustrate the use of riskadjusted CUSUM methodology to monitor the performance of two specific hip prostheses brands: the DePuy ASR Resurfacing Cup and the Biomet M2A38 cup, which were flagged as outliers by NJR [13].
Methods
Motivating example
An artificial hip includes three major components: a stem that is inserted into the femur, a head (a ball) attached to the top of the femur and a cup, also called the acetabular component, that is implanted into the pelvis. A hip resurfacing procedure is typically used in younger patients where it can delay the need for a total hip replacement, it replaces the socket with an artificial cup and resurfaces the head of the femur instead of removing it. In 2010, NJR recorded 123 brands of acetabular cups, 13 brands of resurfacing cups and 146 brands of femoral stems used in primary and revision procedures [14].
Given a vast variety of available types and brands of prosthesis components for use in the hip replacement surgery, monitoring implant quality is the main objective of the NJR implant scrutiny group that was established in 2009. According to the current NJR methodology [15], an implant is considered to be a Level 1 outlier when its Patient Time Incident Rate (PTIR) is twice the PTIR of the implant group, where the group rate is weighted by the relevant implant types. From 2009 to 2014, three hip stems, three hip acetabular components and 17 hip stem/cup combinations were reported as Level 1 outliers [13].
To test our analytical approach on real world data, our analysis will focus on two of these outlier compoents: (i) the DePuy ASR Resurfacing Cup (first identified as a part of an outlier head/cup combination in April 2010 and last implanted in July 2010) and (ii) the Biomet M2A38 acetabular cup (first identified by the NJR as an outlier in 2014, and last implanted in June 2011).
A standard CUSUM chart usually has a learning period where the parameters of the relevant null distribution are estimated, and the deviation from the null of clinical concern is decided upon to calibrate the control limits. The chart is then run with these control limits. An example of this approach is by Hardoon et al. [7], 2007 who monitored a constant target revision rate in a time interval. However, the failure rates differ by implant types, the age of the patients, and other case mix characteristics. They also may vary by the site at which operations take place (the operating unit). Therefore we consider a riskadjusted CUSUM where the target rates are estimated for the popular implants (top 80%), and experienced units (more than 1 surgery per week, on average), which requires an introduction of shared frailty terms, describing similarities within and heterogeneity between units, to our survival models, and an appropriate adjustment of the control limits.
Description of the NJR data
The NJR data were made available after a formal request to the NJR Research Committee. The dataset is related to the data cut used in the 10th NJR Annual Report [16]. The data were anonymised in respect to patient, to surgeon and to operating unit identifying details. Approval was obtained from Computing Subcommittee of the University of East Anglia Ethics Committee, reference number CMP/1718/F/10A. The NJR dataset provides the following four groups of variables used in the timetofailure analysis of the hip replacements to riskadjust the CUSUM boundaries.
Information on procedures, such as date of operation or revision, and side;
Institution and staff involved, such as unit and consultant IDs (anonymised), and surgeon grade;
Hip prosthesis characteristics, such as fixation type (cemented, uncemented, hybrid, resurfacing), its components (head, cup, stem, and liner brands), head size, bearing surfaces (metal, polyethylene, ceramic);
Patient characteristics, such as age, sex, ASA physical status classification [17] at 5 levels from healthy (1) to near death (5), Body Mass Index (BMI), index of multiple deprivation (IMD)[18] (a higher IMD means higher proportion of people in the area classed as deprived), and death date.
Since about a half of records had missing BMI values, this factor was excluded from further consideration. ASA scores were grouped into two categories in further analysis: ASA 12  normal healthy patients and patients with mild systemic disease, ASA 35  patients with serious, nonincapacitating systemic disease, patients with lifethreatening incapacitating systemic disease and patients that are near death.
Data selection in SQL (elimination of duplicates, second and subsequent revisions) resulted in 504,024 records with the fields listed above. By further cleaning the following records have been additionally excluded:
Patients with bilateral operations;
Records with missing or misreported side;
Records with time to revision equal to 0;
Records with date of operation after 31 December 2012;
Patients younger than 50 years at operation day;
Records with missing values of IMD.
This process resulted in 281,265 records. Finally, all records for the patients operated in units with less than 52 operations per year (i.e. less than once per week, on average), and all records with implanted cup/head brands in the bottom 20% in popularity that year, as well as cup/head brands “DePuy” and “Biomet” were excluded in the incontrol dataset, resulting in 113,772 records in total. To test the efficiency of our CUSUM procedure, we have also selected two test datasets including only the records with cup brands “DePuy ASR Resurfacing Cup” (1734 records) and “Biomet M2A 38” (764 records), respectively. The cases for prostheses revised within three months of implantation were censored at the time of revision to exclude failures that might be directly attributive to surgical technique or postoperative complications. Description of the three datasets is given in Table 1. We provide analysis of these data performed in R [19] in the “Results” section.
Basics of CUSUM method for timetoevent data
The CUSUM method is a sequential analysis technique based on the calculation of the series W_{i}, i=0,1,2,..., defined by a simple recurrent equation
where index i stands for a single observation or for a group of observations and X_{i} is the weight or score assigned to index i. The CUSUM alerts when W_{i} crosses a control limit, usually chosen to guarantee a long average run length (ARL) when the process is in control, or to provide a low false alarm probability [20]. In applications to survival data, and assuming independent competing risks of revision and death, the score X_{i} for an individual i with timetorevision t_{i} and vector of covariates u_{i} can be defined as the logarithm of the revisionspecific factor of the likelihood ratio
where δ_{i} is a censoring indicator, \(S^{j}_{i}(.)\) and \(f^{j}_{i}(.)\) are survival and density functions, respectively, and index j=0,1, stands for null hypothesis H_{0} (process is under control) and alternative hypothesis H_{1} (failure rate is higher than expected by a certain margin). Under the assumption of independent competing risks, the revisionspecific factor of the likelihood coincides with the likelihood function that would be obtained be treating failures from any other causes as censored observations.
For a set I of independent individuals, the score X_{I} can be calculated as a sum of individual scores X_{i}, i∈I:
Assuming proportional hazards model with the Weibull baseline distribution under hypotheses H_{j}, j=0, 1, the hazard functions h_{j}(tu)=μ_{j}(t)χ(u) are proportional to the Weibull baseline hazards μ_{j}(t) and a regressor function χ(u). The regressor function is usually specified as χ(u)= exp(β^{∗}u) (the Cox’s regression term) for a transposed column vector of unknown parameters β. The baseline hazard function under H_{0} corresponds to the hazard function μ_{0}(t)=(k/λ)(t/λ)^{k−1} for the Weibull distribution with the shape parameter k and the scale parameter λ, and the baseline hazard function μ_{1}(t) under the alternative hypothesis H_{1} is proportional to μ_{0}, μ_{1}(t)=HRμ_{0}(t). The hazard ratio HR represents the departure from the target survival that we want to detect.
For consecutive time intervals T, consider a subset I=I_{T} of N_{I} individuals observed (prostheses in use) over the time interval T. In this case, the scores X_{I} can be calculated as [7]
where O_{I} is the observed number of failures (revisions) occurring during the interval T and E_{I} is the number of failures that would be expected in the same interval under hypothesis H_{0}.
Denote by (t_{1i},t_{2i}) an intersection of the interval T with the lifetime of the prosthesis i implanted at t_{0i}. Then t_{1i} is the maximum of the lower bound of interval T and t_{0i}, and t_{2i} is the minimum of the upper bound of interval T, the time of revision of prosthesis i and the time of censoring of the patient with prosthesis i. From this, the value of (t_{2i}−t_{1i}) is equal to the length of time when prosthesis i is in use in the time interval T. The values of E_{I} can be computed as
CUSUM scores for shared frailty competing risks model
Under the proportional hazards model with frailty, the hazard functions h(tu,Z) for an observed vector of covariates u and unobserved nonnegative random frailty component Z, is proportional to the baseline hazard μ(t), frailty term Z, and a regressor function χ(u)= exp(β^{∗}u). The conditional survival function is given by
The marginal survival function is defined by
We will use the index f, f=r,d, to denote the types of failure (revision of implant or death of a patient without implant failure, respectively), considered as competing risks. For mathematical convenience, it is frequently assumed that frailty Z_{f} is gammadistributed with mean 1 and unknown variance \(\sigma _{f}^{2}\). The assumption of gamma distributed frailty is not too restrictive, as a number of authors demonstrated that gammabased shared frailty models are robust for a wide class of frailty distributions [21, 22]. The frailty variance \(\sigma _{f}^{2}\) characterizes heterogeneity in the population.
We also assume that the baseline hazard functions are \(\phantom {\dot {i}\!}\mu _{0,r}(t)=(k_{r}/\lambda _{r})(t/\lambda _{r})^{k_{r}1}\) and \(\mu _{0,d}(t)=\lambda _{d}\exp (k_{d}t)\phantom {\dot {i}\!}\) with the shape parameter k_{f} and the scale parameter λ_{f}, f=r,d, for the Weibull and Gompertz distributions, respectively. In this case, the typeoffailure specific marginal survival function is given by
with the typeoffailure specific baseline cumulative hazards \(H_{r}(t)=(t/\lambda _{r})^{k_{r}}\phantom {\dot {i}\!}\) and \(\phantom {\dot {i}\!}H_{d}(t)=(\lambda _{d}/k_{d})(\exp (k_{d}t)1)\).
Correlated frailty terms for revision and death can be constructed as
for independent gamma distributed random variables Y_{0}∼G(l_{0},m_{r}) and Y_{f}∼G(l_{f},m_{f}) with \(l_{f}=1/\sigma _{f}^{2}l_{0}\), \(m_{f}=1/\sigma _{f}^{2}\), f=r,d; 0≤ρ≤ min(σ_{r}/σ_{d},σ_{d}/σ_{r}). The result of this construction is that the frailties are gammadistributed with \(\mathbb {E}Z_{f}=1\), \(\text {Var}Z_{f}=\sigma _{f}^{2}\), and Corr(Z_{r},Z_{d})=ρ. Given the frailties (Z_{r},Z_{d}) and the covariates (u_{r}, u_{d}), typeoffailure specific instantaneous risks are assumed to be conditionally independent at any time t.
The bivariate marginal survival function for the typeoffailure specific latent time moments (t_{r}, t_{d}) is given by the formula
[23]. If left truncation is present at ages (t_{0r}, t_{0d}), we calculate the conditional survival function by dividing the bivariate survival function by S(t_{0r},t_{0d}u_{r},u_{d}).
In the context of hip replacement, the shared frailty terms arise from the assumption that the n_{j} patients who have undergone surgery in the same unit j, j=1,⋯,J, have the same, possibly correlated, unobserved risks of revision and death. This means that the full likelihood function for our model has a form of \({\mathcal L}=\prod _{j=1}^{J}{\mathcal L}_{j}(\bar t_{jr},\bar t_{jd}\bar {\mathbf {u}}_{jr},\bar {\mathbf {u}}_{jd})\) for
where δ_{f}=0,1 is the censoring indicator with δ_{f}=0 indicating right censoring, and \(\bar t_{jf}\) and \( \bar {\mathbf {u}}_{jf}\) are the vectors of causespecific latent times and of covariates for the patients from unit j, respectively, f=r, d, and
where a subscript i, i=1,...,n_{j}, corresponds to a current patient i from unit j. This likelihood can be used for parameter estimation.
Proposed CUSUM scores for a competing risks model with shared frailty are based on the likelihood ratio \({\mathcal L}\). For a time interval T, let I_{j}(T) be a set of individuals from unit j whose implants are in use during the period T, and \(I=I(T)=\bigcup I_{j}(T)\). The scores X_{I}(T) for the time interval T are defined as
where Z_{jr}, Z_{jd} are the shared frailty terms for unit j, the superscript h, h=0,1, stands for hypothesis, and
In general case, expression for X_{I}(T) does not have a simple closed form. In the special case of ρ=0, the competing risks of revision and death are independent, and the score X_{I}(T) is the sum of the respective component scores for revision and death (see Appendix). If the interest lies in the risk of revision only, death can be treated as a noninformative censoring, and we concentrate on the CUSUM analysis of revision scores to the end of this Section.
For the baseline Weibull hazard function, under the proportionate alternatives μ_{1}(t)=HRμ_{0}(t), we can rewrite the revision component of the score (3) as
where O_{j} is a number of revisions in the unit j during period T so that \(O_{I}=\sum _{j}O_{j}\) (see Additional file 1 for proof).
Often, the proportional hazards assumption is too strong; different groups of patients and prostheses do not necessarily have proportional hazard functions for the hip revision times and/or for death. We weaken this assumption by allowing different shape parameters k_{f}(u) in the baseline Weibull and Gompertz hazard functions which depend on covariates through additional Coxregression multipliers, \(k_{f}({\mathbf u})=\exp (\beta ^{*}_{k} \mathbf u)k_{f}\). Then the CUSUM scores for revision are calculated as
CUSUM chart control limits for the shared frailty model for revision
The unknown parameters of the timetorevision model under the null hypothesis H_{0} are estimated from the incontrol (learning) dataset. These are the Coxregression parameters β and β_{k}, parameters of the Weibull baseline distributions k and λ, and the variance of the frailty term σ^{2}. The vector of unknown parameters ξ=(lnk, lnλ, lnσ^{2},β,β_{k}) is estimated using the maximum likelihood method to obtain the estimates \(\hat \xi \). The timetofailure distribution with these estimated parameters is then used to compute the CUSUM scores for the two test datasets and to estimate the control limits for the CUSUM chart: See Additional file 1 for details of calculation of the CUSUM score. Let P=P(ξ) be the true distribution function for revision times, and τ=τ_{c}(P;ξ) is the time at which the chart alerts when it exceeds a threshold c. The false alarm probability in T time units is \(hit(P;\xi) = \mathbb P(\tau _{c}(P;\xi) \leq T)\) for some finite T>0. The threshold c_{hit}(P;ξ)= inf{c>0:hit(P;ξ)≤α} for some 0<α<1 is needed to restrict the false alarm probability to α. However, only \(\hat P\) and \(\hat \xi =\xi (\hat P)\) are known.
A parametric version of the bootstrap algorithm proposed by Gandy and Kvaløy [12] is used to estimate the control limits to guarantee, that the false alarm rate of a CUSUM chart with the incontrol distribution P, conditional on \(\hat \xi \), is below nominal level α with high probability 1−γ.
Define the first time \(\tau _{c}(P\hat \xi)\) at which the CUSUM chart conditional on \(\hat \xi \) exceeds the given value c. We are interested in the boundary \(c_{hit}(P\hat \xi)\) defined by equation \(c_{hit}(P\hat \xi)=\inf \{c>0:\; \mathbb P(\tau _{c} (P\hat \xi)\leq T)\leq \alpha \}\) for some 0<α<1. Since P is unknown, \(c_{hit}(P\hat \xi)\) is unknown too and the estimate \(c_{hit}(\hat P\hat \xi)\) is usually used instead. However, such estimate does not guarantee the false alarm rate of the chart. Following [12], we estimate the 1−γ quantile for the threshold \(c_{hit}(P\hat \xi)\) for some 0<γ<1 using the following algorithm.
Algorithm.
Let N be the number of records (patients) in the control dataset, N_{Sim} be the number of simulations needed to estimate \(c_{hit}(\hat P\hat \xi)\), N_{Boot} be the number of bootstrap replicates, and T=[T_{min},T_{max}] be the observation period.
 1.
Calculate the maximum likelihood estimate (MLE) \(\hat \xi \) of the vector of unknown parameters ξ as well as the estimate \(\widehat {\text {Cov}}\) of the covariance matrix cov (inverse Hessian) for \(\hat \xi \) using the control dataset and the survival model with Weibull hazard described above;
 2.
Generate from the multivariate normal distribution with mean \(\hat \xi \) and the covariance matrix \(\widehat {\text {Cov}}\), a random vector ξ_{cur};
 3.
Keeping the covariates in all three test datasets fixed, generate for all patients new timestorevision t_{rev} on the basis of the survival model with Weibull hazard described above and vector ξ_{cur}. Update the censoring using the rule δ=1 if t_{rev}<= min{t_{death},T_{max}} and δ=0, otherwise. Replace t_{rev} for δ=0 by t_{rev}= min{t_{death},T_{max}}. Repeat N_{Sim} times and calculate for the test dataset j, j=1,2, the values of \(c_{\text {hit}}^{j}(\hat P_{cur}\hat \xi _{cur})\) and \(c_{\text {hit}}^{j}(\hat P\hat \xi _{cur})\);
 4.
To take into account multiple testing, we set \(c_{\text {hit}}(\hat P_{cur}\hat \xi _{cur})=\underset {j=1,2}\max \{c_{\text {hit}}^{j}(\hat P_{cur}\hat \xi _{cur})\}\) and \(c_{\text {hit}}(\hat P\hat \xi _{cur})=\underset {j=1,2}\max \{c_{\text {hit}}^{j}(\hat P\hat \xi _{cur})\}\). Calculate \(p_{cur}=c_{\text {hit}}(\hat P_{cur}\hat \xi _{cur})c_{\text {hit}}(\hat P\hat \xi _{cur})\);
 5.
Repeat steps 24 N_{Boot} times and calculate the 1−γ empirical quantile p_{γ} of p_{cur}.
The estimate of the adjusted threshold is equal to \(c_{hit}(\hat P\hat \xi)p_{\gamma }\). This threshold guarantees that in approximately 100(1−γ)% of the applications the probability of false alarm will not exceed the value of α.In the “Results” section, we use the values of N_{Sim}=100, N_{Boot}=100, α=0.1, and γ=0.1, T_{min}=01.01.2005, and T_{max}=31.12.2012 for the analysis of the NJR data.
Estimating operating unit performance
Estimating performance across surgical units is also of potential importance in the quality control setting. The posterior frailty distribution obtained from the fitted shared frailty survival model described in the “Methods” section, can be used for this purpose. Given the prior gamma distribution with (shape, scale) parameters (a,b)=(σ^{−2},σ^{2}), mean ab=1 and variance ab^{2}, and the observed data D_{j}, the posterior frailty distribution for unit j, is the gamma distribution with (shape, scale) parameters (a_{j},b_{j}) equal to
where O_{j} is the number of observed revisions in unit j, I_{j} is set of all patients from unit j, and H(t_{i},u_{i}) is the cumulative hazard for individual i from unit j with time to revision (or censoring) t_{i} and the vector of covariates u_{i} [24].
The effects of the units (shared frailties) are given by the conditional expectation \(\mathbb {E}(Z_{j}D_{j})=a_{j}b_{j}\), and parameters a_{j} and b_{j} can be estimated by substituting the MLE estimates \(\hat \xi \) of the unknown parameters ξ [21]. Given the proportional hazards formulation, the shared frailty term can be interpreted as an excess hazard of a unit relative to the baseline hazard. Because of this interpretation, we refer to these estimated frailties as unitlevel hazard ratios and denote them by HR_{j}.
Additionally, we propose a new score characterizing the quality of the hip replacement surgery in a unit as
where D_{j} is the data from the control dataset relating to unit j. Large value of Q indicates a decreased hazard of revision in a unit, whereas small value of Q indicates poor performance of a unit. Since the values of Q and HR depend on the vector of unknown parameters ξ and only the MLE estimate \(\hat \xi \) of this vector is available, we generate a set of N_{average} estimates \(\hat \xi _{l}\) from \(\mathrm {N}(\hat \xi,\widehat {cov})\) distribution, and take the average of the obtained estimates of \(Q(\hat \xi _{l})\) and of \(\text {HR}_{j}(\hat \xi _{l})\) over this set of parameters.
Results
For the control dataset described in the “Methods” section, we estimated unknown parameters of the competing risks model with and without shared frailty terms maximizing the likelihood function (2). These include the parameters for the baseline hazard distributions and the coefficients of the Cox’s regressions for timetorevision and timetodeath, allowing for the possible covariatedependent shape parameters, as described at the end of the “Methods” section. Significant predictors had been chosen using the backward elimination in stepwise regression. The estimated coefficients and their confidence intervals for the models with and without frailty components are given in Table 2. The notation \(\phantom {\dot {i}\!}^{\prime \prime }{k}_{f}^{\prime \prime }\) before the name of a variable means that its coefficient relates to the shape parameter k_{f}. The baseline values for the categorical and binary regressors were: males for sex, cemented for fixation, ceramic/ceramic for cup/head bearing surfaces, and operation date before 01.01.2007.
Comparing likelihood, AIC and BIC values in Table 2 we see that the correlation between causespecific frailties Z_{r} and Z_{d} does not differ significantly from zero, and the best (in terms of AIC and BIC) model includes a frailty term only for revision. That is, the risks of revision and death can be modelled as independent, and formula (5) can be used to calculate CUSUM scores for revision.
Females had a decreased hazard of revision of hip prostheses compared to males on the timetorevision interval [ 0,λ]. Hazard of revision decreased with age and head size. Uncemented hip prostheses had an increased hazard of revision compared to cemented or hybrid fixation. The cup/head combinations with resurfacing/metal and resurfacing/resurfacing bearing surfaces also had increased hazards compared to other types of bearings, whereas the polyethylene/ceramic bearing surfaces provided a decreased risk of revision compared to the ceramic/ceramic ones. These results agree with the findings by [8]. Those patients who underwent the surgery after 01.01.2007 had an increased hazard of revision. This may reflect the fact that early revisions were missed by the NJR due to poor data quality in the early years. We also have found a significant random effect of units, with the estimated frailty variance \(\sigma ^{2}_{r}\) equal to 0.18 with confidence interval of (0.12−0.28). i.e. the hazard of revision differed by units.
Patients with serious disease (ASA P3P5) and patients from areas with high deprivation (IMD 45) had increased hazards of death. The cup/head combination with polyethylene/metal bearing surfaces had a significantly increased hazard of death compared to ceramic/ceramic bearing. The shape parameters for baseline hazards of death also differed by these factors and by the date of surgery before/after 01.01.2007.
Based on the fitted revision submodel with frailty under independent competing risks, and targeting the hazard ratios of 1.25, 1.50 and 1.75 under alternative hypotheses, the CUSUM scores were calculated quarterly for the period 200512. The bootstrapbased boundaries were calculated at the false alarm rate α=0.1 and the tolerance level 1−γ=0.9 and adjusted for multiple comparisons for two tested hip implants. The CUSUM scores did not differ much between the models with and without frailty component. Figure 1 presents the CUSUM charts for the two test datasets as well as the incontrol dataset for the models without/with frailty component at all three target hazard ratios. The CUSUM charts without frailty for DePuy ASR Resurfacing Cup produced alarm in the 4th quarter of 2009 for HRs of 1.25 and 1.75, and in the 3rd quarter of 2009 for HR of 1.50. The charts with frailty produced alarm somewhat later, in the 4th quarter of 2009 for all three values of the hazard ratio. This is comparable with the alarm based on PTIR by NJR in April 2010. For the Biomet M2A 38, the CUSUM charts without frailty hit the boundary in the second quarter of 2011 for HR=1.25, in the first quarter of 2011 for HR of 1.50, and in the second quarter of 2010 for HR= 1.75. The CUSUM charts with frailty alarm in the 2rd, the 1nd and the 2nd quarter of 2011, respectively. This is 3 to 4 years prior to the NJR alarm issued in 2014 [8].
The estimates of the quality scores Q_{j} and the hazard ratios HR_{j} have been calculated for 269 units included in the control dataset using N_{average}=100. Our results demonstrate high heterogeneity in performance. 17 units out of the total of 269 had the quality scores greater than 0.9. HRs for these units were between 0.38 and 0.67. 15 units had the quality score values less than 0.1. Their HRs varied from 1.52 to 2.28.
To check the goodnessoffit of chosen parametric distributions in our models for revision and mortality, we compared semiparametric estimates of baseline cumulative hazard functions to baseline cumulative hazards obtained from our parametric models, separately within each strata of a moderate to large size with a particular shape value. The results are shown in Fig. 2 for the Weibull baseline hazards in the revision model, and in Fig. 3 for the Gompertz baseline hazards in the mortality model. Additionally, these figures include plots of the residuals between the parametric and semiparametric estimates of the baseline hazards pooled across the strata. In Fig. 2, the larger deviations are still very small in absolute value, and mostly correspond to the small number of operations performed before 2007. Figure 3 is the confirmation of a wellknown fact [25] that the Gompertz distribution describes human mortality well only up to 95 years, and the oldest patients in Fig. 3 are the outliers. Overall, the Weibull and the Gompertz models fit the revision and the mortality data, respectively, very well.
To assess the predictive value of our models, we also calculated the Harrell’s concordance index [26, 27] between the predicted and the observed survival. In the models without frailty, the estimates of the concordance were equal to 0.818 (SE=0.009) and 0.732 (SE=0.003) for revision and mortality data, respectively. For the models with frailty, the concordance values were equal to 0.819 (SE=0.009) and 0.732 (SE=0.003), respectively.
Discussion
In hip replacement surgery, the continuous monitoring of the revision experience of hip prostheses is necessary due to delayed outcomes after the introduction of new brands into practice. CUSUM charts are a useful tool for early detection of changes in the revision rates after hip replacement. In the standard applications of the CUSUMbased monitoring, the learning data set required for the model identification is usually chosen from a preceding period. This assumes the stationarity of the process and leads to loss of information and the reduction of the period under study. Instead, we chose the incontrol and the test data from the same period. This novel approach is especially beneficial for the future development of the adaptive version of the algorithm.
In the absence of the gold standard, the choices of the learning dataset and the model describing the data play an important role in the analysis using a selfstarting CUSUM. After the routine cleaning of the original dataset, we excluded the records from units with less than 52 hip replacements per year to guarantee to some degree the sufficient experience of the implant within surgical teams. Similarly, only the top 80% of cup/head brands in each year were included to exclude rarely used brands, where the measure of failure rate was unlikely to be stable or robust.
Naive analysis treating competing risk events as noninformative censoring can lead to bias in estimates if competing risks are not independent. The competing risks model with dependent unobserved risk factors (frailties) is a convenient analytical tool for such data.
Two types of failure  revision and death without revision  are considered in this study. Other events during the followup period (e.g. loss to followup due to migration) are treated as noninformative censoring. In addition to observed factors, we included in the competing risks model correlating typeoffailurespecific random effects and all patients from a unit shared their values [28]. Sex, age, fixation, bearing surfaces, head size, and the date of operation were significantly associated with the lifetime of the hip prosthesis. Bad health (ASA 35), high deprivation (IMD 45), polyethylene/metal bearing surfaces, and the date of operation were significantly associated with the higher hazards of death. These effects were robust against the frailty settings.
Identifiability of the competing risks model with random effects was studied in [29]. The main assumption for the identifiability of this model is the finite mean of the frailty. Identifiability of the bivariate survival models with timedependent frailties given by the correlated Lévyprocesses was studied in the recent publication [30]. Our methodology can be easily adapted to this scenario.
There is no consensus on whether the risks of revision and death are independent in hip replacement. Shwarzer et al. [31] showed these risks to be dependent in their data. However, a recent publication by Sayers et al. [32] argued for independence. Comparing the results from four competing risk models with and without shared frailty terms, we found that the best model included the shared frailty for revision but not for death. This means that the competing risks of revision and death are independent in the NJR data. The variance of the frailty term for revision differed significantly from zero, in other words, there were significant differences between units.
We used the classical AIC and BIC for the model selection. However, the conditional AIC (cAIC) [33–35] is more appropriate for use in frailty models, since the marginal AIC favors smaller models excluding random effects. We believe that the use of cAIC would not have changed our models because of the negligibly small values of the estimates for the variance of the frailty for mortality, the very small correlation between frailties, and the practically unchanged value of the loglikelihood compared to the model without a random effect for mortality. The cAIC methods are also very computationally intensive. However, our final model includes the random effect for revision. We intend to incorporate cAIC for model selection in our further work.
We proceeded with CUSUM monitoring of revision rates. The two cup brands, DePuy ASR Resurfacing Cup and Biomet M2A 38, were not included in the learning dataset and their performances were monitored using CUSUM charts. We calculated the adjusted boundaries for three target values of the hazard ratios to guarantee approximately 10% of false alarm rate with probability of 0.9 during the observation period 200512. The estimates of the boundary calculated using the models with the frailty component were higher, i.e. more conservative, than the one calculated using the model without the frailty component. This delayed two of the alarms, by three and by 12 months. The charts were comparatively robust to the changes in the target HR levels. The estimated CUSUM scores of the DePuy ASR Resurfacing Cup consistently increased from mid2009. The increase of the CUSUM scores for the Biomet cup also started in 2009 and produced alarms in 201011, four years before the increased failure rate came to the attention of the UK regulatory authorities [15].
Estimating the posterior frailty distribution allows to compare the quality of the hip replacement surgery across units. From the 269 units included in the control dataset, 17 (6.3%) had a decreased hazard of revision with a quality score higher than 0.90 and 15 (5.6%) had an increased hazard of revision with a quality score less than 0.10. The associated hazard ratios of revision across the units varied from 0.38 to 2.28.
Due to low revision rates, the data set under study has about 90% censoring. The properties of the statistical methods in highly censored data sets are not well known. A further simulation study is required to assess the performance of our methods under varying amounts of censoring. Another limitation of this study is the choice of the gamma distribution for the correlated frailties. The advantage of the gamma frailty is a closed form expression for its Laplace transform. It allows for simple expressions for CUSUM scores. However, this choice results in necessarily positive correlations between revision and mortality frailties. Other forms of the frailty distributions (e.g. lognormal) to allow possible negative correlations will be pursued in our future work.
Conclusions
This study developed and implemented, for the NJR data, continuous monitoring methods for surgical outcomes. We used the Weibull and the Gompertz hazard functions to describe the baseline hazards of revision and death, respectively. These functions appear to provide a good approximation to the respective typeoffailure lifetime. However, adjustment for observed covariates is necessary to improve this approximation and to better understand the influence of the different factors on the lifetimes of the hip prosthesis and the patient.
Flexible parametrization taking into account possible influence of observed covariates on the shape and the slope parameters of the revision and mortality hazard functions as well as inclusion of the random effects (frailties) accommodate nonproportional hazards and improve the fit of our models to observed data.
Our results demonstrate that the competing risks of revision and death are independent in the NJR data. This finding will facilitate further development of continuous monitoring methods for these data.
We developed a novel method of CUSUMbased monitoring of revision rates. This method includes the choice of the incontrol and the test data from the same period, and can be expanded for the subsequent development of an adaptive algorithm. Implementation of the special bootstrap algorithm to estimate the control limits in the CUSUM method guarantees with high probability that the false alarm rate is below a prespecified level. An earlier detection of failure signal by our method in comparison to the PTIR method may be explained by proper riskadjustment and the ability to accommodate timedependent hazards.
We found considerable variation in the hazard ratios of revision across the units. Therefore, the continuous monitoring of hip replacement outcomes should include risk adjustment at both the individual and unit level.
Our approach can be easily adapted to other practice areas requiring the continuous monitoring of the failure rates. Further development of the dynamic CUSUMbased methodology similar to that of [36] is needed to adapt our approach to realtime applications, where the new data are regularly updated. Additionally, more sophisticated methods are required to adjust for multiplicity if testing hundreds of various implant brands. We intend to address these further challenges elsewhere.
Availability of data and materials
The NJR data are available to interested researchers subject to approval of the data access request by the Healthcare Quality Improvement Partnership (HQIP) and governance controls. The R programs used to analyse the data are available from the authors on request.
Abbreviations
 ASA:

American Society of Anesthesiologists
 BMI:

Body mass index
 CUSUM:

Cumulative sum
 HR:

Hazard Ratio
 IMD:

Index of Multiple Deprivation
 NJR:

National Joint Register
 PTIR:

Patient time incident rate
 QCC:

Care Quality Commission
References
 1
Page E. Continuous inspection schemes. Biometrika. 1954; 14:100–15.
 2
de Leval M, Franćois K, Bull C, Brawn W, Spiegelhalter D. Analysis of a cluster of surgical failures: Application to a series of neonatal arterial switch operations. J Thorac Cardiovasc Surg. 1994; 107(3):914–924.
 3
Spiegelhalter D, SherlawJohnson C, Bardsley M, Blunt I, Wood C, Grigg O. Statistical methods for healthcare regulation: rating, screening and surveillance. J R Stat Soc Ser A Stat Soc. 2012; 175(1):1–47.
 4
Bottle A, Aylin P. Intelligent information: A national system for monitoring clinical performance. Health Serv Res. 2008; 43:10–31.
 5
Grigg O, Farewell V, Spiegelhalter D. Use of riskadjusted CUSUM and RSPRT charts for monitoring in medical contexts. Stat Methods Med Res. 2003; 12(2):147–170.
 6
Biau D, Meziane M, Bhumbra R, Dumaine V, Babinet A, Anract P. Monitoring the quality of total hip replacement in a tertiary care department using a cumulative summation statistical method (CUSUM). J Bone Joint Surg Br. 2011; 93:1183–1188.
 7
Hardoon S., Lewsey J., van der Meulen J. Continuous monitoring of longterm outcomes with application to hip prostheses. Stat Med. 2007; 26(28):5081–5099.
 8
National Joint Register. 14th Annual report 2017. surgical data to 31 December 2016. 2017. https://reports.njrcentre.org.uk/Portals/6/PDFdownloads/NJR%2014th%20Annual%20Report%202017.pdf.
 9
Biswas P, Kalbfleisch J. A riskadjusted CUSUM in continuous time based on the Cox model. Stat Med. 2008; 27(17):3382–3406.
 10
Macpherson G, Brenkel I, Smith R, Howie C. Outlier analysis in orthopaedics: Use of CUSUM: The Scottish Arthroplasty Project: Shouldering the burden of improvement. J Bone Joint Surg Am. 2011; 93:81–88.
 11
Assareh H, Smith I, Mengersen K. Bayesian estimation of the time of a linear trend in riskadjusted control charts. Int J Comput Sci. 2011; 38(4):409–417.
 12
Gandy A, Kvaløy J. Guaranteed conditional performance of control charts via bootstrap methods. Scand Stat Theory Appl. 2013; 40:647–668.
 13
National Joint Register. 12th Annual Report 2015. Surgical data to 31 December 2014. 2015. http://www.njrcentre.org.uk/njrcentre/Portals/0/Documents/England/Reports/12th%20annual%20report/NJR%20Online%20Annual%20Report%202015.pdf.
 14
National Joint Register. 8th Annual Report 2011. Surgical data to 31 December 2010. 2011. http://www.njrcentre.org.uk/njrcentre/Portals/0/Documents/NJR%208th%20Annual%20Report%202011.pdf.
 15
National Joint Register. NJR implant performance analysis methodology. 2017.
 16
National Joint Register. 10th Annual Report 2013. Surgical data to 31 December 2012. 2013.
 17
Owens W, Felts J, Spitznagel JE. ASA physical status classifications: a study of consistency of ratings. Anesthesiol. 1978; 49:239–43.
 18
English Indices of Deprivation. Guidance Document. https://www.gov.uk/government/uploads/system/uploads/\\attachment_data/file/6222/1871538.pdf.
 19
R Core Team. R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing; 2016. https://www.Rproject.org/.
 20
Gandy A, Lau FH. Nonrestarting CUSUM charts and control of the false discovery rate. Biometrika. 2013; 100(1):261–8.
 21
Glidden D, Vittinghoff E. Modelling clustered survival data from multicentre clinical trials. Stat Med. 2004; 23(3):369–88.
 22
Gleiss A, Gnant M, Schemper M. Explained variation in shared frailty models. Stat Med. 2017; 37(9):1472–90.
 23
Wienke A. Frailty Models in Survival Analysis. New York: Chapman & Hall; 2010.
 24
Nielsen G, Gill R, Andersen P, Sørensen T. A counting process approach to maximum likelihood estimation in frailty models. Scand J Stat Theory Appl. 1992; 19:25–43.
 25
Vaupel JW. Biodemography of human ageing. Nature. 2010; 464(7288):536–542.
 26
Harrell Jr FE, Lee KL, Mark DB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med. 1996; 15(4):361–87.
 27
Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N, Pencina MJ, Kattan MW. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology. 2010; 21(1):128–138.
 28
Gorfine M, Hsu L. Frailtybased competing risks model for multivariate survival data. Biometrics. 2011; 67(2):415–26.
 29
Abbring JH. The identifiability of the mixed proportional hazards competing risks model. J R Statist Soc B. 2003; 65(3):701–10.
 30
Begun A, Yashin A. Study of the bivariate survival data using frailty models based on lévy processes. AStA Adv Stat Anal. 2018; 103(1):37–67. https://doi.org/10.1007/s101820180322y.
 31
Shwarzer G, Schumacher M, Maurer T, PE O. Statistical analysis of failure times in total joint replacement. J Clin Epidemiol. 2001; 54:997–1003.
 32
Sayers A, Evans J, Whitehouse M, Blom A. Are competing risks models appropriate to describe implant failure?. Acta Orthopaedica. 2018; 89(3):256–8.
 33
Vaida F, Blanchard S. Conditional Akaike information for mixedeffects models. Biometrika. 2005; 92(2):351–70.
 34
Greven S, Kneib T. On the behaviour of marginal and conditional AIC in linear mixed models. Biometrika. 2010; 97(4):773–89.
 35
Ha ID, Jeong JH, Lee Y. Statistical Modelling of Survival Data with Random Effects. Singapore: Springer; 2017.
 36
Zhang X, Woodall W. Dynamic probability control limits for riskadjusted Bernoulli CUSUM charts. Stat Med. 2015; 34(25):3336–3348.
Acknowledgements
The authors thank Sophie E. Garrett and Dr Wenjia Wang for the extraction of the preliminary NJR dataset in an analysis friendly format. The authors also thank the referees, Ha Il Do and Vera Tomazalla for their useful suggestions for improving the presentation of the material of this article.
We thank the patients and staff of all the hospitals in England, Wales and Northern Ireland who have contributed data to the National Joint Registry. We are grateful to the Healthcare Quality Improvement Partnership (HQIP), the NJR Research Subcommittee and staff at the NJR Centre for facilitating this work. The authors have conformed to the NJR’s standard protocol for data access and publication. The views expressed represent those of the authors and do not necessarily reflect those of the National Joint Registry Steering Committee or the Health Quality Improvement Partnership (HQIP) who do not vouch for how the information is presented.
The Healthcare Quality Improvement Partnership ("HQIP") and/or the National Joint Registry ("NJR") take no responsibility for the accuracy, currency, reliability and correctness of any data used or referred to in this report, nor for the accuracy, currency, reliability and correctness of links or references to other information sources and disclaims all warranties in relation to such data, links and references to the maximum extent permitted by legislation. HQIP and NJR shall have no liability (including but not limited to liability by reason of negligence) for any loss, damage, cost or expense incurred or arising by reason of any person using or relying on the data within this report and whether caused by reason of any error, omission or misrepresentation in the report or otherwise. This report is not to be taken as advice. Third parties using or relying on the data in this report do so at their own risk and will be responsible for making their own assessment and should verify all relevant representations, statements and information with their own professional advisers.
Funding
The work by A. Begun and E. Kulinskaya was supported by the Economic and Social Research Council [grant number ES/L011859/1]. The work by A. Begun was also supported by the Orthopaedics Trust.
Author information
Affiliations
Contributions
All authors have made contributions to conception, design and methodology of this study. AJM formulated the problem and obtained the data, AB and EK contributed to methods development, AB carried out the analysis, and EK drafted the first version of the manuscript. All authors have been involved in revisions, read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
The NJR data were made available after a formal request to the NJR Research Committee. The data were anonymised in respect to patient, to surgeon and to operating unit identifying details. Approval was obtained from Computing Subcommittee of the University of East Anglia Ethics Committee, reference number CMP/1718/F/10A.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interest.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1
calculation of the CUSUM score
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Begun, A., Kulinskaya, E. & MacGregor, A.J. Riskadjusted CUSUM control charts for shared frailty survival models with application to hip replacement outcomes: a study using the NJR dataset. BMC Med Res Methodol 19, 217 (2019). https://doi.org/10.1186/s1287401908532
Received:
Accepted:
Published:
Keywords
 CUSUM charts
 Baseline hazard function
 Risk adjustment
 Competing risks
 Shared frailty
 Bootstrap