BivRec: an R package for the nonparametric and semiparametric analysis of bivariate alternating recurrent events

Castro-Pearson, Sandra; Sur, Aparajita; Lee, Chi Hyun; Huang, Chiung-Yu; Luo, Xianghua

doi:10.1186/s12874-022-01558-0

Software
Open access
Published: 03 April 2022

BivRec: an R package for the nonparametric and semiparametric analysis of bivariate alternating recurrent events

Sandra Castro-Pearson¹,
Aparajita Sur¹,
Chi Hyun Lee²,
Chiung-Yu Huang³ &
…
Xianghua Luo ORCID: orcid.org/0000-0001-7501-6582^1,4

BMC Medical Research Methodology volume 22, Article number: 92 (2022) Cite this article

1554 Accesses
1 Citations
1 Altmetric
Metrics details

Abstract

Background

Bivariate alternating recurrent event data can arise in longitudinal studies where patients with chronic diseases go through two states that occur repeatedly, e.g., care periods and break periods. However, there was no statistical software that provided tools for the analysis of such data. To meet this software need, we developed BivRec, a package for R that contains a set of tools for exploratory, nonparametric and semiparametric regression analysis of bivariate alternating recurrent events.

Results

The BivRec package provides functions for nonparametric estimations for the joint distribution of bivariate gap times (bivrecNP) and semiparametric regression methods for evaluating covariate effects on the two types of gap times under the accelerated failure time model framework (bivrecReg). The package also provides exploratory data analysis tools such as a visualization of the gap times by groups. We utilize a subset of the South Verona Psychiatric Case Register (PCR) data to illustrate the use of the BivRec package for the reviewed methods.

Conclusions

We demonstrate BivRec’s capability for data visualization, nonparametric and regression based analysis, as well as data simulation. The package has default methods with satisfactory performance despite the complexity of calculations and fills a gap in software for statistical analysis of bivariate alternating recurrent events. BivRec is accessible under the GPL-3 General Public License through CRAN, facilitating its installation.

Peer Review reports

Background

Bivariate alternating recurrent event data can arise in longitudinal studies where individuals go through two interchanging states that reoccur over time. For instance, patients with chronic conditions, such as addiction or mental illnesses, experience care periods and break periods repeatedly. The on and off periods of care or disease form a sequence of recurrent gap times. Consider a group of patients who enter a study at the time of a hospitalization due to a particular disease and during the study experience a series of recurrent hospitalizations caused by the same disease. In this case, patients experience alternating periods of care and breaks from care (referred to as Type I and Type II gap times, respectively, hereinafter) until a censoring event such as the end of study occurs.

First, we introduce some notation. Let $X^{0}_{ij}$ and $Y^{0}_{ij}$ be random variables representing the length of Type I and Type II gap times of episode j experienced by subject i, respectively. Denote the collection of all episodes of subject i by $N_{i}= \left \{\left (X^{0}_{i1}, Y^{0}_{i1}\right), \left (X^{0}_{i2}, Y^{0}_{i2}\right), \ldots \right \}, i=1, 2, \ldots, n$. In addition, we assume that the bivariate recurrent event process of subject i is subject to right censoring C_i, which has survival function G(·) with maximum support τ_c= sup{t:G(t)>0}. Let m_i denote the number of episodes of bivariate alternating recurrence times for subject i, which satisfies the conditions:

$$\sum\limits_{j=1}^{m_{i}-1} (X^{0}_{ij} + Y^{0}_{ij}) \le C_{i} \text{and} \sum\limits_{j=1}^{m_{i}} (X^{0}_{ij} + Y^{0}_{ij}) > C_{i}. $$

Note that, in the last bivariate pair $\left (X^{0}_{{im}_{i}},Y^{0}_{{im}_{i}}\right)$, $X^{0}_{{im}_{i}}$ may or may not be censored, but $Y^{0}_{{im}_{i}}$ is always censored, for which we use the hospitalization example to illustrate. When the censoring occurs during the last care period, $X_{{im}_{i}}^{0}$, the break period afterwards, $Y^{0}_{{im}_{i}}$ cannot be observed, and when the censoring occurs during the last break period, $Y^{0}_{{im}_{i}}$ is only partially observed.

The observation of the gap times $X^{0}_{ij}$ and $Y^{0}_{ij}$ is subject to the censoring time $C_{ij}^{*} = C_{i} - \sum \limits _{l=1}^{j-1} \left (X^{0}_{il} + Y^{0}_{il}\right)$ and $\max \left \{C_{ij}^{*}-X_{ij}^{0},0\right \}$, respectively, where $\sum \limits _{1}^{0}=0$. Due to the censoring, the observed data for subject i are $\left \{\left (X_{ij}, Y_{ij}, \Delta ^{X}_{ij}, \Delta ^{Y}_{ij}\right),j=1,\ldots,m_{i}\right \}$, where $X_{ij}= X_{ij}^{0}, Y_{ij} = Y_{ij}^{0}$ and $\Delta ^{X}_{ij}=\Delta ^{Y}_{ij}=1$ for j=1,…,m_i−1, while $X_{{im}_{i}} = {\min }\left (X_{{im}_{i}}^{0}, C_{{im}_{i}}^{*}\right), Y_{{im}_{i}} = \min \left \{Y_{{im}_{i}}^{0}, \max \left (C_{{im}_{i}}^{*}-X_{{im}_{i}}, 0\right)\right \}, \Delta ^{X}_{{im}_{i}}= I\left (X_{{im}_{i}}^{0} < C_{{im}_{i}}^{*}\right)$ and $\Delta ^{Y}_{{im}_{i}}=0$. Figure 1 shows an illustration of this process where the Type I gap of the last episode is observed but the Type II gap following it is censored (i.e., $0 < C_{{im}_{i}}^{*}-X_{{im}_{i}} < Y_{{im}_{i}}^{0}$).

The BivRec package was designed to analyze bivariate alternating recurrent data with the form as depicted in Fig. 1. It provides a consistent and user-friendly set of functions to: explore and visualize the data, estimate and plot the joint cumulative distribution function (cdf), the marginal survival and the conditional cdf using nonparametric methods [1] and fit semiparametric accelerated failure time models [2, 3] to estimate the effect of covariates on the two alternating gap times.

BivRec version 1.2.1 is available through CRAN at https://cran.r-project.org/package=BivRec. The reference manual can be found in both CRAN and Additional file 1 of this paper. The implementation in R was designed for users with some experience analyzing survival data and follows conventions used in similar R packages. Most functions in BivRec are S4 methods that produce S4 class objects. Where possible functions have been optimized using Fortran 90 to reduce running time.

Implementation

Nonparametric analysis

In 2005, Huang and Wang [1] developed a nonparametric method for estimating the joint distribution of the two types of alternating gap times, which is a useful data summary tool for bivariate alternating recurrent event data. In the BivRec package, we implemented Huang and Wang’s (2005) [1] nonparametric methods in the function bivrecNP() to estimate:

1.
the joint cumulative distribution function (cdf) for the two types of gap times, $\Pr \left (X_{ij}^{0} \leq x, Y_{ij}^{0} \leq y\right)$, and its associated standard error (in an output data frame and a contour plot),
2.
the marginal survival function for the Type I gap times, $\Pr \left (X_{ij}^{0} > x\right)$, and its associated standard error (in an output data frame and a survival plot), and
3.
the conditional cdf for the Type II gap times given the Type I gap times, $\Pr \left (Y_{ij}^{0} \leq y|X_{ij}^{0} \leq x\right)$, and its associated standard error (in an output data frame and a conditional cdf plot).

For estimation for the cdf, assume that there exists a subject-level latent variable Ψ_i with an unspecified cdf, P_Ψ(·) such that the bivariate gap times $\left (X^{0}_{ij}, Y^{0}_{ij}\right), j=1, 2,\ldots $ are identically and independently distributed (i.i.d.) given Ψ_i and that the censoring time C_i is independent of (N_i,Ψ_i). Define variables $Z^{0}_{ij}= X^{0}_{ij} + Y^{0}_{ij}$ and $\boldsymbol {W}^{0}_{ij} = \left (X^{0}_{ij}, Y^{0}_{ij}\right)$, then their joint cdf is $F_{Z^{0}, \boldsymbol {W}^{0}}( {z}, \boldsymbol {u})=\Pr \left (X^{0}_{i1} + Y^{0}_{i1} \leq {z}, X^{0}_{i1} \leq x, Y^{0}_{i1} \leq y\right)$ for z=x+y and u=(x,y). The marginal survival function of $Z^{0}_{ij}$ is hence $S_{Z^{0}}( {z}) = 1 - F_{Z^{0}, \boldsymbol {W}^{0}}( {z}, (\infty, \infty))$. Our interest lies in the estimation of the joint distribution $F_{X^{0}, Y^{0}}(x, y)= \Pr (X^{0}_{ij} \leq x, Y^{0}_{ij} \leq y)$, which is determined by $F_{Z^{0}, \boldsymbol {W}^{0}}( {z}, \boldsymbol {u})$ through the identity $F_{X^{0}, Y^{0}}(x, y) = F_{Z^{0}, \boldsymbol {W}^{0}}(x+y, (x,y))$. Let F_a(z,u)=E[a_iI(Z_i1≤z,W_i1≤u,Δ_i1=1)] and R_a(z)=E[a_iI(Z_i1≥z)], where the weight a_i=a(C_i) is a non-negative function of C_i and satisfies $\mathrm {E}[a_{i}^{2}]<\infty $, with a special case of a_i=1, i.e., no weights. Then, following Huang and Louis (1998) [4], it is shown that $F_{Z^{0}, \boldsymbol {W}^{0}}( {z}, \boldsymbol {u}) = \int _{0}^{ {z}} S_{Z^{0}}(s-) \frac {F_{a}(ds,\boldsymbol {u})}{R_{a} (s)}$, where $S_{Z^{0}}(\cdot)$ is the marginal survival function of $Z^{0}_{ij}$ [1]. Noticing that F_a(ds,u) and R_a(s) can be replaced with their respective empirical estimators and that the survival function $S_{Z^{0}}(\cdot)$ can be estimated by the estimator of [5] for univariate (i.e., single-type) recurrent gap times $Z^{0}_{ij}$, one can estimate the joint distribution of interest for any (x,y) satisfying x+y≤τ_c as follows:

$$ {{\begin{aligned} \hat{F}_{X^{0}, Y^{0}}(x,y) = \sum\limits_{t_{k}^{*} \leq x+y} \prod_{l< k} \left\{1- \frac{\hat{H}_{a}(t_{l}^{*}, \infty)}{\hat{R}_{a}(t_{l}^{*})}\right\} \frac{\hat{H}_{a}(t_{k}^{*}, (x,y))}{\hat{R}_{a}(t_{k}^{*})}, \end{aligned}}} $$

(1)

where $t_{1}^{*}, t_{2}^{*}, \dots t_{K}^{*}$ are the distinct and uncensored recurrence times from $\{Z_{ij}, j= 1, \dots m_{i}^{*}, i=1, \dots, n\}$ with $m_{i}^{*} = m_{i} - 1$ for m_i≥2 and $m_{i}^{*}=1$ for m_i=1, $\hat {H}_{a}( {z},\boldsymbol {u}) = n^{-1} \sum \limits _{i=1}^{n} \frac {a_{i} I(m_{i} \geq 2)}{m_{i}^{*}} \sum \limits _{j=1}^{m_{i}^{*}}I(Z_{ij} = {z}, \boldsymbol {W}_{ij} \leq \boldsymbol {u})$ and $\hat {R}_{a}( {z}) = n^{-1} \sum \limits _{i=1}^{n} \frac {a_{i}}{m_{i}^{*}} \sum \limits _{j=1}^{m_{i}^{*}} I(Z_{ij} \geq {z})$.

Next, we focus on the estimation of the marginal survival function for Type I gap times, $S_{X^{0}}(x)=\Pr (X^{0}_{i1}\ge x)$. Let $x_{1}^{*}, x_{2}^{*}, \dots x_{M}^{*}$ denote the distinct and uncensored Type I gap times from $\{X_{ij}, j= 1, \dots m_{i}^{*}, i=1, \dots, n\}$. Following Wang and Chang (1999) [5], Huang and Wang in 2005 [1] proposed to estimate $S_{X^{0}}(x)$ for x≤τ_c with:

$$\hat{S}_{X^{0}}(x) = \prod_{x^{*}_{k} \leq x} \left\{1- \frac{\hat{H}_{X}(x^{*}_{k})}{\hat{R}_{X}(x^{*}_{k})}\right\}, $$

where $\hat {H}_{X}(t) = \frac {1}{n} \sum \limits _{i=1}^{n} \frac {a_{i} I(m_{i} \geq 2)}{m_{i}^{*}} \sum \limits _{j=1}^{m_{i}^{*}}I(X_{ij} = t)$ and $\hat {R}_{X}(t) = \frac {1}{n} \sum \limits _{i=1}^{n} \frac {a_{i}}{m_{i}^{*}} \sum \limits _{j=1}^{m_{i}^{*}} I(X_{ij} \geq t).$ Finally, they argued that the marginal distribution of Type II gap times $Y^{0}_{ij}$ is not estimable due to induced dependent censoring, and hence proposed to estimate the conditional distribution of Type II gap times given Type I gap times, $F_{Y^0|X^{0}}(y|x) = \Pr (Y^{0}_{i1} \leq y | X^{0}_{i1} \leq x)$ using the estimator

$$\hat{F}_{Y^0|X^{0}}(x,y) = \frac{\hat{F}_{X^{0}, Y^{0}}(x,y)}{1 - \hat{S}_{X^{0}}(x)}, \text{for}\ x+y \leq \tau_{c}. $$

Similarly, one can estimate $\Pr (Y^{0}_{i1} \leq y |x_{2}\leq X^{0}_{i1} \leq x_{1})$ by $ \frac {\hat {F}_{X^{0}, Y^{0}}(x,y)}{\hat {S}_{X^{0}}(x_{1}) - \hat {S}_{X^{0}}(x_{2})}$ for x₁+y≤τ_c.

The standard errors of the estimators for the joint distribution and the marginal survival function of Type I gap times are estimated based on the large sample properties of these estimators proved in [1]. Briefly, for 0≤x+y≤L,L<τ<τ_c with τ being the maximal support of R_a(z), $\sqrt {n}\{\hat {F}_{X^{0}Y^{0}}(x,y) - F_{X^{0}Y^{0}}(x,y)\}$ weakly converges to a Gaussian process with mean zero and variance-covariance function Σ=E[ϕ₁(z₁,u₁)ϕ₁(z₂,u₂)]. Similarly, $\sqrt {n} \{\hat {S}_{x^{0}}(x) - S_{x^{0}}(x)\}$ converges weakly to a Gaussian process with mean zero and with variance-covariance function $\Sigma _{s} = S_{x^{0}}(x_{1})S_{x^{0}}(x_{2})\mathrm {E}[\xi _{1}(x_{1})\xi _{1}(x_{2})]$ where x₁,x₂∈[0,L]. The definitions of ϕ₁(z,u) and ξ₁(x) can be found in [1]. For the conditional distribution estimator, $\hat F_{Y^{0}|X^{0}}(y|x)$, the package provides the bootstrap standard error and confidence intervals.

Semiparametric regression

Researchers, especially in clinical settings, are often more interested in regression methods that allow them to understand the relationship between covariates and the recurrent event process. In this regard, Chang in 2004 [2] proposed a semiparametric accelerated failure time (AFT) model which allows the estimation of the covariate effects on the two types of alternating gap times to be done simultaneously (referred to as Chang’s method hereinafter). Recognizing that the estimation of the AFT model coefficients by Chang is based on a nonsmooth, rank-based estimating function, Lee et al. in 2018 [3] proposed a smooth, U-statistic-based estimating function whose solution is found to be more computationally tractable (referred to as Lee et al.’s method hereinafter). We now briefly review the AFT model for bivariate recurrent gap times and the estimation methods developed by [2] and [3] and implemented in the function bivrecReg().

Let A_i denote a p×1 vector of baseline covariates of subject i and γ_i=(γ_i1,γ_i2)^′, a subject-specific latent vector that carries information for within subject correlations among the recurrent gap times. The censoring time C_i is assumed to be independent of (N_i,A_i,γ_i). The AFT model assumes that conditioning on A_i and γ_i, the bivariate gap time pairs $(X_{ij}^{0},Y_{ij}^{0}),j=1,2,\ldots $ are i.i.d. within subject i. Furthermore, each (log) gap time is linearly related to the covariates as follows:

$$\begin{array}{@{}rcl@{}} \log X_{ij}^{0} &=& \gamma_{i1} + \boldsymbol{A}_{i}'\boldsymbol{\beta}_{1} + \varepsilon_{ij1},\\ \log Y_{ij}^{0} &=& \gamma_{i2} + \boldsymbol{A}_{i}'\boldsymbol{\beta}_{2} + \varepsilon_{ij2}, \end{array} $$

(2)

where β₁,β₂ denote the regression coefficients for Type I and Type II gap times, respectively, and ε_ijk,k=1,2, are mutually independent random errors with mean zero. Both the errors and the latent vectors γ_i come from unspecified distributions.

In 2004, Chang [2] considered the transformed, complete gap times given by $\tilde {X^{0}_{ij}}(\boldsymbol {b})=X^{0}_{ij}\exp (-\boldsymbol {A}_{i}^{\prime } \boldsymbol {b}_{1})$ and $\tilde {Z^{0}_{ij}}(\boldsymbol {b})=X^{0}_{ij}\exp (-\boldsymbol {A}_{i}^{\prime } \boldsymbol {b}_{1})+Y^{0}_{ij}\exp (-\boldsymbol {A}_{i}^{\prime } \boldsymbol {b}_{2})$, where $\boldsymbol {b}=(\boldsymbol {b}_{1}^{\prime },\boldsymbol {b}_{2}^{\prime })^{\prime }$. Their observed counterparts are:

$${{\begin{aligned} \tilde{X}_{ij}(\boldsymbol{b})=\min\left\{\tilde{X}^{0}_{ij}(\boldsymbol{b}),C_{i} \exp(-\boldsymbol{A}^{\prime}\boldsymbol{b}_{1})-\sum\limits_{l=1}^{j-1}\tilde{Z}_{il}(\boldsymbol{b})\right\}, \text{and} \end{aligned}}} $$

$$\tilde{Z}_{ij}(\boldsymbol{b})=\min\left\{\tilde{Z}^{0}_{ij}(\boldsymbol{b}),C_{i} \exp(-\boldsymbol{A}^{\prime}\boldsymbol{b}_{1})-\sum\limits_{l=1}^{j-1}\tilde{Z}_{il}(\boldsymbol{b})\right\}. $$

The rank-based estimating functions [2] are:

$$\begin{array}{@{}rcl@{}} U_{1}(\boldsymbol{b}) &=& n^{-1/2} \sum\limits_{i=1}^{n}\frac{1}{m_{i}^{*}}\sum\limits_{j=1}^{m_{i}^{*}} \left[\boldsymbol{A}_{i}-\frac{S_{1}^{X}\{\boldsymbol{b},\tilde{X}_{ij}(\boldsymbol{b})\}}{S_{0}^{X}\{\boldsymbol{b},\tilde{X}_{ij}(\boldsymbol{b})\}}\right],\\ U_{2}(\boldsymbol{b}) &=& n^{-1/2} \sum\limits_{i=1}^{n}\frac{1}{m_{i}^{*}}\sum\limits_{j=1}^{m_{i}^{*}} \left[\boldsymbol{A}_{i}-\frac{S_{1}^{Z}\{\boldsymbol{b},\tilde{Z}_{ij}(\boldsymbol{b})\}}{S_{0}^{Z}\{\boldsymbol{b},\tilde{Z}_{ij}(\boldsymbol{b})\}}\right], \end{array} $$

where $S_{k}^{X}(\boldsymbol {b},t) = n^{-1}\sum \limits _{i=1}^{n}\frac {1}{m_{i}^{*}}\sum \limits _{j=1}^{m_{i}^{*}}\boldsymbol {A}_{i}^{\otimes k}I(\tilde {X}_{ij}(\boldsymbol {b})\ge t)$ and $S_{k}^{Z}(\boldsymbol {b},t) = n^{-1}\sum \limits _{i=1}^{n}\frac {1} {m_{i}^{*}}\sum \limits _{j=1}^{m_{i}^{*}}\boldsymbol {A}_{i}^{\otimes k}\cdot I(\tilde {Z}_{ij}(\boldsymbol {b})\ge t)$ for k=0,1 and $\boldsymbol {A}_{i}^{\otimes k} =1$ for k=0; $\boldsymbol {A}_{i}^{\otimes k} = \boldsymbol {A}_{i}$ for k=1. Denote the solution to U₁(b)=0 and U₂(b)=0 by $\hat {\beta }_{\text {Chang}}$. Chang (2004) proposed to use the resampling method in [6] to estimate the covariance matrix for $\hat {\boldsymbol {\beta }}_{\text {Chang}}$. Since both the point estimate and the resampling-based interval estimate rely on solving nonsmooth estimating functions, fitting the AFT model with Chang’s method can be computationally inefficient and even encounter a nonconvergence problem, in which case the R function will give the user an error message.

Motivated by a multi-state model [7], Lee et al. [3] defined the transformed gap times $X^{0}_{ii^{\prime }j}(\boldsymbol {b}_{1}) = \exp \left (\boldsymbol {A}_{ii^{\prime }}^{\prime }\boldsymbol {b}_{1}\right)X_{ij}^{0}$ and $Z^{0}_{ii^{\prime }j}(\boldsymbol {b}) = \exp \left (\boldsymbol {A}_{ii^{\prime }}^{\prime }\boldsymbol {b}_{1}\right)X_{ij}^{0} + \exp \left (\boldsymbol {A}_{ii^{\prime }}^{\prime }\boldsymbol {b}_{2}\right)Y_{ij}^{0}$ with $\boldsymbol {A}_{ii^{\prime }} = \boldsymbol {A}_{i^{\prime }}-\boldsymbol {A}_{i}$. Their observed counterparts are:

$$X_{ii^{\prime}j}(\boldsymbol{b}_{1}) = \exp(\boldsymbol{A}_{ii^{\prime}}^{\prime}\boldsymbol{b}_{1})X_{ij},\ \text{and} $$

$$Z_{ii^{\prime}j}(\boldsymbol{b}) \,=\, \exp(\boldsymbol{A}_{ii^{\prime}}^{\prime}\boldsymbol{b}_{1})X_{ij} + \exp(\boldsymbol{A}_{ii^{\prime}}^{\prime}\boldsymbol{b}_{2})Y_{ij} \text{ for}\ j \,=\, 1, \dots, m_{i}^{*}\!. $$

The authors use O_L(·,·), a symmetric, continuous function on {(t,s):0≤t≤L,0≤s≤L} such that O_L(s,t) is monotonic in t if s is given and vice versa to derive the following U-statistic-based estimating equations:

$${{\begin{aligned} D_{1}^{*}(\boldsymbol{b}_{1}) = n^{-2} \sum\limits_{i=1}^{n} \left[ \sum\limits_{i^{\prime}=1}^{n} \boldsymbol{A}_{ii^{\prime}} \frac{1}{m_{i}^{*}} \sum\limits_{j=1}^{m_{i}^{*}} \frac{\Delta_{ij}^{X} O_{L_{1}} \{ X_{ij}, X_{ii^{\prime}j}(\boldsymbol{b}_{1})\}}{\hat{G}_{1}\{X_{ij} \wedge L_{1} \}} \right], \end{aligned}}} $$

$${{\begin{aligned} D_{2}^{*}(\boldsymbol{b}) = n^{-2} \sum\limits_{i=1}^{n} \left[ \sum\limits_{i^{\prime}=1}^{n} \boldsymbol{A}_{ii^{\prime}} \frac{1}{m_{i}^{*}} \sum\limits_{j=1}^{m_{i}^{*}} \frac{\Delta_{ij}^{Y} O_{L_{2}} \{ Z_{ij}, Z_{ii^{\prime}j}(\boldsymbol{b})\}}{\hat{G}_{2}\{Z_{ij} \wedge L_{2}\}} \right], \end{aligned}}} $$

where $\hat {G}_{1}$ and $\hat {G}_{2}$ are Kaplan-Meier estimators of the survival function of the censoring time G(·) based on the data $\{(X_{i1}, 1-\Delta _{i1}^{X}), i=1, \dots, n\}$ and $\{(Z_{i1}, 1-\Delta _{i1}^{Y}), i=1, \dots, n\}$, respectively, and L₁<τ_c and L₂<τ_c are limits that ensure respect for the support of G(·). The bivariate functions in the estimating functions, $O_{L_{1}}$ and $O_{L_{2}}$ are not necessarily the same. Current implementations used L₁=L₂=L and O_L(s,t)= log[ min{max(t,s),L}]− log(L). The regression coefficient estimators, $\hat {\boldsymbol {\beta }}_{1}$ and $\hat {\boldsymbol {\beta }}_{2}$ can be obtained by inductively solving $D_{1}^{*}(\boldsymbol {b}_{1}) = 0$ and $D_{2}^{*}((\hat {\boldsymbol {\beta }}_{1}', \boldsymbol {b}_{2}^{\prime })^{\prime })=0$. Note that the smooth and monotonic nature of the estimating equations guarantees a unique solution, a property not possessed by Chang’s method. Moreover, Lee et al. proved the weak convergence of $n^{1/2}(\hat {\boldsymbol {\beta }} - \boldsymbol {\beta })$ to a mean zero normal distribution with variance that can be consistently estimated by $\hat {\Sigma }_{\hat {\boldsymbol {\beta }}}^{-1} \hat {\Omega }^{*} (\hat {\Sigma }_{\hat {\boldsymbol {\beta }}}^{-1})'$ where the definitions of $\hat {\Sigma }_{\hat {\boldsymbol {\beta }}}$ and $\hat {\Omega }^{*}$ can be found in [3].

Results

We use a subset from the South Verona Psychiatric Case Register (PCR) [8] to illustrate functions in BivRec for data exploration, visualization and analysis in R. Our PCR sample contains data on patients’ care and break periods and disease-related and socioeconomic factors such as age at disease onset, education level, and sex from 336 patients with schizophrenia or related disorders with their conditions first recorded between 1981 and 1995 in South Verona, Italy. We focus on two covariates, one categorical, EDU, and one continuous, Age10, that were previously studied in [3]. These correspond to the education level and age at onset (in 10 years), respectively. We also show a simulated data set and how to simulate data with a function in BivRec in a later section.

Data preparation

Even though bivariate alternating recurrent data may be displayed in a wide or long format, in line with various longitudinal methods, the BivRec package requires that the data is in a long format, with possibly multiple rows for each participant, reflecting the number of episodes that the participant experienced. In addition to a set of baseline covariates (which repeats for each row if a participant has more than one row in the data), the long format data should have six columns corresponding to $i,j,X_{ij},Y_{ij},\Delta _{ij}^{X}$, and $\Delta _{ij}^{Y}$ which are defined in the previous section and can be specified in the statements of a data object function bivrecSurv() by “id=”, “episode=”, “xij=”, “yij=”, “d1=”, and “d2=”, respectively. See Table 1 for the detailed definitions of these arguments. All the functions for data exploration and analysis in BivRec use the data object created with this function. We used the PCR data to create a bivrecSurv() object as follows (see the package manual [9] for further details):

Table 1 Arguments and compatible standard functions for function bivrecSurv()

Full size table

Note that in practice, the input data may have quality issues such as unequal length of variables, negative values for the gap time variables (xij and yij), gaps or non-integers in the episode variable within a subject (e.g., j=1,3,4 or j= a, b, c), unreasonable values in the censoring indicators (e.g., (d1, d2) = (0, 1) for any episode or (d1, d2) = (1, 1) for the last episode), or no subjects having any uncensored episodes observed (i.e., all d2 = 0). In these cases, the user will get an error message such as “Error: Data not cleaned” with possibly more details to help pinpoint the problem. However, missing values are allowed in the input data even though only subjects with complete data will be used in any subsequent analysis. It is worthwhile to mention that in applications, the time variables may not be observed continuously (e.g., in days); for example, a subject’s censoring event could occur on the same day as the last observed event, causing the situation of (d1, d2) = (1, 1) for the last episode of events (j=m_i). For these subjects, we suggest users to add a small quantity to the censoring time during the data cleaning process.

Data exploration

We begin the exploration of this data set by obtaining a visualization of the care and break periods (Fig. 2) using the plot() function on a bivrecSurv object in the following way:

The resulting plot shows that during the study, the majority of the observation time was spent out of the hospital or care facility, as accentuated by the larger portions of blue in the graph.

The data can also be viewed in subgroups defined by a categorical covariate of six or less levels if a user specifies such a covariate using the by argument of plot(). An example of this feature is shown by looking at times in and out of care based on education levels in the PCR data set. Figure 3 shows the resulting subgroup plots for the levels of the variable EDU, an indicator with values of one for participants with secondary or higher education levels and zero otherwise. Note that in this example, the covariate Age10 is automatically dropped from the by statement, since the function detects that Age10 is possibly a continuous variable. In addition, 10 subjects had missing values for the EDU variable, which is reflected in a message letting the user know only 326 of the subjects were used instead of the full 336 sample.

The “bluer” shade of the left panel compared to the one on the right in Fig. 3 indicates that patients with higher education level might have longer break periods than those with less education. Thus, considering education as a factor that affects the re-hospitalization process is reasonable. Note that more than one categorical covariate can be specified in the by argument, but subgroup plots are created for each covariate separately. If subgroups defined by the combinations of multiple categorical variables (e.g., sex and race) are desired, users can define a new categorical variable (with ≤6 levels) before applying the plot() function.

Nonparametric analysis

We use the function bivrecNP() to estimate the joint distribution $F_{X^{0}, Y^{0}}(x, y)$ for all combinations of values given by the options u1 and u2. The non-negative weight in the nonparametric estimator in Equation (1), a_i=a(C_i), is specified by the option ai. For the data example, we use a simple unit weight a(C_i)=1 by setting ai=1 (the default). If one sets ai=2, the weight will be the censoring time of each subject, a(C_i)=C_i. The function will automatically estimate the marginal survival probability for Type I gap times $S_{X^{0}}$ for all distinct and uncensored recurrence times. We set conditional=TRUE to request the conditional distribution estimate for Type II gap times given that Type I gap times fall into a certain interval specified by the given.interval argument. In the following example, we set given.interval=c(100, 500) to estimate $F_{Y^{0}|X^{0}}(y|100 \leq X^{0} \leq 500)$, and the confidence level of the point-wise confidence intervals is set by level=0.99 as an illustration that the confidence level can be changed to values different than 0.95. Table 2 and the package manual [9] provide further details of the function bivrecNP().

Table 2 Arguments and compatible standard functions for function bivrecNP()

Full size table

Using the function head() we show a snapshot of all the output elements. To see the entire data frame for all the output elements, use the R function print(). The individual output elements (joint_cdf, marginal_survival, and conditional_cdf) can also be retrieved using the $ operator following the output object.

We use the plot() function on the resulting object to simultaneously generate plots for all the estimated distributions. To individually plot the joint cdf, the marginal survival probability and the conditional cdf, one can use the plotJoint, plotMarg and plotCond functions, respectively.

Figure 4 is the contour plot of the joint cdf in the half-plane where estimations meet the condition x+y≤τ_c, for τ_c=5697. The joint cdf shows that, for instance, the probability that participants were in care for 2000 days or less and out of care for 3000 days or less is over 0.70. Note that the choices of u1 and u2 in the code will determine if the plot of the joint distribution is drawn on a half plane or not. For instance, if we had coded u1=seq(100, 2000, 50) and u2=seq(100, 3000, 50), then all the (x,y) pairs defined by u1 and u2 would be estimable because max(u1)+ max(u2)≤τ_c. In addition, the resulting object (npresult in this example) will show values of NA for the joint probability for any combination of u1 and u2 that does not meet the x+y≤τ_c condition.

The line plots of the estimated marginal survival probability and conditional cdf (with corresponding 99% confidence intervals in this example) are shown in Fig. 5. The left panel shows that care times shorter than a year and a half are the most likely as the survival probability drops sharply before 500 days. With this in mind, we look at the conditional cdf plot for patients who received between 100 and 500 days of care and conclude that the probability that those patients spent 3000 or fewer days out of care was close to 80%. The bivrecNP() function with conditional=TRUE specification for the PCR data produced results in 248.39 seconds (4.14 minutes, using an AMD Ryzen 5 3500U with Radeon Vega Mobile Gfx 2.10 GHz processor with 16 GB RAM for this and other analyses in this paper). Note that the computing time would be shorter if a less fine grid was specified for u1 and u2. The visualizations for the results, shown in Figs. 4 and 5 ran in less than 2 seconds each.

Semiparametric regression

The function bivrecReg() provides point estimates, standard error estimates, an estimated variance-covariance matrix and confidence intervals at a specified confidence level for the effects of covariate(s) on the two types of gap times, based on the Chang’s method and Lee et al.’s method. Using the PCR data, we show how to estimate the effects of the education and age of disease onset on the two types of gap times $X_{ij}^{0}$ (care time) and $Y_{ij}^{0}$ (break time) by fitting the AFT model using the function bivrecReg(). We again need to specify a bivrecSurv object as the response, but now we add the covariates to the right hand side of the formula (for further details on the syntax of this function see Table 3 and the reference manual of this package [9]). Note that only baseline (i.e., time-invariant) covariates can be fit in the AFT model in Eq. (2), otherwise, user will get the message, ~Error: Time-varying covariates not allowed.~ We use the function’s default estimation method, Lee et al.’s method, to obtain estimates and Wald confidence intervals based on the asymptotic standard error estimate. This method can also be explicitly specified by the option method=~Lee.et.al~. The messages shown while the function is running and the model fitting output from summary() are as follows.

Table 3 Arguments and compatible standard functions for function bivrecReg()

Full size table

We show the confidence interval for education and the full variance-covariance matrix using the functions confint() and vcov(), respectively. Similar to the standard R function confint(), one can specify the confidence level and an individual parameter of interest. If needed, one could also see coefficient results using the function coef().

Based on the above results, we conclude that patients in the secondary education or higher (EDU=1) group tended to have 1.76 (=e^0.56, p-value =0.007) times longer care periods and 1.21 (=e^0.19, p-value >0.05) times longer break periods than patients with less education (EDU=0) after adjusting for covariate Age10. In addition, a ten year delay in age of disease onset was associated with a 1.27-fold (=e^0.24, p-value =0.001) increase in the length of break period and a minimal change in the care period of less than 1% (=e^0.007, p-value >0.05), holding EDU constant.

We also fit the AFT model using Chang’s method for point estimation along with Parzen’s resampling algorithm to obtain standard error estimates and construct Wald confidence intervals. The results are as follows.

It is important to point out that due to the resampling algorithm needed to obtain variances, applying Chang’s method to the PCR data (n=326 after omitting missing values for EDU) led to substantially longer computing times than applying Lee et al.’s method (56 minutes vs. 67 seconds). Similarly, applying these methods to a simulated data set with a smaller sample size, n=150, led to a much shorter computing time (602 and 16 seconds for the two methods, respectively). In addition, as discussed earlier, the convergence of Chang’s method is not guaranteed for any dataset, and when the convergence is not achieved, users get the following error message, ~Error: Max Iterations reached. Did not converge.~ For these reasons, we recommend to use the default ~Lee.et.al~ method for fitting the AFT model. For a comparison of the estimation bias, standard error, and coverage probability between these two methods with more extensive simulation studies, see [3]. Nonetheless, we show how to use function simBivRec() provided in the BivRec package to simulate a bivariate alternating recurrent event data. This provides a way for users to perform their own simulation studies.

Following [3], in the example below, we show how to simulate a dataset with a categorical covariate a1 from a binomial distribution with success probability 0.5 and a continuous covariate a2 from a uniform (0,1) distribution using the function simBivRec(). This function has options to simulate data based on the various scenarios outlined in [3]. As an example, we set the parameters in the same way as for the scenario presented in the top panel of Table 3 in [3]: the sample size is n=150 (nsize=150); the regression coefficients for the effect of covariate a1 on the Type I and Type II gap times are set as beta1=c(0.5, 0.5); those for covariate a2 are set as beta2=c(0, -0.5); and the support of the uniform distribution (0,τ_c) for the censoring time is set with tau_c=63, which yields censoring rate of 15% for the first bivariate gap time pairs, on average. Additional parameters for the within-subject correlation structure of the gap times are set through the option, set=1.1, meaning the first parameter setting of simulation scenario 1 in [3].

The first few lines of the simulated data are shown as follows, where id is the subject ID (which can take both numerical and string values in the package), epi is the episode number of a gap time pair, xij and yij are the j^th episode of the observed Type I and Type II gap times of subject i, respectively, ci is the overall censoring time of subject i, d1 and d2 are their corresponding censoring indicators, and a1 and a2 are two baseline covariates.

Conclusions

Despite the growing need and interest in the study of recurrent event data along with the development of R packages for its analysis such as survrec [10], reda [11] and reReg [12], software for analyzing bivariate alternating recurrent events has been lacking. In this paper, we reviewed nonparametric and semiparametric regression methods for gap times between alternating recurrent events and demonstrated how to use the BivRec package that we developed in R to perform these analyses. We also demonstrated BivRec’s capabilities for data visualization and simulation.

There is still a need for further package development such as additions of univariate recurrent gap times (i.e., gap times of recurrent events of the same type) methods as the degenerated case of the bivariate alternating gap times data such as the seminal nonparametric work by Wang and Chang (1999) [5] and the subsequent semiparametric regression methods by various authors [2, 13–16]. Finally, some competing, intensity-based models for bivariate alternating recurrent event data, which are not included in our package, can be found in [17] and [18].

Availability and requirements

Project name: Bivariate Alternating Recurrent Event Data Analysis

Project home page: https://cran.r-project.org/web/packages/BivRec/index.html

Operating system(s): Platform independent

Programming language: R

Other requirements: R 3.5.0 or higher

License: GPL-3

Any restrictions to use by non-academics: None

Availability of data and materials

Requests for the South Verona PCR data can be made by contacting the principal investigator of the WHO Collaborating Centre for Research and Training in Mental Health and Service Evaluation at https://www.dnbm.univr.it/?ent=bibliocr&id=254&tipobc=6&lang=en. R source code is available at https://github.com/SandraCastroPearson/BivRec.

Abbreviations

PCR:: Psychiatric Case Register
AFT:: Accelerated Failure Time

References

Huang C-Y, Wang M-C. Nonparametric estimation of the bivariate recurrence time distribution. Biometrics. 2005; 61:392–402.
Article Google Scholar
Chang S-H. Estimating marginal effects in accelerated failure time models for serial sojourn times among repeated events. Lifetime Data Anal. 2004; 10:175–90.
Article Google Scholar
Lee CH, Huang C-Y, Xu G, Luo X. Semiparametric regression analysis for alternating recurrent event data. Stat Med. 2018; 37:996–1008.
Article Google Scholar
Huang Y, Louis TA. Nonparametric estimation of the joint distribution of survival time and mark variables. Biometrika. 1998; 85:785–98.
Article Google Scholar
Wang M-C, Chang S-H. Nonparametric estimation of a recurrent survival function. J Am Stat Assoc. 1999; 94:146–53.
Article Google Scholar
Parzen MI, Wei LJ, Ying Z. A resampling method based on pivotal estimating functions. Biometrika. 1994; 81:341–50.
Article Google Scholar
Huang Y. J R Stat Soc Ser B (Stat Methodol). 2002; 64:17–29.
Tansella M. Community-based psychiatry: long-term patterns of care in south-verona. Psychol Med Monogr Suppl. 1991; 19:1–54.
Article Google Scholar
Castro-Pearson S, Sur A, Lee CH, Huang C-Y, Luo X. BivRec: Bivariate alternating recurrent event data analysis. 2021. https://CRAN.R-project.org/package=BivRec. Accessed 10 June 2021.
González JR, Peña EA, Strawderman RL. survrec: Survival analysis for recurrent event Data. 2012. https://CRAN.R-project.org/src/contrib/Archive/survrec/. Accessed 1 May 2021.
Wang W, Fu H, Chiou SH, Yan J. reda: Recurrent event data analysis. 2020. https://CRAN.R-project.org/package=reda. Accessed 22 Sep 2021.
Chiou SH, Huang C-Y. reReg: Recurrent event regression. 2020. https://CRAN.R-project.org/package=reReg. Accessed 22 Sep 2021.
Huang Y, Chen YQ. Marginal regression of gaps between recurrent events. Lifetime Data Anal. 2003; 9:293–303.
Article Google Scholar
Lu W. Marginal regression of multivariate event times based on linear transformation models. Lifetime Data Anal. 2005; 11:389–404.
Article Google Scholar
Sun L, Park D-H, Sun J. The additive hazards model for recurrent gap times. Stat Sin. 2006; 16:919–32.
Google Scholar
Luo X, Huang C-Y, Wang L. Quantile regression for recurrent gap time data. Biometrics. 2013; 69:375–85.
Article Google Scholar
Xue X, Brookmeyer R. Bivariate frailty model for the analysis of multivariate survival time. Lifetime Data Anal. 1996; 2:277–89.
Article CAS Google Scholar
Li X, Chen Y, Li R. A frailty model for recurrent events during alternating restraint and non-restraint time periods. Stat Med. 2017; 36:643–54.
Article Google Scholar

Download references

Acknowledgements

The authors would like to thank Dr. Sy Han Chiou who has provided valuable suggestions during the development of the package.

Funding

National Institute of Mental Health R03MH112895 to XL, National Cancer Institute R01CA193888 to CYH, and the National Heart, Lung, and Blood Institute (NHLBI) T32HL129956 to SCP. The funders of the study had no role in study design, data collection, data analysis, data interpretation, or writing of the manuscript.

Author information

Authors and Affiliations

Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, USA
Sandra Castro-Pearson, Aparajita Sur & Xianghua Luo
Department of Biostatistics and Epidemiology, School of Public Health and Health Science, University of Massachusetts Amherst, Amherst, USA
Chi Hyun Lee
Department of Epidemiology and Biostatistics, University of California San Francisco, San Francisco, USA
Chiung-Yu Huang
Biostatistics Core, Masonic Cancer Center, University of Minnesota, Minneapolis, USA
Xianghua Luo

Authors

Sandra Castro-Pearson
View author publications
You can also search for this author in PubMed Google Scholar
Aparajita Sur
View author publications
You can also search for this author in PubMed Google Scholar
Chi Hyun Lee
View author publications
You can also search for this author in PubMed Google Scholar
Chiung-Yu Huang
View author publications
You can also search for this author in PubMed Google Scholar
Xianghua Luo
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

SCP wrote the BivRec package and drafted the manuscript. AS assisted in the development of the BivRec package and the manuscript draft. CHL and CYH revised the manuscript. XL initiated the research idea, helped in the design and development of the BivRec package and significantly revised the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Xianghua Luo.

Ethics declarations

Ethics approval and consent to participate

Not applicable. No data were collected in this study.

Consent for publication

Not applicable.

Competing interests

XL is a member of the editorial board for the BMC Medical Research Methodology journal.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Castro-Pearson, S., Sur, A., Lee, C.H. et al. BivRec: an R package for the nonparametric and semiparametric analysis of bivariate alternating recurrent events. BMC Med Res Methodol 22, 92 (2022). https://doi.org/10.1186/s12874-022-01558-0

Download citation

Received: 30 September 2021
Accepted: 28 February 2022
Published: 03 April 2022
DOI: https://doi.org/10.1186/s12874-022-01558-0

BivRec: an R package for the nonparametric and semiparametric analysis of bivariate alternating recurrent events