### Hurdle model

The excess zeroes that are present in the data cannot be ignored. In count data, this is typically addressed using a zero-inflated model [18], which models the increased probability of observing zeroes, in addition to the zeroes expected from the response distribution, e.g., a Poisson distribution. However, in the present study, the counts of the number of seconds on therapy are more closely represented by a normal distribution with a strictly positive domain (e.g., the log-normal or truncated normal distribution). Moreover, in this context, zeroes have only one interpretation, namely that of the therapy not being initiated on a given day. If we regard initiating treatment as a hurdle that patients need to overcome daily, we can model the initiation of therapy as a two-step process. This approach is referred to as hurdle modeling, and is generally applicable when a response variable is conditional on the occurrence of an event, with distinct values.

A hurdle model comprises a finite mixture of a point mass at zero, and a distribution with positive domain. Excess zeroes in a count variable can arise from a significant hurdle or a factor preventing the event from happening, but can also happen when the time available for counting the events is too short relative to the frequency of the event occurring. Lee et al. [19] investigated the risk of miscarriage in women with sleep-disordered breathing using truncated Poisson hurdle regression. Hurdle modeling is not restricted to count data, as it can be applied with any distribution that does not contain the hurdle response value (e.g., a truncated distribution). Saberi et al. [20] investigated the percentage of HIV medication non-adherence using a Gamma hurdle model.

Let \(\phantom {\dot {i}\!}\mathbf {y}_{i}=\left \{ y_{i,1},y_{i,2},...,y_{i,J_{i}}\right \} \) denote the adherence trajectory of patient *i*∈*I* consisting of *J*_{i}=90 observations, for any patient from the set of available patients *I*. Here, *y*_{i,j} denotes the time on therapy of patient *i* on the *j*th measurement at time *t*_{j}. The daily hurdle of initiating therapy can be modeled with a Bernoulli process with probability

$$ \Pr\left(H_{i,j}=h_{i,j}\right)=\left\{\begin{array}{ll} \nu_{j} & h_{i,j}=0\\ 1-\nu_{j} & h_{i,j}=1 \end{array}\right., $$

(1)

where *h*_{i,j}∈{0,1} denotes whether the hurdle is overcome for patient *i* at time *t*_{j}, and *ν*_{j}∈(0,1) represents the probability of failing to pass the hurdle at time *t*_{j}.

**Truncated normal hurdle model** With the exception of models involving two-sided truncation, as seen in double hurdle modeling, examples of left-sided truncated normal distributions are few in number. Cragg [21] first proposed a truncated normal hurdle model for modeling the consumer demand of durable goods, to account for periods of time during which no purchases of goods were made. For observations where the hurdle is passed (i.e., *h*_{i,j}=1), we assume the time on therapy *y*_{i,j} to be normally distributed with strictly positive values. The probability density function (PDF) is given by

$$ f_{\mathrm{N}}(y;\mu,\sigma)=\frac{\phi\left(\tfrac{(y-\mu)}{\sigma}\right)}{1-\Phi\left(\tfrac{-\mu}{\sigma}\right)}\quad y>0 $$

(2)

and zero otherwise, where *ϕ*(·) is the standard normal PDF, *Φ*(·) is the standard normal CDF, and *μ* and *σ* are the mean and standard deviation of the non-truncated normal distribution. If *X* has a normal distribution, the moments of the truncated normal distribution are then given by [22]

$$\begin{array}{*{20}l} {}\mathbf{E}(X|X >0)&= \mu+\sigma\frac{\phi\left(\tfrac{-\mu}{\sigma}\right)}{1-\Phi\left(\tfrac{-\mu}{\sigma}\right)},\\ {}\mathbf{E}\left(X^{2}|X >0\right)&= \mu^{2}+2\sigma\mu\frac{\phi\left(\tfrac{-\mu}{\sigma}\right)}{1-\Phi\left(\tfrac{-\mu}{\sigma}\right)}\\&\quad+\sigma^{2}\left[\frac{\tfrac{-\mu}{\sigma}\phi\left(\tfrac{-\mu}{\sigma}\right)}{1-\Phi\left(\tfrac{-\mu}{\sigma}\right)}+1\right], \end{array} $$

(3)

$$\begin{array}{*{20}l} {}\mathbf{E}\left(X^{3}|X > 0\right)&= \mu^{3}+\left[3\sigma\mu^{2}-\mu\sigma^{2}+\sigma^{3}\right]\frac{\phi\left(\tfrac{-\mu}{\sigma}\right)}{1-\Phi\left(\tfrac{-\mu}{\sigma}\right)}\\&\quad+ 3\sigma^{2}\mu\left[\frac{\tfrac{-\mu}{\sigma}\phi\left(\tfrac{-\mu}{\sigma}\right)}{1-\Phi\left(\tfrac{-\mu}{\sigma}\right)}+1\right]. \end{array} $$

(4)

The time on therapy for patient *i* at time *t*_{j} is distributed as

$$ \Pr\left(y_{i,j}\leq y\right)=\nu_{j}+(1-\nu_{j})F_{N}(y_{i,j},\mu_{j},\sigma_{j}) $$

(5)

with *y*∈**R** and *F*_{N} the truncated normal distribution function.

The truncated normal hurdle distribution described here, denoted by TNH, is represented by three parameters. We allow each of the parameters to change over time. As such, each patient *i* at the *j*th observation at time *t*_{j} is represented by the probability *ν*_{j} of failing to pass the hurdle, the expected conditional mean *μ*_{j} and standard deviation *σ*_{j}. In a single-group analysis the hurdle and conditional time on therapy essentially operate on disjoint data, and therefore can be estimated separately, using logistic regression to model the hurdle, and truncated normal regression for the conditional time on therapy. However, this does not hold when the terms across the models are assumed to be correlated, or when a mixture of hurdle models is being estimated.

### Generalized additive modeling for location, scale and shape

GAMLSS is a method for modeling a numerical univariate response variable in terms of a general parametric distribution. Whereas generalized additive modeling (GAM) and generalized linear modeling (GLM) can only handle exponential family distributions and assume a variance as a function of the mean with a constant scaling factor [23, 24], GAMLSS can describe parametric response distributions by their mean (i.e., location *μ*), variance (i.e., scale *σ*) and shape (e.g., skewness and kurtosis) in terms of linear predictors and additive functions. Furthermore, through the inclusion of a distributional parameter for the excess zeros, hurdle and zero-inflated distributions can be handled.

GAMLSS was proposed by Rigby & Stasinopoulos [14, 25], and developed into a framework implemented in various packages in R [26, 27]. To describe our model in terms of GAM, let \(\phantom {\dot {i}\!}\boldsymbol {y}_{i}^{\top }=\left (y_{i,1},\ldots,y_{i,J_{i}}\right)\) denote the longitudinal measurements of a patient *i*∈*I* among the sets of patients *I*, with *y*_{i,j}∼TNH(*μ*_{i,j},*σ*_{i,j},*ν*_{i,j}). Each of the distributional parameters can be described by a linear model, describing the \(J=\sum _{i\in I}J_{i}\) observations across all patients. The PDF of the complete model is given by *f*_{Y}(*y*_{i,j};*μ*_{i,j},*σ*_{i,j},*ν*_{i,j}). For brevity, the predictor vector of length *J* for the *k*th distribution parameter is denoted by *d*_{k}, with *d*_{1}=*μ*,*d*_{2}=*σ*, and *d*_{3}=*ν*. The general random effects GAMLSS model [28] for the *k*th distributional parameter is given by

$$ g_{k}\left(\boldsymbol{d}_{k}\right)=\boldsymbol{X}_{k}\boldsymbol{\beta}_{k}+\sum_{m=1}^{M_{k}}\boldsymbol{Z}_{k,m}\boldsymbol{\gamma}_{k,m}, $$

(6)

where *g*_{k}(·) denotes the monotonic link function for the respective distribution parameter. The linear additive terms of the model are represented by a *J*×*L*_{k} design matrix denoted by *X*_{k} for *L*_{k} fixed effects, with coefficients \(\phantom {\dot {i}\!}\boldsymbol {\beta }_{k}^{\top }=\left (\beta _{k,1},\ldots,\beta _{k,L_{k}}\right)\). The *J*×*Q*_{k,m} design matrix *Z*_{k,m} models the random effects with *γ*_{k,m} as a vector of *Q*_{k,m} random variables. These random effects also allow for (penalized) smoothing as a function of an explanatory variable, e.g., cubic splines, P-splines, and fractional polynomials. An advantage of GAMLSS is that the random effects can be included in any of the distributional parameters, although this comes at the cost of increased computational complexity.

We limit the model complexity by only representing each distributional parameter using a linear parametric representation. In addition, the hierarchical nature of the longitudinal data needs to be taken into account. Patients have different levels of expected usage, variance, and attempts, arising from factors such as sleep schedule, quality of sleep, and tolerance to the therapy. We can account for these patient-specific differences by partitioning the random effects design matrix into patient-specific matrices. We only consider the case of *M*_{k}=1 (i.e., a random intercept model for each distributional parameter), so we therefore omit the *m* subscript from the notation hereafter. The patient-specific random effects design matrix *Z*_{k,i} of order *J*_{i}×*Q*_{k} are concatenated to yield \(\boldsymbol {Z}_{k}^{\top }=\left [\boldsymbol {Z}_{k,1}^{\top }\left |\boldsymbol {Z}_{k,2}^{\top }\right |\cdots \left |\boldsymbol {Z}_{k,|I|}^{\top }\right.\right ]\) [28]. The random effects vector is denoted by \(\phantom {\dot {i}\!}\boldsymbol {\gamma }_{k,i}=(\gamma _{k,1,i},\ldots,\gamma _{k,Q_{k},i})\), with *γ*_{k,i}∼*N*(0,*Σ*_{k}) for each of the distribution parameters, where *Σ*_{k} is the variance-covariance matrix for the random effects of the respective distribution parameter.

Although we observe a marginally better fit using smoothing functions of time, modeling change using linear additive terms is preferred in this analysis for its lower complexity, and greatly reducing computation time. We therefore model each of the distributional parameters using a second-order polynomial dependent on time. The identity link function suffices for the mean *μ*_{i,j}, whereas a log link is used for the variance *σ*_{i,j} in order to ensure positive values. The hurdle probability *ν*_{i,j} is modeled using logistic regression by assuming a logit link \(g_{3}(\nu _{i,j})=\log \left (\frac {\nu _{i,j}}{1-\nu _{i,j}}\right)\). Accordingly, the random effects model is given by

$$\begin{array}{*{20}l} \mu_{i,j} & =\beta_{1,0}+\beta_{1,1}t_{i,j}+\beta_{1,2}t_{i,j}^{2}+\boldsymbol{Z}_{1,i,j}\boldsymbol{\gamma}_{1,i}, \\ \log\sigma_{i,j} & =\beta_{2,0}+\beta_{2,1}t_{i,j}+\beta_{2,2}t_{i,j}^{2}+\boldsymbol{Z}_{2,i,j}\boldsymbol{\gamma}_{2,i}, \\ \log\frac{\nu_{i,j}}{1-\nu_{i,j}} & =\beta_{3,0}+\beta_{3,1}t_{i,j}+\beta_{3,2}t_{i,j}^{2}+\boldsymbol{Z}_{3,i,j}\boldsymbol{\gamma}_{3,i}. \end{array} $$

(7)

We will use this model to compare against the mixture model described in the next section.

#### Latent-class modeling

The findings from previous studies on CPAP adherence suggest a complex, non-normal distribution of adherence patterns [11, 12]. We therefore opt for a non-parametric approach to modeling the heterogeneity, by describing the patient-specific deviations from the population mean in terms of a finite number of structural deviations. In a cross-sectional data context, this approach is commonly referred to as finite mixture modeling [29]. This has the added benefit of accounting for the (possibly non-linear) relationship between the distributional parameters through the different clusters. In particular, an association can be expected between the attempt probability and the mean level of usage.

Growth mixture modeling (GMM) is an approach to modeling longitudinal change (i.e., a growth curve), accounting for patient heterogeneity by assuming each patient belongs to one of several unobserved (i.e., latent) classes [30–32]. The class models include patient-specific random effects, therefore the approach essentially assumes the heterogeneous data to consists of a set of heterogeneous subgroups.

The appeal of allowing for patient-specific deviations within the latent classes is that it enables an emphasis on the change of adherence over time as opposed to the expected average time on therapy. Without a random intercept, most of the group trajectories would be representing the differences in mean time of therapy, resulting in many constant group trajectories. To a lesser degree, patients may also exhibit different levels in their attempt probability and conditional standard deviation. However, in consideration of the increased model complexity with an increasing number of latent classes, we opt for simplifying the class model. We therefore only include a random intercept \(\gamma _{i}^{(g)}\sim N(0,\sigma _{\gamma })\) for the mean level. Each latent class is described by a model, where the model for class *g* is described by

$$\begin{array}{*{20}l} \mu_{i,j}^{(g)} & =\eta_{1,i,j}^{(g)}=\beta_{1,0}^{(g)}+\beta_{1,1}^{(g)}t_{i,j}+\beta_{1,2}^{(g)}t_{i,j}^{2}+\gamma_{i}^{(g)}, \\ \log\sigma_{i,j}^{(g)} & =\eta_{2,i,j}^{(g)}=\beta_{2,0}^{(g)}+\beta_{2,1}^{(g)}t_{i,j}+\beta_{2,2}^{(g)}t_{i,j}^{2}, \\ \log\frac{\nu_{i,j}^{(g)}}{1-\nu_{i,j}^{(g)}} & =\eta_{3,i,j}^{(g)}=\beta_{3,0}^{(g)}+\beta_{3,1}^{(g)}t_{i,j}+\beta_{3,2}^{(g)}t_{i,j}^{2}. \end{array} $$

(8)

Each of these class models represents a proportion of the overall heterogeneity in the data. The overall model is given by

$$ f(y_{i,j};\boldsymbol{\Theta},\boldsymbol{\pi})=\sum_{g=1}^{G}\pi_{g}f_{g}\left(y_{i,j};\boldsymbol{\beta}_{1}^{(g)},\boldsymbol{\beta}_{2}^{(g)},\boldsymbol{\beta}_{3}^{(g)},\sigma_{\gamma}\right) $$

(9)

where *Θ*={*θ*^{(1)},…,*θ*^{(G)}} comprises the group model parameters, *f*_{g} denotes the model for group *g*, and *π* is the vector of group proportions *π*_{g} for group *g* with *π*_{g}≥0 and \(\sum _{g}\pi _{g}=1\). The class assignment of patients is probabilistic, which is in contrast to other approaches such as longitudinal *k*-means (KML) where the cluster edges are well-defined but arbitrarily selected due to the distance measure used [33].

A few studies have used a similar approach in the context of hurdle modeling. Maruotti [34] proposed a longitudinal latent-class hurdle mixed effects model that accounts for missing data patterns arising from drop-outs. They applied the model for the analysis of skin cancer counts, of which the data had a considerable number of missing measurements, in addition to zero inflation. Moreover, Ma et. al [35] used a log-normal hurdle mixture to identify patterns of factors contributing to vehicle crash rates. To the best of our knowledge no studies have used a hurdle approach with within-class heterogeneity using GAMLSS up to now, in particular when combined with class-specific temporal heteroskedasticity.

### Model estimation

The analysis is performed in R 3.5.0 [36] using version 5.1-2 of the *gamlss* package [14] for the implementation of GAMLSS. The GAMLSS model is fitted using the RS algorithm proposed by Rigby & Stasinopoulos [37]. The algorithm maximizes the (penalized) maximum likelihood of the full model using expectation maximization (EM). The estimation of the random patient factor is based on penalized quasi-likelihood. The zero-truncated normal distribution is available in the *gamlss.tr* package (version 5.1-0) [38], and was adapted to account for excess zeros by the parameter *ν*.

We estimate the mixture model specified in Equation 9 using a nonparametric maximum likelihood (NPML) approach [39, 40], as implemented in the gamlssNP function in the package *gamlss.mx* (version 4.3-5) [41]. This approach describes the data heterogeneity through a non-parametric density function comprising a finite mixture [40–42]. The marginal likelihood for the data is given by

$$ f(\boldsymbol{y};\boldsymbol{\Theta},\boldsymbol{\pi})=\prod_{i\in I}\sum_{g=1}^{G}\left[\pi_{g}\prod_{j=1}^{J_{i}}f_{g}\left(y_{i,j};\boldsymbol{\theta}^{(g)}\right)\right]. $$

(10)

Here, the group parameters *θ*^{(g)} represent the mass points of the non-parametric density, occurring with probability (i.e., the masses) *π*_{1},…,*π*_{G}, respectively.

Each trajectory is assumed to have been generated by one of the group models, however, the true group membership is unknown. The membership of the trajectory *y*_{i} to group *g* is indicated by *δ*_{i,g}, with *δ*_{i,g}=1 if the trajectory belongs to group *g*, and *δ*_{i,g}=0 otherwise. The vector of group indicators for the trajectory *i* is denoted by *δ*_{i}=(*δ*_{i,1},…,*δ*_{i,G}). We denote the set of all indicator vectors across patients by *δ*. With this, the likelihood of the model with specified group memberships *δ*, referred to as the complete model, is given by

$$\begin{array}{*{20}l} L_{c} & =L(\boldsymbol{y},\boldsymbol{\Theta},\boldsymbol{\pi},\boldsymbol{\delta})=f(\boldsymbol{y},\boldsymbol{\delta})\\ & =f(\boldsymbol{y}|\boldsymbol{\delta})f(\boldsymbol{\delta}) \\ & =\prod_{i\in I}f(\boldsymbol{y}_{i}|\boldsymbol{\delta}_{i})f(\boldsymbol{\delta}_{i}) \\ & =\prod_{i\in I}\prod_{g=1}^{G}\left[\pi_{g}^{\delta_{i,g}}\prod_{j=1}^{J_{i}}f_{g}\left(y_{i,j}\right)^{\delta_{i,g}}\right]. \end{array} $$

(11)

Here, the parameters *Θ* and *π* were left out for conciseness. A more detailed derivation is provided by Stasinopoulos et al. [41]. The log-likelihood of the complete model is given by

$$ \ell_{c}=\sum_{i\in I}\sum_{g=1}^{G}\delta_{i,g}\log\pi_{g}+\sum_{i\in I}\sum_{g=1}^{G}\sum_{j=1}^{J_{i}}\delta_{i,g}f_{g}(y_{i,j}). $$

(12)

The complete model with *G* classes is equivalent to a weighted regression model over repeated data observations for *g*=1,...,*G* with an additional covariate indicating the class membership. The observations are weighted by the posterior probability

$$ w_{i,g}=\hat{\pi}_{i,g}=\frac{\pi_{g}\prod_{j=1}^{J_{i}}f_{g}\left(y_{i,j};\boldsymbol{\theta}^{(g)}\right)}{\sum_{g^{\prime}=1}^{G}\pi_{g^{\prime}}\prod_{j=1}^{J_{i}}f_{g^{\prime}}\left(y_{i,j};\boldsymbol{\theta}^{(g^{\prime})}\right)}. $$

(13)

The latent class proportions of the mixture model are computed from the respective average posterior probability, given by

$$ \hat{\pi}_{g}=\frac{1}{|I|}\sum_{i\in I}w_{i,g}. $$

(14)

The EM algorithm is initialized by fitting a fixed-effects GAMLSS model. In the E-step, the patient weights are updated, followed by the maximization of the weighted GAMLSS model likelihood in the M-step. The optimization process is halted when the reduction in the deviance, computed as *D*=−2 log*L*, falls below a certain threshold. In the analysis, we use a lenient threshold of 0.3, which we determined to provide sufficiently stable results due to the large amount of data, while halting relatively quickly. Details on the algorithm and the initialization are given by Einbeck & Hinde [40]. Moreover, we observed stable solutions across repeated random starts, which is in agreement with findings by other researchers [39, 40]. Nevertheless, the model does fail to converge sometimes so repeated random starts are recommended.

### Evaluation

Prior to the mixture modeling analysis, we explore several mixed models based on Equation 7, with polynomial random effects \(\sum _{p=0}^{P}\gamma _{k,p,i}t_{i,j}^{p}\) of order *P* in the predictor *η*_{k} of the respective distributional parameter. In addition we estimate a fixed effects model as a baseline.

We assess the model fit of the models by investigating the standardized residuals for normality, using a detrended quantile-quantile (Q-Q) plot. The different models are compared using the Akaike information criterion (AIC). The AIC measures the amount of information lost about the data by the model representation while penalizing overfitting. It is defined as AIC=2*m*−2 log*L*, where *m* is the number of model parameters, and *L* is the likelihood of the model. This is a specific case of the generalized Akaike information criterion (GAIC) [43], which is recommended for comparing non-nested models [14]. Only models that converged successfully are evaluated. Likelihood ratio tests were considered but yielded consistent p-values of zero for any improvement in AIC due to the large sample size, and therefore we do not report them in the results section. Lastly, we measure the separation between classes in terms of the relative entropy [44], given by

$$ \mathrm{relative\:entropy}=1-\frac{\sum\sum_{g=1}^{G}-\hat{\pi}_{i}^{(g)}\log\hat{\pi}_{i}^{(g)}}{|I|\log G}. $$

(15)

For the selected mixture model we compare the subgroups in order to create distinct descriptors of the groups, and to highlight meaningful differences between the groups in terms of adherence. We assess the group trajectories on each of the distributional parameters visually. Furthermore, we explore whether the groups differ on any of the other available covariates of interest, which are the residual AHI, leakage, pressure settings, and motivation score. Here, each patient is assigned to the most likely group (i.e., modal assignment).

Although the additional covariates could have been included in the GAMLSS mixture model, this was omitted for practical reasons because the computation time would increase considerably. Moreover, preliminary tests on a random subset of 1,000 patients yielded mostly the same groups as the mixture model without the inclusion of an additional covariate.

We therefore apply a three-step approach, where in the first step the mixture model is estimated (i.e., the measurement model). In the second step, each patient is assigned to their most likely group. Lastly, the patient groups are compared on the covariates of interest. The means are compared using ANOVA F-tests, whereas the medians of skewed distributions are compared using the Kruskal-Wallis test. Due to the large sample size of this study, even small differences between groups are statistically significant. Instead we will only highlight practically significant differences that are deemed clinically relevant.

The three-step approach has been shown to lead to biased estimates on the effects of external variables [16]. We therefore considered the modified Bolck-Croon-Hagenaars (BCH) approach, which applies a correction for the misclassification errors [16, 45]. However, when applied to the case study at hand, we observed that the correction did not result in a meaningful difference in the mean estimates between groups, nor different conclusions on statistical significance. This is likely attributable to the large sample size, and low misclassification error due to the large number of observations per trajectory.