 Research
 Open access
 Published:
Flexible Bayesian semiparametric mixedeffects model for skewed longitudinal data
BMC Medical Research Methodology volumeÂ 24, ArticleÂ number:Â 56 (2024)
Abstract
Background
In clinical trials and epidemiological research, mixedeffects models are commonly used to examine populationlevel and subjectspecific trajectories of biomarkers over time. Despite their increasing popularity and application, the specification of these models necessitates a great deal of care when analysing longitudinal data with nonlinear patterns and asymmetry. Parametric (linear) mixedeffect models may not capture these complexities flexibly and adequately. Additionally, assuming a Gaussian distribution for random effects and/or model errors may be overly restrictive, as it lacks robustness against deviations from symmetry.
Methods
This paper presents a semiparametric mixedeffects model with flexible distributions for complex longitudinal data in the Bayesian paradigm. The nonlinear time effect on the longitudinal response was modelled using a spline approach. The multivariate skewt distribution, which is a more flexible distribution, is utilized to relax the normality assumptions associated with both randomeffects and model errors.
Results
To assess the effectiveness of the proposed methods in various model settings, simulation studies were conducted. We then applied these models on chronic kidney disease (CKD) data and assessed the relationship between covariates and estimated glomerular filtration rate (eGFR). First, we compared the proposed semiparametric partially linear mixedeffect (SPPLM) model with the fully parametric one (FPLM), and the results indicated that the SPPLM model outperformed the FPLM model. We then further compared four different SPPLM models, each assuming different distributions for the random effects and model errors. The model with a skewt distribution exhibited a superior fit to the CKD data compared to the Gaussian model. The findings from the application revealed that hypertension, diabetes, and followup time had a substantial association with kidney function, specifically leading to a decrease in GFR estimates.
Conclusions
The application and simulation studies have demonstrated that our work has made a significant contribution towards a more robust and adaptable methodology for modeling intricate longitudinal data. We achieved this by proposing a semiparametric Bayesian modeling approach with a spline smoothing function and a skewt distribution.
Introduction
Longitudinal data are present in numerous clinical and other followup studies that involve monitoring subjects over time to understand the impact of exposures, processes, or characteristics on outcomes. These studies involve tracking a group of subjects and recording data at different time points throughout the study duration. For example, one or more renal functional progress biomarkers (e.g., serum creatinine, albuminuria, glomerular filtration rate, and other biomarkers) of a chronic kidney disease (CKD) patient can be measured repeatedly until endstage renal disease and/or other events of interest occur.
This research was driven by longitudinal data on CKD, a significant global health issue that affects approximately 500 million individuals worldwide [1]. Around 80 percent of these CKD cases are found in low and middleincome countries. A prevalence of approximately 35.52 percent of CKD was observed among people with diabetes in Ethiopia [2]. To comprehend how CKD progresses within individuals and across populations, as well as to assess the impact of treatments over time, conducting longitudinal data analysis is necessary.
Longitudinal data can show a variety of features over time and across subjects in many realworld situations during followup studies. A suitable choice of methods for analysing such complex longitudinal data is therefore sought. The most popular method proposed is the linear mixedeffects (LME) model [3,4,5,6] with a Gaussian response. The generalized linear mixedeffects models [7,8,9] and nonlinear mixedeffects models [10] have been also used as an extension of LME model.
Despite the increasing popularity of LME models in applications, the specification and statistical inference of these models may necessitate much attention when treating and analysing longitudinal data with many features. One of these features is that longitudinal data can exhibit nonlinear, irregular patterns over time, along with asymmetry. Thus, to model and analyse longitudinal data with this feature, LME (fully parametric) models may not be flexible enough.
Another feature is that, unlike linear models, mixed models make assumptions regarding the distribution of model errors as well as randomeffects. In the literature, it is usually assumed that the model errors and/or randomeffects follow a multivariate normal distribution. In practice, longitudinal data might exhibit asymmetric distributions, leading to biased statistical results [11, 12]. Because of this, employing a normal distribution for model errors may lack robustness against deviations from normality and may be too limited to accurately describe the among and withinsubject variability of longitudinal outcomes [13]. Many previous studies suggest considering a more flexible distribution for model errors to make a valid statistical inference [13, 14]. There are different suggestions in the literature concerning the impact of misspecification of a randomeffect distribution on parameter estimation and inference. For instance, Molenberghs and Verbeke [15] suggest that misspecification of the randomeffects distribution can lead to biased parameter estimates in nonlinear and generalised linear mixed models; in linear mixed models, however, deviations from the normality assumption may have very little impact on parameter estimation. McCulloch and Neuhaus [16] considered a generalised linear mixed model using a maximum likelihood estimation technique to evaluate the misspecification of the distribution of a random effect. Their findings demonstrated the robustness of most aspects of statistical inferences to the normality of random effects. Other authors in the recent literature, however, suggest that future research should accept more flexible distributional assumption for randomeffects in addition to model errors [17, 18]. As a result, skew distributions have recently been used in the literature to handle asymmetry and model longitudinal data more flexibly [18,19,20].
Thus, in this study, we propose a flexible Bayesian mixedeffects model in a semiparametric setting with a smoothing spline specification and skew distributions for longitudinal data with the aforementioned features. To assess the effectiveness of the proposed methods in various model specifications, simulation studies were conducted. Finally, the proposed model was applied to data on CKD.
Methods
Motivating CKD data and longitudinal outcome trajectories
This paper utilizes a dataset spanning eight years, from June 2014 to June 2022, in the context of chronic kidney disease (CKD). The CKD data was gathered from the University of Gondar Comprehensive Specialized Hospital in Ethiopia, primarily extracted from patientsâ€™ profiles (or charts) and medical records. Only patients with three or more followups are included in the analysis. The dataset encompasses repeatedly recorded renal function biomarkers, comorbidities, and baseline characteristics of 198 CKD patients. On average, the patients were approximately 55 years old, with 56.6% being male. Around onethird (34.4%) of the CKD patients in the study population had baseline hypertension. Furthermore, the baseline prevalence of diabetes among the CKD patients was determined to be 23.81%.
The estimated glomerular filtration rate (eGFR), which estimates the rate at which the kidneys filter blood, is utilized as a longitudinal response variable. Thus, the analysis of this study considered 1,425 eGFR measurements from 189 patients. The minimum, maximum and average number of measurements per patient were 3, 18 and 8, respectively. 63.5% (120) patients had six and above number of measurements, and out of them 43% patients had ten and above measurements. Of the total 1,425Â measurements, based on the National Kidney Foundation guidelines [21], 39.7% indicated CKD Stage 3 (moderate kidney disease), 32.9% indicated Stage 4 (severe kidney disease), and 14.6% indicated Stage 5 (endstage renal disease). To accurately represent the diverse patterns of renal function progression and create an appropriate model, the analysis includes patients with an eGFR value below ninety. FigureÂ 1 displays the eGFR profiles of patients with CKD. The figure depicts the presence of nonlinear trajectories and a positively skewed distribution of eGFR over time.
Bayesian modelling
The semiparametric mixedeffects longitudinal outcome model
In this paper, the longitudinal variable is denoted as \({y}_{ij}\), which represents the value of the response eGFR for subject \(i\) at the \({j}^{th}\) time point \({t}_{ij}\). The indices \(i\) and \(j\) range from 1 to \(m\) and 1 to \({m}_{i}\) respectively, indicating the total number of subjects and the number of measurements for each subject. Let \({{\varvec{x}}}_{ij}={({x}_{1ij}, \dots , {x}_{pij})}^{T}\) denotes a \(1\times p\) vector of associated \(p\) covariates. Most previous studies on chronic kidney disease have taken a parametric approach, like utilizing linear mixedeffects models, to model the longitudinal response variable \({y}_{ij}\) and the associated covariates \({{\varvec{x}}}_{ij}\). However, as demonstrated in the presentation of the motivating CKD data above, the outcome eGFR exhibits irregular (nonlinear) trajectories over time. Therefore, this paper introduces a semiparametric mixedeffects model that considers the nonlinear trajectories of \({{\varvec{y}}}_{i}\) by employing a spline approach.
where \({{\varvec{y}}}_{i}={({y}_{i1}, \dots , {y}_{imi})}^{T}\) represent the vector of longitudinal response variable, \({{\varvec{X}}}_{i}={({{\varvec{x}}}_{1i}, \dots , {{\varvec{x}}}_{pi})}^{T}\) denote the design matrix of fixedeffects, and \({{\varvec{H}}}_{i}={({{\varvec{h}}}_{1i}, \dots , {{\varvec{h}}}_{qi})}^{T}\) represent the design matrix of randomeffects. \({\varvec{\beta}}\) and \({{\varvec{\xi}}}_{i}\) represent parameter vectors that are associated with the covariates of fixed and random effects. In Eq.Â (1), the effect of measurement time \({{\varvec{t}}}_{i}={({t}_{i1}, \dots , {t}_{imi})}^{T}\) on the response \({{\varvec{y}}}_{i}\) is modelled using a nonparametric approach. This is achieved by employing a smoothing function \({{\varvec{N}}}_{i}\left({{\varvec{t}}}_{i}\right)\), which can be defined as follows:
where \(\mathbf{U}\left({\mathbf{t}}_{i}\right)\) and \({\mathbf{V}}_{i}\left({\mathbf{t}}_{i}\right)\) represent unknown smoothing functions for the population and subjectspecific variations of the longitudinal response \({{\varvec{y}}}_{i}\) due to time effects \({\mathbf{t}}_{i}\), respectively. The random vectors \({{\varvec{\xi}}}_{i}\), \({\mathbf{V}}_{i}\left({\mathbf{t}}_{i}\right),\) and \({{\varvec{\varepsilon}}}_{i}\) are assumed to be independent one another.
A regression spline method is utilized to specify the unknown functions \(\mathbf{U}\left({\mathbf{t}}_{i}\right)\) and \({\mathbf{V}}_{i}\left({\mathbf{t}}_{i}\right)\) in Eq.Â (2), and can be defined as a linear combination of spline basis functions,
\({{\varvec{\Phi}}}_{ki}\left({\mathbf{t}}_{i}\right)={({{\varvec{\Phi}}}_{k} ({t}_{ij} ), \dots , {{\varvec{\Phi}}}_{k} ({t}_{imi}))}^{T}\) andÂ \({{\varvec{\Lambda}}}_{li}\left({\mathbf{t}}_{i}\right)={({{\varvec{\Lambda}}}_{l} ({t}_{ij} ), \dots , {{\varvec{\Lambda}}}_{l} ({t}_{imi} ))}^{T}\). Where.
\({{\varvec{\Phi}}}_{k} \left({t}_{ij}\right)={({\upphi }_{0} ({t}_{ij} ),{\upphi }_{1} ({t}_{ij} ) \dots , {\upphi }_{k1} ({t}_{ij} ))}^{T}\) and.
\({{\varvec{\Lambda}}}_{l} \left({t}_{ij}\right)={({\uplambda }_{0} ({t}_{ij} ),{\uplambda }_{1} ({t}_{ij} ) \dots , {\uplambda }_{l1} ({t}_{ij} ))}^{T}; j=1, \dots ,{m}_{i}.\)Â Mathematically, the specification can be given by
where \({{\varvec{\eta}}}_{k}={({\upeta }_{0} ,{\upeta }_{1}, \dots , {\upeta }_{k1} )}^{T}\) and \({\boldsymbol{\vartheta }}_{il}={({\vartheta }_{i0} ,{\mathrm{\vartheta }}_{i1}, \dots , {\mathrm{\vartheta }}_{i(l1)} )}^{T}\) are \(k\times 1\) and \(l\times 1\) parameter vectors of the fixedeffect spline basis \({{\varvec{\Phi}}}_{k}\left({\mathbf{t}}_{i}\right)\) and random spline basis effects\({{\varvec{\Lambda}}}_{l}\left({\mathbf{t}}_{i}\right)\), respectively. The Bspline, truncated power or natural cubic spline basis can be used to construct the bases (\({{\varvec{\Phi}}}_{k}\left({\mathbf{t}}_{i}\right)\) and\({{\varvec{\Lambda}}}_{l}\left({\mathbf{t}}_{i}\right)\)) in (3). In this study, natural cubic spline with percentilebased knots is considered to approximate the bases. By using Eq.Â (3), model (1) can be rewritten as:
Let \({{\varvec{Z}}}_{i}=({{\varvec{X}}}_{i}, {{\varvec{\Phi}}}_{k}\left({\mathbf{t}}_{i}\right))\) and \({{\varvec{R}}}_{i}=({{\varvec{H}}}_{i}, {{\varvec{\Lambda}}}_{l}\left({\mathbf{t}}_{i}\right))\) be the fixedeffect (population) and random effects design matrices, respectively. Furthermore, let \(\boldsymbol{\alpha }={({{\varvec{\beta}}}_{p}^{T}, {{\varvec{\eta}}}_{k}^{T})}^{T}\) and \({\boldsymbol{\varphi }}_{i}={({{\varvec{\xi}}}_{iq}^{T}, {\boldsymbol{\vartheta }}_{il}^{T})}^{T}\) be the associated parameter vectors. Then, model (1) can be reformulated as
Most previous studies assumed a Gaussian distribution for the randomeffects \({\boldsymbol{\varphi }}_{i}\) (representing intersubject variation of \({\mathbf{y}}_{i}\)) as well as for the model errors \({{\varvec{\epsilon}}}_{i}\) (representing withinsubject variation) due to its computational convenience. However, in this study, we considered multivariate skewt distributions [22] for both \({\boldsymbol{\varphi }}_{i}\) and \({{\varvec{\varepsilon}}}_{i}\). That is, \({\boldsymbol{\varphi }}_{i}\sim S{T}_{q+l,{\rho }_{\varphi }}\left(0,{{\varvec{\Sigma}}}_{\varphi },{{\varvec{\delta}}}_{\varphi }\right)\) and \({{\varvec{\varepsilon}}}_{i}\sim S{T}_{{m}_{i},{\rho }_{\varepsilon }}\left(0,{\sigma }_{\varepsilon }^{2}{\mathbf{I}}_{{m}_{i}},{\delta }_{\varepsilon }{\mathbf{I}}_{{m}_{i}}\right)\). Where \(ST(.)\) is a skewt distribution; \({\rho }_{\varphi }\) and \({\rho }_{\varepsilon }\) denote degrees of freedom; \({{\varvec{\Sigma}}}_{\varphi }\) and \({{\varvec{\Sigma}}}_{\varepsilon }={\sigma }_{\varepsilon }^{2}{\mathbf{I}}_{{m}_{i}}\) denote covariance matrices; and \({{\varvec{\delta}}}_{\varphi }\) and \({{\varvec{\delta}}}_{\varepsilon }={\delta }_{\varepsilon }{\mathbf{I}}_{{m}_{i}}\) are skewness vectors of the randomeffects \({\boldsymbol{\varphi }}_{i}\) and model errors \({{\varvec{\varepsilon}}}_{i}\), respectively.
Hierarchical reformulation of the model
The statistical inference from a semiparametric mixedeffects model with multivariate skewt distributions using the likelihood approach can be computationally demanding. Hence, to overcome this challenge, we adopted the Bayesian approach, which offers computational efficiency. This approach not only reduces the computational load but also allows for more accurate parameter estimation by leveraging existing information (prior knowledge) for parameter estimation. By employing Markov Chain Monte Carlo (MCMC) algorithms, the Bayesian approach enables us to estimate the parameters more efficiently while obtaining posterior distributions that provide a comprehensive quantification of parameter uncertainty.
In order to carry out the MCMC, it is crucial to reformulate the model (5) by representing the skewt distributions using the stochastic representation considered by [22] (See Appendix B). To achieve this, we introduced random vectors \({{\varvec{W}}}_{{\varphi }_{i}}={({W}_{{\varphi }_{i}1},\dots ,{W}_{{\varphi }_{i}(q+1)})}^{T}\) and \({{\varvec{W}}}_{{{\varvec{\varepsilon}}}_{i}}={({W}_{{\varepsilon }_{i}1},\dots ,{W}_{{\varepsilon }_{i}{m}_{i}})}^{T}\), as well as random variables \({v}_{\varphi }\) and \({v}_{\varepsilon }\) to represent the skewt distributions associated with the random effects \({\boldsymbol{\varphi }}_{i}\) and model errors \({{\varvec{\varepsilon}}}_{i}\), respectively. Consequently, we present the hierarchical reformulation of model (5) as follows:
Where \({N}_{b}()\), in general, stands for a multivariate normal distribution with a dimension of b, and \(\Gamma ()\) denotes a gamma distribution.
Prior specification and the posterior distribution
Let \({\varvec{\Omega}}=\left\{\boldsymbol{\alpha },{\sigma }_{\varepsilon }^{2},{{\varvec{\Sigma}}}_{\varphi }, {{\varvec{\delta}}}_{\varphi }, {\delta }_{\epsilon },{\rho }_{\varphi },{\rho }_{\varepsilon }\right\}\) represent the set of all parameters in the hierarchical model (6). We specify the prior distributions for each parameter in \({\varvec{\Omega}}\) as follows:

The fixedeffects and skewness parameters \(\boldsymbol{\alpha }\), \({{\varvec{\delta}}}_{\varphi }\), and \({\delta }_{\epsilon }\) are assumed to follow independent normal prior distributions \({N}_{{\text{p}}}({{\varvec{\upalpha}}}_{0}, {{\varvec{\Theta}}}_{\boldsymbol{\alpha }})\), \({N}_{{\text{q}}+{\text{l}}}(0, {{\varvec{\upkappa}}}_{{\delta }_{\varphi }})\), and \(N(0, {\upkappa }_{{\delta }_{\varepsilon }})\), respectively.

The scale parameters \({{\varvec{\Sigma}}}_{\varphi }\) and \({\sigma }_{\varepsilon }^{2}\) follow an inverseWishart and inverseGamma prior distributions, \({IW}_{{\text{q}}+{\text{l}}}({\mathbf{D}}_{\varphi }, {\upnu }_{\varphi })\) and \(IG\left({\varrho }_{\varepsilon 1}, {\varrho }_{\varepsilon 2}\right),\) respectively.

The degrees of freedom parameters \({\rho }_{\varphi }\) and \({\rho }_{\varepsilon }\) are assumed to follow truncated exponential prior distributions, \(Exp({\rho }_{\vartheta 0})I({\rho }_{\vartheta }>3)\) and \(Exp({\rho }_{\varepsilon 0})I({\rho }_{\varepsilon }>3)\), respectively.
The hyperparameter matrices \({{\varvec{\Theta}}}_{\boldsymbol{\alpha }}\) and \({\mathbf{D}}_{\varphi }\) are assumed diagonal for convenient implementation. Then, the prior distribution of all the parameters, denoted as \(\pi ({\varvec{\Omega}})\), can be defined as the product of the individual prior distributions of each parameter.
Suppose \(\mathbf{G}=\left\{{\mathbf{y}}_{i},{\mathbf{Z}}_{i}, {\mathbf{W}}_{{\varphi }_{i}}, {\mathbf{W}}_{{\varepsilon }_{i}},{v}_{\varphi }, {v}_{\varepsilon }\right\}\) be the observed data. An approximation of the posterior density of Î© given G can be obtained as follows:
where \(f(\mathcal{G}{\varvec{\Omega}})\) is the joint likelihood function of G given Î© and \({{\varvec{\mu}}}_{y}={\mathbf{Z}}_{i}\boldsymbol{\alpha }+{\mathbf{R}}_{i}{\boldsymbol{\varphi }}_{i}+{\delta }_{\varepsilon }{\mathbf{W}}_{\varepsilon i}\).
The Metropolisâ€“Hastings algorithm within Gibbs sampler can be used to draw samples from the full conditional posterior distributions of the parameters and to estimate their posterior means and standard deviations. For all models, the Markov chain Monte Carlo (MCMC) procedure was implemented using WinBUGS14 software, which simplifies the implementation of the MCMC algorithm by eliminating the need to derive full conditionals and specify the algorithm explicitly.
Model comparison and diagnostics checking
The specification and implementation of the proposed model in the Bayesian approach may require to conduct convergence diagnostic checks and thoroughly examine the distributional assumptions before drawing any statistical inferences about the parameters. Failure to do so may result in biased estimates and invalid statistical inference. Thus, in this study, the BrooksGelmanRubin (BGR) plot [23], trace plot, ACF plot and the Gewekeâ€™s test of convergence are all used to evaluate convergence. After confirming convergence, we proceed to evaluate the effectiveness of the proposed semiparametric mixedeffects model (5) by exploring various distributional assumptions for the randomeffects and model errors. This model comparison involves considering different distributional specifications and examining their performance in capturing the underlying characteristics of the data. These specifications are given below:

MoSTST: A semiparametric partially linear mixedeffects model (SPPLMEM) with multivariate skewt (ST) distributions for both the random effects \({\boldsymbol{\varphi }}_{i}\) and model errors \({{\varvec{\varepsilon}}}_{i}\).

MoNST: An SPPLMEM with multivariate normal (N) distribution of \({\boldsymbol{\varphi }}_{i}\) and STdistribution of \({{\varvec{\varepsilon}}}_{i}\).

MoSNSN: An SPPLMEM with both \({\boldsymbol{\varphi }}_{i}\) and \({{\varvec{\varepsilon}}}_{i}\) follow multivariate skewnormal (SN) distributions.

MoNN: An SPPLMEM with multivariate normal (N) distributions of \({\boldsymbol{\varphi }}_{i}\) and \({{\varvec{\varepsilon}}}_{i}\). MoNN model is the standard choice in longitudinal data analysis.
In order to evaluate the performance of the estimators and make comparisons between different models, we utilised several statistical measures. Additionally, to compare and select the bestfitting Bayesian model for the skewed longitudinal response, we employed the deviance information criterion, which takes into account both the goodness of fit and model complexity.
Deviance information criterion (DIC)
In this paper, DIC [24] is used to choose the bestfitting Bayesian semiparametric mixedeffects model. DIC is the most popular Bayesian model comparison tool in the literature: the smaller this value, the better the model fit. The DIC for the hierarchical Bayesian model (6) with parameters vector Î© and observed longitudinal data D can be defined as
where
is the deviance computed at the posterior mean of model parameters. And
is effective number of parameters. Where \(\stackrel{}{Dev(\Omega )}\) represents the expected deviance; \(Dev\left(\Omega \right)=2{\text{log}}(f({\varvec{D}}{\varvec{\Omega}}))\) is the deviance function; and \(f({\varvec{D}}{\varvec{\Omega}})\) is the likelihood of the parameters in Eq.Â (6).
Results
Simulation studies
Simulation studies were conducted to assess and compare the effectiveness of the proposed semiparametric mixedeffects model in various model settings. In these simulations, a sample of 400 individuals was considered, each having eleven equally spaced measurement times, resulting in a total of 4,400 observations. The longitudinal data was simulated using the semiparametric mixedeffects model (5). The specifications of this general model can be given as follows:
where \({y}_{ij}\) and \({Z}_{pij}\) are the longitudinal response and binary covariates, \(p=1, 2\). \(\boldsymbol{\alpha }={({\alpha }_{1}, {\alpha }_{2}, {\alpha }_{3})}^{T}\) and \({\varvec{\lambda}}={({\lambda }_{1}, {\lambda }_{2}, {\lambda }_{3})}^{T}\) denote parameter vectors of the fixed effects and \({\boldsymbol{\varphi }}_{i}=\) (\({\varphi }_{i1}\), \({\varphi }_{i2}\), \({\varphi }_{i3}\), \({\varphi }_{i4}\))^{T} denote parameter vector of the randomeffects. \({\varvec{\Phi}}\left({t}_{ij}\right)=\) (\({\phi }_{1}\left({t}_{ij}\right)\), \({\phi }_{2}\left({t}_{ij}\right)\),\({\phi }_{3}\left({t}_{ij}\right)\))^{T} is a vector of natural cubic spline bases used in the regression spline method. We used eleven equally spaced time points (t_{ij}â€‰=â€‰0, 1, 2, 3, â€¦, 10) with percentile based knots to generate the spline bases [14, 25, 26].
To create longitudinal data with a skewed distribution, the components of the random effects \({\boldsymbol{\varphi }}_{i}\) and the error terms \({{\varvec{\varepsilon}}}_{i}\) are simulated from a gamma distribution \(\gamma (\mathrm{2,1})\). These generated values are then subtracted by two [27, 28]. The vectors Î±â€‰=â€‰(27.5,â€‰âˆ’â€‰5,â€‰âˆ’â€‰4)^{T} and Î»â€‰=â€‰(âˆ’â€‰9,â€‰âˆ’â€‰25,â€‰âˆ’â€‰5) are set accordingly.
Furthermore, \({{\varvec{Z}}}_{1}\) and \({{\varvec{Z}}}_{2}\) are generated using Bernoulli distributions with probabilities (proportions) 0.24 and 0.44, respectively.
While performing the Bayesian inference, we considered weakly informative priors for the parameters. Specifically, each component of \(\boldsymbol{\alpha },{\varvec{\uplambda}}, {{\varvec{\delta}}}_{\varphi },\) and \({\delta }_{\varepsilon }\) was assumed to follow a normal prior distribution, N (0, 100). Furthermore, inverse Wishart \(IW(0.01{\mathbf{I}}_{4}, 4)\), inverse gamma IG(0.01, 0.01), Exp(0.5), and Exp(0.5) priors are considered for \({{\varvec{\Sigma}}}_{\varphi }\), \({\sigma }_{\varepsilon }^{2}\), \({\rho }_{\varphi }\), and \({\rho }_{\varepsilon }\), respectively.
Three MCMC chains were run using R2WinBUGS in R. Each chain consisted of 90,000 iterations, and a burnin of 45,000 iterations was applied. After thinning, we retained a total of 4,500 posterior estimates for each parameter from each model.
In assessing convergence, Figure A.1 (Appendix A) displays the trace plots, while Figure A.2 (Appendix A) exhibits the plots of ACF (autocorrelation function) and BGR diagnostic plots of the parameters derived from the proposed semiparameteric mixedeffects model [5). These figures clearly demonstrate convergence. In addition, none of the absolute values of Geweke's test statistics results (Appendix A) for the parameters exceeded the 95% critical value of 1.96, demonstrating strong evidence of convergence.
We computed the relative bias (RB), which indicates the extent of bias in the estimators; the 95% coverage probability (CP) to assess the accuracy of credible intervals; and the rootmeansquare (RMS) error to measure the overall prediction accuracy. The results presented in TableÂ 1 provide an evaluation of different models in terms of their posterior mean estimates, along with RB, RMSE, CP, and DIC based on simulation studies. Specifically, the evaluation focuses on semiparametric mixedeffect models (SPMEMs) with skew distributions in comparison to a Gaussian SPME model for skewed longitudinal data. The findings indicate that the proposed Bayesian SPMEMs with skew distributions (MoSTST, MoNST, and MoSNSN) outperformed the Gaussian model (MoNN). The DIC values of the skewed models MoNST, MoSNSN, and MoSTST are comparatively smaller (11,674, 12,997, and 13,249, respectively) than those of the normal model MoNN (DICâ€‰=â€‰14,528). The two models with skewt and skewnormal distributions of model errors and random effects (MoSTST and MoSNSN) have relatively closer DIC values. Specifically, the model with a skewt distribution for model errors and a normal distribution of random effects (MoNST) has the smallest DIC value and exhibited better performance compared to the other models. In terms of relative bias (RB) and RMSE, however, the model with a skewt distribution for both random effects and model errors (MoSTST) demonstrated superior performance. This suggests that incorporating skewness in modelling the longitudinal data and proposing a more flexible distributional assumption (skew distribution) allows for better capturing the inherent asymmetries and heavy tails present in the data, leading to more accurate estimates. Overall, these results emphasize the advantages of employing Bayesian SPMEM with skew distribution over the conventional Gaussian model, offering greater flexibility and improved performance in accurately modelling complex longitudinal data.
Results of the CKD data analysis
In this paper, we included diabetes and hypertension as binary covariates based on the real CKD dataset and three spline basis functions of time with four randomeffects to model and analyse the longitudinal response, the estimated glomerular filtration rate (eGFR). Accordingly, we reformulate the general semiparametric mixedeffects model (6) as follows:
where the parameter vectors Î±, Î», \({\boldsymbol{\varphi }}_{i}\), and Î¦(\({Time}_{ij}\)) are as defined as in the simulation section study. In order to obtain an approximation of the spline bases, we considered two internal knots at 9 and 25Â months and two boundary knots at 0 and 96Â months. The locations of these knots were determined based on the quantiles of the distribution of observed measurement time points. We proceed to analyse the CKD data using the proposed model with varying distributional assumptions, and subsequently compare and interpret the results. We begin by initially comparing the performance of two models: the proposed semiparametric (partially linear) mixedeffects model (SPPLMEM) specified in Eq.Â (12), and a fully parametric (linear) mixedeffects model (FPLMEM) that assumes Gaussian distributions for both the random effects and model errors. The FPLMEM is specifically defined as follows:
where \({Time}_{ij}\) denotes the observed measurement time of the longitudinal biomarkers for the i^{th} subject at the j^{th} visit. The results (TableÂ 2) show that the estimates of some parameters become large from FPLMEM compared to SPPLMEM. For instance, the estimates of \({\alpha }_{2}\) and Ïƒ^{2} from SPPLMEM wereâ€‰âˆ’â€‰6.38 and 60.19, while from FPLMEM they becameâ€‰âˆ’â€‰7.58 and 72.30, respectively. In addition, in order to select the most suitable Bayesian model that accurately represents the CKD data, we also compute the deviance information criterion (DIC) [24]. Our analysis reveals that the SPPLMEM gives a lower DIC value (DICâ€‰=â€‰10, 290) in comparison to the FPLMEM (DICâ€‰=â€‰10, 430).
After selecting the SPPLMEM as the most suitable model that accurately represents the data, we proceed to further compare four different SPPLMEMs by taking into account different distributional specifications. For model errors and randomeffects as described in the simulation study. We fitted four Bayesian semiparametric mixedeffects models to the CKD data. The MCMC setup, computations, and convergence diagnostic methods employed were identical to those described in the simulation study. Table 3. displays a summary of the data analysis results and estimates for the parameters (Par) obtained from the four models with different distributional specifications.
As shown in TableÂ 3., the CKD data analysis results reveal that each model produces slightly varied yet statistically significant estimates of most of the parameters. When comparing the models, the findings reveal that the utilization of the 4th model (MoNN), which employs a multivariate normal distribution for random effects and model errors, may result in an overestimation of some of the parameters. Specifically, the population parameters\({\alpha }_{1}\),Â \({\alpha }_{2}\),Â \({\alpha }_{3}\),Â \({\lambda }_{1}\),Â \({\lambda }_{2}\), and \({\lambda }_{3}\) are prone to being overestimated. Notably, as can be clearly seen, the estimated scale parameter (the variance) of model errors (\({\sigma }_{\varepsilon }^{2}\)) is significantly larger in MoNN compared to the other models. The 3rd model (MoSNSN) also gives larger parameter estimates (e.g.,\({\widehat{\sigma }}_{\varepsilon }^{2}\)) compared to the first two models. Furthermore, the estimated skewness parameter of the outcome eGFR (\({\delta }_{\varepsilon }\)) is significantly different from zero in the first three models: MoSTST, MoNST, and MoSNSN. Some of the skewness parameters of the random effects (\({{\varvec{\delta}}}_{\varphi }\)) are also significantly different from zero in MoSTST and MoSNSN. Thus, the significantly different from zero positive estimates of \({\delta }_{\varepsilon }\) and the subjectspecific random intercept \({\delta }_{{\varphi }_{1}}\) confirm the presence of positive skewness in the longitudinal eGFR data. In other words, the nonzero estimates of the skewness parameters and relatively small estimates of the variances may indicate that the proposed Bayesian models with skewt distribution of model errors and/or random effects (MoSTST and MoNST) fit the CKD data well. This is in line with the results of the simulation studies.
In general, the proposed models (MoSTST and MoNST) outperform and the standard MoNN. In particular, MoNST has been chosen as the best model for further indepth interpretation and discussion of the results because it has a relatively small DIC value, despite the fact that both MoSTST and MoSNSN have some significant skewness parameter estimates for the random effects. As can be seen from the simulation studies, MoNST also has a lower DIC value. This finding, a mixed model (skewed in our case) with a normal distribution of randomeffects, is consistent with the study [15].
The results of all models indicate that the variables examined in this study, namely hypertension, diabetes, and followup time (the spline bases), are statistically significant factors contributing to the decline of patientsâ€™ kidney function. This is attributed to the negative and significant association between these covariates and the response variable, eGFR. In other words, it is evident that these covariates have a substantial association with the decrease in GFR estimates. For example, the diabetes coefficient (\({\widehat{\alpha }}_{2}=6.66\), 95% CI: [âˆ’â€‰10.16,â€‰âˆ’â€‰3.11]) from MoNST (the bestfitting model) can be interpreted as the eGFR value of a CKD patient with diabetes being reduced by 6.66 units compared to a CKD patient without diabetes, while holding the same covariates and random effects. Additionally, a hypertensive CKD patient is associated with a 4.44 unit lower eGFR value (\({\widehat{\alpha }}_{3}=4.44\), 95% CI: [âˆ’â€‰5.45,â€‰âˆ’â€‰3.46]) compared to a nonhypertensive CKD patient, with the same covariates and random effects.
Discussion
In recent years, there has been a growing emphasis in the literature on effectively modeling longitudinal data with many features. This includes giving careful consideration to the functional forms of longitudinal markers and the assumptions made about the distribution of random effects and model errors. With this in mind, the main objective of this study was to develop a flexible Bayesian mixedeffects model that addresses the problems commonly observed in longitudinal CKD data, encompassing characteristics such as skewness, nonlinear effects over time, and flexible distributions for both random effects and model errors. The ultimate goal was to establish a robust statistical methodology that enables accurate and reliable inference in complex longitudinal data analysis.
We therefore proposed a Bayesian semiparametric mixedeffects model for the longitudinal response eGFR that addresses the above issues. To capture the nonlinear effects of time and the flexibility of eGFR, regression splines were employed in the model. Additionally, multivariate skew distributions were incorporated to account for skewness in eGFR and to relax the assumptions about its distribution. Simulation studies were first conducted to provide a comprehensive description and evaluation of the performance of the proposed model.
We applied the proposed model by analysing data on chronic kidney disease (CKD) and assessing the relationship between covariates and estimated glomerular filtration rate (eGFR). The model comparison process in this study involved two steps. Firstly, we compared the proposed semiparametric partially linear mixedeffect (SPPLM) model with the fully parametric one (FPLM), and our results indicated that the SPPLM model outperformed the FPLM model. In the second step, we further compared four different SPPLM models, each assuming different distributions for the randomeffects and model errors. As described in the data analysis and results, the SPPLM models with skewt distribution exhibited a superior fit to the CKD data in comparison to the Gaussian SPPLM model.
The findings from the application revealed that hypertension, diabetes, and followup time had a substantial association with kidney function, specifically leading to a decrease in eGFR. These factors were identified as important predictors and exhibited a negative correlation with kidney function.
Additionally, the results of this study imply that when dealing with longitudinal data characterized by the aforementioned features, it is useful to incorporate nonparametric smoothing functions (splines) to capture nonlinear timeeffects and utilize skew distributions for model errors and/or random effects. In particular, accounting for skewness in the longitudinal data analysis by utilizing a more flexible distribution, the skewt distribution, is crucial to handle asymmetry in the data and get unbiased results. By doing so, we can obtain less biased results and draw valid statistical inferences. Additionally, employing a flexible distributional assumption for the randomeffects can lead to a more accurate explanation of subjectspecific variations.
Apart from the CKD followup data that served as motivation, our methodology has broader applicability in cases where the longitudinal data have similar characteristics and the fundamental model requirements (or settings) are satisfied.
Conclusion
In conclusion, we have proposed a semiparametric Bayesian modeling approach with flexible distributions for complex longitudinal data. The results of the simulation and application studies have demonstrated that our work has made a significant contribution towards a more robust and adaptable methodology for modeling intricate longitudinal data. We recommend paying special attention to the specifications of the functional forms of longitudinal biomarkers and distributional assumptions of model errors when modeling complex longitudinal data.
Availability of data and materials
The actual CKD data utilized to exemplify the proposed model can be obtained from the corresponding author upon a substantial request.
References
Stanifer JW, Muiru A, Jafar TH, Patel UD. Chronic kidney disease in lowand middleincome countries. Nephrol Dial Transplant. 2016;31(6):868â€“74.
Shiferaw WS, Akalu TY, Aynalem YA. Chronic Kidney Disease among Diabetes Patients in Ethiopia: A Systematic Review and MetaAnalysis. Int J Nephrol. 2020;2020:15.
Laird NM, Ware JH. Randomeffects models for longitudinal data. Biometrics. 1982;1:963â€“74.
Diggle PJ, Heagerty P, Liang K, Zegger SL. Analysis of longitudinal data. 2nd ed. Oxford: Oxford University Press; 2002.
Hedeker D, Gibbons RD. Longitudinal data analysis. Hoboken, NJ: John Wiley & Sons; 2006.
Nguyen DV, SÂ¸entuÂ¨rk D, Carroll RJ. Covariateadjusted linear mixed effects model with an application to longitudinal data. J Nonparametr Stat. 2008;20(6):459â€“81.
Wu H, Ding AA, De Gruttola V. Estimation of HIV dynamic parameters. Stat Med. 1998;17(21):2463â€“85.
Nelder JA, Wedderburn RW. Generalized linear models. Journal of the Royal Statistical Society: Series A (General). 1972;135(3):370â€“84.
Tang NS, Tang AM, Pan DD. Semiparametric Bayesian joint models of multivariate longitudinal and survival data. Comput Stat Data Anal. 2014;77:113â€“29.
Lu X, Huang Y. Bayesian analysis of nonlinear mixedeffects mixture models for longitudinal data with heterogeneity and skewness. Stat Med. 2014;33(16):2830â€“49.
Sahu SK, Dey DK, Branco MD. A new class of multivariate skew distributions with applications to Bayesian regression models. Canadian Journal of Statistics. 2003;31(2):129â€“50.
Huang X, Li G, Elashoff RM. A joint model of longitudinal and competing risks survival data with heterogeneous random effects and outlying longitudinal measurements. Statistics and its interface. 2010;3(2):185.
ArellanoValle R, Bolfarine H, Lachos V. Bayesian inference for skewnormal linear mixed models. J Appl Stat. 2007;34(6):663â€“82.
Ariyo OS, Adeleke MA. Simultaneous Bayesian modelling of skewnormal longitudinal measurements with nonignorable dropout. Comput Statistics. 2022;37(1):303â€“25.
Molenberghs G, Verbeke G. Models for Discrete Longitudinal Data. New York: Springer Series in Statistics. 2005. p. 419â€“435. https://scihub.se/https://link.springer.com/10.1007/0387289801.
McCulloch CE, Neuhaus JM. Misspecifying the Shape of a Random Effects Distribution: Why Getting It Wrong May Not Matter. Stat Sci. 2011;26(3):388â€“402.
Baghfalaki T, Kalantari S, Ganjali M, Hadaegh F, Pahlavanzadeh B. Bayesian joint modeling of ordinal longitudinal measurements and competing risks survival data for analysing Tehran Lipid and Glucose Study. J Biopharm Stat. 2020;30(4):689â€“703.
Zhang H, Huang Y. Bayesian joint modeling for partially linear mixedeffects quantile regression of longitudinal and timetoevent data with limit of detection, covariate measurement errors and skewness. J Biopharm Stat. 2021;31(3):295â€“316.
Lu X, Huang Y, Chen J, Zhou R, Yu S, Yin P. Bayesian joint analysis of heterogeneousand skewedlongitudinal data and a binary outcome, with application to AIDS clinical studies. Stat Methods Med Res. 2018;27(10):2946â€“63.
Azarbar A, Wang Y, Nadarajah S. Simultaneous Bayesian modeling of longitudinal and survival data in breast cancer patients. Communications in StatisticsTheory and Methods. 2021;50(2):400â€“14.
Goolsby MJ. National Kidney Foundation Guidelines for chronic kidney disease: evaluation, classification, and stratification. J Am Acad Nurse Pract. 2002;14(6):238â€“42.
Lee S, McLachlan GJ. Finite mixtures of multivariate skew tdistributions: some recent and new results. Stat Comput. 2014;24(2):181â€“202.
Brooks SP, Gelman A. General methods for monitoring convergence of iterative simulations. J Comput Graph Stat. 1998;7(4):434â€“55.
Spiegelhalter DJ, Best NG, Carlin BP, Van Der Linde A. Bayesian measures of model complexity and _t. Journal of the royal statistical society: Series b (statistical methodology). 2002;64(4):583â€“639.
Dagne GA, Huang Y. Bayesian semiparametric mixture Tobit models with left censoring, skewness, and covariate measurement errors. Stat Med. 2013;32(22):3881â€“98.
Andrinopoulou ER, Rizopoulos D, Takkenberg JJ, Lesare E. Combined dynamic predictions using joint models of two longitudinal outcomes and competing risk data. Stat Methods Med Res. 2017;26(4):1787â€“801.
Zhang H, Huang Y. Quantile regressionbased Bayesian joint modeling analysis of longitudinalsurvival data, with application to an AIDS cohort study. Lifetime Data Anal. 2020;26:339â€“68.
Ferede MM, Mwalili S, Dagne G, Karanja S, Hailu W, ElMorshedy M, et al. A Semiparametric Bayesian Joint Modelling of Skewed Longitudinal and Competing Risks Failure Time Data: With Application to Chronic Kidney Disease. Mathematics. 2022;10(24):4816.
Acknowledgements
Not applicable.
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
MMF, GAD, SMM, and HAE conceived and designed the study. MMF and WHB participated in the data collection. MMF and HAE carried out the data analysis and wrote the manuscript. SMM, GAD, HAE, WHB and SMK reviewed drafts of the manuscript and offered interpretation and critical comments. All authors participated in the review and approval of the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
The study was reviewed and approved by the Institutional Ethical Review Board of the University of Gondar, Ethiopia (Ref. VP/RTT/05/777/2022). All methods were carried out in accordance with relevant guidelines and regulations (the Helsinki Declaration). The Institutional Ethical Review Board of the University of Gondar also waived the need for written informed consent from individuals due to the retrospective nature of the study and its major focus on model development. The data were therefore anonymized before the analysis.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisherâ€™s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1:
Appendix A. Convergence diagnostic checking results. Figure A.1. Trace plots of some representative parameters from the chosen model. Figure A.2. Autocorrelation function plots (a) and BGR plots (b) of some representative parameters. Table A.1. Results of the Geweke's test of convergence. The computed value of the test statistic for each parameter from the chosen model. Appendix B. Skew Distributions.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Ferede, M.M., Dagne, G.A., Mwalili, S.M. et al. Flexible Bayesian semiparametric mixedeffects model for skewed longitudinal data. BMC Med Res Methodol 24, 56 (2024). https://doi.org/10.1186/s1287402402164y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1287402402164y