Data
In this retrospective cohort study (approved by “The Ethical Committee of the Hamadan University of Medical Sciences”; NO. IR.UMSHA.REC.1398.075), all records of patients with schizophrenia hospitalized at Sina Hospital (Farshchian) in Hamadan, Iran, from March 2011 to March 2019 were reviewed, and finally, 413 patients were selected according to inclusion criteria. All methods were carried out by relevant guidelines and regulations. Inclusion criteria were: the diagnostic criteria for schizophrenia in ICD-10 (International Classification of Diseases, version 10), patients without change of disease in each hospitalization, and patients without other psychiatric disorders at the same time. To extract the information from the patients’ files (without patient involvement), a checklist made by the researchers and a clinical consultant was used. In this checklist, the independent variables were divided into two parts: demographic and clinical characteristics, which were:
-
a)
Demographic characteristics: age, gender, birth season, the birthplace township, education status, marital status, number of children, employment status, residence status, the township of residence, living status, having homogeneous siblings, number of siblings, history of emotional distress, including illness or death of a family member or loved one, parental divorce, marriage, pregnancy, emotional breakdown, etc., history of arrest or prison, history of substance abuse, including opium, heroin, cannabis, methamphetamine, crack, and so on (based on a compilation of interviews with patients and their companions and laboratory results), and a history of smoking. In addition, to determine the population of the city or village where people live, the results of the general population and housing censuses in 2011 and 2016 were used.
-
b)
Clinical characteristics: age at onset of illness, duration of illness, family history of psychiatric illness, history of medical disease, including cardiovascular, visual and auditory diseases, skin, hypertension, hyperlipidemia, glands, gastrointestinal, neurological, rheumatological and urological, history of non-adherence to antipsychotic drugs (including failure to take medication, taking lower doses than prescribed, and premature termination of medication), and history of the suicide attempt.
The variable of the number of psychiatric rehospitalizations of schizophrenic patients (from the onset of the disease to the end of the study (March 2019)) was considered as the response variable. It is noteworthy that variables were measured at the time of the last recurrence of the symptoms of the patients’ disease. Descriptive statistics of the response variable and characteristics of the studied schizophrenic patients are presented in Tables 2, 3 and 4.
Statistical analysis
Variable selection methods including ALASSO, SCAD, MCP, Backward Stepwise, and RF were used through a simulation study by generating Negative Binomial distribution for the response variable. Methods were compared and the best one was used to fit on schizophrenia data to determine variables related to the number of psychiatric rehospitalization for schizophrenia disorder. The best model also was compared to other models such as zero-inflated and full models using schizophrenia data.
Poisson regression (PR) model
Poisson regression is the first model used specifically for modeling count data and also is considered as the basis for many count models [13]. If \(Y_{i}\) (\(i = 1,...,n\)) is a discrete random variable and follows the Poisson distribution with the parameter \(\mu_{i}\) and \(n\) is the sample size, then its density function is defined as follows:
$$f(Y_{i} = y_{i} ;\mu_{i} ) = \frac{{(e^{{ - \mu_{i} }} )\,(\mu_{i}^{{y_{i} }} )}}{{y_{i} !}}\,\,\,\,\,\,;y_{i} = 0,1,2,...\,$$
(1)
With equal mean and variance: \(E(Y_{i} ) = V(Y_{i} ) = \mu_{i}\).
In the Poisson regression model, a logarithmic link function is typically used to link \(\mu_{i}\) to the linear predictor \(X_{i}\):
$$\log (\mu_{i} ) = X_{i}^{T} \beta$$
(2)
where the \(X_{i} = (1,X_{i1} ,X_{i2} ,...,X_{ip} )^{T}\) shows the vector of predictor variables for the ith subject, \(\beta = (\beta_{0} ,\beta_{1} ,\beta_{2} ,...,\beta_{p} )^{T}\) shows the vector of regression coefficients, and \(p\) indicates the number of predictor variables.
Negative Binomial (NB) regression model
In practice, the variability of the response variable is often greater than the expected value (over-dispersion), in which case the use of the Negative Binomial regression model is common [14]. If \(Y_{i}\) as a random variable has the Negative Binomial distribution with the mean parameter \(\mu_{i}\) and the shape parameter \(k\) (to control over-dispersion), its probability density function is defined as follows:
$$f(Y_{i} = y_{i} ;\mu_{i} ,k) = \frac{{\Gamma (y_{i} + k)}}{{\Gamma (k)\Gamma (y_{i} + 1)}}\left( {\frac{{\mu_{i} }}{{\mu_{i} + k}}} \right)^{{y_{i} }} \left( {\frac{k}{{\mu_{i} + k}}} \right)^{k} \,\,\,\,;y_{i} = 0,1,2,...$$
(3)
With mean \(E(Y_{i} ) = \mu_{i} \,\) and variance \(V(Y_{i} ) = \mu_{i} + \frac{{\mu_{i}^{2} }}{k}\).
The Negative Binomial regression model is defined as the Poisson regression model (Eq. (2)).
Zero-inflated (ZI) regression models
Zero-inflated models for high-zero data were proposed by Mullahy in 1986 [15]. The zero-inflated count model is a mixed model in which one component is count and the other is zero-degenerated. The count component is usually modeled as the Poisson distribution or Negative Binomial. Lambert in 1992 proposed the zero-inflated Poisson (ZIP) model with an application to defects in manufacturing [16]. The zero-inflated Negative Binomial (ZINB) model was introduced as a generalization of the Negative Binomial model to include large zeros in the data by Mwalili et al. in 2008 [17]. The density of the ZI model is expressed as follows (\(0 < \phi < 1\)):
$$\Pr [Y = n] = \left\{ {\begin{array}{*{20}c} {\phi + (1 - \phi )\Pr [Y = 0]} & {for\,\,\,n = 0\,\,\,\,\,\,\,\,} \\ {(1 - \phi )\Pr [Y = n]} & {for\,\,\,n = 1,2,...} \\ \end{array} } \right.\,$$
(4)
where the random variable Y follows a standard distribution such as Poisson or Negative Binomial and \(\phi\) represents the uncertainty parameter (mixing proportion).
Penalized methods
Various penalty methods such as LASSO, SCAD, ALASSO, and MCP have been proposed for variable selection. In penalty methods, removing the predictor variables not related to the response variable increases the interpretability of the model and reduces the over-fitting of the data. These methods penalize the regression coefficients in the likelihood function and select the variables by setting some coefficients to zero. In other words, the estimation of regression coefficients is obtained by minimizing the logarithm of the penalized likelihood function. Thus, variable selection and estimation of regression coefficients do simultaneously [18]:
$$\widehat\beta\,_{\mathbf{Penalized}\,\,\mathbf{method}}=\arg\;\min\nolimits_{\mathrm\beta}\left(-\underbrace{l(\beta)}_{\mathbf{Log}\,\boldsymbol-\mathbf l\mathbf i\mathbf k\mathbf e\mathbf l\mathbf i\mathbf h\mathbf o\mathbf o\mathbf d}+\sum_{j=1}^k\underbrace{P(\beta_j)}_{\mathbf{Penalty}\,\,\mathbf{Function}}\right)\,$$
(5)
where \(l(\beta )\) is the logarithm of the likelihood function (the function can be considered PR, NB, ZIP, and ZINB), \(P(.)\) is the penalty function, \(\beta = \left( {\beta_{1} ,\beta_{2} ,...,\beta_{k} } \right)^{T}\) is the vector of regression coefficients, and \(k\) is the number of explanatory variables.
Penalty functions
LASSO (Tibshirani (1996)) [19]: \(P_{\lambda } (\beta_{j} ) = \sum\limits_{j = 1}^{k} {\lambda \left| {\beta_{j} } \right|}\), where \({{\varvec{\uplambda}}} \ge 0\) is the tuning parameter.
SCAD (Fan and Li (2001)) [11]: \(p_{\lambda } (\beta_{j} ;a) = \left\{ {\begin{array}{*{20}c} {\lambda \left| {\beta_{j} } \right|} & {\left| {\beta_{j} } \right| \le \lambda } \\ { - \left( {\frac{{\beta_{j}^{2} - 2a\lambda \left| {\beta_{j} } \right| + \lambda^{2} }}{2(a - 1)}} \right)} & {\,\,\,\,\lambda < \left| {\beta_{j} } \right| \le a\lambda } \\ {\frac{{(a + 1)\lambda^{2} }}{2}} & {\left| {\beta_{j} } \right| > a\lambda } \\ \end{array} } \right.\).
Where \({{\varvec{\uplambda}}} \ge 0\) and \(a > 2\) are tuning parameters.
ALASSO (Zou (2006)) [10]: \(P_{\lambda } (\beta_{j} ) = \sum\limits_{j = 1}^{k} {\lambda w_{j} \left| {\beta_{j} } \right|}\), where \({{\varvec{\uplambda}}} \ge 0\) is the tuning parameter and \({\hat{\mathbf{w}}}_{j} = \frac{1}{{\left| {\hat{\beta }_{j} } \right|^{\gamma } }}\) is an adaptive weight vector (\(\gamma > 0\); a possible value for \(\hat{\beta }_{j}\) is the coefficients obtained from the ordinary least squares estimator method).
MCP (Zhang (2010)) [12]: \(p_{a} (\beta_{j} ;\lambda ) = \left\{ {\begin{array}{*{20}c} {\lambda \left| {\beta_{j} } \right| - \frac{{\beta_{j}^{2} }}{2a}} & {\left| {\beta_{j} } \right| < a\lambda } \\ {\frac{{\lambda^{2} a}}{2}} & {\left| {\beta_{j} } \right| \ge a\lambda } \\ \end{array} } \right.\).
Where \({{\varvec{\uplambda}}} \ge 0\) and \(a > 1\) are tuning parameters.
In the present study, the tuning parameter \(a\) for SCAD and MCP was considered equal to 3.7 [11, 20] and the tuning parameter λ was estimated by optimization.
Random Forest variable selection
Random Forest introduced by Breiman (2001) is a popular algorithm and belongs to the family of ensemble methods used for both classification and regression problems [21]. This technique predicts an outcome by averaging the output of hundreds or more decision trees [22]. RF is also used as a variable selection approach in order to find informative variables [23]. In this paper, we used RF variable selection using the tree minimal depth methodology introduced by Ishwaran et al. (2010) [24].
Simulation study
We carried out a simulation study by setting up six different scenarios to evaluate and compare the performance of five different variable selection methods. In the simulation study, we generated 100 data sets with sample sizes of 500 as a training set to optimize tunning parameters of ALASSO, SCAD, MCP, and RF. The tenfold cross-validation and out-of-bag (OOB) error were used to optimize the tuning parameters in penalized methods and the Random Forest, respectively. Furthermore, we generated additional data sets with sample sizes of 1000 in order to calculate evaluation criteria including the number of false-negative cases (NO. FN), number of true-positive cases (NO. TP), number of false-positive cases (NO. FP), number of true-negative cases (NO. TN), total accuracy (TA), sensitivity, and specificity. The number of covariates was set to 20, and 40, and they were generated from Multivariate Normal distribution with different values of correlations (\(\rho = 0.2,\,\,0.5\,\,,0.7\)). Five covariates were considered effective (informative covariates). For non-informative covariates, the regression coefficients were considered zero. The response variable was generated from the Negative Binomial distribution.
Software
After entering the information recorded in the patients’ files into SPSS software (version 24) and grouping the variables, regression models were fitted using R software (version 4.1.1) by mpath, randomForestSRC, and MASS packages. Also, the pscl package was used to run the Vuong test. We used a significance level of 0.05 for all statistical analyzes.