 Research
 Open Access
 Published:
To tune or not to tune, a case study of ridge logistic regression in small or sparse datasets
BMC Medical Research Methodology volume 21, Article number: 199 (2021)
Abstract
Background
For finite samples with binary outcomes penalized logistic regression such as ridge logistic regression has the potential of achieving smaller mean squared errors (MSE) of coefficients and predictions than maximum likelihood estimation. There is evidence, however, that ridge logistic regression can result in highly variable calibration slopes in small or sparse data situations.
Methods
In this paper, we elaborate this issue further by performing a comprehensive simulation study, investigating the performance of ridge logistic regression in terms of coefficients and predictions and comparing it to Firth’s correction that has been shown to perform well in lowdimensional settings. In addition to tuned ridge regression where the penalty strength is estimated from the data by minimizing some measure of the outofsample prediction error or information criterion, we also considered ridge regression with prespecified degree of shrinkage. We included ‘oracle’ models in the simulation study in which the complexity parameter was chosen based on the true event probabilities (prediction oracle) or regression coefficients (explanation oracle) to demonstrate the capability of ridge regression if truth was known.
Results
Performance of ridge regression strongly depends on the choice of complexity parameter. As shown in our simulation and illustrated by a data example, values optimized in small or sparse datasets are negatively correlated with optimal values and suffer from substantial variability which translates into large MSE of coefficients and large variability of calibration slopes. In contrast, in our simulations prespecifying the degree of shrinkage prior to fitting led to accurate coefficients and predictions even in nonideal settings such as encountered in the context of rare outcomes or sparse predictors.
Conclusions
Applying tuned ridge regression in small or sparse datasets is problematic as it results in unstable coefficients and predictions. In contrast, determining the degree of shrinkage according to some meaningful prior assumptions about true effects has the potential to reduce bias and stabilize the estimates.
Background
In medical research, logistic regression is commonly used to study the relationship between a binary outcome and a set of covariates. For a dataset with similar prevalence of the two outcome levels and sufficient sample size, the maximum likelihood estimation of the regression coefficients facilitates inference, i.e. interpretability of effect estimates, as well as accuracy of predictions given the covariates. Thus, maximum likelihood logistic regression may be used for explanation or prediction, depending on context. These attractive properties of the maximum likelihood logistic regression, however, vanish when the sample size is small or the prevalence of one of the two outcome levels (for some combination of exposure) is low, yielding coefficient estimates biased away from zero and very unstable predictions that generalize poorly on a new dataset from the same population [1, 2].
In theory, a straightforward approach to alleviate the problem would be to apply penalized maximum likelihood logistic regression: a penalty term that is added to the log likelihood function provides shrinkage of the coefficients towards zero, hereby decreasing the variance of the maximum likelihood estimates and stabilizing the predictions by pulling them towards the observed event rate [3]. A common way of shrinkage is by ridge logistic regression where the penalty is defined as minus the square of the Euclidean norm of the coefficients multiplied by a nonnegative complexity parameter λ. The multiplier λ controls the strength of the penalty, i.e. amount of shrinkage towards zero. According to the idea of the biasvariance tradeoff, the expected prediction error can be decomposed into the three components bias, variance and irreducible error [4]. Hence, the goal in ridge regression is to find the value of λ that balances the model between underfitting and overfitting, producing generalizable results [5]. As compared to the maximum likelihood estimation the resulting coefficients may achieve lower mean squared errors (MSE) but are usually biased towards zero, therefore conventional inference by hypothesis tests and confidence intervals based on standard errors is difficult [6]. A further complication for inference arises from the estimation of λ, which is often performed on the same data set by crossvalidation, as its sampling variability contributes to the uncertainty in the regression coefficients.
Tuned ridge logistic regression has been extensively investigated in simulation studies and was commonly found to perform well for low dimensional settings in terms of small MSE of coefficients and predictions [2, 7, 8]. However, one should not expect that penalization can overcome the problem of insufficient sample sizes when developing prediction models [9]. Indeed, there has been evidence that ridge regression is sensitive to small or sparse data situations, yielding poor performance in individual datasets [10,11,12,13]. Recent recommendations, therefore, advise caution when using ridge logistic regression for developing prediction models in case of low sample size or low events per variable ratio and call for more research investigating the impact of specific combinations of shrinkage and tuning methods [11]. While in theory there always exists some value of λ for which ridge regression outperforms maximum likelihood estimation in terms of the MSE of predictions [14], choosing λ adequately in datasets that suffer from large random sampling variation is difficult. For such datasets tuning procedures based on outofsample prediction performance might fail to approximate the Ushaped curve arising from the biasvariance tradeoff and result in an arbitrary choice of λ that either equals the smallest or the largest value of the prespecified range of values. This will yield large variability of tuned solutions and consequently, very unstable estimates [13].
We assume that larger variability of calibration slopes in small or sparse datasets as compared to Firth’s correction [11] is closely related to tuning and not to ridge regression as a shrinkage method per se. Therefore, in the present paper we investigate the performance of different commonly used approaches to tune ridge logistic regression in a lowdimensional sparse data setting by means of a simulation study. We also include ridge regression with prespecified λ, which is interpretable as semiBayesian analysis with a normal prior centered at zero [1, 15, 16], and Firth’s correction [17] in our comparison, as these approaches were proposed for similar settings [7, 11, 12, 18] and do not suffer from the convergence issues that may occur in maximum likelihood estimation [19]. We structured the paper accordingly: in the following section we introduce Firth’s correction and ridge logistic regression and describe different ways to choose the complexity parameter λ in ridge regression. We then illustrate the problems which might arise with tuning in sparse data situations. Subsequently, we present the setup and report the results from our simulation study with respect to the accuracy of coefficients and predictions. Further on, we perform an analysis of a real data example by fitting ridge regression and Firth’s correction models. Finally, we summarize our main findings.
Methods
Let y_{i} ∈ {0, 1}, i = 1, …N, be a realization of a binary outcome variable Y, where y_{i} = 1 denotes an event occurring in the ith observation. The logistic regression model associates y_{i} to a set of corresponding covariate values x_{i} = (1, x_{i1}, …, x_{iK}), K < N, by assuming
where β_{0} is an intercept and β_{k}, k = 1, …, K, are regression coefficients. The parameters β = (β_{0}, β_{1}, …β_{K}) of the model can be estimated by the maximum likelihood method, maximizing the loglikelihood function
\( \ell \left(\boldsymbol{\upbeta} \right)=\sum \limits_{i=1}^N\left({y}_i\log {\pi}_i+\left(1{y}_i\right)\log \left(1{\pi}_i\right)\right), \) using an iterative algorithm [20].
Firth’s correction
Maximum likelihood estimation is asymptotically unbiased, however, in situations when data are small or sparse coefficient estimates become biased away from zero and very unstable or may even not exist [19]. To reduce the bias of maximum likelihood estimates, Firth [17] proposed to penalize the likelihood function by Jeffreys’ invariant prior so that the penalized loglikelihood becomes
where I(β) is the Fisher information matrix evaluated at β. Since the intercept is included in the penalty term, the average predicted probability may not equal the observed event rate but is instead biased towards onehalf. To correct for this bias that may become especially apparent in situations with unbalanced outcome, Puhr et al. [7] proposed a simple modification, Firth’s logistic regression with interceptcorrection (FLIC) that alters the intercept such that average predicted probabilities become equal to the observed event rate.
Ridge regression
In ridge regression coefficients are constrained by the square of the Euclidean norm of the coefficients, i.e. the penalized loglikelihood reads
where the positive complexity parameter λ controls the amount of shrinkage towards zero. The intercept β_{0} is excluded from the penalty term, yielding an average predicted probability equal to the observed event rate. Unlike Firth’s correction, ridge regression is not invariant to linear transformation of the design matrix. Therefore, to facilitate interpretation and ensure that coefficients are represented on the same scale, suitable standardization of covariates is required, usually to zero mean and unit variance.
Tuning procedures
To select the complexity parameter λ, generally, a sequence of λ values is prespecified and the corresponding set of models is evaluated. The optimized λ^{∗} is the one that produces the model minimizing the expected outofsample prediction error, often estimated by crossvalidation. The outofsample prediction error may be defined in different ways [3, 21,22,23], e.g. as

deviance (D) [3]
$$ D=2\sum \limits_{i=1}^N\left({y}_i\log {\hat{\pi}}_{\left(i\right)}+\left(1{y}_i\right)\log \left(1{\hat{\pi}}_{\left(i\right)}\right)\right), $$ 
generalized crossvalidation (GCV) [21, 23]
$$ GCV=\frac{N\cdot D}{{\left(Nd{f}_e\right)}^2}, $$
where df_{e} are the effective degrees of freedom, \( d{f}_e=\mathrm{trace}\left(\frac{\partial^2\ell }{\partial^2\boldsymbol{\upbeta}}\left(\hat{\boldsymbol{\upbeta}}\right){\left(\frac{\partial^2{\ell}_{ridge}^{\ast }}{\partial^2\boldsymbol{\upbeta}}\left(\hat{\boldsymbol{\upbeta}}\right)\right)}^{1}\right) \),

classification error (CE) [3]
$$ CE=\frac{1}{N}\sum \limits_{i=1}^N\left({y}_iI\left({\hat{\pi}}_{\left(i\right)}<c\right)+\left(1{y}_i\right)I\left({\hat{\pi}}_{\left(i\right)}>c\right)+\frac{1}{2}I\left({\hat{\pi}}_{\left(i\right)}=c\right)\right), $$
with I denoting an indicator function and c some cutoff, usually set to 1/2. Since in datasets with unbalanced outcomes c = 1/2 would assign most of observations to the more frequent outcome level Blagus and Lusa [10] advised to set c equal to the marginal event rate instead.
In the definitions above \( {\hat{\pi}}_{\left(i\right)} \) is the event probability estimate for the ith observation computed from the model where that observation has been left out from estimation of the model parameters. Alternatively, 10fold crossvalidation may be used to speedup computations, however, this produces different optimized λ^{∗} values for different combinations of fold assignments to observations. To stabilize the selection of λ^{∗}, 10fold crossvalidation may be repeated several times, and a particular quantile θ of the values obtained may be used [2, 24].
Alternatively, to avoid resampling, λ may be tuned by using the Akaike’s information criterion (AIC) [6, 25], where
Prespecifying the degree of shrinkage
Mathematically, ridge regression is identical to Bayesian analysis with zerocentered univariate normal priors imposed on the coefficients [1]. The variance v_{prior} of these priors is inversely proportional to λ. If the priors for the coefficients are assumed to have different variances, this translates into a penalty equal to the weighted sum of squared coefficients with different weights for each coefficient. In the Bayesian analysis approach suggested by Sullivan and Greenland [15] the degree of shrinkage is not determined by tuning but is instead based on some prior assumptions about covariates’ odds ratios that can be easily converted into v_{prior}. The prior variance v_{prior} can be obtained from a plausible (usually 95%) prior interval for a covariate’s odds ratio that has to be specified according to some background assumptions. In a particular setting Sullivan and Greenland [15] considered as plausible the 95% odds ratio interval ranging from 1/4 to 4 which translates to v_{prior} = 1/2. However, if one wishes to avoid the effort of specifying prior distributions, one could apply weakly informative priors, e.g. assuming the 95% probability that the odds ratio falls between 1/16 to 16, which are still beneficial to stabilize estimates.
Illustration
Consider the two datasets described below, each with 100 independent observations of a binary outcome, y_{i} ∈ {0, 1}, and a single covariate x_{i} ∈ {0, 1}, i = 1, …, 100.
Dataset 1
y  

0  1  
x  0  20  0 
1  71  9 
Dataset 2
y  

0  1  
x  0  19  1 
1  71  9 
In dataset 1, separation occurs as there are no observations with x_{i} = 0 and y_{i} = 1. Therefore, maximum likelihood estimation yields perfect leaveoneout crossvalidated predictions \( {\hat{\pi}}_{\left(i\right)}=0 \) for x_{i} = 0 and such also the individual outofsample prediction errors equal \( {D}_{x_i=0}=0 \). These errors, however, increase with shrinkage (in particular, we considered a fixed sequence of 200 loglinearly equidistant λ values ranging from 10^{(−6)} to 100) as predicted probabilities get pulled towards 9/99 (Fig. 1). In addition, the errors increase with shrinkage for those 9 observations with event as \( {\hat{\pi}}_{\left(i\right)}=8/79=0.1 \) in the maximum likelihood model and \( {\hat{\pi}}_{\left(i\right)}=0.08 \) for λ = 100. Conversely, shrinkage reduces the errors of the 71 observations with x_{i} = 1 and y_{i} = 0 but the predicted probabilities are similar for λ = 100 (\( {\hat{\pi}}_{\left(i\right)}=0.09 \)) and λ = 10^{(−6)} (\( {\hat{\pi}}_{\left(i\right)}=0.11 \)), and such are the differences between the error estimates when λ = 100 and λ = 10^{(−6)} (Fig. 1). Therefore, the tuning procedure based on D favors the smallest of the prespecified range of λ values, in our example λ^{∗} = 10^{(−6)}. In this case, fitting ridge regression model with a standardized covariate X using R [26] package penalized [27] yields an estimate of β_{1} as large as 13.94, a consequence of data sparsity [1, 19, 28]. In contrast, FLIC (fitted by using package logistf [29]) and ridge regression with an informative prior (IP), assuming the 95% prior interval for the odds ratio of a standardized covariate ranging from 1/4 to 4, yield interpretable coefficient estimates (Table 1).
In dataset 2, we have one single observation with event and x_{i} = 0 for which maximum likelihood estimation falsely predicts \( {\hat{\pi}}_{\left(i\right)}=0 \). While for all other observations the outofsample prediction errors D_{i} do not change much if applying shrinkage (for some observations D_{i} gets slightly larger and for the others slightly smaller), the error for this single observation reduces considerably with increasing shrinkage (Fig. 1). This results in λ^{∗} that equals the largest of the prespecified range of values, in our example λ^{∗} = 100. Obviously, this overshrinks the coefficients as compared to FLIC and IP (Table 1).
Repeating the datagenerating process used to generate the two datasets 500times in which a binary covariate X was sampled with E(X) = 0.8 and the binary outcome y_{i} was drawn from a Bernoulli distribution with true event probability (1 + exp(−(−3.05 + x_{i})))^{−1}, each time standardizing X and tuning the value of λ^{∗} by D results in a choice of λ^{∗} that for 42% of simulated datasets simply equals the smallest and for 38% of datasets the largest value of the prespecified range of λ values. This reflects a large variability of tuned λ^{∗} values and consequently, very unstable coefficients with large expected MSE. The large MSE of coefficients is mostly due to data sparsity that leads to very small optimized λ^{∗} values and huge coefficients. It is reasonable to assume that the instability in optimized complexity parameter values stands alongside prediction performance, translating into calibration slopes of large variability (models that strongly underfit or overfit). Indeed, the median (25th and 75th percentile) of calibration slopes that were evaluated on a dataset of size 10,000, independently generated from the same distribution, was 0.07 (0.07, 18.4). In contrast, in FLIC and IP no tuning is required and plausible estimates of β_{1} and more stable calibration slopes are produced over 500 datasets. In particular, bias (\( \mathrm{E}\left(\hat{\beta}\beta \right)\Big) \) and MSE (\( \mathrm{E}{\left(\hat{\beta}\beta \right)}^2 \)) of coefficient estimate, and the median (25th and 75th percentile) of calibration slopes were −0.15, 0.64 and 0.77 (0.53, 1.77) for FLIC and −0.11, 0.45 and 0.82 (0.6, 1.6) for IP, respectively.
Simulation study
Design
We describe the simulation study design following recommendations by Morris [30].
Aims
Our aim was to systematically investigate the performance of ridge logistic regression in terms of effect estimation and prediction in lowdimensional sparse data settings where the complexity parameter λ was determined using different approaches and to compare it to Firth’s correction with predictions obtained by FLIC.
Datagenerating mechanisms
To allow a fair comparison of the approaches [31], we considered a data generation scheme similar to the one described in Binder et al. [32], featuring covariates with mixed types and shapes of distributions and a complex correlation structure, similar to what an analyst is usually confronted with in biomedical prognostic studies. Covariates X_{1}, …, X_{15} were obtained by applying certain transformations to variables Z_{1}, …, Z_{15} sampled from a standard multivariate normal distribution with correlation matrix Σ (Table 2). In particular, X_{1}, …, X_{4} were binary, X_{5} and X_{6} ordinal with three levels and X_{7}, …, X_{15} continuous. To avoid extreme values, the continuous variables were generated from truncated distributions, where the truncation was at the third quartile plus five times the interquartile distance of the respective underlying distribution. The values of the binary outcome y_{i} were sampled from Bernoulli distributions with event probabilities (1 + exp(−β_{0} − a ∗ (β_{1}x_{i1} + … + β_{K}x_{iK})))^{−1}, where i = 1, …, N, N ∈ {100, 250, 500}, K ∈ {2, 5, 10}, effect multiplier a ∈ {0.5, 1} for moderate and strong effects, respectively, and true regression coefficients β_{1}, …, β_{K} defined as follows: β_{1} = 2.08, β_{2} = 1.39, β_{3} = β_{4} = 0.69, β_{5} = β_{6} = 0.35 and β_{7}, …, β_{10} were chosen such that the log odds ratio between the first and the fifth sextile of the corresponding distribution was 0.69 (Table 3). An intercept β_{0} was determined for each simulation scenario such that the desired marginal event rate E(Y) ∈ {0.1, 0.25} was approximately obtained. We considered two types of analysis: one using exactly the set of real predictors used to generate the data, and one also including five noise covariates X_{11}, …, X_{15} that were not associated with the outcome. We refer to this factor as ‘noise’ (absent/present). Combining the simulation parameters N (sample size), K (number of true predictors), a (effect multiplier), E(Y) (marginal event rate) and noise (absent/present) in a full factorial design resulted in 72 possible scenarios. On the request of a reviewer we added 16 more scenarios also considering N = 1000 and combining it with a ∈ {0.5,1}, E(Y) ∈ {0.1, 0.25}, K ∈ {5, 10} and noise (absent/present). We simulated 1000 datasets for each scenario. Table S1 in Additional file 1 shows minimum sample size required for developing a prediction model for different scenarios based on recent guidance [9, 33].
Methods
We analyzed each simulated dataset by fitting ridge and Firth’s logistic regression models. To obtain predictions based on Firth’s correction we applied FLIC as suggested by Puhr et al. [7]. To fit ridge regression models we first standardized covariates of each dataset to zero mean and unit variance, and then optimized the complexity parameter λ^{∗} over a fixed sequence of 200 loglinearly equidistant values ranging from 10^{(−6)} to 100 by using the following procedures:

D;

GCV;

CE where the cutoff c was set to the observed event rate \( \frac{1}{N}\sum \limits_i{y}_i. \) As CE is discrete in nature and has no unique optimum in λ, in our study λ^{∗} was the largest λ minimizing CE;

D by 10fold crossvalidation with 50 repetitions (RCV) where λ^{∗} was chosen as the θth quantile of the obtained values with θ ∈ {0.5, 0.95} (RCV50, RCV95) [2, 24];

AIC;

restricting the standardized coefficients by informative (IP, λ = 2) and weakly informative prior assumptions (WP, λ = 1/2). In the simulations the degree of shrinkage was the same for all the covariates.
As a benchmark we defined two oracle models, determined by an amount of shrinkage ideal with respect to estimation of β_{1} (explanation oracle, OEX) and to predictions (prediction oracle, OP). For OEX λ^{∗} was chosen such that \( {\left({\hat{\beta}}_1{\beta}_1\right)}^2 \) (or equivalently \( \mid {\hat{\beta}}_1{\beta}_1\mid \)), where \( {\hat{\beta}}_1 \) is the ridge regression estimate of β_{1}, was minimized; for OP, λ^{∗} was the one minimizing \( {\sum}_{\mathrm{i}}{\left({\hat{\pi}}_i{\pi}_i\right)}^2 \), where \( {\hat{\pi}}_i \) is the estimate of the ith probability of π_{i}. To avoid model fitting problems, all ridge regression models were fitted by data augmentation [15] in the following way: two artificial data records were added for each covariate; the values for this covariate were set to 1/s and to zero for other covariates, where s = 10 was a rescaling factor improving the approximation. Maximum likelihood estimation on this augmented dataset was then performed, specifying weights that equaled 1 for the original observations and 2s^{2}λ for the pseudoobservations. We used the libraries brglm2 [34] for detecting separation, penalized [27] for performing crossvalidation and logistf [29] for model fitting in R version 4.0.2 [26].
Estimands
The true regression coefficient β_{1} and the vector of true event probabilities π were the estimands in our study.
Performance measures
We evaluated the root mean squared errors (RMSE) of coefficients \( {\left(\frac{1}{1000}{\sum}_{s=1}^{1000}{\left({\hat{\beta}}_{k,s}{\beta}_k\right)}^2\right)}^{1/2} \), where \( {\hat{\beta}}_{k,s},k=1, \) s ∈ {1, …, 1000} is the estimate of β_{k} in the sth simulated dataset) and of predictions \( {\left(\frac{1}{1000N}{\sum}_{s=1}^{1000}{\sum}_{i=1}^N{\left({\hat{\pi}}_{i,s}{\pi}_{i,s}\right)}^2\right)}^{1/2} \), where \( {\hat{\pi}}_{i,s} \) and π_{i, s} are the estimated and true event probability for the ith observation in the sth simulated dataset). We also evaluated cstatistics, estimated with respect to newly generated outcome, and calibration slopes evaluated on a validation dataset generated once for each scenario from the same population with a sample size N = 10,000. The variability of calibration slopes was assessed by median absolute deviation (MAD) of the log(slope_{s}). To combine bias and variability of calibration slopes, we calculated root mean squared distances (RMSD, \( {\left(\frac{1}{1000}{\sum}_{s=1}^{1000}{d}_s^2\right)}^{1/2} \)), where the sth distance was defined as d_{s} = log(1) − log(slope_{s}), as suggested by Van Calster et al. [11]. To avoid issues with negative slopes that were rarely obtained by the methods we winsorized them at 0.01 for the calculation of RMSD. In addition, we assessed the Spearman correlation coefficients between calibration slopes and tuned complexity parameters λ^{∗} as well as the RMSD of calibration slopes achieved by the methods and the variability of tuned complexity parameters λ^{∗}, expressed by median absolute deviation, over all simulated scenarios.
Results
Among 88 simulated scenarios the prevalence of separation was ranging from zero in scenarios with moderate effects, large sample sizes and E(Y) = 0.25 to at most 85% in a scenario with large effects, N = 100, K = 5 and E(Y) = 0.1 (Table S2 and S3 in Additional file 1).
First, we describe the distribution of λ^{∗} values obtained by optimizing different tuning criteria over 1000 simulation runs and their correlations with ‘optimal’ λ^{∗} as achieved by OEX and OP, respectively (Fig. 2). For brevity, Fig. 2 focuses on scenarios with E(Y) = 0.1 and K = 5 only. Tuning procedures often led to large variability of selected λ^{∗}, which was especially apparent in moderate effects scenarios. Generally, the variability was smaller when the true effects were strong, with larger N and K, i.e. the number of predictors associated with the outcome, and with more balanced outcomes. With moderate effects the methods tended to overshrink, often producing very wide distributions of optimized λ^{∗} values. The smallest variability of optimized values over all scenarios in terms of MAD was obtained by the AIC, followed by CE and RCV50. Quite some variability of λ^{∗} was also obtained by OP, however, the correlations between ‘optimal’ λ^{∗} (of both OP and OEX) and λ^{∗} obtained by optimizing different tuning criteria were mostly negative. OEX resulted in less variability of λ^{∗} and less shrinkage than OP. In IP the prespecified λ^{∗} was in median very close to the one obtained by OEX.
Accuracy of coefficients
Figure 3 shows the RMSE of β_{1} across simulated scenarios and models with and without noise by means of nested loop plots [35, 36]. More detailed results also including scenarios with N = 1000 are contained in Table S2 and S3 of Additional file 1. As expected the best performance across all simulated scenarios was achieved by OEX. Generally, the performance of other tuned ridge regression approaches was extremely variable and unreliable and due to data sparsity the methods yielded coefficients with extremely large RMSE. Interestingly, the RMSE of β_{1} did not always decrease with increasing sample size and noise did not necessarily worsen the performance of those methods. In contrast, the methods where tuning was not required showed stable performance across all simulated scenarios. While the performance of Firth’s correction was satisfactory in almost all scenarios, suffering from RMSE larger than one in scenarios with the expected event rate E(Y) = 0.1 and sample size N = 100 only, it was clearly outperformed by IP that produced small RMSE of β_{1} across all scenarios. Although WP generally resulted in worse performance than Firth’s correction, it was less sensitive to very sparse data situations in which the performance of Firth’s correction was poor.
Accuracy of predictions
Results regarding the RMSE of predicted probabilities for E(Y) = 0.1 are shown in Fig. 4. While the performance of the methods was similar in scenarios with larger sample sizes and no noise, the differences between them became apparent in scenarios with N = 100 and especially when including noise. Noise considerably worsened the performance of the methods. The least affected by noise were the methods based on crossvalidation (apart from CE) that generally yielded the best performance. However, with no noise and strong effects (higher cindices) IP always outperformed the other methods (apart from OP). The performance of FLIC and WP was consistently somewhat worse than the one of IP. The performance of AIC was similar to that of the crossvalidationbased methods if there was no noise, however, when N = 100 it appeared sensitive to noise.
Calibration slopes are presented by means of boxplots for scenarios with E(Y) = 0.1 and K = 5 (Fig. 5). For clarity of presentation, datasets where calibration slopes were larger than 5 are not shown. IP, WP and FLIC yielded similar performance with small variability over simulation runs and were generally close to the slope of 1 in strong effects scenarios but suffered from overfitting in moderate effects scenarios. The addition of noise increased overfitting of these methods. Tuning procedures (specifically, D, RCV50 and AIC) yielded calibration slopes that were in median relatively close to 1, however they suffered from large variability. While this variability decreased with N, there was still a considerable number of outliers produced by these methods even with N = 500, if the outcomes were very unbalanced and effects moderate only. Among tuning procedures AIC achieved on average the smallest variability of calibration slopes. With respect to the RMSD of the logarithm of calibration slopes (Table S10, S11 in Additional file 1) OP overall achieved the best performance, followed by IP if no noise was included or AIC in case of noise. Interestingly, noise did not necessarily increase the RMSD of tuning procedures and they appeared to be less sensitive to it as compared to the methods where shrinkage was prespecified. Calibration slopes were strongly positively correlated with optimized λ^{∗} values (Fig. S1 in Additional file 1) and such were the correlations between the RMSD of calibration slopes and the variability of λ^{∗}.
In terms of cindices there was no considerable differences between methods (Table S12, S13 in Additional file 1).
Data example
As an example we consider the study described by Poplas Susič et al. [37]. The aim of the study was to estimate the prevalence of dependence in daily activities (binary) and its risk factors in a group of individuals whose health status is not well known to family practice teams (patients who had not visited their chosen family physician in the last 5 years). Nine risk factors were considered: age (continuous), sex (binary), body mass index category (BMI; ordinal with 4 levels), APGAR (measuring family function via five constructs: Adaptation, Partnership, Growth, Affection, and Resolve; binary), chronic disease (CD; ordinal with four levels), fall (measuring increased risk of fall; binary), loneliness (measured on a discrete scale from 1 to 10; continuous), health (measured on a discrete scale from 1 to 10; continuous), pain (measured on a discrete scale from 1 to 10; continuous). Complete case analysis on a sample of individuals of size N = 1814, from which 423 (23%) had an event, was performed to quantify the effects of risk factors.
For our demonstration we randomly selected N = 275 individuals from the complete sample. According to recent guidance, the minimum required sample size for developing a prediction model was N = 273 based on an expected value of the (CoxSnell) Rsquared of 0.5, 9 predictors, expected value of events E(Y) = 0.23, and a desired level of shrinkage of 0.9 [9, 33]. In the subsample 60 (22%) events were observed. We used this subsample for fitting ridge regression and FLIC models while the remaining data served as a validation dataset for calculating the calibration slopes (Table 4). Ridge regression and FLIC models were fitted as described in Section 4.1.3. Most shrinkage was induced by tuned ridge regression methods with the exception of AIC and CE while methods with fixed penalization strength yielded calibration slopes (slightly) smaller than 1.
Discussion
Numerous studies have shown that shrinkage is effective in preventing overfitting and may solve issues that arise in classical clinical settings with relatively large number of correlated covariates [2, 7, 8, 38]. Therefore, applying shrinkage has been recommended not only when developing prediction models but also when interest lies in coefficients with reduced MSE and inference is not required [7]. A recent study, however, noted that while calibration slopes obtained by shrinkage methods are on average close to 1 the variability of calibration slopes in small or sparse situations is large and therefore, improved performance in a single dataset cannot be guaranteed [11]. If considering that the number of observations (with event) constitutes the amount of information contained in the data, this may not seem surprising. However, many researchers would still utilize shrinkage methods in small or sparse datasets, expecting all problems to be solved. Therefore, in this paper, we have elaborated this issue further by focusing on ridge logistic regression. We evaluated its performance in a lowdimensional setting and compared it to Firth’s correction by means of simulation study. The amount of shrinkage in ridge regression was determined using different tuning procedures and prior assumptions, respectively. We were interested in the accuracy of both coefficient estimates and predictions.
With respect to large variability of calibration slopes, the results of our study confirm the findings of Van Calster et al. [11]. Furthermore, as already indicated by Riley et al. [13], we observed that the RMSD of the logarithm of calibration slopes was strongly correlated with the variability of optimized complexity parameters λ^{∗}. By an illustrative example we demonstrated that tuning procedures might fail to approximate the Ushaped curve arising from the biasvariance tradeoff and result in completely arbitrary choice of λ^{∗} that simply equals the smallest or the largest λ of the prespecified sequence of values. However, in the simulation study we then observed that substantial variability of λ^{∗} must even be expected from the oracle that ‘knows’ the true event probabilities (prediction oracle) or regression coefficients (explanation oracle) and uses this knowledge to determine the optimal values of λ^{∗}. This indicates that the variability of λ^{∗} by itself would not be that much of a problem but tuning procedures yielded optimized values that were negatively correlated with their ‘optimal counterparts’, determined by explanation or prediction oracle. On the one hand this can be explained by separation that often makes tuning procedures result in an optimized value of λ^{∗} close to zero [12]. In such datasets, if the amount of shrinkage is too small, this will yield coefficients with large MSE and predictions that may be numerically indistinguishable from zero or one by software packages [18]. While some may argue that this is not problematic when interest lies in predictions, we observed that the prediction oracle generally favored larger values of λ^{∗} than the explanation oracle, suggesting that in typical clinical studies seemingly perfect predictions should not be accepted as they potentially imply overfit and reflect increased variability of calibration slopes. On the other hand, as demonstrated in our illustration, if only a few observations prevent the data set from separation, a large value of λ^{∗} is needed to avoid very large outofsample prediction errors for the crucial, separationpreventing observations, while in fact the performance of maximum likelihood estimation in such datasets is already satisfactory [18]. In words of van Houwelingen [39], if β is ‘large’ by random fluctuation tuning procedures tend to keep the model large instead of correcting for the ‘large’ β by setting λ^{∗} > 0; and vice versa. In our simulation study we observed that for tuned ridge logistic regression calibration slopes were more stable in scenarios with larger sample sizes, more balanced outcomes, stronger effects and often even with a larger number of true predictors, while noise possibly increased the variability of calibration slopes. This suggests that simply breaking down the problem to a measure such as the eventspervariable ratio is unsatisfactory as not only the number of covariates but also their relations to the outcome are decisive here [8, 9].
Our results show that optimization of the complexity parameter in ridge regression is difficult in datasets where sampling variability is large and sampling artefacts, e.g. separation, are likely to occur. Van Calster et al. [11] instead suggested to apply FLIC [7] that provides only little shrinkage but results in lower variability of calibration slopes. Another convenient property of Firth’s correction, not shared by ridge regression, is its invariance to linear transformations of the design matrix. However, ridge regression may be generally preferred over Firth’s correction in the case of highly correlated covariates, where I(β) is close to singularity, causing Firth’s correction to deteriorate. Alternatively, the choice of the complexity parameter in ridge regression may be based on prior expectations about the magnitude of the underlying effects [15]. Prespecifying the degree of shrinkage seems reasonable as it stabilizes λ^{∗}, and appeared beneficial in our study in which we included such a semiBayesian approach with zerocentered informative or weakly informative normal priors (IP and WP, respectively). Despite different motivation behind methods with fixed penalization strength, IP clearly outperformed tuned ridge regression (and Firth’s correction) with regard to RMSE of coefficients. With tuned ridge regression valid inference is hard to achieve due to bias introduced in the coefficients and additional variability that comes along with tuning λ (which possibly leads to less bias and more variance) [14, 40]. By contrast, for IP and WP (like with Firth’s correction) valid 95% posterior limits could be obtained easily by data augmentation, using any statistical software that enables maximumlikelihood fitting and weighting of observations [15]. Moreover, in scenarios with no noise included in the model IP yielded small RMSE of predictions and small RMSD of calibration slopes. Although one should usually devote additional work to specify prior distributions, we straightforwardly followed the outline of Sullivan and Greenland [15] in defining our priors, assuming that the true effects are not too extreme. IP therefore performed extremely well in all scenarios with strong effects, associated with higher true cindices. While it seems that this approach is to some extent robust to misspecification of the prior, the results showed that with moderate effects only (and lower true cindices) or with noise present in the data it could be reasonable to choose smaller prior variances (preferably for each coefficient separately) to better handle overfitting. However, if one is in doubt about how to determine the limits of the prior interval, weaker penalties are preferred. More guidance on how to specify prior distributions can be found in the paper by Greenland et al. [1].
Our study showed that particularly methods with a fixed degree of shrinkage were sensitive to noise, especially in terms of calibration slopes. For highdimensional settings where much noise is contained in the data defining appropriate priors may be much more challenging and thus, these methods less appropriate. We expect that in such settings larger complexity parameter values will yield smaller outofsample prediction errors, thus tuning may be successful in preventing overfitting, but the problem of large calibration slopes due to very large optimized λ^{∗} might remain or even increase. Future research should investigate the behavior of tuning approaches in highdimensional settings, also considering other penalized regression methods, e.g. lasso that in addition to shrinkage also performs variable selection. In a classical clinical prediction modelling context, however, lasso may be too restrictive and result in too sparse models. Furthermore, based on limited additional simulations (results not shown) we suspect that issues we discussed with respect to tuning will also appear with the lasso or any other tuned penalized regression method [11, 13].
Summarizing, while tuning has the potential to reduce the MSE of the estimates as demonstrated by the oracles, applying tuned ridge logistic regression in small or sparse datasets is problematic as tuned λ^{∗} values are highly variable and in addition negatively correlated with optimal values, yielding unstable coefficients and predictions. Naturally, only limited performance of the methods can be expected if little information is provided by the data, as is the case with small or sparse datasets. In order to alleviate the problem and allow for a more efficient use of available sample size, we recommend to determine the degree of shrinkage a priori with respect to some meaningful assumptions about true effects as demonstrated by Sullivan and Greenland [15]. Our simulations indicate that this approach has the potential to stabilize the estimates and may reduce bias of the coefficients away from zero such that relatively accurate coefficients and predictions can be obtained even in nonideal settings which are typical e.g. in the context of rare outcomes or sparse predictors. Also with larger sample sizes an analysis might benefit from this approach, especially when (in addition to point estimates or predictions) valid Bayesian inference is required [1, 15, 16].
Availability of data and materials
The code for illustration, simulation study and data example is contained in Additional file 1. The dependence in daily activities data analyzed in Section 5 are available from Antonija Poplas Susič (antonija.poplassusic@zdlj.si) but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. The dependence in daily activities data cannot be shared by the authors due to violating confidentiality.
Abbreviations
 AIC:

Akaike’s information criterium
 CE :

classification error
 D :

deviance
 FLIC:

Firth’s logistic regression with interceptcorrection
 GCV :

generalized crossvalidation
 IP:

ridge regression based on informative prior assumptions
 MAD:

median absolute deviation
 MSE:

mean squared error
 OEX:

explanation oracle
 OP:

prediction oracle
 RCV:

repeated 10fold crossvalidated deviance
 RMSD:

root mean squared distance
 RMSE:

root mean squared error
 WP:

ridge regression based on weakly informative prior assumptions
References
 1.
Greenland S, Mansournia MA, Altman DG. Sparse data bias: a problem hiding in plain sight. BMJ. 2016;352:i1981. https://doi.org/10.1136/bmj.i1981.
 2.
Pavlou M, Ambler G, Seaman S, De Iorio M, Omar RZ. Review and evaluation of penalised regression methods for risk prediction in lowdimensional data with few events. Stat Med. 2016;35(7):1159–77. https://doi.org/10.1002/sim.6782.
 3.
Le Cessie S, Van Houwelingen JC. Ridge estimators in logistic regression. J R Stat Soc: Ser C: Appl Stat. 1992;41(1):191–201. https://doi.org/10.2307/2347628.
 4.
Hastie T, Tibshirani R, Friedman JH: The elements of statistical learning: data mining, inference, and prediction: Springer; 2009. https://doi.org/10.1007/9780387848587.
 5.
Belkin M, Hsu D, Ma S, Mandal S. Reconciling modern machinelearning practice and the classical bias–variance tradeoff. Proc Natl Acad Sci. 2019;116(32):15849–54. https://doi.org/10.1073/pnas.1903070116.
 6.
Harrell FE, jrl FEH: Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis: Springer; 2001.
 7.
Puhr R, Heinze G, Nold M, Lusa L, Geroldinger A. Firth's logistic regression with rare events: accurate effect estimates and predictions? Stat Med. 2017;36(14):2302–17. https://doi.org/10.1002/sim.7273.
 8.
van Smeden M, Moons KG, de Groot JA, Collins GS, Altman DG, Eijkemans MJ, et al. Sample size for binary logistic prediction models: beyond events per variable criteria. Stat Methods Med Res. 2019;28(8):2455–74. https://doi.org/10.1177/0962280218784726.
 9.
Riley RD, Ensor J, Snell KIE, Harrell FE, Martin GP, Reitsma JB, et al. Calculating the sample size required for developing a clinical prediction model. BMJ. 2020;368:m441.
 10.
Blagus R, Lusa L. Class prediction for highdimensional classimbalanced data. BMC Bioinformatics. 2010;11(1):523. https://doi.org/10.1186/1471210511523.
 11.
Van Calster B, van Smeden M, De Cock B, Steyerberg EW. Regression shrinkage methods for clinical prediction models do not guarantee improved performance: simulation study. Stat Methods Med Res. 2020;29(11):3166–78. https://doi.org/10.1177/0962280220921415.
 12.
Šinkovec H, Geroldinger A, Heinze G, Blagus R: Tuning in ridge logistic regression to solve separation. arXiv: 201114865 2020.
 13.
Riley RD, Snell KIE, Martin GP, Whittle R, Archer L, Sperrin M, Collins GS: Penalisation and shrinkage methods produced unreliable clinical prediction models especially when sample size was small. J Clin Epidemiol. 2021;132:88–96. https://doi.org/10.1016/j.jclinepi.2020.12.005.
 14.
Blagus R, Goeman JJ. Mean squared error of ridge estimators in logistic regression. Statistica Neerlandica. 2020;74(2):159–91. https://doi.org/10.1111/stan.12201.
 15.
Sullivan SG, Greenland S. Bayesian regression in SAS software. Int J Epidemiol. 2013;42(1):308–17. https://doi.org/10.1093/ije/dys213.
 16.
Greenland S. Methods for epidemiologic analyses of multiple exposures: a review and comparative study of maximumlikelihood, preliminarytesting, and empiricalbayes regression. Stat Med. 1993;12(8):717–36. https://doi.org/10.1002/sim.4780120802.
 17.
Firth D. Bias reduction of maximum likelihood estimates. Biometrika. 1993;80(1):27–38. https://doi.org/10.1093/biomet/80.1.27.
 18.
Šinkovec H, Geroldinger A, Heinze G. Bring more data!—a good advice? Removing separation in logistic regression by increasing sample size. Int J Environ Res Public Health. 2019;16(23):4658. https://doi.org/10.3390/ijerph16234658.
 19.
Heinze G, Schemper M. A solution to the problem of separation in logistic regression. Stat Med. 2002;21(16):2409–19. https://doi.org/10.1002/sim.1047.
 20.
Agresti A: Categorical data analysis: Wiley; 2012.
 21.
Golub GH, Heath M, Wahba G. Generalized crossvalidation as a method for choosing a good ridge parameter. Technometrics. 1979;21(2):215–23. https://doi.org/10.1080/00401706.1979.10489751.
 22.
van Wieringen WN: Lecture notes on ridge regression. arXiv: 150909169 2020.
 23.
Wood S: Generalized additive models: an introduction with R: Taylor & Francis; 2006, DOI: https://doi.org/10.1201/9781420010404.
 24.
Roberts S, Nowak G. Stabilizing the lasso against crossvalidation variability. Comput Stat Data Anal. 2014;70:198–211. https://doi.org/10.1016/j.csda.2013.09.008.
 25.
Akaike H. A new look at the statistical model identification. IEEE Trans Autom Control. 1974;19(6):716–23. https://doi.org/10.1109/TAC.1974.1100705.
 26.
Team RC: R: a language and environment for statistical computing. 2020.
 27.
Goeman JJ, Meijer R, Chaturvedi N. Penalized: L1 (lasso and fused lasso) and L2 (ridge) penalized estimation in GLMs and in the Cox model. 2018 (R package version 0.9–51).
 28.
Mansournia MA, Geroldinger A, Greenland S, Heinze G. Separation in logistic regression: causes, consequences, and control. Am J Epidemiol. 2017;187(4):864–70. https://doi.org/10.1093/aje/kwx299.
 29.
Heinze G, Ploner M, Jiricka L. logistf: Firth's BiasReduced Logistic Regression. 2020 (R package version 1.24).
 30.
Morris TP, White IR, Crowther MJ. Using simulation studies to evaluate statistical methods. Stat Med. 2019;38(11):2074–102. https://doi.org/10.1002/sim.8086.
 31.
Boulesteix AL, Binder H, Abrahamowicz M, Sauerbrei W. On the necessity and design of studies comparing statistical methods. Biom J. 2018;60(1):216–8. https://doi.org/10.1002/bimj.201700129.
 32.
Binder H, Sauerbrei W, Royston P: Multivariable modelbuilding with continuous covariates: 1. Performance measures and simulation design. Technical Report FDMPreprint 105 2011.
 33.
Ensor J, Martin EC, Riley RD: pmsampsize: Calculates the Minimum Sample Size Required for Developing a Multivariable Prediction Model. 2020 (R package version 1.0.3).
 34.
Kosmidis I: brglm2: Bias Reduction in Generalized Linear Models. 2020 (R package version 0.6.2).
 35.
Kammer M: looplot: A package for creating nested loop plots. 2020 (R package version 0.5.0.9001).
 36.
Rücker G, Schwarzer G. Presenting simulation results in a nested loop plot. BMC Med Res Methodol. 2014;14(1):129. https://doi.org/10.1186/1471228814129.
 37.
Poplas Susič A, KlemencKetiš Z, Blagus R, Ružić Gorenjec N. Factors that determine dependence in daily activities: a crosssectional study of family practice nonattenders from Slovenia. PLoS One. 2021;16(1):e0245465. https://doi.org/10.1371/journal.pone.0245465.
 38.
Steyerberg EW, Eijkemans MJC, Harrell FE Jr, Habbema JDF. Prognostic modelling with logistic regression analysis: a comparison of selection and estimation methods in small data sets. Stat Med. 2000;19(8):1059–79. https://doi.org/10.1002/(SICI)10970258(20000430)19:8<1059::AIDSIM412>3.0.CO;20.
 39.
Van Houwelingen JC. Shrinkage and penalized likelihood as methods to improve predictive accuracy. Statistica Neerlandica. 2001;55(1):17–34. https://doi.org/10.1111/14679574.00154.
 40.
Heinze G, Wallisch C, Dunkler D. Variable selection – a review and recommendations for the practicing statistician. Biom J. 2018;60(3):431–49. https://doi.org/10.1002/bimj.201700067.
Acknowledgements
We thank Zalika Klemenc Ketiš and Antonija Poplas Susič from the Health Centre Ljubljana that provided the dependence in daily activities data. We also thank Michael Kammer for programming the nested loop plot function which is publicly available [30] and the reviewers that helped us improving the manuscript.
Funding
This research was funded by the Austrian Science Fund (FWF), grant number I 2276N33.
Author information
Affiliations
Contributions
HŠ performed the analysis, wrote the R code and drafted the manuscript. GH, RB and AG provided revisions and improvements to the simulation study and the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no conflict of interest.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Šinkovec, H., Heinze, G., Blagus, R. et al. To tune or not to tune, a case study of ridge logistic regression in small or sparse datasets. BMC Med Res Methodol 21, 199 (2021). https://doi.org/10.1186/s1287402101374y
Received:
Accepted:
Published:
Keywords
 Calibration slope
 Firth’s correction
 Mean squared error
 Penalized logistic regression
 Ridge regression
 Shrinkage
 Tuning