 Research article
 Open access
 Published:
A Monte Carlo simulation study comparing linear regression, beta regression, variabledispersion beta regression and fractional logit regression at recovering average difference measures in a two sample design
BMC Medical Research Methodology volume 14, Article number: 14 (2014)
Abstract
Background
In biomedical research, response variables are often encountered which have bounded support on the open unit interval  (0,1). Traditionally, researchers have attempted to estimate covariate effects on these types of response data using linear regression. Alternative modelling strategies may include: beta regression, variabledispersion beta regression, and fractional logit regression models. This study employs a Monte Carlo simulation design to compare the statistical properties of the linear regression model to that of the more novel beta regression, variabledispersion beta regression, and fractional logit regression models.
Methods
In the Monte Carlo experiment we assume a simple two sample design. We assume observations are realizations of independent draws from their respective probability models. The randomly simulated draws from the various probability models are chosen to emulate average proportion/percentage/rate differences of prespecified magnitudes. Following simulation of the experimental data we estimate average proportion/percentage/rate differences. We compare the estimators in terms of bias, variance, type1 error and power. Estimates of Monte Carlo error associated with these quantities are provided.
Results
If response data are beta distributed with constant dispersion parameters across the two samples, then all models are unbiased and have reasonable type1 error rates and power profiles. If the response data in the two samples have different dispersion parameters, then the simple beta regression model is biased. When the sample size is small (N_{0} = N_{1} = 25) linear regression has superior type1 error rates compared to the other models. Small sample type1 error rates can be improved in beta regression models using bias correction/reduction methods. In the power experiments, variabledispersion beta regression and fractional logit regression models have slightly elevated power compared to linear regression models. Similar results were observed if the response data are generated from a discrete multinomial distribution with support on (0,1).
Conclusions
The linear regression model, the variabledispersion beta regression model and the fractional logit regression model all perform well across the simulation experiments under consideration. When employing beta regression to estimate covariate effects on (0,1) response data, researchers should ensure their dispersion submodel is properly specified, else inferential errors could arise.
Background
In biomedical research it is common to encounter response variables which have support on the interval (0,1). These types of response variables may arise in the form of proportions/percentages, or certain types of fractions and rates. The traditional approach to analyzing these types of response data – across virtually all scientific disciplines  is via linear regression. If desired, the response variable can be transformed prior to estimation of the linear regression parameters. This transformed linear model may improve diagnostic performance; however, this may render interpretation of estimated regression parameters challenging. Alternatively, the beta distribution allows specification of a probability model for continuous random variables with support over the interval (0,1). For many years statisticians have exploited the flexibility of the beta distribution in theoretical modelling exercises; however, its use in applied research settings has not garnered equal attention. Johnson et al. [1] cite numerous instances where the beta distribution has been used in theory/practice and champion increased use of the beta distribution in applied research settings. Gupta et al. [2] also cite numerous applications where the beta distribution provides a useful probability generating model for continuous data with support on the interval (0,1). However, neither of these extensive resources on the beta distribution cites a regression modelling framework for estimating covariate effects on beta distributed response variables. Recent developments by Paulino [3], Ferrari and CribrariNeto [4], Smithson and Verkuilen [5] and others have resulted in a more general purpose beta regression machinery. The variabledispersion beta regression model [5] will be used extensively in our simulation experiments, as it is particularly useful for modelling covariate effects on response variables which are assumed to follow a beta distribution. The beta regression model extends on ideas of generalized linear models [6] both in terms of their specification and estimation. Use of the beta regression model has been increasing in recent years. In slides from an unpublished presentation given by Ferrari [7], the author suggests over 100 instances where beta regression has been used in theoretical and applied research settings. Some application areas include: medicine, veterinary science, pharmacology, odontology, hydrobiology, nutritional science, forest science, waste management, education, political science, economics and finance. Clearly, embedding the beta distribution within a more general regression modelling framework has enhanced its uptake in applied research settings. A final model which we consider for estimating the average proportion/percentage/rate difference in our twosample model is the fractional logit regression model. The fractional logit model is a popular model for fractional response variables in econometrics and was proposed (independently) by Papke and Wooldridge [8] and by Cox [9]. The fractional logit model is similar to generalized linear regression models [6]; however, it does not make any fully parametric assumptions regarding the distributional form for the response variable. Rather the fractional logit model only specifies a parametric form for the conditional mean and conditional variance of the response. The form of the conditional mean and variance functions are chosen to ensure admissible predictions/fittedvalues from such models. In this case, the model specification is chosen to ensure predictions/fitted values from the fractional logit model fall in the interval (0,1). The estimator proposed by Papke and Wooldridge [8] is the one we pursue in this manuscript as they specify forms for robust variance estimators which have more desirable coverage/power properties than the more traditional quasilikelihood models proposed by Cox [9].
Given the recent popularity of the beta regression model, especially in biomedical research, we thought it prudent to compare linear regression, beta regression, variabledispersion beta regression and fractional logit regression models for estimating covariate effects on a response variable which lives on the interval (0,1). To accomplish this goal we conducted a Monte Carlo simulation experiment where we generated response variables following different (parametric) probability generating models. First, we considered simulating response data from the continuous beta distribution with support on (0,1). This experiment allows us to compare models when we know the beta regression model is properly specified given the response data. Specifically, we can investigate efficiency gains which may be observed from specifying an appropriate statistical model to observed response data. Additionally, we simulate response data from the discrete multinomial distribution with probability mass observed only on a finite number of points in (0,1). This experiment allows us to investigate model performance when response data is noncontinuous. In this case, all models are incorrectly specified given the response data. This experiment allows us to investigate whether estimated regression models are robust to noncontinuous response data. In all scenarios, we fit linear regression, beta regression, variabledispersion beta regression, and fractional logit regression models to these randomly generated response data and compared the finite sample statistical properties of the respective estimators. We are particularly interested in the ability of each estimator to recover the average proportion/percentage/rate differences from a simple two sample design. In terms of statistical properties we will compare the respective estimators in terms of: bias, variance, type1 error and power. Understanding the performance of these models on simulated datasets (where population parameters are known) is important for applied researchers who must discern whether to estimate covariate effects on (0,1) response data using the traditional linear regression model or more novel regression models, such as beta regression, variabledispersion beta regression and fractional logit regression models.
Methods
Statistical methods
The linear regression model
The linear regression model is a workhorse of applied statisticians. It is used to model the effect of continuous/categorical covariates on a scalar response (assumed to be generated according to a Gaussian probability model). Thorough introductions to the linear regression model are given in Weisberg [10], McCullagh and Nelder [6], and White [11].
In this study we consider a simple two sample problem, recast under a regression framework, such that our response variable is modelled as a function of a single intercept parameter and a single slope parameter. The linear model and its conditional mean function look as follows:
The notation above suggests that we observe a vector of response variables, Y_{1}…Y_{n}. Further, we have information on a single binary covariate, X_{i} ∈ {0,1}, again for i = 1…n. The regression coefficients β_{0} and β_{1} are estimated from the data. Estimation and inferential procedures are justified given that the following assumptions are satisfied [11]:

1.
The model is properly specified

2.
X is a nonstochastic and finite dimensional (n by p) matrix with n ≥ p

3.
(X^{T}X) is nonsingular and hence invertible

4.
E(ϵ_{i}) = 0 ∀ (i = 1…n)

5.
ϵ_{i} ~ Normal(0, σ^{2}) ∀ (i = 1…n)
In our experiment, we are interested in the ability of the linear regression estimator to recover the average proportion/percentage/rate difference given our simple two sample design. Taking linear combinations of the estimated model parameters we arrive at the following estimator:
Therefore a test of ∆ = 0 is equivalent to a test of β_{1} = 0. In this simulation we carry out such a test using a Wald statistic, W, which follows an asymptotic standard normal distribution. We reject the null hypothesis in instances where W > 1.96 (corresponding to an α = 5% significance threshold).
The beta regression model (and some extensions)
The beta regression model was proposed by Paulino [3], Ferrari and CribrariNeto [4], and Smithson and Verkuilen [5] for modelling covariate effects on a continuous response variable which assumes support on the interval (0,1).
The beta distribution is thoroughly described in Johnson et al. [1] and Gupta [2]. The beta density is a very flexible density, assuming support on the interval (0,1). The most common parameterization of the beta density is in terms of its two shape parameters {p,q}:
In the parameterization given above we assume p > 0, q > 0, y ∈ (0,1) and use (·) to denote the gamma function (a generalization of the factorial function to noninteger arguments). The density assumes probability mass on the interval (0,1) and is zero elsewhere. Further, under this parameterization we define the mean and variance of the random variable, Y, as follows:
Above E(·) and VAR(·) denote the expectation and variance operators, with respect to the given beta distribution. In regression modelling, it is more common to parameterize the density in terms of a mean (μ) and dispersion parameter (ϕ) instead of two shape parameters, {p,q}. In this parameterization we have the following relationships:
This implies: p = μϕ and q = (1 – μ)ϕ.
Given the above relationships we can derive the mean and the variance of the beta density in terms of a mean and dispersion parameter as follows:
Given a fixed value for the mean, the larger the value of ϕ the smaller the variance of the response variable, Y (and viceversa). Under this new parameterization, in terms of a mean and dispersion parameter, the density of Y looks as follows:
In Figure 1, we graphically represent some of the forms the beta density can take on for different values of {p,q}, or alternatively, {μ, ϕ}.
The beta regression model, and the variabledispersion extensions which we will discuss in this study are being increasingly utilized to model covariate effects on response variables observed on the interval (0,1). The beta regression model is an obvious choice for modelling response data which follow a beta distribution. Consider the scenario where we observe response data Y_{1…}Y_{n} on the interval (0,1). The beta regression model assumes that the mean of these random variables, can be represented in the following form:
Our link function g(·) can be any function which is strictly monotone, twice differentiable, and maps the response variable observed on the interval (0,1) to the real line. The most commonly used link function in beta regression is the logit link. Alternative link functions include: the probit, the complementary loglog, the loglog and the Cauchy link. In general, any inverse cumulative distribution function will be an appropriate link function in a beta regression framework as they act to map the interval (0,1) to the real line.
The components of the basic beta regression model can be summarized as:

1.
A response variable from a beta distribution

2.
A linear predictor, η_{i}

3.
A suitable link function, such that: E(Y _{ i }X _{ i }) = g(μ _{i}) = η _{i}
Given the above components, the loglikelihood of the beta regression model can be written as follows:
The loglikelihood function can be maximized numerically as described in Ferrari and CribrariNeto [4]. The mean and dispersion parameter estimates are known to be biased, especially in small samples. Kosmidis and Firth [12] discuss the issue of finite sample bias in beta regression. The authors propose a general purpose algorithm for producing biasreduced and biascorrected parameter estimates via adjustments to the score function. In our simulation experiment we estimate parameters from the beta regression model via standard maximum likelihood (ML) methods, as well as the biasreduced (BR) and biascorrected (BC) methods. In our simulation experiments we employ the simple ML estimators; however, we note that BC/BR methods may improve type1 error rates in small sample situations.
The beta regression model proposed above assumes that the dispersion parameter is constant for all individuals under consideration. In many biomedical applications this may be an unrealistic assumption (especially if one expects a nonzero mean difference across categorical groups). As its name implies, the variabledispersion beta regression model [5] allows the value of the dispersion parameter to vary across individuals. Further, the value of the dispersion parameter can actually be modelled as a function of covariates. The variabledispersion beta regression model is a type of doubleindex regression model [13], as it contains two regression equations, one modelling the mean as a function of covariates and the other modelling the dispersion as a function of covariates.
Again, we consider the scenario where we observe response data Y_{1…}Y_{n} on the interval (0,1). The variabledispersion beta regression model assumes that the mean and dispersion of these random variables can be represented in the following form:
Once again, we assume that both g(·) and h(·) are strictly monotonic, twice differentiable functions which act to map the mean, μ_{i}, and the dispersion, ϕ _{ i }, to the real line. Once again, suitable choices of g(·) include the following link functions: logit, probit, complementary loglog, loglog, Cauchy or any other inverse cumulative distribution function. The link function for h(·) is typically chosen to be the log link. The identity link can also be used; however, it has the undesirable property of possibly suggesting nonpositive values of ϕ _{ i }.
The loglikelihood function for the variabledispersion beta regression model can be numerically maximized and is subject to similar finite sample biases as the basic beta regression model. Below, we illustrate the loglikelihood function for this model:
In the case of both the beta regression model and the variabledispersion (doubleindex) beta regression models, estimates of mean and dispersion parameters {β,γ} are achieved by numerically solving the likelihood equations given above. The resulting parameter estimates are asymptotically normally distributed and take the following form:
Further,
For our purposes it suffices to realize that the estimators of the mean and dispersion parameters are consistent estimators of their target parameters and are distributed according to a multivariate normal distribution, with variancecovariance matrix C^{1}. Detailed derivations of these formulas (particularly pertaining to the forms of the C^{1} matrix) are given in Ferrari and CribrariNeto [4].
Again, we are interested in the ability of the (variabledispersion) beta regression estimator to recover the average proportion/percentage/rate difference given our two sample design. In all of our simulation experiments we assume a logit link for the mean function. The (default) identity link is used in the beta regression modelling context and the log link is used in the variabledispersion beta regression context. In all scenarios, our target of inference is the average proportion/percentage/rate difference and we view the terms in the dispersion submodel as a nuisance. A point estimator of the proportion/percentage/rate difference from the beta regression model is:
We use the delta method to estimate the variance and standard error of this estimator, respectively. We construct a Wald style test of the null hypothesis that ∆ = 0. The Wald statistic, W, is computed as the ratio of the difference in proportion/percentage/rates over the estimated standard error. The test statistic is presumed to follow an asymptotic standard normal distribution. Again, we use a 5% critical threshold for rejecting the null hypothesis (this corresponds to rejection of H_{0} if W > 1.96)
The fractional logit regression model
The final methodology we consider for estimating average proportion/percentage/rate differences in our twosample design is the fractional logit regression model [8, 9]. The fractional logit regression model is most commonly encountered in the econometrics literature and has been demonstrated as being an effective means for estimating covariate effects on a response variable which lives on (0,1). Hence we consider it in this manuscript – as a result, introducing health services researchers to yet another plausible strategy for modelling proportions/percentages/fractions/rates.
The fractional logit regression model is considered a quasiparametric regression model. In other words, the fractional logit regression model does not make any parametric assumption regarding the distribution of the response variable being modelled; rather, it makes assumptions regarding only the first two conditional moments of the response variable – the conditional mean and the conditional variance. The choice of the conditional mean and conditional variance function are typically made to ensure that predictions/fittedvalues from the specified model are admissible. In our case, this implies that the predictions/fittedvalues fall in the interval (0,1).
As mentioned above, quasilikelihood models typically only make assumptions regarding the first two conditional moments of the response variable [6, 9]. The conditional variance is assumed to be a known function of the mean (up to a scale parameter) and the conditional mean function therein is assumed to be a function of unknown model parameters:
In the first Equation V(·) denotes the variance operator, σ^{2} is a scale parameter which is estimated from observed data. ν(·) is a known variance function, and μ_{i} is the mean function. In the second Equation E(·) denotes the variance operator, g(·) is a known link function and β_{j} represent the unknown mean function parameters which must be estimated from the data. Y_{i} and X_{i} represent the response variable and observed covariates, respectively.
In describing the fractional logit model we adopt the terminology of Papke and Wooldridge [8]. Our chief assumption relates to the specification of the conditional mean function, namely:
Generally, h(·) is a known function which maps our real valued linear predictor into the interval (0,1). Again, their exist many plausible function which could accomplish this goal, in this manuscript we choose h(·) to be the logistic function and arrive at the fractional logit model. That is:
Further, the conditional variance of the response variable is assumed to be:
Papke and Wooldridge [8] argue that this conditional variance assumption is too restrictive for modelling response data with support over (0,1). Therefore, in their manuscript they offer two alternative strategies: first, using robust/sandwich estimators of the variancecovariance matrix and second, adjusting the estimated variancecovariance matrix by the Pearson scale adjustment factor. We considered both approaches; however, noted little difference in performance between the two estimators of the variancecovariance matrix. Hence, we report on only the fractional logit model with sandwich/robust variancecovariance matrix.
Parameter estimation under the fractional logit model proceeds by maximizing the following Bernoulli quasilikelihood function:
Monte Carlo simulation design
The goal of this simulation experiment is to compare the properties of the linear regression model, the beta regression model, the variabledispersion beta regression model and the fractional logit regression model at recovering estimates of average proportion/percentage/rate differences from a simple two sample design. In all experiments we simulate data from parametric probability generating models such that the observed response data is on the interval (0,1). Subsequently, we estimate covariate effects on the response variable using one of four regression models: the linear regression model, the beta regression model, the (doubleindex) variabledispersion beta regression model and the fractional logit regression model. Given estimates of average proportion/percentage/rate differences from the respective models, we compare statistical properties of the respective estimators, such as: bias, variance, type1 error and power [14, 15]. We investigate finite sample performance of each of the estimators by varying the sample size within each unique simulation experiment. In all scenarios, the sample size in group 1 is set equal to the sample size in group 2. Group specific sample sizes under consideration in this simulation are: 25, 100, 250, and 750. The total sample size for a given simulation experiment is double the groupspecific sample size (as this experiment assumes a 2sample design). In each instance we consider 20,000 replications of each experiment. We choose 20,000 replicate simulations such that coverage in the type1 error experiments is based off of approximately 1000 rejections of a true null hypothesis. We present mean estimates of bias, variance, type1 error and power averaged across the 20,000 replicate simulations. Further we present Monte Carlo error estimates of bias, variance and power. Detailed derivations of Monte Carlo error are described in White [16]. The “seeds” which govern the pseudorandomness of the various Monte Carlo experiments are given in the attached R/SAS codes.
The first parametric probability model which we consider for generating response data on the interval (0,1) is the beta distribution. Table 1 describes the parameter values used to generate randomly simulated beta response variables. The response variables are generated such that certain mean and dispersion properties are achieved. For example, mean differences of zero are used to assess the type1 error rates of respective estimators (for both fixed and varying dispersion). Further, nonzero mean differences are used to assess power (again for both fixed and varying dispersion). In this experiment response data are generated as independent draws from the respective beta distributions. That is, observations within and between the two samples are independently distributed. Within the type1 error and power experiment frameworks, respectively, we have 3 subexperiments: the first set of experiments consider the scenario where the central tendency of the simulated response distribution is near the center of the support (0.5); the second set of experiments considers the effect of shifting the central tendency to the right such that it is centered near 0.25; and finally, the last experiment considers the effect of shifting the central tendency to the boundary of the support, near 0.05. As subscenarios we vary the shape of the beta distribution when data are simulated from the center, rightcenter and farright of the support, considering scenarios where the simulated data are symmetric and other scenarios where the simulated data is highly skewed. As the data are beta distributed we expect the beta regression models to perform well in all scenarios; however, we anticipate that the linear model will perform well when data are symmetric and unimodal. That is, we expect the linear model to perform well as the shape/rate parameters both become large and as the ratio of the shape/rate parameters approach 1 (resulting in a symmetric and unimodal beta distribution – which converges to that of a normal distribution).
The next parametric probability model under consideration is the discrete multinomial model which takes probability mass only on a finite number of points on the interval (0,1). More specifically, we assume our response variable Y_{i} can take on the following values:
That said, we do not assume the probability of assuming these values is necessarily uniform. Rather, we assign a vector of probabilities to these points, corresponding to the relative likelihood that the response variable assumes that particular value. Table 2 describes the particular probability vectors used to generate response variables for each group in our two sample design. Once again, we vary the expected value of the response to assess differences in type1 error rates and power across our linear regression, beta regression, (doubleindex) variabledispersion beta regression and fractional logit regression models. Again, in this experiment response data are generated as independent draws from the respective multinomial distributions. That is, observations within and between the two samples are independently distributed.
Statistical software
This simulation experiment was conducted using R version 3.02 [17] and results were also verified using SAS 9.3 [18].
Simulation of the beta and multinomial response variables were carried out using the rbeta() and rmultinom() functions, respectively. Linear regression modelling was performed using the lm() function. Beta regression was performed using the betareg() function in the betareg library [13]. Fractional logit regression models were estimated using the glm() function and the sandwich() function [19]. Standard errors for the proportion/percentage/rate differences from beta regression and fractional logit regression models were calculated using the deltamethod() function in the msm library [20].
SAS PROC NLMIXED was used to specify the linear regression model, beta regression model, variabledispersion betaregression model and fraction logit regression model likelihood equations, respectively, and model parameters were estimated via likelihood methods.
All R and SAS code used to conduct this simulation can be obtained by contacting the corresponding author.
Results
Detailed results of the Monte Carlo simulation study are given in Tables 3, 4, 5 and 6. Tables 3 and 4 describe the type1 error and power experiments, respectively, given response data simulated according to independent draws from various parameterizations of the beta distribution. Tables 5 and 6 describe the type1 error and power experiments, respectively, given response data simulated according to independent draws from various parameterizations of the multinomial distribution.
Table 3 describes the results of the type1 error experiment (∆ = 0) given response data distributed according to independent draws from a beta distribution. The top half of Table 3 illustrates results when the dispersion parameter is equal across groups; whereas, the bottom half of Table 3 illustrates results when the dispersion parameter varies as a function of group membership. As probability mass moves away from the center of the support (i.e. 0.5) and towards the boundary of the support (0 or 1) we observe that the beta regression model provides biased estimates of the average proportion/percentage/rate difference between the two samples when the dispersion parameters vary as a function of group membership. For example, when μ_{0} = μ_{1} = 0.25 and ϕ _{0} = 5 and ϕ _{1} = 10 we observe biased estimates of effect from the beta regression model (biases range from 2.27E02 through 2.41E02). As the dispersion parameters increase (i.e. ϕ _{0} = 100 and ϕ _{1} = 200) the observed bias in the beta regression model is slightly attenuated (biases range from 1.22E03 through 1.26E03). Similar findings are observed when the mean parameters are adjusted, such that μ_{0} = μ_{1} = 0.05. Table 4 describes the results of the power experiments (∆ = 0.025) given response data distributed according to independent draws from a beta distribution. Near identical results are observed as were discussed for the type1 error experiments in Table 3. That is, when the respective means are near the boundary of the support, and the dispersion parameters vary as a function of group membership the beta regression model can yield biased estimates of effect. When the dispersion parameters are small (ϕ _{0} = 5 and ϕ _{1} = 10) the bias in effect estimates is appreciable (biases range from 2.22E02 through 2.31E02). On an absolute scale these biases are meaningful; however, when expressed on a relative scale these biases are even more pronounced. As the dispersion parameters increase in magnitude the magnitude of the bias in the beta regression models is attenuated (biases range from 9.73E04 through 1.20E03). Given that the simple beta regression model is biased in certain scenarios we eliminate it from consideration in the results/discussion sections which follow.
In small sample scenarios, when N_{0} = N_{1} = 25, the linear regression model had a mean type1 error of approximately 0.050; whereas, the variabledispersion beta regression model had mean type1 error rate of 0.058 and the fractional logit regression model had a mean type1 error rate of 0.060. As the sample size is increased to 100 per group, 250 per group and 750 per group, respectively, the type1 error rates of the linear regression model, the variable dispersion beta regression model and the fractional logit regression model became more similar. Further, improvements in the type1 error rate of the variabledispersion beta regression model for small samples (N_{0} = N_{1} = 25) were observed when we used bias corrected/reduced estimation methods instead of the more traditional ML estimation methods (results not shown; however, can be verified by modifying simulation codes in R).
When considering the power experiments estimates of average bias across the 20,000 replicate experiments were small for the linear regression model, the variabledispersion beta regression model and the fractional logit regression model (of magnitude 1E04 through 1E06 respectively). Further, estimates of average variance across the 20,000 replicate simulations were similar across the linear regression model, the variabledispersion beta regression model and the fractional logit regression model. These findings imply the estimators have similar average mean squared error. That said, the power for estimated variabledispersion beta regression models and the fractional logit regression models, respectively, marginally exceeded that of the linear regression model across all simulation experiments considered.
Table 5 describes the results of the type1 error experiment (∆ = 0) given response data distributed according to independent draws from a multinomial distribution. For the type1 error experiments all estimators are relatively free of bias. The magnitudes of estimated biases are similar for the linear regression model, variabledispersion beta regression model and the fractional logit model. Again, average variance across the 20,000 replicate simulations were similar for all models. Type1 error rates are closest to the desired 5% level for the linear regression model. Again the variabledispersion beta regression model and the fractional logit regression model have elevated type1 error rates when sample sizes are small (N_{0} = N_{1} = 25). Table 6 describes the results of the power experiment (∆ = 0.10) given response data distributed according to independent draws from a multinomial distribution. When data are simulated according to either a symmetric or asymmetric discrete multinomial distribution we observe that the beta regression model is biased. In the symmetric case biases are attenuated (biases range from 2.32E03 through 2.55E03) compared to the asymmetric case (biases range from 1.35E02 through 1.38E02). The magnitude of the bias in the linear regression estimator and the fractional logit regression estimator are similar. However, in the case of discrete data we notice that the variabledispersion beta regression model has slightly elevated mean bias levels. That said, the variabledispersion beta regression model is slightly more powerful than the linear regression model and the fractional logit regression model (however, this is likely an artifact of the difference in magnitudes of bias in these models). Among models with comparable biases, the fractional logit model is more powerful than the linear regression model when data are generated from a discrete multinomial distribution on (0,1).
Discussion
The main findings of this Monte Carlo simulation study are summarized in Tables 3, 4, 5 and 6 in the results section. In general, properties of the respective estimators are similar regardless of whether the underlying data generating mechanism is beta distributed (Tables 3 and 4) or multinomial distributed (Tables 5 and 6). Hence we will discuss findings from the type1 error experiments and the power experiments in general, as results seem to hold irrespective of the probability generating models. We note interesting exceptions where warranted.
Considering the type1 error experiments (Table 3 and Table 5) we observe that the linear regression model, the variabledispersion beta regression model and the fractional logit regression model provide unbiased estimates of our population proportion/percentage/rate difference (∆ = 0) under all simulated scenarios. The magnitudes of bias tend to be similar across estimators, ranging from 1E04 through 1E06. In many circumstances the simple beta regression model also provide unbiased estimates of our null (∆ = 0) effect. However, in circumstances where the dispersion parameter varied between groups, the simple beta regression model demonstrated fairly substantial bias in its attempt to recover the average population proportion/percentage/rate difference. The impact of nonconstant dispersion amongst individuals in this simulation experiment were more pronounced when the dispersion parameters were small (e.g. ϕ _{0} = 5 and ϕ _{1} = 10) compared to when the dispersion parameters were large (e.g. ϕ _{0} = 100 and ϕ _{1} = 200). Further, the effects of nonconstant dispersion between groups appear more pronounced when the group means are near the boundary of the distributions support (0 or 1) compared to when they are near the center of the support (½). This is demonstrated by observed biases in the beta regression model of about 0.02 units in certain circumstances (Table 3). It is interesting to note that in terms of type1 error rates the linear regression model performed well regardless of sample size; whereas, the variabledispersion beta regression model and fractional logit regression model experienced slightly elevated type1 error rates when the group specific sample sizes were small (N_{0} = N_{1} = 25). Another important point is that improvements in the small sample type1 error rates of the beta regression estimators could be achieved by using the bias corrected/reduced estimation methods in place of the more traditional ML estimators. These BC/BR estimators are easily implemented in the R betareg() procedure [12, 13].
Considering the power experiments (Table 4 and Table 6) we again observe that the linear regression model, the variabledispersion beta regression model and the fractional logit regression model provide (relatively) unbiased estimates of our proportion/percentage/rate difference (∆ = 0.025 in the beta distributed simulations and ∆ = 0.10 in the multinomial distributed simulations). The magnitude of the average biases across the 20,000 replicate experiments is similar across these three models when the data are beta distributed (Table 4); however, when the data arise from a multinomial distribution the variabledispersion beta regression model has slightly elevated bias levels compared to the linear regression model and fractional logit regression model. That said, on a relative (or absolute) scale, the observed biases in the variabledispersion beta regression model are not overly large. Again, in cases where the dispersion parameter varies across groups we observe that the simple beta regression model has trouble recovering the desired epidemiological effect measure. Again, this problem is more pronounced when the dispersion parameters are small and the group means are situated near the boundary of the support. The beta regression model also struggles at recovering the desired difference measure in the multinomial experiment where the response variable is skewed (Table 6); however, demonstrates more comparable performance to the linear regression model, the variabledispersion beta regression model and the fractional logit model when the response variable is simulated from a symmetric multinomial distribution. In general, the linear regression model, the variabledispersion beta regression model and the fractional logit regression model perform well in terms of recovering unbiased estimates of the nonzero effect measure. The models have similar power profiles across the continuous beta distributed simulation experiments – with minor power advantages appearing in the variabledispersion beta regression models and the fractional logit regression models. Further when response data are distributed according to a discrete multinomial distribution minor advantages in power appear for the variabledispersion beta regression model (at the cost of small magnitude increases in bias) and the fractional logit model compared to the linear regression model (Table 6).
The results of this Monte Carlo simulation study indicate that the linear regression model, the variabledispersion beta regression model and the fractional logit regression model are capable of producing unbiased estimates of average proportion/percentage/rate differences given response data observed on the interval (0,1) from a two sample design. The simple beta regression model struggles if the dispersion submodel is incorrectly specified. When sample sizes are small, type1 error rates appear closer to the nominal 5% level in the linear regression model. The variabledispersion beta regression model and fractional logit model appear slightly more powerful than the linear regression model when a nonzero difference between groups is present.
A similar study was conducted by Kieschnick and McCullough [21]. In their article they made similar conclusions favouring the (variabledispersion) beta regression model and the fractional logit regression model for estimating covariate effects on response data observed on (0,1). In their article they dismissed the linear regression model, because in the more complex regression scenarios they were considering it could lead to inadmissible predictions (e.g. predicted values outside of (0,1)). In our simulation experiment we are not necessarily interested in the predictions or fitted values, rather we are interested in the ability of our model to recover the average difference in proportions/percentages/rates across a twosample design. This is typically viewed as an absolute measure of “effect” in epidemiological research. If one is interested in this measure of effect, rather than in model based predictions then it appears from this simulation that the linear regression model performs similarly well as the more novel variabledispersion beta regression model and the fractional logit model. That said, if ones interest lies in predicted/fittedvalues then it likely behoves the researcher to choose a model which will result in admissible predictions.
Introduction of the beta distribution into a general regression framework has resulted in enhanced attention/use of the beta distribution by both theoretical and applied researchers. Ferrari [7] provides many examples in which the beta regression model has been applied in theoretical/applied modelling exercises. Two biomedical applications where the beta regression framework has been implemented include modelling scales scores, such as SF6D response data [22] and modelling stroke lesion volumes [23]. Many other biomedical applications of the beta regression model exist as suggested by Ferrari [7].
The purpose of this Monte Carlo simulation experiment was to investigate the properties of the linear regression model, the beta regression model, the variabledispersion beta regression model and the fractional logit regression model at recovering average proportion/percentage/rate differences from a two sample design. The simplicity of the design aids in interpreting properties of the respective models. In addition, the two sample design is one of the most commonly encountered study designs by epidemiologists and biostatisticians. Hence the simulation study answers a very important question for applied epidemiologists/biostatisticians, namely: given a more traditional linear regression framework for modelling covariate effects on response data observed on the interval (0,1) are there any benefits in fitting a more novel beta regression model, variabledispersion beta regression model or fractional logit regression model to these same data and using it for inference? Results of this simulation study suggest that the more novel variabledispersion beta regression model and fractional logit regression model have comparable properties to the traditional linear regression model. While the linear regression model may perform better in terms of type1 error rates in small samples, the variabledispersion beta regression model and fractional logit regression model seem slightly more powerful at detecting a true nonzero difference between groups in a twosample design. Conversely, the simple beta regression model appears to struggle if the dispersion submodel is incorrectly specified. Given this finding applied researchers should be cautious in fitting off the shelf beta regression models to their (0,1) response data. If a choice is made to fit a beta regression model to observed data practitioners must strive to ensure correct specification of both the mean and dispersion submodels in order to generate proper inferences.
Conclusion
The purpose of this Monte Carlo simulation study was to compare the properties of the linear regression model to the more novel beta regression, variabledispersion beta regression and fractional logit regression models at recovering estimates of average proportion/percentage/rate differences in a twosample design. We observe that the simple beta regression model is biased if the dispersion submodel is incorrectly specified. The variabledispersion beta regression model is unbiased (given proper specification of the dispersion submodel). The fractional logit regression model is also an unbiased estimator of effect. Moreover, the power and type1 error profiles are very similar for the linear model, variabledispersion beta regression model and the fractional logit regression model. These results seem to suggest promise for the beta regression model going forward; however, for the time being applied researchers should be cautious in applying off the shelf beta regression algorithms to their response data observed on the interval (0,1) and should strive to ensure correct specification of both the mean and dispersion submodels such that proper inferences are generated from the observed data.
References
Johnson N, Kotz S, Balakrishnan N: Continuous Univariate Distributions. 1995, Hoboken, New Jersey: Wiley, 2
Gupta A, Nadarajah S: Handbook of Beta Distribution and its Applications. 2004, Boca Raton, Florida: CRC Press, 1
Paolino P: Maximum likelihood estimation of models with beta distributed dependent variables. Polit Anal. 2001, 9 (4): 325346. 10.1093/oxfordjournals.pan.a004873.
Ferrari S, CribrariNeto F: Beta regression for modelling rates and proportions. J Appl Stat. 2004, 10: 118.
Smithson M, Verkuilen J: A better lemon squeezer? Maximumlikelihood regression with beta distributed dependent variables. Psychol Methods. 2006, 11 (1): 5471.
McCullagh P, Nelder J: Generalized linear models. 1989, Boca Raton: CRC Press, 2
Ferrari S: Beta Regression Modelling: Recent Advances and in Theory and Applications. 2013, Unpublished presentation: http://www.ime.usp.br/~sferrari/13EMRslidesSilvia.pdf
Papke L, Wooldridge J: Econometric methods for fractional response variables with an application to 401(K) plan participation rates. J Appl Econ. 1996, 11: 619632. 10.1002/(SICI)10991255(199611)11:6<619::AIDJAE418>3.0.CO;21.
Cox C: Nonlinear quasilikelihood models: applications to continuous proportions. Comput Stat Data Anal. 1996, 21 (4): 449461. 10.1016/01679473(95)000240.
Weisberg S: Applied Linear Regression. 2005, Hoboken, New Jersey: Wiley, 3
White H: Asymptotic Theory for Econometricians. 2000, San Diego, California: Academic Press
Kosmidis I, Firth D: A generic algorithm for reducing bias in parametric estimation. Electron J Stat. 2010, 4: 10971112. 10.1214/10EJS579.
Grun B, Kosmidis I, Zeileis A: Extended beta regression in R: shaken, stirred, mixed and partitioned. J Stat Softw. 2012, 48 (11): 125.
Knight K: Mathematical Statistics. 2000, Boca Raton, Florida: CRC Press
Wasserman L: All of Statistics: A Concise Course in Statistical Inference. 2004, New York, New York: Springer
White I: SIMSUM: analyses of simulation studies including Monte Carlo Error. Stata J. 10 (3): 369385.
R Development Core Team: R: A language and environment for statistical computing. 2013, Vienna, Austria: R Foundation for Statistical Computing, http://www.Rproject.org/, 3900051070,
SAS Institute: 2013, North Carolina, USA, http://www.sas.com/en_us/legal/editorialguidelines.html,
Zeileis A: Econometric computing with HC and HAC covariance matrix estimators. J Stat Softw. 2004, 11 (10): 117.
Jackson C: Multistate models for panel data: the msm package for R. J Stat Softw. 2011, 38 (8): 129.
Kieschnick R, McCullough B: Regression analysis of variates observed on (0,1): percentages, proportions and fractions. Stat Model. 2003, 3 (3): 193213. 10.1191/1471082X03st053oa.
Hunger M, Beaumert J, Holle R: Analysis of SF6D index data: is beta regression appropriate?. Value Health. 2011, 14: 759767. 10.1016/j.jval.2010.12.009.
Swearingen C, Tilley B, Adams R, Rumboldt Z, Nicholas J, Bandyopadhyay D, Woolson R: Application of beta regression to analyze ischemic stroke volume in NINDS rtPA clinical trials. Methods in Neuroepidemiology. 2011, 37: 7382. 10.1159/000330375.
Prepublication history
The prepublication history for this paper can be accessed here:http://www.biomedcentral.com/14712288/14/14/prepub
Acknowledgements
The authors would like to acknowledge the contributions of the assigned BMC Medical Research Methodology editor – Monica Taljaard – and two reviewers Jay Verkuilen and Maarten Buis. The reviewers’ comments helped to improve the manuscript greatly.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interest
Neither of the two authors of this manuscript have any competing interest to disclose.
Authors’ contributions
CM played a role in the conceptualization of the study, programming the simulation, interpreting results and writing the first draft of the manuscript. RM played a role in the conceptualization of the study, programming the simulation, interpreting the results and critically revising drafts of the manuscript. Both authors read and approved the final manuscript.
An erratum to this article is available at http://dx.doi.org/10.1186/s1287401602566.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Meaney, C., Moineddin, R. A Monte Carlo simulation study comparing linear regression, beta regression, variabledispersion beta regression and fractional logit regression at recovering average difference measures in a two sample design. BMC Med Res Methodol 14, 14 (2014). https://doi.org/10.1186/147122881414
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/147122881414