Many health and nursing related studies focus on outcome measures that can be used to identify superior treatments and/or to reveal deficiencies in practices [1]. While substantial effort has been made on research design and data collection, researchers are more concerned with the validity of statistical conclusions should the reliability of the measurement be compromised [2] or the basic statistical assumptions be violated because non-normal data distributions with these outcomes are common [3]. In the later case, data transformation is one of the powerful tools for developing parsimonious models for detecting structural effects or predictive factors and for better data representation and interpretation [4–6]. Ever since the pioneer works on the formal estimation of a suitable transformation [3], the nonlinear monotonic power transformation family in the form of
and *y*
^{(λ)}= log (*y*), *if*(*λ* = 0) has been the focus of extensive research and, as a result, has resulted in widespread applications in linear model analysis. With the advance in statistical research and computational technology, the Box-Cox transformation has recently found its application in the linear mixed model settings [7–9], which, as hierarchical experiment design and longitudinal studies become more desirable, is an active field of research. Under the linear model framework, the parameter estimate for λ with the power transformation family, by definition, is obtained along with the structural effect such that the error term is normally distributed, *ε* ~ N (0, σ^{2}), with the model y
^{(λ) }= **Xθ** + ε, where y
^{(λ)}, **X**, and **θ** represents the transformed response, the design matrix of structural effects, and the vector of parameter estimates, respectively. This implies one should know a priori what the structure is before actually estimating the parameter for transformation (λ). In reality, factors with potential structural effects on the outcome can be large, unknown, and often are of primary interest for research, especially for large observational or cross-sectional studies, such as the National Database of Nursing Quality Indicators (NDNQI^{®}). This study contrasted the effect of obtaining the Box-Cox power transformation parameter and subsequent analysis with or without a priori knowledge of predictor variable under the classic ANOVA model with simulation, and then illustrated such effects by extending the Box-Cox transformation into hierarchical analysis with the mixed model on two NDNQI nursing sensitive indicators.

### Basic assumption for linear model methodology

Statistical analyses with the linear model methodology are based on the assumption that the population being investigated is normally distributed with a common variance and additive mean structure [

10,

11]. Let

*Y*
_{
ijk
}be the response for the k

^{th} unit in the ij

^{th} subclass for a two-way classification model;

*β* is the vector of regression parameters, and

*X*
_{
ij
}is the design matrix for the ij

^{th} subclass, the linear model (1) then assumes that the error is independent and identically distributed normal variable,

*ε*
_{
ijk
}~

*N* (0, σ

^{2} ), after removing the structural effect

*X*
_{
ij
}
*β*.

When the theoretical assumption is not satisfied, data transformation can be applied so that inferences about unknown factors are still valid on the transformed scale [

11]. Depending on the type of data and the form of their distribution, a number of different transformations were found so that the transformed data would meet the theoretical assumptions. These include: logit transformations for proportions; the square root transformation for count data; a logarithm or inverse transformation for continuous data skewed to either side with a heavy tail, etc. The family of power transformations is useful when the choice of transformation to improve the approximation of normality is not obvious [

12]. The power transformation was first introduced by Tukey [

13] and later modified by Box & Cox [

3] to take account of the discontinuity at

*λ* = 0. The Box-Cox power transformation takes the following form (2) so that the transformed values are a monotonic function of the observations,

and for the unknown transformation parameter,

*λ*,

where, *Y*
_{
ijk
}, *X*
_{
ij
}, *β* and *ε*
_{
ijk
}are all defined as in equation (1). This transformation may allow the response variable to achieve simplicity and additivity in mean structure for the expected value of (*y*
^{
λ
}) and make the variance more nearly constant among points in the factor space [14].

Substantial research has been conducted on the theoretical aspects of Box-Cox modification [15], and a wide variety of applications used Box-Cox transformation [16–18]. It is reported that maximum likelihood-based variance components analysis applied to non-normal data had inflated type I errors, which were controlled best by Box-Cox transformation [19]. Box-Cox transformation can be used to improve signal/noise ratio, map families of distributions and result in more efficient and robust results [20]. Analysis of the diagnostic accuracy using the receiver operating characteristic curve methodology required a Box-Cox transformation within each cluster to map the test outcomes to a common family of distributions [21]. Recently, median regression after applying the Box-Cox transformation was reported as notably more efficient and robust than the standard least absolute deviations estimator [22]. Due to its highly structured nature, however, the Box-Cox power transformation model is controversial, as some theoretical and Monte Carlo studies indicated that the data based estimate of *λ* is unstable and that, much like the case of multivariate collinearity, *λ* and *β* are highly correlated [7–9, 16, 17]. Other studies, however, downplayed the cost from data-based Box-Cox transformation, arguing the cost should be moderate on the whole and seldom large [23]. It has been suggested that we need to understand better the joint effects of variable selection and data transformation [7, 8, 23]. Under the Box-Cox transformation (2), one can put the data on the correct scale for an ANOVA model when the predictor variables (X) are identified and included during the transformation process. Unfortunately, for many non-randomized studies it is not clear what predictor variables should be included when the dependent variable deviates significantly from the normal distribution.

Under the linear mixed model setting, the error term of *ε*
_{
ijk
}in model (3) is no longer independent and identically distributed (iid) normal, but rather correlated because sampling and experiment units may be hierarchical or each sampling unit may be repeatedly measured.

### NDNQI database overview

In 1998, NDNQI^{®} was established by the American Nurses Association (ANA) to monitor nursing-sensitive indicators that measure nursing quality and patient safety across all 50 states in the US [24]. Over the last decade, NDNQI has seen its participating hospitals grow from 35 in 1998 up to 1,450 by the end of 2009 [25]. With nursing data collected at the unit level within member institutions, NDNQI provides hospitals unit-level performance reports with 8-quarter trend data, along with national comparison data grouped by hospital staffed bed size, teaching status, Magnet status, various other hospital characteristics, and unit type [25].

Nursing-sensitive indicators reflect the structure, process and outcomes of nursing care. Examples of nursing structure measures include the supply of nurses, skill level, RN education and certification [

24–

26]. The Patient Falls indicator is an example of a nursing sensitive outcome and is defined as the rate per 1,000 patient days at which patients experience an unplanned descent to the floor during the course of their hospital stay.

Patient Injury Falls, as another example, is defined as:

Both Patient Falls and Patient Injury Falls have a common denominator of Total Number of Patient Days. Conceptually, a patient day is 24 hours, beginning with the hour of admission. The operational definition of patient days is the total number of inpatients present at the midnight census plus the total number of hours of short stay patients divided by 24. Short stay patients are patients on a unit for less than 24 hours either for observation or same day surgery.

Both Patient Falls and Patient Injury Falls are critical nursing quality indicators that may be associated with nursing workforce characteristics, as well as with unit type and some hospital characteristics such as teaching status and Magnet status. Other unknown factors might also affect the rates of Patient Falls and Patient Injury Falls in NDNQI hospitals across a wide spectrum of settings over the entire United States. Further, if such factors do exist, it would be of great interest to examine what administrative or nursing process adjustments a hospital might take to reduce these rates and thus improve the overall quality of service.