Although Rasch Models [1] were originally designed and used for educational assessment in recent years they have increasingly been used in health research. This renewed interest in these models has largely been encouraged by a number of potential advantages of Rasch models over traditional psychometric methods, including the ability to decrease the number of items in questionnaires to reduce patient burden whilst retaining the psychometric properties of the instrument, and the pooling of data drawn from different samples allowing more accurate parameter estimation. Recent studies in health have explored the use of Rasch models in instrument development [2–4], modification of existing questionnaires [5–8], as well as in instrument and cross-linguistic comparison [9, 10].

Rasch Models are a family of measurement models [11] which can be used to describe latent traits where items from questionnaires and person scores are located along the same scale of the latent trait. Item location ("difficulties") and person measures ("abilities") are estimated separately to produce estimates for each parameter which are sample and item independent respectively [12]. Rasch Models specify a number of criteria, which if fulfilled result in interval scales where adjacent scores along the scale are equally spaced, a feature which is particularly important for interpreting clinically meaningful differences [13]. Firstly, the data should describe a unidimensional construct, that is, a single latent trait should explain the variance in the data. The existence of dimensionality can be assessed using principal components analyses of the residuals [14]. Secondly, item invariance stipulates that item (or person) parameters should be independent of the sample (or items) used. This item invariance criterion can be evaluated using differential item functioning to determine whether item bias is present. The final criterion, which will form the focus of this paper, is item fit, in other words whether individual items in a scale fit the Rasch model.

There has been and there continues to be a considerable debate around the issue of which is the most appropriate fit statistic to use, what range of fit statistics to be employed when evaluating fit, and how fit statistics should be interpreted [15, 16].

The use of chi-square statistics or infit and outfit mean squares to assess item fit to the model (described in more detail below) has been advocated. The mean squares can be converted through a cube-root transformation (Wilson-Hilferty) to (infit/outfit) t-statistics.

The mean square fit statistics are perhaps the most commonly used fit statistics in health research. A series of ranges has been suggested [17] to be employed when evaluating item fit depending on the type of test, however the majority of studies employ a range of 0.7 to 1.3. Despite the popularity of this approach some concerns have been voiced about the use of a single, universal range to evaluate fit and the lack of adjustment of the range to sample size. For instance, Smith et al. [16] using simulated datasets on dichotomous data have determined that Type I error rates (defined here as the probability of falsely rejecting an item as not fitting the Rasch model) were significantly less than α = 0.05 for both infit and outfit mean squares using a range of critical values (0.7, 0.8, 0.9 – 1.1, 1.2, 1.3). Furthermore, Type I error rates decreased for the outfit mean square as sample size was increased. In contrast, the Type I error rates for the t-statistics, although not equal to 5% demonstrated fewer discrepancies.

More recently, studies [18] have demonstrated using data collected from a large sample of examinees' results that t-statistics may potentially identify more items that do not fit the model than both the infit and outfit mean square fit statistics. For instance, the number of misfitting items identified by the t-statistic was four times greater than those identified by the mean square fit statistic (23 and 5, respectively).

In addition to research on the dichotomous model, recent work on the polytomous (Rating Scale) model with simulated data has suggested that the variability of mean squares is dependent on sample size and furthermore that the standard deviations for the t-statistics are generally smaller than their expected value (unity) [19]. These authors propose adjusting the critical range employed for both types of fit statistic depending on sample size.

Finally, Smith & Suh [18] have concluded that using mean square statistics may lead researchers to missing significant numbers of misfitting items, which may have an important impact on the development of unidimensional instruments, and that there is, furthermore, a need to understand Type I error rates associated with critical values for fit statistics. On the basis of this Smith and colleagues [16, 18] have suggested that the t-statistic rather than the weighted and unweighted mean squares should be used to identify misfit, given that this statistic appears to be less sensitive to changes in sample size or alternatively to adjust mean square fit statistics using a correction based on the square root of the sample size [16].

However, despite this assertion there are a number of other methodological studies [15, 20] which have shown that the t-statistic is highly sample dependent.

The evaluation and identification of item misfit is critical to the development of unidimensional instruments, and reliable fit statistics play an important part in this. There is uncertainty in the literature to assist health researchers in determining the most appropriate fit statistic to select for developing or modifying questionnaires. Previous research on simulated datasets has focused on the relationship between sample size and fit statistics at the level of groups of items. However, for test users the emphasis is more on which fit statistics are able to identify misfit consistently for individual items. Identification and removal of misfitting items will not only reduce patient burden, but may also improve person measure assessment [5].

Therefore the aim of this study was to investigate the impact of sample size on four commonly used fit statistics, i.e. infit/outfit mean square and their t-statistics for two polytomous Rasch models using data collected from a cancer patient sample.

The study attempted to determine: 1). whether fit statistics (and therefore Type I error rates, i.e. the probability of falsely rejecting an item which does fit the Rasch model) vary with sample size and 2). whether there were any differences in this variation between the different types of fit statistic.