Comparison of two Bayesian methods to detect mode effects between paper-based and computerized adaptive assessments: a preliminary Monte Carlo study
© Riley and Carle; licensee BioMed Central Ltd. 2012
Received: 20 October 2011
Accepted: 31 July 2012
Published: 17 August 2012
Computerized adaptive testing (CAT) is being applied to health outcome measures developed as paper-and-pencil (P&P) instruments. Differences in how respondents answer items administered by CAT vs. P&P can increase error in CAT-estimated measures if not identified and corrected.
Two methods for detecting item-level mode effects are proposed using Bayesian estimation of posterior distributions of item parameters: (1) a modified robust Z (RZ) test, and (2) 95% credible intervals (CrI) for the CAT-P&P difference in item difficulty. A simulation study was conducted under the following conditions: (1) data-generating model (one- vs. two-parameter IRT model); (2) moderate vs. large DIF sizes; (3) percentage of DIF items (10% vs. 30%), and (4) mean difference in θ estimates across modes of 0 vs. 1 logits. This resulted in a total of 16 conditions with 10 generated datasets per condition.
Both methods evidenced good to excellent false positive control, with RZ providing better control of false positives and with slightly higher power for CrI, irrespective of measurement model. False positives increased when items were very easy to endorse and when there with mode differences in mean trait level. True positives were predicted by CAT item usage, absolute item difficulty and item discrimination. RZ outperformed CrI, due to better control of false positive DIF.
Whereas false positives were well controlled, particularly for RZ, power to detect DIF was suboptimal. Research is needed to examine the robustness of these methods under varying prior assumptions concerning the distribution of item and person parameters and when data fail to conform to prior assumptions. False identification of DIF when items were very easy to endorse is a problem warranting additional investigation.
Computerized adaptive testing (CAT) is widely used in education and has gained acceptance as a mode for administering health outcomes measures [1, 2]. CAT offers several potential advantages over conventional (e.g., paper-and-pencil) administration, including automated scoring and storage of questionnaire data, and reduction of respondent burden. Instruments developed for paper-and-pencil administration frequently form the basis for CAT. In these situations, the transition to computerized adaptive testing requires establishing the equivalence between CAT-administered measures and their original paper-and-pencil version [3, 4]. A meta-analytic review of 65 studies comparing computerized an paper-and-pencil administration of patient-reported outcome measures suggests that scores obtained by computer are comparable to those obtained by conventional modes of administration . This study, however, did not focus on CAT. Unlike computer-based assessment, CAT selects items for administration based on item parameters that, if not accurate for CAT mode of administration, may diminish the reliability or efficiency of CAT [5, 6]. Item-level mode effects, in other words, may have a greater effect on CAT compared to other assessment modalities. The shift in item parameters resulting from changes in administration mode reflects the presence of differential item functioning (DIF), which can be defined as differential performance (e.g., differences in level of endorsement) of an item between two or more groups matched on the total score or measure [7, 8]. This paper will focus on the detection of DIF between CAT and paper-and-pencil administrations of a measure.
Methods used for assessing DIF by mode of administration fall into two general categories: (1) approaches based on classical test theory (CTT), such as comparisons of item p values, representing percentage of endorsement; and (2) methods based on item response theory (IRT) [9–12], including comparisons of item difficulty parameters. Confidence intervals of item endorsement probabilities (i.e., p-values) have been found to vary significantly by mode [13, 14]. Pommerich  also presented the proportion of items statistically favoring each mode. In another study , item p-values and IRT item difficulty parameters were compared and scatterplots of item parameters across mode were constructed. Johnson and Green  compared p-values of items as well as conducted a qualitative examination of error types (e.g., transcription error, place value error, partial answer, computation error, misunderstanding) made by students in each mode. Keng, McClarty, and Davis  examined differences in mode at the item level by comparing p-values and differences in chosen response category and by computing IRT-based DIF tests. Finally, Kim and Huynh  employed a robust-Z statistic to determine whether differences in item parameters across mode were statistically significant.
Though these studies often employed multiple methods of assessing item comparability, systematic comparisons across methods were not conducted. Nevertheless, there is reason to believe that some methods, such as item p-values may not be appropriate when detecting mode effects involving CAT-administered items. That is, differences in item p-values may not be valid indicators of DIF if the samples completing each mode of assessment differ in mean level on the measure. Moreover, item p-values can be influenced by the selective administration of items that takes place during CAT. For instance, CAT typically selects items that have an approximate probability of endorsement of 50% (i.e., items tailored to the individual to provide maximum information). Therefore, comparing CAT vs. P&P item p-values would likely result in items erroneously flagged as exhibiting DIF.
Several methods have been developed that attempt to overcome the limitations of classical procedures for detecting mode effects. Most of these methods are based on item response theory and involve comparisons of item parameters after matching of respondents according to trait level. Achieving accurate identification of DIF based on an IRT framework requires precise estimates of item parameters and person measures and the use of an appropriate measurement model . However, a limitation of IRT-based methods is that missing data (e.g., resulting from CAT administration) can reduce accuracy in parameter estimates and in DIF detection [19, 20]. In their simulation study, Robitzsch and Rupp  observed that when the missing data rate was 30% and data were missing at random, mean bias (difference between true and observed differences in item difficulty between groups) was 0.60, nearly two standard deviations above average bias across all conditions. CAT can reduce the number of items administered by as much as 90%, depending upon the size and quality of the item bank and criteria for stopping the test [21–23]. Therefore, higher rates of bias would likely occur when examining DIF in CAT-administered items with these methods.
Given the uncertainty in trait and item parameters, some investigators have recommended methods to identify DIF based on Bayesian probability theory. Bayesian approaches use probability distributions to model uncertainty in model parameters. These probability distributions represent prior beliefs or assumptions concerning the nature of the data and the level of uncertainty regarding various parameters. For instance, an investigator may specify that item discrimination parameters adhere to a lognormal distribution with log mean of 0 and variance of 0.5. The prior (particularly the prior variance) reflects uncertainty about the values before observing the data. Conversely, the posterior distribution reflects updated knowledge about parameter values after observing the data. Bayesian approaches make inferences using the posterior distribution. Unlike frequentist statistics, Bayesian methods do not rely on asymptotic (large-sample) theory in order to obtain standard errors, making Bayesian methods particularly attractive when small samples or missing data are involved.
Two general methods of DIF detection employing Bayesian methods have been proposed. The first approach is the use of Bayesian procedures to directly estimate DIF magnitude such as the Mantel-Haenszel (MH) test . Zwick and her associates [25–27] tested an empirical Bayes (EB) formulation of the MH test and demonstrated that EB results more closely approximated targeted DIF values (i.e., values used to simulate DIF in the item response data) compared to standard MH. The latter finding was particularly true for the relatively small (N=1,000 per group) sample size condition. Power ranged from 63.8 to 81.4% depending on sample size and mean difference in proficiency between groups. However, Zwick and Thayer  acknowledged that EB resulted in a higher Type I error rate (ranging from 10% to 20%) compared to conventional MH.
A second approach involves estimation of the posterior distribution of model parameters, which can be used in subsequent DIF analyses [28–31]. Wang, Bradlow, Wainer and Muller  examined DIF for a given item by producing separate item difficulty estimates for each group. Posterior distributions of the item difficulty parameter (b iG1 and b iG2 for groups 1 and 2, respectively) are computed, and from this a Bayesian p value representing the number of times (b iF - b iR) >0 can be used as an indicator of DIF. This procedure provided more accurate results compared to standard MH DIF analysis, especially when items were very easy to endorse . In a similar application  posterior distributions of proficiency measures were used in two nonparametric regression models (one with and one without group membership as a covariate) to compute posterior mean p values for the likelihood ratio based on the two models. Using Bonferroni-adjusted p values and a total sample size of 900 simulees, the investigators were able to obtain power of .90 to 1.00 and false-positive rates well below the set alpha level of .05.
Despite these promising results, none of the studies employing posterior distributions of item parameter estimates assessed DIF in CAT-administered items. Moreover, to our knowledge there has been no application of Bayesian methods to the assessment of DIF between non-CAT and CAT-administered assessments. Standard methods of assessing DIF can be problematic when comparing CAT- and P&P-administered data because of the confounding of CAT item selection, sample differences in trait level, and actual mode effects.
Rationale of the Study
It is common practice to employ paper-based forms when validating and scaling an item bank for use in CAT. Thus, it is important to determine that the resulting item parameters are not influenced by mode DIF. As suggested earlier, current methods of assessing DIF may not be appropriate when comparing adaptively and non-adaptively administered items. One solution would be to administer the entire item bank via computer and conventional modes of administration and then employ standard methods of DIF assessment. Whereas this approach could be used with small item banks, it would be quite burdensome to respondents and likely require collecting data apart from standard assessment practice with very large item banks.
Other researchers have already faced this issue. For example, to reduce respondent burden, the Patient Reported Outcomes Information System (PROMIS) only administered the entire set of initially developed PROMIS item to a small set of individuals from the total PROMIS calibration sample. This has limited the PROMIS collective’s ability to address some key issues, similar to what we raise here. Thus, while the less technical approach is possible, we suspect the common problem of needing to reduce respondent burden will generally limit the application of the less technical approach, indicating the need for alternative approaches. One alternative approach, which we present in this paper, is to develop procedures appropriate for detecting mode DIF in CAT vs. non-CAT-administered items, enabling assessment of DIF using data collected as part of standard assessment.
The purpose of the present study was to develop and evaluate two approaches to assessing item-level mode effects employing a Bayesian framework. In the following sections we outline this framework and describe the design and results from a preliminary Monte Carlo simulation study. The procedures are described and evaluated with respect to false-positive (i.e., DIF is detected when not simulated) and true-positive (i.e., DIF is detected when simulated) detection rates under several study conditions. We then examine factors associated with true and false DIF identification.
How well does each method detect item-level mode effects as indicated by ROC analysis, true positive and false positive rates? In the present study, true positives are defined as identification of items as exhibiting mode DIF when mode DIF is simulated, which is also referred to as correct DIF detection. Conversely, false positives refer to flagging of items as exhibiting DIF when DIF was not simulated, which is also referred to as incorrect DIF detection.
2. What factors influence correct (true positive) and incorrect (false positive) detection of item-level mode DIF using each procedure?
The methods employed in this study will be presented in three main sections. First, we describe the development and underlying assumptions of two Bayesian methods for detecting item-level mode effects. Second, we describe the simulation study, including its design and data generation procedures. The third section outlines the analysis of the simulated data.
A Bayesian procedure for detecting item-level mode effects
In the proposed model, analysis of mode effects involved a three-step process:
Step 1. Estimate θ using item response data pooled across administration modes (CAT and P&P). That is, θ is obtained using item parameters based on the combined CAT and P&P response data. This is to ensure that item parameters estimated in subsequent steps are on a common metric.
Step 2. Using θ i obtained in Step 1, estimate the posterior distributions of mode-specific item parameters for subsequent comparison in step 3.
where Med is the median and IQR is the interquartile range. The standard robust Z is asymptotically consistent with a standard normal distribution while minimizing the effect of extreme values. It has been used as a screening method for identifying stable items for IRT linking and DIF procedures [18, 32]. Unlike previous application of the robust Z in which the median and interquartile range are based on point estimates of parameters for all items in the instrument, here these values are based on the posterior distribution of the parameters for item j in each administration mode.
The second approach involved constructing the 95% credible interval (CrI) of the CAT vs. P&P difference for item j’s difficulty parameter. This interval is computed by obtaining the 2.5 and 97.5 percentiles of item j’s posterior distribution of βj CAT – βj P&P. In order to obtain a single value reflecting the level of mode DIF, we also computed the minimum difference of each bound of the CrI from zero (referred to as Δ CrI). Note that Δ CrI = 0 if the credible interval includes zero. The following priors were used in the model:
θ i ~ Normal(0,1)
α j ~ Lognormal(0,0.5)
β j ~ Normal(0,2)
Y ij ~ Bernouli(P[Y ij = 1|θ, α, β])
where the first value in parentheses for priors of θ i α j , and β j is the prior mean and the second is the prior variance. These priors may be regarded as “semi-informative.” They are similar to priors employed in earlier IRT studies, with the exception that we selected a lognormal rather than a truncated normal prior for the discrimination parameters [33–35].
The Markov chain Monte Carlo estimation consisted of three parallel chains each with a separate and randomly generated set of starting values for model parameters. For each chain, the first 1,000 MCMC iterations were discarded (burn-in phase), followed by 500 iterations per chain retained for subsequent analysis. The total number of iterations and the length of the burn-in phase were chosen on the basis of preliminary examination of trace plots of item and person parameters which revealed good convergence of the three chains of parameter estimates (analysis results are available upon request). Using additional iterations or a longer burn-in did not change DIF analysis results.
where Y ij is the response to item j by respondent i α j is the discrimination parameter and β j is the difficulty parameter for item j θ i is respondent i’s measure on the latent trait, and D is a scaling constant. In our simulations, D = 1.702 which makes the estimated response probabilities consistent with the normal ogive model and is used by the IRT estimation software employed in the study. In the one-parameter (1PL) case, all a i are equal across items.
In this study, the following factors were investigated: (1) data-generating model (one-parameter [1PL] vs. two-parameter [2PL] logistic IRT model); (2) DIF magnitude (|β CAT – β P&P|) of 0.42 vs. 0.63 logits, which corresponds to “B” and “C” class DIF, respectively, according to Educational Testing Services criteria ; (3) DIF percentage (10% vs. 30% of items in the item bank), and (4) mean difference in θ estimates across modes of 0 vs. 1 logits. We employed a fully crossed research design that resulted in a total of 16 conditions, with 10 replications (datasets) per condition.
Data were generated for the present study in three steps: (1) generation of validation (paper-and-pencil) data, (2) generation of CAT item response data, and (3) CAT simulation, which produced item response datasets containing only those items selected by the CAT. Each of these steps is outlined in the following sections.
Generation of the Validation (Paper-and-Pencil) Item Parameters and Response Data
For each IRT model, a set of item parameters and corresponding item response datasets were generated. Both item banks consisted of 100 items. In the 1PL model, discrimination (α j) parameters for all items were set to 1.0; in the 2PL item bank, α j parameters were randomly generated from a lognormal distribution with log mean = 0 and SD = 0.5, with values restricted to a range of 0.5 to 2.5. Discrimination parameters were limited to this range because items with very low discrimination (i.e., less than 0.5) are rarely used in item banks, whereas highly discriminating items (i.e., true discrimination parameters greater than 2.5) tend to be poorly estimated (i.e., positively biased) parameters . For both item banks, item difficulty (β j ) parameters were generated from a uniform distribution ranging from −3.0 to 3.0 logits, in increments of 0.25 logits. Person measures (θ i ) for 500 simulees were generated using an N(0, 1) standard normal distribution.
The generated item-response data were then used to estimate IRT item parameters (see Additional file 1). For both datasets, the standard deviation of the theta estimates was set to 1.0 in order to identify the model. In the 1PL case, discrimination parameters were also constrained to be equal across items. Maximum likelihood estimation was employed rather than a Bayesian procedure in order to avoid potential confounds between Bayesian priors used in item calibration and subsequent DIF analysis. Correlations between true and estimate β j parameters were 0.99 and 1.00 and root mean squared error (RMSE) values were 0.11 and 0.15 for 1PL and 2PL-generated datasets, respectively. For the 2PL data, correlation between true and estimated α j parameters was .9 and RMSE was 0.14. As previously observed , RMSEs for the discrimination parameters increased with higher values of α j . The estimated item parameters were used in subsequent CAT simulations. RMSEs and correlations between item parameters and their estimates were consistent parameter recovery results presented elsewhere [37–39].
Generation of CAT Item Response Data
Prior to performing CAT simulations, response data for all 100 items in the simulated item banks described above were generated for a total of 3000 simulees in each iteration. This sample size permitted examination of the effect of CAT item usage on DIF detection rates. Employing the study variables described above, a total of 160 item-response datasets were created and used for CAT simulation. For each dataset, person measures were generated from an N(μ CAT, 1.0) distribution, where μCAT = 0.0 or 1.0. Non-DIF-item response data were generated using the estimated parameters in Additional file 1. Items simulated to exhibit mode effects (DIF) were randomly selected according to the percentage of DIF items (10% or 30%) for the specified simulation condition. The direction of DIF (i.e., easier vs. more difficult to endorse in the CAT sample) was also randomized. Specifically, a value of 1 (harder to endorse) or −1 (easier to endorse) was generated from a uniform discrete distribution. This value was then multiplied by the appropriate DIF magnitude (0.42 or 0.63 logits), with the resulting value added to the corresponding β j parameter (see Additional file 1 for table of generated and estimated item parameters and Additional file 2, Additional file 3, Additional file 4 for data files containing these parameters and item response data used in the simulation) The α j parameters for the generated CAT item responses were the same as those used to generate the initial P&P data.
Each generated dataset was then used in a series of CAT simulations. In order to ensure comparability across conditions, a fixed-length CAT consisting of 30 administered items for each simulee was conducted. This stopping rule is similar to that used in a previous investigation of CAT and DIF . All CAT simulations employed maximum-likelihood estimation and item selection based on Fisher’s information criterion, a standard CAT algorithm. Each CAT simulation produced the following data: (1) item responses of items selected during the simulated CAT session, (2) index numbers identifying the items selected by CAT, and (3) estimated theta and standard error of theta for each CAT simulee. The originally simulated P&P response data and simulated CAT item-response data were employed in the DIF analysis procedures described earlier (see “A Bayesian Procedure for Detecting Item-Level Mode Effects”).
Prior to addressing the main research questions, descriptive analyses were performed for both the CAT simulation results and the RZ and CrI statistics. Descriptive statistics for the CAT simulations included CAT-to-full-scale correlations and mean standard errors (MSE), Distributional properties of the RZ and CrI statistics, including mean, standard deviation, skewness, kurtosis, and values corresponding to the 2.5 and 97.5 percentiles were calculated.
Detection of Mode Effects (Research Question 1)
The overall performance of the robust Z (RZ) and Bayesian credible interval (CrI, as measured by the minimum difference of CrI to 0 or Δ CrI) was assessed first by examining the sensitivity, specificity, and correct classification rates using cutoff values for α = .05 (i.e., | RZ| > 1.96 and 95% Δ CrI ≠ 0).
Logistic regression and ROC analyses were also performed to examine the predictive accuracy of each statistic without reference to specific cutoff values. Since both RZ and Δ CrI can have negative and positive values that are indicative of mode DIF, we first fit a logistic regression model with a quadratic term (i.e., RZ + RZ 2 and ΔCrI + Δ CrI 2 for robust Z and credible interval models, respectively) to predict simulated mode DIF. ROC analyses were then conducted based on predicted probabilities from each logistic regression model. The difference in the area under the ROC curves (AUCs) was also assessed for statistical significance using a chi-square procedure . Descriptive statistics (percentages) were used to summarize the true positive and false positive mode-of-administration DIF results in the simulation study.
Factors Related to True and False Positive Mode Effects (Research Question 2)
A series of multilevel random-intercept logistic regression analyses were performed at both univariate (single predictor) and multivariate levels. At the multivariate level, four models were developed, one for each statistical test (RZ and Δ CrI) and each DIF decision (correct and incorrect). In each model, the main predictors are: (a) size of DIF, (b) percentage of DIF items in the dataset, (c) IRT model used to generate the response data, (d) difference in mean performance between the P&P and CAT samples (0 vs. 1 logit), (e) number of times a given item was administered by CAT (item usage), (f) item difficulty, and (g) item discrimination, the latter two predictors based on the estimated parameters using the simulated P&P dataset. Preliminary analyses revealed that absolute values of item difficulty better predicted correct DIF detection, whereas signed item difficulty values more accurately predicted incorrect DIF decisions. With the exception of binary variables (i.e., IRT model, difference in CAT vs. P&P mean trait level), predictors were normalized by dividing each variable by two standard deviations prior to analysis . AUC values derived from ROC analyses based on each model and each individual were also reported to indicate predictive efficacy. Random intercepts were estimated at both item and dataset levels.
Relationship of Item Difficulty to Power and Type I Error
In order to provide a clearer picture of the relationship of item difficulty with power and Type I error, a plot of mean power and Type I error by P&P item difficulty was created. This plot was based on a series of linear regression analyses to predict mean power and Type I error for both RZ and CrI using the paper-and-pencil item difficulties and their higher level (i.e., quadratic, cubic, quartic, and quintic) terms as predictors. Predicted values from these regression analyses were used to create the plot.
Generation of item and person parameters and item response data was performed in the R statistical package . Estimation of P&P item parameters was performed using MPlus version 6.0 . CAT simulations were performed with Firestar version 1.33 . For the DIF procedures, estimation in Steps 1 and 2 of the DIF analyses outlined above was performed using WinBUGS version 1.4.3 ; see Additional file 5, which has been used in previous IRT applications [28, 30, 39]. Specifically, we called WinBUGS from R using the R2WinBUGS package , the latter used to retrieve the posterior estimates generated by WinBUGS for subsequent analysis. Descriptive analyses and analyses of the simulation results were performed in Stata version 11.0 (Stata Corp., College Station, Texas).
Summary of CAT simulations by underlying measurement model, DIF size, mean CAT measures and percentage of DIF items
CAT to Full-Scale θCorrelation
Mean Standard Error
Diff. Mean θ = 0
DIF % = 10
DIF % = 30
Diff. Mean θ = 1
DIF % = 10
DIF % = 30
With respect to the number of times a given item was administered by CAT (CAT item usage), the median number of item administrations across items and simulation conditions is 553 (IQR = 119—1318). The median and IQR was 586 (144—1312) and 504 (87—1333) for 1PL and 2PL item banks, respectively. Item usage was comparable for items simulated with DIF (Med=557.5, IQR = 123—1315) and non-DIF items (Med = 551, IQR = 117—1320). For RZ, an item usage of ≥ 369 and ≥ 422 were associated with power to detect DIF of 80 percent for the 1PL and 2PL conditions, respectively. For CrI, 80 percent power was associated with CAT item usage of 305 and 341 for 1PL and 2PL conditions, respectively. In the 2PL condition, item usage was positively correlated with item discrimination (r = .46, p < .01), reflecting the fact that CAT-bases item selection on item discrimination.
Robust Z and 95% Credible Interval Indices
Among non-DIF items, RZ had a mean of −0.10 and a standard deviation of 0.82. Mean Δ CrI was 0.01 (SD=.06). Though both indices were positively skewed and leptokurtotic, this was particularly true for Δ CrI (RZ skewness = 0.26, Δ CrI skewness = 14.94; RZ kurtosis = 1.46; Δ CrI kurtosis = 253.70). RZ values of −1.60 and 1.53 corresponded to the 2.5 and 97.5 percentiles for items not simulated with mode DIF, respectively. Both 2.5th and 97.5th percentiles corresponded to a ΔCrI of 0.00 for non-DIF items.
Detection of mode effects (Research question 1)
Correct classification, sensitivity, and specificity were examined using expected cutoff values at α = .05 level, i.e., | RZ| > 1.96 and Δ CrI ≠ 0. Employing these criteria resulted in correct classification, sensitivity, and specificity of 92.4%, 69.1%, and 98.1% for RZ and 92.3%, 71.8%, and 97.2% for Δ CrI, respectively. Since our descriptive results presented above suggest that both indices are non-normal, these cutoff values may not be appropriate. We therefore performed logistic regression and ROC analyses to examine the relative performance of the two indices without reference to specific cutoff values. ROC analyses revealed an area under the curve ( AUC) of .91 and .82 for RZ and Δ CrI, respectively. This difference in AUCs was statistically significant [ X 2 (1) = 545.06, p < .0001]. This indicates that RZ values are significantly stronger predictor of the presence of mode DIF compared to ΔCrI values. Further analyses revealed that empirically derived cutoff values for both RZ and Δ CrI may help to improve sensitivity or specificity. However, since these results are preliminary and for convenience purposes, results presented in subsequent sections of the paper will use the original cutoff values of | RZ| > 1.96 and Δ CrI ≠ 0.
True positive and false positive rates as a function of generating IRT model, DIF size, number of DIF items, and mean difference between modes
Diff. Mean θ
Bayes 95% CrI
The present findings revealed power (true positive) rates of 69.1% and 71.8% for RZ and Δ CrI, respectively. Power was highest in the 1PL condition when DIF was large (0.63 logits) and the percentage of items with DIF was high (30%) and the mean difference in trait level between CAT and P&P modes was 0 (RZ: 82.7%; Δ CrI: 87.0%). Power was lowest for RZ in the 1PL, medium DIF effect size (0.42) 10% DIF items and mean θ CAT-θ P&P = 0 condition (54.9%) whereas for ΔCrI it was lowest under the 2PL, medium DIF effect size, 10% DIF items, and mean θ CAT-θ P&P = 1.0 (55.1%). For RZ, the average true positive rate was 64.2% when DIF size = 0.42 and 74.1% when DIF size = 0.63 logits. Similarly, true positive rates of 64.7 and 78.9 were observed using Δ CrI for medium and large DIF effect sizes, respectively.
Factors related to true and false positive mode effects (Research question 2)
Univariate and multivariate multilevel logistic regression to predict correct detection of mode effects defined by Robust Z and Bayesian 95% credible interval as a function of study variables
Robust Z (Model AUC = 0.95)
Size of DIF
Percentage of DIF
2PL IRT Modelb
Diff. Mean θ = 1.0
CAT Item Usagec
Absolute Item Difficultyd
Bayesian 95% Credible Interval (Model AUC = 0.93)
Size of DIF
Percentage of DIF
2PL IRT Modelb
Diff. Mean θ = 1.0
CAT Item Usagec
Absolute Item Difficultyd
Univariate and multivariate multilevel logistic regression to predict incorrect detection of mode effects defined by Robust Z and Bayesian 95% credible interval as a function of study variables
Robust Z (Model AUC = 0.77)
Size of DIF
Percentage of DIF
2PL IRT Model
Diff. Mean θ = 1.0
CAT Item Usage
Bayesian 95% Credible Interval (Model AUC = 0.74)
Size of DIF
Percentage of DIF
2PL IRT Model
Diff. Mean θ = 1.0
CAT Item Usage
For the RZ procedure, univariate logistic regression analyses revealed that the following were significantly and positively associated with increased false-positive DIF results: size of DIF, mean difference in mean trait level by mode, CAT item usage, and item discrimination (see Table 4). Conversely, item difficulty was inversely associated with false positive results, indicating that items of higher difficulty were less likely to be incorrectly flagged as exhibiting mode effects. These predictors were also significant at the multivariate level with the exception of item discrimination. For the CrI procedure, size of DIF and difference in mean trait level by mode significantly and positively predicted false-positive DIF results, whereas item difficulty was significantly and inversely associated with false positive mode DIF. These factors were also statistically significant in the multivariate model. CAT item usage was also significantly and positively predictive of false positive DIF results in the multivariate model. Based on AUCs, item difficulty was the single best predictor of false positives in DIF identification for both RZ and CrI, followed by difference in mean trait level between modes. The overall model AUCs were 0.77 and 0.74 for RZ and Δ CrI DIF indices, respectively.
Relationship of Item Difficulty to Power and Type I Error
Bayesian methods have been widely used in IRT and have received considerable attention in DIF analysis. However, their application to detecting DIF between CAT and conventional modes of administration has received relatively little attention. Thus, this study sought to develop and test methods for assessing CAT vs. P&P mode DIF employing a Bayesian framework. The present study revealed that the robust Z (RZ) and Bayesian credible interval (CrI) methods generally showed good control of false positive DIF results. Power as measured by the true-positive rate varied considerably for both methods but was consistent with previous reports [25–27]. The CrI method resulted in slightly higher power, but this was offset by a higher false positive rate relative to RZ. ROC analysis revealed that RZ significantly outperformed CrI, which appears mainly attributable to improved control of false positives. The results of the study indicate that neither RZ nor Δ CrI conform to a standard normal or similar distribution. In fact, RZ and particularly Δ CrI evidenced positive skewness and kurtosis. Thus, empirically derived cutoff values for each statistic may yield improved results. Nevertheless, the use of conventional cutoff values (e.g., 1.96 for RZ at α = .05) is not likely to increase Type I error.
CAT item usage was found to be the single best predictor of detecting simulated mode effects, followed by absolute item difficulty. In fact, the multivariate model performed only slightly better than when CAT item usage was the only predictor. For items with DIF, those items administered most often by CAT were more likely to be detected than items administered less frequently. This is not surprising given the wide variability in the frequency that various items were administered during the CAT simulations. The frequency an item is administered by CAT could therefore form the basis of power analysis conducted prior to DIF analysis for a given item. This would be particularly useful in the context of ongoing data collection, potentially improving power and minimizing analysis time.
There are two likely explanations for the observed relationship between absolute item difficulty and power in DIF detection. First, items with difficulty parameters closest to the mean theta values will be more likely to be administered by CAT. Since measures with mean trait levels of 0 or 1 logit were simulated, items in this range of difficulty would be most frequently administered. Second, items towards the extremes of the measurement continuum are less precisely estimated (i.e., have larger standard errors). Thus, power to detect DIF in items that are very easy or difficult to endorse is lower than that for items of average difficulty. This would likely explain why absolute item difficulty was a significant predictor of power even after controlling for CAT item usage. These findings may in part reflect the use of a fixed-length CAT during the simulation. In the case of a variable-length CAT, more items would likely be administered to simulees at the extremes of the trait continuum in order to achieve sufficient measurement precision, including items that are very easy or difficult to endorse. Conversely, we would expect fewer items to be administered to simulees who are in the center of the trait distribution under a variable-length CAT.
With respect to incorrect DIF decisions, easier-to-endorse items were more likely to be erroneously flagged than more difficult items. This finding is in contrast to Wang, Bradlow, Wainer and Muller  who found that unlike the standard Mantel-Haenszel test , a Bayesian approach did not result in elevated false positive errors for easy items. There are a number of differences between the Wang, Bradlow, Wainer and Muller study and the present investigation that may account for the differential findings. The former study did not examine DIF in CAT-administered items, employed a testlet model, and analyzed DIF using posterior p values. Further, in Wang, Bradlow, Wainer and Muller, Type I error was examined in the absence of DIF items. Conversely, the present study assessed Type I error (false positive DIF results) in which some DIF items were present, thus contaminating the estimated measures used in group matching. Research is clearly needed to determine the causes of elevated false positive rate for easy-to-endorse items. Two possible avenues of research in this area include: (1) further examination of different priors for item parameters and their effect on DIF detection for easy-to-endorse items, and (2) an iterative process of identifying DIF items and then removing or appropriately weighting them in the estimation of person measures.
As might be expected, DIF magnitude (i.e., the difference between CAT and P&P item parameters for a given item) was significantly and positively related to power. The same was not true for the percentage of items with DIF in the item bank. The latter result suggests that the power to detect a single DIF item is not significantly affected by the presence of other DIF items in the bank which may "contaminate" the person measures.
The results of this study revealed a positive relationship between item discrimination and power to identify items with mode DIF. One possible explanation for this finding is that CAT using a 2PL model and maximum information item selection will tend to select items with higher discrimination parameters for administration. In other words, DIF in high discriminating items may be easier to detect because these items are administered more frequently in CAT. Yet the results of the multivariate logistic regression analysis failed to support this conclusion. Item discrimination remained statistically significant even when controlling for CAT item usage. High item discrimination therefore appears to enhance power in mode-effect detection. This finding is corroborated by previous DIF research examining the relationship of item discrimination to power using several analytic procedures [48, 49]. Using the RZ procedure, item discrimination was positively associated with false DIF results at the univariate level, though this effect was no longer significant at the multivariate level. The latter findings partially confirmed previous studies that reported a positive relationship between item discrimination and Type I error rate for uniform DIF [50, 51].
For both RZ and CrI, power to detect DIF was lower in the 2PL condition. This appears to be related to some extent to CAT item usage. Though the number of items administered to each simulee was the same across the two conditions, median CAT item usage was lower (Med=504) in the 2PL than in the 1PL (Med=586) condition. However, the logistic regression results indicate that IRT model remained significant even when CAT item usage was included in the model. Thus, CAT item usage may not completely explain why power was lower in the 2PL condition. Though these findings are based on a small number of replications per condition and need to be interpreted cautiously, the observed relationship between measurement model and power to detect mode effects warrants further exploration.
In addition to the effect of item parameters, false positive DIF results were significantly associated with DIF size and mean difference in trait level between CAT and P&P administration modes. These effects likely reflect problems with the trait estimate used as the matching variable in the DIF analysis. Items with large DIF effects and mean differences in trait level between groups limit the effectiveness of matching, as has been observed in previous DIF studies [50–53]. These results highlight the need for careful sampling of respondents who complete each form of the instrument and assessment of trait-level differences prior to assessment of mode effects. The percentage of DIF items in the item bank was not associated with false DIF results. Though false positive rates were smaller in the 10% compared to the 30% DIF conditions, DIF percentage was not found to be significantly predictive of false positive DIF in either the univariate or multivariate logistic regression models for either RZ or CrI. Note that due to the computational demands involved in estimating posterior distributions of parameters, we decided not to perform item purification in this simulation.
The strength of Monte Carlo simulation lies in its ability to systematically vary several factors thought to affect identification of simulated effects. In this study, several factors were directly examined with respect to detection of mode-of-administration DIF, including DIF size, percentage of DIF items, and mean difference in trait level between modes, item response model, and analytic procedure. We also examined the effects of variables not part of the research design, including CAT item usage, item discrimination, and item difficulty parameters. A particular strength of the study is the examination of CAT item usage rather than sample size as a factor related to identification of DIF.
Nevertheless, our study has several limits. For example, several other factors were not considered in the simulation. Of particular importance is the degree to which the mean, variance, and shape of distributions of parameters are consistent with specified priors in the Bayesian estimation model. Though differences in mean trait levels were examined, deviations from prior assumptions concerning parameter variances or distribution types were not examined. For instance, there is a need to conduct further studies examining the potential effect of skewed theta and item parameter distributions on the performance of DIF procedures . Methods of CAT item selection and stopping rules also deserve further attention. There is also a need to assess the RZ and CrI procedures in identifying items exhibiting non-uniform mode DIF. Additional limitations of the present study include the small number of replications per experimental condition, the use of a fixed-length CAT and fixed item bank size.
Also, we intentionally did not address non-uniform DIF. Thus limits our study to conclusions about uniform DIF only. Importantly, though, no theoretical reasons exist to preclude conducting similar analyses on non-uniform DIF. However, given the nascent status of research in this field, we choose to focus on a single type of DIF. Our future research will hopefully address non-uniform DIF in one study and both simultaneously in a final study. By addressing each in a stepwise and piecemeal fashion, we hope to avoid spurious conclusions that could arise by addressing all simultaneously in the initial study. For example, we did not want to the presence of non-uniform to influence the detection of uniform DIF using these methods we developed here. Final, we only used simulated data. Future studies employing these procedures with real data are also needed.
This study yielded mixed results concerning the methods for assessing mode effects. Whereas Type I error was well controlled, power to detect DIF was suboptimal, though the present findings were consistent with those reported in similar studies [25–27]. The modified robust Z test provided better control of the Type I error rate compared to CrI. True positive rates were primarily predicted by CAT item usage, absolute item difficulty and item discrimination. Further research is needed to examine the robustness of the method under varying prior assumptions concerning the distribution of item and person parameters and when data fail to conform to these prior assumptions. False identification of DIF when items were very easy to endorse is a problem requiring additional investigation.
One-parameter logistic item response theory model
Two-parameter logistic item response theory model
Area under the curve
Computerized adaptive testing
Differential item functioning
Item response theory
Robust z test.
The development of this paper was supported by the National Institute on Drug Abuse (NIDA) under grant R21 DA 025371. NIDA had no direct role in the design of the study, analyses or interpretation of the study findings. The authors would like to thank Leanne Welch and Tim Feeney for their help proofreading the manuscript. Finally, the authors wish to thank the Research Open Access Publishing (ROAAP) Fund of the University of Illinois at Chicago for financial support toward the open access publishing fee for this article.
- Reeve BB: Special issues for building computerized-adaptive tests for measuring patient-reported outcomes: The National Institute of Health’s investment in new technology. Medical Care. 2006, 44 (11 Supp 3): S198-S204.View ArticlePubMedGoogle Scholar
- Reeve BB, Hays RD, Bjorner JB, Cook KF, Crane PK, Teresi JA, Thissen D, Revicki DA, Weiss DJ, Hambleton RK, et al: Psychometric evaluation and calibration of health-related quality of life item banks: plans for the Patient-Reported Outcomes Measurement Information System (PROMIS). Medical Care. 2007, 45 (5 Suppl 1): S22-S31.View ArticlePubMedGoogle Scholar
- Schulenberg SE, Yutrzenka BA: The equivalence of computerized and paper-and-pencil psychological instruments: Implications for measures of negative affect. Behavioral Research Methods Instruments and Computers. 1999, 31: 315-321. 10.3758/BF03207726.View ArticleGoogle Scholar
- Gwaltney CJ, Shields AL, Shiffman S: Equivalence of electronic and paper-and-pencil administration of patient-reported outcome measures: A meta-analytic review. Value Health. 2008, 11 (2): 322-333. 10.1111/j.1524-4733.2007.00231.x.View ArticlePubMedGoogle Scholar
- Pommerich M: The effect of using item parameters calibrated from paper administrations in computer adaptive test administrations. Journal of Technology, Learning, and Assessment. 2007, 5: 1-29.Google Scholar
- Zwick R, Thayer DT, Wingersky M: Effect of Rasch calibration on ability and DIF estimation in computer-adaptive tests. J Educ Meas. 1995, 32 (4): 341-363. 10.1111/j.1745-3984.1995.tb00471.x.View ArticleGoogle Scholar
- Holland PW, Thayer DT: Differential item functioning and the Mantel-Haenszel procedure. 1986, Evanston, IL: Educational Testing ServiceGoogle Scholar
- Dorans NJ, Kulick E: Demonstrating the utility of the standardization approach to assessing unexpected differential item performance on the Scholastic Apptitude Test. J Educ Meas. 1986, 23 (4): 355-368. 10.1111/j.1745-3984.1986.tb00255.x.View ArticleGoogle Scholar
- Birnbaum A: Some latent trait models and their use in inferring an examinee's ability. Statistical theories of mental tests scores. Edited by: Lord FM, Novick MR, Reading MA. 1968, Addison-Wesley, 397-472.Google Scholar
- Lord FM: Estimating true-score distributions in psychological testing (An empirical Bayes estimation problem). Psychometrika. 1969, 34 (3): 259-299. 10.1007/BF02289358.View ArticleGoogle Scholar
- Lord FM, Novick MR: Statistical theories of mental test scores. 1968, Reading, MA: Addison-WesleyGoogle Scholar
- Rasch G: Probabilistic models for some intelligence and attainment tests. 1960, Copenhagen: Danmarks Paedogogiske InstitutGoogle Scholar
- Pommerich M: Developing computerized versions of paper-and-pencil tests: Mode effects for passage-based tests. Journal of Technology, Learning, and Assessment. 2004, 2 (6): 1-44.Google Scholar
- Higgins J, Russell M, Hoffmann T: Examining the effect of computer-based passage presentation on reading test performance. Journal of Technology, Learning, and, Assessment. 2005, 3 (4): 1-34.Google Scholar
- Sandene B, Horkay N, Bennett R, Allen N, Braswell J, Kaplan B, Oranje A: Online assessment in mathematics and writing. NAEP technology-based assessment project, research and development series (National Center for Education Statistics Publication No NCES 2005–457). 2005, Washington DC: U.S. Government Printing OfficeGoogle Scholar
- Johnson M, Green S: On-line mathematics assessment: The impact of mode on performance and question answering strategies. The Journal of Technology, Learning, and Assessment. 2006, 4 (5): 1-35.Google Scholar
- Keng L, McClarty KL, Davis LL: Item-level comparative analysis of online and paper administrations of the Texas Assessment of Knowledge and Skills. Appl Meas Educ. 2008, 21 (3): 207-226. 10.1080/08957340802161774.View ArticleGoogle Scholar
- Kim D, Huynh H: Comparability of computer and paper-and-pencil versions of algebra and biology assessments. Journal of Technology, Learning and Assessment. 2007, 6 (4): 1-31.Google Scholar
- Robitzsch A, Rupp AA: Impact of missing data on the detection of differential item functioning: The case of Mantel-Haenszel and logistic regression analysis. Educ Psychol Meas. 2008, 69 (1): 18-34. 10.1177/0013164408318756.View ArticleGoogle Scholar
- Zhang B, Walker CM: Impact of missing data on person model fit and person trait estimation. Appl Psychol Meas. 2008, 32 (6): 466-479. 10.1177/0146621607307692.View ArticleGoogle Scholar
- Gershon RC: Computer adaptive testing. J Appl Meas. 2005, 6 (1): 109-127.PubMedGoogle Scholar
- Jenkinson C, Fitzpatrick R, Garratt A, Peto V, Stewart-Brown S: Can item response theory reduce patient burden when measuring health status in neurological disorders? Results from Rasch analysis of the SF-36 physical functioning scale (PF-10). J Neurol Neurosurg Psychiatry. 2001, 71 (2): 220-224. 10.1136/jnnp.71.2.220.View ArticlePubMedPubMed CentralGoogle Scholar
- Riley BB, Conrad KJ, Bezruczko N, Dennis ML: Relative precision, efficiency and construct validity of different starting and stopping rules for a computerized adaptive test: The GAIN Substance Problem Scale. J Appl Meas. 2007, 8 (1): 48-65.PubMedGoogle Scholar
- Mantel N, Haenszel W: Statistical aspects of the analysis of data from retrospective studies. J Natl Cancer Inst. 1959, 22 (4): 719-748.PubMedGoogle Scholar
- Zwick R, Thayer DT: An empirical Bayes approach to Mantel-Haenszel DIF analysis. J Educ Meas. 1999, 36 (1): 1-28. 10.1111/j.1745-3984.1999.tb00543.x.View ArticleGoogle Scholar
- Zwick R, Thayer DT: Application of an empirical Bayes enhancement of Mantel-Haenszel differential item functioning analysis to a computerized adaptive test. Appl Psychol Meas. 2002, 26 (1): 57-76. 10.1177/0146621602026001004.View ArticleGoogle Scholar
- Zwick R, Thayer DT: An empirical Bayes enhancement of Mantel-Haenszel DIF analysis for computer-adaptive tests. 2003, Newton, PA USA: Law School Admission CouncilGoogle Scholar
- Chaimongkol S, Kamata K: An explanatory differential item functioning (DIF) model by the WinBUG 1.4. Songklanakarin Journal of Science and Technology. 2007, 29 (2): 449-458.Google Scholar
- Glickman ME, Seal P, Eisen SV: A non-parametric Bayesian diagnostic for detecting differential item functioning in IRT models. Health Services and Outcomes Research Methodology. 2009, 9 (3): 145-161. 10.1007/s10742-009-0052-4.View ArticleGoogle Scholar
- Soares TM, Goncalves FB, Gamerman D: An integrated Bayesian model for DIF analysis. J Educ Behav Stat. 2009, 34 (3): 348-377. 10.3102/1076998609332752.View ArticleGoogle Scholar
- Wang X, Bradlow E, Wainer H, Muller E: A Bayesian method for studying DIF: A cautionary tale filled with surprises and delights. J Educ Behav Stat. 2008, 33 (3): 363-384.View ArticleGoogle Scholar
- Huynh H, Meyer P: Use of robust z in detecting unstable items in item response theory models. Practical Assessment Research & Evaluation. 2010, 15 (2): 1-8.Google Scholar
- Patz RJ, Junker BW: Applications and extensions of MCMC in IRT: Multiple item types, missing data, and rated responses. J Educ Behav Stat. 1999, 24 (4): 342-366.View ArticleGoogle Scholar
- Patz RJ, Junker BW: A straightforward approach to Markov chain Monte Carlo methods for item response models. J Educ Behav Stat. 1999, 24 (2): 146-178.View ArticleGoogle Scholar
- Sahu SK: Bayesian estimation and model choice in item response models. J Stat Comput Simul. 2002, 72: 217-232. 10.1080/00949650212387.View ArticleGoogle Scholar
- Hambleton RK, Jones RW, Rogers HJ: Influence of item parameter estimation errors in test development. J Educ Meas. 1993, 30 (2): 143-155. 10.1111/j.1745-3984.1993.tb01071.x.View ArticleGoogle Scholar
- Hulin CL, Lissak RI, Drasgow F: Recovery of two- and three-parameter logistic item characteristic curves: A monte carlo study. Appl Psychol Meas. 1982, 6 (3): 249-260. 10.1177/014662168200600301.View ArticleGoogle Scholar
- Kang T, Cohen AS: IRT model selection methods for dichotomous items. Appl Psychol Meas. 2007, 31 (4): 331-358. 10.1177/0146621606292213.View ArticleGoogle Scholar
- Stone CA: Recovery of marginal maximum likelihood estimates in the two-parameter logistic response model: An evaluation of MULTILOG. Appl Psychol Meas. 1992, 16 (1): 1-16. 10.1177/014662169201600101.View ArticleGoogle Scholar
- Zwick R, Thayer DT, Wingersky M: A simulation study of methods for assessing differential item functioning in computerized adaptive tests. Appl Psychol Meas. 1994, 18 (1): 121-140.View ArticleGoogle Scholar
- DeLong ER, DeLong DM, Clarke-Pearson DL: Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach. Biometrics. 1988, 44 (3): 837-845. 10.2307/2531595.View ArticlePubMedGoogle Scholar
- Gelman A: Scaling regression inputs by dividing by two standard deviations. Stat Med. 2008, 27 (15): 2865-2873. 10.1002/sim.3107.View ArticlePubMedGoogle Scholar
- R Development Core Team: R: R Development Core Team. Statistical programming language. 2011, 212Google Scholar
- Muthén LK: Mplus. 2010, Los Angeles, CA: Muthén & Muthén, 60Google Scholar
- Choi SW: Firestar: Computerized adaptive testing simulation program for polytomous IRT models. Appl Psychol Meas. 2009, 33 (8): 644-645. 10.1177/0146621608329892.View ArticleGoogle Scholar
- Spiegelhalter D, Thomas A, Best N, Lunn D: WinBUGS version 1.4. 3 user manual. 2007, Cambridge, United Kingdom: MRC Biostatistics UnitGoogle Scholar
- Gelman A, Sturtz S, Ligges U, Gorjanc G, Kerman J: The R2WinBUGS Package Manual Version 2.0-4. 2006, New York: Statistic Department FacultyGoogle Scholar
- Kristjansson E, Aylesworth R, Mcdowell I, Zumbo BD: A comparison of four methods for detecting differential item functioning in ordered response items. Educ Psychol Meas. 2005, 65: 935-953. 10.1177/0013164405275668.View ArticleGoogle Scholar
- Zwick R, Donoghue JR, Grima A: Assessment of differential item functioning for performance tasks. J Educ Meas. 1993, 30: 233-251. 10.1111/j.1745-3984.1993.tb00425.x.View ArticleGoogle Scholar
- Ankenmann RD, Witt EA, Dunbar SB: An investigation of the power of the likelihood ratio goodness-of-fit statistic in detecting differential item functioning. J Educ Meas. 1999, 36 (4): 277-300. 10.1111/j.1745-3984.1999.tb00558.x.View ArticleGoogle Scholar
- Roussos LA, Stout WF: Simulation studies of the effects of small sample size and studied item parameters on SIBTEST and Mantel-Haenszel Type I error performance. J Educ Meas. 1996, 33 (2): 215-230. 10.1111/j.1745-3984.1996.tb00490.x.View ArticleGoogle Scholar
- Zwick R, Thayer DT, Mazzeo J: Descriptive and inferrential procedures for assessing differential item functioning in polytomous items. Appl Meas Educ. 1997, 10 (4): 321-344. 10.1207/s15324818ame1004_2.View ArticleGoogle Scholar
- Jodoin MG, Gierl MJ: Evaluating type I error and power rates using an effect size measure with the logistic regression procedure for DIF detection. Appl Meas Educ. 2001, 14: 329-349. 10.1207/S15324818AME1404_2.View ArticleGoogle Scholar
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1471-2288/12/124/prepub
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.