Bayesian methods have been widely used in IRT and have received considerable attention in DIF analysis. However, their application to detecting DIF between CAT and conventional modes of administration has received relatively little attention. Thus, this study sought to develop and test methods for assessing CAT vs. P&P mode DIF employing a Bayesian framework. The present study revealed that the robust Z (RZ) and Bayesian credible interval (CrI) methods generally showed good control of false positive DIF results. Power as measured by the true-positive rate varied considerably for both methods but was consistent with previous reports [25–27]. The CrI method resulted in slightly higher power, but this was offset by a higher false positive rate relative to RZ. ROC analysis revealed that RZ significantly outperformed CrI, which appears mainly attributable to improved control of false positives. The results of the study indicate that neither RZ nor Δ CrI conform to a standard normal or similar distribution. In fact, RZ and particularly Δ CrI evidenced positive skewness and kurtosis. Thus, empirically derived cutoff values for each statistic may yield improved results. Nevertheless, the use of conventional cutoff values (e.g., 1.96 for RZ at α = .05) is not likely to increase Type I error.
CAT item usage was found to be the single best predictor of detecting simulated mode effects, followed by absolute item difficulty. In fact, the multivariate model performed only slightly better than when CAT item usage was the only predictor. For items with DIF, those items administered most often by CAT were more likely to be detected than items administered less frequently. This is not surprising given the wide variability in the frequency that various items were administered during the CAT simulations. The frequency an item is administered by CAT could therefore form the basis of power analysis conducted prior to DIF analysis for a given item. This would be particularly useful in the context of ongoing data collection, potentially improving power and minimizing analysis time.
There are two likely explanations for the observed relationship between absolute item difficulty and power in DIF detection. First, items with difficulty parameters closest to the mean theta values will be more likely to be administered by CAT. Since measures with mean trait levels of 0 or 1 logit were simulated, items in this range of difficulty would be most frequently administered. Second, items towards the extremes of the measurement continuum are less precisely estimated (i.e., have larger standard errors). Thus, power to detect DIF in items that are very easy or difficult to endorse is lower than that for items of average difficulty. This would likely explain why absolute item difficulty was a significant predictor of power even after controlling for CAT item usage. These findings may in part reflect the use of a fixed-length CAT during the simulation. In the case of a variable-length CAT, more items would likely be administered to simulees at the extremes of the trait continuum in order to achieve sufficient measurement precision, including items that are very easy or difficult to endorse. Conversely, we would expect fewer items to be administered to simulees who are in the center of the trait distribution under a variable-length CAT.
With respect to incorrect DIF decisions, easier-to-endorse items were more likely to be erroneously flagged than more difficult items. This finding is in contrast to Wang, Bradlow, Wainer and Muller  who found that unlike the standard Mantel-Haenszel test , a Bayesian approach did not result in elevated false positive errors for easy items. There are a number of differences between the Wang, Bradlow, Wainer and Muller study and the present investigation that may account for the differential findings. The former study did not examine DIF in CAT-administered items, employed a testlet model, and analyzed DIF using posterior p values. Further, in Wang, Bradlow, Wainer and Muller, Type I error was examined in the absence of DIF items. Conversely, the present study assessed Type I error (false positive DIF results) in which some DIF items were present, thus contaminating the estimated measures used in group matching. Research is clearly needed to determine the causes of elevated false positive rate for easy-to-endorse items. Two possible avenues of research in this area include: (1) further examination of different priors for item parameters and their effect on DIF detection for easy-to-endorse items, and (2) an iterative process of identifying DIF items and then removing or appropriately weighting them in the estimation of person measures.
As might be expected, DIF magnitude (i.e., the difference between CAT and P&P item parameters for a given item) was significantly and positively related to power. The same was not true for the percentage of items with DIF in the item bank. The latter result suggests that the power to detect a single DIF item is not significantly affected by the presence of other DIF items in the bank which may "contaminate" the person measures.
The results of this study revealed a positive relationship between item discrimination and power to identify items with mode DIF. One possible explanation for this finding is that CAT using a 2PL model and maximum information item selection will tend to select items with higher discrimination parameters for administration. In other words, DIF in high discriminating items may be easier to detect because these items are administered more frequently in CAT. Yet the results of the multivariate logistic regression analysis failed to support this conclusion. Item discrimination remained statistically significant even when controlling for CAT item usage. High item discrimination therefore appears to enhance power in mode-effect detection. This finding is corroborated by previous DIF research examining the relationship of item discrimination to power using several analytic procedures [48, 49]. Using the RZ procedure, item discrimination was positively associated with false DIF results at the univariate level, though this effect was no longer significant at the multivariate level. The latter findings partially confirmed previous studies that reported a positive relationship between item discrimination and Type I error rate for uniform DIF [50, 51].
For both RZ and CrI, power to detect DIF was lower in the 2PL condition. This appears to be related to some extent to CAT item usage. Though the number of items administered to each simulee was the same across the two conditions, median CAT item usage was lower (Med=504) in the 2PL than in the 1PL (Med=586) condition. However, the logistic regression results indicate that IRT model remained significant even when CAT item usage was included in the model. Thus, CAT item usage may not completely explain why power was lower in the 2PL condition. Though these findings are based on a small number of replications per condition and need to be interpreted cautiously, the observed relationship between measurement model and power to detect mode effects warrants further exploration.
In addition to the effect of item parameters, false positive DIF results were significantly associated with DIF size and mean difference in trait level between CAT and P&P administration modes. These effects likely reflect problems with the trait estimate used as the matching variable in the DIF analysis. Items with large DIF effects and mean differences in trait level between groups limit the effectiveness of matching, as has been observed in previous DIF studies [50–53]. These results highlight the need for careful sampling of respondents who complete each form of the instrument and assessment of trait-level differences prior to assessment of mode effects. The percentage of DIF items in the item bank was not associated with false DIF results. Though false positive rates were smaller in the 10% compared to the 30% DIF conditions, DIF percentage was not found to be significantly predictive of false positive DIF in either the univariate or multivariate logistic regression models for either RZ or CrI. Note that due to the computational demands involved in estimating posterior distributions of parameters, we decided not to perform item purification in this simulation.
The strength of Monte Carlo simulation lies in its ability to systematically vary several factors thought to affect identification of simulated effects. In this study, several factors were directly examined with respect to detection of mode-of-administration DIF, including DIF size, percentage of DIF items, and mean difference in trait level between modes, item response model, and analytic procedure. We also examined the effects of variables not part of the research design, including CAT item usage, item discrimination, and item difficulty parameters. A particular strength of the study is the examination of CAT item usage rather than sample size as a factor related to identification of DIF.
Nevertheless, our study has several limits. For example, several other factors were not considered in the simulation. Of particular importance is the degree to which the mean, variance, and shape of distributions of parameters are consistent with specified priors in the Bayesian estimation model. Though differences in mean trait levels were examined, deviations from prior assumptions concerning parameter variances or distribution types were not examined. For instance, there is a need to conduct further studies examining the potential effect of skewed theta and item parameter distributions on the performance of DIF procedures . Methods of CAT item selection and stopping rules also deserve further attention. There is also a need to assess the RZ and CrI procedures in identifying items exhibiting non-uniform mode DIF. Additional limitations of the present study include the small number of replications per experimental condition, the use of a fixed-length CAT and fixed item bank size.
Also, we intentionally did not address non-uniform DIF. Thus limits our study to conclusions about uniform DIF only. Importantly, though, no theoretical reasons exist to preclude conducting similar analyses on non-uniform DIF. However, given the nascent status of research in this field, we choose to focus on a single type of DIF. Our future research will hopefully address non-uniform DIF in one study and both simultaneously in a final study. By addressing each in a stepwise and piecemeal fashion, we hope to avoid spurious conclusions that could arise by addressing all simultaneously in the initial study. For example, we did not want to the presence of non-uniform to influence the detection of uniform DIF using these methods we developed here. Final, we only used simulated data. Future studies employing these procedures with real data are also needed.