Clinicians routinely use structured clinical interviews when diagnosing personality disorders (PDs); however, it is common to use multiple raters when researching clinical conditions such as PDs. Because multiple raters are used, it is particularly important to have a way to document adequate levels of agreement between raters in such studies.
The Structured Clinical Interview, based on the Diagnostic and Statistical Manual of Mental Disorders-IV - for Axis II Personality Disorders (SCID II) , is one of the standard tools used to diagnose personality disorders. Because this assessment results in dichotomous outcomes, Cohen’s Kappa [2, 3] is commonly used to assess the reliability of raters. Only a few studies have assessed inter-rater reliability using SCID II, but our recent report  revealed that the overall Kappa for the Thai version of SCID II is .80, ranging from .70 for Depressive Personality Disorder to .90 for Obsessive-compulsive Personality Disorder. However, some investigators have expressed concerns about the low Kappa values found for some criteria, despite the high percentage of agreement [4–6]. This problem has been referred to as the “Kappa paradox” by Feinstein and Cicchetti , who stated, “in one paradox, a high value of the observed agreement (Po) can be drastically lowered by a substantial imbalance in the table's marginal totals either vertically or horizontally. In the second paradox, kappa will be higher with an asymmetrical rather than symmetrical imbalance in marginal totals, and with imperfect rather than perfect symmetry in the imbalance. An adjusted kappa does not repair either problem, and seems to make the second one worse.” Di Eugenio and Glass  stated that κ is affected by the skewed distributions of categories (the prevalence problem) and by the degree to which coders disagree (the bias problem).
In an attempt to fix these problems, Gwet  proposed two new agreement coefficients. The first coefficient can be used with any number of raters but requires a simple categorical rating system, while the second coefficient, though it can also be used with any number of raters, is more appropriate when an ordered categorical rating system is used. The first agreement coefficient is called the “first-order agreement coefficient,” or the AC1 statistic, which adjusts the overall probability based on the chance that raters may agree on a rating, despite the fact that one or all of them may have given a random value. A random rating occurs when a rater is not certain about how to classify an object, which can occur when the object’s characteristics do not match the rating instructions. Chance agreement can inflate the overall agreement probability, but should not contribute to the measure of any actual agreement between raters. Therefore, as is done with the Kappa statistic, Gwet adjusted for chance agreement by using the AC1 tool, such that the AC1 between two or multiple raters is defined as the conditional probability that two randomly selected raters will agree, given that no agreement will occur by chance . Gwet found that Kappa gives a slightly higher value than other coefficients when there is a high level of agreement; however, in the paradoxical situation in which Kappa is low despite a high level of agreement, Gwet proposed using AC1 as a “paradox-resistant” alternative to the unstable Kappa coefficient.
Gwet has also proved the validity of the multiple-rater version of the AC1 and the Fleiss’ Kappa statistics, using a Monte-Carlo simulation approach with various estimators .
To the best of our knowledge, Gwet’s AC1 has never been tested with an inter-rater reliability analysis of personality disorders; therefore, in this study we analyzed the data using both Cohen’s Kappa and Gwet’s AC1 to compare their levels of reliability.