By comparing overall scores and inter-rater reliability of the NOS quality assessment tool between reviewers and authors, we found remarkably low inter-rater reliability. The majority of the cohort studies were rated as being at higher risk of bias by authors than by reviewers. That is, the reviewers assigned higher NOS scores by two or more points of a total of nine points for the overall NOS score.
Inter-rater reliability between authors and reviewers was remarkably low and the agreement across items was only minimal. The item about adequacy of follow-up of cohorts had the highest kappa value of 0.15 (95% CI = −0.19, 0.48), despite being considered only as slight agreement. The comparably small room for subjectivity might explain the relatively high kappa value given that the authors simply had to indicate whether or not there was loss to follow-up for their cohort, and whether it would potentially have introduced bias into their study. The overall lack of agreement between reviewers and authors persisted even after performing a second inter-rater reliability test when categorizing into three groups of risk of bias. Of note, the difference in NOS scores was 2 or higher in almost half of the studies. While one may argue that a one point difference would not have any practical implications, a two point difference would likely be regarded as clinically important given that this would reflect a >20% (i.e., 2/9 points) difference in the assessment.
One possible explanation for these differences is that the reviewers may not have had all information needed available from the published article in order to assess the risk of bias reliably. For example, when evaluating representativeness of exposed cohort reviewers may treat the study population as truly representative, but the author knew that the study group was not representative for the average population in the respective setting.
Methodological studies have found that reviewers may overestimate the risk of bias of randomized controlled trials due to unclear or lacking details from insufficient reporting by authors [5, 6]. In one of these studies involving 98 RCTs, 55% failed to report allocation concealment and from 25% to 96% did not report blinding status (e.g., for participants, data analysts) [5]. However, once authors were contacted, it was noted that appropriate concealment and blinding were in fact in place in many of these studies [5]. In other words, reviewers without the unpublished details from authors would have overestimated the risk of bias for RCTs, contrary to our findings when using NOS for observational studies. One possible explanation might be that NOS allows for more subjectivity than when assessing RCTs; lack of information in RCTs automatically results in higher risk of bias, whereas NOS requires the reviewer to decide subjectively on the risk of bias for each item based on the information available in the report. Others have also found the NOS tool’s decision rules and guidelines to be vague and difficult to use [3]. For example, the difference between a “structured interview” and “written self-report” was difficult to determine if the study used a structured validated survey that was completed independently by study participants [3]. Similar ambiguity was found in our investigation for the item ascertainment of exposure, where NOS reviewers identified 59 of the 65 cohort studies using secure records (i.e., one point assigned). In contrast, 44 out of the 65 authors claimed to be using medical records (i.e., no points assigned).
Inter-rater reliability
The developers of the NOS evaluated the tool for face and criterion validity by comparing it with an established scale by Downs and Black [11]. Ten cohort and 10 case–control studies about the association between breast cancer and hormone replacement therapy were examined. Criterion validity showed strong and moderate agreement for cohort (intra-class correlation coefficient [ICC] = 0.88) and case–control studies (ICC = 0.62), respectively. Inter-rater reliability was also high for both cohort (ICC = 0.94) and case–control studies (ICC = 0.82).
To our knowledge, only two other studies have examined inter-rater reliability of NOS other than those conducted by the developers of the tool [3, 4]. Contrary to findings from developers of the tool, other studies have found overall low inter-rater reliability. In a study by Hartling [3], two reviewers applied NOS to 131 cohort studies included in eight meta-analyses on different medical topics. Inter-rater reliability between reviewers ranged from poor (K = −0.06, 95% CI = −0.20, 0.07) to substantial (K = 0.68, 95% CI = 0.47, 0.89), although eight out of nine of the NOS items had K values <0.50 [3]. Reliability for overall NOS score was fair (K = 0.29, 95% CI = 0.10, 0.47). A similar lack of agreement was found in a study applying the NOS to observational studies on cognitive impairment following electroconvulsive therapy for major depressive disorders [4]. Using inexperienced student raters, inter-rater reliability for cohort studies ranged from K = −0.14 (95% CI = −0.28, 0.00) to K = 0.39 (95% CI = −0.02, 0.81), and for case–control studies from K = −0.20 (95% CI = −0.49, 0.09) to K = 1.00 (95% CI = 1.00, 1.00). All nine of the NOS items and seven out of nine items had K values <0.50 for cohort and case–control studies, respectively. However, the findings of the latter study may also explain to some extent the disagreement between methodically trained reviewers and authors of study: In our study, the disagreement between raters may also have been attributable to the authors’ lack of experience applying the NOS.
Limitations
One limitation of our study was the relatively small sample size of 65 cohort studies. With a response rate of 36%, selection bias may be present as authors who responded to the survey might have a different level of expertise and interest in research methodology than non-responders. Also, all studies included were from one field of medicine. Notably however, the reviewer assessments of the risk of bias were similar for the included and excluded studies. It remains unclear whether the difference between the reviewer and author’s assessments were due to different information available to them for assessment, or whether authors’ lack of familiarity with the NOS tool resulted in this finding. Given that the one item (i.e., representativeness of the exposed cohort) that would not need to be downgraded based on lack of information was similarly low in agreement as items that would expect downgrading for lack of reporting, subjective interpretation was probably the primary driver resulting in the low agreement between reviewers and authors. This would emphasize, as suggested by Hartling [3] and Oremus [4], that training and detailed guidance is needed in order to appropriately apply the NOS. Although the survey explained in detail how the assessment should be conducted, lack of familiarity with the NOS tool by the authors may have negatively affected their performance.
Another limitation is the nature of the inter-rater reliability test, that is, some have considered kappa to be an overly conservative measure of agreement [12]. Since the kappa value depends on the observed categories’ frequencies, the test might underestimate agreement for a category. An example was the item selection of the non-exposed cohort, where 82% of the time both reviewers and authors agreed that the non-exposed were derived from the same population as the exposed (i.e., one point assigned). However, there were no occurrences where both raters agreed that the non-exposed were not derived from the same population as the exposed, which resulted in an underestimation of the agreement. A larger sample of studies may have circumvented this problem.