Item generation
The systematic reviews produced a list of 28 possible items for inclusion in the quality assessment tool. These are shown in Table 1 [see Additional file 1] together with the results of the systematic reviews on sources of bias and variation, and existing quality assessment tools. The evidence from the review on sources of bias and variation was summarised as the number of studies reporting empirical evidence (E), theoretical evidence (T) or absence (A) of bias or variability. The number of studies providing each type of evidence of bias or variability is shown in columns 2–4 of Table 1 [see Additional file 1]. The results from the review of existing quality assessment tools was summarised as the proportion of tools covering each item. The proportions were grouped into four categories: I (75–100%), II (50–74%), III (25–49%) and IV (0–24%) and are shown in the final column of Table 1 [see Additional file 1]. For some items evidence from the reviews was only available in combination with other items rather than for each item individually, e.g. setting and disease prevalence and severity. For these items the evidence for the combination is provided. Table 1 [see Additional file 1] also shows to which item on the QUADAS tool each item in this table refers. Evidence from the first review was not available for a number of items; these items were classed as "na".
Assessment of face validity: The Delphi procedure
Nine of the eleven people invited to take part in the Delphi procedure agreed to do so. The names of the panel members are listed at the end of this paper.
Delphi Round 1
Eight of the nine people who agreed to take part in the procedure returned completed questionnaires. The ninth panel member did not have time to take part in this round. Following the results of this round, six items were selected for inclusion, one item was removed from the tool, and the remaining items were put forward to be re-rated as part of round 2. Items selected for inclusion were:
1. Appropriate selection of patient spectrum
2. Appropriate reference standard
3. Absence of partial verification bias
4. Absence of review bias (both test and diagnostic)
5. Clinical review bias
6.Reporting of uninterpretable/indeterminate/intermediate results
The item removed from the tool was:
1. Test utility
Panel members made a number of suggestions regarding rephrasing of items. We considered these and made changes where appropriate. Based on some of the comments received we added an additional item to the category "Spectrum composition". This item was "What was the study design?". This item was rated for inclusion in the tool as part of round 2.
Delphi Round 2
Of the nine people invited to take part in round 2, eight returned completed questionnaires. Based on the results of this round, a further four items were selected for inclusion in the tool:
1. Absence of disease progression bias
2. Absence of differential verification bias
3. Absence of incorporation bias
4. Reporting of study withdrawals.
Panel members did not achieve consensus for a further five items, these were re-rated as part of round 3:
1. Reporting of selection criteria
2. Reporting of disease severity
3. Description of index test execution
4. Description of reference standard execution
5. Independent derivation of cut-off points
All other items, including the new item added based on feedback from round 1, were excluded from the process. Based on the comments from round 2, we proposed the following additional items which were included in the round 3 questionnaire:
1. Are there other aspects of the design of this study which cause concern about whether or not it will correctly estimate test accuracy?
2. Are there other aspects of the conduct of this study which cause concern about whether or not it will correctly estimate test accuracy?
3. Are there special issues concerning patient selection which might invalidate test results?
4. Are there special issues concerning the conduct of test which might invalidate test results?
Since none of the panel members were in favour of highlighting a number of key items in the quality assessment tool, this approach was not followed. At this stage, five of the panel members reported that they endorsed the Delphi procedure so far, one did not and two were unclear. The member who did not endorse the Delphi procedure stated that "I fundamentally believe that it is not possible to develop a reliable discriminatory diagnostic assessment tool that will apply to all, or even the majority of diagnostic test studies." One of the comments from a panel member who was "unclear" also related to the problem of producing a quality assessment tool that applies to all diagnostic accuracy studies. The other related to the process used to derive the initial list of items and the problems of suggesting additional items. All panel members agreed to let the steering group produce the background document to accompany the tool. The feedback suggested that there was some confusion regarding the proposed validation methods. These were clarified and re-rated as part of round 3.
Delphi Round 3
All nine panel members invited to take part in round 3 returned completed questionnaires. Agreement was reached on items to be included in the tool following the results of this round.
Three of the five items re-rated as part of this round were selected for inclusion. These were:
1. Reporting of selection criteria
2. Description of index test execution
3. Description of reference standard execution
The other two items and the additional items rated as part of this round were not included in the tool.
The panel members agreed with the scoring system proposed by the steering group. Each of the proposed validation steps was approved by at least 7/9 of the panel members. These methods will therefore be used to validate the tool. Five of the panel members indicated that they would like to see the development of design and topic specific criteria. Of these four stated that they would like to see this done via a Delphi procedure. The development of these elements will take place after the generic section of the tool has been evaluated.
At this stage, all but one of the panel members stated that they endorsed the Delphi procedure. This member remained unclear as to whether he/she endorsed the procedure and stated that "all my reservations still apply". These reservations related to earlier comments regarding the problems of developing a quality assessment tool which can be applied to all studies of diagnostic accuracy. Seven of the panel members reported using the evidence provided from the systematic reviews to help in their decisions of which items to include in QUADAS. Of the two that did not use the evidence one stated that (s)he was too busy, the other stated that there was no new information in the evidence. Seven of the panel members reported using the feedback from earlier rounds of the Delphi procedure. Of the two that did not, one stated that he/she was "not seeking conformity with other respondents" the other did not explain why he or she did not use the feedback. The two panel members that did not use the feedback were different from the two that did not use the evidence provided by the reviews. These responses suggest that the evidence provided by the review did contribute towards the production of QUADAS.
Delphi Round 4
The fourth and final round did not include a questionnaire, although panel members were given the opportunity to feedback any additional comments that they had. Only one panel member provided further feedback. This related mainly to the broadness of the first item included in the tool, and the fact that several items relate to the reporting of the study rather than directly to the quality of the study.
The QUADAS tool
The tool is structured as a list of 14 questions which should each be answered "yes", "no", or "unclear". The tool is presented in Table 2. A more detailed description of each item together with a guide on how to score each item is provided below.
Users' guide to QUADAS
1. Was the spectrum of patients representative of the patients who will receive the test in practice?
a. What is meant by this item
Differences in demographic and clinical features between populations may produce measures of diagnostic accuracy that vary considerably, this is known as spectrum bias. It refers more to the generalisability of results than to the possibility that the study may produce biased results. Reported estimates of diagnostic accuracy may have limited clinical applicability (generalisability) if the spectrum of tested patients is not similar to the patients in whom the test will be used in practice. The spectrum of patients refers not only to the severity of the underlying target condition, but also to demographic features and to the presence of differential diagnosis and/or co-morbidity. It is therefore important that diagnostic test evaluations include an appropriate spectrum of patients for the test under investigation and also that a clear description is provided of the population actually included in the study.
b. Situations in which this item does not apply
This item is relevant to all studies of diagnostic accuracy and should always be included in the quality assessment tool.
c. How to score this item
Studies should score "yes" for this item if you believe, based on the information reported or obtained from the study's authors, that the spectrum of patients included in the study was representative of those in whom the test will be used in practice. The judgement should be based on both the method of recruitment and the characteristics of those recruited. Studies which recruit a group of healthy controls and a group known to have the target disorder will be coded as "no" on this item in nearly all circumstances. Reviewers should pre-specify in the protocol of the review what spectrum of patients would be acceptable taking factors such as disease prevalence and severity, age, and sex, into account. If you think that the population studied does not fit into what you specified as acceptable, the item should be scored as "no". If there is insufficient information available to make a judgement then it should be scored as "unclear".
2. Were selection criteria clearly described?
a. What is meant by this item
This refers to whether studies have provided a clear definition of the criteria used as in- and exclusion criteria for entry into the study.
b. Situations in which this item does not apply
This item is relevant to all studies of diagnostic accuracy and should always be included in the quality assessment tool.
c. How to score this item
If you think that all relevant information regarding how participants were selected for inclusion in the study has been provided then this item should be scored as "yes". If study selection criteria are not clearly reported then this item should be scored as "no". In situations where selection criteria are partially reported and you feel that you do not have enough information to score this item as "yes", then it should be scored as "unclear".
3. Is the reference standard likely to correctly classify the target condition?
a. What is meant by this item
The reference standard is the method used to determine the presence or absence of the target condition. To assess the diagnostic accuracy of the index test its results are compared with the results of the reference standard; subsequently indicators of diagnostic accuracy can be calculated. The reference standard is therefore an important determinant of the diagnostic accuracy of a test. Estimates of test performance are based on the assumption that the index test is being compared to a reference standard which is 100% sensitive and specific. If there are any disagreements between the reference standard and the index test then it is assumed that the index test is incorrect. Thus, from a theoretical point of view the choice of an appropriate reference standard is very important.
b. Situations in which this item does not apply
This item is relevant to all studies of diagnostic accuracy and should always be included in the quality assessment tool.
c. How to score this item
If you believe that the reference standard is likely to correctly classify the target condition or is the best method available, then this item should be scored "yes". Making a judgement as to the accuracy of the reference standard may not be straightforward. You may need experience of the topic area to know whether a test is an appropriate reference standard, or if a combination of tests are used you may have to consider carefully whether these were appropriate. If you do not think that the reference standard was likely to have correctly classified the target condition then this item should be scored as "no". If there is insufficient information to make a judgement then this should be scored as "unclear".
4. Is the time period between reference standard and index test short enough to be reasonably sure that the target condition did not change between the two tests?
a. What is meant by this item
Ideally the results of the index test and the reference standard are collected on the same patients at the same time. If this is not possible and a delay occurs, misclassification due to spontaneous recovery or to progression to a more advanced stage of disease may occur. This is known as disease progression bias. The length of the time period which may cause such bias will vary between conditions. For example a delay of a few days is unlikely to be a problem for chronic conditions, however, for many infectious diseases a delay between performance of index and reference standard of only a few days may be important. This type of bias may occur in chronic conditions in which the reference standard involves clinical follow-up of several years.
b. Situations in which this item does not apply
This item is likely to apply in most situations.
c. How to score this item
When to score this item as "yes" is related to the target condition. For conditions that progress rapidly even a delay of several days may be important. For such conditions this item should be scored "yes" if the delay between the performance of the index and reference standard is very short, a matter of hours or days. However, for chronic conditions disease status is unlikely to change in a week, or a month, or even longer. In such conditions longer delays between performance of the index and reference standard may be scored as "yes". You will have to make judgements regarding what is considered "short enough". You should think about this before starting work on a review, and define what you consider to be "short enough" for the specific topic area that you are reviewing. If you think the time period between the performance of the index test and the reference standard was sufficiently long that disease status may have changed between the performance of the two tests then this item should be scored as "no". If insufficient information is provided this should be scored as "unclear".
5. Did the whole sample or a random selection of the sample, receive verification using a reference standard?
a. What is meant by this item
Partial verification bias (also known as work-up bias, (primary) selection bias, or sequential ordering bias) occurs when not all of the study group receive confirmation of the diagnosis by the reference standard. If the results of the index test influence the decision to perform the reference standard then biased estimates of test performance may arise. If patients are randomly selected to receive the reference standard the overall diagnostic performance of the test is, in theory, unchanged. In most cases however, this selection is not random, possibly leading to biased estimates of the overall diagnostic accuracy.
b. Situations in which this item does not apply
Partial verification bias generally only occurs in diagnostic cohort studies in which patients are tested by the index test prior to the reference standard. In situations where the reference standard is assessed before the index test, you should firstly decide whether there is a possibility that verification bias could occur, and if not how to score this item. This may depend on how quality will be incorporated in the review. There are two options: either to score this item as 'yes', or to remove it from the quality assessment tool.
c. How to score this item
If it is clear from the study that all patients, or a random selection of patients, who received the index test went on to receive verification of their disease status using a reference standard then this item should be scored as "yes". This item should be scored as yes even if the reference standard was not the same for all patients. If some of the patients who received the index test did not receive verification of their true disease state, and the selection of patients to receive the reference standard was not random, then this item should be scored as "no". If this information is not reported by the study then it should be scored as "unclear".
6. Did patients receive the same reference standard regardless of the index test result?
a. What is meant by this item
Differential verification bias occurs when some of the index test results are verified by a different reference standard. This is especially a problem if these reference standards differ in their definition of the target condition, for example histopathology of the appendix and natural history for the detection of appendicitis. This usually occurs when patients testing positive on the index test receive a more accurate, often invasive, reference standard than those with a negative test result. The link (correlation) between a particular (negative) test result and being verified by a less accurate reference standard will affect measures of test accuracy in a similar way as for partial verification, but less seriously.
b. Situations in which this item does not apply
Differential verification bias is possible in all types of diagnostic accuracy studies.
c. How to score this item
If it is clear that patients received verification of their true disease status using the same reference standard then this item should be scored as "yes". If some patients received verification using a different reference standard this item should be scored as "no". If this information is not reported by the study then it should be scored as "unclear".
7. Was the reference standard independent of the index test (i.e. the index test did not form part of the reference standard)?
a. What is meant by this item
When the result of the index test is used in establishing the final diagnosis, incorporation bias may occur. This incorporation will probably increase the amount of agreement between index test results and the outcome of the reference standard, and hence overestimate the various measures of diagnostic accuracy. It is important to note that knowledge of the results of the index test alone does not automatically mean that these results are incorporated in the reference standard. For example, a study investigating MRI for the diagnosis of multiple sclerosis could have a reference standard composed of clinical follow-up, CSF analysis and MRI. In this case the index test forms part of the reference standard. If the same study used a reference standard of clinical follow-up and the results of the MRI were known when the clinical diagnosis was made but were not specifically included as part of the reference then the index test does not form part of the reference standard.
b. Situations in which this item does not apply
This item will only apply when a composite reference standard is used to verify disease status. In such cases it is essential that a full definition of how disease status is verified and which tests form part of the reference standard are provided. For studies in which a single reference standard is used this item will not be relevant and should either be scored as yes or be removed from the quality assessment tool.
c. How to score this item
If it is clear from the study that the index test did not form part of the reference standard then this item should be scored as "yes". If it appears that the index test formed part of the reference standard then this item should be scored as "no". If this information is not reported by the study then it should be scored as "unclear".
8. Was the execution of the index test described in sufficient detail to permit replication of the test?
9. Was the execution of the reference standard described in sufficient detail to permit its replication?
a. What is meant by these items
A sufficient description of the execution of index test and the reference standard is important for two reasons. Firstly, variation in measures of diagnostic accuracy can sometimes be traced back to differences in the execution of index test or reference standard. Secondly, a clear and detailed description (or citations) is needed to implement a certain test in another setting. If tests are executed in different ways then this would be expected to impact on test performance. The extent to which this would be expected to affect results would depend on the type of test being investigated.
b. Situations in which these items do not apply
These items are likely to apply in most situations.
c. How to score these items
If the study reports sufficient details or citations to permit replication of the index test and reference standard then these items should be scored as "yes". In other cases these items should be scored as "no". In situations where details of test performance are partially reported and you feel that you do not have enough information to score this item as "yes", then it should be scored as "unclear".
10. Were the index test results interpreted without knowledge of the results of the reference standard?
11. Were the reference standard results interpreted without knowledge of the results of the index test?
a. What is meant by these items
This item is similar to "blinding" in intervention studies. Interpretation of the results of the index test may be influenced by knowledge of the results of the reference standard, and vice versa. This is known as review bias, and may lead to inflated measures of diagnostic accuracy. The extent to which this may affect test results will be related to the degree of subjectiveness in the interpretation of the test result. The more subjective the interpretation the more likely that the interpreter can be influenced by the results of the reference standard in interpreting the index test and vice versa. It is therefore important to consider the topic area that you are reviewing and to determine whether the interpretation of the index test or reference standard could be influenced by knowledge of the results of the other test.
b. Situations in which these items do not apply
If, in the topic area that you are reviewing, the index test is always performed first then interpretation of the results of the index test will usually be without knowledge of the results of the reference standard. Similarly, if the reference standard is always performed first (for example, in a diagnostic case-control study) then the results of the reference standard will be interpreted without knowledge of the index test. However, if test results can be interpreted at later date, after both the index test and reference standard have been completed, then it is still important for a study to provide a description of whether the interpretation of each test was performed blind to the results of the other test. In situations where one form of review bias does not apply there are two possibilities: either score the relevant item as "yes" or remove this item from the list. If tests are entirely objective in their interpretation then test interpretation is not susceptible to review bias. In such situations review bias may not be a problem and these items can be omitted from the quality assessment tool. Another situation in which this form of bias may not apply is when tests results are interpreted in an independent laboratory. In such situations it is unlikely that the person interpreting the test results will have knowledge of the results of the other test (either index test or reference standard).
c. How to score these items
If the study clearly states that the test results (index or reference standard) were interpreted blind to the results of the other test then these items should be scored as "yes". If this does not appear to be the case they should be scored as "no". If this information is not reported by the study then it should be scored as "unclear".
12. Were the same clinical data available when test results were interpreted as would be available when the test is used in practice?
a. What is meant by this item
The availability of clinical data during interpretation of test results may affect estimates of test performance. In this context clinical data is defined broadly to include any information relating to the patient obtained by direct observation such as age, sex and symptoms. The knowledge of such factors can influence the diagnostic test result if the test involves an interpretative component. If clinical data will be available when the test is interpreted in practice then this should also be available when the test is evaluated. If however, the index test is intended to replace other clinical tests then clinical data should not be available, or should be available for all index tests. It is therefore important to determine what information will be available when test results are interpreted in practice before assessing studies for this item.
b. Situations in which this item does not apply
If the interpretation of the index test is fully automated and involves no interpretation then this item may not be relevant and can be omitted from the quality assessment tool.
c. How to score this item
If clinical data would normally be available when the test is interpreted in practice and similar data were available when interpreting the index test in the study then this item should be scored as "yes". Similarly, if clinical data would not be available in practice and these data were not available when the index test results were interpreted then this item should be scored as "yes". If this is not the case then this item should be scored as "no". If this information is not reported by the study then it should be scored as "unclear".
13. Were uninterpretable/ intermediate test results reported?
a. What is meant by this item
A diagnostic test can produce an uninterpretable/indeterminate/intermediate result with varying frequency depending on the test. These problems are often not reported in diagnostic accuracy studies with the uninterpretable results simply removed from the analysis. This may lead to the biased assessment of the test characteristics. Whether bias will arise depends on the possible correlation between uninterpretable test results and the true disease status. If uninterpretable results occur randomly and are not related to the true disease status of the individual then, in theory, these should not have any effect on test performance. Whatever the cause of uninterpretable results it is important that these are reported so that the impact of these results on test performance can be determined.
b. Situations in which this item does not apply
This item is relevant to all studies of diagnostic accuracy and should always be included in the quality assessment tool.
c. How to score this item
If it is clear that all test results, including uninterpretable/indeterminate/intermediate are reported then this item should be scored as "yes". If you think that such results occurred but have not been reported then this item should be scored as "no". If it is not clear whether all study results have been reported then this item should be scored as "unclear".
14. Were withdrawals from the study explained?
a. What is meant by this item
This occurs when patients withdraw from the study before the results of either or both of the index test and reference standard are known. If patients lost to follow-up differ systematically from those who remain, for whatever reason, then estimates of test performance may be biased.
b. Situations in which this item does not apply
This item is relevant to all studies of diagnostic accuracy and should always be included in the quality assessment tool.
c. How to score this item
If it is clear what happened to all patients who entered the study, for example if a flow diagram of study participants is reported, then this item should be scored as "yes". If it appears that some of the participants who entered the study did not complete the study, i.e. did not receive both the index test and reference standard, and these patients were not accounted for then this item should be scored as "no". If it is not clear whether all patients who entered the study were accounted for then this item should be scored as "unclear".