Skip to main content
  • Research article
  • Open access
  • Published:

Development and evaluation of a quality score for abstracts



The evaluation of abstracts for scientific meetings has been shown to suffer from poor inter observer reliability. A measure was developed to assess the formal quality of abstract submissions in a standardized way.


Item selection was based on scoring systems for full reports, taking into account published guidelines for structured abstracts. Interrater agreement was examined using a random sample of submissions to the American Gastroenterological Association, stratified for research type (n = 100, 1992–1995). For construct validity, the association of formal quality with acceptance for presentation was examined. A questionnaire to expert reviewers evaluated sensibility items, such as ease of use and comprehensiveness.


The index comprised 19 items. The summary quality scores showed good interrater agreement (intra class coefficient 0.60 – 0.81). Good abstract quality was associated with abstract acceptance for presentation at the meeting. The instrument was found to be acceptable by expert reviewers.


A quality index was developed for the evaluation of scientific meeting abstracts which was shown to be reliable, valid and useful.

Peer Review reports


The presentation of research at scientific meetings is an essential part of scientific communication. The choice of which research will be presented at a meeting is usually based on peer-review of abstracts submitted by interested investigators. Only about 50% of the research projects initially submitted as conference abstracts will eventually be published as full articles in peer-reviewed journals and full publication may not occur for several years [1, 2]. As a result, a published abstract from a scientific meeting is often the only permanent source of information available on the methodology and results of a research project.

When writing an abstract, investigators are limited on the amount of detail on the study's methodology and results that they can include. This may partly explain the poor observer-agreement and several kinds of bias that have been demonstrated in the abstract selection procedures for various medical specialty meetings [39]. However, there is evidence that the quality of the presentation of information in an abstract is associated with the scientific quality of the research [10]. Therefore, standardized methods for assessing formal abstract quality may help improve the reliability of abstract selection and, if used as a checklist, result in more informative and useful abstracts.

Quality checklists and scoring instruments have been published for use with full reports of clinical research [11, 12]. No such score is available for the evaluation of abstracts. Also, there is little information addressing the reporting of the methodology and results in basic science research. The objective of this study was to develop a reliable instrument to assess the quality of meeting abstracts that would be applicable to a wide variety of research types, including both clinical and basic science.



We used abstracts submitted to the 1992 to 1995 annual meetings of the American Gastroenterological Association (AGA) to develop and test a quality scoring instrument for abstracts. Each year several thousands of abstracts on all aspects of research in gastroenterology are submitted to this meeting. All submitted abstracts were published in a yearly supplement to the journal Gastroenterology whether or not they were accepted for presentation at the meeting. Abstracts selected for oral or poster presentation were marked, however, the mode of presentation was not specified. Abstracts were submitted on a standard form that limited the length to between 200 to 350 words, depending on font size and inclusion of graphs. A structured format (Background & Aims, Methods, Results, Conclusion) was not mandatory but was commonly used.

For the purpose of this study, we divided the abstracts into different research types. Basic science studies (BSS) comprise all studies, done in a laboratory setting, where the unit of analysis was not an intact human. Controlled clinical trials (CCT) included prospective studies on the effects of diagnostic tests or therapeutic interventions, using parallel or cross over controls. All other studies using humans were categorized as "other clinical research" (OCR) and included human physiology experiments, epidemiological studies and uncontrolled therapeutic studies.

A large sample (n = 1000) was selected for use in a follow up study on publication bias using computer generated random numbers based on a database containing all abstracts submitted (17 205). This sampling was stratified by research type to increase the proportion of RCT's. For the test development and testing, a random subsample (n = 100) was collected from this sample. The eventual sample included 42 CCT's, 39 OCR, and 19 BSS. The abstracts were assessed in a prespecified order.

Item generation and reduction

Formal abstract quality was conceptualized as the combination of the methodological strength of the research as well as the clarity of the presentation. The methodological strength of the research included those methods used to improve the internal validity of the study as reported in the abstract.

Item selection and instrument structure were based on previously published instruments used to assess the quality of full manuscripts in therapeutic trials and epidemiological research [11, 1315]. A measure developed by Cho and Bero was particularly influential because of its applicability to different types of clinical research [16]. We could not identify any work on quality scoring in basic science reports.

When adapting this scale for use with abstracts, we took into account published guidelines for the composition of structured abstracts for original research for publication in biomedical journals [17, 18]. Similar guidelines for meeting abstracts and basic science reports had not been published. For content validity, the list of items was discussed face to face and repeatedly with one researcher each in lab based medicine (J. Wallace, Calgary), health care research (RJ Hilsden), and clinical trials (LR Sutherland). The instrument was further modified following a pilot study of 80 abstracts that were scored by two raters and subsequently discussed in detail. Items were simplified or dropped if consensus decisions were difficult or if sufficient information was absent in the majority of abstracts, unless the item was considered essential for the evaluation of study validity by at least one of the reviewers. Input on the final instrument was sought from researchers in the three fields from across Canada by sending them the instrument and a questionnaire on the appropriateness of the items included (Table 2). No further items were suggested to be added or dropped by those surveyed.

The resulting instrument allowed for the calculation of a summary score with a range from 0 to 1. Summary scores are reported as mean scores with standard errors of the mean (SEM).

Interrater and test-retest reliability

We tested interrater reliability on 100 abstracts scored by two raters with training in gastroenterology and epidemiology (AT and RJH). Test-retest reliability was determined through the duplicate scoring of 85 abstracts by the same observer (AT) with an interval of four to six weeks. The raters were blinded to name and institution of the authors and to whether the abstract had been selected for presentation. For each item, the kappa coefficient was used to assess the degree of agreement beyond what might be expected by chance [19]. A kappa score in the range of 0.41 to 0.60 was considered to represent moderate agreement, and a kappa score in the range of 0.61 to 0.80 was considered substantial agreement [20]. Agreement between summary scores was assessed using the intra class correlation coefficient [21]. This measure is the ratio of between abstract variance to overall variance. A ratio (Ri) close to 1 is found when the variance is almost exclusively due to differences between the abstracts, while low ratios represent a strong influence of random error and/or observer variance on the overall variance. There is no universally accepted cut off of Ri for good agreement, as it is appreciated that this depends on the number of subjects studied and the context in which the decisions based on the instrument will be made [21]. For this study, moderate agreement was defined as Ri > 0.6 and good agreement as Ri > 0.8.

Construct validity

To examine construct validity, we hypothesized that higher abstract quality scores should correlate with abstract acceptance by the AGA. To test this hypothesis, the mean quality scores of rejected and accepted abstracts were compared using Student t-tests for independent samples.


The sensibility of an instrument, as defined by Feinstein, comprises different aspects of usability, such as comprehensibility, ease of use, face and content validity, and suitability for the purpose at hand [22]. A sensibility questionnaire was adapted from Oxman [23, 24] to assess the usefulness of the newly developed score (see appendix). This questionnaire was sent to 24 Canadian researchers in gastroenterology and epidemiology. The researchers rated each item based on a seven-point scale (1= unacceptable, 7 = criterion fully met). We considered a mean rating of >5 to indicate general satisfaction with and acceptability of an item.


The review of the literature and expert opinion resulted in the generation of 24 quality assessment items pertaining to the study purpose, methods, reporting of results and conclusions. Following the pilot-study, the instrument was reduced to 19 items. Two items on research methods were discarded because they were never applicable to the abstracts evaluated (1. whether random selection from the study population was performed; 2. if yes, which method of random selection was used). A priori sample size considerations were also never reported. Instead, for use in abstracts, the reporting of posteriori sample size considerations or power calculations for negative results was considered sufficient. Two pairs of items were combined into single items. First, "control for confounding by the study design" and "control for confounding by the analysis" were combined into "confounding controlled for?". Second, because it was too difficult to discern between the inclusion/exclusion criteria and the baseline characteristics of the subjects, it was considered sufficient if either were included in the abstract. Although the method of randomization in randomized trials was never described, an item on this was retained because it was considered essential for the assessment of study quality. Following the pilot testing, several items were reworded for simplification and clarity. A detailed manual was written that provided the definitions used for each item (manual available from the authors). Using the instrument and manual, the average time required to score an abstract was 3:45 minutes (range 2:30 to 6:00 minutes). The final 19-item instrument is shown in the appendix.


For the 100 abstracts independently assessed by two raters, quality scores ranged from 0.28 to 0.95, and were approximately normally distributed. As shown in table 1, the inter-observer agreement was highest for controlled clinical trials and lowest for basic science abstracts. Individual item analysis showed substantial agreement for the description of random allocation, investigator and subject blinding, appropriateness of sample size, reporting of attrition, and reporting of statistical tests. Moderate agreement was achieved for the appropriateness of the controls and the appropriateness of the statistical methods. The agreement was low for the remaining items (description of study objective, description and appropriateness of the design, method of subject selection, definition of outcome measure, control for confounding, exact p-values or confidence intervals, reporting of results, and conclusions supported by the results).

Table 1 Interrater reliability.
Table 2 Sensibility ratings

Test-retest reliability was good independent of the research type, with identical mean scores for first and second evaluations (0.57) and an intra-class coefficient of 0.85. Individual items also had good test-retest reliability (range of kappa coefficients, 0.54 to 0.85).

Construct Validity

In total, 67 of the 100 abstracts were accepted for presentation at the AGA meeting. These abstracts had significantly higher quality scores compared with the rejected abstracts (0.61 vs. 0.54, p = 0.001). The quality scores were higher for accepted abstracts compared with rejected abstracts for all research types, but this was only statistically significant for clinical trials (n = 42, 0.67 vs. 0.52, p < 0.001).


Sensibility ratings were available from 15 reviewers (survey response rate 62.5%). As shown in Table 2, clinical and health care researchers found the instrument useful and acceptable for the purpose described. The reviewers felt that both the research methodology and quality of the report were adequately assessed by our instrument (Table 2). In general, basic scientists found the instrument less useful, largely because of the time and effort required. Both groups of reviewers considered the applicability to basic science abstracts to be a problem. The issue of the limited amount of information available from abstracts was also raised.


We have developed a reliable instrument for assessing the quality of an abstract submitted to a scientific meeting. The instrument rates both the quality of the reporting and the quality of the research methods. While a variety of scales are available for the assessment of full reports of clinical trials, abstract evaluation has usually involved informal ad hoc scales or checklists [4, 18, 25, 26] or has been restricted to the quality of reporting only [18, 27, 28]. Furthermore, most abstract evaluation studies were based on summaries of published research.

The limited amount of information available from abstracts was the greatest challenge in the development of this instrument. The methodological quality of research can only be assessed to the extent that pertinent information is available in the report. There is controversy over how well the quality of research is reflected in a short report [25, 29, 30]. Furthermore, the quality of the report and the quality of the research are often intertwined. The experts in our sensibility survey felt that our instrument adequately assessed both the research methodology and the quality of the report. However, we appreciate that the lack of a clear distinction between these two concepts of quality in our instrument may be unsatisfactory to some users, as may the restriction to formal aspects of scientific quality. Our instrument avoids any judgment of content, such as, originality, ethics or scientific relevance. It is recommended that these factors be assessed separately, as suggested by other authors [16, 28, 31]. However, there is evidence that the formal quality of a research report reflects the content or overall quality of the research [28, 32]. This view is supported by the association of higher quality scores with abstract acceptance by the AGA.

A few important items seem to be missing in our instrument. These are characteristics which were never reported in any of the abstracts, although they have been shown to be essential to the validity of study results. Most prominently, this concerns concealment of allocation in RCT's. Possibly, the importance of this item was not appreciated by abstract authors because the paper by Schulz, demonstrating the bias introduced by open allocation was only published in 1995, i.e. after conclusion of our abstract submission period studied [33].

Generally, the use of summary scores should be viewed with caution. In the context of meta-analysis, the individual assessment of the various methodological aspects has been shown to be more informative with respect to their effect on effect size. The excellent meta-analysis by Jüni et al. on quality scoring of clinical trials was not published when we developed our instrument [12]. In contrast, the use of a summary score rather than components is more helpful if a range of diverse designs and topics is evaluated, as the number of shared components is rare.

One of the objections raised in the sensibility survey was the time and effort needed to complete a checklist of 19 items. It is appreciated that this list may seem daunting, especially as previous instruments on full reports were restricted to as few as three items. However, instruments such as the checklist used by Chalmers, the Delphi list, or the short instrument by Jadad were developed to assess full reports of randomized clinical trials in the context of meta-analysis [13, 14, 34]. This entails the possibility (in some cases: necessity) to tailor the instrument to the study subject at hand. In contrast, it was our aim to develop an instrument that would be applicable to a wider variety of research types and subjects. Furthermore, even for randomized controlled trials, the use of a three item score was shown to be not useful in the evaluation of abstract quality in a follow up study on scientific abstracts by I. Chalmers [26]. This may be due to insufficient information provided in abstracts. A more sensitive instrument is needed for abstracts. Also, other items are required for the evaluation of other types of research, such as adequate control for confounding in observational studies. Consequently, depending on the study design, only a limited number of items are applicable for each individual abstract. The average time of less than 4 minutes per abstract seemed a reasonable effort. However, this time does not include an evaluation of the scientific background of the study question nor the appreciation of the importance of its conclusions for the scientific community.

Problems were encountered when we tried to tailor the instrument to basic science research reports. In fact, the results of both the reliability testing and the sensibility survey indicate that the instrument may be less suitable for this type of research. We are not aware of any literature on the formal assessment of quality in the design and reporting of basic science research. Therefore, some of the problems encountered were due to a lack of generally agreed upon criteria for the evaluation of this type of research. However, the main difficulty when reading basic science abstracts related to the omission of basic information such as the formulation of the research objective and clarification of the research design. Lower quality scores, an increased need for subjective judgments because of limited information, and lack of scale acceptance by basic science reviewers may be interrelated and may reflect, in general, a lesser interest in structured reporting in basic science abstracts.

The score was subsequently successfully used in a study examining the determinants of abstract acceptance and publication in gastroenterological research [35, 36]. High abstract quality scores were associated with higher chances for acceptance for presentation at a meeting, and were also associated with higher impact factors, if abstracts were followed by full publications.


In summary, we have developed a reliable, valid and applicable instrument for the evaluation of the quality of scientific abstracts. While most useful in the clinical research setting, particularly for clinical trials, limitations may apply for its use in basic science. It may also be helpful as a checklist for the preparation of scientific abstracts or serve as an instrument to compare abstract quality between meetings or over time. Other possible applications include the adjunct use in abstract peer review. More work is needed to improve the instrument for use in basic science research and to assess the applicability to other medical specialties. Also, further research should test the hypothesis that this instrument will reduce bias in the selection of abstracts for presentation.

Tables and Appendix

Tables and appendices are added at the end of the manuscript as follows:

Table 1: Interrater reliability

Table 2: Sensibility ratings

Appendix A: Quality scoring instrument (see additional file 1)

Appendix B: Sensibility questionnaire for reviewer survey (see additional file 2)


  1. Scherer RW, Langenberg P: Full publication of results initally presented in abstracts (Cochrane Methodology Review). The Cochrane Library. 2001, Oxford: Update Softw (4)

    Google Scholar 

  2. Scherer RW, Dickersin K, Langenberg P: Full publication of results initially presented in abstracts. A meta-analysis. JAMA. 1994, 272: 158-162. 10.1001/jama.272.2.158.

    Article  CAS  PubMed  Google Scholar 

  3. Sacks JJ, Peterson DE: Improving conference abstract selection [letter]. Epidemiology. 1994, 5: 636-637.

    Article  CAS  PubMed  Google Scholar 

  4. Conn HO: An experiment in blind program selection. Clin Res. 1974, 22: 128-134.

    Google Scholar 

  5. Appleton DR, Kerr DN: Choosing the programme for an international congress. Br Med J. 1978, 1: 421-423.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Kemper KJ, McCarthy PL, Cicchetti DV: Improving participation and interrater agreement in scoring Ambulatory Pediatric Association abstracts. How well have we succeeded?. Arch Pediatr Adolesc Med. 1996, 150: 380-383.

    Article  CAS  PubMed  Google Scholar 

  7. Vilstrup H, Sorensen HT: A comparative study of scientific evaluation of abstracts submitted to the 1995 European Association for the Study of the Liver Copenhagen meeting. Dan Med Bull. 1998, 45: 317-319.

    CAS  PubMed  Google Scholar 

  8. Rubin HR, Redelmeier DA, Wu AW, Steinberg EP: How reliable is peer review of scientific abstracts? Looking back at the 1991 Annual Meeting of the Society of General Internal Medicine. J Gen Intern Med. 1993, 8: 255-258.

    Article  CAS  PubMed  Google Scholar 

  9. Koren G, Graham K, Shear H, Einarson T: Bias against the null hypothesis: the reproductive hazards of cocaine. Lancet. 1989, 1140-1142.

    Google Scholar 

  10. McIntosh N: Abstract information and structure at scientific meetings [letter]. Lancet. 1996, 347: 544-545. 10.1016/S0140-6736(96)91177-0.

    Article  CAS  PubMed  Google Scholar 

  11. Moher D, Jadad AR, Nichol G, Penman M, Tugwell P, Walsh S: Assessing the quality of randomized controlled trials: an annotated bibliography of scales and checklists. Control Clin Trials. 1995, 16: 62-73. 10.1016/0197-2456(94)00031-W.

    Article  CAS  PubMed  Google Scholar 

  12. Juni P, Witschi A, Bloch R, Egger M: The hazards of scoring the quality of clinical trials for meta-analysis. JAMA. 1999, 282: 1054-1060. 10.1001/jama.282.11.1054.

    Article  CAS  PubMed  Google Scholar 

  13. Chalmers TC, Smith H, Blackburn B, Silverman B, Schroeder B, Reitman D, Ambroz A: A method for assessing the quality of a randomized controlled trial. Control Clin Trials. 1981, 2: 31-49. 10.1016/0197-2456(81)90056-8.

    Article  CAS  PubMed  Google Scholar 

  14. Jadad AR, Moore RA, Carroll D, Jenkinson C, Reynolds DJ, Gavaghan DJ, McQuay HJ: Assessing the quality of reports of randomized clinical trials: is blinding necessary?. Control Clin Trials. 1996, 17: 1-12. 10.1016/0197-2456(95)00134-4.

    Article  CAS  PubMed  Google Scholar 

  15. Friedenreich CM: Methods for pooled analyses of epidemiologic studies. Epidemiology. 1993, 4: 295-302.

    Article  CAS  PubMed  Google Scholar 

  16. Cho MK, Bero LA: Instruments for assessing the quality of drug studies published in the medical literature. JAMA. 1994, 272: 101-104. 10.1001/jama.272.2.101.

    Article  CAS  PubMed  Google Scholar 

  17. Haynes RB: More informative abstracts: current status and evaluation. J Clin Epidemiol. 1993, 46: 595-597.

    Article  CAS  PubMed  Google Scholar 

  18. Froom P, Froom J: Deficiencies in structured medical abstracts. J Clin Epidemiol. 1993, 46: 591-594.

    Article  CAS  PubMed  Google Scholar 

  19. Landis JR, Koch GG: The measurement of observer agreement for categorical data. Biometrics. 1977, 33: 159-174.

    Article  CAS  PubMed  Google Scholar 

  20. Seigel DG, Podgor MJ, Remaley NA: Acceptable values of kappa for comparison of two groups. Am J Epidemiol. 1992, 135: 571-578.

    CAS  PubMed  Google Scholar 

  21. Streiner DL, Norman GR: Health measurement scales. A practical guide to their development and use. 1989, Oxford, New York, Tokyo: Oxford University Press

    Google Scholar 

  22. Feinstein AR: The theory and evaluation of sensibility. In Clinimetrics. 1987, New Haven, London: Yale University Press, 141-166.

    Google Scholar 

  23. Oxman AD, Guyatt GH, Cook DJ, Jaeschke R, Heddle N, Keller J: An index of scientific quality for health reports in the lay press. J Clin Epidemiol. 1993, 46: 987-1001. 10.1016/0895-4356(93)90166-X.

    Article  CAS  PubMed  Google Scholar 

  24. Oxman AD, Guyatt G, Guyatt GH: Validation of an index of the quality of review articles. J Clin Epidemiol. 1991, 11: 1271-1278. 10.1016/0895-4356(91)90160-B.

    Article  Google Scholar 

  25. Panush RS, Delafuente JC, Connelly CS, Edwards NL, Greer JM, Longley S, Bennett F: Profile of a meeting: how abstracts are written and reviewed. J Rheumatol. 1989, 16: 145-147.

    CAS  PubMed  Google Scholar 

  26. Chalmers I, Adams M, Dickersin K, Hetherington J, Tarnow-Mordi W, Meinert C, Tonascia s, Chalmers TC: A cohort study of summary reports of controlled trials. JAMA. 1990, 263: 1401-1405. 10.1001/jama.263.10.1401.

    Article  CAS  PubMed  Google Scholar 

  27. Narine L, Yee DS, Einarson TR, Ilersich AL: Quality of abstracts of original research articles in CMAJ in 1989. CMAJ. 1991, 144: 449-453.

    CAS  PubMed  PubMed Central  Google Scholar 

  28. Scherer RW, Crawley B: Reporting of randomized clinical trial descriptors and use of structured abstracts. JAMA. 1998, 280: 269-272. 10.1001/jama.280.3.269.

    Article  CAS  PubMed  Google Scholar 

  29. Deeks JJ, Altman DG: Inadequate reporting of controlled trials as short reports. Lancet. 1998, 352: 1908-1909.

    Article  CAS  PubMed  Google Scholar 

  30. Weintraub WH: Are published manuscripts representative of the surgical meeting abstracts? An objective appraisal. J Ped Surg. 1987, 22: 11-13.

    Article  CAS  Google Scholar 

  31. Weber EJ, Callaham ML, Wears RL, Barton C, Young G: Unpublished research from a medical specialty meeting. Why investigators fail to publish. JAMA. 1998, 280: 257-259. 10.1001/jama.280.3.257.

    Article  CAS  PubMed  Google Scholar 

  32. Moher D, Jadad AR, Tugwell P: Assessing the quality of randomized controlled trials. Current issues and future directions. International Journal of Technology Assessment in Health Care. 1996, 12: 195-208.

    Article  CAS  PubMed  Google Scholar 

  33. Schulz KF, Chalmers I, Hayes RJ: Empirical evidence of bias: dimensions of methodological quality associated with estimates of treatment in controlled trials. JAMA. 1995, 273: 408-412. 10.1001/jama.273.5.408.

    Article  CAS  PubMed  Google Scholar 

  34. Verhagen AP, de Vet HC, de Bie RA, Kessels AG, Boers M, Bouter LM, Knipschild PG: The Delphi list: a criteria list for quality assessment of randomized clinical trials for conducting systematic reviews developed by Delphi consensus. J Clin Epidemiol. 1998, 51: 1235-1241. 10.1016/S0895-4356(98)00131-0.

    Article  CAS  PubMed  Google Scholar 

  35. Timmer A, Hilsden RJ, Sutherland LR: Determinants of abstract acceptance for the Digestive Diseases Week – a cross sectional study. BMC Med Res Methodol. 2001, 1: 13-10.1186/1471-2288-1-13.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Timmer A, Hilsden RJ, Cole J, Hailey D, Sutherland LR: Publication bias in gastroenterological research. BMC Med Res Methodol. 2002, 2: 7-10.1186/1471-2288-2-7.

    Article  PubMed  PubMed Central  Google Scholar 

Pre-publication history

Download references


We wish to thank Dr. MK Cho for providing information and instructions on the instruments developed by her and L. Bero [16]. We would also like to thank J. Wallace and all other expert raters for participation in the sensibility survey. Special thanks to C. Macarthur who contributed to the development and validation of the instrument, as well as to the writing of the manuscript. A. Timmer was a fellow of the German Academic Exchange Service (DAAD). The study was supported by a grant from the Calgary Regional Health Authority (CRHA R&D) and by Searle Canada.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Antje Timmer.

Additional information

Competing interests

None declared.

Authors' contributions

A. Timmer and RJ Hilsden developed the instrument, evaluated abstracts and jointly wrote the manuscript. The statistical analyses were performed by A. Timmer. LR Sutherland conceived and supervised the study, and contributed to the instrument modifications and to the writing of the manuscript.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Cite this article

Timmer, A., Sutherland, L.R. & Hilsden, R.J. Development and evaluation of a quality score for abstracts. BMC Med Res Methodol 3, 2 (2003).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: