Bmc Medical Research Methodology Open Access Examining Intra-rater and Inter-rater Response Agreement: a Medical Chart Abstraction Study of a Community-based Asthma Care Program

Background: To assess the intra-and inter-rater agreement of chart abstractors from multiple sites involved in the evaluation of an Asthma Care Program (ACP).

tion highlights the importance of assessing the intra-rater and inter-rater reliability of the collected data, as it reflects the quality of the data and the value of the results [4].
Although many investigators have expressed concern regarding the reliability associated with data abstracted from medical records [1,2,5,6], few studies report details regarding the types of data elements that were assessed and the methods used to ensure reliability [2,[7][8][9]. For multi-centre studies, the need for reliability assessments is especially important given the involvement of multiple data abstractors and the potential for variability. Thus, designing a reliability assessment study that uses appropriate sampling techniques and analytic methods offers an efficient way to verify the reliability of the data collected through chart review. Indications of high agreement can provide confidence in the data collection process and the subsequent conclusions drawn from those data [10].
Previous studies have assessed the reliability of medical record reviews in the context of screening and the detection of adverse events [11,12]. This paper describes a secondary chart abstraction study carried out as part of the multi-site Primary Care Asthma Pilot Project (PCAPP) to determine the effectiveness of an evidence-based community asthma care program. The purpose of chart re-abstraction was to measure the intra-and inter-rater reliability of abstracted patient chart data across sites and assessors involved in PCAPP.

Primary Care Asthma Pilot Project (PCAPP)
The Primary Care Asthma Pilot Project (PCAPP) was a community-based participatory study funded by the Ontario Ministry of Health and Long-Term Care. Initiated in 2003, PCAPP was designed to determine whether the use of an evidence-based Asthma Care Program (ACP) would lead to improved asthma care delivery and outcomes for patients from 15 satellite clinics in eight local communities across Ontario. Patients included in PCAPP were those aged 2 to 55 with mild to moderate asthma. The satellite clinics included eight Community Health Centres, a Rural Family Health Team, a Group Health Centre and an Aboriginal Access Centre. Participants in the PCAPP project consented to have their medical charts reviewed on four occasions to measure the process of care involved in the implementation of the ACP. Ten different research staff conducted chart abstraction across the various sites, thus it was essential to ensure that data were collected consistently over time within each participating site (intra-rater reliability) and across sites (inter-rater reliability).

Initial data collection and reabstraction
The original chart extraction and coding of the 1,433 PCAPP participants' records occurred between September 2003 and June 2005. Information was abstracted on prospective patient visits that took place at baseline, 6-month follow-up, and 12-month follow-up to measure the process of care during the implementation of the ACP. Information was also abstracted from retrospective patient visits that happened between January 1, 2002 and December 31, 2002 in order to describe the pattern of asthma care prior to the ACP implementation. Data from patient medical records were collected using a data abstraction form containing 33 items grouped into six categories (Table 1). Abstractors indicated on the form whether information pertaining to each item was "documented" or "not documented" in the patient's chart. The chart abstraction study and the accompanying form were approved by the Research Ethics Board at the Hospital for Sick Children, Toronto.
In the current study, assessment of chart abstraction reliability involved distinct methods of comparison for intrarater agreement and for inter-rater agreement. Intra-rater agreement was based on the re-abstraction of medical charts by abstractors from the same site at Time 2 (between July 2005 and February 2006) and comparing the re-abstracted data to data collected at the initial abstraction at Time 1 (between September 2003 and June 2005). To minimize the potential for recall and artificial inflation of observed agreement, all data abstracted during Time 1 (initial chart abstraction) were returned to the research centre upon their completion, and chart abstractors were not granted access to their original responses at Time 2 (chart re-abstraction). Inter-rater agreement involved comparisons between abstractors and a gold standard using a single set of eight fictitious medical charts at the end of the study.

Definition of chart abstraction items
Six chart abstraction categories were used as a basis for determining chart abstractor agreement: physical assessment, asthma control, spirometry, asthma education, referrals to specialists or education programs, and medications prescribed. The chart abstraction form included space for abstractors to record specific medication information. However, details such as generic/trade name, strength, dosing, route of administration and duration were considered especially challenging to analyze quantitatively. Thus for the purpose of this study, only the presence of medication side effects was included as a dichotomous yes/no variable for agreement analyses. A summary of the number of chart abstraction items within each category is shown in Table 1.

Data quality control measures
As part of the main study, site coordinators and chart abstractors took part in training workshops and orientation sessions to introduce the project and research methodology to the local staff. Chart abstraction guides were distributed to all sites and included abstraction procedures, coding instructions, and several scenarios for discussion so that chart abstractors would handle potentially challenging medical charts in a consistent manner. Additional site visits were conducted three to five months following the implementation of PCAPP to review data collection processes and to ensure that research protocols were being followed. Other ongoing communication methods included email, teleconferencing, and regularly scheduled Advisory Committee meetings.
Following abstraction, chart abstractors entered the data into a database using Microsoft Access (Microsoft Corporation, Redmond, Washington). Data were transferred monthly between the participating sites and the lead investigating centre, where data were tracked and cleaned.

Assessing intra-rater reliability
To assess intra-rater reliability, ten abstractors re-abstracted data at Time 2 from randomly selected patient charts that had been abstracted at Time 1. A sample of 110 patient charts representing 8% of the main study population was randomly selected from the original sample for re-abstrac-tion. Random number generation and subsequent analyses were conducted using SAS Statistical Software version 9 [13]. The formula given by Sim and Wright [14], indicated that the sample size for this re-abstraction study would allow the detection of a kappa statistic between the values 0.60 and 0.70 with 80% power at an alpha level of 0.05.

Assessing inter-rater reliability
To assess inter-rater agreement while ensuring that patient charts remained in their respective sites to safeguard patient confidentiality, an innovative method was used wherein 8 fictitious medical charts were distributed to each site chart abstractor as well as an experienced chart abstractor who is not involved in the study. All abstractors abstracted information from the eight simulated charts using the study chart abstraction form. The simulated charts reflected a range of patient characteristics: young versus old, few versus multiple co-morbidities, controlled versus uncontrolled asthma. Pilot testing of the fictitious medical charts for face and content validity occurred at two sites by individuals not involved in chart abstraction. The set of charts underwent minimal revision. Examples of the simulated patient chart and the chart abstraction form are illustrated in Figure 1.

Analysis
Intra-rater reliability was assessed using the overall percentage of agreement [15], and the kappa statistic. The abstractors' responses given during the initial abstraction (Time 1) were compared to those provided during the reabstraction (Time 2). To ensure proper comparison, the abstracted data were matched by study identification numbers and corresponding asthma-related visit dates.
Percentage agreement and the kappa statistic were also used to assess inter-rater reliability across the study sites for the eight simulated patient charts. Data abstractions from the simulated charts by site abstractors were checked against the data abstracted by the experienced chart abstractor, who served as a gold standard. Sensitivity and specificity estimates were calculated comparing results from all raters against the gold standard.
Using the algorithm of Landis and Koch [16], kappa values of 0.80 and above represented excellent agreement, values between 0.61 and 0.80 represented substantial agreement, 0.41 to 0.61 represented moderate agreement, and values below 0.40 suggested fair to poor agreement.
Summary kappa scores and percentage agreement were calculated for each category (n = 6) and also for each variable (n = 33). Agreement coefficients for each of the variables were calculated to determine the range of agreement within each category. The overall kappa score was calculated by summation of scores within each category, and subsequently using the category-specific kappa values to compute an overall kappa summary statistic. The homogeneity of kappas across chart abstraction categories was tested using the method outlined by Donner et al [17].

Intra-rater agreement
For the intra-rater component of the re-abstraction study, 110 charts were reviewed, and 218 documented asthmarelated patient visits were included in comparing chart abstraction at Time 1 and Time 2.
Chart abstraction form and sample portion of a fictitious medical chart used for assessing inter-rater reliability Figure 1 Chart abstraction form and sample portion of a fictitious medical chart used for assessing inter-rater reliability.
The overall intra-rater percentage of agreement and kappa statistics were 93% and 0.81 respectively. Across the six categories, kappa varied between 0.51 and 0.84, with categories pertaining to asthma referrals and medication side effects showing only moderate agreement. Table 2 presents a summary of the results for category-specific and overall measures of agreement for intra-rater reliability. Homogeneity testing of the category-specific kappa values suggested statistically significant heterogeneity. It was therefore of interest to explore possible sources of heterogeneity. Data were pooled and subsequently stratified by abstractor to compare the degree of agreement between Time 1 and Time 2 for each individual abstractor ( Table  3). The abstractor-specific percentage agreement varied from 79% to 98%, while the abstractor-specific intra-rater kappa statistics ranged from 0.21 to 0.94. The heterogeneity chi-square statistic for the difference in the kappas calculated for the 10 abstractors was also statistically significant (χ 2 = 238, p < 0.0001), suggesting that individual differences in abstractor consistency may account for the heterogeneity of the overall category-specific kappa values. Intra-rater agreement was also examined for all the abstractors according to when the chart was abstracted, either at the start of the study (on retrospective patient visits prior to baseline) or at 6-month follow-up. The kappa statistics were 0.76 (95% CI: 0.73, 0.79) for retrospectively abstracted charts, and 0.82 (95% CI: 0.80, 0.85) for 6-month follow-up chart abstraction (data not shown in tables).

Inter-rater agreement
As expected, inter-rater agreement measures were slightly lower than those for intra-rater agreement. The overall inter-rater percentage of agreement and kappa statistics were 88% and 0.76 respectively (Table 4). Kappa values showed moderate agreement for the category of asthma education, and could not be calculated for the spirometry and medication side effects categories due to a high observed percentage of agreement. The overall sensitivity and specificity estimates across all six categories were 0.91 and 0.89 respectively (Table 4), which indicate consistent abstractions conducted by the raters in comparison to the gold standard.
The inter-rater kappa values were shown to be statistically heterogeneous according to the homogeneity test (χ 2 = 737, p < 0.0001). To explore the source of heterogeneity, data abstracted from the eight simulated charts were pooled and agreement coefficients were calculated for each paired combination of abstractors including the standard abstractor who was not involved in the study. The range of percentage agreement and kappa values are shown in Figure 2. The percentage agreement across the different pairs of abstractors ranged from 78% to 98%. Kappa values ranged from 0.56 to 0.90 and were statistically heterogeneous (χ 2 = 128 on 54 degrees of freedom, p < 0.0001). Inter-rater agreement was then examined for each simulated chart. As shown in Table 5, the percentage agreement was consistent across the eight charts (85% to 91%), and kappa statistics ranged from 0.69 to 0.78. The sensitivity estimates ranged from 0.84 to 0.99 and specificity estimates ranged from 0.85 to 0.96.

Discussion
The results of this study showed that chart abstractors involved in data collection for the Primary Care Asthma Pilot Project reliably extracted information contained in the medical charts. Previous studies of intra-rater and inter-rater reliability have also demonstrated moderate to substantial intra-rater and inter-rater reliability associated with medical chart abstraction [6,18]. When abstracted by the same rater, or raters within the same centre, the majority of items (27 of 33, 82%) had kappa values greater than 0.61, suggesting that intra-rater agreement was substantial to excellent overall as well as on a per-item basis despite changes in staff that occurred at some sites. For inter-rater agreement, 10 of 33 items (30%) were associated with a kappa value greater than 0.61. Fewer items showed substantial inter-rater agreement as compared to intra-rater agreement, which may be attributable to the content of the fictitious charts used for inter-rater assessment rather than differences in the abstraction process or between the abstractors themselves.
Closer examination of the agreement coefficients revealed differences in kappa statistics depending on the chart abstraction category. For example, the kappa statistics indicated only moderate agreement for intra-rater data abstraction relating to asthma referrals and medication side effects. Moderate agreement was also shown for interrater data abstraction relating to asthma education. Information pertaining to these abstraction categories was presented in the medical charts as free-text or in narrative form. The demonstration of moderate agreement may be due to differences in the abstractors' familiarity with the concepts and terminology of asthma treatment strategies, the indications for specialist referrals, and how and where these details would be noted in the medical chart. Other studies have identified similar issues as having a potential impact on reliability [7,9,19]. Our findings showed that intra-rater reliability was significantly improved for 6-month follow-up chart abstraction compared to retrospective chart abstraction at baseline, which may be attributable to the effect of ongoing abstractor training and increasing familiarization with the nuances of chart abstraction over time.
A salient challenge of chart review is the accurate abstraction of medication information, dosing, frequency and duration. Quantifying agreement for the documentation of pharmaceutical treatments has important clinical implications. This study was unable to analyse treatment in this context. However, results showed moderate intrarater and inter-rater agreement for the categories of asthma education, asthma referral, and medication side effects, suggesting that agreement for more specific medication regimens may be moderate at best. Further studies are needed to assess the agreement in medical chart documentation of treatments.
Results from this study also reflect the paradox associated with the kappa statistic wherein an item or category demonstrates high percentage agreement but a low kappa coefficient [20]. This inherent limitation of kappa is wellestablished and acknowledged. Thus, in some instances, the percentage agreement may be a more appropriate measure of reliability. In the presence of a reference or a gold standard, test statistics such as sensitivity, specificity, predictive values, and likelihood ratios are more often used than the simple kappa statistic. In this study, since a gold standard was available, it was possible to compute sensitivity and specificity estimates across all the categories and for all the charts created. The overall sensitivity and specificity measures in this study were in the order of 90% indicating good validity. Despite the limitations of the kappa statistic, it was also used to present results of the current study since it provides a simple measure to assess intra-rater reliability (precision) and also inter-rater reliability in studies that involve multiple raters. While there Ranges of inter-rater agreement among pairs of abstractors Figure 2 Ranges of inter-rater agreement among pairs of abstractors. Data are shown as percentage agreement (%), kappa coefficient, and 95% confidence interval for kappa. Abbreviations: Std = Gold Standard.  To address the objectives of this multi-site study, assessing chart abstraction agreement required the computation of multiple kappa statistics with the underlying assumption that these individual measures could be combined statistically into overall values for intra-rater reliability and inter-rater reliability. The overall results indicated substantial to excellent agreement though statistical tests suggested significant heterogeneity in the kappa statistics. Further investigation of possible sources of heterogeneity, including the clinical knowledge base of chart abstractors, may be relevant to enhancing the design of studies involving clinical interventions that are documented and subsequently abstracted from patient medical charts.
Although the overall intra-and inter-rater agreement was high in the current study, the reliability of chart abstraction data remained less than 100%. This means that there may still be some misclassification during data collection, which may result in a bias towards the null hypothesis (Type II error) of the intervention study. One way to overcome this potential problem is to obtain a large enough sample size. Future evaluations of chart abstraction may include regression modelling to investigate the heterogeneity in kappa statistics and the impact of possible explanatory factors on the probability of achieving agreement. Methods such as the generalized estimating equation (GEE) approach of Liang and Zeger [21] are considered useful for intra-rater agreement involving clustered, multicentre designs where data are collected at multiple time periods. Based on the elements involved in the current study, possible covariates for investigation may include the number of abstractors involved, the differences in chart abstraction items or categories, and the duration of time between abstractions.

Conclusion
This study demonstrates that chart abstraction can be a reliable form of data collection in multi-centre studies for asthma. The ongoing assessment of data reliability offers an effective way of monitoring data quality, which can subsequently improve the reliability of the results drawn from the data.
Publish with Bio Med Central and every scientist can read your work free of charge