We found that using a patient-completed questionnaire was a feasible and valid method of capturing health care use and costs for patients with OA compared with accessing administrative databases. First, with respect to the questionnaire's feasibility and research costs, response rates were high (89%); though a minority of participants needed up to two telephone calls in order to return the questionnaire. Item-level missing data were minimal (1.6%) and research costs were considerably lower than relying on administrative databases. A review by Verbrugge  showed a range of 7-33% of item-level missing data for diaries. This metric was not reported by Goossens et al. ; however they reported that only 68% of diaries were returned.
As for previous studies [3, 4, 10, 33, 34], we found high levels of agreement between the data-collection methods for salient, high-cost health services such as hospitalizations and emergency visits. Participants more accurately reported their non-use (specificity) than use of health care (sensitivity). This likely reflects the three-month time horizon, which for most participants resulted in a relatively low use of health care. For GP contacts, paracetamol, and NSAIDs, sensitivity was higher than specificity (Table 2). The lower specificity levels for GP contacts may be the result of recall bias. Given the short timeframe of our study, it is unlikely this discrepancy reflects participants forgetting their visits. Rather, it may be the result of 'telescoping' (or 'reverse telescoping') which occurs when a person includes (or 'telescopes') health care used outside of the study time period, or when health care from within the study time period is 'reverse telescoped' out . Telescoping does not result in a consistent bias . The lower specificity of paracetamol and NSAIDs may be associated with the lower actual use of these medications (often taken on an 'as needed' basis) than as prescribed. In other words, a participant may have accurately reported use of a medication, but because this was less than prescribed, no new dispensing of the medication was required within the three-month time horizon leaving no record in the database.
We were able to only analyse quantity data for GP visits and the number of medications used. Utilization was too low to assess quantity data for the other health care included in our study. Levels of agreement between the assessment methods were acceptable (ρ
c > 0.40) for the number of GP visits and total number of medications used. We also compared cost estimates using Lin's concordance correlation coefficient (ρ
c) and Bland-Altman comparisons. Societal costs and co-payments associated with GP visits were in agreement with database results (ρ
c = 0.502). For the majority of medication subsidies, the cost estimates from the OCC-Q and NZHIS agreed. The questionnaire underestimated paracetamol subsidy by an average of $0.58 per person (2009 NZD) and overestimated patient co-payment by $12.34 per person. We believe this to be the result of some participants being prescribed the medication by their GP but, instead of filling the prescription at a pharmacy, they chose to purchase the medication over-the-counter.
The self-reporting of omeprazole subsidy disagreed with the NZHIS results (ρ
c = 0.13) despite dichotomous reporting of omeprazole use corresponding to sensitivity and specificity levels of 80% and 91% respectively. This discrepancy appeared to be related to the reimbursement of a more expensive brand-named drug, Losec, in the NZHIS record. We did not account for the reimbursement of Losec in the questionnaire-based estimates because it was not included in the pharmaceutical schedule. We can only guess why Losec was reimbursed by the government agency (PHARMAC) despite it not being on the pharmaceutical schedule. Perhaps it was the result of a rebate agreement with the manufacturer or a purchase by the Otago DHB. The removal of omeprazole from the total medication cost increased the concordance correlation coefficient to levels well beyond the threshold of clinical/practical agreement (data not reported). This illustrates the importance of knowing which unit costs to use when estimating costs. It also shows how reasonable attempts at using the appropriate unit cost may not capture costs that are not reported in the public domain.
For most medications, there were poor agreements for co-payments, though mean differences were low (< $4.26 for all but paracetamol). This was likely due to a combination of participants not accurately recalling their co-payments and errors in the expected contribution reported by NZHIS. This was a particularly difficult task for participants because they often combined store purchases with medication co-payments when at the pharmacy. For this reason, we recommend that analysts use caution when capturing co-payments for particular cost items. An understanding of the environment in which co-payments are made is important. For example, an option other than self-report may be required for studies set in countries where medication co-payments take place in a pharmacy in which purchases unrelated to health care can be made. If available, options such as average medication co-payments may be applied. However, considering the low impact of the co-payment on the cost of total health care use, it may not be worthwhile investing a large amount of analyst resources to increase the accuracy of this particular cost .
Our results compare favourably with other studies that assessed agreement between patient report and administrative records [3, 34]. We compared our results with two studies that reported data in cross tables, which allowed us to calculate sensitivity and specificity from their findings. Data reported by Ruof et al. indicated excellent sensitivity for physician visits (100%) . This physician visit category was a general measure that included GPs and specialists. Ruof et al. grouped physicians to increase agreement; however applying costs to such a general measure would be difficult. Disaggregated physician visits were not included in the cross tables but the authors indicated that kappa values were < 0.2 (poor agreement) for all physician categories . With respect to GP visits, data reported by Raiana et al. indicated sensitivity levels (95%) similar to our study (93%), but lower specificity levels (20%) suggesting that participants in their study were less able to accurately recall a non-attendance than our participants . Our results were stronger for the quantity of GP visits reported when compared with Raiana et al. (intra-class correlation coefficient of 0.34), though their population had a greater proportion of demographic factors associated with poorer agreement . Ruof et al. reported high sensitivity levels (> 90%) for the majority of medications assessed . This was higher than for our study, which may have been due to differences in the populations studied. Ruof et al. studied patients with rheumatoid arthritis, who typically receive intensive, medication-based treatment that slows disease progression . The primary use of medication as a disease-modifying treatment may increase a patient's ability to recall use of that particular health care item .
Two characteristics of our study may have increased the complexity of our data collection and analysis. The first is the use of an imperfect gold standard. Administrative databases have been shown to have errors [6, 7]. Having an imperfect gold standard complicates the assessment of accuracy, often resulting in an upward or downward bias of the accuracy estimate . Reitsma et al. review methods to correct the gold standard which include adjusting it by percentages of error that are found in the literature or that may be considered plausible . However, little evidence exists to guide the correction of cost data from administrative databases. An example from our study illustrates imperfection of our gold standard with respect to the reporting of OA-related inpatient services. One total knee replacement was recorded on the patient questionnaire, but not in the administrative database. When manually searching the hospital records for this participant we found operative notes indicating that a knee replacement had indeed been performed during the study period. In addition to decreasing our estimate of accuracy for the reporting of OA-related inpatient events, the removal of this procedure from total costs of health care services would have substantially affected our results.
The second characteristic that may have complicated data collection was the use of OA-related costs. There is a possibility that patients and researchers do not agree on what 'OA-related' means. We attempted to minimize this discrepancy by using firm definitions for determining which services constitute OA-related health care. Though we used definitions for OA-related health care use, we acknowledge that medicine is multifactorial and that there is little evidence supporting the use of definitions to identify what is and is not disease-related . The decision to use OA-related costs was made in order to reduce variability of our cost estimates by limiting costs to those related to OA. The results of our study may give researchers who are interested in analyses of 'OA-related' disease support in the application of a definition that resulted in high levels of agreement between patients filling out a questionnaire and researchers applying the definition against administrative data .
Our study has several limitations. First, we have a small sample size. We reported descriptive results so that our results are interpretable within the context of the sample, and we provided cross tables to allow comparisons with current and future studies. Second, we did not screen our participants with a mental health assessment in order to decide who should be allowed to participate in the self-reporting exercise. By not screening participants in this way, a small number of participants may not have been able to complete the questionnaire accurately; however, considering the age range of our population and features of our questionnaire designed to assist recall, we believe this to be of little consequence in our study. Third, the reduced research costs found when using the OCC-Q may have been related to the small sample size. Fixed versus variable costs differ between the analytic approaches used in the study. Many of the databases had larger fixed costs and smaller variable costs when compared with the questionnaire. For example, if the sample size increased to 150 participants, time spent on the phone would have been unchanged for the databases but would have nearly tripled for the questionnaire-based approach. It is possible, given a significantly larger sample size, that the research costs would be higher for use of the OCC-Q than for the databases. Fourth, we were unable to compare important cost items such as patients' work-related productivity losses, community-based allied health visits, and informal care. These costs were collected using the questionnaire but were unavailable from administrative databases and so they were not included in the present study.