In this study we described a number of parameters which may affect the precision and power of CRR harm estimates in general practice. We combined a wide range of different parameter values into different CRR scenarios and used computer simulation to establish which ones would yield harm rate estimates with acceptable precision and adequate power. From this, we derived a formula which we used to calculate the minimum number of harm incidents that had to be detected during any CRR process to ensure the harm estimates had acceptable precision and adequate power. We found that any CRR scenario which detected a minimum of 100 harm incidents would have harm rate estimates with the level of precision we pre-specified. Using the formula and our simulated data, we calculated that detecting a 50% and 20% reduction in harm with acceptable power would require CRR scenarios to detect at least 100 and 500 harm incidents respectively, over a given period of time.
The practical implication of the CRR scenarios which assures harm rate estimates with acceptable precision (as defined by us) is that approximately 2000 records (assuming a high baseline harm rate) increasing to 20 000 records (assuming a low harm rate) would have to be reviewed. If the aim of the CRR is to detect changes in harm rates with adequate power over time as many as 120 000 records may have to be reviewed, depending on the prevalence of the harm in the patient population of interest. Different parameter values can be combined into different CRR scenarios by health care researchers, clinicians, policy makers and others to fit their aims and resources. By applying our formula, they could ensure the harm estimates of these potential different CRR scenarios will have adequate precision and acceptable power.
Comparison with the literature
The vast majority of studies with a CRR methodology aim to detect either patient safety incidents (PSIs) in general, or more specific subsets of PSIs such as harm, adverse drug reactions or errors, to estimate a harm rate for a defined geographical location or clinical department at specified points in time. Our non-systematic search of the relevant international literature
[1, 3] did not uncover a single study in which the precision of these reported rates was either considered or documented. In addition, none seem to have explicitly considered the required parameter values of their CRR method. Instead, the size of the patient record samples seemed determined only by resources, time and feasibility concerns. While this observation does not necessarily imply that all previous harm rate estimates were imprecise, our findings suggest that any CRR which detected less than 100 harm incidents may not have had adequate precision (as defined by us).
To illustrate this point further, we provide three practical examples. Example one: Singh and colleagues measured the adverse drug event rate amongst older patients with established cardiovascular disease by reviewing a 12-month period in 393 pre-screened trigger positive records from six general practices in the UK
. They found 232 adverse drug events, of which 92 were judged preventable, with an estimated rate of 24.6 preventable adverse drug incidents/100patients/year. Applying our formula to their CRR method and findings suggest their estimated rate has adequate precision (as defined by us). Example two: Gaal and colleagues reviewed 1000 unique medical records in Dutch general practice, over a 12 month period, and estimated a rate of 21.1 patient safety incidents/100patients/year (CI 18.5-24.1), and 5.8 harm events/100patients/year
. Applying our formula to their CRR method and findings suggest the estimated PSI rate is precise but the harm rate estimate may not be. Example three: De Wet and Bowie reviewed a 12-month period in each of 100 records randomly sampled in five participating practices in Scotland
. Overall, 64 PSIs were found, which is less than the 100 harm incidents our formula suggests for acceptable precision.
Only a tiny minority of studies using the CRR method has aimed to measure reductions in harm rates over time. They were all conducted in secondary care settings and to our knowledge there has been none in primary care. Carter described the experiences of a hospital in the UK with the global trigger tool over a five year period
. While it ‘appeared’ that the incidence of ‘more serious’ events reduced and ‘more minor’ harm incidents increased, the changes were not quantified. Landrigan and colleagues’ review of 2341 admissions to 10 USA hospitals over a six year period was the largest study of its type when it was published, but failed to detect a significant reduction in the rate of harm during this period
[8, 32]. Applying our formula (which suggests detecting a 20% change in harm would require a CRR to detect at least 500 preventable harm incidents) to their findings suggests at least two possibilities: either there was no reduction in harm, or there was a small reduction but the sample was insufficiently powered to detect it.
Our simulations represent ‘best case’ scenarios and likely underestimate the amount of records that may have to be reviewed. While we know that a substantial proportion of PSIs may not be preventable because they originated in different settings, are recognized as side effects of appropriate treatment or are dependent on patient factors, this was not directly controlled for in our simulations. Current estimates suggest between 10 and 50% of detected harm incidents may be preventable
[9, 18, 33, 34]. Therefore, when researchers or reviewers attempt to measure reductions in harm over time, they have to remove, or at the very least consider, what proportion of the detected harm incidents are likely to be’non-preventable’. Otherwise, the observed reduction will appear ‘smaller’ (as a percentage) than it actually was, and their CRR scenario’s power to detect the change will also be decreased.
To illustrate this point further, consider the study conducted by Takata and colleagues as a practical example. They detected 107 adverse drug events, of which 24 (22%) were judged preventable in their review of 960 paediatric records from 12 USA hospitals
. If they aimed to reduce the number of preventable incidents by an ambitious 50% (e.g. a reduction from 24 to 12 incidents) over a given period of time, this reduction would ‘only’ be 11.2% of their overall ADE rate. Our findings suggest this would require a CRR of many thousands of records, and certainly much more than if their aim had been a reduction of 50% in the overall rate.
Potential application of findings
There is considerable political and policy interest in a measure to reliably quantify and then track rates of harm in primary care records over time. The ideal attributes of such a measure are that it should be: relevant; valid; reliable; discriminative; credible; timely; feasible; accessible; and actionable
. CRR has most of these attributes, but may be limited by feasibility concerns. Our findings are the first known attempt to quantify the minimum CRR parameter values which impact on feasibility (e.g. number of practices reviewing records and number of records reviewed per practice) and may therefore help to inform the discussion and planning of health care policy makers and leaders who are interested in measuring harm in general practice.
While our findings suggest a single general practice cannot feasibly measure its rate of harm with acceptable precision or adequate power, we provided many CRR scenarios that would yield harm rate estimates with adequate precision and acceptable power if implemented at national or regional level and a formula to test any other proposed CRR adaptations.
At national level, there are 1003 general medical practices in Scotland
. Our findings suggest that if at least 300 practices each reviewed 25 records twice over a given period of time (say 12 months), the CRR sample yield harm rate estimates with acceptable precision and would have adequate power to detect a 50% reduction in ‘any’ assumed baseline harm rate if it occurred during this period. Smaller changes in harm rates could be detected if every practice in Scotland participated, although engagement would likely have be sought through contractual incentivisation.
Let us consider two examples at the regional level. Example one: A Scottish regional Health Board with 100 general medical practices aims to estimate their harm rate with acceptable precision. If they assume a real (baseline) harm rate of 10 incidents/100 patients/year, our formula indicates that each practice will have to review 50 records to achieve this aim. If the Health Board assume a lower harm rate of 5 incidents/100patients/year or selects a less harm prone patient population, each practice will have to review 100 records to achieve a harm rate estimate with acceptable precision. Example two: A Scottish regional Health Board wants to estimate the harm rate in their region which has 57 general practices. If they assume a baseline harm rate of 5 incidents/100 patients/year, each practice would have to review 150 records to estimate the harm rate with acceptable precision.
Measuring at regional and national level will require substantial investment in training and support, allocation of additional resources and protected time for clinician reviewers.
Strengths and limitations
Our findings were derived by aggregating the results of multiple simulated data sets for different CRR scenarios derived from predefined parameters and parameter values. Our assumptions about these parameter values were informed by practical experience and available literature. Given that the available evidence of harm prevalence and preventability varies widely, our choices of harm rates and potential reductions in harm are therefore likely to include overestimations of incidence and reductions.
Our statistical method allowed simulation of complex scenarios, but the data remains simulated and at best a simplified and imprecise presentation of reality. We accepted the principle that the same patient may suffer more than one incident during a review period. This meant that data had to be treated as ‘count’ rather than binary. The consequences were that harm rates had to be expressed as rates (i.e. incidents/100 patients/year) and not percentages, and sensitivity, specificity and predictive value could not be calculated. Potential inter-rater bias and intra-rater error (inconsistency) were accounted for by ‘including’ it as part of the inter-practice variation in harm rate. We assumed the same patients’ records were reviewed at the beginning and end of the study period. This reduced inter-patient variation and increased power.
We also identified a problem of substantial positive bias in harm rate estimates where there are high levels of inter-patient variation. The standard approach of quantifying and adjusting for inter-patient variation was not feasible due to the very low numbers of harm incidents in some CRR scenarios. These results suggest that estimates of harm rates from CRRs could contain unquantifiable upward bias due to unknown levels of inter-patient variation. This is a problem that will affect real studies and not an artefact of our analysis. It is a consequence of making estimates from multilevel data where the numbers of events are too small to allow the multilevel effects to be adjusted for. The sample sizes required to adjust for these effects were beyond the realistic range explored here and may be unfeasible. The implications of this inability to estimate random effects go beyond bias in harm rate estimates to scenarios where variation between practices is of primary interest rather than simply a parameter to be adjusted for. If the aim of CRR was to determine whether some practices have significantly higher harm rates than others, or if the harm rates of some practices are changing (increasing or decreasing) faster than others, considerably larger numbers of patient safety incidents would have to detected than in our simulations. This would require increasing the number of records reviewed, lengthening the review period and/or selecting an unusually harm-prone population of patients.
We simulated CRR to detect changes over a single time period. In our scenarios power was maximised by reviewing records at only two time points - the beginning and end of a 12-month period. However, many patient safety programs may not be time-limited or will measure harm at multiple time points. The availability of data at additional time points will allow the detection of trends. Monte Carlo simulations could be used in future research to optimise experimental design for such longitudinal scenarios.
The relationship between measurement and improvement, and the challenge of ‘getting one to follow the other’ has previously been described
. We still do not know which interventions can successfully improve patient safety in general practice. What little evidence there is suggests successful interventions will likely require a multi-method approach, rigorous evaluation and small, local clinician-led pilots
. Future research should therefore examine the utility of CRR as a learning and improvement tool, ‘…working on the nuts and bolts of how we turn measurement for improvement into tangible change in practice…
’. Other potential research questions include: the effects of inter-patient and practice variation on estimated harm rates; and what the ideal mixture of parameter values (number practices, records reviewed in each practice and review time per record) are to detect the minimum number of harm incidents to ensure acceptable precision and adequate power. Finally, our statistical model and formula needs to be validated further through practical application.