Addressing and correcting procedural inconsistencies
Because we recognized most of these procedural discrepancies among clinics early in the data collection process, we were able to address them early enough to have a beneficial effect on data quality. Our first step was to provide written clarification of any ambiguity in the manual of procedures, including a description of which stool and ruler to use for sitting height, the correct placement and tension of the measuring tape for waist circumference, and the importance of five minutes rest before measuring blood pressure.
At the first refresher training workshop, we reviewed measurement procedures for sitting height, waist circumference, and blood pressure. We discussed the protocol and reinforced the correct procedures with the pediatricians in a large group session, followed by small-group practice sessions. At subsequent refresher training sessions, we revisited these topics to ensure that there was no lingering confusion.
Because some clinics had been using non-standard sitting stools, we also decided to measure the height of the stool used at each polyclinic. In cases where we confirmed that a clinic had used an incorrect ruler or study stool, we adjusted the sitting height measurements accordingly.
While the large difference in sitting height is easy to detect visually with box plots, a similar equipment error might result in a smaller absolute difference that would not be so apparent. Our experience with sitting height measurement provides a rather extreme example of differences in equipment or technique across centers that are likely very common, and may not be so readily detected if smaller in magnitude.
Following our retraining and clarification efforts at subsequent workshops, we also noticed a decrease in the ICCs for waist circumference and systolic blood pressure (Table 1). The ICC for waist circumference displayed a steady, gradual decline from a peak of 0.11 to 0.07, while the ICC for systolic blood pressure dropped from 0.18 to 0.12. However, the remaining variability was higher than we would expect, especially as we measured blood pressure in triplicate using a well calibrated machine. Our results contrast with those of Vierron et al., who found that about 20% of the variability SBP measurement with mercury sphygmomanometer and Doppler probe was attributable to differences between centers, but identified no center effect for measurements with a semiautomatic device [12]. We hypothesize that differences in the demeanor of the pediatricians and the setting of the polyclinics (e.g. the presence or absence of central heat) may explain the residual clustering we observed, but further evaluation including centralized measurement would be necessary to confirm this hypothesis.
Although we did identify the cause of the head circumference ICC increase, we were unable to clarify the procedure in time to result in a substantial decrease in the ICC value. By the time we discussed this measurement technique with the pediatricians, the two clinics that had been incorrectly measuring head circumference had completed the majority of their study visits, and we could not reliably adjust the incorrect measurements.
For measurements that are highly operator dependent, such as measurement of skinfold thicknesses and circumferences, it is likely that the reproducibility of each individual's measurements increase with experience. However, it is uncertain whether this increased precision would improve the accuracy of measurement and consistency across centers. Therefore, ongoing evaluation using ICC's, followed by retraining, is essential to ensure that measurement techniques are optimally consistent across sites.
We suggest that investigators leading similar studies do their best to ensure that data entry and reporting to study investigators occur as quickly as possible, and that auditing and retraining workshops occur as soon as possible after data collection commences, optimally within about a month. While we recognized these precepts from the start of our study, the logistical challenges of leading a large study distributed across a country with transportation challenges and limited computer accessibility outside of the central data center in Minsk were such that we were unable to entirely eliminate measurement clustering. Centralized measurement would limit the influence of measurement clustering, but it was not feasible for participants to travel to a central site for research measurements in our study, and we suspect this would be true in most other studies as well.
While an ICC of 0 might be ideal for a characteristic that truly does not vary by cluster, it is likely that many characteristics vary between centers for reasons in addition to clustered variation in measurement technique [13]. For example, there may be geographic differences in the racial or ethnic compositions of populations served in different centers, or there may be regional differences in diet or other behaviors that lead to expected variation in characteristics such as weight or BMI. These differences are a particular concern in cluster-randomized trials, in which a limited number of often heterogeneous groups often makes it difficult for randomization to distribute potential sources of confounding evenly [14]. In Belarus, the population is relatively homogenous, which is reflected by the low ICC's (< = 0.03) for many measurements such as height, weight, upper arm circumference, and bioimpedance. Although some guidelines are available for expected values of ICC, especially among studies of adults [14], little information is available for anthropometric values among children. Given the lack of regional variation in our study population, we considered an ICC to be higher than expected if it was above 0.05. However, in other settings in which regional variation is more likely, investigators will need to the appropriate threshold for investigation, perhaps based on instrument-measured characteristics or centralized measurement of selected variables.
Even after correcting errors in measurement techniques and achieving low ICC's, some clustering remains, as we found in our analysis of the intervention upon sitting height, unadjusted and accounting for clustering. Therefore, even with excellent consistency of measurement resulting in low ICC's, it is still vital to account for clustering within center in statistical analyses.