To our knowledge, no other study has examined and compared the effect of calibration on inter-instrument reliability after applying unit-specific calibration factors to data obtained both in the laboratory and in the field.
As the primarily finding, this study revealed that unit-specific calibration factors shown to reduce inter-instrumental variability considerably in the experimental setup in the laboratory should be considered as rather ineffectual when applied to field data in children and adolescents. Furthermore, a significantly reduced inter-instrument reliability was observed over time post hoc in the MTI monitors, and when compared to the CSA instruments a significantly increased (9.95%) mean acceleration response was observed in the batch of MTI instruments. These findings should be interpreted in the light of several considerations.
General strengths and limitations
The strengths in the present study include the large number of accelerometers examined in a mechanical setup producing highly standardized reference acceleration values. On the other hand, serving as a limitation the calibration machine in the laboratory solely offers an isolated and standardized sinusoidal way of movement, which potentially will affect the comparability of inter-instrument variability estimated according to mechanical movements in the laboratory and inter-instrument variability experienced in the field when exposed to complex human locomotion. Therefore, in order to improve the variation and complexity of movement in the mechanical setup all instruments were calibrated in three different radius settings using four different frequencies, which produced a total of four different acceleration values.
Unit-specific acceleration response varies over time (intra-instrument variation), although intra-instrument reliability has been reported to be fairly good at any given time point [26, 27]. Therefore, the calibration factors estimated in this study will include residual unit-specific test-retest variation. In this study, unit-specific acceleration responses were assessed at three different time points during the period November 2003 to March 2004 in order to minimize the effect of intra-instrumental variation. Examining acceleration responses within a rage of different accelerations in multiple units more frequently than performed in the present study becomes rather problematic, if calibration should be feasible in large scale population studies, since this procedure requires significant time/manpower.
The fact that laboratory data was collected in parallel with field data will increase the comparability between results observed in the mechanical setup and during free living conditions, respectively.
The aim was to examine acceleration responses in the laboratory under standardized conditions where accelerometer outputs (counts*min-1) were comparable to typically values obtained during free-living activities. Compared with validation studies in children , the outputs produced in the mechanical setup ranged in locomotion field speed from approximately 4.0 to 8.0 km*h-1 (e.g. the range from walking to running). However, the absence of acceleration responses where only very limited instrument output was produced must be regarded as a limitation in the present study. This is stressed further by findings observed by Brage et al  who previously showed that the CSA accelerometer displays larger relative variability at very low accelerations. However, Brage and colleagues suggested that the poor reliability at very low accelerations may be explained by the dead band of the Actigraph (approx. 0.3 m*s-2) being different between units, meaning that different units have different lower thresholds at which they begin to register movements. However, acceleration of the human body is expected clearly to exceed that of the dead band, and therefore, in relation to issues linked to calibrated field data the clinical significance of poor reliability at the very low accelerations caused by the dead band is probably very limited. However, different lower threshold of registration might of course affect the number of valid days of measurements since many research groups interpret long bouts of zero activity as non-monitored time. Furthermore, varying lower threshold of registration will potentially have an influence on the amount of time spent in sedentary and/or light intensity categories, depending on how cut points are used.
It might be speculated that numerous periods of zero activity (where children are not moving at all) will attenuate the potential impact of unit-specific calibration when applied to field data, due to the fact that when exposed to no acceleration at all, all instruments will produce the exact same output (i.e., zero). Therefore, the effect of applying unit-specific calibration factors to field data representing the percentage of total registered time spent in high or vigorous activity levels, defined according to Trost et al.  was analysed post hoc as the unit-specific calibration factors were applied separately for each epoch being downloaded (data not shown). However, under these circumstances where periods of zero activity are greatly eliminated from field data the exact same result was observed – calibration did not reduce random variation caused by inter-instrumental variability across the examined group of children and adolescents.
The mechanical setup solely offers isolated and standardized sinusoid accelerations. However, children have been reported typically to be involved in many different activities, including different games, jumping, dancing, running, climbing, and biking , introducing a wide range of frequencies and accelerations through more complex movements of the human body. These dissimilarities between types of movement, as well as biomechanical differences between subjects, even when involved in the same type of activity, might affect the comparability between inter-instrument variability characterized in the mechanical setup and in field when assessing the complex and heterogeneous behaviour of human locomotion in children and young people. Even when examining reliability using a standardized treadmill protocol, Welk and colleagues  found that Actigraph accelerometer counts for a standardized bout of activity can vary by 20% for participants wearing the same monitor and performing the same absolute workload. For example, differences in step frequencies have been reported to explain 11% and 40% of the speed-adjusted variance in Actigraph output in walking and running, respectively . Therefore, the effect of calibration on increased inter-instrumental reliability might be reduced due to an increased between-individual variation caused by differences in step frequencies during free living conditions.
Furthermore, previous Brage et al. , found inter-instrument differences to be heteroschedastic in response to the acceleration magnitude, which indicates that inter-instrument variability is related to the frequency and/or magnitude of movement. Similar findings have been reported by Jakicic et al.  who found that inter-instrument reliability in the TriTrac-R3D accelerometer appeared to depend on the specific type of PA being assessed.
Optimal measuring axis of movement
When trying to achieve successful calibration it is important to optimize the parallelism between the measuring axis of the instrument and the axis of movement actually experienced. When calibration and quality checks were performed in the laboratory a standardized attachment of the instruments to the plate at the calibration machine was performed in order to ensure that the registration of acceleration along the vertical axis was optimized. Ideally, every child would wear the accelerometer at the exact same angle in the field. However, even though participants were carefully instructed how to wear the accelerometer, rather individual attachments to the body must be expected, and instrument position might change as a result of lose attachment combined with body movements (which in the end will contribute to an increased random variation in the field). The scope of this problem is illustrated by previous findings showing a reduced accelerometer output of 6%, 16%, and 29% when the optimal angle at the axis of measurement was reduced by 15°, 30°, and 45°, respectively, during standardized conditions in the laboratory .
Agreement between raw and calibrated field data
The amount of variation introduced to field data after applying unit-specific calibration factors was estimated to be only 1.1% and 4.2%, when compared to the total amount of variation in HPA in children and adolescents, respectively. This amount of variation must be considered to be small, especially considering the size of the reproducibility coefficient (R) of a 4-day period previously observed in the children and adolescents examined in the present study. In children, R was found to be approximately 0.65 , whereas R was found to be approximately 0.70 in adolescents (unpublished data).
The high correlation between raw and calibrated field data observed in the present study is probably explained by a combination of an improved data quality due to the repeated quality checks and the presence of other major sources of variation (e.g. biological variation, day to day variation, seasonal variation, and poor compliance with correct mounting of the devise to the body), meaning that the inter-instrumental variability will be relatively small when compared to the total amount of variation in field data.
In children who were measured with the MTI instruments, the Bland-Altman plot showed that relative 95% limits of agreement between raw and calibrated instrument output in the field was approximately ± 5.5%. In adolescents measured with the CSA monitors, however, relative limits of agreement showed that 95% of all subjects stayed within a wider range of approximately ± 13% when comparing raw and calibrated field output. A number of outliers caused the limits of agreement in adolescents to be slightly skewed and increased.
Theoretically, ideal calibration factors applied to instruments with zero intra-instrument variation would cause inter-instrumental variability to disappear when examined under standardized conditions in the mechanical setup in the laboratory. However, even though inter-instrument variability was substantially reduced after applying the calibration factors, considerable inter-instrument variations were still observed when examined under standardized conditions in the laboratory. Therefore, even though we would assume that participants whose activity level changed considerable after calibration actually achieved a HPA level closer to their "true" level if monitored with no measurement error at all, the calibration factors estimated and applied in this study will include residual standardized unit-specific test-re-test variation, and could therefore in theory also add to the random variation.
Furthermore, we speculate that unit-specific calibration factors estimated in the laboratory not fully reflect inter-instrument variations in the field. This, in combination with the presence of other major sources of variation, indicates that the Bland-Altman plot only to a certain degree will capture the "true" individual diversity between raw field data and field data obtained without any measurement error. Nevertheless the outliers, which were observed, are probably explained by repeated measurements with one or few units with particular poor reliability. This highlights the importance of performing continuous calibration checks according to an a priori limit of variability.
Changed inter-instrumental reliability over time
Significantly non-homogeneous standard deviations with increasing size over time were observed in the group of MTI instruments when exposed to standardized accelerations in the mechanical setup. This indicates a modestly reduced inter-instrument reliability throughout the data collection period. Although the SD increased slightly over time in the CSA instruments, a significant heterogeneous pattern could not be observed. It should be noted that when the MTI instruments were calibrated the first time in November no instrument had yet been sent into the field. Therefore, we speculate that the reduced reliability over time partly might be the result of mechanical wear on the cantilevered moving arm (the accelerometer sensor) caused by everyday movements and instrument shocks.
In the group of CSA instruments, inter-instrument variability was found to be rather high to begin with. However, it should be noted that the somewhat older CSA instruments had been used in another study before the first calibration was performed in November 2003. As time went by from November 2003 to March 2004, inter-instrument reliability in the MTI instruments was approaching the level of the CSA instruments.
Comparing acceleration responses between MTI and CSA instruments
The Actigraph count output has previously been found to increase as frequency decreases at a given acceleration [27, 28]. Therefore, the fact that in the present study CSA instruments were exposed to slightly higher frequencies compared to the MTI instruments, potentially challenges the validity of our results indicating a batch effect. However, when identical frequencies and acceleration magnitudes were applied post hoc in June 2004, the MTI instruments displayed a significantly (p < 0.001) increased acceleration response of 9.50%, when compared to the group of CSA instruments. Indications of batch/lot effects have also previously been reported by Esliger et al.  who compared mean accelerometer output in six testing conditions in a mechanical setup.
To test whether the different acceleration response observed between the batches of MTI- and CSA instruments in the present study was due to the past use of CSA instruments, the mean acceleration response was compared post hoc in the mechanical setup in 2006 immediately after all CSA units were calibrated according to the spinning procedure recommended by the manufacturer . Results revealed a significantly (p = 0.002) increased mean acceleration response of 10.7% in the MTI instruments, indicating that the diversity previously observed three years earlier apparently mirrored a more universal disparity across the two generations of instruments.