Study approach
This study used a combination of regression analysis and machine learning for DETACH algorithm development inclusive of temperature rate-of-change, followed by a comparison of the algorithm to published non-wear detection methods to evaluate performance. Specifically, phase one of the study used linear regression to establish the viability of, and criteria for, temperature rate-of-change as a parameter for identifying device removal. Phase two used a decision tree classifier to determine the optimal series of features, and their respective thresholds, for determining the start and end of a non-wear period. Following this phase, edge cases identified from the training data were used to establish additional algorithm rules. Finally, outputs from the finalized DETACH algorithm and from the van Hees [11] and Zhou [25] algorithms were compared using accuracy, precision, recall, and F1 score performance metrics.
Data source
Data were collected as part of the Remote Monitoring in Neurodegenerative Disease (ReMiNDD) study conducted by the Ontario Neurodegenerative Disease Research Initiative (ONDRI) [34] which included 39 participants with a confirmed diagnosis of cerebrovascular disease, Alzheimer’s disease/amnestic mild cognitive impairment, frontotemporal dementia, Parkinson’s disease, or amyotrophic lateral sclerosis (36% female; age range: 45–83 years). Detailed descriptions of study participants and protocol are provided in [16].
Data collection
Briefly, the study consisted of a baseline clinic visit to Sunnybrook Hospital in Toronto, Canada, a 7-day remote monitoring period using wearable technology, and an in-person discharge visit. Data collection took place from May 2019 to March 2020. Participants were instrumented with five wearable devices located bilaterally on the wrists and ankles, and on the chest. Participants were asked to wear the devices for 24-h per day except during bathing and swimming. Limb-worn devices were GENEActiv Original [35] which contain tri-axial accelerometers, a near-body temperature sensor, and a light sensor. The limb devices were mounted on the wrists using rubber watch straps and on the ankles with custom-made, medical-grade wraps [36]. Accelerometer data were collected at a sampling rate of 75 Hz with a dynamic range of ±8 g (1 g = 9.81 m/s2) and temperature data at a sampling rate of 0.25 Hz. The temperature sensor was accurate to +/− 1 degree Celsius [37].
GENEActiv data processing
Data from one wrist device was used for each participant since the wrist is a common wear location for activity and sleep studies e.g. [38, 39] and was used in the comparator studies [11, 25]. Specifically, the left wrist data were used which represented the non-dominant limb for 95% of participants. Data consisted of six continuous 24-hour periods extracted from the 7-day protocol. For two participants, data volume was less than the six 24-h periods due to early discharge from the study (142/144 hours; 113/144 hours).
The raw GENEActiv accelerometer data were stored in standard gravity units for subsequent analysis. These data files were converted from compressed binary files into standardized European Data Format (EDF) using Python package PyEDFlib [40] and stored by sensor type (tri-axial accelerometer, temperature, light) as part of a standard data management process. Data files were also cropped to the end of data collection (i.e., final device removal) as determined by study logs combined with visual inspection of the data. Due to the low sampling frequency of the temperature signal, temperature data were smoothed using a 2nd order, low pass, Butterworth filter with a cut-off of 0.005 Hz prior to subsequent analysis, except when re-creating the algorithm developed by [25], where a moving average model was used to smooth the temperature data. No conditioning of accelerometer data took place, consistent with the work of both [11] and [25] . For the rate-of-change in temperature parameter used in the DETACH algorithm, a one-minute rate-of-change was used by calculating the difference of smoothed temperature values one minute apart. To submit an equivalent number of datapoints for accelerometer and temperature data to the decision tree classifiers, a one-minute rolling standard deviation of the accelerometer data was determined, and this output was down sampled from 75 Hz to 0.25 Hz.
Reference dataset
The reference dataset used to evaluate the DETACH algorithm was based on visual inspection of non-wear periods conducted independently by two expert analysts using raw temperature and accelerometer data, with each assigned to reviewing data from half of the participants. Non-wear detection criteria included the absence of acceleration and a sustained, decreasing temperature (non-wear start) and the presence of acceleration with an accompanying increase in temperature (non-wear end). These changes in temperature, both in rate and direction, were distinguished from temperature changes that can be associated with sleep (Fig. 1). Prior to inspecting and annotating the reference dataset, analysts were trained on a known dataset by independently identifying probable non-wear periods, then reviewing and resolving discrepancies with the assistance of the study team. The known dataset was obtained from 2-day collections using GENEActiv accelerometers worn on the wrist. Each participant completed a series of structured (within any two-hour window) and unstructured (time of their choosing) removals, varying in length from one to 15 minutes, with instruction to remove as needed outside of these procedures. Non-wear time accounted for an average of 6 ± 14% of the collection period with a median duration of 5 minutes (range: 1 to 596 minutes). Following training, analysts independently annotated the reference dataset with any uncertainties resolved via consensus using device removal logs as reference. Removal logs were completed by study participants or their enrolled caregiver study partner who were asked to record what sensors were removed, the time of removal and re-attachment, and the reason for removal, without delay. Non-wear start and stop times were recorded with one-minute precision.
Comparison algorithms
For both the van Hees [11, 41] and Zhou [25] algorithms, each data point from the raw acceleration signal is classified as wear or non-wear. Only the Zhou algorithm [25] uses temperature in addition to acceleration. The van Hees algorithm [41] only examines acceleration in 60-minute overlapping windows (15-minute steps, 45-minute overlap). Each window is classified as non-wear if the standard deviation of acceleration is less than 13 mg (1 mg = 0.00981 m/s2) or the acceleration range of that window is less than 50 mg in at least two of the three accelerometer axes. To remove implausible wear periods, a secondary condition is applied which classifies a detected wear period shorter than six or three hours as non-wear if it is less than 30% or 80% (respectively) of the combined duration of bordering non-wear periods [11].
The Zhou algorithm [25] examines both temperature and acceleration over a one-minute moving window. Each window is classified as non-wear if the mean temperature is less than or equal to 26 °C and the standard deviation of acceleration for each of the three axes is less than 13 mg. A secondary condition is also used to identify non-wear when the temperature is below 26 °C but the accelerometer standard deviation criterion is not met. In this instance, if the temperature in the current window is lower than the mean temperature of the previous one-minute window, the current window is labeled non-wear.
Establishing temperature rate-of-change as a non-wear feature
Phase one of algorithm development focused on characterizing the temperature dynamics that were associated with periods of sensor removal and subsequent donning. The decision to examine temperature rate-of-change as a parameter was driven by observation that a) there was often a delay between the absence of acceleration marking the potential start of a non-wear period and the absolute temperature threshold of 26 °C utilized by [25] and b) there were cases when the absolute threshold was not met despite known periods of sensor removal. There was also concern that differences in seasonal weather or climate would affect the accuracy of an absolute threshold for temperature [25]. To explore the association between rate of temperature change and known non-wear periods, regression analyses were conducted using a training dataset (see below) with starting temperature as the independent variable and negative rate-of-change (°C/minute) at 1, 3, 5, and 10 minutes as the targets, using all non-wear periods within the training dataset (Fig. 2). At one minute, the mean negative temperature rate-of-change was indistinguishable from normal temperature variations (i.e., values near zero). At three, five and ten minutes, however, temperature rates of change were appreciable. Based on these analyses, as detailed in the Results, temperature rate-of-change was deemed a viable feature for non-wear detection. Further, the five-minute window was selected given that confidence in the classification increased with longer window lengths but a significant proportion (12%) of the non-wear periods were less than 10 minutes in duration.
Non-wear algorithm development
The DETACH algorithm was designed to improve the accuracy of non-wear start and end time detection compared to determining individual windows of data with predetermined lengths as either wear or non-wear within both the van Hees [41] and Zhou [25] algorithms. To establish the optimal parameters for detecting non-wear start and end periods, an open-source classification and regression trees (CART) decision tree classifier [42] with a depth of three was used to determine the best series of true-false conditions, based on given features, that would properly classify the data. Models were created for two different decision tree classifiers: one to detect the start of a non-wear period and another to detect the end. Both classifiers used the same features which included temperature rate-of-change as established in phase one of this study, as well as the previously established parameters of absolute temperature and one-minute rolling standard deviation for each of the three accelerometer axes used in one or both comparison algorithms [25, 41]. The depth hyperparameter of three was validated for both decision trees using cross-validation on the training set (see Supplementary File 1).
Fifteen of the 39 participants (38.5%) were used to train the classifiers with the remaining 24 participants used for testing. Data within the training dataset were first prepared by labelling all points as: wear, non-wear start (first 10 minutes), non-wear middle (beyond the first 10 minutes), or non-wear end (the 10 minutes following the end of a non-wear period). Wear and non-wear start data subsets were input into the “non-wear start” classifier while non-wear middle and non-wear end data subsets were input into the “non-wear end” classifier. The training dataset (n = 15 participants) contained 75 non-wear periods while the testing dataset (n = 24 participants) contained 111 non-wear periods. The results of the decision tree classifiers were used to create rules for detecting non-wear.
Lastly, to establish the final set of rules for the DETACH algorithm, results of the decision tree analysis were supplemented based on a) edge cases observed within the training data (e.g., low temperatures observed at start of some non-wear periods) and b) parameters reported in previously published non-wear papers, specifically using more than one axis of acceleration in the parameter set [26].
Algorithm evaluation
Non-wear detection was compared for the DETACH algorithm and both the van Hees [11] and Zhou [25] algorithms using data from the 24 test participants. All algorithms were implemented using Python with one-second classification windows and then compared to the manually labelled non-wear periods from the reference dataset.
Accuracy (the fraction of correct predictions, both wear and non-wear, across all data), precision (the fraction of predicted non-wear time that was correctly identified), recall (the fraction of actual wear time correctly identified), and the F1 Score (the harmonic mean of the recall and precision) were computed for each of the three algorithms based on classification of algorithm predictions as true positive (TP), true negative (TN), false positive (FP), or false negative (FN) where TPs were correctly identified non-wear predictions and TNs were correctly identified wear predictions. Performance metrics were calculated for each participant using the formulae below:
$$\mathrm{Accuracy}=\left(\left(\mathrm{TP}+\mathrm{TN}\right)/\left(\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN}\right)\right)$$
$$\mathrm{Precision}=\left(\mathrm{TP}/\left(\mathrm{TP}+\mathrm{FP}\right)\right)$$
$$\mathrm{Recall}=\left(\mathrm{TP}/\left(\mathrm{TP}+\mathrm{FN}\right)\right)$$
$$\mathrm{F}1\ \mathrm{score}=\left(2\times \left(\mathrm{precision}\times \mathrm{recall}\right)/\left(\mathrm{precision}+\mathrm{recall}\right)\right)$$
Outcomes were presented as an average with 95% confidence intervals. Analysis was conducted using Python (custom-coded or [42]).