Investigating psychometric properties of the arm activity measure – Thai version (ArmA-TH) sub‐scales using the Rasch model

Background This study investigated the ArmA-TH sub-scale measurement properties based on item response theory using the Rasch model. Methods Patients with upper limb hemiplegia resulting from cerebrovascular and other brain disorders were asked to complete the ArmA-TH questionnaire. Rasch analysis was performed to test how well the ArmA-TH passive and active function sub-scales fit the Rasch model by investigating unidimensionality, response category functioning, reliability of person and item, and differential item functioning (DIF) for age, sex, and education. Results Participants had stroke or other acquired brain injury (n = 185), and the majority were men (126, 68.1 %), with a mean age of 55 (SD 22). Most patients (91, 49.2 %) had graduated from elementary/primary school. For the ArmA-TH passive function scale, all items had acceptable fit statistics. The scale’s unidimensionality and local independence were supported. The reliability was acceptable. A disordered threshold was found for five items, and none indicated DIF. For the ArmA-TH active function scale, one item was misfit and three were locally dependent. The reliability was good. No items showed DIF. All items had disordered thresholds, and the data fitted the Rasch model better after rescoring. Conclusions Both sub-scales of ArmA-TH fitted the Rasch model and were valid and reliable. The disordered thresholds should be further investigated.


Background
The Arm Activity Measure questionnaire (ArmA) is a twenty-item patient and/or carer-reported outcome measure of function of hemiparetic upper limbs developed in 2013 by Ashford et al. with the primary goal of addressing 'real-life' function, that is, day-to-day performance in the person's normal environment [1]. The unique characteristic of ArmA is its two separate constructs, as it has passive and active function sub-scales, for evaluating the most clinically relevant goals [2]. As there has never been an objective self-report measure for assessing hemiparetic upper limb function for patients in Thailand, ArmA was translated into the Thai language as ArmA-TH with a preliminary evaluation of psychometric properties, including a content validity index for both item (I-CVI) and score (S-CVI), interrater reliability, and internal consistency [3]. However, neither the construct validity of the items nor a detailed item evaluation of ArmA-TH based on measurement theory were initially explored. According to Rasch measurement theory, an outcome measure scale should demonstrate unidimensionality (all items contribute to the same construct) and have no DIF (invariance across a sub-population) [4]. All these properties can be evaluated by conformity to the Rasch model, which the original English version of the ArmA passive function sub-scale was evaluated against in a UK sample [5].
In this study, we therefore aimed to examine the extent to which our data, from a Thai sample, fit the Rasch measurement model. The Rasch model is an itemresponse latent trait model, which is a probabilistic logistic model that predicts that the response to a particular item is influenced by the quality of both the person and the item. The key concepts of the Rasch model are, first, that it transforms non-linear raw scores into logit scale measures, in which the location (logit) of both the particular person and the item are determined on the same interval scale. This interval scale can differentiate how people adhere to the fundamental measurement principle, which provides interval-level measurement as opposed to ordinal scaling using the raw score [6].
The second concept is 'local independence' [4,6], which implies that there should not be any correlation between two items after the effect of the latent trait is conditioned out (the correlation of residuals should be zero) [7]. Violation of local independence can affect unidimensionality, and both local independence and DIF are important to differentiating the individual as a function of latent trait scores.

Population
Patients with hemiplegic upper limb impairment resulting from stroke or other acquired brain injury who were receiving rehabilitation services in Chiang Mai, Thailand were asked to complete the ArmA-TH questionnaire in person. All patients were asked to give written informed consent before proceeding with the questionnaire. The patients were between 20 and 85 years of age, had Thai as their mother tongue, and had graduated from at least elementary school with the ability to understand Thai communication for daily activities. The patient demographic characteristics in this study were age, sex, hemiparetic side, diagnosis, education level, and ArmA-TH passive and active scores. Ethical approval for the research programme was received from the Research Ethics Committee of Faculty of Medicine, Chiang Mai University, Thailand, Ethic approval number REH-2558-03109.

Measure
The ArmA-TH is a twenty-item questionnaire for assessing the difficulty in functioning of a hemiparetic upper limb. There are seven items in the passive function subscale and thirteen items in the active function sub-scale. Using a Likert scoring system between 0 (no difficulty) and 4 (unable to do task), the passive function sub-scale scores range from 0 (high function) to 28, and the active function sub-scale scores range from 0 (high function) to 52 [2].

Analysis
Descriptive statistics were used to describe the demographic characteristics of the patients, presented as mean (SD). The ArmA-TH sub-scale scores are presented in terms of median (interquartile range).

Rasch analysis
Partial credit Rasch model was used for analysis, the following criteria were investigated [8,9].
(1) Unidimensionality. Two methods were evaluated for determining unidimensionality. First, the first principal component of the residuals (first construct) should be no more than 15 % or have an eigenvalue less than 2 [10]. Second, that the item fit statistics were assessed using outfit and infit meansquare statistics. Outfit mean-square is calculated by averaging the squared residuals for each item across all persons, whereas, infit values was computed by having squared residuals to be weighted by their variances before averaging. The outfit MNSQ and infit MNSQ should be 0.70 and 1.50 [10]. In addition, the correlation of the two sets of person measures and the correlation disattenuated due to measurement error should be greater than 0.7 to indicate unidimensionality. (2) Local independence. To evaluate local independency, a pair of items should not have inter-item residual correlations higher than 0.2 [11]. (3) Reliability. There are two kinds of reliability evaluated by Rasch analysis: person reliability and item reliability. The person reliability is interpreted as the ability of the scale to reliably rank the person relative to location within the scale of the measure. Similar to Cronbach's alpha, but the value of person reliability is often lower than that of Cronbach's because it does not include extreme scores. The item reliability coefficient reflects the extent to which the item hierarchy is replicable with a different set of individuals. A reliability coefficient of > 0.70 is considered acceptable for a person, and a coefficient of ≥ 0.80 is considered acceptable for an item.
Response category functioning. Ordered categories and thresholds are expected for measurement. Therefore, adjacent categories (thresholds) on the latent scale have the same position and order on the latent trait measured [12]. Items with a disordered threshold between categories can be evaluated by category probability curves, and the item fit of each categorical response is examined. Item fits less than 2.0 are acceptable) [8]. Items that exhibited disordered thresholds were rescored by collapsing adjacent categories, and a reanalysis was performed to check whether it showed a better fit to the model.
(4) Targeting of persons, items, and item hierarchy.
Acceptable item-test targeting for compliance with the Rasch model is evaluated through the closeness of the mean of the person and the mean of the item on the Wright map (no more than 1 logit) [13]. The item hierarchy indicates how the items match the intentions of the instrument developer in difficulty and the expectations of those planning to use the test results [14]. (5) DIF for age, sex, and education. An ideal item is one with invariant measurement properties across subgroups, meaning that item calibration should be the same in different subgroups of people [8].
Moderate-to-large DIF was evaluated using a significant DIF contrast of < 0.64, thereby indicating an acceptable value [6]. In this study, DIF due to age, sex, and education was examined. Both the ArmA-TH passive and active function sub-scales were separately evaluated for fit to the Rasch model.

Results
A total of 185 patients participated in the questionnaire evaluation. The majority were men (126, 68.1 %) with mean age of 55 (SD 22). The hemiparesis resulted from haemorrhagic stroke (81, 43.8 %), ischemic stroke (78, 42.1 %), traumatic brain injury (24, 13.0 %) and other causes (2, 1.1 %). Most patients (91, 49.2 %) had graduated at the elementary/primary school level, followed by secondary school level (40, 21.6 %), vocational or high vocational certificate (28, 15.1 %), and the smallest group had a bachelor's degree or above (26, 14.1 %). The ArmA-TH passive function sub-scale scores ranged from 0 to 28, covering the total range from minimum to maximum score. The ArmA-TH active function sub-scale scores ranged from a minimum of 0 to 49, almost reaching the maximum score of 52. Details are shown in Table 1.

ArmA-TH passive function
For the ArmA-TH passive function, the fit statistics ranged from 0.73 to 1.31, indicating all items contributed to the Rasch measurement model. Principal component analysis of the residuals showed the first eigenvalue of 1.73 (11.5 %) supported unidimensionality, whereas the standardized residual correlations were less than 0.3, indicating local independence. Figure 1 illustrated the plot of item loadings on the first factor extracted from the residuals, which separated the items into three clusters (1, 2, 3…). The plot Active Function Sub-Scale 11 (5)(6)(7)(8)(9)(10)(11)(12)(13)(14)(15)(16)(17)(18) graphically maps the loadings of items in the off-target dimension with item loadings on the two ends of the plot. The loadings of item 2(0.69), item 3(0.66), and item 1 (0.15) appeared to be another dimension, with item 5 (-0.69), item 7 (-0.56), and item 4 (-0.10). However, the contrast of the item loadings in the passive function scale was not as strong (< 2), and the second dimension is unlikely. Moreover, the disattenuated correlation between clusters in passive items approached 1.000, indicating that the two clusters of items were measuring the same thing. The person reliability was acceptable (0.70), while Cronbach's alpha was 0.83. The item reliability was excellent (0.97). No disordered category was found, however, a disordered threshold was found for item 1 (Cleaning palm), item 2 (Cutting fingernails), item 3 (Putting on a glove), item 6 (Put on a splint), and item 7 (Positioning arm on a cushion or support in sitting) ( Table 2). The ArmA-TH passive function seemed not to be well-targeted in this sample, as the mean logit between item and person was more than 1 SD. Item bias or DIF was not found for ArmA-TH passive function.
Based on the disordered threshold of each item (not shown here), it was suggested that categories 0-1 and 3-4 be collapsed, which reduced the responses from 5 to 3.
Reanalysis after rescoring from 5 to 3 response options; 0 + 1, 2, 3 + 4 The eigenvalue of the first construct was reduced to 1.52 (12.6 %). The infit MNSQ or outfit MNSQ ranged from 0.82 (item 2) to 1.41 (item 5) ( Table 3). The disattenuated correlation between person measures was 1.00, and no local dependence was found, all of which suggested unidimensionality. The person reliability was reduced to 0.51. The Cronbach's alpha was 0.82. The item reliability was excellent (0.96). No disordered category threshold was found after this reanalysis.

ArmA-TH active function
For the ArmA-TH active function, all items except item 7 fell within an acceptable range of fit indices. This implied that item 7 could derail the Rasch measurement model ( Table 2). The principal component analysis of residuals showed the first construct with an eigenvalue of 2.57 (8.9 %), suggesting a violation of unidimensionality. The standardized residual correlations between items 13 and 12 was 0.59; between items 11 and 12 was 0.52; and between item 11 and item 13 was 0.45, indicating local dependence, as depicted, and could be a source for another dimension (Table 2). Figure 2 illustrates the plot of item loadings on the first factor extracted from the residuals, which separated the items into three clusters. The plot graphically maps the loadings of items in the off-target dimension with items with loadings at the two ends of the plot.    Asks about 'caring' for your affected arm either yourself with your unaffected arm or by a carer or a combination of both of these. This section does not ask about using your affected arm to complete any of the tasks. b Asks what you can do with your affected arm or using both arms disattenuated correlation approached 1.000, therefore the person measures from the two clusters of items were statistically the same, indicating that the two clusters of items were measuring the same thing. To put it another way, the secondary dimension underlying the first contrast did not exist but was a strand. The person reliability was acceptable (0.77), while Cronbach's alpha was 0.85 and the item reliability was excellent (0.99). A disordered category was found in item 10 and disordered thresholds were found in all items. The ArmA-TH active function sub-scale did not appear to be well-targeted in this sample, as the mean logit between item and person was more than 1 SD. Item bias or DIF was not found in ArmA-TH active function for age or sex, or for different education levels.
Reanalysis after rescoring from 5 to 3 response options; 0, 1 + 2, 3 + 4 The eigenvalue of the first construct was reduced to 2.23 (8.5 %). The infit MNSQ or outfit MNSQ ranged from 0.32 (item 11) to 1.67 (item 8) (Table 3). Notably, the outfit MNSQ of item 7 was reduced, whereas that of item 8 increased. The standardized residual correlations between items 13 and 12 was 0.28; between items 11 and 12 was 0.33, and between items 11 and item 13 was 0.61, indicating some local dependence. However, the disattenuated correlation between person measures was 1.00, suggesting sufficient unidimensionality. The person reliability increased to 0.71, Cronbach's alpha was 0.84, and the item reliability was excellent (0.98).
Transformation of raw scores to Rasch-scaled scores using the original 5 responses is illustrated in Table 4. Ideally, the ArmA raw score should be converted to the Rasch-scale score on the users' own data. However, this converted logit-scale should be applicable to situations where the data exhibit a similar fit to the model.

Discussion
This study aimed to explore the measurement and scaling properties of the ArmA-TH using Rasch analysis in patients with hemiparetic upper limbs. Our findings confirmed the unidimensonality of both the passive and the active function sub-scales. We found the same items to have a disordered threshold, as did Ashford et al. (except for item 2). Although rescoring seemed to make the data fit the Rasch measurement model, this risks reducing person reliability in the passive sub-scale, which has fewer items compared with the active sub-scale. However, some investigators have been less concerned by the disordered threshold because it does not impact construct validity [8].
For the active function sub-scale, we found the original data did not fit well with the Rasch measurement model when compared with the passive function sub-scale. We assume that the poor fit comes from two sources. First, some items do not contribute sufficiently to the construct. Four items were identified to be problematic. Item 7 'Hold an object still while using unaffected hand' did not contribute to the same construct as the other items. The high value of misfit indicated that this item was not productive, albeit not harmful to the overall scale. Although items 11, 12, and 13 were dependent on each other to the extent that they could form a second dimension, the disattenuated correlation between person measures on the two item clusters suggested that they measured the same thing.
Second, all items in the active function sub-scale were found to have disordered thresholds; rescoring from 5 to 3 response options improved the fit with the Rasch model and was acceptable. The possible reason for the disordered thresholds might relate to a limited comprehension of the rating scale by stroke patients due to cognitive impairment, which is found in 20-80 % of post stroke patients and is present as early as 3-6 months after stroke onset [15,16]. The previous study showed that the active function scale fitted in non-parametric item response theory (Mokken analysis), but not with the present study using a stricter model as a Rasch model [2]. This means that the possibility of using a sum score to produce a reasonable person measure on an interval scale may not be completely accurate. Using a Rasch model creates an opportunity to identify some potential problematic items.
Further exploration of category and threshold adjustment should be carefully considered, particularly for the passive function sub-scale, which had fewer items, rendering a low level of person reliability. While related studies have shown that a disordered threshold does not cause much problem for the model compared with some misfitting items [17,18], we found that rescoring seemed not to improve the fit to the model. Therefore, we preferred keeping the original 5-response options. One limitation of the study was that the larger sample size still needed for further analysis to ascertain the fitting of the data to the Rasch model, e.g. 400, as recommended by experts [19].
Another limitation was that some participants required assistance to read the questionnaires because of visual or physical impairment. The assistance might have influenced their responses or interfered with their freedom to respond.

Conclusions
According to results of the Rasch analysis, both ArmA-TH active and passive function sub-scales data fit the Rasch model. Even though item 7 of the active function sub-scale seemed to present extra challenge for the Rasch model, the item was not considered harmful to the overall measurement, provided useful clinical information, and was therefore retained. It is worth noting that a better fit to the model was observed when the item responses were rescored from 5 to 3. Rescoring the item responses to less than 5 should be considered in future evaluations of ArmA-TH. Poor targeting in this sample implied that more easier items assessing arm function should be added [2,19].