 Research article
 Open Access
 Open Peer Review
 Published:
Test equating sleep scales: applying the Leunbach’s model
BMC Medical Research Methodology volume 19, Article number: 141 (2019)
Abstract
Background
In most cases, the total scores from different instruments assessing the same construct are not directly comparable, but must be equated. In this study we aimed to illustrate a novel test equating methodology applied to sleep functions, a domain in which few score comparability studies exist.
Methods
Eight scales from two crosssectional selfreport studies were considered, and one scale was common to both studies. The International Classification of Functioning, Disability and Health (ICF) was used to establish content comparability. Direct (common persons) and indirect (common item) equating was assessed by means of Leunbach’s model, which equates the scores of two scales depending on the same person parameter, taking into account several tests of fit and the Standard Error of Equating (SEE).
Results
All items were linked to the body functions category b134 of the ICF, which corresponds to ‘Sleep functions’. The scales were classified into three sleep aspects: four scales were assessing mainly sleep disturbance, one quality of sleep, and three impact of sleep on daily life. Of 16 direct equated pairs, 15 could be equated according to Leunbach’s model, and of 12 indirect equated pairs, 8 could be equated. Raw score conversion tables between each of these 23 equated pairs are provided. The SEE was higher for indirect than for direct equating. Pairs measuring the same sleep aspect did not show better fit indices than pairs from different aspects. The instruments mapped to a higher order concept of sleep functions.
Conclusion
Leunbach’s equating model has been successfully applied to a functioning domain little explored in test equating. This novel methodology, together with the ICF, enables comparison of clinical outcomes and research results, and facilitates communication among clinicians.
Background
To measure functioning, several instruments are commonly available. Typically, one clinician or researcher uses one instrument while another uses another instrument, both of which measure the same concept. However, the scores of the two instruments cannot be directly compared as they may have different operational ranges, or measure different levels of the concept, such that their total scores are on different ordinal metrics. This restricts their comparability and impedes research and communication among clinicians.
To be able to compare scores from different instruments, they must be equated. Equating can be defined as a statistical process used to adjust scores on test forms so that the scores can be used interchangeably [1]. While various equating techniques were developed during the twentieth century, it was not until the 1980s that they became popular [1]. Equating techniques include linear, equipercentile, and Modern Test Theory (MTT) methods. They all can be used to equate scores in different data collection designs, such as those in which two or more instruments are administered to the same group of persons (known as common persons design), or those in which common items are found across different studies (known as common items design).
In linear equating, the standardized deviation scores of the two forms are set to be equal by means of a linear conversion [2]. However, the formula converting scores from one form to the other may be nonlinear. Equipercentile equating admits nonlinear relationships —it identifies scores on one form that have the same percentile ranks as scores on the second form— but it assumes that the scores are continuous when they are usually discrete. Although the data can be made continuous [3, 4], equipercentile relies on observed scores. MTT methods —including Item Response Theory (IRT) [5] and Rasch Measurement Theory (RMT) [6]— assume that a common latent variable lies behind responses to the items of the instruments. MTT refers to the outcomes of the latent variable as person parameters and regards an estimate of the person parameter as a measure of the respondent’s ability or trait level. IRT and RMT share a number of assumptions, including: unidimensionality, monotonicity of item characteristic functions, local independence, and no Differential Item Functioning (DIF) [7]. Testing these assumptions adds strength to the equating process, because it is possible to test that the scores of the two instruments to be equated actually measure the same construct. This test adds evidence to one of the requirements of test equating, namely construct equivalence [8].
In IRT models, the person estimate is a complex function of patterns of item responses. Compared to this, the situation is much simpler and better suited for test equating in RMT because there is a onetoone mapping of raw scores to the estimate of the person parameter, due to the statistical sufficiency of the raw score in RMT from which follows conditional inference where assumptions about person distributions and sampling are not required, and specific objective measurement [6]. Hence, IRT and RMT are different paradigms within MTT [9].
MTT methods have been applied to create a common metric in health domains such as depression [10,11,12], anxiety [13], pain [14], or physical functioning [15, 16]. Furthermore, Andrich [17] presented an application of the polytomous Rasch model in equating two instruments intended to assess the same trait treating the total scores of two instruments as partial credit items from a test with two items. This approach has been employed in the health literature [18] together with the International Classification of Functioning, Disability and Health (ICF) [19] for conceptual matching. The polytomous Rasch models used in these studies are formally the same as the model described by Leunbach [20] used in this paper.
Gustav Leunbach developed the model in 1976 to assess whether two instruments measure the same trait, relating their total scores to a common scale [20]. This model is supported by a sound statistical theory on conditional inference, as well as the property of raw score sufficiency [21]. The Rasch model also possesses these properties. Although Leunbach’s model seems promising in test equating, it has rarely been applied, probably because it has gone unnoticed by the scientific community. Andrich [17] acknowledged that the model he uses came from Leunbach’s report [20], but apart from this, Leunbach’s report has rarely been cited and, as far as we are concerned, the model has not been implemented in any software until recently. Hence, it can be considered a ‘novel’ methodology which we wanted to rediscover by applying it to a functioning domain. We considered sleep functions as a case in point because it has been little explored in the field of equating.
Thus, the objective of our article is to illustrate an application of a novel methodology for equating functioning scales. Specifically, we aim (1) to rediscover Leunbach’s model and its properties, (2) to show how to interpret tests of fit and precision to decide on the adequateness of the equating, and (3) to apply the model to a domain little explored in the field of equating in health.
Methods
Sample and instruments
Secondary data were analysed from two crosssectional selfreport studies: Trajectories of Outcome in Neurological Conditions (TONiC), and PatientReported Outcomes Measurement Information System (PROMIS). These studies were chosen because they were available at the time when the current project was designed and they were suitable for secondary analyses.
The TONiC study (https://tonic.thewaltoncentre.nhs.uk/) examines the factors that influence quality of life in patients with neurological conditions. The sample for the current study consists of a cohort of patients with clinical definite Multiple Sclerosis from consecutive individual outpatient attendances in three neuroscience centres in the UK. The data were collected over the first 12 months of study recruitment, and the participants received a questionnaire pack including sleep instruments. The study had approval from the local research ethics committees. All subjects received written information on the study and gave written informed consent prior to participation [22].
The PROMIS initiative (http://www.healthmeasures.net/exploremeasurementsystems/promis) aims to build item pools and develop core questionnaires that measure key health outcome domains including sleep [23]. The sample for the current study consists of an internet (YouGov Polimetrix, https://today.yougov.com/solutions/overview) sample and a clinical sample [24]. The latter included patients recruited from sleep medicine, general medicine, and psychiatric clinics at the University of Pittsburgh Medical Center.
The Epworth Sleepiness Scale (ESS) [25] was common to both studies. The Medical Outcomes Study (MOS) [26], and the three subscales of the Neurological Sleep Index (NSI) [22]—Diurnal sleepiness, Nonrestorative nocturnal sleep, and Fragmented nocturnal sleep— were only present in TONiC. The Pittsburgh Sleep Quality Index (PSQI) [27], PROMISSleep Disturbance [28], and PROMISSleep Related Impairment [28] were only present in PROMIS.
The ESS and the PSQI are the most widely used scales in sleep medicine. However, new generic (PROMIS) and diseasespecific scales are emerging and a set of these were available from the PROMIS and TONiC studies, with the ESS as the link. Hence, we took advantage of the fact of having eight sleep instruments available and we consider that equating pairwise all of them would be of interest to researchers and clinicians. The eight sleep instruments as well as the study to which each instrument was administered are described in Table 1.
International Classification of Functioning, Disability and Health (ICF)
The ICF is an international standard offering a common language to describe functioning [19]. It is based on the integrative biopsychosocial model of functioning, disability and health of the World Health Organization [19]. Body functions (‘b’), Body structures (‘s’), Activities and Participation (‘d’), and Environmental factors (‘e’) are classified using an alphanumeric system. Second, third, and fourthlevel categories are found under each letter, so that, for example, under the twolevel category b134 sleep functions, seven thirdlevel categories exist: b1340 Amount of sleep, b1341 Onset of sleep, b1342 Maintenance of sleep, b1343 Quality of sleep, b1344 Functions involving the sleep cycle, b1348 Sleep functions, other specified, and b1349 Sleep functions, unspecified.
One of the key requirements of test equating is construct equivalence [1]. In health, when equating scales in functioning domains, it is recommended to first link the items from the different tests to the ICF so that content comparability among the scales can be established and thus satisfy the requirement of equivalent constructs. In addition, the International Standards Organization [29] has prescribed the ICF as the framework for cataloguing health in ehealth informatics (the concept of health is based on the health components of the ICF). Consequently, the use of ICF codes is twofold in the current study: (1) to ensure concept comparability, a prerequisite for test equating, and (2) to lay a marker for the future when ehealth informatics will be at the forefront of data management techniques in health care.
Hence, the items from all the scales were linked to the ICF. Two researchers performed independently the linking of items to ICF categories using the latest ICF linking rules [30], and then discussed possible disagreements to come up with a final solution. As suggested by Stucki et al. [31], the ICF Core Set for sleep disorders [32] was taken into account.
Leunbach’s model
Leunbach [20] used a Power Series Distribution depending on an underlying common latent trait to relate the total scores of two instruments to a common scale. A Power Series Distribution [33] is a discrete probability distribution over nonnegative integers of the form
where the probability of obtaining a score x depends on a person parameter ξ and several score parameters γ_{x}. For each score x, a score parameter is estimated.
Leunbach’s model is a test equating method for two tests (A and B), hence only the total score in each of the tests, not each item response, is considered. For each total score in A, a corresponding equated total score in B is estimated.
Let X_{1} be a test score of A and X_{2} a test score of B. (X_{1}, X_{2}) depend on the same person parameter ξ, and (A, B) have maximum scores equal to (m_{1}, m_{2}). The two test scores are assumed to be conditionally independent given ξ. Under this assumption, the probability of a total score over the test scores, r = X_{1} + X_{2}, can be computed as
Mesbah & Kreiner [34] showed that the distributions of polytomous Rasch items can be parameterized as Power Series Distributions as described in (2.1) and that the same applies for the total score over several items including the total raw score over all items. In this sense, it is correct to regard Leunbach’s model as the joint distribution of two Rasch model super items with distributions defined by:
Then, from (2.2) and (2.3) we can derive the distribution of the total score X_{1} + X_{2}:
where
\( {\omega}_r=\sum \limits_{x=0}^{m_1}{\gamma}_{1x}{\gamma}_{2,rx}\kern0.3em \mathrm{and}D=\left({\sum}_{h=0}^{m_1}{\xi}^h{\gamma}_{1h}\right)\left({\sum}_{h\prime =0}^{m_2}{\xi}^{h\prime }{\gamma}_{2h\prime}\right)=\sum \limits_{r=0}^{m_1+{m}_2}{\xi}^r{\omega}_r \).
The joint distribution of (X_{1}, X_{2}) is:
From this it follows that the conditional probability of the responses (x, r − x) of a person to the two instruments, given the person’s total score r, is given by the ratio (2.5) and (2.4):
which is independent of the person parameter ξ so that the total score r is a sufficient statistic for ξ. It also follows (1) that the score parameters can be estimated by the same conditional maximum likelihood estimation procedures that Andersen [35] proposed for estimates of item parameters in Rasch models, that is, by methods that make no assumptions on the distribution and sampling of persons [7]; and (2) that person parameters can also be estimated by the same maximum likelihood procedures that are used to calculate maximum likelihood estimates of person parameters in Rasch models [7].
Iterative proportional fitting [36] is used to calculate the conditional maximum likelihood estimate of score parameters and NewtonRaphson [37] to calculate the maximum likelihood estimates of person parameters.
Notice that Leunbach’s model fits raw scores from the Rasch model with conditionally independent items because the raw score over all items have Power Series Distributions. In this sense, Leunbach’s approach applies automatically. However, Leunbach’s model is more general than that, because it may apply in situations where the items of the two scores do not fit the Rasch model. The only requirement is that the two raw scores fit the Leunbach’s model. Kreiner & Christensen [38] describe loglinear Rasch models where uniform local dependence is permitted, and where the raw scores do fit Leunbach’s model.
Note also that the proposed method of equating based on Leunbach’s model could be considered as an example of Nonlinear IRT True score equating [8]. Considering (X_{1}, X_{2}) from above, Nonlinear True score equating assumes that X_{1} and X_{2} are raw scores summarizing the responses to sets of items from IRT models with a common latent variable θ.
In such models, true scores \( {\tau}_{X_1} \) and \( {\tau}_{X_2} \) are the expected outcomes given θ,
The functions \( {\nu}_{x_1}\left(\theta \right) \) and \( {\nu}_{x_2}\left(\theta \right) \) define test characteristic curves of X_{1} and X_{2}. They define a monotonic but nonlinear symmetric relationship between the true scores given by
Holland and Dorans [8] suggested to replace the true scores for observed scores in (2.8) so that one has
The maximum likelihood estimates of the person parameters in Leunbach’s model are equal to the person parameters where the expected value of the total score is the same as the observed score and therefore defined by \( {\nu}_{x_1}^{1}\left({X}_1\right) \) and \( {\nu}_{x_2}^{1}\left({X}_2\right) \). For this reason, we may regard the observed score as an unbiased maximum likelihood estimate of the true score and, therefore, the suggestion (2.9) is justified.
Besides, the three steps of the equating process in Leunbach’s model are the same as the steps taken in IRT true score equating, namely (1) take a score on scale A, (2) find the person parameter that corresponds to that score, and (3) find the score on scale B that corresponds to that person parameter. These steps are described in Fig. 1. More details of Leunbach’s model are given in Additional file 1.
In Leunbach’s report [20], only direct equating is addressed. In this study, we apply Leunbach’s model for both direct and indirect test equating.
Direct equating
For direct equating (see Fig. 1), also known as common person equating, we assume that we have two tests (A and B), and that a number of persons responds to both. This is the case, for instance, when we equate the ESS (A) to the MOS (B) from the TONiC study. In this case, the analysis by Leunbach’s model proceeds in four steps.
The first step estimates the score parameters (γ_{x}) of the two tests by conditional maximum likelihood in the same way that item parameters are estimated in the Rasch model [34].
The second step tests the fit of the model to the twoway contingency table with the joint distribution of the raw scores of A and B. Since this table may be large and sparse, where we cannot rely on the asymptotic distribution of the test statistics, pvalues are calculated by parametric bootstrapping. Bootstrapping consists of taking multiple random samples with replacement from the sample data at hand [39]. We use three tests to assess the fit of the model to the table that are similar to tests used to test for multidimensionality in Rasch models. First, (1) a conditional Likelihood Ratio Test comparing observed and expected counts given the total score of the two tests. Second, (2) a test comparing the observed correlation (Goodman and Kruskal’s Gamma [40]) of the scores to the expected value under the model. Horton et al. describe a similar test of unidimensionality for Rasch models [41]. Third, (3) by counting the number of persons with two scores that depart significantly at a 5% critical level from each other under the Leunbach’s model. Since the person parameter of Leunbach’s model can be estimated separately from the two scores, this test is similar to a ttest of unidimensionality in Rasch models comparing person estimates from different subscores [42]. The advantage of focusing on subscores instead of person parameters is that the analysis avoids the problematic assumption that person estimates are normally distributed. A chisquare pvalue is obtained for (3) on whether the observed frequencies of persons with significant differences is larger than 5%. Following Cox and Snell (page 37) [43] we only regard pvalues less than or equal to 0.01 as strong evidence against the fit of the model to the data. Moderate evidence provided by pvalues less than 5% will of course occur, but will only be regarded as conclusive if more than one of the three tests are significant. However, the reader is free to draw their own conclusions concerning model fit in Table 5.
The third step equates a score on A to a score on B: Firstly, by calculating a maximum likelihood estimate of the person parameter given the A score, and secondly, by calculating the expected B score for persons with a person parameter equal to this estimate. Since the equated B score has to be an integer instead of a real number, the equated B score is defined as the rounded value of the expected B score.
The final step assesses the error of equating by bootstrapping from the observed contingency table. If the model was accepted during step two, the variation of the results of the three steps on the bootstrapped data will provide an unbiased estimate of the random error associated with the equated results. Such error is the Standard Error of Equating (SEE) [1] and is computed for each equated score. In other words, the SEE corresponds to the standard deviation of equated scores over hypothetical replications of an equating procedure in samples from a population of test takers [1]. For a score x_{i} of test A, the SEE of the equated score on test B, \( {\hat{eq}}_B\left({x}_i\right) \), can be computed using the following formula
We calculated the replications of the equating procedure in S = 1000 bootstrap samples. The SEE formula using bootstrap samples is as follows:
where
A weighted SEE mean for all the equated scores is then calculated. We calculated a weighted instead of an unweighted mean because we are summarizing errors over a large number of score groups, some of them with very few cases, and an unweighted mean would mean that the errors in the small groups would inflate the assessment of the degree of error in the population.
As explained in Table 2 and in Additional file 1, we regard a weighted SEE mean below 0.91 as acceptable.
Indirect equating
For indirect equating, also known as common item equating, imagine that we have three tests (A, B, and C); one sample of persons responds to A and B, and another sample responds to A and C. Equating from B to C can be indirectly done via A, which is the ‘common item’ (or common scale) enabling the equating. This is the case, for instance, when we equate the MOS (B)—available only in the TONiC sample, to the PSQI (C)—available only in the PROMIS sample, via the common scale ESS (A)—available in both TONiC and PROMIS samples.
The scale A should not work differently for the two samples of persons. Therefore, Differential Item Functioning (DIF) [44] for sample was assessed in each indirect equating triplet A, B, C.
Indirect equating from B to C is a threestep procedure. In the first step, direct equating of B to A is performed. In the second step, direct equating of A to C is performed. Then, the results of the previous steps are used to establish a correspondence of scores from B to C (i.e., to perform indirect equating). For example, as shown in Fig. 2, imagine that we want to know the score of C that corresponds to a score of 6 of B. We first have to find in step 1 the expected score in A of 6, which is 4.5. Then in step 2 we see that the expected scores for A = 4 and A = 5 are 3.5 and 5.3, respectively. Hence, the expected C score lies between 3.5 and 5.3, and by interpolating we find that it is (3.5 + 5.3)/2 = 4.4, which corresponds to a rounded integer of 4.
The tests of fit (second step in section Direct equating) are not available for indirect equating because to evaluate misfit the contingency table shown in Fig. 1 is needed, and it cannot be obtained if different sets of persons have responded to the tests. Nevertheless, in the first two steps of indirect equating from B to C via A, it is tested whether B and A measure the same construct, and whether A and C measure the same construct. If both tests accept the hypotheses, it follows logically that B and C must measure the same construct. On the other hand, the SEE of the indirect equating from B to C can be estimated by bootstrapping in exactly the same way as for direct equating. In addition, Additional file 1: Table S20 provides an example where the ESS and the MOS are equated directly and indirectly via the NSID, and the score correspondences in both cases are very similar.
Software
Direct and indirect equating pairs among the eight sleep instruments were assessed by the Leunbach’s model implemented in DIGRAM [45], which is free and can be downloaded from http://staff.pubhealth.ku.dk/~skm/skm/. Additional file 1 shows how to perform Test Equating with DIGRAM. DIF was assessed with RUMM2030 [42]. The statistical test used for detecting DIF in RUMM2030 is a twoway Analysis of Variance (ANOVA) [46] of the personitem deviation residuals with person factors (i.e. sample) and class intervals (i.e., strata along the trait) as factors.
Results
Sample
The TONiC sample consisted of 722 multiple sclerosis patients, and the PROMIS sample of 2252 participants recruited from the internet and from clinics. Of the 1993 participants from the PROMIS internet sample, 1259 reported no sleep problems and 734 reported sleep problems. The clinical sample consisted of 259 adults from clinics at the University of Pittsburgh Medical Center. Table 3 shows the distribution of sex and age in the TONiC and PROMIS samples, as well as globally.
ICF
The 106 items of the 8 instruments were linked to the second level ICF category b134 sleep functions. Some were also linked to a third level sleep category (b1340 Amount of sleep, b1341 Onset of sleep, b1342 Maintenance of sleep, b1343 Quality of sleep). The b categories in the brief ICF Core Set for sleep disorders—b134 Sleep functions, b130 Energy and drive functions, b140 Attention functions, b110 Consciousness functions, and b440 Respiration functions, were found in our linking; while b134 was the primary concept, the rest were secondary concepts. Three of the four Core Set d categories—d475 Driving, d240 Handling stress and other psychological demands, d230 Carrying out daily routine, were also found as secondary concepts. More b, d, and e categories were identified as secondary concepts, too. All these secondary categories are the contextual parameters for items in the sleep instruments.
Five main sleep aspects —Sleep disturbance (b1341, b1342), Quality of Sleep (b1343), Amount of sleep (b1340), Impact of sleep on daily life (b134), Facilitators/barriers of sleep (b134), to which each item could belong to, were derived. Table 4 shows the number of items per instrument belonging to a sleep aspect.
MOS, NSIF, PSQI, PSD were assessing mainly sleep disturbance, NSIN quality of sleep, and ESS, NSID, PSRI impact of sleep on daily life. ESS and NSID were the sole instruments with all the items pointing to one sleep aspect. NSIF and PSRI involved two aspects, MOS, NSIN, and PSD three, and PSQI four.
The two PSQI items belonging to Facilitators/barriers of sleep (How often have you taken medicine to help you sleep (prescribed or ‘over the counter’)? / Do you have a bed partner or roommate?) were not considered in the summated score. They are Environmental factors in ICF nomenclature, and thus cannot be summated with the other items. The PSQI ended up with 12 items, and with a score range of 0–36.
Leunbach’s model
For each pairwise direct equating, DIGRAM uses the estimates of the score parameters to calculate the expected counts under the Leunbach’s model and to test whether the model fits the data. Three test of fit are available (Likelihood ratio test, Gamma coefficient, and the Number and percentage of persons with significant differences between measurements). A bootstrap pvalue is provided for the first and second tests, and an asymptotic chi square pvalue is obtained for the third. These pvalues are presented in Table 5 (columns 2–4) for each directly equated pair, highlighting the pvalues with a significant level below 0.01. The equating of ESSPSD, ESSPSRI, and PSDPSRI are presented as a percentage of persons with significant differences between measurements. ESSPSD presented also a significant gamma coefficient, so there is evidence from two tests that ESS and PSD measure different constructs; equating these two tests or using them for indirect equating was therefore not recommended. MOSNSIF and NSINNSIF presented a significant Likelihood Ratio Test.
To assess the precision of the equating results, for each equated score in each equated pair, bootstrap samples were generated in order to compute the standard deviation of the equated scores over replications, namely the SEE. The distribution of the SEE among the equated scores for each equated pair is presented in the last four columns of Table 5. The most relevant value is the weighted mean, and values above 0.91 are highlighted. The minimum SEE values were practically 0 for all the pairs, and the maximum oscillated between 0.5 and 3.55. The weighted SEE mean is below 1 in all the pairs except ESSPSD.
The indirect equated pairs (via ESS) excluding the PSD ones (which involved ESSPSD) were first tested for DIF by sample. ESS showed DIF only for NSIFPSQI, and the marginal value was considered not to be substantial enough to prevent the equating. Then we assessed the tests of fit in the first two direct equating steps: if these were acceptable, the fit of the indirect equating was also acceptable. The fit was acceptable for all the pairs except the ones involving ESSPSD. Regarding the SEE, bootstrap samples were generated and evaluated. Table 6 shows the distribution of the SEE for each pairwise indirect equating excluding PSD pairs. The SEE values were higher than for direct equating, oscillating the maximum between 0.56 and 4.99, and the highest weighted mean value was 1.4. The pairs involving PSRI presented a weighted mean above 1.
Additional file 1 contains detailed results of the direct equating of ESS and MOS and of the indirect equating of MOS and PSQI via ESS.
Tables 5 and 6 show that pairs belonging to the same aspect did not necessarily have better fit indices and precision than pairs from different aspects. For example, MOSESS (different aspects) shows better fit values than PSQIPSD (same aspect). While MOSPSQI (same aspect) shows better SEE values than MOSPSRI (different aspects), NSIDPSQI (different aspects) shows better SEE values than NSIDPSRI (same aspect). Also, both tables show that the SEE is lower when we equate the large scale (in terms of scale range) to the small one than vice versa. For example, the SEE for ESSNSID (small to large) is 0.80 while NSIDESS (large to small) is 0.38.
Out of the 28 possible pairs, 23 could be equated. The exchange tables for these 23 equated pairs can be found in Additional file 2.
Discussion
In this study we described a novel methodology for equating functioning scales and we applied it to a domain little explored in the field of equating, sleep functions. Leunbach’s model equates the scores of two scales considering that they depend on the same person parameter. It has been shown how to take into account the three tests of fit, as well as the SEE, to decide on the adequateness of the equating.
In our case in point, 23 out of the 28 possible pairs among 8 instruments could be equated according to the model. The reason why the Gamma coefficient, and the counting of the number of persons with two scores that depart significantly at a 5% critical level from each other under the model are significant for equating ESSPSD, could be due to a type 1 error. In addition, the scale range difference between ESS and PSD, 84, is the highest among all the direct equated pairs. The higher this difference is, the more problematic is the equating.
Issues remain for ESSPSRI, PSDPSRI, MOSNSIF, and NSINNSIF. Their misfit may be due to local dependence between scores and/or because the latent trait assumed by the Leunbach’s model to lie behind the scores is measured on logit scales with different units [47]. While equating the ESS with the PSD should be avoided, the scores of ESSPSRI, PSDPSRI, MOSNSIF, and NSINNSIF could be equated. The indirect equating was free of DIF for sample with one exception showing marginal DIF without impeding the equating.
The SEE for indirect equating was larger than for direct equating because the former uses results from two sets of direct equating estimates, both of which have error. Indirect equating is, therefore, less robust than direct. We also observed that there is less precision in terms of SEE when we equate the small scale (having a lower score range) to the large one (having a bigger score range) than vice versa. This makes sense because when going from small to large, for each score there is a wider range of options of scores to be equated.
As explained in the Methods section, when equating scales in functioning domains, linking the items to the ICF enables to establish content comparability among the scales and thus satisfy the requirement of construct equivalence [1]. In our case in point, the instruments were classified into three sleep aspects: sleep disturbance, quality of sleep, and impact of sleep on daily life. Given that the pairs belonging to the same aspect did not necessarily present better fit indices than pairs from different aspects, it seems that the instruments map to a higher order concept of sleep functions (b134). Moreover, as only 2 (ESS and NSID) of the 8 instruments were measuring one sole aspect, different aspects of sleep are already considered in the existing instruments. ESS and NSID are then more limited than the remaining instruments, which are more content valid. Hence, the linking process helped also in the interpretation of the results.
Sleep scales have been previously linked to the ICF [48], and the ICF has also been used to compare the content of health status measures, where the b134 sleep functions category appears [49,50,51], or where the content relates to sleep medicine practice and research [52, 53]. The PSQI has also been linked to the ICF together with instruments from other health domains [54]. Problems in functioning of people with sleep disorders have also been identified via the ICF [55,56,57]. However, we are unaware of any study that uses the ICF beyond the content comparability to formally equate sleep scales.
Leunbach’s model, developed by Gustav Leunbach in 1976, has been rarely applied despite its desirable properties of raw score sufficiency, sound statistical theory on conditional tests, and the similarity with Rasch models for measurement. This similarity should not be surprising; Leunbach collaborated with Rasch for many years (Leunbach translated —or, according to Rasch, transformed— Rasch’s 1960 book [6] from Danish into English; see page ix of the book [6]) and it is not an unreasonable conjecture that the idea of using power series distributions for measurement models came from Rasch himself. The similarity between the power series distribution and the distribution of test scores in Rasch’s multiplicative Poisson model and the distribution raw score in the Rasch model for item analysis (see formula (5.5) in Chapter X of the Rasch’s book [6]) is also an indicator of the inspiration for Leunbach’s model.
A limitation of this study, considering the current implementation of the Leunbach’s model in DIGRAM, is that only the raw scores taken by the sample appear in the equating table. In our case in point, this is the case of MOS, which theoretical range is 0–24 but only the range 0–21 is equated, because the raw scores 22–24 were not taken. This problem could be solved by interpolation, and we are currently working on how to implement it in DIGRAM with the aim that the next version of DIGRAM will incorporate it. Another limitation is that the ESS, the common scale used for indirect equating, assesses only one sleep aspect (impact of sleep on daily life), and therefore the indirect equating is not optimal. Nevertheless, we have shown that it is possible to equate several sleep scales using the Leunbach’s model. The exchange of scores between pairs of sleep instruments available in Additional file 2 will facilitate the comparison of clinical outcomes and research results. Any clinician or researcher can continue using the sleep scale they feel more comfortable with and look for the correspondence of each raw score to any other sleep scale.
In this study we applied a particular test equating methodology to two specific datasets. Hence, the results obtained are not generalizable. Although the main focus of this study was not to provide generalizable findings, but to illustrate the application of a novel test equating method, it would be interesting to carry out in future studies simulations on different testing conditions to assess the robustness of Leunbach’s model. Another future research study could compare Leunbach’s model to other equating methods. DIGRAM also provides equating results from the equipercentile method, and Additional file 1 includes the equipercentile results from ESS and MOS equating.
In conclusion, we illustrated how to apply a novel test equating methodology implemented (partly during the current study) in the DIGRAM software which is free and is easy to use. We encourage its use in future applications.
Availability of data and materials
The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.
Abbreviations
 DIF:

Differential Item Functioning
 ESS:

Epworth Sleepiness Scale
 ICF:

International Classification of Functioning, Disability and Health
 IRT:

Item Response Theory
 MOS:

Medical Outcomes Study
 MTT:

Modern Test Theory
 NSI:

Neurological Sleep Index
 NSID:

NSIDiurnal sleepiness
 NSIF:

NSIFragmented nocturnal sleep
 NSIN:

NSINonrestorative nocturnal sleep
 PROMIS:

PatientReported Outcomes Measurement Information System
 PSD:

PROMISSleep Disturbance
 PSQI:

Pittsburgh Sleep Quality Index
 PSRI:

PROMISSleep Related Impairment
 RMT:

Rasch Measurement Theory
 SEE:

Standard Error of Equating
 TONiC:

Trajectories of Outcome in Neurological Conditions
References
 1.
Kolen MJ, Brennan RL. Test equating, scaling, and linking. Methods and practices. 2nd ed. New York: Springer; 2004.
 2.
González J, Wiberg M. Applying test equating methods: using R: Springer International Publishing; 2017.
 3.
Holland PW, Thayer DT. The kernel method of equating score distributions. (Tecnhical report 87–79) Princeton, NJ: Educational Testing Service; 1989.
 4.
von Davier AA, Holland PW, Thayer DT. The kernel method of test equating. New York: Springer Verlag; 2004.
 5.
van der Linden WJ, Hambleton RK. Handbook of Modern Item Response Theory. New York: Springer; 1996.
 6.
Rasch G. Probabilistic models for some intelligence and attainment tests: Danmarks Paedagogiske Institut; 1960.
 7.
Christensen KB, Kreiner S, Mesbah M. Rasch models in health. Hoboken: Wiley; 2013.
 8.
Holland PW, Dorans NJ. Linking and equating. In: Brennan RL, editor. Educational Measurement. Westport: Praeger Publishers; 2006.
 9.
Andrich D. Rating scales and Rasch measurement. Expert Rev Pharmacoecon Outcomes Res. 2011;11(5):571–85 Epub 2011/10/01.
 10.
Wahl I, Lowe B, Bjorner JB, Fischer F, Langs G, Voderholzer U, et al. Standardization of depression measurement: a common metric was developed for 11 selfreport depression measures. J Clin Epidemiol. 2014;67(1):73–86 Epub 2013/11/23.
 11.
Choi SW, Schalet B, Cook KF, Cella D. Establishing a common metric for depressive symptoms: linking the BDIII, CESD, and PHQ9 to PROMIS depression. Psychol Assess. 2014;26(2):513–27 Epub 2014/02/20.
 12.
Zhao Y, Chan W, Lo BC. Comparing five depression measures in depressed Chinese patients using item response theory: an examination of item properties, measurement precision and score comparability. Health Qual Life Outcomes. 2017;15(1):60 Epub 2017/04/05.
 13.
Schalet BD, Cook KF, Choi SW, Cella D. Establishing a common metric for selfreported anxiety: linking the MASQ, PANAS, and GAD7 to PROMIS anxiety. J Anxiety Disord. 2014;28(1):88–96 Epub 2014/02/11.
 14.
Chen WH, Revicki DA, Lai JS, Cook KF, Amtmann D. Linking pain items from two studies onto a common scale using item response theory. J Pain Symptom Manag. 2009;38(4):615–28 Epub 2009/07/07.
 15.
Fisher WP Jr, Eubanks RL, Marier RL. Equating the MOS SF36 and the LSU HSI physical functioning scales. J Outcome Meas. 1997;1(4):329–62 Epub 1997/01/01.
 16.
Schalet BD, Revicki DA, Cook KF, Krishnan E, Fries JF, Cella D. Establishing a common metric for physical function: linking the HAQDI and SF36 PF subscale to PROMIS(®) physical function. J Gen Intern Med. 2015;30(10):1517–23. Epub 2015/05/21.
 17.
Andrich D. The Polytomous Rasch model and the equating of two instruments. In: Christensen KB, Kreiner S, Mesbah M, editors. Rasch Models in Health. Hoboken: Wiley; 2013. p. 163–96.
 18.
Prodinger B, O'Connor RJ, Stucki G, Tennant A. Establishing score equivalence of the Functional Independence Measure motor scale and the Barthel Index, utilising the International Classification of Functioning, Disability and Health and Rasch measurement theory. J Rehabil Med. 2017;49(5):416–22. Epub 2017/05/05.
 19.
World Health Organisation. International Classification of Functioning, Disability and Health. Geneva: World Health Organization (WHO); 2001.
 20.
Leunbach G. A probabilistic measurement model for assessing whether two tests measure the same personal factor. Copenhagen: Danish institute for educational research; 1976.
 21.
Fischer GH. Derivations of the Rasch model. In: Fischer GH, Molenaar IW, editors. Rasch models : foundations, recent developments, and applications. New York: SpringerVerlag; 1995. p. 15–38.
 22.
Mills RJ, Tennant A, Young CA. The Neurological Sleep Index: A suite of new sleep scales for multiple sclerosis. Mult Scler J Exp Transl Clin. 2016;2:2055217316642263 Epub 2016/04/07.
 23.
van Kooten JA, Terwee CB, Kaspers GJ, van Litsenburg RR. Content validity of the PatientReported Outcomes Measurement Information System Sleep Disturbance and Sleep Related Impairment item banks in adolescents. Health Qual Life Outcomes. 2016;14:92. Epub 2016/06/19.
 24.
PROMIS 2 sleep wake [database on the internet]. Harvard Dataverse. 2016. Available from: https://doi.org/10.7910/DVN/XESLRZ.
 25.
Sargento P, Perea V, Ladera V, Lopes P, Oliveira J. The Epworth Sleepiness Scale in Portuguese adults: from classical measurement theory to Rasch model analysis. Sleep Breath. 2015;19(2):693–701 Epub 2014/11/21.
 26.
VialaDanten M, Martin S, Guillemin I, Hays RD. Evaluation of the reliability and validity of the Medical Outcomes Study sleep scale in patients with painful diabetic peripheral neuropathy during an international clinical trial. Health Qual Life Outcomes. 2008;6:113 Epub 2008/12/19.
 27.
Buysse DJ, Reynolds CF 3rd, Monk TH, Berman SR, Kupfer DJ. The Pittsburgh Sleep Quality Index: a new instrument for psychiatric practice and research. Psychiatry Res. 1989;28(2):193–213. Epub 1989/05/01.
 28.
Buysse DJ, Yu L, Moul DE, Germain A, Stover A, Dodds NE, et al. Development and validation of patientreported outcome measures for sleep disturbance and sleeprelated impairments. Sleep. 2010;33(6):781–92 Epub 2010/06/17.
 29.
International Organization for Standardization. Health informatics  Capacitybased eHealth architecture roadmap  Part 2: Architectural components and maturity model. ISO/TR 14639–2:2014. United Kingdom 2014.
 30.
Cieza A, Fayed N, Bickenbach J, Prodinger B. Refinements of the ICF linking rules to strengthen their potential for establishing comparability of health information. Disabil Rehabil. 2019;41(5):574–83.
 31.
Stucki G, Prodinger B, Bickenbach J. Four steps to follow when documenting functioning with the International Classification of Functioning, Disability and Health. Eur J Phys Rehabil Med. 2017;53(1):144–9. Epub 2017/01/26.
 32.
Stucki A, Cieza A, Michel F, Stucki G, Bentley A, Culebras A, et al. Developing ICF Core sets for persons with sleep disorders based on the International Classification of Functioning, Disability and Health. Sleep Med. 2008;9(2):191–8. Epub 2007/07/24.
 33.
Noack A. A class of random variables with discrete distributions. Ann Math Stat. 1950;21(1):127–32.
 34.
Kreiner S, Mesbah M. Rasch Models for Ordered Polytomous Items. In: Christensen KB, Kreiner S, Mesbah M, editors. Rasch Models in Health. Hoboken: Wiley; 2013. p. 27–41.
 35.
Andersen EB. Asymptotic properties of conditional maximumlikelihood estimators. J R Stat Soc Ser B Methodol. 1970;32(2):283–301.
 36.
Bishop Y, Fienberg SE, Holland PW. Discrete Multivariate Analysis: Theory and Practice. Cambridge: Massachusetts Institute of Technology Press; 1975.
 37.
Gil A, Segura J, Temme N. Numerical Methods for Special Functions; 2007.
 38.
Kreiner S, Christensen KB. Validity and Objectivity in HealthRelated Scales: Analysis by Graphical Loglinear Rasch Models. In: von Davier M, Carstensen C, editors. Multivariate and Mixture Distribution Rasch Models  Extensions and Applications. New York: SpringerVerlag; 2007. p. 329–46.
 39.
Efron B, Tibshirani RJ. An introduction to the bootstrap. New York: Chapman & Hall; 1993.
 40.
Goodman LA, Kruskal WH. Measures of Association for Cross Classifications. J Amer Stat Assoc. 1954;49:732–64.
 41.
Horton M, Marais I, Christensen KB. Dimensionality. In: Christensen KB, Kreiner S, Mesbah M, editors. Rasch Models in Health. Hoboken: Wiley; 2013. p. 137–57.
 42.
Andrich D, Sheridan B, Luo G. Rasch models for measurement: RUMM2030. Perth: RUMM Laboratory Pty, Ltd; 2010.
 43.
Cox DR, Snell EJ. Applied statistics : principles and examples. London: Chapman and Hall; 1981.
 44.
Holland PW, Wainer H. Differential item functioning. Hillsdale: Lawrence Erlbaum; 1993.
 45.
Kreiner S, Nielsen T. Item analysis in DIGRAM 3.04. Part I: Guided tours. Research report 2013/06. Copenhagen: University of Copenhagen, Department of Public Health; 2013.
 46.
Gelman A, Hill J. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge: Cambridge University Press; 2006.
 47.
Humphry SM. The role of the unit in physics and psychometrics. Measurement. 2011;9(1):1–24.
 48.
Andelic N, Johansen JB, BautzHolter E, Mengshoel AM, Bakke E, Roe C. Linking selfdetermined functional problems of patients with neck pain to the International Classification of Functioning, Disability, and Health (ICF). Patient Prefer Adherence. 2012;6:749–55. Epub 2012/11/03.
 49.
Borchers M, Cieza A, Sigl T, Kollerits B, Kostanjsek N, Stucki G. Content comparison of osteoporosistargeted health status measures in relation to the International Classification of Functioning, Disability and Health (ICF). Clin Rheumatol. 2005;24(2):139–44. Epub 2004/09/17.
 50.
Brockow T, Wohlfahrt K, Hillert A, Geyh S, Weigl M, Franke T, et al. Identifying the concepts contained in outcome measures of clinical trials on depressive disorders using the International Classification of Functioning, Disability and Health as a reference. J Rehabil Med. 2004;(44 Suppl):49–55. Epub 2004/09/17.
 51.
Roe Y, Soberg HL, BautzHolter E, Ostensjo S. A systematic review of measures of shoulder pain and functioning using the International Classification of Functioning, Disability and Health (ICF). BMC Musculoskelet Disord. 2013;14:73. Epub 2013/03/01.
 52.
Gradinger F, Glassel A, Bentley A, Stucki A. Content comparison of 115 health status measures in sleep medicine using the International Classification of Functioning, Disability and Health (ICF) as a reference. Sleep Med Rev. 2011;15(1):33–40. Epub 2010/09/08.
 53.
Stucki A, Cieza A, Schuurmans MM, Ustun B, Stucki G, Gradinger F, et al. Content comparison of healthrelated quality of life instruments for obstructive sleep apnea. Sleep Med. 2008;9(2):199–206 Epub 2007/07/24.
 54.
Campos TF, Rodrigues CA, Farias IM, Ribeiro TS, Melo LP. Comparison of instruments for sleep, cognition and function evaluation in stroke patients according to the International Classification of Functioning, Disability and Health (ICF). Rev Bras Fisioter. 2012;16(1):23–9. Epub 2012/03/24.
 55.
Gradinger F, Boldt C, Hogl B, Cieza A. Part 2. Identification of problems in functioning of persons with sleep disorders from the health professional perspective using the International Classification of Functioning, Disability and Health (ICF) as a reference: a worldwide expert survey. Sleep Med. 2011;12(1):97–101. Epub 2010/12/15.
 56.
Gradinger F, Glassel A, Gugger M, Cieza A, Braun N, Khatami R, et al. Identification of problems in functioning of people with sleep disorders in a clinical setting using the International Classification of Functioning, Disability and Health (ICF) checklist. J Sleep Res. 2011;20(3):445–53. Epub 2010/10/05.
 57.
Gradinger F, Kohler B, Khatami R, Mathis J, Cieza A, Bassetti C. Problems in functioning from the patient perspective using the International Classification of Functioning, Disability and Health (ICF) as a reference. J Sleep Res. 2011;20(1 Pt 2):171–82. Epub 2010/07/21.
 58.
TONiC study group. The Walton Centre NHS Foundation Trust. Liverpool.
 59.
Cella D, Yount S, Rothrock N, Gershon R, Cook K, Reeve B, et al. The PatientReported Outcomes Measurement Information System (PROMIS): progress of an NIH Roadmap cooperative group during its first two years. Med Care. 2007;45(5 Suppl 1):S3–S11. Epub 2007/04/20.
Acknowledgements
We would like to thank Kannit Pongpipatpaiboon for having helped with the linking of the sleep items to the ICF. Also, we would like to thank Jerome Bickenbach for his comments on a previous version of this paper. Finally, we are extremely grateful to the TONiC study group [58] and the PROMIS roadmap initiative [59] for providing us the data to use as an illustration of the methodology.
Funding
This work was supported by Swiss Paraplegic Research and University of Lucerne.
Author information
Affiliations
Contributions
NDA conceived and designed the study, analysed and interpreted the data, and wrote the paper. SK implemented the methodology in the software used, interpreted the data, and helped to prepare the draft manuscript. CY and RM provided data. AT conceived and designed the study, interpreted the data, and helped to prepare the draft manuscript. All authors critically revised the draft manuscript and approved the final version.
Author’s information
This paper is part of the cumulative PhD thesis of NDA.
Corresponding author
Correspondence to Núria Duran Adroher.
Ethics declarations
Ethics approval and consent to participate
In this study, secondary data from TONiC and PROMIS studies were analysed. Ethics approval and consent to participate were obtained in the primary studies.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional files
Additional file 1:
Direct and indirect test equating in DIGRAM 4.06. (PDF 699 kb)
Additional file 2:
Raw score conversion tables among sleep instruments. (XLSX 28 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Received
Accepted
Published
DOI
Keywords
 Test equating
 Leunbach’s model
 International Classification of Functioning, Disability and Health
 Rasch models
 ESS
 MOS
 NSI
 PSQI
 PROMISSD
 PROMISSRI