Test equating sleep scales: applying the Leunbach’s model

Background In most cases, the total scores from different instruments assessing the same construct are not directly comparable, but must be equated. In this study we aimed to illustrate a novel test equating methodology applied to sleep functions, a domain in which few score comparability studies exist. Methods Eight scales from two cross-sectional self-report studies were considered, and one scale was common to both studies. The International Classification of Functioning, Disability and Health (ICF) was used to establish content comparability. Direct (common persons) and indirect (common item) equating was assessed by means of Leunbach’s model, which equates the scores of two scales depending on the same person parameter, taking into account several tests of fit and the Standard Error of Equating (SEE). Results All items were linked to the body functions category b134 of the ICF, which corresponds to ‘Sleep functions’. The scales were classified into three sleep aspects: four scales were assessing mainly sleep disturbance, one quality of sleep, and three impact of sleep on daily life. Of 16 direct equated pairs, 15 could be equated according to Leunbach’s model, and of 12 indirect equated pairs, 8 could be equated. Raw score conversion tables between each of these 23 equated pairs are provided. The SEE was higher for indirect than for direct equating. Pairs measuring the same sleep aspect did not show better fit indices than pairs from different aspects. The instruments mapped to a higher order concept of sleep functions. Conclusion Leunbach’s equating model has been successfully applied to a functioning domain little explored in test equating. This novel methodology, together with the ICF, enables comparison of clinical outcomes and research results, and facilitates communication among clinicians. Electronic supplementary material The online version of this article (10.1186/s12874-019-0768-y) contains supplementary material, which is available to authorized users.


Background
To measure functioning, several instruments are commonly available. Typically, one clinician or researcher uses one instrument while another uses another instrument, both of which measure the same concept. However, the scores of the two instruments cannot be directly compared as they may have different operational ranges, or measure different levels of the concept, such that their total scores are on different ordinal metrics. This restricts their comparability and impedes research and communication among clinicians.
To be able to compare scores from different instruments, they must be equated. Equating can be defined as a statistical process used to adjust scores on test forms so that the scores can be used interchangeably [1]. While various equating techniques were developed during the twentieth century, it was not until the 1980s that they became popular [1]. Equating techniques include linear, equipercentile, and Modern Test Theory (MTT) methods. They all can be used to equate scores in different data collection designs, such as those in which two or more instruments are administered to the same group of persons (known as common persons design), or those in which common items are found across different studies (known as common items design).
In linear equating, the standardized deviation scores of the two forms are set to be equal by means of a linear conversion [2]. However, the formula converting scores from one form to the other may be non-linear. Equipercentile equating admits non-linear relationships -it identifies scores on one form that have the same percentile ranks as scores on the second form-but it assumes that the scores are continuous when they are usually discrete. Although the data can be made continuous [3,4], equipercentile relies on observed scores. MTT methods -including Item Response Theory (IRT) [5] and Rasch Measurement Theory (RMT) [6]-assume that a common latent variable lies behind responses to the items of the instruments. MTT refers to the outcomes of the latent variable as person parameters and regards an estimate of the person parameter as a measure of the respondent's ability or trait level. IRT and RMT share a number of assumptions, including: unidimensionality, monotonicity of item characteristic functions, local independence, and no Differential Item Functioning (DIF) [7]. Testing these assumptions adds strength to the equating process, because it is possible to test that the scores of the two instruments to be equated actually measure the same construct. This test adds evidence to one of the requirements of test equating, namely construct equivalence [8].
In IRT models, the person estimate is a complex function of patterns of item responses. Compared to this, the situation is much simpler and better suited for test equating in RMT because there is a one-to-one mapping of raw scores to the estimate of the person parameter, due to the statistical sufficiency of the raw score in RMT from which follows conditional inference where assumptions about person distributions and sampling are not required, and specific objective measurement [6]. Hence, IRT and RMT are different paradigms within MTT [9].
MTT methods have been applied to create a common metric in health domains such as depression [10][11][12], anxiety [13], pain [14], or physical functioning [15,16]. Furthermore, Andrich [17] presented an application of the polytomous Rasch model in equating two instruments intended to assess the same trait treating the total scores of two instruments as partial credit items from a test with two items. This approach has been employed in the health literature [18] together with the International Classification of Functioning, Disability and Health (ICF) [19] for conceptual matching. The polytomous Rasch models used in these studies are formally the same as the model described by Leunbach [20] used in this paper.
Gustav Leunbach developed the model in 1976 to assess whether two instruments measure the same trait, relating their total scores to a common scale [20]. This model is supported by a sound statistical theory on conditional inference, as well as the property of raw score sufficiency [21]. The Rasch model also possesses these properties. Although Leunbach's model seems promising in test equating, it has rarely been applied, probably because it has gone unnoticed by the scientific community. Andrich [17] acknowledged that the model he uses came from Leunbach's report [20], but apart from this, Leunbach's report has rarely been cited and, as far as we are concerned, the model has not been implemented in any software until recently. Hence, it can be considered a 'novel' methodology which we wanted to rediscover by applying it to a functioning domain. We considered sleep functions as a case in point because it has been little explored in the field of equating.
Thus, the objective of our article is to illustrate an application of a novel methodology for equating functioning scales. Specifically, we aim (1) to rediscover Leunbach's model and its properties, (2) to show how to interpret tests of fit and precision to decide on the adequateness of the equating, and (3) to apply the model to a domain little explored in the field of equating in health.

Sample and instruments
Secondary data were analysed from two cross-sectional self-report studies: Trajectories of Outcome in Neurological Conditions (TONiC), and Patient-Reported Outcomes Measurement Information System (PROMIS). These studies were chosen because they were available at the time when the current project was designed and they were suitable for secondary analyses.
The TONiC study (https://tonic.thewaltoncentre.nhs. uk/) examines the factors that influence quality of life in patients with neurological conditions. The sample for the current study consists of a cohort of patients with clinical definite Multiple Sclerosis from consecutive individual outpatient attendances in three neuroscience centres in the UK. The data were collected over the first 12 months of study recruitment, and the participants received a questionnaire pack including sleep instruments. The study had approval from the local research ethics committees. All subjects received written information on the study and gave written informed consent prior to participation [22].
The PROMIS initiative (http://www.healthmeasures.net/ explore-measurement-systems/promis) aims to build item pools and develop core questionnaires that measure key health outcome domains including sleep [23]. The sample for the current study consists of an internet (YouGov Polimetrix, https://today.yougov.com/solutions/overview) sample and a clinical sample [24]. The latter included patients recruited from sleep medicine, general medicine, and psychiatric clinics at the University of Pittsburgh Medical Center.
The ESS and the PSQI are the most widely used scales in sleep medicine. However, new generic (PROMIS) and disease-specific scales are emerging and a set of these were available from the PROMIS and TONiC studies, with the ESS as the link. Hence, we took advantage of the fact of having eight sleep instruments available and we consider that equating pairwise all of them would be of interest to researchers and clinicians. The eight sleep instruments as well as the study to which each instrument was administered are described in Table 1.

International Classification of Functioning, Disability and Health (ICF)
The ICF is an international standard offering a common language to describe functioning [19]. It is based on the integrative bio-psycho-social model of functioning, disability and health of the World Health Organization [19]. Body functions ('b'), Body structures ('s'), Activities and Participation ('d'), and Environmental factors ('e') are classified using an alphanumeric system. Second, third, and fourth-level categories are found under each letter, so that, for example, under the two-level category b134 sleep functions, seven third-level categories exist: b1340 Amount of sleep, b1341 Onset of sleep, b1342 Maintenance of sleep, b1343 Quality of sleep, b1344 Functions involving the sleep cycle, b1348 Sleep functions, other specified, and b1349 Sleep functions, unspecified.
One of the key requirements of test equating is construct equivalence [1]. In health, when equating scales in functioning domains, it is recommended to first link the items from the different tests to the ICF so that content comparability among the scales can be established and thus satisfy the requirement of equivalent constructs. In addition, the International Standards Organization [29] has prescribed the ICF as the framework for cataloguing health in e-health informatics (the concept of health is based on the health components of the ICF). Consequently, the use of ICF codes is two-fold in the current study: (1) to ensure concept comparability, a prerequisite for test equating, and (2) to lay a marker for the future when e-health informatics will be at the forefront of data management techniques in health care.
Hence, the items from all the scales were linked to the ICF. Two researchers performed independently the linking of items to ICF categories using the latest ICF linking rules [30], and then discussed possible disagreements to come up with a final solution. As suggested by Stucki et al. [31], the ICF Core Set for sleep disorders [32] was taken into account.

Leunbach's model
Leunbach [20] used a Power Series Distribution depending on an underlying common latent trait to relate the total scores of two instruments to a common scale. A Power Series Distribution [33] is a discrete probability distribution over non-negative integers of the form where the probability of obtaining a score x depends on a person parameter ξ and several score parameters γ x . For each score x, a score parameter is estimated. Leunbach's model is a test equating method for two tests (A and B), hence only the total score in each of the tests, not each item response, is considered. For each total score in A, a corresponding equated total score in B is estimated.
Let X 1 be a test score of A and X 2 a test score of B. (X 1 , X 2 ) depend on the same person parameter ξ, and Only the categorical items of the PSQI were considered. The sum of the individual items instead of the existing algorithm was applied (A, B) have maximum scores equal to (m 1 , m 2 ). The two test scores are assumed to be conditionally independent given ξ. Under this assumption, the probability of a total score over the test scores, r = X 1 + X 2 , can be computed as Mesbah & Kreiner [34] showed that the distributions of polytomous Rasch items can be parameterized as Power Series Distributions as described in (2.1) and that the same applies for the total score over several items including the total raw score over all items. In this sense, it is correct to regard Leunbach's model as the joint distribution of two Rasch model super items with distributions defined by: Then, from (2.2) and (2.3) we can derive the distribution of the total score X 1 + X 2 : ξ r ω r . The joint distribution of (X 1 , X 2 ) is: From this it follows that the conditional probability of the responses (x, r − x) of a person to the two instruments, given the person's total score r, is given by the ratio (2.5) and (2.4): which is independent of the person parameter ξ so that the total score r is a sufficient statistic for ξ. It also follows (1) that the score parameters can be estimated by the same conditional maximum likelihood estimation procedures that Andersen [35] proposed for estimates of item parameters in Rasch models, that is, by methods that make no assumptions on the distribution and sampling of persons [7]; and (2) that person parameters can also be estimated by the same maximum likelihood procedures that are used to calculate maximum likelihood estimates of person parameters in Rasch models [7].
Iterative proportional fitting [36] is used to calculate the conditional maximum likelihood estimate of score parameters and Newton-Raphson [37] to calculate the maximum likelihood estimates of person parameters.
Notice that Leunbach's model fits raw scores from the Rasch model with conditionally independent items because the raw score over all items have Power Series Distributions. In this sense, Leunbach's approach applies automatically. However, Leunbach's model is more general than that, because it may apply in situations where the items of the two scores do not fit the Rasch model. The only requirement is that the two raw scores fit the Leunbach's model. Kreiner & Christensen [38] describe loglinear Rasch models where uniform local dependence is permitted, and where the raw scores do fit Leunbach's model.
Note also that the proposed method of equating based on Leunbach's model could be considered as an example of Non-linear IRT True score equating [8]. Considering (X 1 , X 2 ) from above, Nonlinear True score equating assumes that X 1 and X 2 are raw scores summarizing the responses to sets of items from IRT models with a common latent variable θ.
In such models, true scores τ X 1 and τ X 2 are the expected outcomes given θ, The functions ν x 1 ðθÞ and ν x 2 ðθÞ define test characteristic curves of X 1 and X 2 . They define a monotonic but nonlinear symmetric relationship between the true scores given by Holland and Dorans [8] suggested to replace the true scores for observed scores in (2.8) so that one has The maximum likelihood estimates of the person parameters in Leunbach's model are equal to the person parameters where the expected value of the total score is the same as the observed score and therefore defined by ν −1 x 1 ðX 1 Þ and ν −1 x 2 ðX 2 Þ . For this reason, we may regard the observed score as an unbiased maximum likelihood estimate of the true score and, therefore, the suggestion (2.9) is justified.
Besides, the three steps of the equating process in Leunbach's model are the same as the steps taken in IRT true score equating, namely (1) take a score on scale A, (2) find the person parameter that corresponds to that score, and (3) find the score on scale B that corresponds to that person parameter. These steps are described in Fig. 1. More details of Leunbach's model are given in Additional file 1.
In Leunbach's report [20], only direct equating is addressed. In this study, we apply Leunbach's model for both direct and indirect test equating.

Direct equating
For direct equating (see Fig. 1), also known as common person equating, we assume that we have two tests (A and B), and that a number of persons responds to both. This is the case, for instance, when we equate the ESS (A) to the MOS (B) from the TONiC study. In this case, the analysis by Leunbach's model proceeds in four steps.
The first step estimates the score parameters (γ x ) of the two tests by conditional maximum likelihood in the same way that item parameters are estimated in the Rasch model [34].
The second step tests the fit of the model to the twoway contingency table with the joint distribution of the raw scores of A and B. Since this table may be large and sparse, where we cannot rely on the asymptotic distribution of the test statistics, p-values are calculated by parametric bootstrapping. Bootstrapping consists of taking multiple random samples with replacement from the sample data at hand [39]. We use three tests to assess the fit of the model to the table that are similar to tests used to test for multidimensionality in Rasch models. First, (1) a conditional Likelihood Ratio Test comparing observed and expected counts given the total score of the two tests. Second, (2) a test comparing the observed correlation (Goodman and Kruskal's Gamma [40]) of the scores to the expected value under the model. Horton et al. describe a similar test of unidimensionality for Rasch models [41]. Third, (3) by counting the number of persons with two scores that depart significantly at a 5% critical level from each other under the Leunbach's model. Since the person parameter of Leunbach's model can be estimated separately from the two scores, this test is similar to a t-test of unidimensionality in Rasch models comparing person estimates from different subscores [42]. The advantage of focusing on subscores instead of person parameters is that the analysis avoids the problematic assumption that person estimates are normally distributed. A chi-square p-value is obtained for (3) on whether the observed frequencies of persons with significant differences is larger than 5%. Following Cox and Snell (page 37) [43] we only regard p-values less than or equal to 0.01 as strong evidence against the fit of the model to the data. Moderate evidence provided by p-values less than 5% will of course occur, but will only be regarded as conclusive if more than one of the three tests are significant. However, the reader is free to draw their own conclusions concerning model fit in Table 5.
The third step equates a score on A to a score on B: Firstly, by calculating a maximum likelihood estimate of the person parameter given the A score, and secondly, by calculating the expected B score for persons with a person parameter equal to this estimate. Since the equated B score has to be an integer instead of a real number, the equated B score is defined as the rounded value of the expected B score.
The final step assesses the error of equating by bootstrapping from the observed contingency table. If the model was accepted during step two, the variation of the Fig. 1 Direct equating. The crosses in the contingency table indicate a non-zero value. A i is any raw score for scale A, and B j is any raw score for scale B. A m1 is the maximum possible total score for A, and B m2 the maximum possible total score for B. The equating process shows that for any A i , an estimate for the equated value in scale B is computed given the estimate of the person parameter for A i results of the three steps on the bootstrapped data will provide an unbiased estimate of the random error associated with the equated results. Such error is the Standard Error of Equating (SEE) [1] and is computed for each equated score. In other words, the SEE corresponds to the standard deviation of equated scores over hypothetical replications of an equating procedure in samples from a population of test takers [1]. For a score x i of test A, the SEE of the equated score on test B, b eq B ðx i Þ, can be computed using the following formula We calculated the replications of the equating procedure in S = 1000 bootstrap samples. The SEE formula using bootstrap samples is as follows: A weighted SEE mean for all the equated scores is then calculated. We calculated a weighted instead of an unweighted mean because we are summarizing errors over a large number of score groups, some of them with very few cases, and an unweighted mean would mean that the errors in the small groups would inflate the assessment of the degree of error in the population.
As explained in Table 2 and in Additional file 1, we regard a weighted SEE mean below 0.91 as acceptable.

Indirect equating
For indirect equating, also known as common item equating, imagine that we have three tests (A, B, and C); one sample of persons responds to A and B, and another sample responds to A and C. Equating from B to C can be indirectly done via A, which is the 'common item' (or common scale) enabling the equating. This is the case, for instance, when we equate the MOS (B)-available only in the TONiC sample, to the PSQI (C)-available only in the PROMIS sample, via the common scale ESS (A)-available in both TONiC and PROMIS samples.
The scale A should not work differently for the two samples of persons. Therefore, Differential Item Functioning (DIF) [44] for sample was assessed in each indirect equating triplet A, B, C.
Indirect equating from B to C is a three-step procedure. In the first step, direct equating of B to A is performed. In the second step, direct equating of A to C is performed. Then, the results of the previous steps are used to establish a correspondence of scores from B to C (i.e., to perform indirect equating). For example, as shown in Fig. 2, imagine that we want to know the score of C that corresponds to a score of 6 of B. We first have to find in step 1 the expected score in A of 6, which is 4.5. Then in step 2 we see that the expected scores for A = 4 and A = 5 are 3.5 and 5.3, respectively. Hence, the expected C score lies between 3.5 and 5.3, and by interpolating we find that it is (3.5 + 5.3)/2 = 4.4, which corresponds to a rounded integer of 4.
The tests of fit (second step in section Direct equating) are not available for indirect equating because to evaluate misfit the contingency table shown in Fig. 1 is needed, and it cannot be obtained if different sets of persons have responded to the tests. Nevertheless, in the first two steps of indirect equating from B to C via A, it is tested whether B and A measure the same construct, and whether A and C measure the same construct. If both tests accept the hypotheses, it follows logically that B and C must measure the same construct. On the other hand, the SEE of the indirect equating from B to C can be estimated by bootstrapping in exactly the same way as for direct equating. In addition, Additional file 1: Table S20 provides an example where the ESS and the MOS are equated directly and indirectly via the NSID, and the score correspondences in both cases are very similar.   [45], which is free and can be downloaded from http://staff.pubhealth.ku.dk/~skm/ skm/. Additional file 1 shows how to perform Test Equating with DIGRAM. DIF was assessed with RUMM2030 [42]. The statistical test used for detecting DIF in RUMM2030 is a two-way Analysis of Variance (ANOVA) [46] of the person-item deviation residuals with person factors (i.e. sample) and class intervals (i.e., strata along the trait) as factors.

Sample
The TONiC sample consisted of 722 multiple sclerosis patients, and the PROMIS sample of 2252 participants recruited from the internet and from clinics. Of the 1993 participants from the PROMIS internet sample, 1259 reported no sleep problems and 734 reported sleep problems. The clinical sample consisted of 259 adults from clinics at the University of Pittsburgh Medical Center. Table 3 shows the distribution of sex and age in the TONiC and PROMIS samples, as well as globally.

ICF
The 106 items of the 8 instruments were linked to the second level ICF category b134 sleep functions. Some were also linked to a third level sleep category (b1340 Amount of sleep, b1341 Onset of sleep, b1342 Maintenance of sleep, b1343 Quality of sleep). The b categories in the brief ICF Core Set for sleep disorders-b134 Sleep functions, b130 Energy and drive functions, b140 Attention functions, b110 Consciousness functions, and b440 Fig. 2 Indirect equating. This figure shows the three-step procedure to equate test B to test C indirectly via test A. The direct equating of B to A and the direct equating of A to C are the two previous steps needed to conduct the indirect equating from B to C Respiration functions, were found in our linking; while b134 was the primary concept, the rest were secondary concepts. Three of the four Core Set d categories-d475 Driving, d240 Handling stress and other psychological demands, d230 Carrying out daily routine, were also found as secondary concepts. More b, d, and e categories were identified as secondary concepts, too. All these secondary categories are the contextual parameters for items in the sleep instruments. Five main sleep aspects -Sleep disturbance (b1341, b1342), Quality of Sleep (b1343), Amount of sleep (b1340), Impact of sleep on daily life (b134), Facilitators/ barriers of sleep (b134), to which each item could belong to, were derived. Table 4 shows the number of items per instrument belonging to a sleep aspect.
MOS, NSIF, PSQI, PSD were assessing mainly sleep disturbance, NSIN quality of sleep, and ESS, NSID, PSRI impact of sleep on daily life. ESS and NSID were the sole instruments with all the items pointing to one sleep aspect. NSIF and PSRI involved two aspects, MOS, NSIN, and PSD three, and PSQI four.
The two PSQI items belonging to Facilitators/barriers of sleep (How often have you taken medicine to help you sleep (prescribed or 'over the counter')? / Do you have a bed partner or roommate?) were not considered in the summated score. They are Environmental factors in ICF nomenclature, and thus cannot be summated with the other items. The PSQI ended up with 12 items, and with a score range of 0-36.

Leunbach's model
For each pairwise direct equating, DIGRAM uses the estimates of the score parameters to calculate the expected counts under the Leunbach's model and to test whether the model fits the data. Three test of fit are available (Likelihood ratio test, Gamma coefficient, and the Number and percentage of persons with significant differences between measurements). A bootstrap p-value is provided for the first and second tests, and an asymptotic chi square p-value is obtained for the third. These p-values are presented in Table 5 (columns 2-4) for each directly equated pair, highlighting the p-values with a significant level below 0.01. The equating of ESS-PSD, ESS-PSRI, and PSD-PSRI are presented as a percentage of persons with significant differences between measurements. ESS-PSD presented also a significant gamma coefficient, so there is evidence from two tests that ESS and PSD measure different constructs; equating these two tests or using them for indirect equating was therefore not recommended. MOS-NSIF and NSIN-NSIF presented a significant Likelihood Ratio Test.
To assess the precision of the equating results, for each equated score in each equated pair, bootstrap samples were generated in order to compute the standard deviation of the equated scores over replications, namely the SEE. The distribution of the SEE among the equated scores for each equated pair is presented in the last four columns of Table 5. The most relevant value is the weighted mean, and values above 0.91 are highlighted. The minimum SEE values were practically 0 for all the pairs, and the maximum oscillated between 0.5 and 3.55. The weighted SEE mean is below 1 in all the pairs except ESS-PSD.
The indirect equated pairs (via ESS) excluding the PSD ones (which involved ESS-PSD) were first tested for DIF by sample. ESS showed DIF only for NSIF-PSQI, and the marginal value was considered not to be substantial enough to prevent the equating. Then we assessed the tests of fit in the first two direct equating steps: if these were acceptable, the fit of the indirect equating was also acceptable. The fit was acceptable for all the pairs except the ones involving ESS-PSD. Regarding the SEE, bootstrap samples were generated and evaluated. Table 6 shows the distribution of the SEE for each pairwise indirect equating excluding PSD pairs. The SEE values were higher than for direct equating, oscillating the maximum between 0.56 and 4.99, and the highest weighted mean value was 1.4. The pairs involving PSRI presented a weighted mean above 1. Total number of instruments 4 1 3 The numbers in bold-italics refer to the most prevalent aspect  6 show that pairs belonging to the same aspect did not necessarily have better fit indices and precision than pairs from different aspects. For example, MOS-ESS (different aspects) shows better fit values than PSQI-PSD (same aspect). While MOS-PSQI (same aspect) shows better SEE values than MOS-PSRI (different aspects), NSID-PSQI (different aspects) shows better SEE values than NSID-PSRI (same aspect). Also, both tables show that the SEE is lower when we equate the large scale (in terms of scale range) to the small one than vice versa. For example, the SEE for ESS-NSID (small to large) is 0.80 while NSID-ESS (large to small) is 0.38.
Out of the 28 possible pairs, 23 could be equated. The exchange tables for these 23 equated pairs can be found in Additional file 2.

Discussion
In this study we described a novel methodology for equating functioning scales and we applied it to a domain little explored in the field of equating, sleep functions. Leunbach's model equates the scores of two scales considering that they depend on the same person parameter. It has been shown how to take into account the three tests of fit, as well as the SEE, to decide on the adequateness of the equating.
In our case in point, 23 out of the 28 possible pairs among 8 instruments could be equated according to the model. The reason why the Gamma coefficient, and the counting of the number of persons with two scores that depart significantly at a 5% critical level from each other under the model are significant for equating ESS-PSD, could be due to a type 1 error. In addition, the scale range difference between ESS and PSD, 84, is the highest among all the direct equated pairs. The higher this difference is, the more problematic is the equating.
Issues remain for ESS-PSRI, PSD-PSRI, MOS-NSIF, and NSIN-NSIF. Their misfit may be due to local dependence between scores and/or because the latent trait assumed by the Leunbach's model to lie behind the scores is measured on logit scales with different units [47]. While equating the ESS with the PSD should be avoided, the scores of ESS-PSRI, PSD-PSRI, MOS-NSIF, and NSIN-NSIF could be equated. The indirect equating was free of DIF for sample with one exception showing marginal DIF without impeding the equating.
The SEE for indirect equating was larger than for direct equating because the former uses results from two sets of direct equating estimates, both of which have error. Indirect equating is, therefore, less robust than direct. We also observed that there is less precision in terms of SEE when we equate the small scale (having a lower score range) to the large one (having a bigger score range) than vice versa. This makes sense because when going from small to large, for each score there is a wider range of options of scores to be equated.
As explained in the Methods section, when equating scales in functioning domains, linking the items to the ICF enables to establish content comparability among the scales and thus satisfy the requirement of construct equivalence [1]. In our case in point, the instruments were classified into three sleep aspects: sleep disturbance, quality of sleep, and impact of sleep on daily life. Given that the pairs belonging to the same aspect did not necessarily present better fit indices than pairs from different aspects, it seems that the instruments map to a higher order concept of sleep functions (b134). Moreover, as only 2 (ESS and NSID) of the 8 instruments were measuring one sole aspect, different aspects of sleep are already considered in the existing instruments. ESS and NSID are then more limited than the remaining instruments, which are more content valid. Hence, the linking process helped also in the interpretation of the results.
Sleep scales have been previously linked to the ICF [48], and the ICF has also been used to compare the content of health status measures, where the b134 sleep functions category appears [49][50][51], or where the  [52,53]. The PSQI has also been linked to the ICF together with instruments from other health domains [54]. Problems in functioning of people with sleep disorders have also been identified via the ICF [55][56][57]. However, we are unaware of any study that uses the ICF beyond the content comparability to formally equate sleep scales. Leunbach's model, developed by Gustav Leunbach in 1976, has been rarely applied despite its desirable properties of raw score sufficiency, sound statistical theory on conditional tests, and the similarity with Rasch models for measurement. This similarity should not be surprising; Leunbach collaborated with Rasch for many years (Leunbach translated -or, according to Rasch, transformed-Rasch's 1960 book [6] from Danish into English; see page ix of the book [6]) and it is not an unreasonable conjecture that the idea of using power series distributions for measurement models came from Rasch himself. The similarity between the power series distribution and the distribution of test scores in Rasch's multiplicative Poisson model and the distribution raw score in the Rasch model for item analysis (see formula (5.5) in Chapter X of the Rasch's book [6]) is also an indicator of the inspiration for Leunbach's model.
A limitation of this study, considering the current implementation of the Leunbach's model in DIGRAM, is that only the raw scores taken by the sample appear in the equating table. In our case in point, this is the case of MOS, which theoretical range is 0-24 but only the range 0-21 is equated, because the raw scores 22-24 were not taken. This problem could be solved by interpolation, and we are currently working on how to implement it in DIGRAM with the aim that the next version of DIGRAM will incorporate it. Another limitation is that the ESS, the common scale used for indirect equating, assesses only one sleep aspect (impact of sleep on daily life), and therefore the indirect equating is not optimal. Nevertheless, we have shown that it is possible to equate several sleep scales using the Leunbach's model. The exchange of scores between pairs of sleep instruments available in Additional file 2 will facilitate the comparison of clinical outcomes and research results. Any clinician or researcher can continue using the sleep scale they feel more comfortable with and look for the correspondence of each raw score to any other sleep scale.
In this study we applied a particular test equating methodology to two specific datasets. Hence, the results obtained are not generalizable. Although the main focus of this study was not to provide generalizable findings, but to illustrate the application of a novel test equating method, it would be interesting to carry out in future studies simulations on different testing conditions to assess the robustness of Leunbach's model. Another future research study could compare Leunbach's model to other equating methods. DIGRAM also provides equating results from the equipercentile method, and Additional file 1 includes the equipercentile results from ESS and MOS equating.
In conclusion, we illustrated how to apply a novel test equating methodology implemented (partly during the current study) in the DIGRAM software which is free and is easy to use. We encourage its use in future applications.