Comparison of unweighted and item response theory-based weighted sum scoring for the Nine-Questions Depression-Rating Scale in the Northern Thai Dialect

Kawilapat, Suttipong; Maneeton, Benchalak; Maneeton, Narong; Prasitwattanaseree, Sukon; Kongsuk, Thoranin; Arunpongpaisal, Suwanna; Leejongpermpoon, Jintana; Sukhawaha, Supattra; Traisathit, Patrinee

doi:10.1186/s12874-022-01744-0

Research
Open access
Published: 12 October 2022

Comparison of unweighted and item response theory-based weighted sum scoring for the Nine-Questions Depression-Rating Scale in the Northern Thai Dialect

Suttipong Kawilapat^1,2,
Benchalak Maneeton²,
Narong Maneeton²,
Sukon Prasitwattanaseree¹,
Thoranin Kongsuk^3,4,
Suwanna Arunpongpaisal⁵,
Jintana Leejongpermpoon³,
Supattra Sukhawaha³ &
…
Patrinee Traisathit^1,6,7

BMC Medical Research Methodology volume 22, Article number: 268 (2022) Cite this article

1734 Accesses
3 Citations
1 Altmetric
Metrics details

Abstract

Background

The Nine-Questions Depression-Rating Scale (9Q) has been developed as an alternative assessment tool for assessing the severity of depressive symptoms in Thai adults. The traditional unweighted sum scoring approach does not account for differences in the loadings of the items on the actual severity. Therefore, we developed an Item Response Theory (IRT)-based weighted sum scoring approach to provide a scoring method that is more precise than the unweighted sum score.

Methods

Secondary data from a study on the criterion-related validity of the 9Q in the northern Thai dialect was used in this study. All participants were interviewed to obtain demographic data and screened/evaluated for major depressive disorder and the severity of the associated depressive symptoms, followed by diagnosis by a psychiatrist for major depressive disorder. IRT models were used to estimate the discrimination and threshold parameters. Differential item functioning (DIF) of responses to each item between males and females was compared using likelihood-ratio tests. The IRT-based weighed sum scores of the individual items are defined as the linear combination of individual response weighted with the discrimination and threshold parameters divided by the plausible maximum score based on the graded-response model (GRM) for the 9Q score (9Q-GRM) or the nominal-response model (NRM) for categorical combinations of the intensity and frequency of symptoms from the 9Q responses (9QSF-NRM). The performances of the two scoring procedures were compared using relative precision.

Results

Of the 1,355 participants, 1,000 and 355 participants were randomly selected for the developmental and validation group for the IRT-based weighted scoring, respectively. the gender-related DIF were presented for items 2 and 5 for the 9Q-GRM, while most items (except for items 3 and 6) for the 9QSF-NRM, which could be used to separately estimate the parameters between genders. The 9Q-GRM model accounting for DIF had a higher precision (16.7%) than the unweighted sum-score approach.

Discussion

Our findings suggest that weighted sum scoring with the IRT parameters can improve the scoring when using 9Q to measure the severity of the depressive symptoms in Thai adults. Accounting for DIF between the genders resulted in higher precision for IRT-based weighted scoring.

Peer Review reports

Background

Depression is a common mental disorder that is a leading cause of the global disease burden and deaths by suicide. In 2017, an estimated 264 million people (3.44%; range 2–6%) worldwide and 2.62 million people (3.09%) in Thailand experienced depression. The prevalence of depression in Thailand is slightly different between males and females (2.57% vs. 3.56%) and around twice higher in the elderly (50 years of age or more) than individuals aged 15–49 years old (6.02–6.29% vs. 3.37%) [1].

The measurement of psychological constructs such as depression and quality of life is complicated due to there being no way of assessing them directly. However, they can be quantified with an instrument, of which there are several for depression assessment, such as the Hamilton Rating Scale for Depression, the Beck Depression Inventory, the Montgomery-Åsberg Depression-Rating Scale, the Patient Health Questionnaire-9 (PHQ-9), the Calgary Depression Scale for Schizophrenia (CDSS), among others [2,3,4,5,6]. A Nine-Questions Depression-Rating Scale (9Q) in the northern Thai dialect is a measurement tool developed for assessing the severity of depressive symptoms in Thais in the northern region of the country since many people there do not use the formal Thai dialect in their daily lives, especially elderly people and those living in rural areas. Communication or interviewing involving technical terms in the formal Thai dialect could have led to misunderstanding. Researchers conducting a previous depression surveillance study in the northern region of Thailand using a two-question depression screening test (2Q) in the formal Thai dialect found that some participants denied the existence of symptoms related to depression due to the question not being relevant in their sociocultural context. Therefore, the 9Q in the northern Thai dialect was developed to reduce the possibility of misunderstanding due to the language barrier. It consists of nine rating scale items about the frequency and intensity of the diagnostic symptoms for major depressive disorder [7]. Scoring in the 9Q is commonly summed (ranging from 0 to 81 points) based on traditional techniques such as the Classical Test Theory (CTT). In contrast to the CTT approach, the Item Response Theory (IRT) is a technique for analyzing important aspects of measurements (e.g., item difficulty and item discrimination, as well as ordering of the response categories) and offers many advantages over CTT. The authors in [8] stated that an IRT model yields the estimated item and latent trait while taking variation according to the population characteristics into account, and thus can provide more comprehensive and accurate evaluations of item characteristics. Moreover, it can be applied to assess group differences for item and scale functioning and evaluate scales containing items with different response formats. In addition, it can also be helpful for developing better health outcome measures and for modeling changes over time. Moreover, it has been increasingly used as an alternative to CTT for measuring the development and validation of psychiatric disorders such as depression and anxiety [8,9,10,11,12,13,14,15]. The results from previous studies suggest that IRT approach may reveal additional information about the actual level of depression or other disorders compared to standard sum scoring [16,17,18,19].

Previously, researchers have suggested that IRT approach may reveal additional information about the actual level of depression or other symptoms compared to standard sum scoring [16]. Moreover, it may increase the precision in discriminating between individual differences in items over time [17]. The results of a simulation study indicate that the bias of estimating the rate of change over time was reduced by IRT-based scoring compared to standard sum scoring [20], possibly due to not assuming a constant error along the continuum of the measure, which is unlike CTT.

Previously, McNeish and Wolf [21] revealed that factor and IRT-based scoring are optimally weighted scales in which the loading for each item can be estimated differently. However, the sum-score approach is based on unit-weighting scoring that accounts for possible differences in the relationship between the latent trait score and each item, which can lead to less reliable scoring if the scales are scored differently. In addition, the authors also compared the results of using sum-scoring, factor-scoring, and simultaneous approaches on Verbal Cognition and Speeded Cognition for school membership. Their results showed that different scoring methods can result in different results; the first school membership group scored significantly higher on Verbal Cognition while the second group scored significantly higher on Speeded Cognition, which was different from the results using the factor-scoring regression and simultaneous approaches. This finding suggests that despite high correlations between the sum scores and factor scores (R² = 0.97), small unexplained variances between the methods can lead to different conclusions. However, Widaman and Revelle [22] suggest that there was variation in factor loadings and factor scoring weights across the samples. Since the IRT approach takes the variation in population characteristics when estimating item parameters and latent traits into account [8], we hypothesized that applying IRT parameters as the weighted parameters for weighted sum scoring could be beneficial for mitigating this issue.

The PHQ-9 is commonly employed as a screening tool for depression and its severity in Thailand due to its excellent sensitivity and specificity for major depressive disorder [2]. However, considering only the frequency of symptoms might uncover the intensity of each symptom. Moreover, the standard sum score of PHQ-9 or 9Q based on CTT might lead to estimation bias between the demographics of the population and in the follow-up monitoring of people at risk over time. In addition, accounting for the differences of responses between genders when scoring for depression or depressive symptom severity has rarely been taken into account. Differential item functioning (DIF) is an approach to examine the difference in the probability of responding to an item among groups with the same psychological construct score. Previously, several researchers have found an impact of gender on the response pattern for a depression or depressive symptoms scale. In a study in Australia, researchers found that gender-related DIF was present in three symptoms associated with depression in the World Health Organization’s Composite International Diagnostic Interview [23]. The results of a study among Chilean adolescents indicate that DIF across gender was present in 6 of 13 items of the ASEBA School-Age Form Youth Self Report (YSR) used to measure depression and anxiety levels, among other disorders. These findings suggest that items found in commonly utilized measures for anxiety and depression symptoms may not represent the true level of behavioral problems unless DIF analysis is conducted based on gender [24]. The findings from another study on response patterns of Brazilian college students by using the Beck Depression Inventory-II (BDI-II) indicate that gender-related DIF was present in one item related to crying, implying that women are more likely to respond with a higher level of crying behavior than men even when they had a similar severity level of depression [25]. These studies reveal the importance of accounting for the difference in response patterns between genders. Therefore, our aim was to develop an IRT-based weighted sum scoring approach for a depressive symptom severity diagnosis tool that provides a more informative and precise indication of the actual levels of depressive symptoms as an alternative to the unweighted sum scoring approach by taking gender-related DIF into account. For that purpose, the performances of depressive symptom severity detection using the unweighted and IRT-based weighted scoring approaches for the 9Q were compared.

Methods

Settings and participants

We used secondary data from a study on the criterion-related validity of a revised 9Q in the northern Thai dialect comprising 1,527 individuals from the northern region of Thailand. This revised questionnaire was translated from the central Thai dialect version [7]. Participants who did not complete all items in the assessment or were under 19 years old were excluded from the study. The remaining participants were randomly stratified with proportional allocation for gender into two groups: a developmental group for IRT-based weighted scoring (n = 1,000) and a validation group for performance comparison.

Assessments

The approach consisted of several parts, including demographics, screening for major depressive disorder, and diagnosis by an expert. All of the participants were first interviewed by a psychiatric nurse to obtain their demographic data and screen them for major depressive disorder using the revised two-question screening test [26]. They were then evaluated for depressive symptoms by a psychiatric nurse using the revised 9Q, which was blinded for another psychiatric nurse who evaluated them for major depressive disorder severity by using the Hamilton Rating Scale for Depression (HRSD–17). The participants were then interviewed by a psychiatrist to diagnose major depressive disorder based on the fourth edition of the American Psychiatric Association’s Diagnostic and Statistical Manual of Mental Disorders (DSM-IV) [27] and the MINI International Neuropsychiatric Interview-Thai version [28].

The 9Q was developed to assess the severity of depressive symptoms whereas the PHQ–9 was used to screen for depression. We hypothesized that considering only frequency of symptoms might not uncover the severity of depressive symptoms, and thus both the frequency and symptom intensity were accounted for in the product score in the calculation. Development of the 9Q in the northern Thai dialect included the following processes. (1) Psychiatrists and psychiatric nurses with experience of diagnosing depression and who spoke the northern Thai dialect consulted with experts in this dialect and patients/relatives living in northern Thailand to establish pertinent words and phrases for the questions about expressing feelings and mood in the formal Thai dialect version and the DSM-IV criteria by using the Delphi technique. (2) The study team formed a focus group involving the various populations in the northern area across age groups to ensure that the language used in this scale enabled efficient communication. (3) The developed tool was evaluated for construct validity and reliability by using exploratory factor analysis and Cronbach’s alpha coefficients, respectively.

The 9Q consists of nine rating scale items: (1) depressed mood (Mood); (2) markedly diminished interest or pleasure (Interest); (3) insomnia or hypersomnia (Sleep); (4) fatigue or loss of energy (Fatigue); (5) weight loss when not dieting or weight gain (Weight); (6) feeling of worthlessness or excessive or inappropriate guilt (Guilty); (7) diminished ability to think or concentrate, or indecisiveness (Concentration); (8) Psychomotor agitation or retardation (observable by others, not merely subjective feelings of restlessness or being slowed down) (Psychomotor); and (9) recurrent thoughts of death, recurrent suicidal ideation, or a suicide attempt or a specific plan for committing suicide (Suicide). The participant scored each item according to the perceived intensity (0 = no symptoms, 1 = mild, 2 = moderate, 3 = severe) and frequency (1 = several days, 2 = more than a week, 3 = nearly every day) of major depressive disorder symptoms within the previous two weeks. The score for each item was calculated as the product of the intensity and frequency scores. There are 7 plausible points for the product score of each item (0 = no symptoms, 1 = mild symptoms for several days, 2 = moderate symptoms for several days or mild symptoms for more than a week, 3 = severe symptoms for several days or mild symptoms nearly every day, 4 = moderate symptoms for more than a week, 6 = moderate symptoms nearly every day or severe symptoms for more than a week, and 9 = severe symptoms nearly every day). The total score for the 9Q ranges from 0 to 81 points. In the IRT procedure (i.e., assumption testing and parameter estimation), the 9Q product labels were defined as 0, 1, 2, 3, 4, 5, and 6 corresponding to the traditional 9Q scores of 0, 1, 2, 3, 4, 6, and 9, respectively.

IRT models

This family of models can be used to measure an unobservable characteristic or a latent trait (θ) in individuals. An important difference between IRT and CTT is that the scale for the underlying latent variable that is being measured by a set of items is defined in IRT and the items are calibrated with respect to the scale. A commonly used IRT model for dichotomous items is the two-parameter logistic (2PL) model represented by two item parameters: item discrimination (a) and item difficulty (b).

Analogous to the 2PL model, IRT models for polytomous items (e.g., the Likert scale) have one discrimination parameter (a_i) and a set of discrimination parameters for either the between-category threshold or the m-1 threshold (b_ij) for each item. The discrimination parameter indicates the slope of the category response curves with a narrow and peaked curve indicating that the response category differentiates well across latent traits. The threshold parameters represent the location of the latent-trait level at which individuals have a 50% probability of endorsing the next category as an adjacent response category [29, 30]. The marginal maximum likelihood estimation (MMLE) using an expectation-maximization (EM) algorithm is suggested for parameter estimation [31, 32]. The polytomous IRT models used in our study were the graded-response model (GRM) (Eq. 1) and the nominal-response model (NRM) (Eq. 2):

$${\kern 1pt} {P_{ik}}(\theta )=\frac{{\exp \left[ {{a_i}\left( {\theta - {b_{ik}}} \right)} \right]}}{{1+\exp \left[ {{a_i}\left( {\theta - {b_{ik}}} \right)} \right]}} - \frac{{\exp \left[ {{a_i}\left( {\theta - {b_{i(k+1)}}} \right)} \right]}}{{1+\exp \left[ {{a_i}\left( {\theta - {b_{i(k+1)}}} \right)} \right]}},$$

(1)

$${\kern 1pt} {P_{ik}}(\theta )=\frac{{\exp \left[ {{a_{ik}}\theta +{c_{ik}}} \right]}}{{\sum\limits_{{k=1}}^{m} {\exp \left[ {{a_{ik}}\theta +{c_{ik}}} \right]} }},$$

(2)

where, ${\kern 1pt} {P_{ik}}(\theta )$ = The probability of responding to item i in category k (k = 0, 1, …, m).

a _i = A discrimination parameter for item i.

a _ik = A category slope parameter for item i in category k.

b _ik = A threshold parameter for item i in category k.

c _ik = A category intercept parameter for item i in category k.

Since the score for each 9Q item was calculated by multiplying its frequency and intensity, some of its values were equal even though their endorsements can be different. For example, the 9Q score of an individual who endorsed mild symptoms nearly every day (intensity = 1 multiplied by frequency = 3) is 3 points, which is the same as another individual who endorsed severe symptoms for several days (intensity = 3 multiplied by frequency = 1). Thus, there can be difficulties when accounting for this via the traditional ordering of the 9Q scores or nominal categorization using IRT-based scoring. Therefore, we applied the NRM for the categorical combination of symptom intensity and frequency on the nominal scale without natural ordering in addition to the GRM with ordering.

Model selection

Prior to fitting the IRT model, the unidimensionality and local dependence assumptions were evaluated using a confirmatory factor analysis (CFA) with a maximum likelihood estimator, and local dependence was evaluated by using the residual correlation matrix resulting from the single factor CFA. Unidimensionality indices, including a comparative fit index (CFI) > 0.95, a Tucker Lewis index (TLI) > 0.95, and a root-mean-squared error of approximation (RMSEA) < 0.06, indicate that the fit of the model was adequate [33]. A residual correlation value of > 0.20 possibly indicates local dependence [34]. The monotonicity assumption was evaluated based on Loevinger’s H coefficient values for both the total scale (H) and each item (H_i). The coefficients for the items (H_i) of ≥ 0.30 and the total scale (H) of ≥ 0.50 proved that the monotonicity was acceptable [35, 36].

Likelihood-ratio testing was performed to compare the IRT models. The model with the lowest Akaike information criterion (AIC) and Bayesian information criterion (BIC) values was selected for model fitting [37].

The item-fit statistics of each item in the 9Q product and the 9QSF combination for the GRM were tested by using the chi2W method of Kondratek (2020) [38]. It is a Wald-type test statistic that compares the observed and expected item mean scores over a set of ability bins. It is available as a module in the Stata statistical software suite and can be used as an alternative method to assess the item-fit statistics for polytomous items.

Differential item functioning (DIF)

This occurs when participants from different demographic groups (e.g., gender, age) with the same underlying trait score have a different probability of responding to an item. The presence of DIF may compromise comparisons across subgroups and can lead to misleading results, and measurement invariance cannot be presumed if DIF is present [39]. It can either be non-uniform, which is due to a statistically significant interaction between the trait level and the demographic variable (effect modification), or uniform, which is the difference between the strength of the relationship between the ability and the item responses in a model with and without the demographic variable for each item (confounding) [40].

An IRT-based technique was used to detect DIF for polytomous items. The baseline IRT models were fitted for all items and then compared to the other model with varied discrimination and threshold parameters between the reference and focal groups for each item. A comparison of models was performed using the likelihood-ratio test, with a significant difference (p-value) < 0.05 between the baseline and constrained model indicating the presence of DIF between the groups [39,40,41].

IRT-based weighted scoring

The 9Q score (the sum-score of symptom intensity multiplied by the frequency of each item on an ordinal scale) and 9QSF (the categorical combination of symptom intensity and frequency on a nominal scale) was compared in this study. In the model selection procedure, GRM, which attained the lowest AIC and BIC values (Table 2) was used as the baseline model for IRT parameter estimation. For GRM, a discrimination parameter (a_i) and threshold parameters (b_ik) for k categories were estimated for each item i. However, GRM could not be used for parameter estimation using 9QSF due to the unordered scores of the categorical combinations. Thus, the IRT parameters for the 9Q score were estimated based on GRM while the parameters for 9QSF were estimated based on NRM with 10 plausible combined categories (0 = no symptoms, 11 = mild symptoms for several days, 12 = mild symptoms for more than a week, 13 = mild symptoms nearly every day, 21 = moderate symptoms for several days, 22 = moderate symptoms for more than a week, 23 = moderate symptoms nearly every day, 31 = severe symptoms for several days, 32 = severe symptoms for more than a week, and 33 = severe symptoms nearly every day). The number of each category combination was only used to label the category and was not based on the scoring. For the latter model, the k-1 category slope or category boundary discrimination (CBD) parameters for category sf (a_i(sf)) and category intercept parameters for category sf (c_i(sf)) were estimated for each item i.

We also tested IRT models without accounting for DIF (9Q-GRM and 9QSF-NRM) along with other models accounting for DIF (9Q-GRM-DIF and 9QSF-NRM-DIF). For example, we found that gender-related DIF was present in Item 2 and item 5 of the score under the GRM model. Therefore, the 9Q-GRM-DIF model was used to separately estimate threshold parameters for these items according to gender in the IRT-based weighted sum scoring.

For IRT-based weighted scoring, we considered that the threshold and discrimination parameters (based on the GRM) and the category slope parameters and category intercept parameters (based on the NRM) can be applied as the category weights and item weights for the weighted scoring for individual item scores. Thus, the IRT-based weighted sum score was calculated based on the weighted score for each item. The estimated values of the threshold parameters (b_ik) under GRM were considered as the category weight for item i in category k whereas the estimated discrimination parameters (a_i) were considered as the item weight for item i. The 9Q-GRM (or 9Q-GRM-DIF) score for individual j is defined as the linear combination of the product of the individual responses and the category weights weighted with item weights for all items divided by the plausible maximum of the product weighted score as follows:

$${\kern 1pt} 9{\text{Q-GR}}{{\text{M}}_j}{\text{ = }}\frac{{\sum\limits_{{i=1}}^{9} {\sum\limits_{{k=0}}^{6} {{a_i}{b_{ik}}{X_{ik}}} } }}{{\sum\limits_{{i=1}}^{9} {{a_i}{b_{i6}}{X_{i6}}} }},$$

(3)

where a_i is the discrimination parameter for item i (i = 1, 2, …, 9), b_ik is the threshold parameter for item i in category k (k = 0, 1, 2, 3, 4, 5, 6), and X_ik is the response of the individual for item i in category k (0 when category k is not endorsed or 1 when it is).

Meanwhile, under the NRM, combining the estimated category slope parameters (a_i(sf)) and estimated category intercept parameters (c_i(sf)) provides the category weights. The 9QSF-NRM (or 9QSF-NRM-DIF) score for individual j is defined as the linear combination of the individual weighted responses divided by the plausible maximum of the combined weighted score as follows:

$${\kern 1pt} 9{\text{QSF-NR}}{{\text{M}}_j}{\text{ = }}\frac{{\sum\limits_{{i=1}}^{n} {\sum\limits_{{sf=0}}^{{33}} {\left( {{a_{i(sf)}}+{c_{i(sf)}}} \right){X_{i(sf)}}} } }}{{\sum\limits_{{i=1}}^{n} {{\text{MAX}}\left( {\left( {{a_{i(sf)}}+{c_{i(sf)}}} \right){X_{i(sf)}}} \right)} }},$$

(6)

where a_i(sf) is the category slope parameter for item i (i = 1, 2, …, 9), c_i(sf) is the category intercept parameter for item i in category sf (sf = 0, 11, 12, 13, 21, 22, 23, 31, 32, or 33), and X_ik is the response of the individual for item i in category k (0 when category k is not endorsed or 1 when it is).

For example, under the GRM, assuming that the discrimination parameter of item 1 (mood) is 2.50 and the threshold parameters categorized from 0 to 6 are 0, 0.50, 1.00, 1.50, 2.00, 2.50, and 3.00, respectively, the item score is 7.50 (2.50 multiplied by 3.00) if the participant endorses a severe level for mood nearly every day. The sums of all of the item scores were calibrated on a 0–1 scale by dividing by the plausible maximum sum score, and the scale was then multiplied by 81 to enable comparison with the 9Q unweighted scores.

Statistical analysis

The demographics of the participants are reported as frequencies and percentages for categorical variables and as medians and interquartile ranges (IQRs) for continuous variables. Differences between the demographic variable values of the developmental and validation groups were tested for significance by using Fisher’s exact test and the Mann-Whitney U test for categorical and continuous variables, respectively.

Differences between the means of the depressive symptoms severity levels using 9Q sum score (reference) were compared with 9Q frequency, 9Q-GRM, 9Q-GRM-DIF, 9QSF-NRM, and 9QSF-NRM-DIF by using analysis of variance (ANOVA) with Bonferroni adjustment. Pairwise comparisons for each category were compared using independent t-tests. The relative precision (RP) index was used to compare the performances of the two scoring procedures [42], the results of which are expressed as the ratio of the pairwise F-statistics (the IRT-based weighted score F-statistic divided by the unweighted sum-score F-statistic). This indicator is used to determine how much more or less precise the new scoring methods (9Q-GRM score, 9Q-GRM-DIF score, 9QSF-NRM score, and 9QSF-NRM-DIF score) are relative to the traditional method (9Q score) for distinguishing the severity of depressive symptoms. All analyses were performed using Stata version 17 (StataCorp, College Station, Texas 77,845 USA).

Results

Of 1,527 individuals who participated in the primary study of the 9Q in the northern region of Thailand, 52 respondents (3.41%) who did not complete all of the items in the 9Q and the HRSD-17 were excluded from the analysis. Of the 1,355 participants aged 19 years old or more who were included in the study, 920 (67.90%) participants were female and the median age was 48 years old (IQR: 36–58). Most participants were married or living with a partner (64.99%). Two-hundred and fifteen participants (15.95%) were unemployed while around half of the participants (48.88%) had an income of less than 5000 baht/month. The major ethnicity and nationality of the participants were Thai (89.72% for ethnicity and 92.01% for nationality). Five hundred and twelve participants (38.18%) had at least one underlying disease (Table 1) such as hypertension, allergy, and/or diabetes mellitus. One thousand participants were randomly selected for the developmental group for the IRT-based weighted sum scoring while the remaining 355 participants were assigned to the validation group. There were no differences in the demographic characteristics between the two groups (Table 1).

Table 1 Characteristics of the participants (N = 1,355)

Full size table

According to item endorsement, more than 80% of the participants had no symptoms related to depression within the previous two weeks (except for items 2, 3, and 7). Item 3 had the highest endorsement rate of having severe symptoms nearly every day. Almost all of the participants (96%) did not report thoughts of physical self-harm or suicide (item 9) (Fig. 1).

The unidimensionality, local independence, and monotonicity assumption indices for the 9Q product and 9QSF combination used on participants aged 19 years old or over produced values close to the acceptance criteria. However, the values for participants aged 13–18 years old were poor (Supplementary Table 1). Therefore, IRT parameter estimation and scoring were only conducted on the participants aged 19 years old or over to avoid critical violations of the IRT assumptions. According to the model comparison using the likelihood-ratio test, GRM was the most appropriate model for all participants (AIC = 10710.43; BIC = 10898.05), as well as for males (AIC = 3299.51; BIC = 3442.14) and females (AIC = 7428.65; BIC = 7602.32). However, due to the unordered scores for categorical combinations, the NRM model was used to estimate the IRT parameters for the 9QSF even when its AIC and BIC values were a bit higher (Table 2). According to the item-fit statistics, 3 of the 9Q product items were a good fit for the GRM (Interest: χ² = 1.75, p = 0.186; Guilt: χ² = 1.37, p = 0.241; and Psychomotor: χ² = 3.07, p = 0.080) whereas only one item from the 9QSF was suitable (Psychomotor; χ² = 3.21, p = 0.073) (Table 3). The results of the DIF analysis show that there were significant differences in the responses to items 2 and 5 for the 9Q score and items 1, 2, 4, 5, 7, 8, and 9 for the 9QSF combination (Table 4). Therefore, we used both IRT models without accounting for DIF and the model accounting for DIF between males and females in this study.

Table 2 Item Response Theory model selection for the included participants aged ≥ 19 years (N = 1,355)

Full size table

Table 3 Item-fit statistics for the 9Q product and 9QSF combination items’ suitability for the GRM by using chi2W item-fit statistics (adult participants; N = 1,355)

Full size table

Table 4 DIF analysis between males and females (N = 1,355)

Full size table

The estimated IRT parameter values based on GRM for the 9Q score are reported in Table 5. For the GRM model accounting for DIF, the threshold parameters of items 2 and 5 were separately reported for males and females. Item 1 had the highest discrimination parameter values for both models while item 3 had the lowest. The IRT-based weighted sum score for the 9Q score was calculated by using the estimated discrimination parameters and the threshold parameters for items 1 through 9 for the validation group based on Eq. 5. The estimated IRT parameter values for the 9QSF combination based on NRM are reported in Table 6. Since endorsements for some combinations of the 9QSF were absent, we used the values from the other gender when they were absent for a particular gender or the values from the previous set of frequencies with the same intensity when they were absent for both genders. The category slope and intercept parameter values are reported separately for each category for the model without accounting for DIF and additionally separated by gender for the model accounting for DIF. The IRT-based weighted sum score of the 9QSF combination was calculated using the estimated parameters for the validation group based on Eq. 6. Examples of the raw score for each item, 9Q score, 9Q-GRM, 9Q-GRM-DIF, 9QSF-NRM, and 9QSF-NRM-DIF score are summarized in Supplementary Table 2.

Table 5 Estimated IRT parameter values for the 9Q score with the GRM for the developmental group

Full size table

Table 6 Estimated IRT parameter values for the 9QSF combination with the NRM for the developmental group

Full size table

Table 7 reports the mean and standard errors of the IRT-based weighted sum scores and unweighted sum scores for the validation group (N = 355). The IRT-based weighted sum scores were rescaled from 0 to 81 to directly compare them with the unweighted sum score (the 9Q score), after which it can be seen that the mean IRT-based weighted sum scores were higher than the 9Q unweighted score. Overall and pairwise comparisons between the means of the depressive symptoms severity groups show that they were significantly different. The RP values show that 9Q-GRM, 9Q-GRM-DIF, and 9QSF-NRM (1.140, 1.167, and 1.045, respectively) had higher precision than the unweighted sum scores. However, in the pairwise comparison, the RP values for IRT-GRM were lower than those for the 9Q score when comparing the mean values for no and severe depression. In addition, the RP of 9Q-GRM-DIF was higher than those of the other IRT models for almost all pairwise comparisons conducted in this analysis.

Table 7 Analysis of the depressive symptoms severity levels for the validation group (N = 355)

Full size table

Discussion

We conducted an observational study to develop an IRT-based weighed scoring approach for a depressive symptom assessment tool suitable for Thai adults. Individuals aged 19 years old or more from several areas of northern Thailand were interviewed, screened, and the severity of their depressive symptoms assessed by using the 9Q and HRSD-17, followed by a medical assessment. We discovered that using the IRT-GRM model while accounting for DIF for the 9Q score had a higher precision than the traditional unweighted sum score.

Several items with DIF attained a high discrimination parameter value for the actual depression trait. Although there are several measurement tools for depression and its severity suitable in different settings, ignoring differences in the discrimination parameter values of an item in a measurement tool can cause bias. Scoring of the discrimination and threshold parameters across characteristics (e.g., gender, underlying disease, etc.) based on the IRT approach might be useful for reducing bias in depression and severity measurements. According to the DIF analysis, we found that the responses to 9Q items 2 and 5 were different between males and females. This result, which is consistent with the findings from a previous study [43], could be due to the different underlying abilities of the gender groups or else different interpretations of the item responses. In addition to gender, it has also been reported that responses across age and ethnic groups are also sensitive to the DIF for some of the items in PHQ-9 [44, 45]. However, DIF analysis for between ethnic groups was not performed in this study due to an insufficient number of participants who were not Thai. Further study should be conducted to examine differences in responses for other characteristics of the participants not covered in this study.

Both NUDIF and UDIF according to gender were present in two items (item 2 “Markedly diminished interest or pleasure” and item 5 “Significant weight loss or gain”). The significant DIF values concerning depression could be due to the difference in the perception of or concern about psychological issues between the genders due to not only genetic but also social, biological, and environmental factors. According to the Thai culture, and especially in rural areas, women take care of the family and do housework whereas men work to earn money. However, men can relax with colleagues and/or friends more often than women. The differences in tasks, environment, and lifestyle could have led to women being more prone to diminished pleasure from life. The results from a previous study on patients undergoing treatment for painful conditions in an emergency department in the US indicate that the female patients presented higher scores for stress and anxiety than the male ones [46]. In this case, “interest” was the hallmark depressive symptom presenting a difference in responses between males and females, and so the evaluator would need to have been extra careful for this item when conducting the interview to prevent misdiagnosis and misinterpretation. In addition, the outcomes from a study on the impact of stressful life events on body mass index (BMI) changes also show that stressful life events are associated with an increase in BMI in females only [47]. The difference in this relationship might be due to DIF across gender groups.

When estimating the IRT parameters based on GRM for the 9Q score, we found that item 1 “Feeling down, depressed, or hopeless” had the highest discrimination parameter value, meaning that depressive symptoms are the most related to depression severity, a finding which is consistent with a previous study using CFA [48]. The results of the discrimination parameter analysis show differences in the correlation between depression severity and each item. Therefore, IRT procedures that can account for the different weights applied to the items seem to be appropriate for improving the scoring method for the 9Q adapted for northern Thais.

Our results show that accounting for DIF in the 9Q-GRM model provided higher precision (16.7%) than the traditional unweighted sum-score approach. This finding suggests that accounting for IRT discrimination and threshold parameters, as well as the differences between responses according to gender, could provide higher precision in 9Q scoring to evaluate the severity of the depressive symptoms. However, as the results of the 9QSF-NRM-DIF indicate, replacing the missing estimated parameters with previous categorical values when there are missing or non-responses for some of the plausible combination categories seems to be inappropriate. Recruiting more participants or finding alternative approaches (e.g., simulation) to complete the sample for all of the plausible 9QSF categories might improve the scoring precision.

Previously, the findings from a study using other depression severity measurement tools (the Patient Health Questionnaire (PHQ-9) and the Hospital Anxiety and Depression Scale (HADS)) also point toward age-related DIF for 3 PHQ-9 items (“little interest or pleasure in doing things”, “feeling down, depressed or hopeless”, and “feeling tired or having little energy”), which is consistent with the 9Q items with age-related DIF in our study [44]. However, the results from a recent study on the DIF of the PHQ-9 among healthcare workers in Thailand indicate that DIF was not found in items across age, gender, education, or alcohol consumption [49]. This suggests that DIF might be related to the none to low level of depression in the healthcare workers.

In addition, considering DIF for several factors could lead to estimating a large number of combinations of IRT parameters. The findings from a recent study on the impact of somatic symptoms on PHQ-9 scores suggest that although several items showed DIF with respect to disease-specific severity, salient DIF was present in the responses of very few patients [50]. Considering the impact of DIF on specific characteristics is worthy of further study.

There are limitations to this study, including no responses to some of the categories in the 9Q items, which makes it impossible to directly estimate the IRT parameters for several combinations of 9QSF combination models. Moreover, fewer participants had a moderate-to-severe level of depressive symptoms, which could have potentially caused estimation bias resulting in lower accuracy during parameter estimation involving these groups. In addition, we only performed the DIF analysis according to gender due to insufficient participants to create separate groups for other variables. Indeed, the parameter estimations might have been more precise when considering differences in responses according to characteristics other than gender. A further study with a larger sample size should be conducted to determine DIF in other variables and confirm the findings from the present study. Moreover, other approaches toward determining the DIF for polytomous items should be considered. Finally, the questionnaire used in this study was revised into the northern Thai dialect to interview only those Thais who understand it. The IRT parameters used in this scoring approach might be different when used in other settings. Finally, to prevent the necessity for psychiatrists, healthcare providers, or researchers to compile the IRT-based weighted sum score, we plan to develop a user-friendly website and/or smartphone app for practitioners to calculate the IRT-based weighted sum score automatically after inputting the raw data. However, accessing IT equipment and/or the Internet could be a limitation for its practical usage. Thus, modifying the IRT-based weighted sum scoring system to make the calculation easier under these circumstances would be useful.

Conclusion

In summary, the findings for the parameters of the IRT models and scoring methods presented in this study suggest that we improved the scoring method for applying 9Q to measure the severity of depressive symptoms in Thai adults. Accounting for the DIF according to the gender of the participants resulted in higher precision both for overall and pairwise comparisons of mean depression scores using the IRT models. Our findings could improve the precision for evaluating depressive symptoms, which could lead to appropriate treatment according to the major depressive disorder severity.

Availability of data and materials

The datasets used and/or analyzed during the current study are not publicly available due to lack of previous approval to share data publicly. The datasets used and/or analyzed during the current study can be made available through a data-sharing agreement with the corresponding author on reasonable request.

References

Ritchie H, Roser M. Mental Health: Our World in Data; 2018 [Available from: https://ourworldindata.org/mental-health.
Lotrakul M, Sumrithe S, Saipanish R. Reliability and validity of the Thai version of the PHQ-9. BMC Psychiatry. 2008;8:46.
Article PubMed PubMed Central Google Scholar
Zimmerman M, Martinez JH, Young D, Chelminski I, Dalrymple K. Severity classification on the Hamilton Depression Rating Scale. J Affect Disord. 2013;150:384–8.
Article PubMed Google Scholar
Suttajit S, Srisurapanont M, Pilakanta S, Charnsil C, Suttajit S. Reliability and validity of the Thai version of the Calgary Depression Scale for Schizophrenia. Neuropsychiatr Dis Treat. 2013;9:113–8.
Article PubMed PubMed Central Google Scholar
Satthapisit S, Posayaanuwat N, Sasaluksananont C, Kaewpornsawan T, Singhakun S. The comparison of Montgomery and Asberg Depression Rating Scale (MADRS thai) to diagnostic and statistical manual of mental disorders (DSM) and to Hamilton Rating Scale for Depression (HRSD): validity and reliability. J Med Assoc Thai. 2007;90:524–31.
PubMed Google Scholar
Jackson-Koku G. Beck Depression Inventory. Occup Med (Lond). 2016;66:174–5.
Article Google Scholar
Kongsuk T, Arunpongpaisal S, Janthong S, Prukkanone B, Sukhawaha S, Leejongpermpoon J. Criterion-Related Validity of the 9 Questions Depression Rating Scale revised for Thai Central Dialect. J Psychiatric Association Thail. 2018;63:321–34.
Google Scholar
Hays RD, Morales LS, Reise SP. Item response theory and health outcomes measurement in the 21st century. Med Care. 2000;38:II28–42.
Article PubMed PubMed Central CAS Google Scholar
Horton M, Perry AE. Screening for depression in primary care: a Rasch analysis of the PHQ-9. BJPsych Bull. 2016;40:237–43.
Article PubMed PubMed Central Google Scholar
Zhong Q, Gelaye B, Fann JR, Sanchez SE, Williams MA. Cross-cultural validity of the Spanish version of PHQ-9 among pregnant Peruvian women: a Rasch item response theory analysis. J Affect Disord. 2014;158:148–53.
Article PubMed PubMed Central Google Scholar
Adler M, Hetta J, Isacsson G, Brodin U. An item response theory evaluation of three depression assessment instruments in a clinical sample. BMC Med Res Methodol. 2012;12:84.
Article PubMed PubMed Central Google Scholar
Fischer HF, Rose M. www.common-metrics.org: a web application to estimate scores from different patient-reported outcome measures on a common scale. BMC Med Res Methodol. 2016;16:142.
Article PubMed PubMed Central Google Scholar
Haroz EE, Bolton P, Gross A, Chan KS, Michalopoulos L, Bass J. Depression symptoms across cultures: an IRT analysis of standard depression symptoms using data from eight countries. Soc Psychiatry Psychiatr Epidemiol. 2016;51:981–91.
Article PubMed PubMed Central CAS Google Scholar
Wardenaar KJ, Wanders RBK, Jeronimus BF, de Jonge P. The Psychometric Properties of an Internet-Administered Version of the Depression Anxiety and Stress Scales (DASS) in a Sample of Dutch Adults. J Psychopathol Behav Assess. 2018;40:318–33.
Article PubMed Google Scholar
Pilkonis PA, Yu L, Dodds NE, Johnston KL, Maihoefer CC, Lawrence SM. Validation of the depression item bank from the Patient-Reported Outcomes Measurement Information System (PROMIS) in a three-month observational study. J Psychiatr Res. 2014;56:112–9.
Article PubMed PubMed Central Google Scholar
Snitz BE, Yu L, Crane PK, Chang CC, Hughes TF, Ganguli M. Subjective cognitive complaints of older adults at the population level: an item response theory analysis. Alzheimer Dis Assoc Disord. 2012;26:344–51.
Article PubMed PubMed Central Google Scholar
Reise SP, Haviland MG. Item response theory and the measurement of clinical change. J Pers Assess. 2005;84:228–38.
Article PubMed Google Scholar
Gorter R, Fox JP, Twisk JW. Why item response theory should be used for longitudinal questionnaire data analysis in medical research. BMC Med Res Methodol. 2015;15:55.
Article PubMed PubMed Central Google Scholar
Saracino RM, Aytürk E, Cham H, Rosenfeld B, Feuerstahler LM, Nelson CJ. Are we accurately evaluating depression in patients with cancer? Psychol Assess. 2020;32:98–107.
Article PubMed Google Scholar
Crane PK, Narasimhalu K, Gibbons LE, Mungas DM, Haneuse S, Larson EB, et al. Item response theory facilitated cocalibrating cognitive tests and reduced bias in estimated rates of decline. J Clin Epidemiol. 2008;61:1018–27 .e9.
Article PubMed PubMed Central Google Scholar
McNeish D, Wolf MG. Thinking twice about sum scores. Behav Res Methods. 2020;52:2287–305.
Article PubMed Google Scholar
Widaman KF, Revelle W. Thinking thrice about sum scores, and then some more about measurement and analysis. Behav Res Methods. 2022. https://doi.org/10.3758/s13428-022-01849-w.
Cavanagh A, Wilson CJ, Caputi P, Kavanagh DJ. Symptom endorsement in men versus women with a diagnosis of depression: A differential item functioning approach. Int J Soc Psychiatry. 2016;62:549–59.
Article PubMed Google Scholar
Bares C, Andrade F, Delva J, Grogan-Kaylor A, Kamata A. Differential item functioning due to gender between depression and anxiety items among Chilean adolescents. Int J Soc Psychiatry. 2012;58:386–92.
Article PubMed Google Scholar
de Sá Junior AR, Liebel G, de Andrade AG, Andrade LH, Gorenstein C, Wang Y-P. Can Gender and Age Impact on Response Pattern of Depressive Symptoms Among College Students? A Differential Item Functioning Analysis. Front Psychiatry. 2019;10:50-.
Article PubMed PubMed Central Google Scholar
Arunpongpaisal S, Kongsuk T, Maneethorn N, Maneethorn B, Wannasawek K, Leejongpermpoon J, et al. Development and validity of two-question screening test for depressive disorders in Northeastern Thai community. Asian J Psychiatr. 2009;2:149–52.
Article PubMed Google Scholar
American Psychiatric Association APATFoDSMIV. Diagnostic and statistical manual of mental disorders: DSM-IV. Washington, DC: American Psychiatric Association; 1994.
Google Scholar
Kittirattanapaiboon PK M. The validity of the Mini International Neuropsychiatric Interview (M. I. N. I.)-Thai Version. Manual for MINI (Thai version). 2004:13–21.
Edelen MO, Reeve BB. Applying item response theory (IRT) modeling to questionnaire development, evaluation, and refinement. Qual Life Res. 2007;16(Suppl 1):5–18.
Article PubMed Google Scholar
Nering M, Ostini R. Handbook of Polytomous Item Response Theory Models. R. O: Routledge;: ML. N; 2010.
Google Scholar
Baker F, Kim S. Item Response Theory: Parameter Estimation Techniques. 2nd ed. New York: Dekker; 2004.
Book Google Scholar
Johnson MS. Marginal Maximum Likelihood Estimation of Item Response Models in R. 2007. 2007;20:24.
Lt Hu, Bentler PM. Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Struct Equation Modeling: Multidisciplinary J. 1999;6:1–55.
Article Google Scholar
Reeve BB, Hays RD, Bjorner JB, Cook KF, Crane PK, Teresi JA, et al. Psychometric evaluation and calibration of health-related quality of life item banks: plans for the Patient-Reported Outcomes Measurement Information System (PROMIS). Med Care. 2007;45:22–31.
Article Google Scholar
Stochl J, Jones PB, Croudace TJ. Mokken scale analysis of mental health and well-being questionnaire item responses: a non-parametric IRT method in empirical research for applied health researchers. BMC Med Res Methodol. 2012;12:74.
Article PubMed PubMed Central Google Scholar
McNeish D. Thanks coefficient alpha, we’ll take it from here. Psychol Methods. 2018;23:412–33.
Article PubMed Google Scholar
Kang T, Cohen AS, Sung H-J. Model Selection Indices for Polytomous Items. Appl Psychol Meas. 2009;33:499–518.
Article Google Scholar
Kondratek B. UIRT: Stata module to fit unidimensional Item Response Theory models. 2022.
Crane PK, van Belle G, Larson EB. Test bias in a cognitive test: differential item functioning in the CASI. Stat Med. 2004;23:241–56.
Article PubMed Google Scholar
Crane PK, Gibbons LE, Jolley L, van Belle G. Differential item functioning analysis with ordinal logistic regression techniques. DIFdetect and difwithpar. Med Care. 2006;44:115-23.
Article Google Scholar
Raykov T, Marcoulides G. A course in item response theory and modeling with Stata. Texus: Stata Press College Station; 2018.
Google Scholar
Las Hayas C, Bilbao A, Quintana JM, Garcia S, Lafuente I. A comparison of standard scoring versus Rasch scoring of the visual function index-14 in patients with cataracts. Invest Ophthalmol Vis Sci. 2011;52:4800–7.
Article PubMed Google Scholar
Teymoori A, Real R, Gorbunova A, Haghish EF, Andelic N, Wilson L, et al. Measurement invariance of assessments of depression (PHQ-9) and anxiety (GAD-7) across sex, strata and linguistic backgrounds in a European-wide sample of patients after Traumatic Brain Injury. J Affect Disord. 2020;262:278–85.
Article PubMed Google Scholar
Cameron IM, Crawford JR, Lawton K, Reid IC. Differential item functioning of the HADS and PHQ-9: an investigation of age, gender and educational background in a clinical UK primary care sample. J Affect Disord. 2013;147:262–8.
Article PubMed Google Scholar
Huang FY, Chung H, Kroenke K, Delucchi KL, Spitzer RL. Using the Patient Health Questionnaire-9 to measure depression among racially and ethnically diverse primary care patients. J Gen Intern Med. 2006;21:547–52.
Article PubMed PubMed Central CAS Google Scholar
Patel R, Biros MH, Moore J, Miner JR. Gender differences in patient-described pain, stress, and anxiety among patients undergoing treatment for painful conditions in the emergency department. Acad Emerg Med. 2014;21:1478–84.
Article PubMed Google Scholar
Udo T, Grilo CM, McKee SA. Gender differences in the impact of stressful life events on changes in body mass index. Prev Med. 2014;69:49–53.
Article PubMed PubMed Central Google Scholar
González-Blanch C, Medrano LA, Muñoz-Navarro R, Ruíz-Rodríguez P, Moriana JA, Limonero JT, et al. Factor structure and measurement invariance across various demographic groups and over time for the PHQ-9 in primary care patients in Spain. PLoS ONE. 2018;13:e0193356.
Article PubMed PubMed Central CAS Google Scholar
Jiraniramai S, Wongpakaran T, Angkurawaranon C, Jiraporncharoen W, Wongpakaran N. Construct Validity and Differential Item Functioning of the PHQ-9 Among Health Care Workers: Rasch Analysis Approach. Neuropsychiatr Dis Treat. 2021;17:1035–45.
Article PubMed PubMed Central Google Scholar
Katzan IL, Lapin B, Griffith S, Jehi L, Fernandez H, Pioro E, et al. Somatic symptoms have negligible impact on Patient Health Questionnaire-9 depression scale scores in neurological patients. Eur J Neurol. 2021;28:1812–9.
Article PubMed Google Scholar

Download references

Acknowledgements

We would like to thank the physicians, nurses, medical staffs, and all participants who involved in this study.

Funding

A primary study on validity of 9Q among northern Thai population was supported from Mental Department of Mental Health, Ministry of Public Health. This study was partially supported from Chiang Mai University.

Author information

Authors and Affiliations

Department of Statistics, Faculty of Science, Chiang Mai University, 239 Huaykaew Road, Suthep, Muang, 50200, Chiang Mai, Thailand
Suttipong Kawilapat, Sukon Prasitwattanaseree & Patrinee Traisathit
Department of Psychiatry, Faculty of Medicine, Chiang Mai University, Chiang Mai, Thailand
Suttipong Kawilapat, Benchalak Maneeton & Narong Maneeton
Prasrimahabhodi Psychiatric Hospital, Ubon Ratchathani, Thailand
Thoranin Kongsuk, Jintana Leejongpermpoon & Supattra Sukhawaha
Somdet Chaopraya Institute of Psychiatry, Bangkok, Thailand
Thoranin Kongsuk
Department of Psychiatry, Faculty of Medicine, Khon Kaen University, Khon Kaen, Thailand
Suwanna Arunpongpaisal
Research Center in Bioresources for Agriculture, Industry and Medicine, Chiang Mai University, Chiang Mai, Thailand
Patrinee Traisathit
Department of Statistics, Faculty of Science, Data Science Research Center, Chiang Mai University, Chiang Mai, Thailand
Patrinee Traisathit

Authors

Suttipong Kawilapat
View author publications
You can also search for this author in PubMed Google Scholar
Benchalak Maneeton
View author publications
You can also search for this author in PubMed Google Scholar
Narong Maneeton
View author publications
You can also search for this author in PubMed Google Scholar
Sukon Prasitwattanaseree
View author publications
You can also search for this author in PubMed Google Scholar
Thoranin Kongsuk
View author publications
You can also search for this author in PubMed Google Scholar
Suwanna Arunpongpaisal
View author publications
You can also search for this author in PubMed Google Scholar
Jintana Leejongpermpoon
View author publications
You can also search for this author in PubMed Google Scholar
Supattra Sukhawaha
View author publications
You can also search for this author in PubMed Google Scholar
Patrinee Traisathit
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

SK primary contributed for literature search, study design, performed the data analyses and wrote the first draft of the manuscript. BM contributed in study design, coordinated the operations and collected the data, literature search and reviewing of the manuscript. NM contributed in study design, coordinated the operations and collected the data, literature search and reviewing of the manuscript. SP contributed in study design, literature search and reviewing of the manuscript. TK contributed in study design, coordinated the operations and collected the data, and reviewing of the manuscript. SA contributed in in literature search, coordinated the operations and collected the data, and reviewing of the manuscript. JL contributed in literature search, coordinated the operations and collected the data, and reviewing of the manuscript. SS contributed in literature search, coordinated the operations and collected the data, and reviewing of the manuscript. PT contributed in study design, literature search and wrote the first draft of the manuscript. All authors contributed to the final version of the manuscript. The author(s) read and approved the final manuscript.

Corresponding author

Correspondence to Patrinee Traisathit.

Ethics declarations

Ethics approval and consent to participate

This study using de-identified data from the primary study approved by the Institutional Ethical Committee of Prasrimahabhodi Psychiatric Hospital, Ubon Ratchathani, Thailand (COA No. 002/2560). All methods were carried out in accordance with relevant guidelines and regulations. All participants provided informed consent prior to participate in the study. The participants can discontinue participation at any time without penalty or loss of benefits to which the participant is otherwise entitled. The participants who diagnosed with major depressive disorder were referred for the standard care at the hospitals.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1:

Supplementary Table 1. Confirmatory factor analysis indices (N = 1,475). Supplementary Table 2. Raw scores and examples of scoring by using various major depressivedisorder assessment methods.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Kawilapat, S., Maneeton, B., Maneeton, N. et al. Comparison of unweighted and item response theory-based weighted sum scoring for the Nine-Questions Depression-Rating Scale in the Northern Thai Dialect. BMC Med Res Methodol 22, 268 (2022). https://doi.org/10.1186/s12874-022-01744-0

Download citation

Received: 11 December 2021
Accepted: 29 September 2022
Published: 12 October 2022
DOI: https://doi.org/10.1186/s12874-022-01744-0

Comparison of unweighted and item response theory-based weighted sum scoring for the Nine-Questions Depression-Rating Scale in the Northern Thai Dialect

Abstract

Background

Methods

Results

Discussion

Background

Methods

Settings and participants

Assessments

IRT models

Model selection

Differential item functioning (DIF)

IRT-based weighted scoring

Statistical analysis

Results

Discussion

Conclusion

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Supplementary Information

Additional file 1:

Rights and permissions

About this article

Cite this article

Keywords

BMC Medical Research Methodology

Contact us

Comparison of unweighted and item response theory-based weighted sum scoring for the Nine-Questions Depression-Rating Scale in the Northern Thai Dialect

Abstract

Background

Methods

Results

Discussion

Background

Methods

Settings and participants

Assessments

IRT models

Model selection

Differential item functioning (DIF)

IRT-based weighted scoring

Statistical analysis

Results

Discussion

Conclusion

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Supplementary Information

Additional file 1:

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Medical Research Methodology

Contact us