Systematic review of the psychometric properties of instruments to measure sexual desire

Background Sexual desire is one of the domains of sexual function with multiple dimensions, which commonly affects men and women around the world. Classically, its assessment has been applied through self-report tools; however, an issue is related to the evidence level of these questionnaires and their validity. Therefore, a systematic review addressing the available questionnaires is really relevant, since it will be able to show their psychometric properties and evidence levels. Method A systematic review was carried out in the PubMed, EMBASE, PsycINFO, Science Direct, and Web of Science databases. The search strategy was developed according to the following research question and combination of descriptors and keywords, including original studies with no limit of publication date and in Portuguese, English, and Spanish. Two reviewers carried out the selection of articles by abstracts and full texts as well as the analysis of the studies independently. The methodological quality of the instruments was evaluated by the COnsensus-based Standards for the selection of health status Measurement INstruments (COSMIN) checklist. Results The search resulted in 1203 articles, of which 15 were included in the review. It identified 10 instruments originally developed in the English language. Unsatisfactory results on methodological quality were evidenced in cultural adaptation studies with no description of the steps of the processes and inadequacy of techniques and parameters of adequacy for models. The Principal Component Analysis with Varimax rotation predominated in the studies. Conclusions The limitation of the techniques applied in the validation process of the reviewed instruments was evident. A limitation was observed in the number of adaptations conducted and contexts to which the instruments were applied, making it impossible to reach a better understanding of the functioning of instruments. In future studies, the use of robust techniques can ensure the quality of the psychometric properties and the accuracy and stability of instruments. A detailed description of procedures and results in validation studies may facilitate the selection and use of instruments in the academic and/or clinical settings. Systematic review registration PROSPERO CRD42018085706. Electronic supplementary material The online version of this article (10.1186/s12874-018-0570-2) contains supplementary material, which is available to authorized users.


Background
Sexual desire has been defined as the force to stimulate or inhibit sexual behavior [1], as well as interest in sexual activity [2], and the literature also recognises it as one of the sexual response cycle phases [3] and one of the domains of sexual function [4]. Beyond, sexual desire may be understood by the biological, psychological and social components [1].
According to authors, there are three models of self-reports to evaluate sexual behavior evaluation by interview, questionnaires and behavioural records filled by the client or subject [5,6]. The assessment by self-report is more widely and commonly applied to measure the sexual desire and functioning.
Systematic reviews on psychometric properties for sexual (dys)function have been carried out to identify available measurement instruments [7,8]; however, those addressing sexual desire are still limited.
Regarding sexual desire and functioning, different questionnaires applied to diagnose hypoactive sexual desire disorders as well estimating their real prevalence and associated factors and magnitude have been observed [9,10].
Therefore, the use of instruments to evaluate sexual desire, among other dimensions of sexual life, can be helpful in the planning of multidisciplinary interventions aimed at helping individuals that present with this disorder. Thus, assessing the quality of instruments that measure sexual desire is an essential step in revealing the positive and negative aspects of measurements and providing evidence-based guidance for the selection of validated instruments for the academic and/or clinical contexts.
The COSMIN Checklist is a tool that is increasingly used in systematic analyses to evaluate the measurement properties of instruments [11,12]. This tool was chosen in the present systematic review with the objective of evaluating the methodological quality of the psychometric properties and the level of evidence of the selected instruments that measure sexual desire.

Method
A systematic review of measurement instruments was conducted according to the ten steps of the COSMIN protocol [13]: 1) formulation of the research question; 2) literature search; 3) selection criteria; 4) selection of articles by abstracts and full texts; 5) evaluation of the methodological quality of included studies; 6) extraction of the data; 7) content comparison; 8) data synthesis and evaluation of instruments quality; 9) general conclusion of the systematic review; and 10) preparation of the report on the psychometric properties of the evaluated instruments.

Search strategy
A systematic review was carried out in the PubMed, EMBASE, PsycINFO, Science Direct, and Web of Science databases. The following search strategy was performed in PubMed using MESH in combination with the following keywords: (libido) AND (psychometrics) AND (cross cultural-comparison) OR (cross-cultural AND comparison) OR (cross AND cultural AND comparison) AND (sexual desire) OR (sexual AND desire) OR (sexual AND interest) OR (sexual interest). This search strategy was adapted to the other databases. [See Additional file 1] for the complete search strategy. All citations were imported into the bibliographic database of EndNote Basic.

Selection criteria
The inclusion criteria were established as: original studies, published in Portuguese, English, and Spanish with human beings, and presenting the process of evaluating cultural validations and adaptations of sexual desire instruments, regardless of sample sex or gender. There was no limitation on the initial date of publication, and studies published until November 2017 were considered. In addition, it was determined that articles presenting the dimension of sexual desire or the condition of its decrease (hypoactive sexual desire disorder) would also be included. Articles that aimed to measure dysfunctions in other dimensions of the sexual response in men and/or women and samples with paediatric population were excluded.

Selection of articles by abstracts and full texts
The selection of articles by abstracts and full texts were performed independently by two reviewers (DC and MF), according to the selection criteria. All studies retrieved were imported into the bibliographic database of EndNote Basic. Then, the references were exported to Microsoft Excel, version 2016. In case of disagreement in the selection of the studies, two others reviewers (LC and RA) were consulted.

Data extraction and synthesis
The data extraction of potentially eligible literature were performed independently by two reviewers (DC and MF), and they extracted the following data: author, year of publication, country, title of the study, source, inclusion criteria, exclusion criteria, items, average fill time, population and sample size (n), and types of psychometric properties tested. [See Additional file 2].
The reviewers identified 1190 articles. Another 13 articles were captured through a manual search of references reported in the articles identified first, totalling 1203 articles; of these, 826 were duplicates and not included in the study. The titles and abstracts of 66 studies were analysed by two independent reviewers (DC and MF); in case of disagreement in the selection of the studies, two other reviewers (LC and RA) were consulted. In the end, 66 articles were considered adequate for inclusion in the study. The inter-observer agreement was measured by the Kappa test, with a score of 0.84. Subsequently, the 66 articles were analysed in their entirety and separately by two reviewers (DC and MF). A total of 45 articles were excluded based on the following reasons: they measured sexual desire together with other dimensions of sexual function (n = 23) or measured other constructs (n = 22). Therefore, 21 articles were included in the study, with a total of 10 instruments that measured sexual desire. The search and selection process is presented in Fig. 1 using the PRISMA flowchart [14].

Evaluation of methodological quality
Two reviewers (DC and MF) independently applied the COSMIN Checklist [12] to evaluate the methodological quality of the psychometric properties reported in the included studies. Discordances between the two reviewers were resolved with the participation of a third reviewer who is an expert in psychometrics (FR). The COSMIN Checklist was developed through the international Delphi study [15] in order to facilitate the methodological evaluation of outcome measures for the proper choice of an instrument. This checklist includes nine evaluation parameters: internal consistency, reliability, measurement error, content validity, construct validity, hypothesis testing, cultural validity, criterion validity, and responsiveness.
The quality of psychometric properties was evaluated by a number of items, including design and preferred statistical methods requirements. A four-point rating scale (poor, fair, good, and excellent) was used for the evaluation depending on the information reported by the study authors. A total score was determined according to the lowest item ranking for each measurement property [12].

Synthesis and levels of evidence
After the evaluation by the COSMIN Checklist, the results were combined by instrument to determine the level of evidence of the analysed studies according to the methodological quality criteria of the studies [16] and classified according to the criteria proposed by the Cochrane Back Review Group [17] as: strong (consistent positive results from multiple studies with good methodological quality or one study with excellent methodological quality), moderate (consistent positive results from multiple studies with fair methodological quality or one study with good methodological quality), limited (positive results from a study with fair methodological quality), conflicting (conflicting results from individual studies), or unknown (results from studies with poor methodological quality with an unknown level of evidence).

Results
Out of a total of 1203 articles identified, 21 were included in the review, in which 10 instruments were identified. The search and selection processes are presented in the Fig. 1, using the PRISMA flowchart [14].
All the studies included in the systematic review were documented as supplemental references and were identified in the text with the prefix 's' , followed by the respective reference number. [Additional file 3].
The characteristics of the included studies are presented bellow (Table 1) The results of the COSMIN evaluation (Table 2), and evidence levels (Table 3) of instruments are presented.
The cultural adaptation process presented limitations related to the insufficient description of this process according to the COSMIN criteria [25][26][27] Only two studies evaluated the inter-rater and/or intra-rater as an analytical technique for content validity [22,28].
According to the parameters in the COSMIN checklist, these limitations affected the methodological quality of the identified instruments to measure sexual desire.

Dimensions and structure
The sample size in psychometric studies is usually performed on the number of items in the instrument. A total of 10 participants per item have been considered sufficient to guarantee the quality of analysis, except for instruments with less than 10 items [29,30]. There is evidence that 20 or more participants per item can significantly reduce error and inaccuracies in the solution of psychometric models, such as percentage of samples with correct factor structure, average number of items misclassified in the wrong factor, mean error in eigenvalues, mean error in factorial loads, the percentage of analyses that do not converge after 250 interactions, and percentage with Heywood cases [31].
It is likely that instruments with a good fit, but tested with small samples, show instability in measurement and lose their accuracy in other populations and scenarios, especially in studies with less than 300 participants [29].
The limitation in the number of participants imposes that initial minimum parameters of adequacy, such as factorial loads, communalities, and goodness of fit indexes, are higher than in studies with larger samples. This aims at providing increased surety in the quality of the instrument [30,32] due to an increased imprecision of techniques with small samples.
In 14 of the 31 analysed articles, the relationship between numbers of participants for each instrument item was greater than 20:1. However, no study reported whether the sample size was determined and whether this design also guided the establishment of the model's minimum parameters. This result corroborates the results of another review in which only 43% of the analyzed articles had information on the size sample of the studied [33].

Psychometric properties of instruments
Among the instruments assessed, principal component analysis (PCA) was the dominant technique used for construct validity. Of a total of 15 studies, 8 of them analyzed the data using PCA.It is a data reduction method [30,31], which considers that all items make up the model and, therefore, are not able to explore factors and produce results of the variable latent [34,35] Thus, the PCA would not represent a real factorial analysis technique [36], in addition to overestimating the variance values explained in 16.4% [31], also generating overestimated factorial loads and communalities [37,38].
Even in situations where the factors do not correlate and communalities are moderate, the component variance values tend to be high [38]. Other authors [33,39] complement that studies have systematically shown that PCA is less accurate than factor analysis, especially when the factorial loads are low or close to 0.40 and there are few items per factor/dimension.
The PCA had become common in recent decades, as computers were slow and expensive; it was a fast and cheap method, an alternative to factor analysis [37]. Although the literature has pointed out the limitations and restrictions in the use of PCA, combined with or without the Varimax rotation (orthogonal), the technique has been dominant in validation studies in the last ? n a n a n a ? ? n a n a  [33,36,39,40]. The use of PCA with the Varimax rotation in validation processes has been considered at least a contradictory combination. The PCA considers that all items make up the model even without effectively testing this hypothesis. It assumes, a priori, that the items correlate, because they measure the same latent variable, particularly in psychosocial models. Conversely, the Varimax rotation considers that the items maintain independence between them, and this combination with PCA may increase imprecision in the model. Thus, non-orthogonal rotations (oblique) seem to be adequate in latent psychosocial variables [41].
The studies that conducted the exploratory factor analysis used eigenvalue as the criterion for the definition of factors (dimensions). This analysis configuration corroborates with notes [42] that the PCA's popularity with the use of eigenvalue above 1 and the Varimax rotation yielded significant results for several classical datasets [43].
Several of the studies showed variance explained as below 60% [29,30,32] which indicates the low capacity of the instrument to measure the latent variable. This point is made even more relevant by the predominant use of PCA, which tends to overestimate indicators, and even then, the levels of explained variance were not satisfactory.
None of the studies provided more robust techniques such as the Parallel Analysis [44,45] considered one of the most accurate and robust techniques for this purpose [33,36,39,46,47] The justification for its disuse may be in the absence of this technique in most commercial software.
Another fundamental aspect not addressed in the reviewed studies was the testing of data distribution and its normality to the adequacy of the best statistical technique to be used. In contemporary psychometry, this analysis is essential for the quality of the adequacy of psychometric models. All articles used factorial techniques based on the Pearson's correlation, which is a parametric technique. It should be noted that the distribution of data is rarely normal in psychosocial studies. Thus, the contemporary recommendation is the adoption of the polychoric correlation when normality is violated [48,49]. The factorial solutions obtained by the presence of polychoric correlation improved accurate reproductions of the measurement model [50,51]. All studies used the application of Cronbach's Alpha as a measure of reliability with the obtained values considered acceptable. This coefficient depends on the magnitude of the correlation between items and number of items in the instrument [52].
There is extensive literature criticizing its use without considering the nature and distribution of the data and sample size, mainly in samples with more than 1000 participants [53,54]. The study by Revelle and Zinbarg (2009) compared 13 reliability indicators and concluded that, in many cases, the Cronbach's alpha was not indicated. The use of the McDonald's Omega and Greatest Lower Bound (GLB) is preferable when there is data asymmetry, even in small samples [55]. It is assumed that high Alpha values do not necessarily mean higher reliability and quality of scales or tests, because they can be the result of long scales with parallel and redundant items or generate a restriction in the construct being studied [56]; one should not seek alpha values above 0.90 [52]. Alpha has been usually used more as a measure of internal consistency rather than reliability; it is easy to prove that alpha is not a measure of internal consistency [53]. An even more severe problem is the use of Alpha to remove items because it is not a technique developed for this purpose.
Reliability was evaluated through testing-retesting using the Pearson's correlation in 11 of the 31 studies analysed. The authors of the 11 studies described the testing-retesting in detail, informing about the sample used, number of measurements, and mean time of instrument use [19,25,57]. This procedure is recommended by several psychometrists [30,41,52].
However, the use of the Pearson's correlation for the testing-retesting has been questioned, because it has been deemed inadequate by not considering the systematic differences, and therefore, the systematic error in the measurements [29,52] Despite this, the predominance of the Pearson's correlation in the evaluation of testing-retesting was identified without any testing of data normality.
Another relevant point is the use of testing-retesting before construct validation. It is probable that items are discarded with the use of more adequate and robust techniques by not saturating and/or not conforming to the model after the testing-retesting. Thus, one would have attested reliability and would point to a reliable instrument before showing evidence that the instrument actually measures the latent variable that it is proposed to measure. One would attest to the reliability of the instrument that would differ from the final version, especially when the DeVellis 2017 [58] note that the loss of 50% of the items is expected during the validation process of an instrument is taken. Moreover, Bertchold 2016 [59] questions the use of the term reliability in the testing-retesting reinforcing that the Pearson's correlation is a measure of association and not of reliability.
Another way of clarifying the reliability of an instrument and the possibility of assuring its quality in different contexts is through invariance testing. It was not evidenced in the analyzed studies.
The invariance is an important aspect in the development of a test, especially when using it in heterogeneous populations [60]. The assumptions of invariance answer some points: a) the factorial structure of the instrument is the same in different groups; b) the items that makeup one factor and the instrument have the same importance for different groups; c) scores of one group can be compared to other groups; d) the items present similar measurement errors for different groups; e) the level of variance between factors differ between groups and; f ) the covariance between factors is the same between groups [47]. The temporal invariance, which must be investigated with longitudinal delineations is rarely investigated [61].It would be advisable to test other measurement properties for instrument revalidation to assess whether the original instrument construct remains adequately represented over time.
The present review identified different instruments published to measure sexual desire; however, it illustrated several fragilities in the available instruments. According to COSMIN parameters and criteria of evidence, few were submitted to validation procedures with satisfactory results.
Most of the instruments of measurement of sexual desire evaluated in this review were not used in other contexts and by other authors besides in the studies and authors of the original version. Thus, in the validation process of an instrument, it is fundamental to evaluate its reliability outside its original development context. In general, the lack of a description of the process of the cultural adaptation of instruments may hinder their evaluation and selection in future studies.
Regarding the sample size and structure of the analyzed instruments, most of the studies consider a sample based on the ratio of 20:1 and, therefore, reduce imprecision errors in the psychometric models. The testing of the normality distribution of data is fundamental for choosing between parametric or non-parametric analyses techniques. The most tested properties in the analyzed studies are: construct validity analyzed by means of the PCA as the predominant technique; internal consistency evaluated by the Cronbach's alpha coefficient, and reliability analyzed with the testing-retesting through the Pearson's correlation.
The availability of validated instruments is paramount, because their application can contribute to the evaluation of sexual health in the population and qualification of the care provided. Conversely, the lack of valid instruments restricts or mitigates the ability to assess sexual desire in individuals, which can result in non-ideal health care.

Limitations
The databases chosen for conducting this review are comprehensive; however, other databases and gray literature may be incorporated into future reviews.
The results of this review need to be interpreted with caution, because the studies that did not report the methodological quality procedures contemplated in the COSMIN checklist cannot always be assumed as not having it performed by the authors.

Conclusion
The present systematic review evaluated the methodological quality of the psychometric properties and the level of evidence of instruments that measure sexual desire, published in current databases. A detailed analysis of each study's procedures and indicators leads us to the following conclusions.
The analysis predominantly showed the lack of detail of methodological procedures, such as limited information on the cultural adaptation process according to the COSMIN criteria and restricted use of analyses techniques for content validation (inter-rater and intra-rater). These problems have extended to cultural adaptation studies.
Limiting aspects in the validation processes of instruments were observed, which have been recurrently reported in the literature. The reasons for the sizing of study participants were rarely identified. Likewise in the validation of the construct, no testing of data normality distribution; reasons for choosing the extraction, retention, and rotation techniques of items; and establishment in the method of the minimum indicators required for the adequacy of the model were described. Reliability was limited to the application of Cronbach's alpha, even though there were indications of the instability of indexes due to the number of participants, items, and distribution of data normality.
Only one study applied invariance techniques to ensure that the instrument maintained its properties when used with different populations, contexts, and cultures (sex, race, educational level, and religion among others), especially when it is known that sexual desire may suffer strong interference from moral, social, and religious issues.
Considering that some of the selected instruments were developed in the 1970s and that the majority of others are more than 10 years old from the time of development, we observed that none have been followed up with studies revisiting the psychometric properties of the original instrument in order to adapt and update the content of the instrument's items in the light of contemporary social and cultural changes. This lack of updates can generate biases of prevalence and in answers, because the instrument fails to capture these sociocultural changes.
The limitations found suggest that most of the instruments analysed in these studies require the application of more robust and contemporary techniques as well as improved detailing of the steps and procedures applied, which would ensure their accuracy and stability, and consequently, their application in the academic and/or clinical settings.