In this study, we investigated the methods by which validation studies of PRO have been performed in the PC setting, particularly with regards to test-retest reliability. In general, the methodological quality of this psychometric property was investigated poorly to fairly; according to the COSMIN checklist, only 12.3% of the studies were considered of good quality, and none were considered of excellent quality. In addition, we highlighted the importance of verifying the clinical stability of advanced cancer patients before performing the retest. Based on our results, clinical stability is even more important for test-retest reliability than the accurate definition of the time interval at which the retest is performed.
In our review, we identified 89 validation studies that included cancer symptoms and/or HRQoL as outcome variables. Of those, only 31 (34.8%) evaluated the test-retest reliability. As the test-retest reliability is an essential psychometric property to be measured in validation studies, we hypothesize that researchers are not systematically measuring it because of the instability of advanced cancer patients. Overall, half of the evaluated test-retest reliability scores were classified as inadequate when 0.70 was used as the threshold value . The pressure experienced by scientific researchers to publish positive results  may also explain why only 34.8% of validation studies measured the test-retest reliability. Furthermore, it is possible that inappropriate test-retest values were omitted from some publications.
It is essential to accurately estimate the sample size prior to beginning a study. An insufficient sample size might not detect true differences, which might lead to unreliable results. Conversely, an excessive sample size may produce unnecessary financial losses and ethical concerns regarding futile exposure of study participants . With regards to test-retest reliability analysis, we observed that determining an adequate sample size is not a common practice because only 2 studies [28, 37] described performing a sample size calculation prior to the study. Overall, the median number of included patients for test-retest analysis was 60, which represents 53.8% of the total number of included patients. One study  justified the sample size by citing a rule of thumb suggesting that 50 patients would be sufficient for the analysis [43, 44].
A basic concept regarding test-retest reliability is the need to retest clinically stable patients . The retest of advanced cancer patients is challenging because they are in a dynamic phase of their disease in which symptoms and functionality are prone to decline quickly. The retest of a clinically unstable patient may incorrectly define a PRO as a non-reliable tool. Our results confirm the importance of verifying the clinical stability of the patients before retesting. In addition, our review described the objective criteria used by some studies to define a stable condition.
The definition of an adequate between-assessment time gap for the retest is of the utmost importance. An insufficient time period might allow respondents to recall their first answers, and a longer interval might allow for a true change of the construct to occur [2, 45]. The appropriate time interval depends on the construct to be measured and the target population ; however, approximately 2 weeks is often considered generally appropriate . Nevertheless, the time interval over which to retest advanced cancer patients under PC is still a matter of debate. Some authors have considered retesting advanced cancer patients at least 3 days apart as a measure of responsiveness but not as a measure of test-retest reliability .
In fact, because of concerns about reassessing an unstable patient, some authors (n = 7) reapplied the questionnaires at very short intervals (i.e., less than 24 hours). Jim et al.  investigated daily and intraday changes in the fatigue, depression, sleep, and activity scores in a cohort of cancer patients undergoing chemotherapy. Significant changes were observed over time. Additionally, Dimsdale et al.  investigated cancer-related fatigue every hour for 72 consecutive hours and observed a diurnal variation in fatigue. HRQoL, on the other hand, is a multidimensional construct that encompass physical, psychological, social, and spiritual domains. In general, instruments that measure HRQoL use recall periods of 7 days. Although HRQoL is not commonly assessed on a daily basis, it is expected to behave stably over a few days, especially the social, existential, and global domains. Consequently, we observed that multi-symptom instruments are generally retested within a shorter time frame than HRQoL instruments.
There was a trend of shorter time periods in the adequate test-retest reliability results when compared with the scores with inadequate results (less than 0.7). One reason contributing to the non-significant results might be the large interquartile range for some of the domains; since few studies were analyzed, there was insufficient statistical power for further conclusions. Three studies [18, 29, 40] evaluated the retest reliability at 2 different time points (< 24 hours and 1 week after the first evaluation); in general, a lower time interval was associated with a better retest analysis result. Considering the median time interval used in the studies with adequate test-retest results, in addition to the findings from studies that used two different time intervals for the retest, we can recommend that patients under palliative care for advanced cancer should be retested somewhere around 24 to 48 hours later when evaluating cancer symptoms and 2 to 7 days later when assessing HRQoL. However, we believe that the most important factor is not the time itself but rather confirmation of clinical stability before retesting patients.
As mentioned previously, we concluded that the test-retest reliability analysis was of low quality according to the COSMIN checklist. Other studies using the same guidelines but not the same population have yielded similar results [50–52]. The most troublesome question in our review was item 11 (“for continuous scores: was an intraclass correlation coefficient calculated?”), with 62% of studies classified as poor to fair quality. The preferred test-retest reliability statistic depends on the type of response options. In our review, the majority of the studies evaluated continuous scores. In these cases, the ICC  is the preferred statistic [46, 47, 53]. Moreover, the use of correlation coefficients (Pearson’s and Spearman’s tests) is not adequate because they do not include a consideration of systematic error . In the present study, 18 of 29 studies evaluating continuous scores used a correlation analysis, but they did not evaluate the agreement by using the ICC. Six different versions of ICC can be used depending on various assumptions, and 4 of those are subdivided into consistency or absolute agreement, yielding a total of 10 different ICC calculations . The choice of the correct index has a highly significant impact on the numerical value of the ICC . Even in those studies that correctly used the ICC, none stated the version of the ICC used.
This study has some limitations. Because the studies evaluated test-retest reliability using different statistics, we could not perform a robust meta-analysis. Therefore, we decided to use 0.7 as the threshold for adequate results on test-retest reliability to perform a pooled data analysis. However, the categorization of the test-retest results as a function of a predefined cut-off point may be considered an inadequate simplification. Another limitation is that we did not include in the systematic review other instruments developed to assess only one symptom (fatigue or pain scales, for example). In addition, we did not include abstracts from meetings because it would be difficult to extract the necessary data.