Skip to main content

Randomised controlled comparison of the Health Survey Short Form (SF-12) and the Graded Chronic Pain Scale (GCPS) in telephone interviews versus self-administered questionnaires. Are the results equivalent?



The most commonly used survey methods are self-administered questionnaires, telephone interviews, and a mixture of both. But until now evidence out of randomised controlled trials as to whether patient responses differ depending on the survey mode is lacking. Therefore this study assessed whether patient responses to surveys depend on the mode of survey administration. The comparison was between mailed, self-administered questionnaires and telephone interviews.


A four-armed, randomised controlled two-period change-over design. Each patient responded to the same survey twice, once in written form and once by telephone interview, separated by at least a fortnight. The study was conducted in 2003/2004 in Germany. 1087 patients taking part in the German Acupuncture Trials (GERAC cohort study), who agreed to participate in a survey after completing acupuncture treatment from an acupuncture-certified family physician for headache, were randomised. Of these, 823 (664 women) from the ages of 18 to 83 (mean 51.7) completed both parts of the study. The main outcome measure was the comparison of the scores on the 12-Item Short-Form Health Survey (SF-12) and the Graded Chronic Pain Scale (GCPS) questionnaire for the two survey modes.


Computer-aided telephone interviews (CATI) resulted in significantly fewer missing data (0.5%) than did mailed questionnaires (2.8%; p < 0.001). The analysis of equivalence revealed a difference between the survey modes only for the SF-12 mental scales. On average, reported mental status score was 3.5 score points (2.9 to 4.0) lower on the self-administered questionnaire compared to the telephone interview. The order of administration affected results. Patients who responded to the telephone interview first reported better mental health in the subsequent paper questionnaire (mean difference 2.8 score points) compared to those who responded to the paper questionnaire first (mean difference 4.1 score points).


Despite the comparatively high cost of telephone interviews, they offer clear advantages over mailed self-administered questionnaires as regards completeness of data. Only items concerning mental status were dependent on the survey mode and sequence of administration. Items on physical status were not affected. Normative data for standardized telephone questionnaires could contribute to a better comparability with the results of the corresponding standardized paper questionnaires.

Peer Review reports


The survey methods most commonly used in clinical trials are self-administered questionnaires (SAQ), telephone interviews (TI), and a mixture of both ("mixed-mode method") that consists of mailing the self-administered questionnaires and following up by telephoning non-respondents. The highest response rates are generally achieved either with telephone interviews or with the mixed-mode method, both of which tend to minimise complete drop-outs and missing values for individual items [15]. It is known that patients who do not respond to mailed questionnaires, on average report greater dissatisfaction with treatment when contacted by telephone than do those who mail back their questionnaires [5, 6]. A recently published study comparing the telephone-administration mode of the SF-36 with the self-administered mode concluded that the telephone-administration mode is equivalent to and as valid as the self-administered mode[7].

By contrast, in designing the present study we hypothesized that patients respond differently to questions about psychological states than to those about physical symptoms. The latter will probably be answered more honestly, because physical problems are more socially accepted[8, 9]. For our study we therefore chose to compare patient responses to the 12-Item Short-Form Health Survey (SF-12) [10] and the Graded Chronic Pain Scale (GCPS) questionnaire[11] – two widely used survey instruments that collect data on both mental and physical aspects of pain disorders – in the telephone interview mode and the self-administration mode. A test-retest design was selected to examine whether the order of administration and/or the preliminary information of half the respondents had any effect on patient response behaviour in the comparison of SAQ and TI.

The test-retest design was chosen in order to test memory effects. It is conceivable, for example, that subjects would be better able to memorize their answers in one of the survey modes, thus resulting in greater similarity in responses between the first and second measurements. The point in time at which subjects are informed that they would be asked to answer a second questionnaire could affect results in a similar way. For example, subjects concerned about social acceptance and wishing to give very precise answers might, if told ahead of time that they would be asked to respond to more than one questionnaire, use memory aids such as making notes before the questionnaires were administered.



A four-armed, randomised controlled two-period change-over design (Figure 1) was used. Each patient responded to the same survey twice, once in written form (SAQ) and once by telephone interview (TI), separated by at least a fortnight. This obligatory minimum interval between administration of the two survey modes made bias due to recall of previous answers very unlikely. Patients were first randomly assigned to one of the two main groups A (TI first) or B (SAQ first), and then to one of two subgroups within each main group (A1 and A2 or B1 and B2). Groups A1 and B1 were informed ahead of time that a second survey would be administered, while groups A2 and B2 were not. All patients who participated in both TI and SAQ were included in the evaluation (Figure 1).

Figure 1
figure 1

Study design.

The study was approved by the local ethic committee of the Ruhr-University Bochum.


The study was conducted in 2003/2004. Participants were drawn from our GERAC acupuncture trial and consisted of a random sample of cohort patients who had received acupuncture treatment from an acupuncture-certified physician, in many cases their family physician [12, 13]. The sample was selected based on the patient case report forms submitted by the treating physician, documenting the patient's demographic data, acupuncture course, and acupuncture indication. Primary eligibility criteria were age 18 years or older, acupuncture treatment for migraine and/or tension-type-headaches, and at least six acupuncture treatments received (most had had ten). Patients who were eligible and willing to participate were randomly assigned to the four groups (Figure 1). After randomisation patients were contacted by telephone to inform them about the study procedures, evaluate the correctness of acupuncture indication, assess whether patients had sufficient cognitive and linguistic capacities to participate, and schedule the TI or mailing of the SAQ.


The GCPS is a standard self-assessment instrument used in medical pain research and quality management that offers a means of hierarchically classifying chronic pain severity independent of the pain syndrome[11]. In this study the scores "pain intensity" and "pain-related disability" were analysed. The scores range from 0 to 100, with 100 being maximum pain intensity or disability. The SF-12 measures patients' physical and mental state of health on two separate scales[10]. SF-12 scores range from 0 to 100, with 100 being complete absence of impairment.

Data collection

All study procedures were guided by an Oracle®-based software developed especially for the GERAC cohort study. The system was used to manage interview appointments, conduct interviews using electronic case report forms (eCRFs), and transfer responses from paper questionnaires to the Oracle® data base. Data completeness was verified by a software routine that displayed warning notices for missing data.

Steps were taken to ensure that the interval between administration of the two questionnaires was at least two weeks. For groups A1 and A2 (TI first), the paper questionnaire was mailed 10 working days after the administration of the telephone interview (allowing several days for mail delivery), with instructions to complete the questionnaire immediately upon receipt. In groups B1 and B2 (SAQ first), participants received the paper questionnaire at an agreed on date within a six-week period. The second interview (TI) was conducted at least two weeks after the paper questionnaire had been returned by mail.

All telephone interviews including the first contact followed standardised interview guidelines, and used the same wording as the paper questionnaire items wherever possible. In the case of the SF-12, the interview version was used. Interviews were conducted on weekdays between 9:00 am and 7:00 pm. Interviewers received several days of pre-study training on study design and interview techniques and were supervised by psychologists at all times. The interviews were conducted by 20 students of Ruhr-University Bochum.


An ANOVA design was used to test for equivalence and differences between the two survey modes (TI or SAQ) in two different steps. Equivalence was examined using the confidence-interval inclusion rule [14]. Equivalence was assumed if the 90% confidence interval for the mean difference between the factor steps to be tested (the survey modes) was found to lie within a 25% standard deviation, with interval limits based on statistical values of a standardised collective. For the SF-12, those values lie within a limit of ± 2.5 score points for both scales. Since no standardised data are available for the GCPS, we accessed a database of the German Society for the Study of Pain (DGSS) that stores data for at present 1465 patients suffering from migraine or chronic tension-type-headaches. The mean score for pain intensity in this collective was 73.70 (SD 17.42). For upper and lower limits of ± 0.25 standard deviation, the interval limit for assuming equivalence was rounded and defined as ± 5 score points. The differences for the conditions "survey-mode sequence" in the second step and "point in time when patients were informed of second survey" in the third step were tested using the F-test. Missing data were analysed by means of chi-squared tests. All analyses were performed using SPPS 12 for Windows.



Of the 2993 primary eligible patients, 930 (31.1%) did not respond to the invitation or declined to participate without giving reasons, while 662 (22.1%) responded, but too late. Patients who declined to participate cited various personal reasons (242), health reasons (14), time reasons (31), or communication problems (hearing loss, bad knowledge of German (15). 314 (10.5%) respondents who agreed to participate had to be excluded because they were treated for indications different from those stated on the CRF. Initially 1087 (36.3%) could be randomised (Figure 1), of whom 823 (76.2%) completed the whole study. 125 (11.5%) patients dropped out because they preferred not to continue, 122 (11.2%) were excluded because it was determined in the course of the interview that they had not in fact received acupuncture treatment for headache, and another 17 (1.6%) were excluded due to partial deafness or insufficient knowledge of German. The number of patients excluded any time after randomisation was almost the same in all four groups. [A1: 62 (18.2%), A2: 65 (16.9%), B1: 70 (19.9%), B2: 67 (18.2%), p = 0.92], and was independent of when patients were informed about the second survey (p = 0.87). The final sample consisted of 664 female and 159 male participants (Table 1), aged 18 to 83, with an average age of 51.7.

Table 1 Gender and indication of sample

Missing data

Only completely answered questionnaires were included in the final analysis. In the case of the SF-12, missing data occurred in a total of 120 data sets, resulting in exclusion from analysis [SAQ: 113 (13.7%), TI: 7 (0.9%)]. In the case of the GCPS, 25 data sets had to be excluded [SAQ: 19 (2.3%), TI: 6 (0.7%)]. Analysis of the incomplete questionnaires showed that there were more missing values in the SAQs than in the TIs [SAQ total: 414 (2.8%) vs. TI total: 67 (0.5%); p < 0.001], and a higher rate of missing items for the SF-12 than for the GCPS [SF-12 total 347 (3.5%) vs. GCPS total 134 (2.7%); p = 0.01]. The frequency of missing data was also higher for the SF-12 mental scales than for the physical scales [SF-12 mental: 211 (4.3%), SF-12 physical: 176 (3.6%); p = 0.07]. Again, missing rates were lower in the TIs [SF-12 mental: 43 (0.9%), SF-12 physical: 37 (0.7%); p = 0.5].

Testing for equivalence

The analysis of equivalence revealed a difference between survey modes for the SF-12 mental scales (Table 2). Patients reported poorer mental status on the SAQ than in the TI (mean difference 3.5; 90% confidence interval (CI) 2.9 to 4.0). By contrast the 90% confidence interval for the mean difference of the SF-12 physical scale was within the limits of ± 2.5 score points (mean difference 1.8; 90% CI 1.3 to 2.3). The mean differences for the two GCPS subscales also lie within the GCPS limits of ± 5 score points (mean difference GCPS pain intensity 0.3; 90% CI -0.7 to 1.2; mean difference GCPS pain-related disability -3.2; 90% CI -4.4 to 2.0) (Table 3).

Table 2 Condition-based mean values (M) and standard deviation (SD) for SF-12: mental (SF-12m) and physical (SF-12p) scales
Table 3 Condition-based mean values (M) and standard deviation (SD) for GCPS: pain intensity (PI) and pain-related disability (PD) scales

Testing for difference by survey mode sequence

Survey mode sequence affected response behaviour in the second survey for the SF-12 mental scales only. Patients who responded to the TI first reported better mental health in the following SAQ on average. The same effect was not observed in the reverse sequence. The mean difference between the two survey modes was greater for the condition "SAQ first" than for the condition "TI first" [SF-12 mental scales: mean difference: 4.1 (SAQ first) vs. 2.8 (TI first); 90% CI 0.2 to 2.4, p < 0.05]. There were no other statistically relevant differences, either as regards the SF12 physical scales or the GCPS subscales [SF-12 physical scales: mean difference 1.8 (SAQ first) vs. 1.8 (TI first); 90% CI -0.9 to 1.0, p > 0.05; GCPS pain intensity: -0.5 (SAQ first) vs. 0.03 (TI first); 90% CI -2.4 to 1.3, p > 0.05; GCPS pain-related disability: 3.4 (SAQ first) vs. 2.9 (TI first); 90% CI -1.9 to 2.9, p > 0,05)] (Figure 2).

Figure 2
figure 2

Mean differences between the two survey modes SAQ vs. TI, by survey-mode sequence.

The point in time when patients were informed about the second survey had no effect on the level of agreement between the two survey modes: SF-12 physical scales: mean difference -2.1 (delayed information) vs. -1.4 (information at outset); 90% CI -0.6 to 1.6, p > 0.05); SF-12 mental scales: 3.6 vs. 3.4; 90% CI -0.8 to 1.4, p > 0.05); GCPS pain intensity: 0.2 vs. -0.7; 90% CI -1.0 to 2.7, p > 0.05; GCPS pain-related disability: 3.8 vs. 2.6; 90% CI -1.2 to 3.5, p > 0.05).


The purpose of this study was to assess the agreement between results of telephone interviews and self-administered mailed questionnaires for the SF-12 and the GCPS, two important instruments in clinical research and practice, with the help of a two-period change-over design. To our knowledge no such data have ever been obtained for the GCPS. The study should provide insight as to whether patients' response behaviour is influenced by motivational aspects, information that would be useful for planning clinical and epidemiological trials.

The results of equivalency testing show that the response behaviour of chronic pain patients is subject to different motivational mechanisms, depending on whether the questions concern mental or physical health. Patients gave a more positive estimation of their mental health in telephone interviews than in the self-administered questionnaires. The same was not true for the SF-12 physical scales or the GCPS subscales. The most likely explanation is that the taboo that society still places on mental disability seems to cause patients speaking with another person in a telephone interview to minimise mental problems that accompany physical illness. This tendency is less likely to affect responses to the more anonymous self-administered questionnaires.

Another result of this study is that computer-assisted telephone interviews have clear advantages over mailed self-administered questionnaires when it comes to completeness of data. In addition to this previously recognized advantage of telephone interviews, however, we found that the number of missing responses was closely related to question content. There tended to be more missing responses for items concerning mental status than for those relating to physical condition. These findings support the hypothesis that patients suffering from chronic pain often view their illness as purely physical and therefore shy away from answering questions about their mental state. In the telephone interview, on the other hand, the trained interviewer is able to obtain substantially more complete responses.

A third result of this study is that, in contrast to previous findings [2, 15, 16], the level of agreement between SF-12 scores in TI and SAQ mode for the mental health subscale was dependent on the survey mode sequence. The differences were more marked if the patient had first given a more positive assessment of mental status in the telephone interview. These results can be explained with the help of findings from the psychology of memory (e.g. [17]): a more positive assessment of mental health status in the telephone interview is associated in the respondent's memory with more positive emotions, which facilitate retrospective recall of memory content when the patient subsequently fills out the paper questionnaire. A more negative assessment of physical health would have the reverse effect: negative emotions block recall of memory content, making it more likely that the respondent will describe his or her current physical state in the subsequent survey. The point in time when patients were informed about the second survey had no effect on response behaviour. Evidently the announcement to participants that they will be asked to complete a second survey is understood as information at a formal level only. Cognitive and emotional processing related to estimating one's own state of health is not likely to be influenced by when this information is received.

The strengths of our study are the comparison of GCPS values in the two survey modes, and the new control variable "Point in time when patients were informed of second survey." To our knowledge neither of these has done before.

One limitation of our study is that we cannot rule out a real remission of symptoms in the interval between the administration of the two surveys, which was at least 14 days, and therefore cannot rule out the possibility that response behaviours were influenced by real symptom improvements.

Another limitation is the non-response rate of close to 25%, as the response behaviour of this group could well differ from that of the rest of the study population. However, a systematic bias resulting from a high non-response rate is more likely to occur when the purpose of the survey is to measure treatment success, which was not the case in the present study. A more generous time frame for the return of questionnaires might increase the response rate. In this study, 662 patients did not return the questionnaire until after the four-week period allotted. The average time taken by those patients to return the questionnaire was seven weeks.

There are also some practical disadvantages to using the CATI-System for data collection: depending on the population size and the technical equipment available, this mode is more time consuming and more costly than mailing out paper questionnaires. Administration of surveys via the Internet might be a cost-effective alternative. A factor to be considered, however, is that proportionately fewer people have access to this means of communication than to the telephone.

Chronic pain research and therapy has traditionally been an interdisciplinary undertaking. Besides quality of life questionnaires, other important data gathering instruments are questionnaires that measure fear or depression. Using instruments such as the CES-D in telephone interviews[18] could, by analogy to results obtained for the mental health subscale of the SF-12 in our study, lead to a systematic underestimation of depression.

Since telephone interviews offer significant advantages over self-administered questionnaires, further mode-comparing studies in this area, particularly with chronic pain patients, are clearly needed.


The most commonly used method of collecting data from patients is still the self-administration of a paper questionnaire. But telephone interviews are being more widely used because of the markedly better data quality obtained by this means. Until now RCT evidence as to whether patient responses differ depending on the survey mode has been lacking. We strongly recommend that mixing of questionnaire modes should be avoided when gathering data with respect to mental health criteria. When a homogeneous questionnaire mode is used, the reliability of responses should theoretically not be affected, since deviations will always be in the same direction. However, outcomes may not be directly comparable to those of other studies if the data were gathered by means of a different mode. Normative data for standardized telephone questionnaires could contribute to a better comparability with the results of the corresponding standardized paper questionnaires.


  1. Brogger J, Bakke P, Eide GE, Gulsvik A: Comparison of Telephone and Postal Survey Modes on Respiratory Symptoms and Risk Factors. Am J Epidemiol. 2002, 155 (6): 572-576. 10.1093/aje/155.6.572.

    Article  PubMed  Google Scholar 

  2. McHorney CA, Kosinski M, Ware JE: Comparisons of the costs and quality of norms for the SF-36 health survey collected by mail versus telephone interview: results from a national survey. Medical care. 1994, 32 (6): 551-567. 10.1097/00005650-199406000-00002.

    Article  CAS  PubMed  Google Scholar 

  3. Aitken JF, Youl PH, Janda M, Elwood M, Ring IT, Lowe JB: Comparability of skin screening histories obtained by telephone interviews and mailed questionnaires: a randomized crossover study. Am J Epidemiol. 2004, 160 (6): 598-604. 10.1093/aje/kwh263.

    Article  PubMed  Google Scholar 

  4. Noy D, Creedy D: Postdischarge surveillance of surgical site infections: a multi-method approach to data collection. American journal of infection control. 2002, 30 (7): 417-424. 10.1067/mic.2002.123393.

    Article  PubMed  Google Scholar 

  5. Fowler FJ, Gallagher PM, Stringfellow VL, Zaslavsky AM, Thompson JW, Cleary PD: Using telephone interviews to reduce nonresponse bias to mail surveys of health plan members. Medical care. 2002, 40 (3): 190-200. 10.1097/00005650-200203000-00003.

    Article  PubMed  Google Scholar 

  6. Ludemann R, Watson DI, Jamieson GG: Influence of follow-up methodology and completeness on apparent clinical outcome of fundoplication. Am J Surg. 2003, 186 (2): 143-147. 10.1016/S0002-9610(03)00175-2.

    Article  PubMed  Google Scholar 

  7. García M, Rohlfs I, Vila J, Sala J, Pena A, Masiá R, Marrugat J, REGICOR Investigators: Comparison between telephone and self-administration of Short Form Health Survey Questionnaire (SF-36). Gac Sanit. 2005, 19 (6): 433-439.

    Article  PubMed  Google Scholar 

  8. Hawthorne G: The effect of different methods of collecting data: mail, telephone and filter data collection issues in utility measurement. Qual Life Res. 2003, 12 (8): 1081-1088. 10.1023/A:1026103511161.

    Article  PubMed  Google Scholar 

  9. Weinberger M, Oddone EZ, Samsa GP, Landsman PB: Are health-related quality-of-life measures affected by the mode of administration?. Journal of clinical epidemiology. 1996, 49 (2): 135-140. 10.1016/0895-4356(95)00556-0.

    Article  CAS  PubMed  Google Scholar 

  10. Ware J, Kosinski M, Keller SD: A 12-Item Short-Form Health Survey: construction of scales and preliminary tests of reliability and validity. Medical care. 1996, 34 (3): 220-233. 10.1097/00005650-199603000-00003.

    Article  PubMed  Google Scholar 

  11. Von Korff M, Ormel J, Keefe FJ, Dworkin SF: Grading the severity of chronic pain. Pain. 1992, 50 (2): 133-149. 10.1016/0304-3959(92)90154-4.

    Article  CAS  PubMed  Google Scholar 

  12. Kukuk P, Lungenhausen M, Molsberger A, Endres HG: Long-term Improvement in Pain Coping for cLBP and Gonarthrosis Patients Following Body Needle Acupuncture: A Prospective Cohort Study. Eur J Med Res. 2005, 10 (6): 263-272.

    CAS  PubMed  Google Scholar 

  13. Endres HG, Molsberger A, Lungenhausen M, Trampisch HJ: An internal standard for verifying the accuracy of serious adverse event reporting: the example of an acupuncture study of 190,924 patients. Eur J Med Res. 2004, 9 (12): 545-551.

    CAS  PubMed  Google Scholar 

  14. Westlake WJ: Symmetrical confidence intervals for bioequivalence trials. Biometrics. 1976, 32 (4): 741-744. 10.2307/2529259.

    Article  CAS  PubMed  Google Scholar 

  15. Harewood GC, Yacavone RF, Locke GR, Wiersema MJ: Prospective comparison of endoscopy patient satisfaction surveys: e-mail versus standard mail versus telephone. The American journal of gastroenterology. 2001, 96 (12): 3312-3317. 10.1111/j.1572-0241.2001.05331.x.

    Article  CAS  PubMed  Google Scholar 

  16. Perkins JJ, Sanson-Fisher RW: An examination of self- and telephone-administered modes of administration for the Australian SF-36. Journal of clinical epidemiology. 1998, 51 (11): 969-973. 10.1016/S0895-4356(98)00088-2.

    Article  CAS  PubMed  Google Scholar 

  17. Lewis PA, Critchley HD, Smith AP, Dolan RJ: Brain mechanisms for mood congruent memory facilitation. NeuroImage. 2005, 25 (4): 1214-1223. 10.1016/j.neuroimage.2004.11.053.

    Article  CAS  PubMed  Google Scholar 

  18. Chan KS, Orlando M, Ghosh-Dastidar B, Duan N, Sherbourne CD: The interview mode effect on the Center for Epidemiological Studies Depression (CES-D) scale: an item response theory analysis. Medical care. 2004, 42 (3): 281-289. 10.1097/01.mlr.0000115632.78486.1f.

    Article  PubMed  Google Scholar 

Pre-publication history

Download references


The study was funded by a program for research funding (FoRUM) at the medical faculty of the Ruhr-University Bochum. The project-executing organisation did not influence study design, conduct of the study, data collection, management, analysis, interpretation of the data, manuscript preparation, or publication decisions.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Heinz G Endres.

Additional information

Competing interests

The author(s) declare that they have no competing interests.

Authors' contributions

All authors commented on the draft and the interpretation of the findings, read and approved the final manuscript. ML was responsible for the telephone interviews and supervision, progress of the study, analysis, interpretation and reporting of the data, and wrote the original manuscript; SL was responsible for conception and design, analysis and interpretation of data and presentation of results; CM was responsible for conception and design, interpretation of the data and expertise in pain treatment; CS was responsible for interpretation of the data, telephone interviews and expertise in evaluation of questionnaires; HJT was responsible for conception and design, statistical analysis plan, and statistical expertise; HGE was principal investigator, responsible for the study protocol, conception and design, progress of the study, analysis and interpretation of the data.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article

Cite this article

Lungenhausen, M., Lange, S., Maier, C. et al. Randomised controlled comparison of the Health Survey Short Form (SF-12) and the Graded Chronic Pain Scale (GCPS) in telephone interviews versus self-administered questionnaires. Are the results equivalent?. BMC Med Res Methodol 7, 50 (2007).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: