How does correlation structure differ between real and fabricated data-sets?
© Akhtar-Danesh and Dehghan-Kooshkghazi; licensee BioMed Central Ltd. 2003
Received: 26 June 2003
Accepted: 29 September 2003
Published: 29 September 2003
Misconduct in medical research has been the subject of many papers in recent years. Among different types of misconduct, data fabrication might be considered as one of the most severe cases. There have been some arguments that correlation coefficients in fabricated data-sets are usually greater than that found in real data-sets. We aim to study the differences between real and fabricated data-sets in term of the association between two variables.
Three examples are presented where outcomes from made up (fabricated) data-sets are compared with the results from three real data-sets and with appropriate simulated data-sets. Data-sets were made up by faculty members in three universities. The first two examples are devoted to the correlation structures between continuous variables in two different settings: first, when there is high correlation coefficient between variables, second, when the variables are not correlated. In the third example the differences between real data-set and fabricated data-sets are studied using the independent t-test for comparison between two means.
In general, higher correlation coefficients are seen in made up data-sets compared to the real data-sets. This occurs even when the participants are aware that the correlation coefficient for the corresponding real data-set is zero. The findings from the third example, a comparison between means in two groups, shows that many people tend to make up data with less or no differences between groups even when they know how and to what extent the groups are different.
This study indicates that high correlation coefficients can be considered as a leading sign of data fabrication; as more than 40% of the participants generated variables with correlation coefficients greater than 0.70. However, when inspecting for the differences between means in different groups, the same rule may not be applicable as we observed smaller differences between groups in made up compared to the real data-set. We also showed that inspecting the scatter-plot of two variables can be considered as a useful tool for uncovering fabricated data.
Misconduct in medical research has been the subject of many papers in recent years [1–5]. The usual types of misconduct include fabrication and falsification of data, plagiarism, deceptive reporting of results, suppression of existing data, and deceptive design or analysis [4, 6]. At the same time, there has been much effort to reveal fraudulent data and to equip statisticians with some techniques for detecting such data [7, 8]. In a recent paper, the roles of biostatisticians in preventing and detecting of fraud in clinical trials have been discussed and different methods for detecting fraudulent data have been suggested . For instance, it has been shown in many articles that the standard deviation for fabricated data is less than that of the corresponding real data  and there are arguments that the correlation coefficient between two variables in a fabricated data-set is usually greater than that of the real data-set . However, we could not find any paper about the correlation structure of fabricated data in the literature.
In this article we study an extreme case of fraud, i.e. data fabrication, which could have much more effect on conclusions drawn from medical research than any other type of fraud. Particularly, we emphasise on some simple techniques that might be useful for detecting fabricated data. The techniques are based on the relationship between variables. These methods could be useful not just because of the ethics that must be observed in the research process but because of the possible consequences that fabricated data could have on health care practice.
In this work three examples of real data-sets are considered. For the first two data-sets our main objective is to find out how closely the correlation structures of real data-sets could be reconstructed with fabricated data. In order to investigate the correlation structure of fabricated data, summary statistics of two real data-sets were shown to the faculty members at Shiraz Medical School and Jahrom Medical School in Iran. These faculty members were from the departments of clinical and basic sciences including Community Medicine, Microbiology, Physiology, Paediatrics, Internal Medicine, etc. Statisticians and epidemiologists were not asked to participate in this study. We met each faculty member in person and spent about 10 to 30 minutes to explain our objectives and the summary statistics of the two real data-sets. Then, they were asked to make up similar data-sets for 40 hypothetical subjects using forms provided as if they were attempting fraud by fabricating data for a real study.
The sample size of 40 hypothetical subjects is based on the following considerations. To detect a correlation coefficient greater than 0.40 with type 1 error 0.05 and power 0.80 a sample size of 37 is sufficient (9). We felt that our colleagues would be willing to make up as many as 40 data-points, so there would be good power to detect correlation of 0.40. The results proved that this sample size was enough for most cases.
Respondents were asked to make up their data within the same range indicated by the real data-sets. Thirty-four people returned their completed forms within one week, providing 34 data-sets for each example which are analysed in this article.
In Example 3 the mean of a continuous variable is compared between two groups. This does not deal directly with the correlation coefficient between two continuous variables, but provides a further instance of how made up data-sets can be differentiated from the corresponding real data-set. This can be regarded as an example of association between a continuous and a categorical variable. Each respondent was asked to make up data for 30 hypothetical subjects in each group. Based on the observed means and standard deviations of the real data-set, type 1 error of 0.05, and power of 0.80, a sample size of 25 in each group is sufficient. To be more conservative we chose sample size 30 in each group.
In all examples the made up data-sets were produced "by hand".
For each example the results from the made up data-sets are compared with results of 2500 appropriate computer-generated data-sets. These simulated random samples are drawn with replacement either based on the specifications of the corresponding real data-set and the theory of normal distribution (Example 1 and Example 3) or from the real data-set (Example 2).
In this section the differences between the real and made up data-sets are shown in term of the association between variables.
Summary statistics of height and weight for 65 students
Correlation between variables
r = 0.43
(P < 0.001)
Comparison between the significance levels of the correlations for the made up data-sets and 2500 random samples produced based on the specifications of Table 1*
p ≤ 0.05
0.05 <p ≤ 0.10
p > 0.10
In addition, 19 (56%) of these made up data-sets had correlation coefficients greater than 0.43. The correlation coefficient in 18 (53%) data-sets was greater than 0.70, and in ten (29%) was 0.90 or higher. In comparison, only 5.5% of the simulated data-sets had correlation coefficients statistically different from 0.43. Thus, the made up (fabricated) data-sets yielded considerably higher correlation coefficients than the corresponding real or randomly generated data-sets.
Summary statistics of gestational age (GA) and birth weight for 637 newborn boys
r = 0.031
(p > 0.20)
Table 3 was provided to the participants and they were asked to make up gestational age and birth weight for 40 hypothetical babies in the same range as shown by Table 3, as if they were fabricating the data instead of collecting the real data.
Comparison between the made up data-sets and 2500 real random samples from 637 newborn boys of the p-values of the correlation between GA and birth weight+
p ≤ 0.05
0.05 <p ≤ 0.10
p > 0.10
Samples from the Real data-set
Furthermore, in the 22 made up data-sets with correlation coefficients statistically different from zero, 20 (59%) of them had a positive correlation coefficient and only in two the correlation coefficient was negative. Indeed, in 13 (38%) of them the correlation coefficient was greater than 0.70 and in 5 (15%) more than 0.90.
Summary statistics of communication skill in two groups of students
t = 3.66
p < 0.001
As Figure 4 shows, compared to the real data-set many participants produced data-sets with smaller mean differences between groups which was surprising as we were expecting larger differences between groups. It could have happened by chance as the number of respondents was small (n = 17). On the other hand, it might be the nature of the fabricated data for comparing two treatment groups. In other words, observing large differences between groups for fabricated data might not be a reasonable and justified expectation. All in all, those are the only made up data-sets of this type that we are aware of and much more needs to be done to find out the true nature of fabricated data for comparing means between groups.
Comparison between made up and simulated data-sets about communication skill++
p ≤ 0.05
0.05 <p ≤ 0.10
p > 0.10
In this research we aimed to find out similarities and differences between real and made up data-sets regarding the association between variables. Although the made up data-sets for this research are not real cases of fabricated data, participants were asked to make up data as close as possible to real data, an inclination which is prominent in data fabrication.
In the first two examples we focused on the Pearson correlation coefficient. The third example is not directly about correlation between variables. However, it relates to the association between a categorical variable, dichotomous here, and a continuous variable.
About 30 percent of participants in Example 1 produced data with correlation coefficients greater than 0.90 between height and weight, where the correlation coefficient for the real data-set was 0.43. In Example 2, 15 percent of participants produced data with correlation coefficient greater than 0.90 even though there was no correlation between birth weight and gestational age (gestational age ≥ 38 weeks). Except in longitudinal data where large correlation coefficients occur when the same variable is observed at close time-points, correlation coefficients above 0.80 are not often seen. Therefore, a high correlation coefficient could be regarded as a key point for suspicion when checking for fabricated data.
In Example 3, the expectation was that participants would produce data with larger mean differences between groups than in the corresponding real data, but they produced data with smaller differences. This could be because of the small number of respondents (n = 17). On the other hand, the expectation of observing greater differences between groups, consistent with what we observed for correlations between continuous variables, might not be applicable here. To our knowledge, little, if any, has been done on detecting fabricated data by comparing mean values between groups. This article may be considered as the first step in this regard and much more is needed to be done.
As the last point, there was a considerable number of non-respondents in this survey. Although we never can expect to obtain 100 percent response rate, there were some other factors which could have affected this study. First, some people may hesitate to make up data even when they know that it is used just for research purposes. Second, our request for making up data-sets for Example 3 was circulated to the faculty members at the School of Nursing at the end of the winter semester. At that time faculty members were busy with exams and marking, so would have had little time to participate in the study.
In this survey made up data-sets were used to find out the similarities and differences between fabricated and real data-sets. The results indicate that high correlation coefficients can be considered as a potential sign of data fabrication. However, for differences between mean values in different groups, the same rule may not apply. We also showed that inspecting a scatter-plot of two variables can be a useful tool for uncovering fabricated data. As Bailey  concluded, statistical inference is necessary but may not be sufficient for detecting fabricated data. Sometimes inspecting appropriate graphs could be much more informative than applying statistical techniques and tests.
We are grateful to Dr Elizabeth Rideout for making available to us the data-set for Example 3. We also benefited from her invaluable comments on the first draft of this article. We are thankful to Professor Harry Shannon for his comments and reading the final draft. We are also thankful to the referees for their invaluable comments. We would like to thank our colleagues at Jahrom Medical School, Shiraz University of Medical Sciences, and School of Nursing at McMaster University for making up the data-sets. Our students at Jahrom Medical School kindly provided us their height and weight measurements for Example 1.
- Horton R: The clinical trial: Deceitful, disputable, unbelievable, unhelpful, and shameful – What next?. Controlled Clinical Trials. 2001, 22: 593-604. 10.1016/S0197-2456(01)00175-1.View ArticlePubMedGoogle Scholar
- Neaton JD, Bartsch GE, Broste SD, Cohen JD, Simon NM: A case of data alteration in the multiple risk factor intervention trial (MRFIT). Controlled Clinical Trials. 1991, 12: 731-740. 10.1016/0197-2456(91)90036-L.View ArticlePubMedGoogle Scholar
- Ranstam J, Buyse M, George SL, et al: Fraud in medical research: an international survey of biostatisticians. Controlled Clinical Trials. 2000, 21: 415-427. 10.1016/S0197-2456(00)00069-6.View ArticlePubMedGoogle Scholar
- Riis P: Scientific dishonesty: European reflections. Journal of Clinical Pathology. 2001, 54: 4-6. 10.1136/jcp.54.1.4.View ArticlePubMedPubMed CentralGoogle Scholar
- White C: Plans for tackling research fraud may not go far enough. British Medical Journal. 2000, 321: 1487-10.1136/bmj.321.7275.1487.View ArticlePubMedPubMed CentralGoogle Scholar
- Buyse M, George SL, Evans S, et al: The role of biostatistics in the prevention, detection and treatment of fraud in clinical trials. Statistics in Medicine. 1999, 18: 3435-3451. 10.1002/(SICI)1097-0258(19991230)18:24<3435::AID-SIM365>3.3.CO;2-F.View ArticlePubMedGoogle Scholar
- DeMets DL: Distinctions between fraud, bias, errors, misunderstanding, and incompetence. Controlled Clinical Trials. 1997, 18: 637-650. 10.1016/S0197-2456(97)00010-X.View ArticlePubMedGoogle Scholar
- Bailey KR: Detecting fabrication of data in a multicenter collaborative animal study. Controlled Clinical Trials. 1991, 12: 741-752. 10.1016/0197-2456(91)90037-M.View ArticlePubMedGoogle Scholar
- Machin D, Campbell M, Fayers P, Pinol A: Sample Size Tables for Clinical Studies. 1997, Oxford: Blackwell ScienceGoogle Scholar
- Akhtar-Danesh N: The incidence of congenital malformation in Southern Iran, 1987–1988: an epidemiological survey. MSc thesis. 1988, Shiraz University of Medical SciencesGoogle Scholar
- Rideout E, England-Oxford V, Brown B, et al: A comparison of problem-based and conventional curricula in nursing education. Advances in Health Sciences Education. 2002, 7: 3-17. 10.1023/A:1014534712178.View ArticlePubMedGoogle Scholar
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1471-2288/3/18/prepub
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.