How does correlation structure differ between real and fabricated data-sets?

Background Misconduct in medical research has been the subject of many papers in recent years. Among different types of misconduct, data fabrication might be considered as one of the most severe cases. There have been some arguments that correlation coefficients in fabricated data-sets are usually greater than that found in real data-sets. We aim to study the differences between real and fabricated data-sets in term of the association between two variables. Method Three examples are presented where outcomes from made up (fabricated) data-sets are compared with the results from three real data-sets and with appropriate simulated data-sets. Data-sets were made up by faculty members in three universities. The first two examples are devoted to the correlation structures between continuous variables in two different settings: first, when there is high correlation coefficient between variables, second, when the variables are not correlated. In the third example the differences between real data-set and fabricated data-sets are studied using the independent t-test for comparison between two means. Results In general, higher correlation coefficients are seen in made up data-sets compared to the real data-sets. This occurs even when the participants are aware that the correlation coefficient for the corresponding real data-set is zero. The findings from the third example, a comparison between means in two groups, shows that many people tend to make up data with less or no differences between groups even when they know how and to what extent the groups are different. Conclusion This study indicates that high correlation coefficients can be considered as a leading sign of data fabrication; as more than 40% of the participants generated variables with correlation coefficients greater than 0.70. However, when inspecting for the differences between means in different groups, the same rule may not be applicable as we observed smaller differences between groups in made up compared to the real data-set. We also showed that inspecting the scatter-plot of two variables can be considered as a useful tool for uncovering fabricated data.

detecting such data [7,8]. In a recent paper, the roles of biostatisticians in preventing and detecting of fraud in clinical trials have been discussed and different methods for detecting fraudulent data have been suggested [6]. For instance, it has been shown in many articles that the standard deviation for fabricated data is less than that of the corresponding real data [8] and there are arguments that the correlation coefficient between two variables in a fabricated data-set is usually greater than that of the real data-set [6]. However, we could not find any paper about the correlation structure of fabricated data in the literature.
In this article we study an extreme case of fraud, i.e. data fabrication, which could have much more effect on conclusions drawn from medical research than any other type of fraud. Particularly, we emphasise on some simple techniques that might be useful for detecting fabricated data. The techniques are based on the relationship between variables. These methods could be useful not just because of the ethics that must be observed in the research process but because of the possible consequences that fabricated data could have on health care practice.

Methods
In this work three examples of real data-sets are considered. For the first two data-sets our main objective is to find out how closely the correlation structures of real datasets could be reconstructed with fabricated data. In order to investigate the correlation structure of fabricated data, summary statistics of two real data-sets were shown to the faculty members at Shiraz Medical School and Jahrom Medical School in Iran. These faculty members were from the departments of clinical and basic sciences including Community Medicine, Microbiology, Physiology, Paediatrics, Internal Medicine, etc. Statisticians and epidemiologists were not asked to participate in this study. We met each faculty member in person and spent about 10 to 30 minutes to explain our objectives and the summary statistics of the two real data-sets. Then, they were asked to make up similar data-sets for 40 hypothetical subjects using forms provided as if they were attempting fraud by fabricating data for a real study.
The sample size of 40 hypothetical subjects is based on the following considerations. To detect a correlation coefficient greater than 0.40 with type 1 error 0.05 and power 0.80 a sample size of 37 is sufficient (9). We felt that our colleagues would be willing to make up as many as 40 data-points, so there would be good power to detect correlation of 0.40. The results proved that this sample size was enough for most cases.
Respondents were asked to make up their data within the same range indicated by the real data-sets. Thirty-four people returned their completed forms within one week, providing 34 data-sets for each example which are analysed in this article.
In Example 3 the mean of a continuous variable is compared between two groups. This does not deal directly with the correlation coefficient between two continuous variables, but provides a further instance of how made up data-sets can be differentiated from the corresponding real data-set. This can be regarded as an example of association between a continuous and a categorical variable. Each respondent was asked to make up data for 30 hypothetical subjects in each group. Based on the observed means and standard deviations of the real data-set, type 1 error of 0.05, and power of 0.80, a sample size of 25 in each group is sufficient. To be more conservative we chose sample size 30 in each group.
In all examples the made up data-sets were produced "by hand".
For each example the results from the made up data-sets are compared with results of 2500 appropriate computergenerated data-sets. These simulated random samples are drawn with replacement either based on the specifications of the corresponding real data-set and the theory of normal distribution (Example 1 and Example 3) or from the real data-set (Example 2).

Results
In this section the differences between the real and made up data-sets are shown in term of the association between variables.

Example 1
In this example we consider two variables which are highly correlated. Table 1 shows some descriptive statistics of the height and weight of 65 female students aged 20-22 studying at Jahrom Medical School, autumn 2001. The correlation between the two variables is highly significant (r = 0.43, P < 0.001). We gave this table to the participants and asked them to make up measurements of height and weight for 40 hypothetical female students as if they were fabricating data for a real study.
Correlation coefficients of the 34 made up data-sets ranged from -0.097 to 0.996 (mean = 0.33, SD = 0.33). Figure 1 shows the scatter-plots of these data-sets. Most participants produced data-sets with correlation coefficients greater than that of the real data-set.
To investigate whether the correlation coefficients of these made up data-sets are similar to that of the real data-set, we simulated 2500 random samples of height and weight each with sample size 40 based on the means and standard deviations of height and weight observed in the real data-set (see Table 1) and the theory of the bivariate normal distribution. For each simulated data-set correlation between height and weight was tested against the null hypothesis of ρ = 0.43 and the corresponding p-value was recorded. A comparison between the correlation coefficients of these simulated data-sets and the made up datasets are shown in Table 2. Using a Fisher's exact test indicates that the correlation structures are different between made up data-sets and the simulated data-sets (P <0.0001).
In addition, 19 (56%) of these made up data-sets had correlation coefficients greater than 0.43. The correlation coefficient in 18 (53%) data-sets was greater than 0.70, and in ten (29%) was 0.90 or higher. In comparison, only 5.5% of the simulated data-sets had correlation coefficients statistically different from 0.43. Thus, the made up (fabricated) data-sets yielded considerably higher correlation coefficients than the corresponding real or randomly generated data-sets.

Example 2
In this example two variables are considered which are not expected to be correlated or at least very modestly correlated. Figure 2 shows the scatter-plot of birth weight by gestational age (GA) in 637 newborn boys. Gestational age ranges from 38 to 44 weeks. These data were collected from the birth records of four hospitals in Shiraz, Iran [10]. It can be easily concluded from Figure 2 that these variables are not correlated (r = 0.031, p = 0.437). Table 3 presents the summary statistics of gestational age and birth weight for these babies. Table 3 was provided to the participants and they were asked to make up gestational age and birth weight for 40 hypothetical babies in the same range as shown by Table  3, as if they were fabricating the data instead of collecting the real data.
The correlation coefficients between gestational age and birth weight for the 34 made up data-sets were in the range -0.36 to 0.98. Of these data-sets 22 (65%) were significantly different from zero (see also Figure 3). A simulation study was carried out to determine whether these made up data-sets resemble samples from the real dataset. We drew 2500 random samples of size 40 from the real data-set of which only 109 (4.4%) data-sets had correlation coefficients different from zero. A comparison between the results of the made up data-sets and 2500 random samples from the real data-set is given in Table 4. Fisher's exact test, indicates that correlations in fabricated data-sets are different than those in random samples from the real data-set (P <0.0001).
Furthermore, in the 22 made up data-sets with correlation coefficients statistically different from zero, 20 (59%) of them had a positive correlation coefficient and only in two the correlation coefficient was negative. Indeed, in 13 (38%) of them the correlation coefficient was greater than 0.70 and in 5 (15%) more than 0.90.

Example 3
In a study conducted at McMaster University in Canada, 45 graduating students of nursing from a problem based learning (PBL) curriculum were compared with 31 students from a more conventional curriculum at the University of Ottawa [11]. One variable on which they were compared was the students' perceived ability to communicate with patients or so-called communication skill. The summary statistics of the communication skill for both groups are provided in Table 5.
This table was sent by e-mail to the all faculty members at the School of Nursing at McMaster University and they were asked to generate data-sets for 30 hypothetical students in each group based on the information of the table, as if they were making up data instead of assessing it from real subjects. Seventeen faculty members responded before the specified deadline. For these 17 data-sets the mean difference between groups ranged from -0.057 to 4.63. Only 9 (53%) of these differences were significantly different from zero. Figure 4 represents the box-plots for these made up data-sets along with the box-plot of the real data-set.
As Figure 4 shows, compared to the real data-set many participants produced data-sets with smaller mean differences between groups which was surprising as we were expecting larger differences between groups. It could have Scatter-plot of weight and height for data-sets made up by 34 individuals Figure 1 Scatter-plot of weight and height for data-sets made up by 34 individuals happened by chance as the number of respondents was small (n = 17). On the other hand, it might be the nature of the fabricated data for comparing two treatment groups. In other words, observing large differences between groups for fabricated data might not be a reasonable and justified expectation. All in all, those are the only made up data-sets of this type that we are aware of and  Scatter-plot of birth weight and gestational age for 34 made up data-sets Figure 3 Scatter-plot of birth weight and gestational age for 34 made up data-sets  Box-plots of 17 made up and the real data-sets for the two different curriculum Figure 4 Box-plots of 17 made up and the real data-sets for the two different curriculum Real Data-set much more needs to be done to find out the true nature of fabricated data for comparing means between groups.
Again, 2500 data-sets were generated based on the specifications of Table 5 and the theory of normal distribution. An independent t-test was used to compare the means between two groups in each data-set. Fisher's exact test indicates that the data structures between made up and simulated data-sets are different ( Table 6, P = 0.0011).

Discussion
In this research we aimed to find out similarities and differences between real and made up data-sets regarding the association between variables. Although the made up data-sets for this research are not real cases of fabricated data, participants were asked to make up data as close as possible to real data, an inclination which is prominent in data fabrication.
In the first two examples we focused on the Pearson correlation coefficient. The third example is not directly about correlation between variables. However, it relates to the association between a categorical variable, dichotomous here, and a continuous variable.
About 30 percent of participants in Example 1 produced data with correlation coefficients greater than 0.90 between height and weight, where the correlation coefficient for the real data-set was 0.43. In Example 2, 15 percent of participants produced data with correlation coefficient greater than 0.90 even though there was no correlation between birth weight and gestational age (gestational age ≥ 38 weeks). Except in longitudinal data where large correlation coefficients occur when the same variable is observed at close time-points, correlation coefficients above 0.80 are not often seen. Therefore, a high correlation coefficient could be regarded as a key point for suspicion when checking for fabricated data.
In Example 3, the expectation was that participants would produce data with larger mean differences between groups than in the corresponding real data, but they produced data with smaller differences. This could be because of the small number of respondents (n = 17). On the other hand, the expectation of observing greater differences between groups, consistent with what we observed for correlations between continuous variables, might not be applicable here. To our knowledge, little, if any, has been done on detecting fabricated data by comparing mean values between groups. This article may be considered as the first step in this regard and much more is needed to be done.
As the last point, there was a considerable number of nonrespondents in this survey. Although we never can expect to obtain 100 percent response rate, there were some other factors which could have affected this study. First, some people may hesitate to make up data even when they know that it is used just for research purposes. Second, our request for making up data-sets for Example 3 was circulated to the faculty members at the School of Nursing at the end of the winter semester. At that time faculty members were busy with exams and marking, so would have had little time to participate in the study.

Conclusion
In this survey made up data-sets were used to find out the similarities and differences between fabricated and real data-sets. The results indicate that high correlation coefficients can be considered as a potential sign of data fabrication. However, for differences between mean values in different groups, the same rule may not apply. We also showed that inspecting a scatter-plot of two variables can be a useful tool for uncovering fabricated data. As Bailey [8] concluded, statistical inference is necessary but may not be sufficient for detecting fabricated data. Sometimes inspecting appropriate graphs could be much more informative than applying statistical techniques and tests.