Exploratory data analysis typically starts with data description entailing comparisons, and often generating statistical hypotheses. Ideally, however, all hypotheses should be formulated prior to descriptive analysis (or even before data collection). In practice data description leads to new questions, to investigation of new relationships by formulating hypotheses, and then formal testing. Descriptive statistical analyses can substantially endanger the validity of formal statistical inference by destroying the probabilistic basis of inferential statistics.
Substantial statistical methodology has been dedicated to overcoming this problem including replication, cross-validation, limits for family-wise error rates and Bonferroni-adjustment for multiple testing [14, 15]. These methods could be applied to control the overall significance level for a type I error, but it is usually impossible to quantitate prior "data dredging" [15–19].
Significance levels can be controlled by dividing the data set into separate parts prior to data analysis. Hastie et al. [16], for example, recommend randomly splitting the database into 1) a training set, 2) a validation set, and 3) a test set to evaluate the predictive accuracy of the model. Van Houwelingen and le Cessie [20] suggest using one part of the database to select the covariates, a second to estimate the regression coefficients, and a third to assess the prediction rule.
The SLCMSR hosts a large database on multiple sclerosis patients from clinical trials and natural history studies. Data donors do not influence the publication process, and SLCMSR follows strict rules guaranteeing non-identifiability of individual data sets. Anonymization and splitting of the trials make it impossible for SLCMSR and collaborative researchers to identify patients and trials with individual data donors.
Splitting the large SLC database into two parts yields one training part for hypothesis generation, and a second for validation purposes. Only large databases are suitable for splitting, because in secondary analyses patient numbers drop considerably. The validation part of the database is reserved for confirmation of single pre-specified hypotheses. The major finding of one otherwise finalized project could not be validated, and the publication of a false-positive finding was prevented. More recent findings [21] suggest that even a consistent effect of on-study relapses on subsequent "sustained progression" could not be interpreted as evidence for a link between relapse frequency and the accumulation of true, unremitting disability.
However, one should be aware that it can of course not be excluded that in some cases the application of the validation policy will lead to the publication of false negative findings. This disadvantage is related to the inherent loss of power induced by the split of the data base. Therefore, we think that our procedure is particularly useful in areas where the presence of false-positive findings and a considerable degree of "optimism" is common. This is certainly true in MS research, but may be frequent in other areas of clinical research as well.
Moreover, we believe that in general having this validation policy leads to a more sensible and thorough data analysis, programming and code checking, and selection of hypotheses to validate.
Simulations of the true significance level under the null hypothesis of global F-tests after forward and backward variable selection showed (N = 746) that the significance level can easily go beyond 20% when only a small number of predictor variables are included in the model.
The price to pay for splitting the database is a reduction in statistical power. We simulated power levels similar to a typical study at the SLC, and we demonstrated that the shift in percent of standard deviation for a one-sided Gauss-test detected with 80% power needs to be nearly double the size with result validation than without result validation. In other words, statistically significant findings need to be detected twice: in the training sample and in the validation sample. However, the price of publishing false positive research findings in a field with many false dawns justifies validation efforts.
Is the proposed method of result validation generally suitable for research questions or databases? We think that properly designed randomized controlled clinical trials do not necessarily need result validation – although completely separate and independent replications should not be discouraged. Even when additional hypotheses are to be tested at the end of the trial, Bonferroni adjustments can be sufficient to control the significance level. Epidemiological studies, however, are not scientific experiments, and, the study design is less structured than in clinical trials, and often lacking randomization.
In addition false-positive findings from large-scale studies cannot be disproved since other studies are typically smaller and do not have the power to do so. When a large group of researchers works on a scientific field using the same database, result validation is a powerful way to reduce the probability of publishing false positive findings.
Allision [22] states that there is an ongoing debate whether studies with microarrays require any validation guidelines that are fundamentally different from other types of study [23, 24].
We think that different ways to construct the open and closed part of the data base – e.g. selecting randomly individual data packages (or random fractions thereof) to go either in the open or in the closed part – may be interesting modifications of our policy. Future work will aim to define the exact goals of such extended validation studies and at an assessment of which validation procedures will meet those goals.
Ioannidis [1] states that there is no "gold-standard" for validation in general, but that the percentage of published false positive findings can be reduced by better-powered studies, i.e. large-scale studies, low-bias meta-analyses, registration of studies and networking of data collections – similar to randomized controlled trials, and a split-team approach.