The aim of our matrix is to facilitate the overview of evidence in clinical intervention research. The matrix can serve as a tool to provide visual assessment of reliability of observations with respect to systematic error, random error (internal validity), and design error (external validity).

The matrix should not replace the thorough process of systematically reviewing evidence and profound evaluations of data, but could be integrated within these research activities as a tool for overviewing the results. Also, this matrix is not an absolute measure of the risks of errors. The position of studies in relation to each other is relative rather than absolute.

There is a lack of awareness of the importance of the 'play of chance' for the reliability of study findings. Ordering the standard errors of the studies might be a tool for ranking studies according to the level of random error. We have used natural logarithm (ln) transformations for calculating standard errors, although the logarithm with the base 10 may be used without producing different conclusions.

As an alternative, the Bayes factor can be considered [

37,

68]. The Bayes factor is a likelihood ratio comparing one hypothesis versus another, and, therefore, varies with the definition of the possible alternative hypotheses. The Bayes factor is a summary measure that provides an alternative to the p-value for the ranking or the flagging of associations as 'significant' [

69]. The Bayes factor:

or simple approximations can be very difficult or even impossible to implement for the clinician, since a search for the maximum of the multidimensional posterior may be required for each association [69]. This also includes the asymptotic Bayes factor introduced by Wakefield [69]. In contrast to the Bayes factor, it is possible to calculate the standard error and when available it provides a tool for comparison of the risk of random error between studies of the same intervention.

The aim of minimising error risks according to the three dimensions actually combines the methodological efforts of falsifying any alternative hypothesis in the evaluation of an intervention. Thereby, the matrix concept visualises how far the scientific process has evolved to fulfil Poppers falsification criterion stating that researchers should primarily engage trying to falsify any relevant alternative hypothesis and not only the null hypothesis [5]. The minimisation of systematic errors and random errors, by providing ample room for the null hypothesis, as well as measuring important outcomes is the most audacious attack on any realistic alternative hypothesis. If an array of progressively qualified attacks fails to support the null hypothesis then we can reliably trust the intervention to be either beneficial or harmful.

The conclusion based on an assessment of the evidence using the matrix approach may be implemented into clinical practice or serve as an incentive for new research. The matrix facilitates the identification of lacunae in our knowledge and is likely to benefit the process of developing evidence-based guidelines.

### Preference for the highest evidence

One has to be aware of the multiple forms of bias, potentially present in evidence below level 1 (Table 1). Several examples illustrate that large, apparently beneficial intervention effects from lower level evidence, even from randomized trials [54, 56, 70], may eventually be reversed to harmful effects when new high-quality evidence appears [50, 71]. This is where the three dimensions of error are of central importance in providing a tool for reliability assessment.

### Limitations

Apart from the three error dimensions influencing the reliability of data, other factors play a role in incomparability and uncertainty of inferences. Many reports of studies appear incomplete, and the lack of details raises questions. Incomplete reporting limits interpretation, but more importantly, this reporting factor should be distinguished from the methodological quality of the trial [72].

Statements like CONSORT [73], PRISMA [74], and MOOSE [75] aim to improve and to maximize the amount and correctness of information to be retrieved from publications. These guidelines also create awareness among researchers about the most important issues to report so that the quality of future research may increase. By following reporting guidelines the yield of the research question is likely to be increased (phase 1 in Figure 1).

Standard error does not consider testing of multiple outcomes and multiple testing on accumulating data, which may also induce risks of random error due to multiplicity as well as correlations.

The division of all outcomes into 'primary' and 'secondary' outcome measures can be helpful as this division sets the standards for the evaluation of interventions. However, this division is artificial, and outcome measures, situated on the border of primary and secondary outcomes, exist. For example, one can argue that quality of life is a primary outcome rather than a secondary outcome. Further, there is also a quantitative aspect in the artificial division into primary and secondary outcomes. Small significant differences in primary outcome measures (e.g., bile duct injuries in patients undergoing cholecystectomy) may be found favouring one intervention, while large differences in secondary outcome measures (e.g., costs) may favour the comparator. Eventually, one may prefer the larger advantages in secondary outcomes to the smaller disadvantages in a primary outcome measure.

Another limitation in the outcome measure dimension is that often outcome measures are correlated and mostly this correlation is ignored. For example when mortality is an outcome measure and complications is another, which again counts deaths as complications, then there is a correlation between the two outcome measures. Authors usually carry out multiple univariate analyses ignoring correlations between outcome measures.

Step IV of the matrix includes the assessment of the size of the intervention effect, e.g., expressed in numbers-needed-to-treat to obtain benefit or to harm one patient with the intervention. This step is the last one since it is irrelevant to consider effect sizes and their directions if a study does not appear to be internally and externally valid.

Another aspect to consider is heterogeneity [76, 77]. Statistical heterogeneity reflects the between trial variance of meta-analytic intervention effect estimates rather than the play of chance [76]. Clinical heterogeneity, however, represents differences in populations, procedures, or interventions in daily practice. All these factors of clinical heterogeneity, together with concordance of in- and exclusion criteria should be considered whenever we want to implement results of available evidence. Assessment and consideration of heterogeneity or diversity, therefore, forms the final step before new evidence is implemented. Assessment of heterogeneity is not included in our matrix.