Semantic and cognitive tools to aid statistical science: replace confidence and significance by compatibility and surprise

Background Researchers often misinterpret and misrepresent statistical outputs. This abuse has led to a large literature on modification or replacement of testing thresholds and P-values with confidence intervals, Bayes factors, and other devices. Because the core problems appear cognitive rather than statistical, we review some simple methods to aid researchers in interpreting statistical outputs. These methods emphasize logical and information concepts over probability, and thus may be more robust to common misinterpretations than are traditional descriptions. Methods We use the Shannon transform of the P-value p, also known as the binary surprisal or S-value s = −log2(p), to provide a measure of the information supplied by the testing procedure, and to help calibrate intuitions against simple physical experiments like coin tossing. We also use tables or graphs of test statistics for alternative hypotheses, and interval estimates for different percentile levels, to thwart fallacies arising from arbitrary dichotomies. Finally, we reinterpret P-values and interval estimates in unconditional terms, which describe compatibility of data with the entire set of analysis assumptions. We illustrate these methods with a reanalysis of data from an existing record-based cohort study. Conclusions In line with other recent recommendations, we advise that teaching materials and research reports discuss P-values as measures of compatibility rather than significance, compute P-values for alternative hypotheses whenever they are computed for null hypotheses, and interpret interval estimates as showing values of high compatibility with data, rather than regions of confidence. Our recommendations emphasize cognitive devices for displaying the compatibility of the observed data with various hypotheses of interest, rather than focusing on single hypothesis tests or interval estimates. We believe these simple reforms are well worth the minor effort they require.


| BACKGROUND
An extended technical discussion of S-values and unconditional information can be found in Greenland [1] and Greenland & Rafi [2]. Here we briefly cover several technical topics mentioned in our main paper [3]: Different units for (scaling of) the S-value besides base-2 logs (bits); the importance of uniformity (validity) of the P-value for interpretation of the S-value; the combination of the S-value across studies; and the relation of the S-value to other measures of statistical information about a test hypothesis or model.

| UNITS FOR THE S-VALUE
Other units for measuring information other than bits arise from different choices for the base of the logarithms. For example, using natural (base-e) logs, the S-value becomes s e = − ln(p) = − log 2 (p) ln(2) whose units are called "nats," while using common (base-10) logs the S-value becomes s 10 = − log 10 (p) = − log 2 (p) log 10 (2) whose units are called hartleys, bans, or dits (decimal digits). The ratio of one dit of information to one bit of information is log 2 (10) = 3.22 which is similar to the ratio of meters to feet, 3.28. Just as the choice of meters vs. feet does not affect the concepts and methods surrounding length measurement, so choice of dits vs. bits does not affect any of the concepts or methods of information measurement. Bits are most commonly used in communications engineering because the fundamental physical components in electronic information storage are binary and thus their information capacity is one bit. Natural logs are however more mathematically convenient and thus more common in statistical theory (see below), although base-10 logs are also seen.

| UNIFORMITY OF THE P-VALUE AND THE INFORMATION IN THE S-VALUE
The decision rule "reject H if p ≤ α" will reject H 100α% of the time under sampling from a model M obeying H (i.e., the Type-1 error rate of the test will be α) provided the random variable P corresponding to p is valid (uniform under the model M used to compute it), but not necessarily otherwise [4]. This is one reason why frequentist writers reject invalid P-values (such as posterior predictive P-values, which highly concentrate around 0.50) and devote considerable technical coverage to uniform Pvalues [4][5][6]. A valid P-value ("U-value") translates into an exponentially distributed S-value with a mean of 1 nat or log 2 (e) = 1.443 bits where e is the base of the natural logs.
Uniformity is also central to the "refutational information" interpretation of the S-value used here, for it is necessary to ensure that the P-value p from which s is derived is in fact the percentile of the observed value of the test statistic in the distribution of the statistic under M, thus making small p surprising under M and making s the corresponding degree of surprise. Because posterior predictive P-values do not translate into sampling percentiles of the statistic under the hypothesis (in fact, they are pulled toward 0.5 from the correct percentiles) [5,6], the resulting negative log does not measure surprisal at the statistic given M, and so is not a valid S-value in our terms.
For simplicity, we have assumed that at least an approximately valid Pvalue can be derived for testing M. This is so in typical regression analyses in health and medical sciences, but not always. To deal with exceptions, P-values are often said to be "conservatively valid" for testing M when under M they stochastically dominate a unit-uniform distribution, i.e., under M the probability that P exceeds a given p is at least p, and for some p exceeds p. Typical exact P-values from discrete data are conservatively valid but approach uniformity with increasing sample size. In cases for which we can only deduce that P is conservatively valid (as when its observed value p is an upper bound rather than a direct tail probability), we would interpret its corresponding S-value as conservatively valid in the sense of representing the minimum information against M supplied by the test.
The coin-toss interpretation we have used to physically gauge this information assumes that the only alternative to fairness is in the direction of loading for heads. The S-value it produces thus corresponds to a P-value for the 1-sided hypothesis Pr(heads)≤ 1 /2; nonetheless, this interpretation applies even if the original observed P-value p was 2-sided. This translation from a 2-sided P-value to a 1-sided S-value parallels the transformation of P-values into 1-sided sigmas (Z-scores) in physics, in which for example a P-value of 0.05 from a two-sided test would become a sigma of 1.645, the upper one-sided 5% cutoff for a standard-normal deviate [7].

| COMBINATION OF S-VALUES ACROSS STUDIES
Preprints of our article referred to the S-value as "additive" over independent sources, which is incorrect insofar as the combined refutational information from independent tests of the same model M is a subadditive function of the separate S-values; only the combination of the latter into a summary test statistic is additive. More precisely, suppose we have K studies, each contributing an independent valid S-value S k for M. Under M, each S k has an expected value of 1 nat, which can be viewed as the expected "noise" contribution to the S-value (Good [8,9] dealt with this factor by what in our case reduces to subtracting 1 nat from each surprisal, which however creates problematic negative values when s < 1 nat; we have instead chosen to follow the subsequent theoretical literature and not do so). The sum S + = k S k will thus have an expectation of K "noise" nats under M. Furthermore, the distribution of 2S + will be χ 2 on 2K degrees of freedom [10]; hence the summary S-value S & derived from the sum S + will be the negative log of the P-value from comparing 2S + to a 2K df χ 2 distribution. Under M this summary-χ 2 S & has an expectation of 1 nat and thus will on average be K -1 nats smaller than S + , with even larger discrepancies under violations of M that the test is sensitive to.
More generally, if the test model M k varies across studies, the S -summation test just given is valid for testing the conjunction (intersection) hypothesis M & = M 1 & · · · &M K , but is rarely optimal because it makes no use of homogeneity or other relations among the models. In particular, the test remains valid when as usual each study model M k incorporates a shared target or focal hypothesis H across studies (e.g., no association of a given treatment and disease), combined with different sets of background assumptions A k (e.g., due to differences in study designs), so that M k = H&A k and M & = H&A 1 & · · · &A K = H&A & . When however a homogeneity assumption is correct, this general test will have lower power (sensitivity) for violations of H given the background A & than the usual summary tests (which use homogeneity in deriving the study-specific contributions to their summary statistics). On the other hand, those usual tests can have lower power if homogeneity is very wrong. In any case, the S-values from the two tests can differ considerably due to the additional (and possibly incorrect) homogeneity information used in the usual tests.
Some insight into these results may follow from reviewing the traditional parallel procedure for combining Z-scores Z k (e.g., standardized residuals) by squaring and summing them. The resulting sum of squares k Z k 2 has a K df χ 2 distribution with expectation K if no cross-k infor-mation is used to compute the Z k . More generally however the df are reduced by the number of cross-study sharp constraints used. For example, if H is a hypothesis that a mean difference δ k is 0, the test using the assumption that the δ k are constant across studies imposes K -1 constraints (δ 1 = · · · = δ K ) to derive the Z k , and so will have only 1 df instead of K df; this reduction in df produces considerably more power if the assumption is correct, but not necessarily otherwise. See Greenland and Rafi [2] for details and examples.

| OTHER MEASURES OF STATISTICAL INFORMATION ABOUT A TEST HYPOTHESIS OR MODEL
A common measure for evaluating a hypothesis or model restriction H under background assumptions or unrestricted model A is the maximumlikelihood ratio (MLR), which is the value of the likelihood function at its maximum under A alone, divided by its (restricted) maximum when the test hypothesis H is additionally imposed [11,12]. The MLR defined this way is always above 1; it is however sometimes confused with the posterior odds against the tested value H given A, which it equals only under very special (and usually unrealistic) conditions. The MLR does however show the most extreme increase in posterior odds against H that the data could produce given A. The corresponding information measure paralleling the S-value is the deviance difference or likelihood-ratio (LR) statistic for H given A, 2 ln(M LR ), which is itself a test statistic for H given A. The change in the Akaike Information Criterion (without small-sample adjustment) from adding H to the background model is 2 ln(M LR ) − 2d where d is the dimension (degrees of freedom) of H [11,13]. Now consider a sharp constraint hypothesis H with a P-value less than 1 /e = 0.368. Bayarri & Berger [14] and Sellke et al. [15] show that b = −e · p · ln(p) = e · p · s e is a sharp lower bound on the Bayes factor for H under A, where A now includes strong restrictions on the alternatives to H. (A Bayes factor is the ratio of posterior data probabilities under H and an alternative, given A.) Thus, given A and the data, b is a lower bound on the reduction in odds for H given A in moving from a prior to a posterior, and 1 /b is an upper bound on the increase in odds against H given A. Simple numeric examples show that the latter bound is much lower than the MLR. The strength of the restrictions added to A is indicated for example by the fact that for p = 0.05 the MLR in Table 1 of our main paper [3] is 6.83, while 1 /b is only 2.46. Sellke at al. [15] also discuss how 1 is the Type-1 error rate for a particular type of conditional decision rule.
Grünwald et al. [16] introduce a general concept they call an S-test statistic (where "S" stands for "safe") for H given A, defined as any random variable S satisfying E M (S ) ≤ 1 under any model M obeying H and A. They also call this S an "S-value". As noted above, our binary S-value S 2 = − log 2 (P ) can be redefined using natural logs and thus rescaled to units of nats instead of units of bits, via S e = − ln(P ) = − log 2 (P ) ln (2). S e is then an example of their S-value, since E M (S e ) ≤ 1 when the random P-value P is valid or conservatively valid (uniform or dominated by a uniform random variable under M); it is also an example of a betting score [17] (hence "S" can also be taken as "information score"). Grünwald et al. [16] discuss other S-values, including those based on Bayes factors.
Finally, consider a 1-dimensional continuous parameter µ with test hypotheses H of the form µ ≤ µ 0 and a specified alternative µ ≥ µ 1 (or µ = µ 0 with alternative µ = µ 1 ) where µ 0 < µ 1 . In this context, yet another S-word, "severity", has been used to refer to the P-value p (µ ≥ µ 1 ) for µ ≥ µ 1 (the lower tail of the test statistic mµ 1 for a 1-sided test of µ = µ 1 when using the estimate m of µ), which decreases as µ 1 increases; see p. 345 and Fig, 5.5 of Mayo [18]. Since the complement p (µ ≤ µ 1 ) = 1p (µ ≥ µ 0 ) is the P-value for µ ≤ µ 1 , we find that (whatever the base) the corresponding S-value function s (µ ≤ µ 1 ) = − log(p (µ ≤ µ 1 ) measuring the information against µ ≤ µ 1 increases as p (µ ≥ µ 1 ) increases; thus p (µ ≥ µ 1 ) varies directly with the information s (µ ≤ µ 1 ) against µ ≤ µ 1 (the case with alternative µ 1 < µ 0 is handled symmetrically). This so-called "severity" of the test of the original H (µ ≤ µ 0 ) is not in fact a function of µ 0 and so is identical for all µ 0 . Furthermore, it incorporates no information about background assumptions (e.g., whether treatment assignment was blinded) which bear heavily on practical notions of severity. We thus conclude that it is misleading to label p (µ ≥ µ 1 ) as a severity measure, and it instead should be recognized and treated as the P-value function it is.