### Simple double counting of studies

A recent meta-analysis of the safety of anticholinergics in chronic obstructive pulmonary disease (COPD) by Singh et al [5] in *JAMA* affords an example. A problem with this meta-analysis are that studies were counted twice. For example, a publication by Brusasco et al was included [6]. However, this publication was itself a meta-analysis of two-studies [7] one of which, by Donohue et al [8], was also separately included by Singh et al. Thus the Donohue et al study was included twice, which is clearly inappropriate.

### Double counting of some aspects of studies

This error is slightly more subtle. Again *JAMA* affords an example. A meta-analysis by Kozyrskyj et al compared short and long course treatment of otitis media with antibiotics [9]. An unsatisfactory feature of this overview is that arms of the same study are counted more than once [10]. A number of the trials being summarised had more than two arms. The way that the authors chose to deal with this was to enter the control arm twice. Thus (say) treatment A was compared to C and then treatment B (say) was compared to C. The net effect was that C was counted twice.

For example, a trial by Hoberman et al [11] was included twice, apparently once with 375 patients and once with 386. However, the original data refer to two long courses of antibiotics in 178 and 189 patients respectively and to one short course with 197 patients. It appears that this short course has been counted twice by Kozyrskyj et al so that we have 178+197 = 375 and 189+197 = 386. This sort of double counting seems to have occurred on at least three occasions.

A similar case appears in the meta-analysis by Brocklebank [12] et al in the *BMJ* comparing metered dose inhalers (MDI) and other hand held devices for delivering corticosteroids in asthma. Figure 2 of that paper includes what appear to be two studies by Vidgren et al. In fact, there is only one study, a three armed cross-over trial[13] comparing Diskhaler^{®}, Easyhaler^{®} and an MDI. Presumably, the data for the MDI have been included twice in the overall summary.

A slightly different form of a double counting of some information from a study occurs in the paper by Singh et al [5] already cited. Two studies by Casaburi [14, 15] are included in the meta-analysis. However one was a preliminary report on short term results and the other is the full report at conclusion, including the short term data. Thus the short term data must have been counted twice.

### Accepting implausible claims for the precision of individual studies

A meta-analysis by Hackshaw et al [16] in the *BMJ* considered passive smoking. The method involved weighting reported log-odds ratios using reported (or calculated from confidence intervals) standard errors. However, Peter Lee, in an extremely important but sadly neglected article [17] in *Statistics in Medicine* has pointed out that the fact that the standard error for a log-odds ratio is approximately equal to the square root of the sum of the reciprocals of the frequencies in the corresponding four-fold table provides various lower bounds on the standard error. Conversely, a given standard error implies a minimum sample size. In fact for a given total sample size *N*, the split of cases and non-cases in exposed and unexposed groups that gives rise to the minimum standard error is an equal split of *N*/4 subjects in each cell. It follows, for example, that for any reported variance, *V* the total sample size, *N* must satisfy the requirement that *N* ≥ 16/*V*. Similar inequalities exist for the total of any two cells and for the numbers in any given cell.

As Lee showed [17], at least one of the studies [18] included by Hackshaw et al [16] in their meta-analysis has impossibly low standard errors when examined in this way: the numbers of subjects are too few in view of the precision claimed.

### Imputing data

The meta-analysis by Brockelbank et al [12] already cited has ten within-arm within-study standard deviations equal to 100.0. There is no explanation of this fact and it appears that these standard deviations are imputed. In fact cross-over studies are being combined and it seems that the authors are forcing them into the parallel group framework that RevMan, the Cochrane Collaboration software required (at least in its earlier versions). In order to do this they have invented between-patient standard deviations that are, in fact, irrelevant to judging the outcome from a cross-over trial.

This is, in my view, a bad idea, although, it must be granted that this is a far less serious error than some others described, since, if anything, the evidence from the cross-over studies is likely to be understated since between-patient standard deviations are used. Nevertheless, it is an inappropriate approach that should be avoided.

However, not all attempts to impute data understate the evidence. For example, Nicholson et al [19], in a meta-analysis of depression as a prognostic factor in heart disease were able to identify 54 relevant studies. Unfortunately, six of these only recorded a lack of a significant association and did not give confidence intervals. Nicholson et al imputed an effect estimate of one to the studies and estimated the standard errors by regression on the number of patients.

This procedure cannot be endorsed. The value of unity chosen is the value that gives the least possible association but this overstates the lack of association. For example, a study by Hallstrom [20], that enrolled 795 women for 12 years follow up but for which only the result 'not significant' is available is *awarded* a relative risk (RR) of 1.0 with a confidence interval 0.6 to 1.7. However, the study by Ferketich [21] which is based on 5007 women followed for ten years has a *reported* RR of 1.0 with a wider confidence interval of 0.5 to 2.0. It is surely not appropriate to give a smaller study for which the relevant data have had to be guesstimated more weight than a larger one for which the data are available.

It would have been better in my opinion to have excluded the six studies with insufficient detail altogether.

### Spurious precision of individual trials

An interesting paper by Peters et al [22] considered Bayesian approaches to combining epidemiological observational data on humans with experimental data in animals and illustrated this using an investigation of trihalomethane exposure as a possible cause of low birthweight. They identified five epidemiological and eight toxicological studies in animals. However, in analysing the toxicological studies they treated the pups in litters of rats as independent observations rather than treating them as repeated measures on the dams. Since the number of pups, is of course, much higher than the number of dams this has the consequence of 'spurious precision' [23, 24]. In other words, there is an overstating of the evidence.

### Inappropriate pooling of treatments

A very thorough and in many ways expert meta-analysis by Jüni et al in the *Lancet* looked at the risk of cardiovascular events under rofecoxib [25]. A number of different treatments, including placebo, naproxen and non-naproxen non-steroidal anti-inflammatory drug (NSAID) were considered as controls. Thus the meta-analysis compares rofecoxib to a mixture of controls. This is not, in itself illegitimate but one has to be quite clear about the purpose of such a meta-analysis. The relevant null hypothesis is 'rofecoxib is identical to all these comparators'. If and when this null hypothesis is rejected the alternative hypothesis that then follows is 'rofecoxib is different from at least one of these comparators'.

Jüni et al, were criticised by researchers at Merck, the makers of rofecoxib, for contravening a basic principle of meta-analysis, namely to pool like with like [26]. I disagree that there is such a principle. However, I also disagree with a conclusion that Jüni et al drew from their analysis.

They implied that their meta-analysis showed that rofecoxib was different to *each* comparator, including placebo, and indeed that this was already clear from data available by 2000. However to be able to assert *this* alternative hypothesis, it is necessary to have tested rofecoxib separately against each comparator and for such a meta-analysis the comparators cannot be pooled. In order to justify this claim, they carried out 'a test of interaction' for treatment effect by type of comparator (placebo, naproxen or non-naproxen NSAID) and used a non-significant result to justify pooling. (See, for example, table 2 of that paper.)

However, there are a number of problems with this procedure. The first is that the term *interaction* is misleading. It is actually *main effects* (for example the difference between naproxen and placebo) that it is necessary to prove are zero. This is important, since the situation is qualitatively different from a genuine test of interaction involving trials of different type, or patients of a different sort, as a stratum where the same treatment and control is being compared [27]. Under such circumstances it is a higher order effect (the interaction) that is assumed zero until proof to the contrary is available. Here it is an effect *that is of the same order* (placebo – naproxen) as the effect being examined (rofecoxib-naproxen) that is assumed to be zero.

Secondly, absence of evidence is not evidence of absence. Had Jüni et al wished to use the extremely large amount of information comparing rofecoxib to naproxen to produce a comparison to placebo they should have used the formal method of the putative placebo [28, 29].

Thirdly, it is clear that this procedure is easily abused. Given a great deal of data showing that treatment A (say) is better than control C (say), a small trial inadequately comparing treatment A to B would fail to show a significant 'interaction' and entitle one to pool B and A and use the combined data to prove that B was better than C.

I cannot leave this example, however, without pointing out that I do not believe that the fact that an advantage of naproxen to rofecoxib is not proof of a disadvantage of rofecoxib compared to placebo lets Merck, the developers of rofecoxib, off the hook. The gastric benefits of rofecoxib compared to naproxen were clearly shown in the same study [30] in 2000 that showed the cardiovascular benefits of naproxen to rofecoxib. From that point onwards patients should have been informed that one net benefit was being traded against another, whatever the explanation of either.

### Numerical slips

This is a sin to which I must plead guilty myself on occasion. Indeed, it is inherent to all scientific work that mistakes are made from time to time and are likely to be perpetuated. A beautifully described example comes in Primo Levi's essay 'Chromium' in *The Periodic Table* [31] in which, in a piece of chemical and statistical detection, he becomes suspicious of an unchallenged recipe that requires the addition of 'twenty-three drops of a certain reagent'. Eventually he finds an old file card bearing 'the direction to add "2 or 3" drops and not "23"'(p131).

In a discussion of Bayesian approaches to specifying prior distributions for random effect variances Lambert et al [32] used the data from Kozyrskij [9] to illustrate the problems with random effects analyses. I presented a frequentist alternative based on using proc nlmixed^{®} of SAS^{®} but what I did not realise at the time was that I had coded the main effects of the trials inappropriately. (It was my colleague Jim Weir who subsequently discovered my mistake.) Thus, where I claimed a point estimate of 0.39 with a standard error of 0.20, a corrected analysis gives 0.42 with a standard error of 0.19. The difference is small in this case but that is at least partly a matter of luck.

### Incomplete reporting

This is a rather different problem. There are a number of reported meta-analyses where it simply is almost impossible to check the authors' results with certainty. In particular where the following combination applies, that neither the method of statistical analysis is specified nor are the data from the original study fully available, then a great deal has to be taken on faith. The problem then becomes analogous to one of hearsay evidence in court. What is asserted may well be true but it is very difficult to call anybody to account to establish its reliability.

Consider, for example, a paper by Hrjobartson and Gotszche in the *NEJM* [33] which, considers the efficacy of placebos. This is an extremely interesting investigation that I have referred to elsewhere very positively [34] that points out that to establish the efficacy of placebo to the same degree of proof we require for standard treatments we need trials which have a control group for the placebo, that is to say no treatment. The authors perform a meta-analysis of all the three armed trials (treatment, placebo, no treatment) they can find. An appendix, available on the website gives results but neither it nor the main paper actually details the methods in sufficient detail for the results to be reproduced.

It might be thought that detailing the method is superfluous. In fact, however, there is a bewildering array of techniques possible for conducting a meta-analysis. In my paper *The Many Modes of Meta* [35] I identified three major data types: all studies used the same outcome and raw data are available, the same outcome but summary data only and different outcomes in different studies. I also identified at least nine different philosophical approaches that could be used to analyse summary measures. Many of these nine different approaches could be implemented in different ways. For example, in deciding to analyse binary data, one has to make a choice of risk scale: risk difference, relative risk, odds ratio. A much-cited paper by Newcombe [36] compares eleven different approaches to estimating confidence intervals for a risk difference for a single trial. In other words there are dozens of ways that binary meta-analyses alone could be performed.