Statistical methods for the analysis of adverse event data in randomised controlled trials: a comprehensive review and taxonomy

Background Statistical methods for the analysis of harm outcomes in randomised controlled trials (RCTs) are rarely used, and there is a reliance on simple approaches to display information such as in frequency tables. We aimed to identify whether any statistical methods had been specically developed to analyse prespecied secondary harm outcomes and non-specic emerging adverse events (AEs). Methods A scoping review was undertaken to identify articles that proposed original methods or the original application of existing methods for the analysis of AEs that aimed to detect potential adverse drug reactions (ADRs) in phase II-IV parallel controlled group trials. Methods where harm outcomes were the (co)-primary outcome were excluded. Information was extracted on methodological characteristics such as: whether the method required the event to be prespecied or could be used to screen emerging events; and whether it was applied to individual events or aggregate events. Each statistical method was appraised and a taxonomy was developed for classication. Results Forty-four eligible articles proposing 73 individual methods were included. A taxonomy was developed and articles were categorised as: visual summary methods (8 articles proposing 20 methods); hypothesis testing methods (11 articles proposing 16 methods); estimation methods (15 articles proposing 24 methods); or methods that provide decision-making probabilities (10 articles proposing 13 methods). Methods were further classied according to whether they required a prespecied event (9 articles proposing 12 methods), or could be applied to emerging events (35 articles proposing 61 methods); and if they were (group) sequential methods (10 articles proposing 12 methods) or methods to perform nal/one analyses (34 articles proposing 61 methods). Conclusions This review highlighted that a broad of exist AE Immediate of interventions. RCTs also provide invaluable information to allow evaluation of the harm prole of interventions. The comparator arm provides an opportunity to compare rates of adverse events (AEs) which enables signals for potential adverse drug reactions (ADRs) to be identied.

extension to CONSORT; the pharmaceutical industry standard from the Safety Planning, Evaluation and Reporting Team (SPERT); the extension of PRISMA for harms reporting in systematic reviews; and the joint pharmaceutical/journal editor collaboration guidance on reporting of harm data in journal articles. [6][7][8][9] Regulators including the European Commission and the Food and Drug Administration have also issued detailed guidance on the collection and presentation of AEs/Rs arising in clinical trials. [10][11][12] Whilst these recommendations and guidelines call for better practice in collection and reporting, they are limited in recommendations for improving statistical analysis practices. Journal articles, one of the main sources of dissemination of clinical trial results, predominantly rely on simple approaches such as tables of frequencies and percentages when reporting AEs. [4,13] In view of the lack of sophisticated statistical methods used for AE analysis we performed a review to investigate which statistical methods have been proposed in order to improve awareness and facilitate their use.

Aim
To identify and classify statistical methods that have been speci cally developed or adapted for use in RCTs to analyse prespeci ed secondary harm outcomes and non-speci c emerging AEs. We undertook a scoping review to identify methods for AE analysis in RCTs whose aim was to ag signals for potential ADRs. A scoping review was conducted to uncover all proposed methodology rather than a more structured systematic review as we did not aim to perform a quantitative synthesis and did not want to limit the scope of our results. [14] Search strategy One reviewer (RP) screened titles and abstracts of articles identi ed. Full text articles were scrutinised for eligibility and all queries regarding eligibility were discussed with at least one other reviewer (VC or OS).

Selection criteria
The review included articles that proposed original methods or the original application of existing methods developed for the analysis of AEs in phase II-IV trials that aimed to identify potential ADRs in a parallel controlled group setting. Methods where harm outcomes were the primary or co-primary outcome such as dose-nding or risk-bene t methods were excluded. Established methods designed to monitor e cacy outcomes, which could be used to monitor prespeci ed harm outcomes, such as the methods of e.g. O'Brien and Fleming, Lan and DeMets, were excluded. [15,16] Foreign language articles were translated where needed. Full eligibility criteria is speci ed in the review protocol, which can be accessed via the PROSPERO register for systematic reviews (https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=97442).

Data extraction
Data from eligible articles was extracted using a standardised pre-piloted data extraction form (RP) (additional le 2). Information was collected on methodological characteristics including: whether the method required the event to be prespeci ed or could be used to screen emerging events; whether it was applied to individual events or aggregate events; data type applicable to e.g. continuous, proportion, count, time-to-event; whether any test was performed; what, if any, assumptions were made; if any prior or external information could be incorporated; and what the output included e.g. summary statistic, teststatistic, p-value, plot etc. All queries were discussed with a second reviewer (VC) and clari ed with a third reviewer (OS), if necessary.

Analysis
Results are reported as per the PRISMA extension for scoping reviews. [17,18] Each statistical method was appraised in turn and a taxonomy was developed for classi cation. Data analysis was primarily descriptive, and methods are summarised and presented by taxonomy.

Study selection
The search identi ed 11,118 articles. After duplicate articles were removed, 10,773 articles were screened, ten articles were identi ed from the reference lists of eligible articles and two articles were identi ed through the search of citations of eligible articles. Review of titles and abstracts reduced the number of articles for full review to 169. Review of full text articles resulted in a further 125 exclusions. The main reasons for exclusion after full text review were: the method presented was not original or the original application of a method for the analysis of AEs (33%); there was no comparison group or comparison made (23%); articles were published conference abstracts and therefore were not peer-reviewed and/or lacked su cient detail to undergo a full review (14%). This left 44 eligible articles for inclusion that proposed 73 individual methods (Fig. 1).

Characteristics of articles
Articles were predominantly published by authors working in industry (n = 20 (45%)), eight (18%) were published by academic authors and four (9%) were published by authors from the public sector. Eight (18%) articles were from an industry/academic collaboration, two (5%) an academic/public sector collaboration, one (2%) an industry/public sector collaboration and one (2%) from an industry/academic/public sector collaboration.

Taxonomy of statistical methods for AE analysis
Due to the number and variety of methods identi ed, we developed a taxonomy to classify methods. Four groups were identi ed (Fig. 2).

Visual summary methods
Methods that propose graphical approaches to view single or multiple AEs as the principal analysis method.

Hypothesis testing methods
Methods under the frequentist paradigm. These methods set up a testable hypothesis and use evidence against the null hypothesis in terms of p-values based on the data observed in the current trial.

Estimation methods
Methods that quantify distributional differences in AEs between treatment groups without a formal test.

Methods that provide decision making probabilities
Statistical methods under the Bayesian paradigm. The overarching characteristic of these methods is output of (posterior) predicted probabilities regarding the chance of a prede ned threshold of risk being exceeded based on the data observed in the current trial and/or any relevant prior knowledge.
All methods were further sub-divided into whether they were for use on prespeci ed events, which could be listed in advance as harm outcomes of interest to follow-up and may already be known or suspected to be associated with the intervention, or followed for reasons of caution; or could be applied to emerging (not prespeci ed) events that are reported and collected during the trial and may be unexpected. Further, we made the distinction between (group) sequential methods (methods to monitor accumulating data from ongoing studies) and methods for nal/one analysis (Fig. 3).
The number of articles and methods identi ed by type is provided in Table 1. Articles most frequently proposed estimation methods (15 articles proposing 24 methods), followed by hypothesis testing methods (11 articles proposing 16 methods). Ten articles proposed thirteen methods to provide decisionmaking probabilities and eight articles proposed 20 visual summaries. The majority of articles developed methods for emerging events (35 articles proposing 61 methods) and nal/one analysis (34 articles proposing 61 methods). Individual article classi cations and brief summaries are presented in Table 2.

Summaries of methods by taxonomy
Visual summaries -Emerging events The review identi ed eight articles published between 2001 and 2018 that proposed twenty unique methods to visually summarise harm data, including binary AEs and, continuous laboratory (e.g. blood tests, culture data) and vital signs (e.g. temperature, blood pressure, electrocardiograms) data (additional  le 3, table 3). [19][20][21][22][23][24][25][26] The majority of the proposed plots were designed to display summary measures of harm data (n = 14) and the remaining plots displayed individual participant data (n = 6). None of the plots required the event to be prespeci ed. Eight of the plots were designed to display multiple binary AEs; an example of one such plot is the volcano plot (Fig. 4). [19,27] The remaining plots were proposed to focus on a single event per plot, three of which proposed time-to-event plots and nine proposed plots to analyse emerging, individual, continuous harm outcomes such as laboratory or vital signs data. These plots can aid the identi cation of any treatment effects and identify outlying observations for further evaluation.

Hypothesis tests -Prespeci ed outcomes
Five articles published between 2000 and 2012 present seven methods to analyse prespeci ed harm outcomes under a hypothesis-testing framework (additional le 3, table 4). [28][29][30][31][32] Six of these methods were speci cally designed and promoted for sequentially monitoring prespeci ed harm outcomes. Two of the methods incorporated an alpha-spending function (as originally proposed for e cacy outcomes) [16], two performed likelihood ratio tests, one used conditional power to monitor the futility of establishing safety and one proposed an arbitrary reduction in the traditional signi cance threshold when monitoring a harm outcome. [28][29][30]32] In addition, one method proposed a non-inferiority approach for the nal analysis of a prespeci ed harm outcome. [31] Hypothesis tests -Emerging Six articles published between 1990 and 2014 suggest nine methods to perform hypothesis tests to analyse emerging AE data (additional le 3, table 5). [33][34][35][36][37][38] All of the methods were designed for a nal analysis with one method incorporating an alpha-spending function allowing the method to be used to monitor ongoing studies. Methods are suggested for both binary and time-to-event data with several accounting for recurrent events.
Two methods proposed a p-value adjustment to account for multiple hypothesis tests to reduce the false discovery rate (FDR). [37,38] One article proposed two likelihood ratio statistics to test for differences between treatment groups when incorporating time-to-event and recurrent event data. [36] Three articles adopted multivariate approaches to undertake global likelihood ratio tests to detect differences in the overall AE pro le. [33][34][35] Estimation -emerging Fifteen articles proposed 24 methods published between 1991 and 2016 for emerging events (additional  le 3, table 6). [39][40][41][42][43][44][45][46][47][48][49][50][51][52][53] These estimates re ect different characteristics of harm outcomes such as point estimates for incidence or duration, measures of precision around such estimates, or estimates of the probability of occurrence of events. They rely on subjective comparisons of distributional differences to identify treatment effects.
Point estimates such as the risk difference, risk ratio and odds ratio to compare treatment groups with corresponding con dence intervals (CIs) such as the binomial exact CI (also known as the Clopper-Pearson CI) are a simple approaches for AE analysis. [4,41] Three articles proposed alternative means to estimate CIs. [40,46,48] Eight articles provided methods to calculate estimates that take into account AE characteristics, such as recurrent events, exposure-time, time-to-event information, and duration, which can help develop a pro le of overall AE burden. [39, 42-44, 47, 49, 50, 52, 53] For example, methods such as the mean cumulative function, mean cumulative duration or parametric survival models estimating hazard ratios. Several of these methods incorporated plots that can highlight when differences between treatment groups start to emerge, which would otherwise be masked by single point estimates.
A Bayesian approach was developed to estimate the probability of experiencing different severity grades of each AE, accounting for the AE structure of events within body systems. [45] One article developed a score to indicate if continuous outcomes such as laboratory values were within normal reference ranges and to ag abnormalities. [51] Decision making probabilities -Prespeci ed outcomes Four articles suggested ve unique Bayesian approaches to monitor prespeci ed harm outcomes (additional le 3, table 7). [54][55][56][57] The rst paper was published in 1989 but no further research was published in this area until 2012, the last paper was published in 2016. Each of the methods incorporates prior knowledge through a Bayesian framework, outputting posterior probabilities that can be used to guide the decision whether to continue with the study based on the harm outcome.
Each of the methods was designed for use in interim analyses to monitor ongoing studies but could be used for the nal analysis without modi cation. They could be implemented for continuous monitoring (i.e. after each observed event) or in a group sequential manner after several events have occurred. These methods require a prespeci ed event, an assumption about the prior distribution of this event, a 'tolerable risk difference' and an 'upper threshold probability' to be set at the outset of the trial. [56] At each analysis, the probability that the 'tolerable risk difference' threshold is crossed is calculated and if the predetermined 'probability threshold' is crossed then the data indicate a prede ned unacceptable harmful effect.
Decision making probabilities -emerging outcomes Six articles published between 2004 and 2013 proposed eight unique Bayesian methods to analyse the body of emerging AE data (additional le 3, table 8). [58][59][60][61][62][63] Each of the methods utilise a Bayesian framework to borrow strength from medically similar events. Berry and Berry were the rst, proposing a Bayesian three-level random effects model. [58] The method allows AEs within the same body system to be more alike and information can be borrowed both within and across systems. For example, within a body system a large difference for an event amongst events with much smaller differences will be shrunk toward zero. This work was extended to incorporate person-time adjusted incidence rates using a Poisson model and to allow sequential monitoring. [59,63] Two alternative approaches were also developed following similar principles. The output from all these models is the posterior probability that the relative measure does not equal zero or that the AE rate is greater on treatment than control.

Discussion
In our previous work we found evidence for sub-optimal analysis practice for AE data in RCTs. [4] In this review, we set out to identify statistical methods that had been speci cally developed or adapted for use in RCTs and had therefore had given full consideration to the nuances of harm data building on the recent work of Wang et al. and Zink et al. [26,64] The aim being to improve awareness of appropriate methods. We found that despite the lack of use, there are many suitable and differing methods to undertake more sophisticated AE analysis. Some methods have been available since 1989 but most have been published since 2004. Based on our earlier work, personal experience and low citations of these articles, the uptake of these approaches appear to be minimal.
Issues of multiple testing, insu cient power and complex data structures are sometimes used to defend the continued practice of simple analysis approaches for AE data. We believe these do not justify the prevalent use of simplistic analysis approaches for AE analysis.
Under the frequentist paradigm, performing multiple hypothesis tests increases the likelihood of incorrectly agging an event due to a chance imbalance. However, when analysing harm outcomes multiple hypothesis tests can be considered less problematic than for e cacy outcomes, if incorrectly agging an event simply means that it undergoes closer monitoring in ongoing or future trials. [65] This is supported by the recently updated New England Journal of Medicine statistical guidelines to authors that state, "Because information contained in the safety endpoints may signal problems within speci c organ classes, the editors believe that the type I error rates larger than 0.05 are acceptable".
Multiplicity is also not typically an issue for multivariate approaches that aim to identify global differences. Whilst these methods can be used to ag signals for differences in the overall harm pro le and can help identify any differences in patient burden, a global approach to harm analysis could mask important differences at the event level. Therefore, such approaches should be considered in addition to more speci c event-based analysis.
Whilst failure to consider the consequences of a lack of power can lead to inappropriate conclusions that a treatment is 'safe', prespeci ed analysis plans for prespeci ed events of interest would prevent posthoc, data-driven, hypotheses testing. Nevertheless, most AE analysis is undertaken without a clear objective. Well-de ned objectives setting out the purpose of the AE analysis to be undertaken for both prespeci ed and emerging events could help improve practice.
Visual summaries, estimation and decision-making probability methods identi ed in this review, are typically not affected by issues of power and multiplicity and provide a multitude of useful, alternative ways to analyse AE data. For example, a well-designed graphic can be an effective way to communicate complex AE data to a range of audiences and help to identify signals for potential ADRs from the body of emerging AE data. [21] Similarly, estimation methods provide a means to identify distributional differences in the AE pro le between treatment groups and can incorporate information on, for example, time of occurrence or recurrent events, which is often ignored in AE analysis. However, both approaches rely on visual inspections and subjective opinions regarding a decision whether to ag a signal for potential ADRs. As such, they both provide a useful means to support AE analysis but consideration of use in combination with more objective means such as statistical tests or Bayesian decision-making methods, which provide clear output for interpretation to ag differences between treatment groups, might be appropriate.
Existing knowledge on the harm pro le of a drug can be used to prespecify known harm outcomes for monitoring and using an appropriate Bayesian decision-making method allows formal incorporation of existing information. Such analyses can provide evidence to aid decisions about the conduct of ongoing trials or future trials based on the emerging harm pro le. Incorporating prior and/or accumulating knowledge into ongoing analyses in this way ensures an e cient use of the existing evidence. Like the hypothesis test approaches, output can be used to objectively make decisions about whether to ag events as potential ADRs but do not suffer to the same extent with issues of insu cient power or multiplicity. [66][67][68] However, such methods are reliant on the prior information incorporated so sensitivity of the results to the prior assumptions should be explored and careful consideration of the appropriateness of the source of prior knowledge and its applicability is needed. [69] The most appropriate method for analysis will depend on whether events have been prespeci ed or are emerging and the aims of the analysis. Statistical analysis strategies could be prespeci ed at the outset of a trial for both prespeci ed and emerging events as we would for e cacy outcomes. There are a multitude of specialist methods for the analysis of AEs and there is no one correct approach, rather a combination of approaches should be considered. An unwavering reliance on tables of frequencies and percentages is not necessary given the alternative methods that exist, and we urge statisticians and trialists to explore the use of more specialist analysis methods for AE data from RCTs.
We have not examined the performance of the individual methods included in this review, so we cannot make quantitative comparisons and as such have avoided making recommendations of speci c methods to use. This review builds on existing work to provide a comprehensive overview and audit of statistical methods available to analyse harm outcomes in clinical trials. [20,21] Conclusions There are many challenges associated with assessing and analysing AE data in clinical trials. This review revealed that there are a broad range of suitable methods available to overcome some of these challenges but evidence of application in clinical trials analysis is limited. The reasons for low uptake are unknown but warrants further investigation so that we can better understand the potential barriers to implementation and raise awareness of these and new methods where appropriate, to improve the analysis of AEs in RCTs. Availability of data and materials All data generated and analysed during this study either are included in this article and its supplementary information les and/or are available are available from the corresponding author on reasonable request.

Competing interests
All authors declare that they have no competing interests. Funding RP is funded by a NIHR doctoral research fellowship to undertake this work (Reference: DRF-2017-10-131). This paper presents independent research funded by the National Institute for Health Research (NIHR). The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health and Social Care. The funder played no role in the design of the study or collection, analysis, or interpretation of data, or in writing the manuscript.
Authors' contributions RP & VC conceived the idea for this review. RP conducted the search, carried out data extraction, analysis, and wrote the manuscript. RP, VC and OS participated in in-depth discussions and critically appraised the results. VC undertook critical revision of the manuscript and supervised the project. OS performed critical revision of the manuscript. All authors have read and approved the manuscript.   Classi cation terminology