### Summary and discussion of key findings

We re-analysed 190 ITS using six statistical methods and compared estimates of immediate level change, slope change, their associated standard errors, confidence intervals and *p-*values, and the estimated lag-1 autocorrelation. We found important inconsistency in these estimates across the methods, such that the interpretation of the findings in some series may differ depending on the chosen method.

On average, there were small systematic differences in estimates of level change across the statistical methods, with OLS yielding slightly larger estimates, and REML slightly smaller estimates compared with the other methods. For slope change, all methods yielded, on average, similar estimates. For some pairwise comparisons, the limits of agreement indicated large differences could arise. This was particularly notable in the comparisons between REML and the other methods. There were systematic differences in the standard errors between most methods, and the limits of agreement also indicated large differences could arise. ARIMA yielded systematically larger standard errors compared with all other methods, although the difference with REML was not as large. Of note, the PW yielded, on average, similar standard errors as OLS. This was perhaps surprising given PW provides adjustment for autocorrelation (which OLS does not), and in a numerical simulation study investigating the performance of these methods, PW was shown to perform better than OLS for data series approximately longer than 24 points [12]. The results in our empirical investigation therefore likely reflect the influence of shorter data series.

The differences in point estimates and standard errors led to differences in the confidence interval widths, *p-*values, and statistical significance. Reflecting the pattern observed with standard errors, the ARIMA confidence intervals were wider compared with the other methods. However, REML with the Satterthwaite adjustment, which adjusts the t-distribution degrees of freedom used in the calculation of the confidence interval to account for uncertainty in estimation of the standard error, yielded the widest confidence intervals.

Our results show that naively basing conclusions on statistical significance could lead to a qualitatively different interpretation. There was important discordance in statistical significance (at the 5% level) across many of the pairwise method comparisons. As expected, the discordance was greatest between the methods that yielded larger standard errors or adjusted for uncertainty in estimation of the standard error (i.e. ARIMA, and REML with SW, respectively) and the other methods.

For long series (≥ 100 data points), all methods yielded similar estimates of autocorrelation. The methods yielded different estimates with short to medium length series (i.e. < 100 data points), with the ARIMA and OLS autocorrelation estimates being substantially smaller than REML. For any ITS, we can conceive that the data collected and analysed in the study is a subset of a much longer series. If we make the assumption that there is a stable true underlying autocorrelation, then autocorrelation estimated from different series lengths, should be similar. We generally found this to be the case for REML; however, for ARIMA and PW, estimates of autocorrelation were notably smaller in short series compared with long series. This suggests that ARIMA and PW may be problematic for short series. The stability of REML estimates over the different series lengths is suggestive of it being the preferable estimator, which has been shown in numerical simulation studies to be the case [12, 26].

The magnitude of autocorrelation estimates from these ITS public health datasets, with a median of 0.23 (IQR 0.08 to 0.57, restricted to series with ≥ 100 data points, n = 31 REML method), indicate that autocorrelation should not be ignored in the design or analysis of ITS studies. Despite this, in nearly 50% (113/230) of the series included in the review, autocorrelation was not considered, or the method to adjust for autocorrelation could not be determined [10]. Furthermore, only 1.5% (3/200) of studies provided evidence of a sample size calculation, and only two of these considered autocorrelation. Similar findings have also been observed in other systematic reviews. Jandoc et al. [8] found that only 146/220 (66.4%) ITS studies reported testing for autocorrelation, Hudson et al. [11] found that 63/115 (55%) considered autocorrelation, Ewusie et al. [9] found that only 812/1365 (59.5%) checked for autocorrelation and Hategeka et al. [27] similarly found that 66/120 (55%) checked or adjusted for autocorrelation.

### Strengths and limitations

There are several strengths to our study. First, the repository of ITS studies was randomly sampled from PubMed, thus the findings are likely to be generalisable to ITS studies indexed in this database. Second, we used a variety of methods to obtain the time series data (primarily digital data extraction [14]) to optimise the number of datasets retrieved, which resulted in a large percentage of datasets being retrieved (190/230; 83%). Finally, we investigated a range of statistical methods, including those commonly used in practice [8,9,10,11], and compared their results using metrics of interest to researchers (point estimates, standard errors, confidence intervals, *p-*values, statistical significance) to provide a comprehensive picture of how the methods compared.

One limitation of this study is that our findings may not be generalisable to ITS studies outside of public health. For example, this would be the case if influencing characteristics (e.g. series length) of ITS studies in public health differ to other disciplines. Another limitation is that although the methods we included are those that are commonly used in practice [10], other methods are available (for example, forecast [28] or Bayesian [29] methods). We purposely excluded the Cochrane-Orcutt method (which is used in practice [10]), because the PW method is essentially the Cochrane-Orcutt method, except that the PW method retains the first observation, and so is advantageous for short time series [19]. A further limitation of our study is that we fitted a segmented linear regression model, assuming a continuous data type with lag-1 autocorrelation, to all datasets. This model may have differed to that used in the original publication, and furthermore, may not have been the best fitting model. However, our re-analysis was not intended to specifically address the research question(s) of the original publications, but as a means of comparing different statistical methods.

### Implications for practice

Our research has shown that in this set of ITS studies, the choice of statistical method can importantly affect the findings. This could lead to ‘bias in the selection of the reported result’ [30], where the reported result is chosen based on its magnitude, direction of effect, or statistical significance. Publication of protocols with detailed statistical analysis plans provide a mechanism for study authors to engender trust in the reported results (i.e. when there is consistency between the planned and used analysis methods). Protocols also allow readers to assess whether there were any changes to the analysis, and if so, what the legitimacy of those changes were.

Protocols should include specification of the primary analysis method, and may include a set of sensitivity analyses that allow examination of the robustness of the findings (e.g. level and slope change estimates and their confidence intervals) to the chosen analysis method. The primary analysis method needs to be carefully chosen considering characteristics of the ITS. For example, Turner et al. [12] found through a numerical simulation study that the length of the series is an important factor for deciding on the statistical method. Sensitivity analyses may be particularly important for short series, where estimates from the methods are likely to be most different.

While protocols and statistical analysis plans are now common for randomised trials [31], in our review of ITS studies, none of the 200 studies reported having a published protocol. Protocols can be published in a peer-reviewed journal, published on a pre-print server (e.g. medRxiv), or registered in an online registry (e.g. open science framework).

Finally, we recommend that time series data, including dates of the interruptions and any transition periods be made available alongside the publication. At a minimum, any plots of ITS data should follow graphing recommendations [32] to facilitate data extraction using digitising software [14].

### Implications for future research

Future research examining factors that may modify the magnitude of autocorrelation (e.g. type of outcome) would be useful. Knowledge of these factors would facilitate informed predictions about the likely magnitude of autocorrelation for an individual ITS study with particular characteristics, which could be used to more accurately determine the required sample size. Similar research has been undertaken investigating factors that modify intra-cluster correlations (ICCs) in cluster randomised trials, which has led to generalizable ‘rules-of-thumb’ on the selection of ICCs for sample size calculations in cluster trials [33].