Comparison of six statistical methods for interrupted time series studies: empirical evaluation of 190 published series

Turner, Simon L.; Karahalios, Amalia; Forbes, Andrew B.; Taljaard, Monica; Grimshaw, Jeremy M.; McKenzie, Joanne E.

doi:10.1186/s12874-021-01306-w

Research
Open access
Published: 26 June 2021

Comparison of six statistical methods for interrupted time series studies: empirical evaluation of 190 published series

Simon L. Turner¹,
Amalia Karahalios¹,
Andrew B. Forbes¹,
Monica Taljaard^2,3,
Jeremy M. Grimshaw^2,3,4 &
…
Joanne E. McKenzie¹

BMC Medical Research Methodology volume 21, Article number: 134 (2021) Cite this article

55k Accesses
42 Citations
16 Altmetric
Metrics details

Abstract

Background

The Interrupted Time Series (ITS) is a quasi-experimental design commonly used in public health to evaluate the impact of interventions or exposures. Multiple statistical methods are available to analyse data from ITS studies, but no empirical investigation has examined how the different methods compare when applied to real-world datasets.

Methods

A random sample of 200 ITS studies identified in a previous methods review were included. Time series data from each of these studies was sought. Each dataset was re-analysed using six statistical methods. Point and confidence interval estimates for level and slope changes, standard errors, p-values and estimates of autocorrelation were compared between methods.

Results

From the 200 ITS studies, including 230 time series, 190 datasets were obtained. We found that the choice of statistical method can importantly affect the level and slope change point estimates, their standard errors, width of confidence intervals and p-values. Statistical significance (categorised at the 5% level) often differed across the pairwise comparisons of methods, ranging from 4 to 25% disagreement. Estimates of autocorrelation differed depending on the method used and the length of the series.

Conclusions

The choice of statistical method in ITS studies can lead to substantially different conclusions about the impact of the interruption. Pre-specification of the statistical method is encouraged, and naive conclusions based on statistical significance should be avoided.

Peer Review reports

Background

Randomised trials are the gold standard design for investigating the impact of public health interventions, however, they cannot always be used. For example, interventions that impact an entire country, or those that have occurred historically, may preclude the ability to randomize or include control groups [1]. An alternative non-randomised design that may be considered in such circumstances is an interrupted time series (ITS) [2,3,4]. In an ITS design, data are collected at multiple time points both before and after an interruption (i.e. an intervention or exposure). Modelling of the data in the pre-interruption period allows estimation of the underlying secular trend, which when modelled correctly and extrapolated into the post-interruption time period, yields a counterfactual for what would have occurred in the absence of the interruption. Differences between the counterfactual and observed data at various points post interruption can be estimated (e.g. immediate and long-term effects), having accounted for the underlying secular trend.

A characteristic of data collected over time is that the data points tend to be correlated [5]. This correlation – referred to as autocorrelation or serial correlation – can be positive (whereby data points close together in time are more similar than data points further apart) or, infrequently, negative (whereby data points close together are more dissimilar than data points further apart). Autocorrelation may be observed between consecutive data points or over longer periods of time (e.g. seasonal effects). This characteristic of the data needs to be considered when designing and analysing ITS studies. If positive autocorrelation is present, larger sample sizes are required to provide power at the desired level [6] and if autocorrelation is not accounted for in the statistical analysis, standard errors may be underestimated [7].

Segmented linear regression models are often fitted to ITS data using a range of estimation methods [8,9,10,11]. Commonly ordinary least squares (OLS) is used to estimate the model parameters [10]; however, the method does not account for autocorrelation. Other statistical methods are available that attempt to account for autocorrelation in different ways (e.g. correction of standard errors, directly modelling the errors).

Turner et al. undertook a statistical simulation study examining the performance of statistical methods for analysing ITS data, where the methods were those commonly used in practice or had shown potential to perform well [12]. This simulation study provided insight into how these statistical methods performed under different scenarios, including different level and slope changes, varying magnitudes of underlying autocorrelation and series lengths. In combination with these findings, evidence from an empirical evaluation can provide a more comprehensive understanding of how the methods operate. In particular, empirical evaluations – in which methods are applied to real-world data sets and the results are compared – allow assessment of whether the choice of method matters in practice, and the degree to which they may do so.

To our knowledge, there has been no study that has empirically compared different methods for analysing ITS data when applied to a large sample of real-world data sets. We therefore undertook such an evaluation, where we aimed to compare level and slope change estimates, their standard errors, confidence intervals and p-values, and estimates of autocorrelation, obtained from the set of statistical methods used in the Turner et al. simulation study [12].

Methods

Repository of ITS studies

A sample of 200 ITS studies identified in a previous methods review were eligible for inclusion in the current study [10]. In brief, ITS studies were identified from a search of the bibliometric database PubMed between the years 2013 and 2017. Studies were stratified by year, assigned random numbers, sorted (in ascending order) by these numbers, and screened until we identified 40 studies that met the eligibility criteria. The criteria for inclusion were: 1) studies in which there were at least two segments separated by a clearly defined interruption with at least three points in each segment; 2) observations were collected on a group of individuals at each time point; and 3) the study investigated the impact of an interruption that had public health implications.

For each of the 200 studies, the first reported ITS of each outcome type (binary, continuous, count or proportion) was included, resulting in 230 ITS. Data were collected on the study characteristics and design of the ITS studies, types of outcomes, models used, statistical methods employed, effect measures reported, and the properties of included graphs. Further details of the study methods are available in the study protocol and results papers [10, 13].

Methods to obtain time series data

Time series data from the included studies were obtained using three methods. First, we collated datasets that were reported in the published paper or its supplement (e.g. time series data reported in tables, or as text files). Second, we contacted all authors for whom we were able to obtain contact details to request datasets. We requested only aggregate level data (i.e. not individual participant data) and in the circumstance where a study included multiple series, we only sought data from the first time series reported in the paper to reduce respondent burden. We sent an initial email request on the 13^th December 2018 and a follow-up email on the 24^th January 2019. Third, we digitally extracted datasets from published graphs using the software WebPlotDigitizer [14]. This graphical data extraction tool has been found to accurately estimate the position of points on a graph [15].

If multiple datasets from the above methods were available for a particular time series, we selected the dataset generated using the following hierarchy: (i) published data, (ii) contact with authors, and (iii) digitally extracted. We checked the data provided by authors against the information reported in the publication. Where there was a discrepancy, we re-contacted the authors to query the provided data.

Interrupted time series model

We fitted segmented linear regression models to each dataset using the parameterisation of Huitema and McKean [7] (Eq. 1, Fig. 1):

$${Y}_{t}={\beta }_{0}+{\beta }_{1}t+{\beta }_{2}{D}_{t}+{\beta }_{3}\left[t-{T}_{I}\right]{D}_{t}+{\varepsilon }_{t}$$

(1)

where ${Y}_{t}$ represents the outcome that is measured at time point t of N time points (1 to ${n}_{1}$ measurements during the pre-interruption stage, and ${n}_{1}+1$ to ${n}_{2}$ measurements in the post-interruption stage), with the interruption occurring at time ${T}_{I}$. ${D}_{t}$ is an indicator variable that represents the post-interruption interval: coded as 0 in the pre-interruption period, and as 1 in the post-interruption period. The model parameters ($\beta$ s) represent the baseline intercept (${\beta }_{0}$); pre-interruption slope (${\beta }_{1}$); change in level at the interruption (${\beta }_{2}$), and the change in slope (${\beta }_{3}$). The model can be extended to accommodate more than one interruption with the inclusion of terms representing additional segments.

The error term ${\varepsilon }_{t}$ allows for deviation from the fitted model. In a first order (lag-1) autocorrelation model, the error at time point t (${\varepsilon }_{t}$) is influenced by only the previous data point as ${\varepsilon }_{t}=\rho {\varepsilon }_{t-1}+{w}_{t}$, where $\rho$ is the magnitude of autocorrelation (ranging from -1 to 1) and ${w}_{t}$ represents normally distributed “white noise” ${w}_{t}\sim N\left(0,{\sigma }^{2}\right)$. Longer lags can be modelled or accommodated, but here we restrict our focus to lag-1.

Interrupted time series analysis methods

Six statistical methods were used to analyse the ITS datasets assuming first order autocorrelation (lag-1) (Table 1). The methods were chosen because they have commonly been used in practice [8,9,10,11] or because of they have been shown (through numerical simulation) to have improved confidence interval coverage relative to the methods commonly used in practice [12]. The methods were:

ordinary least squares regression (OLS), which provides no adjustment for autocorrelation, and in the presence of positive autocorrelation will yield standard errors that are too small [16];
OLS with Newey-West standard errors (NW), which yield OLS estimates of the model regression parameters, but with standard errors that are adjusted for autocorrelation [17];
Prais-Winsten (PW), a generalised least squares method, which provides an extension of OLS where the assumption of independence across observations is relaxed [18, 19];
restricted maximum likelihood (REML) (with and without the small sample Satterthwaite approximation (Satt)), which addresses bias in maximum likelihood estimators of variance components by separating the log-likelihood into two terms (one of which is only dependent on variance parameters) and using the appropriate number of degrees of freedom (d.f.) [20, 21]; and,
autoregressive integrated moving average (ARIMA), which explicitly models the influence of previous time points by including regression coefficients from lagged values of the dependent variable and errors [22].

Table 1 Statistical methods, adjustments for autocorrelation and abbreviations used

Full size table

Analysis of the ITS datasets

We implemented the segmented linear regression model (Eq. 1, Sect. 2.3) by setting up datasets for each ITS study with the following variables:

outcome variable;
time variable t, beginning at 1 and incrementing by 1 up to time point N;
an interruption time indicator ${D}_{t}$; coded 0 pre-interruption and 1 post-interruption; and,
a slope change variable $\left[t-{T}_{I}\right]{D}_{t}$ , equal to zero at the time of the interruption (${T}_{I}$) and incrementing by 1 up to time point N.

We used information provided in the corresponding manuscript to determine the interruption time. In studies with multiple interruptions, we only included the first interruption (and adjacent periods). In studies with a transition period, we extended the model to include an additional segment for the transition period; however, when calculating the level and slope changes, we ignored this segment (further details available in Additional file 3: Appendix 1).

We analysed each dataset using the six estimation methods described in Sect. 2.4. For REML with the Satterthwaite approximation, when the computed degrees of freedom were less than two, we substituted these with the value two to avoid overly conservative confidence limits and hypothesis tests. We only included analyses for which the estimate of autocorrelation was strictly between -1 and + 1. The datasets were analysed in Stata 15 [23] (see Additional file 1 for analysis code).

Comparison of results from the different ITS analysis methods

The results of interest were point estimates of the immediate level change (β₂) and slope change (β₃), their associated standard errors, confidence intervals and p-values, and the estimated lag-1 autocorrelation. Across the ITS studies, different outcomes were measured, necessitating the need to standardise the estimates of slope and level change for comparison across the datasets. This was achieved for each dataset by dividing parameter estimates by the root mean square error (RMSE) estimated from a segmented linear regression model using OLS. We also standardised the direction of effect. This was achieved for each pairwise comparison of methods by multiplying both estimates by -1 if the first method’s estimate was less than zero. We also repeated these analyses standardising to the direction of the second method’s estimate.

Estimates of level and slope changes, and their standard errors

We compared the level and slope change point estimates with their standard errors using visual displays and tabulation. Specifically, we used Bland Altman scatter plots [24] to assess pairwise agreement in the results (standardised estimates of level change, slope change, and their standard errors) between the different statistical methods. For each pairwise comparison, the difference in the two estimates was plotted against the average of the two estimates (e.g. ‘difference in estimates of level change from OLS and PW’ versus ‘average of estimates of level change from OLS and PW’). In the case of the standard errors, we first log-transformed these to remove the relationship between the variability of the differences and the magnitude of the standard errors [24]. The mean difference and limits of agreement (average difference $\pm$ 1.96 $\times$ standard deviation of the differences) were calculated and overlaid on the plots. These pairwise comparisons were displayed in a matrix of plots to show comparisons of each method with all others. Plots in the top triangle of the matrix illustrate agreement between the effect estimates (either level change or slope change), and plots in the bottom triangle illustrate the agreement between the standard errors.

We also investigated whether series length impacted the difference in level and slope change estimates between each pair of methods. A matrix of scatterplots of the differences versus the (log) length of series (overlaid with a local regression (LOESS) smoothed curve) for each pairwise method comparison was used to visually examine this relationship.

Confidence Intervals

We visually compared the width of the confidence intervals from the different statistical methods. For each dataset and pairwise comparison, a ratio of the confidence interval widths from the two methods was calculated and then scaled so that the comparison method confidence interval spanned -0.5 to 0.5.

p-values

We compared the p-values of the effect estimates between the methods by categorising the p-values based on commonly used levels of statistical significance. First, we categorised the p-values at the 5% level of statistical significance (i.e. < 5%, ≥ 5%), and second, we categorised p-values using a finer gradation (i.e. p-value < 1%, 1% ≤ p-value < 5%, 5% ≤ p-value < 10%, p-value ≥ 10%). For each pairwise comparison between methods, we calculated the percentage of datasets where there was agreement in the categories of statistical significance (i.e. the percentage of datasets where the p-value for the effect estimate was < 0.05 for both methods or the p-value was ≥ 0.05 for both methods). Further, we calculated kappa statistics to assess agreement beyond chance. We use the following adjectives when describing the results: 0.41–0.6 moderate agreement, 0.61–0.8 substantial agreement, 0.81–1.0 almost perfect agreement [25].

Autocorrelation coefficient estimates

We calculated and tabulated medians and interquartile ranges for estimates of lag-1 autocorrelation for the three methods that yield these estimates (ARIMA, PW, REML). The summary statistics are reported for all series as well as being restricted to series with ≥ 24 points and series with ≥ 100 points, in order to assess whether series length impacted the magnitude of the estimates. A scatterplot of autocorrelation versus (log) length of series (overlaid with a LOESS curve) was used to visually examine this relationship. A further scatter plot was generated that depicted the REML estimates of autocorrelation along with their confidence intervals.

Results

Time series dataset acquisition

Of the 230 ITS identified in the review [10] we obtained 10/230 (4%) datasets directly from the publication (e.g. time series data reported in tables), 50/230 (22%) through email contact with the authors, and 184/230 (80%) through digital data extraction. For some series (n = 47), multiple datasets from the different sources were available (Fig. 2). Using our hierarchy for selecting the source of the dataset when multiple series were available resulted in 190 unique datasets, with 8/190 (4%) sourced directly from the publication, 45/190 (24%) through email contact with authors, and 137/190 (72%) from digital data extraction. We were unable to obtain 40 of the 230 ITS included in the review because the data were not reported in the paper, could not be obtained from authors, or could not be digitally extracted. Five of the datasets obtained from the authors could not be used: three due to errors in the data; two because the data were too complex to fit a simple segmented linear regression model. Forty-six of the datasets could not be digitally extracted, 27 studies included graphs with insufficient resolution to digitally extract data; 8 studies had no graph; 8 studies had summary data only (e.g. a summary graph showing a small number of annual figures was provided when monthly data was used in the analysis); and 3 studies had graphs but did not plot data points.

Characteristics of the included ITS

The characteristics of the ITS studies with available datasets for re-analysis are compared to all 200 ITS studies in Table 2. No major differences were found. The types of study interventions were similar, as were the types of time intervals. The number of time points per series were lower in the studies with available datasets than in all ITS studies (median 41, IQR [25, 71] versus 48, IQR (30, 100)). The length of the segments used to calculate the estimates for the first interruption were slightly shorter in the series with available data than in all series (16, IQR (10, 28) versus 18 IQR (10, 34)).

Table 2 Characteristics of interrupted time series studies and series

Full size table

Comparison of results from the different ITS analysis methods

Estimates of level and slope changes, and their standard errors

The median values of the absolute value of the standardised effect estimates for level change ranged from 1.22 to 1.49 across the statistical methods (Table 3). For slope change, the median value of the absolute value of the standardised effect estimates was 0.13 for all statistical methods (Table 3). Pairwise comparisons were limited to a minimum of 171 datasets because at least one statistical method failed to converge, failed to yield standard errors or estimated the magnitude of autocorrelation to be outside the range -1 to + 1 in 19 of the datasets (Table 4).

Table 3 Effect estimate summaries

Full size table

Table 4 Number of available comparisons for the statistical methods investigated (n = 190)

Full size table

Pairwise comparisons of level change, slope change, and their standard errors for each of the five methods were made (Figs. 3 and 4). REML with the Satterthwaite approximation was excluded from these comparisons because it only adjusts the width of the confidence intervals, and not the standard errors. There were small systematic differences in estimates of level change in the pairwise comparisons between the methods, REML had slightly smaller and OLS slightly larger effect estimates than the other methods (Fig. 3, top triangle, and Table 5). The largest limits of agreement between all methods (REML vs OLS) were ± 1.11. Expectedly, there was no difference in the standardised level change estimates between OLS and NW (since they use the same estimator for ${\beta }_{2}$) and a very small difference between PW and ARIMA (since their point estimation methods are almost equivalent). There were no systematic differences in slope change estimates between the methods (Fig. 4, top triangle and Table 6). Limits of agreement for slope change were generally similar across the pairwise comparisons of methods (but again with the exceptions of the comparison between OLS and NW, and PW and ARIMA).

There were systematic differences in the estimates of standard error of level change across some pairwise comparisons of methods (Fig. 3, bottom triangle, and Table 5). Notably, the ARIMA standard errors were systematically larger compared with all other methods; however, this difference was smaller when compared with REML (geometric mean ratio standard errors for level change of 1.15). Aside from the pairwise comparison between PW and REML, the limits of agreement between the methods showed that the methods could yield large differences in the standard errors, particularly so for ARIMA compared with the other methods. For example, the limits of agreement for ARIMA compared with NW showed that the differences in standard errors could be large, ranging from 61% smaller to 460% larger. Similar patterns were observed for slope change (Fig. 4 bottom triangle, and Table 6).

Table 5 Mean of differences in level change estimates between methods (row method-column method) (top triangle) and geometric mean ratio of standard errors for level change between methods (column method/row method) (shaded bottom triangle) with 95% limits of agreement. The NW and OLS methods use the same estimator for level and slope change, as do REML and REML-Satt (not shown), which also use the same estimator for standard errors

Full size table

Table 6 Mean of differences in slope change estimates between methods (row method—column method) (top triangle) and geometric mean ratio of standard errors for slope change between methods (column method/row method) (shaded bottom triangle) with 95% limits of agreement. The NW and OLS methods use the same estimator for level and slope change, as do REML and REML-Satt (not shown), which also use the same estimator for standard errors

Full size table

Our visual examination of the impact of series length on the differences in level change estimates between pairs of methods showed that series length was not associated with the differences, with the exception of comparisons with the REML method. For these comparisons, the variability of the differences decreased for longer series (Additional file 3: Appendix 2). The variability in differences in slope change estimates for all pairwise comparisons between methods (except between ARIMA and PW), tended to decrease with increasing series length.

When we repeated the analysis standardising the direction of effect to the second method’s estimate, we found the results did not importantly change (Additional file 3: Appendix 3).

Table 7 Pairwise agreement in statistical significance of estimates of level change between statistical methods. P-values associated with estimates of level change were categorised at the 5% level of statistical significance (i.e. <5%, ≥5%). Cells in the upper triangle contain the percentage of series for which the p-value for level change was < 0.05 for both methods or the p-value was ≥0.05 for both methods. Denominators are reported in Table 4. Cells in the lower triangle (shaded) contain kappa statistics. Abbreviations: ARIMA, autoregressive integrated moving average; OLS, ordinary least squares; NW, OLS with Newey-West standard error adjustments; PW, Prais-Winsten; REML, restricted maximum likelihood; REML-Satt, restricted maximum likelihood with Satterthwaite small sample adjustment

Full size table