On summary measure analysis of linear trend repeated measures data: performance comparison with two competing methods

Background The summary measure approach (SMA) is sometimes the only applicable tool for the analysis of repeated measurements in medical research, especially when the number of measurements is relatively large. This study aimed to describe techniques based on summary measures for the analysis of linear trend repeated measures data and then to compare performances of SMA, linear mixed model (LMM), and unstructured multivariate approach (UMA). Methods Practical guidelines based on the least squares regression slope and mean of response over time for each subject were provided to test time, group, and interaction effects. Through Monte Carlo simulation studies, the efficacy of SMA vs. LMM and traditional UMA, under different types of covariance structures, was illustrated. All the methods were also employed to analyze two real data examples. Results Based on the simulation and example results, it was found that the SMA completely dominated the traditional UMA and performed convincingly close to the best-fitting LMM in testing all the effects. However, the LMM was not often robust and led to non-sensible results when the covariance structure for errors was misspecified. The results emphasized discarding the UMA which often yielded extremely conservative inferences as to such data. Conclusions It was shown that summary measure is a simple, safe and powerful approach in which the loss of efficiency compared to the best-fitting LMM was generally negligible. The SMA is recommended as the first choice to reliably analyze the linear trend data with a moderate to large number of measurements and/or small to moderate sample sizes.


Background
In many fields of science, repeated measurements of a response variable are taken on each subject over time to assess the changes in response. The cumbersome aspect in analyzing such data is that there are relationships between the measurements in the subject over time. There are two major policies in terms of overcoming or taking the relationships into account.
First, one can reduce the vector of responses of each subject to a single value by a descriptive statistic and apply standard univariate approaches to test the effects related to the corresponding summary measure. The use of the summary measure approach (SMA) was suggested by Wishart [1] for the first time. Several strategies based on the least squares regression slope and mean of response over time were recommended to evaluate the differences between the groups [2][3][4][5][6]. Moreover, the utility of Kendall's τ b as a summary measure of within-subjects trend in psychiatric longitudinal studies, where the key assumptions of parametric methods are not held, was investigated [7,8].
Second, one can use methods which take the covariances between the measurements into account. Two common and traditional approaches for normally distributed responses are repeated measures ANOVA and MANOVA. In order to avoid inflating type I error rate, the denominator degrees of freedom of the F statistics in the repeated measures ANOVA approach should be adjusted under departures from a restrictive assumption on covariance structures, namely sphericity. But there is no obvious advantage in using the adjusted F tests against the multivariate tests, and generally the adjustments should be avoided [9,10]. In contrast, the repeated measures MANOVA approach makes no assumption regarding covariance structure and hence, it is sometimes known as unstructured multivariate approach (UMA). The only key advantage of the repeated measures ANOVA approach over the UMA is that it can still be implemented in the case where the number of measurements is greater than the sample size.
The linear mixed model (LMM) is more advanced and flexible since it allows dealing with subjects which have incomplete measurements and are unequally spaced in the time period. But the performance of the LMM in testing the effects is highly dependent on the choice of appropriate covariance structure for errors [11,12]. On the other hand, the choice of a parsimonious covariance structure in a small sample design can lead to more efficient inferences concerning the fixed-effects parameters. This aspect makes it inconvenient and unreliable, especially for those who are not familiar with the fundamental principles of mixed models.
Although SMA is a simple, robust and sometimes only applicable tool for the analysis of repeated measures studies, there exists no obvious performance comparison on using the SMA vs. other competitors. Moreover, the application of the SMA has been mostly based on using one summary statistic to assess only the total group difference.
The present study includes repeated measures data in which the pattern of the response profile can be described by a linear trend and the responses measured in a continuous scale. The main objectives of this study are: a) To describe techniques to test time (within-subjects), group (between-subjects) and group × time interaction effects on the basis of two common summary measures, i.e. least square regression slope and mean of response over time. b) To compare the performance of the SMA, LMM and UMA in the analysis of simulated data from a LMM framework under different types of covariance structures. The approach is also illustrated and compared with the competitors using two real data sets.
In our simulations, there is a focus on situations where the LMM may provide extremely unsatisfactory performance such as misspecification of the covariance structure for errors, small and moderate sample sizes, and relatively a large number of measurements.

Unstructured multivariate approach (UMA)
The UMA handles the measurements in the subject as a vector of multivariate responses and treats time points as levels of a qualitative factor with no order. This approach is restricted in equally spaced time points, balanced data with complete measurements and also assumes the homogeneity of covariance matrices in all the k groups.
Let Y ih = (Y i1h , . . . , Y imh ) T denote the vector of m responses from the ith subject in group h for i = 1,...,n h , h = 1,...,k. It is assumed that the response vectors, Y ih , are independent and have multivariate normal distribution with mean μ h = (μ 1h , ..., μ mh ) T and common covariance matrix Σ. The total mean vector is also defined as can use a profile model as where the vector ε ih = (ε i1h , ..., ε imh ) T is the vector of error for the ith subject in group h.
The primary hypothesis interest in a profile analysis is the parallelism of the k groups' profiles or no group × time interaction effect. The hypothesis can be constructed as H 0 : C μ 1 = ... = C μ k for an appropriate transformation matrix C with rank m-1. If the test of interaction is not significant, the tests of the main effects are not confounded. In order to compute any MAN-OVA-type test statistics such as Wilk's lambda (Λ), the condition N-k >m-1 is necessary, where N is the total number of subjects. Otherwise, the estimated covariance matrix of the transformed responses would not be nonsingular and positive-definite. To test time effect, one can investigate the equality of the m elements of the total mean vector (μ . ) using one-sample Hotelling's T 2 test on the m-1 differences between adjacent measurements from each subject. Here, the same strategy as the SMA is utilized to test group effect, as it is often more efficient than MANOVA-type tests to compare the groups' mean vectors.

Linear mixed model (LMM)
Let Y i = (Y i1 , ..., Y im ) T denote the m i × 1 vector of responses from the ith subject for i = 1,...,N, where N is the total number of subjects. In contrast to the UMA, the subjects may have different measuring time points and be unbalanced in terms of the number of measurements. The general form of the LMM is where X T i is an m i × p fixed-effects design matrix for the ith subject, b is a p × 1 vector of fixed-effects parameters for the population, b i is a q × 1 vector of random effects for the ith subject, Z T i is an m i × q random-effects design matrix for the ith subject with q ≤ p, and ε i is an m i × 1 vector of within-subject errors. The random-effects vectors, b i , are assumed to be independent and to have a multivariate normal distribution with mean zero and covariance matrix G i , and the error vectors, ε i , are assumed to be independent and to have a multivariate normal distribution with mean zero and covariance matrix R i . In addition, it is also assumed that b i and ε i are independent of one another. The LMM defines the covariances of the measurements in the subject by the covariances of the random effects (G i ) and the covariances of the errors (R i ). We used the estimators based on the restricted maximum likelihood (REML) method to construct the F statistics of the hypotheses since, in general, it yields less biased estimates of the variance components than those of maximum likelihood (ML) approach and avoids inflating type I error rates [12,13].
The summary measure approach (SMA) In this section, we describe how to apply the least squares regression slope and mean of response over time for each subject to test the effects of time, group and group × time interaction in repeated measures studies.
The slope of least squares regression line was applied to summarize the relationship between response and time for each subject or within-subjects effect. If the pattern of individual profiles is linear or at least monotonic, the slopes can appropriately summarize the rate of change of response over time in the subjects. For repeated measures designs, the primary hypothesis is to test whether the pattern of change over time is the same across the k groups or no group × time interaction effect. Under the assumption of no interaction effect, the slopes in the k groups should not be significantly different. For this purpose, once the slopes are obtained for each subject, the ordinary k sample tests such as one-way ANOVA F or Kruskal-Wallis (for k > 2) and Student's t or Wilcoxon-Mann-Whitney (for k = 2) can be employed to assess the equality of the slopes in the groups. If the test of interaction is not significant, one would be interested in assessing the main effects.
The hypothesis of no time (within-subjects) effect states that all the m elements of the total mean vector (μ . ) are identical. Under this assumption, the overall mean of the slopes in the population must be zero. To test this hypothesis, one-sample t test can be applied to the sample slopes to assess the departure of mean slopes from zero.
For testing group (between-subjects) effect, the mean of measurements over time for each subject is used as a summary measure. By analogy with the interaction effect case, the ordinary k sample tests are applied, but this time, to assess the equality of the individual means in the groups.
Permutation procedure can also be employed to assess the interaction and group effects where the constructive assumptions of the standard tests are not held or cannot be reasonably checked due to small sample sizes in the groups.

Simulation study
For the purpose of data simulation, a simple linear trend mixed model with a random coefficient only for the intercept and a two category grouping variable was considered. The model can be expressed as where Y ij is the jth measurement from ith subject and X i is a grouping variable with the values 0 and 1, for i = 1,...,N and j = 1,...,m i . Linear trend mixed model data was generated based on the model (3) with the same measuring time points t ij = t j = 2j for all the subjects, m i = m = 5, 10, and 20 measurements and b 0 = 2, in which the random effects, b 0i , were assumed to be independently normally distributed with mean zero and standard deviation 0.25.
Since hypothesis testing effects related to within-subjects effect is highly dependent on the number of measurements, the values of b 2 and b 3 are adjusted with respect to the m values. Different combinations of b 1 , b 2 and b 3 were constructed to compute the empirical type I error rates and powers for testing the three effects.
We considered the following three covariance structures for errors to generate artificial data and fit the LMMs: • Simple or independent (IND): R i = s 2 I, where I is an m × m identity matrix.
• First-order autoregressive (AR1) with r = 0.7: For simplicity, we defined the true structures as those which were used to generate data and the working structures as those which were used to fit the model. In all the cases, it was assumed that the errors were normally distributed with zero mean and in the cases of IND and AR1, the error variances were fixed over time and equal to s 2 = 0.5.
1000 sample data sets were generated for n 1 = n 2 = n 3 = 5, 10, 30 and 50 subjects under various choices of the above circumstances.
We have used free statistical software environment R to generate the artificial datasets and fit all of the approaches presented in the method section.

Simulation results
Within-subject (time) and within-by-between-subjects (interaction) effects Tables 1 and 2 display the empirical type I error rates and powers of the tests of time and interaction effects for various covariance structures, respectively. The first rows in each part, where b 1 = b 2 = 0 (b 3 = 0), display the empirical type I error rates and the rows corresponding to b 1 > 0 and b 2 > 0 (b 3 > 0) show the empirical powers in testing time (interaction) effect. Because of the similarities between the results of testing time and interaction effects, we combined the results in this section in which the following report is right for both effects.
First, the three approaches are compared under the IND and AR1 as true structures. As illustrated in both Tables 1 and 2, empirical type I error rates of the SMA and UMA were always close to (and often smaller than) the nominal significance level (%5). However, the LMM in testing both effects displayed notably larger values for the IND working structure under the AR1 as true structure and more generally for the UNS working structure under the two true structures. Unfortunately, the inflation of type I error rates for the IND working structure under the AR1 true structure tended to be fixed as n increased. As misleading results, because of not preserving the type I error rates, the empirical powers of the LMM in these cases were notably greater than those obtained by the other approaches. In summary, the empirical powers of the SMA were notably greater than those of the UMA and were often close to the corresponding values of the best-fitting LMM. It is also worth mentioning that while the powers of the SMA and LMM tend to be 1 for some larger values of n, the values of the UMA have evident departures from them such as n = 50 with m = 5, 10, 20 and somewhat n = 30 with m = 10, 20.
Next, we consider the simulation results for the UNS as true covariance structure in testing time and interaction effects. The LMMs with the IND and AR1 working structures preserved the type I error rates again. Interestingly, like the IND and AR1 true structures, the empirical type I error rates of the LMM for the UNS working structure were not preserved for smaller and larger values of n and m, respectively. It is worthwhile to note that the empirical type I error rates of the LMM were relatively comparable to the corresponding values of the other approaches only for larger values of n accompanied by smaller values of m; n = 30 and 50 with m = 5, and somewhat, n = 50 with m = 10 in both Tables. However, the empirical powers of the SMA and all the LMMs were similar under such circumstances. Only in the case of m = 5 with n = 30, 50 under the UNS true structure, the UMA was comparable with the SMA in testing both effects.

Between-subjects (group) effect
The empirical type I error rates and powers of the test of group effect are displayed in Table 3 in which the empirical type I error rates are the values of the first rows where b 1 = b 2 = 0 and the other rows, where b 1 > 0 and b 2 > 0, display the empirical powers. Again, it should be noted that both the SMA and UMA use the same strategy to test group effect. Hence, Table 3 only reported the results for the SMA and LMM.
First, the simulation results for testing the group effect are considered under the IND and AR1 as true covariance structures. Except for the UNS working structure, both SMA (UMA) and LMM often obtained the same empirical type I error rates close to the nominal significance level. Contrary to what we obtained for the two other effects, the LMM preserved the type I error rates for the IND working structure under the AR1 as true structure. The LMM with the UNS working structure tended to have obviously larger empirical type I error rates than the SMA (UMA). The empirical powers of the SMA and LMM were absolutely similar in both IND and AR1 as true structures when the type I error rates were preserved by the LMM. Now, we consider the results for the UNS as true covariance structure. The LMM with the UNS working structure yielded the preserved type I error rates only for larger values of n accompanied by smaller values of m such as n = 30, 50 with m = 5, 10. However, the LMM preserved type I error rates for the IND and AR1 working structures. In these comparable circumstances, the differences in the powers between the SMA and all the LMMs were negligible.

Illustrative examples Example 1: Pituitary-pteryomaxillary distance data
The first example is a small data set on a facial distance previously published by Potthoff and Roy [14] conducted at the University of North Carolina Dental School. The distance (mm) from the centre of the pituitary gland to the pteryomaxillary fissure was measured at age 8, 10, 12, and 14 in two groups of children (11 girls and 16 boys). The data set has also been analyzed by several analytic methods [12,15]. Figure 1 displays the mean profiles in boys and girls and indicates a departure from the parallelism hypothesis. In general, boys tend to have larger pituitary-pteryomaxillary distances and a faster growth rate than girls. In addition, the distances increase over age points in both groups of children.
Given model (3), three models were fitted with IND, AR1, and UNS covariance structures. Random intercept and slope models with the three covariance structures were also employed. Table 4 reports the results for the six LMMs, UMA and SMA, as well as Akaike's information criterion (AIC) and Bayesian information criterion (BIC) as two model selection indices for the LMMs. The model with the smallest criterion provides the best fit to data. Based on both foregoing criteria, model 1 with IND covariance structure was preferred. It is worth mentioning that a random intercept model with IND covariance structure for errors yields a compound symmetry covariance structure between the responses.
All the LMMs and the SMA showed a significant interaction effect at the 5% significance level. These results indicated that the growth pattern in boys was faster than that in girls. Although one could not reject the hypothesis of no interaction effect by the UMA, Table 1 Type I error rates and powers for testing   there was some evidence that the profiles in Figure 1 were not parallel. All the approaches yielded significant results for the two main effects on the facial growth measurements of children. Based on these results, we accept that boys have larger facial distances than girls and the facial distances increase over age in the two groups of children.

Example 2: Change in lung NO metabolites level data
The second example is an animal experimental study which is about the effects of hypercapnia with or without acidosis on NO production in the isolated ventilated-perfused rabbit lung by assessment of the NO metabolites (nitrite and nitrate) concentration released into the perfusate. The study was conducted at Justus-Liebig-University, Giessen. The NO metabolites concentration (nmol/min) was measured at time point 0, 5, 10, 15, 30, 45, ..., and 180 minutes in three groups of normoxic normocapnia (NX-NC, n = 7), normoxic hypercapnia with acidosis (NX-HCA, n = 4) and normoxic hypercapnia with normal pH level (NX-HCN, n = 6). Since there were some variations between the baseline measurements, values were given as changes from the baseline. There were six samples (lungs) with incomplete measurements. Figure 2 displays the mean profiles of change in NO metabolites level data over time for the three groups. The mean profiles increase over time points in all of the groups. However, it is not expected that the patterns of change in NO metabolites level and the overall means will differ between the three conditions.
In this data set, the UMA could not be conducted, because the number of measurements (m = 14) was larger than that of the samples with complete measurements Table 3 Type I error rates and powers for testing between-subjects (group) effect where rows with b 1 = 0 and b 2 = 0 give the type I error rates, and the other rows are powers  Table 5 displays the results for the LMM with random intercept, random intercept and slope and also the SMA. Note that AIC prefers random intercept and slope model with UNS, IND and AR1 covariance structures, models 6, 4 and 5, respectively, whereas random intercept and slope model with IND and AR1 covariance structures are to be preferred based on BIC, models 4 and 5, respectively. The reason is that a heavier penalty in the calculation of BIC than AIC was imposed when the number of parameters in the model increased. Since there were a limited number of lungs and a large number of measurements, the danger of over-fitting increases. In these cases, it is more reasonable to rely on BIC to select the best parsimonious model. Note that model 6 has larger parameters (d = 112) than model 4 (d = 8) which must be estimated.
Based on the results of the LMMs 4 and 5 selected on the basis of BIC, and also the SMA, one can accept that the rates of NO metabolites change in the three groups do not differ. Although this result coincides with that obtained by the most complicated model 6, the unsuitable models 1 and 3 reject the hypothesis of no interaction effect which is not illustrated in Figure 2.
All the LMMs, as well as the SMA, confirmed the effect of time on increasing the mean change over time in all of the groups. Except for the unreasonable model 6, all the models and the SMA confirmed that the mean change profiles for the three groups were the same throughout the time points; therefore, there was no significant group effect.

Discussion
Based on the simulation and example results, it was found that obtaining accurate inferences in a LMM requires heavy statistical knowledge on the true and working covariance structures. However, due to developments in computer sciences, using mixed models is nowadays widespread in experimental designs and clinical trial studies where the sample sizes are not sufficiently large and/or sometimes the number of measurements is  large. This serious aspect has previously been reported in a simulation study by Park [12] somewhat in a different way, where there was no random effect in the process of data generating. The interested reader is referred to [16][17][18][19] for the sample size and power calculations in repeated measurements analysis. Interestingly, the SMA was robust to the true covariance structures in testing main and interaction effects even for small sample sizes and large number of measurements. Moreover, the SMA in the analysis of linear trend data was a powerful method in which its empirical powers were convincingly close to those of the best-fitting LMM, in general. This means that the least squares slope and mean of response are appropriate measures to summarize the corresponding effects.
In this study, we fitted the LMMs using the ''nlme'' package in the software R in which it follows the innerouter approach for calculating the denominator degrees of freedom (df) of F statistics [20]. In comparison with the packages nlme and lme4 in R, the MIXED procedure in SAS provides also Satterthwaite and Kenward-Roger approximation methods for calculating the denominator df which especially result in some improvements in the resulting p-values. Although the superiority of these complex methods in terms of better preservation of type I error rates has been previously illustrated in unbalance designs [21][22][23], the differences are rather negligible when LMMs are employed inside the context of longitudinal analyses and there is no missing data. The R packages do have the advantage over the SAS procedure in providing the useful alternative algorithms Monte Carlo simulation and parametric bootstrap for getting more sensible p-values and confidence intervals. However, they are computationally intensive to be included in a simulation study.
The SMA clearly dominated the traditional UMA in testing time and interaction effects. The reason is that the SMA utilizes the linear trend in such data by computing the least squares slopes. However, the UMA assumes a more general nonlinear model with more parameters which must be estimated, and also imposes the most complex structure on the covariances of errors in which it may not be necessary.
Though not reported here some simulations based on the non-normal data show that, in general, the approaches were relatively robust to departures from multivariate normality. However, this had been reported previously for the two-sample Hotelling's T 2 test [24,25] and somewhat LME models [26,27].
This paper did not aim to deal with missing observations and baseline or pre-treatment measurement techniques. If the missing observations do not occur completely at random, it can introduce potential bias into parameter estimation and decision-making in statistical models. Barton and Cramer [28] and Catellier and Muller [29] have proposed several approximating denominator df on this issue. In this respect, the performance of SMA is highly dependent on weighting the individual's summary statistics [30] which may be cumbersome in practice. There are also more complex and efficient approaches to adjust the effect of baseline value (values) for the SMA such as including the baseline (average of baselines) or estimated intercept as covariate in an analysis of covariance (ANCOVA) model [4].

Conclusions
It was shown that the SMA, on the basis of the two summary measures, was a simple, safe and powerful method in testing main and interaction effects in which it performed reasonably as the best-fitting LMM. However, The LMM often led to seriously inflated type I error rates and hence non-sensible inferences when the covariance structure for errors is misspecified. Moreover, this simple approach dominated the widely used UMA in assessing the linear trend data from a mixed model framework. The SMA is recommended as the first choice to confidently analyze linear trend data with a moderate to large number of measurements and/or small to moderate sample sizes.