How many repeated measures in repeated measures designs? Statistical issues for comparative trials

Background In many randomized and non-randomized comparative trials, researchers measure a continuous endpoint repeatedly in order to decrease intra-patient variability and thus increase statistical power. There has been little guidance in the literature as to selecting the optimal number of repeated measures. Methods The degree to which adding a further measure increases statistical power can be derived from simple formulae. This "marginal benefit" can be used to inform the optimal number of repeat assessments. Results Although repeating assessments can have dramatic effects on power, marginal benefit of an additional measure rapidly decreases as the number of measures rises. There is little value in increasing the number of either baseline or post-treatment assessments beyond four, or seven where baseline assessments are taken. An exception is when correlations between measures are low, for instance, episodic conditions such as headache. Conclusions The proposed method offers a rational basis for determining the number of repeat measures in repeat measures designs.


Background
Many studies measure a continuous endpoint repeatedly over time. In some cases, this is because researchers wish to judge the time course of a symptom or to evaluate how the effect of a treatment changes over time. For example, in a study of thoracic surgery, patients were evaluated every three months after thoracic surgery to determine the incidence and duration of chronic postoperative pain. The researchers found that the incidence of pain at one year was high and only slightly lower than at three months, showing that post-thoracotomy pain is common and persistent [1]. In such studies, the number and timing of repeated measures needs to be decided on a study-by-study basis depending on the scientific interests of the investigators.
Measures may also be repeated in order obtain a more precise estimate of an endpoint. In simple terms, measure a patient once and they may be having a particularly good or bad day; measure them several times and you are more likely to get a fair picture of how they are doing in general. Repeat assessment reduces intra-patient variability and thus increases study power. This is of particular relevance to comparative studies. For instance, in a randomized trial of soy and placebo for cancer-related hot flashes, patients recorded the number of hot flashes they experienced each day during a baseline assessment period and then during treatment. In this case, the researchers were interested in the change between baseline and follow-up in each group so as to determine drug effect. The time course of symptoms was not at issue. The researchers therefore took a mean of each patient's hot flash score during the baseline period and subtracted the mean of the final four treatment weeks to create a change score. Change scores were compared between groups using a t-test [2]. In addition to using means, post-randomization measures may also be summarized by area-under-the-curve [3] or slope scores, [4] which are particularly relevant if treatment effects diverge over time.
There has been little guidance in the methodologic literature as to how researchers should select the number of repeated measures for repeated measures designs. In the few papers that have discussed power and repeat measurement (for example, Frison and Pocock [5]), the number of measures is seen as a fixed design characteristic, with sample size derived accordingly. Perhaps as a corollary, randomized and other comparative trials involving repeated measures almost invariably lack a statistical rationale for the number of measures taken. Measures are most commonly taken at particular temporal "landmarks", such as the beginning of each chemotherapy cycle, or each day during treatment. Apparently little consideration is given to how increasing or reducing the number of measures affects power.
Consequently, it is not difficult to find studies that appear to have either too few or too many repeat assessments. In a trial of acupuncture for back pain, for example, pain was measured on a visual analog scale (VAS) once at baseline and once following treatment [6]. The standard deviations were very high: mean post-treatment score was 38 mm with a standard deviation of 28 mm (recalculated using raw data from the authors). Part of this variability in pain scores reflects intra-patient variability that would have been reduced had the VAS been repeated several times. This would surely have been feasible in this population. There are also numerous studies where extremely large number of measures were taken, far beyond the point where additional measures would have improved precision to an important degree. For example, in a trial of a topical treatment for HIV-related peripheral neuropathy, patients were required to record pain four times a day for four weeks at baseline and at follow-up, a total of 224 data points [7]. No rationale was provided for such extensive data collection and there was clearly a cost: 46% of patients dropped-out before the end of the trial. In the hot flashes example given above, symptoms were measured every day for four weeks at baseline and for 12 weeks following randomization, a total of 102 data points [2]. The authors do not explain why such extensive data collection was required to answer the study question.
In this paper I argue that the number of repeat measures should not be seen as a fixed design characteristic, rather it is a design choice that can be informed by statistical considerations. I then outline a method for guiding decisions concerning the number of repeat measures and deduce several rules of thumb that can be applied in trial design.

Methods
To determine an optimal number of repeat measures, I use the premise that the ideal number from a statistical viewpoint is infinity, as this would maximally reduce intra-patient variance. However, it would be best in terms of researcher and patient time and effort if only a single assessment was made. Increasing the number of repeat assessments thus has a benefit in statistical terms that is offset by cost. Whereas cost can be estimated only in general terms by researchers (would patients put up with another questionnaire? how much time would it take for an additional range of motion assessment?) statistical efficiency benefits can be quantified. In the following, I describe the formulae for determining the relative benefit of additional repeat assessments for statistical power and deduce some general design principles.
The key question is the degree to which adding a further measure -for example, assessing pain five times rather than four times -increases statistical power. This is known as the "marginal benefit" of repeat measurement. We will start with the situation where data are recorded only after intervention. This is typical in trials of acute sequelae of a predictable event, for example, post-operative pain, chemotherapy nausea or muscle soreness following exercise. It can be shown (see Figure 1) that required sample size (n) patients per group is proportional to the number of measurements (r) and the mean correlation between measurements ( ).
Marginal change in sample size for r + 1 compared to r assessments is therefore: This equation does not require that measurements be equally spaced or that correlations between measurements be constant.
It is common that trials investigate an endpoint that can be informatively measured before treatment. In trials of Here i = A or B, j= 1 … n i and k= 1 … r; µ ik is the true mean for treatment i at time k and e ijk is the error for the jth patient undergoing treatment i at time k. For each patient, a mean of all assessments calculated as: The difference between groups A and B is: The sample size required for a given power and alpha is proportional to the square of the reciprocal of effect size, d. Effect size is defined as the difference between group means over pooled standard deviation. The variance of the mean of r assessments is the sum of the variance of each assessment plus twice each pairwise co-variance: ∑ From the perspective of power, and without loss of generality, the standard deviation of each assessment can be standardized to one, by dividing each x ijk by σ ik . We are not interested in examining different effect sizes at different times, so we can standardize the difference between groups to one; furthermore, ij p can be averaged to give p . Note that there is no requirement for an assumption that all ij p are equal or that k are equally spaced. The number of pairwise correlations between r variables is (r 2 -r)/2. Hence effect size for r assessments : Sample size n patients per group is thus: Marginal change in sample size for r+1 compared to r assessments is therefore: Analysis of covariance (ANCOVA) has been repeatedly demonstrated to be the most powerful method of analysis for this type of trial [5,8,9]. The following discussion will thus only include reference to ANCOVA (rather than say, t-test of change between baseline and follow-up). Frison and Pocock [5] have derived a generalized sample size equation that can be used to assess power for ANCOVA where baseline measures are taken before treatment: p is the number of baseline measures; subscripts pre, post and mix refer, respectively, to the mean correlations within baseline measurements, within follow-up measurements and between baseline and follow-up measures ( Figure 2).
As is the case for trials without baseline measures, there is no requirement that correlations be equal or that assessments be equally spaced.  Table 2 shows the marginal relative decreases in sample size given various numbers of assessments and correlations. For example, if correlation between measures is 0.65, increasing the number of measures from two to three decreases sample size requirements by about 6%. As correlation is reciprocally related to intra-patient variability, additional measures are of greatest value when correlation is low. It is also clear that repeating measurements more than a few times has little effect on power. For example, for a correlation of 0.65, taking four repeated measures only improves power by 3% compared to three  [11] Twice daily for five days 0.58 Chronic neck pain [12] Baseline and three weeks later 0.39 Neck range of motion [13] Before and after a single treatment 0.88 Neck pain [13] Before and after a single treatment 0.9 Constant Murley score of shoulder pain and dysfunction [14] Baseline and four weeks later 0.57

Trials without baseline measures
Back pain by visual analog score [6] Baseline and four weeks later 0.  [15] Every three days 0.89 Prostate specific antigen [16] Four  Tables 3,4,5 show the effect on sample size of increasing the number of follow-up assessments and baseline assessments given different correlations for pre, post and mix. It is assumed for tables 1, 2, 3, 4, 5 that neither the mean of the measures nor the mean correlation between measures depends on the number of measures. This will generally be the case where, for example, a decision needs to be made whether to measure the severity of a chronic condition for one or two weeks at baseline. However, care should be taken with possible exceptions. An example might be if an endpoint was measured twice a day instead of just once. In this case, correlations between measure-ments 12 hours apart might be higher than those taken 24 hours apart. A second possible exception is acute conditions of limited duration: measuring pain after surgery for seven days rather than four days after surgery will not improve precision if few or no patients are in pain after day four. Table 3 gives the most common situation of moderate correlation between baseline and follow-up measures and high correlation within measures. Table 4 shows moderate correlation for within and between measures, typical in an episodic condition. Table 5 shows very high correlations for studies where assessments are taken close together, or in the case of measures with low intra-patient variability such as laboratory data.

Number of baseline measures (p)
No. of follow-up measures (r) As an example, given the most common case of pre, post and mix at 0.7, 0.7, and 0.5, a trial with four baseline and four follow-up measurements would require 60% of the number of patients of a trial with just one baseline and follow-up; a trial with seven assessments at baseline and follow-up would require 54% as many patients. 2. Under the assumption that pre and post are similar, it is more valuable to increase the number of follow-up than baseline assessments. This makes intuitive sense: we should be more concerned about the precision of an endpoint than a covariate.
3. The marginal value of additional follow-up assessments is higher where baseline measurements are taken. Take the  4. The only situation where it is worthwhile to make more than six or seven assessments is when correlation is moderate and similar between all time periods. This is most likely to be the case for episodic conditions such as headache, where scores at any one time will be poorly correlated with scores at any other time.

Conclusion
Investigators may measure a continuous endpoint repeatedly because they wish to judge the time course of a symptom. In such cases, the number of repeat measures will depend upon the scientific interests of the investigators. Alternatively, investigators may use repeat measurement to increase the precision of an estimate. Though this is a particular concern for randomized or non-randomized comparative studies, it is also pertinent to a variety of other research designs: for example, epidemiologic cohort studies may take a measure such as blood pressure, prostate specific antigen or serum micronutrient levels at baseline and then determine whether this predicts development of disease; repeating baselines will improve the precision of such predictions.
Where measures are repeated to improve precision, decisions about the number of repeated measures, that is, the number of within-patient observations, mirror those of standard power calculation, which concerns observations of separate patients. In both cases, statistical concerns to minimize variance are balanced by logistical concerns to minimize number of assessments. Whilst an extensive literature has developed on various methods for selecting a particular number of patients to study, the number of assessments per patient has received little attention, perhaps because this has tended to be seen as a fixed characteristic of any particular trial design. Here I have shown that simple statistical considerations can be used to guide the number of repeated measures in repeated measures designs. Given the most common correlation structure, taking four baselines and seven follow-up measures dramatically improves power compared to a single baseline and follow-up; where no baseline is taken, four follow-up measures importantly improves power; however, the marginal value of including additional measures rapidly diminishes.