Balance algorithm for cluster randomized trials

Background Within cluster randomized trials no algorithms exist to generate a full enumeration of a block randomization, balancing for covariates across treatment arms. Furthermore, often for practical reasons multiple blocks are required to fully randomize a study, which may not have been well balanced within blocks. Results We present a convenient and easy to use randomization tool to undertake allocation concealed block randomization. Our algorithm highlights allocations that minimize imbalance between treatment groups across multiple baseline covariates. We demonstrate the algorithm using a cluster randomized trial in primary care (the PRE-EMPT Study) and show that the software incorporates a trade off between independent random allocations that were likely to be imbalanced, and predictable deterministic approaches that would minimise imbalance. We extend the methodology of single block randomization to allocate to multiple blocks conditioning on previous allocations. Conclusion The algorithm is included as Additional file 1 and we advocate its use for robust randomization within cluster randomized trials.


Background
An essential part of a randomized controlled trial (RCT) is the process of allocating units to treatment or intervention groups (arms). This is defined as randomization and is carried out to ensure that systematic bias is minimized during the selection stage [1][2][3] and that randomized rather than model based inference can be used for analysis. Randomization is carried out so that any differences found between the treatment arms can be argued as a genuine effect or due to chance. The key principles of randomization are to ensure each unit is allocated randomly and that future allocations are concealed. This ensures that knowledge of previous allocations does not allow prediction of those in the future. There are many ways to undertake randomization, with the most straightforward rolling an unbiased die with the treatment groups equally represented. However, in this simple example nothing prevents a heavy imbalance in terms of absolute number or baseline covariates to one of the treatment arms.
A solution to minimize the potential imbalance between treatment groups was originally reviewed by Box [4] then extended independently by Taves [5] and Pocock and Simon [6]. These authors detailed the steps implementing methods that dynamically randomized patients to treatment group, minimizing the imbalance between treatment baseline characteristics, rather than by chance. After minimization was introduced further authors introduced methods that were published but were felt to be suboptimal for the application [7,8]. Since then, the majority of views expressed have been in favour of widespread incorporation of minimization [9][10][11]. International guidance for the pharmaceutical industry has been the most notable critic of minimization and highlighted the lack of concealment [12], whereas other authors argue that in an RCT setting it is an additional administrative burden [13]. The Committee for Propriety Medicinal Products (CPMP) and at the International Conference on Harmonisation (ICH) for statistical principles in clinical trials it was advised that deterministic designs should be avoided and a random element included [12]. Many authors who have expressed reservation about minimization acknowledge that in settings where few units are needed to be randomized it can offer substantial benefits. Therefore minimization is an important consideration for cluster randomized trials where randomization occurs at the centre, rather than at the subject level. However, the key to minimization being accepted is the introduction of randomness to minimize the predictive power of those involved in a trial. This can be by masking and concealment to reduce the risk of allocations being known prior to randomization [14].
Cluster randomized trials may recruit all units prior to randomization which would allow baseline characteristics to be used to calculate the imbalance between treatment arms for each allocation allowing minimization with complete knowledge of baseline characteristics across the sample. Raab and Butcher introduced two criteria to evaluate randomization methods in cluster randomized trials across baseline covariates and in doing so described a simple imbalance measure between treatment arms [15]. Using this measure they generated allocations through randomization block designs which have been criticized for advance sequential randomization [16][17][18]. However, by delaying the allocation until all units within a block have been enrolled adequately deals with the issue of concealment [2,19,20]. Since the initial work of Raab and Butcher was reported other studies have used the methodology [21][22][23]. In a primary care study randomised by practice, this would mean all practices being identified and enrolled prior to randomization.
A natural extension of this is to consider blocks of units as they are enrolled. For a study in primary care this may mean practices which rapidly take up the offer of taking part in a study (and for whom the local ethical and governance arrangements are completed) for a first block, then those who take longer forming a second. With blocks structured pragmatically, it is important to balance between blocks as well as within.
At present no public domain software has been made available to calculate a within and between block imbalance measure using baseline covariate information. Our software is freely available through the R Software [24].

Randomization of a single block of units
Where all units are fully identified in advance, a single block can be used for the study. The algorithm carries out a complete enumeration of all allocations in a two-treatment arm study. When the number of units within a block to be allocated is even, an equal number of units would be allocated into each of the treatment arms. For cases with an odd number of units within block a near equal allocation is generated between the two treatment arms. Once the set of possible enumerations has been generated the imbalance statistic is calculated using the baseline covariates for each allocation across the two treatment arms.
It should be remembered that allocating units within a two arm study design incurs a natural symmetry. Since the algorithm does not assign treatment arm, only 0 or 1 as a treatment arm code, these can be interpreted as either treatment arm. Thus, a design with the first half of units allocated to treatment arm 0 and subsequent allocated to 1 would be identical to the first half of units allocated to treatment arm 1 and latter to 0. Therefore, for single block designs the software always allocates the first unit into group 1.
The imbalance measure as calculated by Raab and Butcher [15] was coded -1 and 1 for the two treatment group, here it is equivalently coded within block using 0 and 1 calculated using: x ij which is a matrix of 0, and 1's denoting allocation to treatment arm for each unit (i) and allocation (k); and the matrix of equally weighted z-scores for the baseline factors noted as w ij ; and can be written as: where x ik is the i th unit of the k th allocation, w ij is the i th unit of the j th baseline covariate, n 1 is the number of units allocated to the first block and M is the number of baseline factors.
The algorithm will provide a set of optimal allocations depending on the number of units to be randomized. In accordance with the principles of the ICH guidance on randomness it is recommended that the final design is sampled from a set of optimal allocations [12,14]. We have offered guidance to the minimum size of the sets required in Table 1. We use the 25% most optimal allocations for blocks with between 8 to 11 units or the optimal 100 allocations for blocks between 12 and 17 units and for larger block sizes with greater than 17 units we use the top 1,000 allocations. The size of the block affects the level of predictability and concealment, with smaller block sizes more susceptible to bias [25].
Once the final allocated has been selected the choice of which group becomes intervention or control should be allocated randomly.

Randomization of multiple blocks of units
As described above there are often practical reasons for using a number of blocks. Therefore, there is a desire to randomize smaller blocks as they become availbale. Computationally the enumeration of 20 practices balanced into a two arm design would lead to 184,756 possible allocations. Beyond 20 units the total number of enumerated allocations quickly becomes a computationally intensive problem, where the maximum number of units able to be randomized is dependent upon the amount of available RAM (Table 2). Therefore, it may be prudent to randomize in blocks to overcome these two difficulties. Smaller block sizes have a increased chance of selection bias through inadequate concealment, or inquisitive investigators [26], however if all units are enrolled prior to randomisation and informed at the same time of their allocation this is unlikely to be an issue.
Second and subsequent blocks should be allocated using the selected design of earlier blocks. The structure of the input allocation from earlier blocks includes the same header with a single row of 0 and 1's allocating units into the two treatment arms.
For an even block size the allocation will be equally split between the two treatment arms (regardless of previous blocks). For odd block sizes the previous blocks allocations will be considered, since if previous blocks had equal number of units within each treatment arm, then a  For each additional block the balance measure is conditional on the selected allocation of the first block, where w ij is the within block z-scores and the balance measure becomes: where: n 2 is the number of units allocated to the additional block.
The output provided from multiple block designs is similar a single block design. The differences in methodology between the allocation of the first and additional blocks are the following: • if the block size is odd, the allocation of the larger number of units will depend on previous block allocations, • the symmetry which existed in the first block, no longer exists in additional blocks, since the treatment arm code has already been allocated. This changes the number of units able to be allocated within a block ( Table 1).

The baseline covariate data
Numerical covariate information can be used directly in 'covariate_csv'. However, categorical data should be coded as below: • If the factor has a natural ordering and would be deemed ordinal categorical then ordered scores should be considered for example: None, mild, moderate and severe disease levels could be coded within the data as 0,1,2,3. However, depending on the extent of the difference between the ordinal categories these might be coded alternatively on the log 2 scale i.e. 0,1,2,4. We advise you to seek statistical and clinical advice for guidance before proceeding with these [26].
• If the factor lacks a natural ordering and is considered nominal categorical then the number of levels within the factor will need to be considered. Nominal categorical factors can be coded using orthogonal dummy variables which identify individual factor levels ( Table 3).
• It should be remembered that units should be equally allocated amongst each of the levels of each factor. Therefore, it would be recommended to include factors with few levels and few factors with greater than two levels, this can often be achieved by aggregating related levels. However to code a factor with three of four levels can be carried out by implementing two variables in the covariates data sheet, or with 5 to 8 levels with 3 variables (see Table 3).

Results
The software has been used within the PRE-EMPT study [27]. This is a study to evaluate the impact of training primary care health professionals in behaviour change counselling. The study was randomized at the practice level and recruited patients prior to their appointment with a GP or nurse and followed up 3 months later. Practices within block were all enrolled into the study, then randomized and then informed of their allocation. Two blocks of 14 and 15 were used respectively, to allow for differing rates of response from practices and approvals being gained.
Here we present data of 29 general practice surgeries where the first 14 (rows) were allocated in block 1, and the remaining 15 allocated to block 2 using 'covariate.csv' as the baseline covariates data. The baseline data included two covariates, the first general practice list size (the number of patients registered at that practice) and the Townsend deprivation index aggregated to the general practice level [28][29][30]. The allocations for block two was conditional on block one has been included as 'block_one_allocation.csv'.
To implement the randomization algorithm carry out the following: 1, check the system requirements in the availability and requirements section, ensuring that you have installed R version 2.4, or later, 2, create a main folder and a subsequent subfolder within this called 'rcode', 3, save "Example_Allocation_Execution_Code.R" and "covariate.csv" within the main folder, then "randomisation.R" and "randomisation 2.R" within the subfolder, 4, open R and update the location of the 'area', highlight and submit the area and source code (Figure 1) 5, to run the algorithm on block one, update the data within the block one allocation code and submit. Inversely to run later block allocations, update the data within the later block allocation code and submit this. Where the following variables are required ( Figure 1): folder, the location of the data, typically the same as the 'area'; covariates, the file containing the baseline covariate information; unit, the number of units within the block; outfile, the file produced containing the optimal set of enumerated allocations ordered by the imbalance statistic; figure, the figure of the distribution of imbalance statistics across all enumeration allocations; infile, the previously allocated block data (only required for blocks two or later).
This will generate a file with the set of optimally balanced allocations ordered by the imbalance statistic called 'block_one.csv', where the rows are the allocations and the columns are the general practices as taken from the covariate file. A histogram that visualises the distribution of the imbalance statistics calculated from all allocations is shown in Figure 2.
From 'block_one.csv' one allocation was selected at random and treatment arm codes 0 and 1 were allocated to control and intervention arms (highlighted in 'block_one.csv'). Later the second block final allocation was selected and can be found as 'block_two.csv' (highlighted again). Final allocations for both blocks were random selectly by the independent statistician on the Trial Steering Committee. Table 4 presents summary statistics for the two baseline covariates including the number randomized into each arm (n), the mean and standard deviation (sd) within block and across blocks. This shows that the two arms exhibit minimal evidence to suggest imbalance either within or between blocks.

Discussion
The allocation of the first block involves the full enumeration of each design. These are used to calculate an imbalance measure between treatment arm [ Equation 1]. Each additional block was randomized conditional on the previously allocated design as a fixed starting point [ Equation  2]. In doing so the algorithm not only provides access to software but extends the work of Raab and Butcher [15] by allowing multiple blocks to be randomized and ensures that allocations are balanced between the two treatment arms.
A minimum number of recruited units available for randomization is crucial to maintain concealment and allocate a minimally imbalanced design. We propose that given the symmetry and importance of the first block this should be randomized with at least eight units. However, this could be reduced for subsequent blocks to six units, see Table 1 for more details of the total number of allocations that would be included in the allocation sets.
We advocate the allocation is selected at random from the set of allocations with the smallest imbalance statistic. Typically when using larger block sizes a random sample should be made from the 1,000 optimally balanced designed. However, for smaller block sizes fewer allocations should be used to randomly select from ( Table 1). The set sizes were determined pragmatically and dependent on block size, but we recommend that further work is carried out to quantify their effect, as we recognise these will effect the degree of randomness introduced.
It would be envisaged that these algorithms are not only used by statisticians, but allow medical researchers with minimal access to a statistician an invaluable tool to help randomize their trials in a robust way. In the primary care setting these algorithms have already been used to randomize centres within cluster randomized trials. In future the algorithms could be extended to incorporate improved efficiency to deal with larger block sizes. Furthermore, we will weight the balance measure by practice recruitment for those recruited practices to deal with under recruiting practices and examine the impact on future block allocations.
For details of a wide range of alternative randomization software algorithms see Professor Martin Bland's randomization software services pages [31].

Conclusion
The software is important for multi disciplinary teams needing to address the issues surrounding randomization.
It allows the inclusion of information from baseline covariates to influence the allocation of the units to treatment groups, without disclosing or causing untoward doubt to the concealment.
It is intended these algorithms are an easy to use and convenient tool to be used by researchers who wish to minimize imbalance between treatment arms across multiple A screen dump from 'Example_Allocation_Execution_Code.R' Figure 1 A screen dump from 'Example_Allocation_Execution_Code.R'. To allocate, change the area and submit the R-script.

Area location and source code
Block one allocation code Additional block allocation code baseline stratification variables ensuring that ICH guidance is adhered to.

Additional File 1
Cluster randomization allocation algorithm version 1. Algorithms scripted in R to provide robust cluster randomization.
Click here for file [http://www.biomedcentral.com/content/supplementary/1471-2288-8-65-S1.zip] The distribution of imbalance statistics across each of the generated allocations for block one Figure 2 The distribution of imbalance statistics across each of the generated allocations for block one.