A simulation study for evaluating the performance of clustering measures in multilevel logistic regression

Background Multilevel logistic regression models are widely used in health sciences research to account for clustering in multilevel data when estimating effects on subject binary outcomes of individual-level and cluster-level covariates. Several measures for quantifying between-cluster heterogeneity have been proposed. This study compared the performance of between-cluster variance based heterogeneity measures (the Intra-class Correlation Coefficient (ICC) and the Median Odds Ratio (MOR)), and cluster-level covariate based heterogeneity measures (the 80% Interval Odds Ratio (IOR-80) and the Sorting Out Index (SOI)). Methods We used several simulation datasets of a two-level logistic regression model to assess the performance of the four clustering measures for a multilevel logistic regression model. We also empirically compared the four measures of cluster variation with an analysis of childhood anemia to investigate the importance of unexplained heterogeneity between communities and community geographic type (rural vs urban) effect in Malawi. Results Our findings showed that the estimates of SOI and ICC were generally unbiased with at least 10 clusters and a cluster size of at least 20. On the other hand, estimates of MOR and IOR-80 were less accurate with 50 or fewer clusters regardless of the cluster size. The performance of the four clustering measures improved with increased clusters and cluster size at all cluster variances. In the analysis of childhood anemia, the estimate of the between-community variance was 0.455, and the effect of community geographic type (rural vs urban) had an odds ratio (OR)=1.21 (95% CI: 0.97, 1.52). The resulting estimates of ICC, MOR, IOR-80 and SOI were 0.122 (indicative of low homogeneity of childhood anemia in the same community); 1.898 (indicative of large unexplained heterogeneity); 0.345-3.978 and 56.7% (implying that the between community heterogeneity was more significant in explaining the variations in childhood anemia than the estimated effect of community geographic type (rural vs urban)), respectively. Conclusion At least 300 clusters with sizes of at least 50 would be adequate to estimate the strength of clustering in multilevel logistic regression with negligible bias. We recommend using the SOI to assess unexplained heterogeneity between clusters when the interest also involves the effect of cluster-level covariates, otherwise, the usual intra-cluster correlation coefficient would suffice in multilevel logistic regression analyses.


Background
Biomedical research studies often collect binary outcome data that have a multilevel structure. For example, in estimating the efficacy of a new drug on the treatment of patients with a viral infection (recovered or not), patients could be nested within the admitting hospitals. Similarly, for a public health specialist assessing the effect of a feeding intervention on the growth of a child (stunted or not), children could be nested within communities which are also nested within districts. In both examples, patients or children (level-1 units) may have little variation in background characteristics within each hospital or community (level-2 units). On the other hand, hospitals or communities may have patients or children drawn from a wide range of background characteristics, which may affect the within-hospital or within-community homogeneity. A standard logistic regression model, which assumes that the individual outcomes are independent will not consider the hierarchical structure of the data within hospitals or communities, and will consequently fail to distinguish between these two manifestations in each example. Ignoring the dependence between the outcomes may result in an exaggerated number of independent patients or children within hospitals or communities. For the strong homogeneity between patients in hospitals or children in communities, the effect would be an underestimation of the standard errors for the covariate effects, which may lead to incomplete and misleading conclusions on the association between the treatment and the viral infection or the feeding intervention and child growth [1][2][3].
Multilevel logistic regression models [4], also called hierarchical logistic models [5], are appropriate statistical techniques for analyzing clustered binary outcome data. The use of multilevel models for the analysis of multilevel data structures has exponentiated since the 1990s due to the rise in computational power [2]. The analysis of multilevel data is undertaken mostly to serve two purposes: to provide correct inferences by accounting for the variation that may exist between level-2 units, such as hospitals or communities, and to assess the effects of level-2 risk factors and the unexplained between-cluster (level-2 unit) variation in explaining differences between individuals' (level-1 units) proneness to the binary outcome [1,6,7]. Several techniques in multilevel logistic regression have been proposed to explore whether the individual binary outcomes are explained by cluster-level covariates and the unexplained between-cluster heterogeneity. Because of its simplicity in interpretation, especially in linear mixed models, the Intra-class correlation coefficient (ICC) (also known as the variance partition coefficient (VPC)) is commonly used to measure the unobserved heterogeneity between clusters in multilevel models [8,9]. In multilevel logistic regression, the ICC loses its natural interpretation which is easily understood and interpreted in the linear mixed model. Thus, several approaches have been developed to provide interpretable and meaningful information regarding the between-cluster heterogeneity and the effect of cluster-level covariates. These methods include the Median Odds Ratio (MOR), the 80% Interval Odds Ratio (IOR-80) and the Sorting Out Index (SOI) [2,10,11]. Though they are rarely used, they have nonetheless offered explanations for the unobserved variations between clusters and the contribution of clusterlevel predictors on the individual binary outcomes [3,12] .
This paper assessed the performance of the ICC, MOR, IOR-80 and SOI in measuring the importance of clusterlevel covariates and unexplained between-cluster heterogeneity in the analysis of a two-level logistic regression model. We compared the performance using simulation studies as well as an empirical application to childhood anemia data from Malawi.

Two-level logistic regression model
Suppose there are H clusters and each cluster has N h individuals. Let Y hi be the binary outcome taking a value of 1 or 0 for each individual i in cluster h. Also, let us assume that the probability of Y hi = 1 is π hi , and π hi /1 − π hi is the odds of Y hi = 1, For each individual hi, there is a vector X hi containing values of p predictors, including cluster-level covariates. The equation of a two-level logistic regression model is given as Equation (1) is the conditional logistic model where π hi is the probability that each individual i in cluster h takes on the value Y hi = 1 and depends on the value of the cluster random effect, u h , which captures the effect of unmeasured cluster-level variability that affects the outcome propensity. They are usually assumed to be normally distributed with a mean of 0 and variance of σ 2 h , and are uncorrelated with the individual and cluster-level predictor variables. In this way, heterogeneity of the binary outcome (Y hi = 1) propensity between the clusters would have been accounted for. For example, increasing values of u h indicate that the associated clusters have higher propensities on the binary outcome (Y hi = 1) and greater heterogeneity between the outcomes is implied by higher values of σ 2 h .

Intra-class correlation coefficient (ICC)
In biomedical research studies, the intra-class correlation coefficient (ICC) statistic is used in multilevel analysis to measure the extent of within-cluster homogeneity, which is caused by the multilevel structure of the data [13,14]. The statistic is obtained by calculating the ratio of the amount of between-cluster variance to the sum of between-and within-cluster error variances, and it is computed as follows: where σ 2 u and σ 2 e are the variances of the between-cluster and the within-cluster errors, respectively. Since adding the two variances give the total variance of the outcome variable, ρ is often interpreted as the proportion of the variance explained by clustering, or as a measure of within-cluster homogeneity. If the value of ICC is 0 then the observed outcomes are completely independent, and if the value is 1 then the outcomes are completely dependent on the cluster they belong to. Thus, the ICC ranges from 0 to 1, and using cut-off points in Koo et al. [13] and Merlo et al. [15], values of ICC of less than 0.50, between 0.50 and 0.75, between 0.76 and 0.90 and greater than 0.90 indicate low, moderate, high and very high correlations, respectively. For the linear mixed model, both the random effect and individual error terms are on the same linear scale, which makes it easier to compute the total variance in the outcome variable. However, for the multilevel logistic regression model, the cluster-level variance is on the logistic scale which cannot be directly understood in terms of variations between clusters in the prevalence of the binary outcome. Furthermore, the individual-level variance is taken from an underlying standard logistic distribution. The value of the variance for this underlying distribution is σ 2 e = π 2 /3 = 3.29. Thus, as pointed out in Merlo et al. [12] and Sanagou et al. [3], the usefulness of ICC is somehow limited, largely due to its dependence on linear mixed model concepts regarding partitioning of the total variance, its association with the prevalence of the outcome variable, and the different scales between cluster-level and individual-level error variances. Thus, for public health and epidemiological usages, explaining the partitioning of total variance and ICC for dichotomous outcomes is most challenging.

The median odds ratio (MOR)
As an alternative to the ICC-based clustering measure in multilevel logistic regression, Larsen and Merlo [10] proposed expressing the between-cluster variance on the odds ratio (OR) scale, which could then be easily interpreted as it is on the same scale on which the effects of predictor variables are measured. For the calculation of MOR, one considers a set of all odds ratios that could be obtained by comparing two individuals with identical covariates but from two different clusters, where one has higher random effect and the other with a lower random effect. The two random effect values are ordered so that the odds ratio is always at least one. The median of these odds ratios provides the Median Odds Ratio (MOR) statis-tic and, assuming that the random effects are normally distributed, the MOR is given as [10]: where −1 (0.75) is the 75th percentile of a standard normal distribution. If σ 2 u = 0, then MOR = 1, which is indicative of the absence of between-cluster heterogeneity, while if σ 2 u is large (i.e. considerable between-cluster heterogeneity), the MOR will be much larger than 1 [3,16]. Thus, the MOR measures heterogeneity between individuals belonging to different clusters.

Interval odds ratio -80% (IOR-80)
The interval odds ratio (IOR), as developed in Larsen and Merlo [10] assesses the impact of cluster-level predictor variables on the individual binary outcomes. One starts by considering two different clusters that differ by one unit on a given cluster-level predictor variable. Two individuals with the same values for the remaining cluster-level predictor variables and all the individual-level covariates are chosen, one from each of the two clusters. Then the odds ratio (OR) for the predictor is computed, which often measures the effect of moving to a higher value of the given cluster-level predictor (for a binary cluster-level predictor variable, the OR is just the odds of it being 1 versus 0). Considering all possible pairs of individuals and their ORs, the 80% interval around the median of the distribution of the OR values provides the 80% interval odds ratio (IOR-80). Under the assumption of a normally distributed random effect, the IOR-80 has a lower limit of IOR L = exp(α + (2 × σ 2 u ) × (−1.2816)) and an upper limit of IOR U = exp(α + (2 × σ 2 u ) × (+1.2816)) [10,17], and values -1.2816 and +1.2816 represent the 10th and 90th percentiles of the standard normal distribution, respectively. Thus, the limits of the IOR-80 are given as: where α is the regression coefficient for the cluster-level predictor of interest. If the between-cluster variance is small, then the IOR-80% interval will be narrow, and if the between-cluster variance is large, then the IOR-80% interval will be wider. IOR-80% will contain a 1 if the cluster heterogeneity is large compared to the effect of the cluster-level predictor. If the interval will not contain 1, then the effect of the cluster-level predictor (positively or negatively) is larger compared to the unexplained between-cluster variation [10]. Thus, the addition of the cluster-level predictor effect helps to compare the importance of between-cluster heterogeneity and cluster-level predictors in understanding the differences in individual proneness to the outcome, Y hi = 1. However, the choice of the 80% interval width is rather arbitrary. Longer or shorter interval widths may or may not contain a 1 when the 80% width actually contains a 1.

Sorting out index (SOI)
To remedy the problem of IOR depending on the choice of the interval width, Chaix et al. [18] suggested using the percentage in the distribution of the odds ratios as described under the IOR for which the value is greater than 1. Thus, if the cluster residual variance is high, and the effect of the cluster-level predictor is not able to distinguish high-risk clusters from low-risk clusters, then the odds ratios between clusters for the cluster effect would be greater than 1 in half of the cases. On the other hand, if the cluster-level predictor has a huge positive effect compared to the between-cluster heterogeneity, then the odds ratios would be greater than 1 in almost all the pairwise cases of individuals. The percentage range of 50-100% is often used as a sorting out index (SOI) to assess and measure the extent to which a given cluster-level covariate is important compared to the between-cluster heterogeneity in explaining the individual proneness to Y hi = 1. The cluster-level covariate could be assumed to be negatively associated with the outcomes, in which case one would formulate the SOI differently. It will still be in half of the cases if it is very weak, otherwise it will be in none of the cases where the odds ratio is greater than 1; then the SOI will range between 0 to 50%. Thus, the percentage range of the SOI can be between 0% and 100% [11,19]. Assuming the random effect is normally distributed, the sorting out index (SOI) is calculated as: The value of for α √ 2σ 2 u can be evaluated from a standard normal distribution table and is expressed as a percentage. This information indicates the degree to which the cluster-level predictor is of importance as compared to the unexplained between-cluster heterogeneity. An SOI of near 0% or 100% implies an absence of betweencluster variability and a high negative or positive effect of the cluster-level predictor, respectively. The SOI of 50% implies that the effect of cluster-level predictor is low [19], as compared to the unexplained between-cluster heterogeneity.

Simulation protocol
We compared the performance of ICC, MOR, IOR-80, and SOI on simulated two-level binary datasets with individual and cluster-level covariates. We then fitted twolevel logistic regression models to the datasets and compared the resulting estimates to the true values by the root mean square errors. For the simulation, the response variable, Y hi , was distributed as a Bernoulli random variable. The following logit model on the probability, π hi = Pr(Y hi = 1), was used to simulate the binary responses: We assumed the cluster-specific random effect, u h , followed a normal distribution with a mean of 0 and a variance of σ 2 u (i.e. u h ∼ N(0, σ 2 u )). The fixed effect parameters were fixed at β 0 = 0.5, β 1 = −1.5, β 2 = 0.3, α = −0.3, and both level-1 covariates X hp ∼ N(0, 1) for p = 1, 2, and level-2 covariate X h ∼ Bernoulli(0.5). For example, to generate 10 clusters with each cluster having 5 individuals with the outcome Y hi = 1 or 0, the following steps were implemented; a) We drew 10 random effects, u h , for the 10 clusters distributed from a normal, u h ∼ N(0, σ 2 u ), for example a value of σ 2 u = 0.2. b) Take the first value of the generated random effect, say u h = 1.78.
Repeat the following steps (c-f) 5 times for cluster 1 with the random effect value of 1.78. c) Draw X hi1 from a N(0, 1) and X hi2 from a N(0, 1) and X h from Bernoulli(0.5). d) Substitute the generated values into the equation: f) We then draw Y hi from a Bernoulli with parameter π hi , which results in either 1 or 0 as a binary outcome for subject 1 in the cluster 1. g) Return to step b) and pick the next value, for example -0.568, as the random effect value for cluster 2. This is repeated 10 times to generate 10 clusters with known random effect values, each cluster having 5 individuals.
Here, we fixed the values for σ 2 u at 0.2, 0.5, 1.0 and 1.5. The resulting true values of ICC, MOR, SOI and IOR-80 are shown in Table 1.
The process was repeated for the varied number of clusters (i.e., H = 10, 50, 100, 300, 500 clusters), number of observations within each cluster (i.e., N h = 5, 20, 50, 100, 250). For each combination of the number of clusters, cluster size, and between cluster variance, 1000 datasets were simulated. The root mean square error (RMSE) , where E i was the estimated value and O i the true value as shown in Table 1, were computed to determine their accuracies. We used the open software R (R Core Team, 2020) and packages rms (regression modeling strategies) [20], lmerTest and pbkrtest in simulating and analyzing the data.

Simulation results
Tables 2 and 3 present the accuracies of estimates of the fixed effects and measures of heterogeneity for the twolevel logistic regression model for a varied number of clusters, cluster sizes and cluster variances. Tables 4 and 5 present results of estimated values of the fixed effects and measures of heterogeneity for the 2-level logistic regression model, under various numbers of clusters, cluster sizes, and cluster variances.

Accuracy of estimated parameters
We observe that the estimated parameters for the fixed effects were accurate for all cluster-levels and cluster variances, particularly for large clusters and cluster sizes. Lower root mean square errors were particularly observed for larger clusters, providing evidence of unbiasedness for the estimatesβ's andα for increased clusters and cluster sizes as shown in Tables 2 and 3. The RMSEs forβ 0 were slightly biased for 10 clusters with a cluster size of 5. However, the estimates were unbiased for at least 10 clusters with a cluster size of at least 20 for all cluster-level variances. At a cluster variance of 1.0, the estimates forβ 0 were unbiased for at least 100 clusters with a cluster size of at least 5. Lower RMSEs forβ 1 were observed for at least 10 clusters with a cluster size of at least 5 for all cluster variances. Bias forβ 1 was negligible for 100 clusters with cluster size of at least 50 at a cluster variance of 0.5. At a cluster variance of 0.2,α was slightly biased for 10 clusters with a cluster size of 5. However, we observe unbiased estimates of α for 100 clusters with cluster sizes of 50, and 500 clusters with a cluster size of at least 5. The bias ofα improved with the increase in clusters and cluster size for all cluster variances.

Intra-class correlation coefficient (ICC)
Tables 2 and 3 show that the estimates for the ICC were less biased when the cluster variance was 0.2 and the estimates of the ICC were unbiased with 100 clusters and clusters sizes of 20 or more. At the cluster variance of 0.5, the ICC was estimated with less bias for at least 100 clusters of sizes 50 or larger. At level-2 variance of 1.0, the ICC was slightly biased for 10 clusters with cluster size of 5. However, the bias for estimating the ICC decreased for at least 50 clusters with cluster size of at least 20. At a level-2 variance of 1.5, the lowest RMSEs were observed for the estimation of ICC with at least 10 clusters each of cluster size of at least 20. Increasing the number of clusters (at least 100) and cluster size (at least 20) gave negligible bias for the estimation of ICC as its RMSEs decreased to almost zero, for all level-2 variances.

Median odds ratio (MOR)
As observed from Tables 2 and 3, the estimates of the MOR were largely biased for 50 clusters with cluster size of 5, for all cluster variances. at a cluster variance of 0.2, the estimates of the MOR were less biased for at least 50 clusters with cluster sizes of at least 20. The estimates of the MOR improved when cluster sizes were 50 or more for all cluster-level variances. The lowest RMSEs were observed for at least 300 clusters and a cluster size of at least 50 in estimating the MOR.

80% interval odds ratio (IOR-80)
The estimates for the IOR-80 were less accurate for small clusters and cluster sizes when the cluster variance was large as observed from Tables 2 and 3. At a cluster variance of 0.2, we observe low RMSEs when estimating the IOR-80 lower and upper limits for at least 10 clusters with a cluster size of at least 20. At a cluster variance of 0.5, the estimates of the IOR-80 lower limit were accurate for at least 10 clusters with a cluster size of at least 20. However, the estimates for the IOR-80 upper limit were largely biased for at least 10 clusters with a cluster size of at least 5 for cluster variances of 1.0 and 1.5. On the contrary, the    estimates for the IOR-80 lower limit at these cluster variances (1.0 and 1.5) were accurate for at least 10 clusters with a cluster size of at least 20. The accuracy of the IOR-80 estimates were less biased with increased clusters and cluster sizes. At a cluster variance 0.2, unbiased estimates of the IOR-80 lower and upper limits were observed for 100 clusters with cluster size of at least 50. Overall, the bias for the estimates of the I0R-80 decreased for at least 300 clusters with a cluster size of at least 100 at all cluster variances.

Sorting out index (SOI)
As seen from Tables 2 and 5, the SOI estimates were biased for 10, 50, and 100 clusters with a cluster size of 5. However, the estimates of SOI were less biased for at least 10 clusters with cluster sizes of at least 20 at cluster variances of 0.2 and 0.5. Furthermore, the estimates for SOI were slightly biased for 10 clusters with a cluster size of 5.
The bias in estimating the SOI decreased with increased clusters (at least 300) and a cluster size (at least 50) at cluster-level variances of 0.2 and 0.5. In addition, for large cluster variances of 1.0 and 1.5, the estimates of the SOI were unbiased with increased clusters (at least 100) and cluster size (at least 50).

Application to childhood anemia in Malawi
For the application, we analyzed childhood anemia among under-five children in Malawi using the 2015-16 Malawi Demographic and Health Survey (MDHS) data by fitting a two-level logistic regression model taking the community as a level-two unit. Childhood anemia is a major public health concern in Sub-Saharan Africa (SSA), particularly in Malawi [21]. We categorized anemia among children aged 6 to 59 months who were born five years before the survey as: no anemia = 0 and had anemia = 1.  There were 4676 children aged 6 to 59 months distributed across 850 enumeration areas (EA) which we regarded as communities. The overall prevalence of anemia was 63.1% (95% CI: 61.7, 64.5), with an average prevalence of childhood anemia of 62.2%, with a range of 0.0% to 100% and mode of 100%. The median anemia prevalence across the communities was 66.7%. The average number of children per community was 5.5, with a range of 1 to 15 and mode of 6. The median number of children per community was 5.
There are several factors that have been found to be associated with childhood anemia. These include individual-level factors: age, gender of a child, mother's age, birth rank, wealth index, mother's education level, father's education level, household size, household hunger status, sanitation services, water and electricity supply, and community-level factors: community geographic type and community development index [21][22][23][24][25]. For our application, we used mother's education, mother's age, child gender, household wealth quantile, and community geographic type (urban/rural). For the unadjusted model, the estimate of the between-community variance was 0.497, translating to an ICC of 0.131 and an MOR of 1.954. Adjusting for the child-level and the communitylevel covariates had the following odds ratios on child anemia: being female (OR= 0.92, 95% CI: 0.80, 1.04), being in a household with medium (OR= 0.71, 95% CI: 0.59, 0.84) and rich (OR= 0.78, 95% CI: 0.66, 0.93) wealth, having mothers with primary (OR= 0.69, 95% CI: 0.56, 0.86) and post-secondary (OR= 0.64, 95% CI: 0.49, 0.82) education, and residing in a rural community (OR= 1.21, 95% CI: 0.97, 1.52). For the covariate-adjusted model, the estimate of the between-community variance was 0.455. This translates to estimated values of ICC and MOR to be 0.122 and 1.898, respectively. The estimate of the ICC show that the unexplained heterogeneity between communities (i.e. low homogeneity in childhood anemia in a community) was low. While that of the MOR shows a high level of unexplained community heterogeneity (MOR »1). The estimates of the SOI was 0.567 (translated as 56.7%) and of the IOR-80 was (0.345, 3.978), implying that the unexplained heterogeneity between communities was more significant in explaining a child's propensity to be anemic than the effect of the community geographic type (rural vs urban).

Discussion
Multilevel logistic regression models, which recognize the fact that level-1 units (individuals) are nested within higher-level units (clusters) when estimating the effect of covariates at different levels on individual binary outcomes, are widely implemented in most standard statistical software packages. They measure and evaluate the relative variation in the outcome measures, between individuals within and between clusters. However, measuring residual cluster-level variation and covariate effect heterogeneity or homogeneity (clustering) of binary outcomes within high-level clusters is more challenging in multilevel logistic regression than for linear mixed models. We set out to compare the performance of four measures of clustering in a two-level nested logistic regression model through a simulation study. The four measures were ICC, MOR, SOI and IOR-80. We applied the four measures to an analysis of childhood anemia in Malawi using a twolevel logistic regression model, where the level-2 units were communities.
The accuracy of the estimated coefficients for the twolevel logistic regression model improved for 100 to 500 clusters with cluster sizes of 50 to 250 for all cluster variances. These findings are similar to results from previous studies [26,27], which showed that as the number of clusters and cluster sizes increased, the accuracy of  the regression coefficients had a negligible bias. The estimates of ICC were more accurate than MOR, SOI and IOR-80 when the cluster number was small (at least 10) and when the cluster size was small (at least 20), and the bias for the ICC was negligible when the number of clusters and the cluster sizes increased. This is similar to a study by Moineddin et al. [26] which found that increasing the number of clusters and cluster sizes improved the convergence rate for the ICC. The estimates of the SOI were more accurate than the estimates for the MOR and the IOR-80 for smaller number of clusters (at least 10) and a cluster size of at least 20 regardless of the cluster variance. However, the estimates of all the measures were more accurate when the number of clusters increased (from 100 to 500) regardless of cluster size and cluster variance. The estimates for MOR and IOR-80 performed better for large clusters (at least 300) and large cluster size (at least 20) at larger cluster variance (1.0 and 1.5). Overall the ICC and SOI estimates were better than the estimates of the MOR and the IOR-80 and the estimation of the heterogeneity between clusters using ICC and SOI was more accurate. However, the ICCs usefulness is limited as it does not account for covariates at the cluster level. Hence, the SOI would be the best statistic to measure the strength of clustering as compared to the ICC, MOR and IOR-80, especially when between cluster-level covariate heterogeneity is also of interest.
In the application to childhood anemia in Malawi, accounting for the purported risk factors, the estimated values of the ICC, and MOR were 0.122 and 1.898, respectively, which were indicative of low similarity between the status of anemia in children within a community, and high variability in unexplained heterogeneity between the communities. The estimates of SOI and IOR-80 were 56.7% and (0.345, 3.978), respectively, which showed that unexplained heterogeneity between communities was of great significance in the understanding of anemia in the children than the child and community-level predictors, namely mother's education, mother's age, child sex, community geographic type and household wealth quantile, used in the model, and community-level geographic type (rural/urban). [10]. Mother's age, wealth quintile and mother's education were significantly associated with childhood anemia, similar to findings from previous studies [21,23,24]. However, the effect of community geographic type on childhood anemia was not significant. With 850 communities, an average of 5 children per community and the simulation study, we recommend using the ICC and SOI to measure the strength of heterogeneity between the communities. However, since the ICC does not take into account the community-level covariate (community geographic type), we recommend using the SOI to measure the strength of variation between communities for this application.

Conclusion
In conclusion, measures of the between-cluster heterogeneity and effects of cluster-level covariates in multilevel logistic regression models would be estimated with negligible bias when the data has at least 300 clusters with a cluster size of at least 50. The estimation of the intra-class correlation coefficient and the sorting out index are accurate with 10 clusters and a cluster size of 20. We recommend using the sorting out index to assess unexplained heterogeneity between clusters as well as the effect of cluster-level covariates. The usual intra-cluster correlation coefficient would suffice when only quantification of the between-cluster variation is the purpose of the multilevel logistic regression analysis.