### Software Strengths and Weaknesses

#### Strengths

Mplus has several strengths. It has tremendous flexibility and can incorporate numerous statistical models within the MLM framework well beyond "traditional" hierarchical linear and generalized linear models. One can fit factor models, latent class models, structural equation models, mixture models, latent growth curve models, and others within the MLM framework. Second, Mplus will automatically scale the weights for the user using each approach described here and Mplus allows analysts to specify weights scaled outside of Mplus. Third, Mplus offers a wide variety of estimators and link functions. Fourth, Mplus can handle both of the current recommended methods for analyzing subpopulations of complex survey data, the zero-weight approach and the multiple-group approach. [24]

MLwiN also incorporates several strengths. First, MLwiN can fit models with up to five levels, making it quite useful in multistage designs. Second, MLM has an easy point-and-click, windows-based user interface, which makes fitting MLMs easy and straightforward. Third, MLwiN incorporates several estimators. Fourth, like Mplus, MLwiN provides an automatic weight scaling feature, and, like Mplus, it allows the user to specify weights scaled outside of MLwiN. Fifth, MLwiN has numerous features available for evaluating a model's appropriateness. And, sixth, MLwiN includes several graphical features.

Finally, GLLAMM also has some distinct advantages. Like Mplus, GLLAMM offers an astounding array of models that it can fit within the MLM framework. [30] Second, GLLAMM does allow more than two cross-sectional levels. Third, GLLAMM allows the user to specify scaled weights. Finally, GLLAMM uses full pseudo-maximum-likelihood estimation for generalized linear mixed models with *any* number of levels using adaptive quadrature,[10] which may result in more appropriate standard errors, especially when working with categorical outcomes. [23]

#### Weaknesses

Despite its strengths, Mplus has some distinct disadvantages. First, it can only fit two-level cross-sectional MLM models. Although one can fit a two-level MLM and use Mplus' complex data analysis feature to properly estimate standard errors for a third level, Mplus does not allow one to investigate what predicts variation at level-3. For multistage surveys, this may be a substantial limit. Second, relative to MLwiN, Mplus offers few analytical tools for investigating model assumptions, model fit, and model diagnostics. Third, relative to MLwiN, Mplus offers few graphical tools, whether these limits outweigh its strengths will depend on the individual users needs.

MLwiN also has limits. Primarily, it cannot fit the wide variety of models that Mplus and GLLAMM can (e.g., latent class models). While MLwiN can fit some models beyond hierarchical linear and generalized linear models (e.g., multilevel confirmatory factor analyses), MLwiN does not have the full flexibility that Mplus and GLLAMM do. For users seeking to fit extremely complex models, this may be a substantial drawback. Second, MLwiN will only automatically scale the weights using method B. And third, while MLwiN does offer several estimators (e.g., iterative generalized least squares (IGLS), restricted IGLS, and Markov chain Monte Carlo (MCMC)), it does not offer as large a range of estimators as Mplus. Again, whether these weaknesses outweigh its strengths will depend primarily on the type of analysis the user expects to conduct.

Finally, GLLAMM has some noteworthy disadvantages. First, GLLAMM has well known problems with computational speed. Models that take seconds to converge in the other programs can take days (literally) to converge in GLLAMM. Aside from some minor adjustments, analysts can do little to increase GLLAMM's speed. Second, although GLLAMM has an advantage with categorical outcomes, it may be less accurate with continuous outcomes. [29, 31] GLLAMM does not scale the weights for the user. Users must supply pre-scaled weights. Third, GLLAMM offers few automatic features (e.g., automatic grand or group mean centering) and diagnostic utilities. However, users familiar with STATA will find it easy to incorporate STATA commands, data manipulation, and diagnostic tools when using GLLAMM, whether these limits outweigh its benefits will depend on the user's individual needs.

#### Limitations

Although these analyses generally support the use of MLM in complex survey data with design weights, some issues remain unresolved. First, a best practice for scaling weights across multiple levels has yet to be advanced. Though Asparouhov[3] and Rabe-Hesketh and Skrondal[10] indicate that scaling level-2 weights has little practical effect, more work is needed to investigate the generality of that advice, particularly in surveys with 3 or more levels. Second, complex survey designs often employ unequal probability of selection at higher levels. For example, the National Epidemiologic Survey of Alcohol and Related Conditions[32] stratified the US into four regions. It then sampled counties within regions and people from households within counties. At both the county and household levels, unequal probability of selection occurred (e.g., some counties were more likely to be included than others). Survey organizations rarely make (or have) level-2 or beyond weights available. Some authors have suggested methods for estimating level-2 weights from level-1 weights,[17] yet more work is needed to investigate these methods' validity.

Third, MLM theoretically allow investigators to examine predictors and variance across naturally occurring clusters within complex sampling design (e.g., creating a three-level model by grouping individuals according to their county of residence using data from a two-level survey that sampled people within states). However, this flexibility may result in cross-classified data structures (e.g., hospital catchment areas overlapping states in a survey that sampled people within states). While MLM can handle cross-classified data,[16] no work has examined handling design weights in this situation.

Fourth, analysts often wish to investigate relationships within a certain subgroup. Although analysts can use interaction terms to investigate hypotheses within the specified subgroup, analysts may wish to examine a subgroup of the sample excluding other sample members entirely. For this situation, where analysts wish to investigate hypotheses among a specific subgroup only, no established guidelines exist regarding a best practice method for estimating MLM in complex survey data with design weights. When using complex surveys, one should include the entire sample in the analyses. This leaves the sample design structure whole and leads to proper estimation of variances and standard errors. However, it presents a problem when analysts would like to select a subgroup and examine a MLM for this subgroup of individuals in a sample. Analysts should *not* simply subset the data to the desired group of interest. [33, 34] While some techniques have been suggested (e.g., zero-weighting[34] and multiple-group analyses[24]), the performance of these techniques in MLM with design weights needs further examination.

Finally, little work addresses missing data's role in MLM with design weights. It remains unclear how to best handle missing data within the context of MLM, complex survey data, and design weights. Analysts might take a zero-weighting approach for missing data,[34] treating individuals with complete data as a subgroup, to address missing data. If one uses this approach, the analyst should take special care to scale the weights using the full set of weights. To evaluate the influence of missing data, analysts might conduct analyses in the full sample using selected variables for which all individuals have complete data and compare those results to identical analyses conducted on the same variable set but using the subsample of individuals with missing data on other variables of interest. Future work should explore missing data's role and develop and test solutions to handle it.

While these limits highlight an array of outstanding issues that need investigation, they do not preclude analysts from employing MLM in complex survey data with design weights. Moreover, they demonstrate the need to choose a MLM program that allows flexibility with regard to design weights. Thus, as theory advances, software will not limit analyses.

#### Applied Summary Recommendations

Given the breadth of findings discussed and presented and the various strengths and weakness of each approach and software program, the reader might now wonder, "what do do in practice?" In my work, I standard approach. First, in terms of software, take the following I generally use Mplus. I do this because Mplus offers the most flexibility relative to speed. I frequently fit models that MLwiN cannot estimate (e.g., MLM multiple group structural equation models) and I rarely fit models with more than two levels (which Mplus currently cannot estimate). Analysts fitting the types of models discussed in this paper will generally find MLwiN more than meets their needs. Second, in terms of scaling the weights, I always fit the models using each scaling technique (methods A and B). I do this to examine any inferential discrepancies. If I find no inferential discrepancies, I generally report the findings from method A. I do this because I frequently work with cluster sizes larger than *n* = 20 and I am interested in both point estimates and variance-covariance discussions. If I worked with smaller cluster sizes (*n* < 20) and had an interest primarily in variance-covariance estimates, I would report the results of method B. If I had an interest primarily in point estimates, I would report the results of method A. Finally, if I encountered a model I could not estimate for some reason using scaled weighted data, I would take the following approach. I would fit a simpler model using scaled weighted and unweighted data. If I did not observe a difference in the inferential conclusions across these approaches, I would then fit the more complex model using unweighted data. However, I would include a note in my reporting of the unweighted findings highlighting that I used unweighted data and that readers should interpret the results with caution.