Many translational research projects are generating very large data sets (VLDS) which require fitting complex models to answer questions of public health interest. Datasets can be considered “very large” because of large numbers of study subjects or units of analysis and/or large numbers of variables, and both situations present challenges during the analysis phase, especially when observations are clustered at some level (eg. Longitudinal data). An example of VLDS with large number of observations is a two-year group randomized trial designed to assess the impact of a quality improvement intervention on colorectal cancer screening in primary care practices. Electronic medical record data were obtained from a sample of 68,150 patients from 32 primary care practices in 19 US states, followed monthly over a 2-year time period . Similarly, an example of VLDS with large number of variables as well as units of analysis is an functional magnetic resonance imaging study of neural changes underlying speech-perception training  in which whole brain images of 40 patients were taken to make functional inference, resulting in hundreds of time series data clustered within thousands of voxels.
Fitting complex models for these types of data sets can be difficult, requiring inordinate amounts of computer time for parameter estimation, requiring memory allocations beyond what are available or containing data structures that prevent model convergence, even within state-of-the-art computational infrastructures of medium size research facilities such as ours. For instance, fitting complicated generalized linear mixed models (GLMMs) for data from the examples above using software such as SAS 9.2.2 (Cary, NC), Stata 11 (College Station, TX) or R (R-2.11.1) may not be possible using desktop computers typically available to researchers within our institutions (64 bit server with 12GB and 667MHz dual ranked DIMMS and 48GB of RAM). Although a few methods for modeling VLDSs exist, current practice mainly involves data reduction processes, which usually result in loss of information.
Recently, we have been working on a longitudinal study of the trajectory of HbA1c control in patients with type2 diabetes treated within the Veterans Administration (VA) healthcare setting, and we have been faced with the problem of fitting GLMMs on over 890,000 patients, clustered in 23 Veterans Integrated Service Networks (VISNs) and followed over 5 years. Fitting mixed effects logistic regression model with over 30 covariates for making individual level inference resulted in an out of memory error using a 64 bit server with 12GB and 667MHz dual ranked DIMMS and 48GB of RAM.
In SAS procedures such as Proc GLIMMIX, fitting mixed effect models with the recommended standard syntax of including subject ID in a Class statement was not possible. This procedure with the standard syntax ran out of memory when we attempted to fit a model with the simplest scenario of including a random intercept. With ad-hoc modifications (see discussion section) to the standard syntax, however, we were able to fit the model despite it took longer time. Similar problems were observed in Stata’s gllamm, and R’s lme4 packages.
With the current focus on personalized medicine, patient/subject level inference is often of key interest in translational research. GLMMs are a very rich class of models that are traditionally used to make such individual-level inference by breaking down the total variation in the observed response into within-subject and between-subject variation. These models are also used to incorporate natural heterogeneity in the estimates due to unmeasured explanatory variables [3–5]. In GLMMs, the joint distribution of the vector of responses is fully specified and the within-subject association among repeated or clustered measures is induced via incorporation of one or more random effects into the model. As a result, interpretation of the regression coefficients for GLMM relies on the induced model for the covariance among the responses. When population level inference is of interest, marginal models (e.g. general linear models) are often used, and within-subject association among repeated responses is incorporated by directly making assumptions about the covariance (e.g. autoregressive, compound symmetry, etc). While such models may not be as difficult to fit with VLDSs, subject-level inference cannot be made using the marginal model framework since the mean response and covariance are modeled separately . Currently, methodology for fitting parametric mixed effect models for VLDSs is underdeveloped.
There are some recent Bayesian methods proposed for fitting parametric random effects models to VLDSs [6–8]. Owen  and Huang and Gelman  propose a computational strategy, akin to a Bayesian meta regression, based on sampling the data, computing separate posterior distributions based on each sample, and then combining these to get a consensus posterior inference. Their approach reduces the number of parameters as well as sample size for each separate model fit and can lead to efficient inference.
An alternative is a 2-stage “data squashing” method . In this method, the complete data is partitioned into compact sub-regions in the first stage. Then one generates a set of “pseudo-data” and weights within each sub-region so that the weighted moments on the squashed data match the unweighted moments on the original data. This method is less sensitive to outliers than random sampling, but it has the potential to be computationally intensive. To date, its characteristics are only known in simpler fixed effect and descriptive models. Madigan et al. [11, 12] proposed a data squashing method which first groups subjects based on their contribution to the likelihood and then fits models to the mean of each group. Although this approach may be promising for some models, it is unwieldy under the Dirichlet process prior (DPP) due the complicated structure of the likelihood . In general, the Bayesian approaches which use DPP to automatically cluster individuals into latent classes [13, 14] may not be feasible in very large data sets due to limitations in current Markov chain Monte Carlo (MCMC) algorithms [12, 15].
Motivated by the scarcity of work in this area and the challenge we faced with the analysis of our VLDS, we propose a random effects meta regression (REMR) approach in which VISN-specific estimates are combined via meta regression. We make comparisons with two other approaches, (1) average estimates from analysis of 1000 data sets obtained via simple random sampling (SRS) of the original data with simulated 95% confidence intervals (CIs), (2) weighted average estimates from analysis of 1000 data sets obtained via VISN-stratified random sampling (StRS) with simulated 95% CIs. Using simulated data, we also assess biases present within each approach, noting whether they provide equivalent inferences as would be obtained from analysis of the full data. The paper is organized as follows: section 2 presents the motivating example; section 3 describes the details of the statistical methods; section 4 presents the results of the analysis; and section 5 discusses the findings.
A national cohort of Veterans with type 2 diabetes was created by linking patient and administrative files from the Veterans Health Administration (VHA) National Patient Care and Pharmacy Benefits Management (PBM) databases. Veterans were included in the cohort if they had type 2 diabetes defined by two or more International Classification of Diseases, Ninth Revision (ICD-9) codes for diabetes (250, 357.2, 362.0, and 366.41) in the previous 24 months (2000 and 2001) and during 2002 from inpatient stays and/or outpatient visits on separate days (excluding codes from lab tests and other non-clinician visits), and prescriptions for insulin or oral hypoglycemic agents (VA classes HS501 or HS502, respectively) in 2002 . Veterans identified as having type 2 diabetes by ICD-9 codes were excluded from the cohort if they did not have prescriptions for diabetic medications (HS501 or HS502) in 2002. The datasets were linked using patient scrambled Social Security Numbers and resulted in 890, 394 Veterans, who were followed until death, loss to follow-up, or through December 2006. The study was approved by our Institutional Review Board and local VA Research Development committee.
The primary outcome was glycosylated hemoglobin (HbA1c) level. In addition, a binary outcome defined as HbA1c ≥ 8.0% was used.
Primary independent variable
For this project, the primary research question was whether HbA1c differed significantly by race/ethnicity, classified as non-Hispanic white (NHW), non-Hispanic black (NHB), Hispanic, and other/unknown/missing.
Age, gender, marital status (i.e., single or married) and percentage service-connectedness (i.e., degree of disability due to illness or injury that was aggravated by or incurred in military service) were available and treated as covariates in the model. Location of residence was defined as urban and rural/highly rural,  and hospital region was defined by the five geographic regions of the country based on VHA Veteran’s Integrated Service Networks (VISNs): Northeast (VISNs 1, 2, & 3), Mid-Atlantic (VISNs 4, 5, 6, 9, & 10), South (VISNs 7, 8, 16, & 17), Midwest (VISNs 11, 12, 15, 19, & 23), and West (VISNs 18, 20, 21, & 22) .
Variables included substance abuse, anemia, cancer, cerebrovascular disease, congestive heart failure, cardiovascular disease, depression, hypertension, hypothyroidism, liver disease, lung disease, fluid and electrolyte disorders, obesity, psychoses, peripheral vascular disease, and other (AIDS, rheumatoid arthritis, renal failure, peptic ulcer disease and bleeding, weight loss) and were defined based on ICD-9 codes at entry into the cohort. In our final models, we included a categorical summary of count of comorbidities defined as (0=none, 1=one, 2=two 3=three or more), a process which has been shown to be as or more efficient than more complicated algorithms .