Group penalized generalized estimating equation for correlated event-related potentials and biomarker selection

Background Event-related potentials (ERP) data are widely used in brain studies that measure brain responses to specific stimuli using electroencephalogram (EEG) with multiple electrodes. Previous ERP data analyses haven’t accounted for the structured correlation among observations in ERP data from multiple electrodes, and therefore ignored the electrode-specific information and variation among the electrodes on the scalp. Our objective was to evaluate the impact of early adversity on brain connectivity by identifying risk factors and early-stage biomarkers associated with the ERP responses while properly accounting for structured correlation. Methods In this study, we extend a penalized generalized estimating equation (PGEE) method to accommodate structured correlation of ERPs that accounts for electrode-specific data and to enable group selection, such that grouped covariates can be evaluated together for their association with brain development in a birth cohort of urban-dwelling Bangladeshi children. The primary ERP responses of interest in our study are N290 amplitude and the difference in N290 amplitude. Results The selected early-stage biomarkers associated with the N290 responses are representatives of enteric inflammation (days of diarrhea, MIP1b, retinol binding protein (RBP), Zinc, myeloperoxidase (MPO), calprotectin, and neopterin), systemic inflammation (IL-5, IL-10, ferritin, C Reactive Protein (CRP)), socioeconomic status (household expenditure), maternal health (mother height) and sanitation (water treatment). Conclusions Our proposed group penalized GEE estimator with structured correlation matrix can properly model the complex ERP data and simultaneously identify informative biomarkers associated with such brain connectivity. The selected early-stage biomarkers offer a potential explanation for the adversity of neurocognitive development in low-income countries and facilitate early identification of infants at risk, as well as potential pathways for intervention. Trial registration The related clinical study was retrospectively registered with https://doi.org/ClinicalTrials.gov, identifier NCT01375647, on June 3, 2011.


Background
Event-related potentials (ERPs) have been widely used in studies of perceptual and cognitive development. ERPs represent the volume-conducted electrical signals generated by large populations of synchronously activated neurons activated in response to stimuli. Specifically, with multiple electrodes on the scalp, ERPs are small parts of electroencephalogram (EEG) recording of the brain response elicited to specific stimuli such as viewing pictures or words on the computer screen [1]. As the brain response to a single stimulus is usually weak or noisy in the EEG recording of a single trial, an ERP waveform is actually generated from the aggregated EEG recordings over many trials for better brain response measuring [2,3]. In general, an ERP waveform consists of a series of positive and negative voltage deflections, characterized by the amplitudes of negative-or positive-going peaks or the latencies to these peaks in milliseconds (ms). For example, the N290 component surfaces as a negative deflection in voltage and with a peak latency between 250 and 350 ms, while the P400 component appears as a positive-going waveform that peaks between 350 and 450 ms depending on the age of the child [4][5][6]. Consequently, ERP data (amplitudes or latencies) are hierarchical in that there are multiple ERP measurements for each subject corresponding to multiple treatment or stimulus conditions and multiple channels (i.e., electrodes), while channels are further clustered in different regions of the brain. Comparisons of brain activities between different treatment conditions for different channels in different brain regions are of research interest [7].
In the previous literature, there are a few approaches to compare ERPs between different stimulus conditions from multiple channels. One approach is to compare ERPs between conditions for each channel individually, which is often subjected to multiple comparison problem. Lage -Castellanos et al. [8] applied false discovery rate method and performed a permutation test for comparisons within each channel and at each time point. Causeur et al. [9] introduced a dynamic factor model for multiple testing to account for the dependence among hypotheses. The second approach is to analyze the data from all channels simultaneously. One popular tactic is to group the channels by the brain regions such as frontal, central and parietal, and then perform Analysis of Variance (ANOVA) separately for each region, or include region as a factor in Multivariate Analysis of Variance (MANOVA) for all ERPs together [10]. Yet another approach is to average the ERPs over the multiple channels of interest and then compare conditions using one-way ANOVA. Either way, the channels within a brain region would be treated the same and the variations or the correlation structure between individual channels would not be accounted for. In fact, ERP measures do not only vary but also are highly correlated among channels. Vossen et al. [11] showed the correlated structure among ERP data and applied mixed regression approach. However, they only considered the correlation among repeated measurements from different conditions while channels are still modeled separately. To improve estimation efficiency, a model accounting for both the individual channel effects and the correlation structure is highly desired in ERP data analysis.
In addition to evaluating the effect of treatment conditions on the brain response of interest in ERP data, motivated by our clinical study, we are also interested in that whether such brain response is attributable to a set of important clinical risk factors and biomarkers. Since a large number of risk factors and biomarkers are available in the clinical study, variable selection using penalized methods would be preferred for such high-dimensional data to select the important predictors and estimate their impacts on the brain response. Many penalized methods have been developed based on different penalties for high-dimensional data, such as Least Absolute Shrinkage and Selection Operator (LASSO) [12], Smoothly Clipped Absolute Deviation (SCAD) [13], Elastic Net [14] and Adaptive LASSO [15]. Penalized methods for correlated data have also been proposed for marginal models [16] and for mixed effects models [17]. In addition, Wang et al. [18] proposed penalized generalized estimating equations (PGEE) for high-dimensional correlated data based on SCAD penalty. However, these available methods are not readily applicable to ERP data mainly due to the lack of consideration of the specific structured correlation among different channels in ERP data, especially when both conditions and channels are included. Second, the SCAD-based PGEE does not allow group variable selection, which is pivotal in the clinical studies as many risk factors or biomarkers are clustered or potentially correlated.
In this paper, we extend the PGEE method to a Group Penalized Generalized Estimating Equations (GPGEE) that can accommodate a multi-level structured correlation and achieve group-wise variable selection. Thus our proposed method can be readily applied to test the condition difference in ERP measures and simultaneously perform group variable selection to identify important predictors associated with ERP for brain response. To our knowledge, hierarchical models with complex correlation structure are rarely used for ERP response analysis in the ERP research, nor are the regularized regression methods with penalty. Our modeling development was motivated by the ERP data from a birth cohort of Bangladeshi children, the Performance of Rotavirus and Oral Polio Vaccines in Developing Countries (PROVIDE) study. A large and comprehensive set of non-invasive biomarkers were developed in the PROVIDE study from fecal and blood samples [19]. Children in low-resource communities such as those in the PROVIDE cohort are exposed to numerous adversities, including malnutrition, infectious disease exposure, and extreme poverty. In turn, exposure to early adversity can limit their cognitive developmental potentials with long lasting effects. Using EEG as a neuro-imaging tool for cognitive and neural development assessment, a subset of children in the PROVIDE birth cohort were measured at 3 years of age for ERP response. The primary objective of our clinical ERP study was to evaluate the impact of early adversity on brain connectivity and identify risk factors and biomarkers associated with the brain response. With the challenges and limitations in ERP research described earlier, the GPGEE model is developed to achieve the clinical objective.
Our method addresses the following major challenges in analyzing the ERP data from the PROVIDE study. First, due to the design of experiment, ERP data are hierarchical or multilevel by nature with multiple conditions and multiple channels for each study subject. Second, ERP data are highly correlated across channels under each condition, and across conditions for each channel. Lots of information would be lost by simply averaging ERPs over these channels to compare ERPs between conditions. Third, although variable selection methods for high-dimensional data have been intensively studied [12,13] and applied in clinical and genetic studies [19][20][21], to our knowledge, no variable selection technique has been applied to ERP data. Further, since many predictors in the high dimensional data are categorized with multiple levels or potentially correlated, group penalty needs to be imposed in the variable selection process to ensure informative predictors and groups can be correctly selected.
The rest of the paper is organized as follows. In "Methods" section, we present the models for highdimensional correlated data, propose the model specifically for the structured correlation matrix in ERP data, expand the PGEE method to allow group penalty, and derive the algorithm for solving group-penalized estimating equations. In "Simulations" section, we conduct a simulation study to compare the relative performance of our proposed GPGEE with the existing model under several scenarios, without and with group penalty, and with different correlation structures. In "Results" section, we apply our proposed method to ERP data from the PRO-VIDE study. Compared to the existing methods such as regularized regression or PGEE, our proposed method doesn't only model the ERP multi-level structure appropriately, but also promotes group-wise variable selection. The simulation results show that our proposed method outperforms the existing modeling approaches in variable selection and parameter estimation. Our work would be one of the pioneering efforts in ERP research to test the difference in ERPs between conditions while identifying important biomarkers associated with ERPs simultaneously.

Clinical data and ERP measurements
The PROVIDE (Performance of Rotavirus and Oral Polio Vaccines in Developing Countries) study was a randomized controlled clinical trial with a 2-by-2 factorial design to investigate the efficacy of Rotavirus and Oral Polio Vaccines in Bangladeshi children, conducted between May 2011 and August 2018 in Dhaka, Bangladesh. The cohort consisted of 700 children enrolled within 72 hours of birth after written parental consent and were followed through twice weekly household visits and regularly scheduled clinical visits during the first 5 years of life. Details about the study design, enrollment, surveillance and biomarker development were described previously [19,20,22,23]. Children were 36 months old at the time of neuro-imaging test for cognitive assessment. The study was approved by the Ethical Review Committee of the International Centre for Diarrhoeal Disease Research, Bangladesh (icddr,b), and the Institutional Review Board at Boston Childrens Hospital and the University of Virginia. This study is reported in line with the Consolidated Standards of Reporting Trials (CONSORT) Statement, and the CONSORT Checklist can be found in Additional file 2.
ERPs were measured in a subset of children at 36 months of age. After data processing and quality checking, 70 children out of 130 had valid data for the final ERP analysis. Each child was tested with a face oddball paradigm in which standard (70% of chance) and oddball (30% of chance) faces were presented in a random order. This paradigm has been widely employed to examine the neural correlates of social attention and recognition memory of faces in children [24][25][26]. The current study focused on one ERP component that can be elicited using this paradigm-the N290 component as the neurocognitive response. The N290 component is regarded as the precursor of the adult N170 face-sensitive component and potentially be generated by the fusiform face and occipital face areas in children [5,27,28]. The N290 amplitude in response to the two conditions (standard and oddball) in different electrode channels reflects the averaged synchronous brain activation of large number of neurons occurring around 290 ms following stimulus onset. Figure 1 shows that N290 peak amplitudes are different among 13 channels (see Additional file 1 for the 13 electrode locations) under either condition, suggesting that modelling the variations among channels would capture more accurate information than simply taking average of all channels. Also, the N290 amplitudes are highly correlated among the 13 channels ( Fig. 2 for the correlation plot). Furthermore, ERP response data are more structured with respect to multiple conditions by multiple channels for each subject, thus a structured correlation matrix will be needed to appropriately characterize the ERP data structure.
The clinical factors included maternal information (maternal height, weight, and education), socioeconomic status, and sanitation and environmental factors such as water source and water treatment. Biomarker data were obtained from the fecal and blood samples collected at early age of life to measure inflammation [19]. We hypothesized that only a small subset of these clinical factors and biomarkers are associated with the ERP response, thus the variable selection methods would be suitable in this investigation. The PGEE model proposed by Wang et al. [18] for correlated data is limited in that it can only handle a simple correlation structure. While the estimator obtained by PGEE [18] is consistent with any working correlation matrix, the efficiency of the estimator can be improved when the specified correlation matrix is closer to the true matrix. To characterize the ERP correlations between conditions and among channels, we specify the withinsubject correlation matrix as the Kronecker product of the channel correlation matrix and condition correlation matrix. In addition, to enable group variable selection of the categorized channel variable with 13 levels in our study, the penalty for individual variable selection in PGEE is adapted for group selection. Therefore, our method extends PGEE and prompts an integrated model for ERP responses such that we can evaluate the differences between conditions and identify informative clinical factors and biomarkers simultaneously, while accounting for the complex correlations among ERPs and allowing group variable selection.

Proposed model for ERPs
Suppose that there are I subjects, each subject is placed under J treatments, and K repeated measurements are recorded under each treatment. We use Y ijk to denote the kth repeated measures under the jth treatment for the ith subject. By vectorizing T , we consider group variable selection for a generalized linear model for the correlated data in Y i : where V i denotes the covariance structure and φ is an overdispersion parameter. Without loss of generality, we assume φ = 1 in the rest of the paper.

Structured correlation matrix
For correlated data, a working correlation matrix needs to be pre-specified in many estimation methods, and its appropriate specification improves the estimation efficiency considerably for regression parameters. Some commonly used correlations, such as unstructured, AR1, exchangeable, etc., are often adopted in the practice. However, none of the commonly used correlations can appropriately account for the structured correlation for ERP data. Given that how ERP data were collected, it is natural to assume that ERP measurements from different conditions are correlated, and under each of treatment conditions the correlation structures among the channels are the same. Thus for the structured covariance matrix, we adopt a separate correlation for treatment condition and channel. Letting B i be the covariance matrix for conditions and i be the covariance matrix for channels, the structured covariance matrix for each subject is the Kronecker product of the two matrices,

Group selection for GEE
Variable selection for correlated data has been studied in Wang et al. [18], where penalized generalized estimating equations is adopted for simultaneous model estimation and variable selection, the SCAD penalty is used for individual variable selection. However, for many biomarker studies, predictors are highly correlated and/or pre-classified into different groups, and variables need to be selected in groups, as shown in Yuan and Lin [29].
Here we adopt SCAD with group selection and extend the PGEE to Group Penalized Generalized Estimating Equations (GPGEE) to select variables for the ERP data. Suppose covariates X 1 , X 2 , ..., X p are classified into d groups X 11 , ..., X 1p 1 , ..., X d1 , ..., X dp d .

The corresponding coefficients are
where β G i is the coefficient vector for group i. For the group variable selection, we will either select the whole group of variables or remove the whole group from the model.
We define the estimating functions as is a vector of estimating functions defining the GEE [18],R is the estimated working correlation matrix , and q G λ (β) sign (β) denotes the component wise product with penalty vector for group i, and q λ (θ) is the derivative of the SCAD penalty function imposed on the L 1 -norm of the group vector β G i . The notation q λ (θ) is the derivative of the SCAD penalty, for θ ≥ 0 and some a > 2. As suggested in Fan and Li [13], we let a = 3.7.

Algorithm for GPGEE
Similar to the algorithm proposed in Wang et al. [18], we apply the Newton-Raphson algorithm combined with the minorization-maximization to solve the penalized estimating equations. By the minorization-maximization algorithm, for a small > 0, the penalized estimator β n approximately satisfies To solve the above equations, we apply the Newton-Raphson algorithm as follows, In practice, we set = 10 −6 and takeβ, the GEE estimator with independence working correlation matrix, as the initial value of β. The stopping criterion for the iterative algorithm is p j=1 β k+1 j −β k j < 10 −5 . In our study, we use Bayesian information criterion (BIC) developed for correlated data [30] for selecting the tuning parameter λ.

Simulations
In this section, we illustrate the numerical strength of our developed method by comparing it with existing methods through a simulation study. In our simulation study, the sample sizes are set at 50 and 100. There are 20 correlated measurements and 40 covariates for each subject. The correlated normal responses are generated from the model For the covariates, we generate x ij,1 from Bernoulli(0.5) distribution and the rest from the multivariate normal distribution with mean 0 and an AR1 covariance matrix with marginal variance 1 and auto-correlation coefficient 0.5. The covariance matrix of random errors for each subject is V i = B i ⊗ i , where B i is a 2-by-2 identity matrix and i is a 10-by-10 AR1 matrix with marginal variance 10 and auto-correlation coefficient 0.9. We compare our GPGEE model with PGEE model to illustrate the importance of incorporating group penalty and using structured correlations. Five models are evaluated for comparison in the simulations: original PGEE with AR1 working correlation (Model 1), a modified PGEE incorporated with our structured correlation (Model 2), our GPGEE with AR1 working correlation (which is unstructured correlation, Model 3), our GPGEE with structured correlation but with misspecified working correlation (Model 4) and our proposed model with both group penalty and correct structured correlation (Model 5). We assume that the true group memberships of the covariates to impose the group-SCAD penalty are known, and divide the covariates into 8 groups. For the structured correlation that is correctly specified, we use AR1 as working correlation structure for i and assume B i to be unstructured. For the misspecified structured correlation, we use CS (compound symmetry) as working correlation structure for i instead. The selection results are defined as exact-selection when the selected model is the true model, under-selection when at least one true covariate is not selected, and all other cases are defined as over-selection. We also report the mean squared error (MSE) which is defined as the average of β − β 2 2 and corresponding standard error (SE) defined as the standard deviation of β − β 2 2 from the simulated datasets.
We conduct the simulation by generating 200 datasets for each sample size, and summarize the percentages of over-selection, under-selection, exact-selection, and MSEs (SE) in Table 1, top panel for sample size 50. Our GPGEE model (Model 5) has the smallest MSE and SE among the 5 models, and selects the true model for 94.5% of the simulated datasets, compared to the existing PGEE model (Model 1) with only 2.5% exact-selection and much higher MSE. The results further show that without the pre-specified structured correlation (Model 3), the model selection is less accurate, and it is more likely to have higher under-selection and larger MSE (SE), which suggests it is crucial to incorporate the structured correlation when the data is multi-level by nature. The results also show the importance of adopting group penalty, especially to deal with under-selection problem when there are covariates with smaller coefficients. In addition, if we capture the multi-level correlation structure correctly but mis-specify the true correlation for one layer (Model 4), the method can still be helpful to select the true model (96% selection rate) though the MSE is larger due to the mis-specification. Thus, as long as we specify this multi-level correlation structure, our method is relatively robust for variable selection against misspecification of the working correlation structure. When the sample size is increased 100, the comparsion results remain to be similar, as shown in Table 1, bottom panel.

Results
In the PROVIDE cohort, there were 47 clinical factors and early-stage biomarkers available for the analysis, including children's enteric and systemic inflammatory biomarkers, nutritional measures, maternal health and socioeconomic status (SES), and sanitation conditions [19]. Of them, 14 biomarkers are categorical measures, and the rest are continuous variables. The CRP index is a cumulative number of times that children experienced elevated CRP level over the first two years (i.e., being on the top 50% at 6, 18, 40, 53 and 104 weeks), thus measuring the sustained inflammation burden. For 70 children with ERP measurements, the descriptive statistics of these clinical factors and biomarkers are summarized in Table 2.
As described earlier, in the ERP study, the children were shown with the face pictures, 70% of time for the same face (standard condition) and 30% of time with new different faces (oddball condition) over 150 trials. Brain activities were recorded for all electrodes during the observation of each picture, and ERP components were derived from multiple trials to measure the electrical activity of the brain immediately in response to a direct stimulus event [6]. In this clinical application, the mean peak amplitude of N290 component was used as a clinical example, which measures the brain response with face processing around 290 ms, obtained under each treatment condition from 13 electrodes placed on different locations of occipital region. The N290 amplitude reflects the synchronous activation of large number of neurons, and large amplitude is generally deemed to have greater underlying neuronal activity. It is hypothesized that the N290 amplitude response originates in areas of the brain dedicated to face processing, such as the occipital face areas and the inferior temporal cortex (such as the fusiform). In addition to evaluate the difference in N290 amplitude for neural activity of face processing between the two conditions, we aimed to study the association of biomarkers in infancy with the ERP response in early childhood. Ultimately, we hope to gain insights on how infant's health and nutrition markers affects the development of the brain.
For each child, there are 26 N290 amplitude responses corresponding to 13 channels under 2 conditions. Those 26 ERP responses are highly correlated with multilevel correlation structure due to the nature of this experiment, that is, N290 measurements are not only correlated across channels, but also vary under different treatment conditions. As shown in Fig. 2, the N290 measurements among 13 channels (aligned by their locations on the brain) are highly correlated, and the correlations appear to be autoregressive in that channels closer to each other in the brain yield higher correlations than that further apart. Also, the correlation patterns appear to be different between oddball and standard conditions. In addition, N290 measurements vary considerably across the 13 channels and across conditions as depicted in Fig. 1. For the special data features, our proposed GPGEE model described in "Methods" section can properly evaluate the relationship between biomarkers and N290 response while accounting for the hierarchical correlation structures and variations across channels/conditions. To apply our proposed model, the correlation matrix between conditions for the same channel was assumed to be unstructured, and that among channels for the same condition to be autoregressive with order 1 (AR1). The group penalty was applied to electrode or channel which is a multilevel categorical covariate with 13 levels. By using 12 dummy variables and grouping them together, we are able to conduct variable selection for this covariate. The N290 responses were assumed to be normally distributed with identity link. For biomarkers and clinical predictors, prescreening was performed based on their correlations, and representative predictors were selected for those with corrections > 0.7. Thus 6 biomarkers were removed, including IL-4 at week18, IL-6 at week 18, TNFa at week 18, WAZ at birth, WHZ at birth and monthly household income.
The results of variable selection with our proposed GPGEE for N290 response were presented in Table 3. A total of 10 biomarkers were selected using BIC after adjusting for condition and channel differences. Among those selected biomarkers, IL-10, RBP, Zinc, Calprotectin, Neopterin and water treatment were positively associated with N290 amplitude, while IL-5, MIP1b, MPO and maternal height have negative effects on the N290. These results provide some supporting evidence that children's health conditions in early childhood indeed are associated with brain development at 3 years of age. While N290 amplitude measures the strength of the signal of brain activity for brain connectivity, some researchers have also focused on studying the change of N290 amplitude between conditions. The differences in N290 between oddball and standard conditions reflects how the brain behaves differently when seeing a new face vs. a familiar face, and therefore measures the child's ability to discriminate between a novel and a familiar face. In particular, A differential response in these ERP components between the two experimental conditions indicates the detection or discrimination of the infrequent from the frequent faces by the brain and reflects some aspect of memory updating and the efficiency of stimulus processing [26,31,32].
When considering the difference in N290 as the outcome variable, the analysis would be performed similarly under our GPGEE framework, where the correction structure is reduced at channel level only. For the difference in N290 response, 13 biomarkers were selected (Table 4), of which 8 biomarkers have positive effects and 5 have negative effects on the N290 difference. Obviously, RBP at week 6, Zinc, mother height, and water treatment were  were only informative to the difference of N290 between the conditions, indicating that these biomarkers contributing to a stronger overall brain EEG signal don't necessarily contribute to a better EEG power, the ability to identify new faces.

Discussion
The primary objective of our clinical study was to identify biomarkers in early childhood that could affect children's brain development measured by ERP data at 3 years of age. To our best knowledge, no previous study has analyzed ERP data under correlated hierarchical data framework where the correlation structure among both channels and conditions are accounted for. Many available statistical methods couldn't be directly applied here because of the nature of ERP's correlation structure. Further, group penalty needs to be incorporated in variable selection for ERP data so the clustered clinical risk factors and biomarkers can be selected together. Therefore our proposed group penalized GEE estimator with structured correlation matrix for ERP data can properly model the complex ERP response and simultaneously identify informative biomarkers associated with ERP amplitude and ERP difference, respectively. Our proposed method outperforms the existing modeling approaches in the simulation study. Further, our work would be one of the pioneering efforts in ERP research to test the condition difference in ERPs and, simultaneously, to identify important covariates associated with ERPs.
In our study, N290 measure was analyzed in the clinical application, but the developed method can be applied to any other ERP measurements with tasks focusing on different brain functions. Our clinical findings were limited by the small sample size, missing data in biomarkers, and time lag between collection of biomarkers and ERP measurement. Nevertheless, our proposed method emphasizes on the correlation structure among channels based on their physical locations on the brain, thus improves the model estimation efficiency for ERP data analysis. For future work, if data is normally distributed with identity link function, our proposed method can be extended further with choices of penalty, such as elastic net, and computing algorithms, such as Fast Iterative Shrinkage-Thresholding Algorithm [33] or Alternating Direction Method of Multipliers [34], which would improve the computational time for larger datasets. In addition, although the ERP responses were considered as the continuous outcomes, our model is also applicable to other types of response such as categorical or count response. In addition, the systemic and enteric inflammation biomarkers identified in this study for their association with ERPs are similar and consistent with the previous findings in the cognitive development research [35].