Potential risk factors associated with human encephalitis: application of canonical correlation analysis

Background Infection of the CNS is considered to be the major cause of encephalitis and more than 100 different pathogens have been recognized as causative agents. Despite being identified worldwide as an important public health concern, studies on encephalitis are very few and often focus on particular types (with respect to causative agents) of encephalitis (e.g. West Nile, Japanese, etc.). Moreover, a number of other infectious and non-infectious conditions present with similar symptoms, and distinguishing encephalitis from other disguising conditions continues to a challenging task. Methods We used canonical correlation analysis (CCA) to assess associations between set of exposure variable and set of symptom and diagnostic variables in human encephalitis. Data consists of 208 confirmed cases of encephalitis from a prospective multicenter study conducted in the United Kingdom. We used a covariance matrix based on Gini's measure of similarity and used permutation based approaches to test significance of canonical variates. Results Results show that weak pair-wise correlation exists between the risk factor (exposure and demographic) and symptom/laboratory variables. However, the first canonical variate from CCA revealed strong multivariate correlation (ρ = 0.71, se = 0.03, p = 0.013) between the two sets. We found a moderate correlation (ρ = 0.54, se = 0.02) between the variables in the second canonical variate, however, the value is not statistically significant (p = 0.68). Our results also show that a very small amount of the variation in the symptom sets is explained by the exposure variables. This indicates that host factors, rather than environmental factors might be important towards understanding the etiology of encephalitis and facilitate early diagnosis and treatment of encephalitis patients. Conclusions There is no standard laboratory diagnostic strategy for investigation of encephalitis and even experienced physicians are often uncertain about the cause, appropriate therapy and prognosis of encephalitis. Exploration of human encephalitis data using advanced multivariate statistical modelling approaches that can capture the inherent complexity in the data is, therefore, crucial in understanding the causes of human encephalitis. Moreover, application of multivariate exploratory techniques will generate clinically important hypotheses and offer useful insight into the number and nature of variables worthy of further consideration in a confirmatory statistical analysis.


Background
Encephalitis is a complex clinical syndrome of the central nervous system (CNS) associated with fatal outcome or severe permanent damage including cognitive and behavioral impairment and epileptic seizures [1][2][3][4][5]. It is often acute, although symptoms may progress rapidly, causing severe debilitation to patients including otherwise healthy children [2,3]. Lewis and Glaser define encephalitis as an acute CNS dysfunction with radiographic or laboratory evidence of brain inflammation [2]. There is no standard laboratory diagnostic strategy for investigation of encephalitis and even experienced physicians often are uncertain about the cause, appropriate therapy and prognosis [1][2][3]6].
Despite being identified worldwide as an important public health concern, retrospective studies on encephalitis are very few and studies often focus on particular types (often with respect to causative agents) of encephalitis (West Nile, Japanese, etc.). However, there are relatively more studies in the pediatric population [2,3,7,8]. Moreover, current knowledge about encephalitis is limited to descriptive statistics. As a result, a comprehensive understanding of human encephalitis, as generated through high quality evidence-based studies and statistical analyses is limited and much of the current knowledge base lacks generalizability [2,[9][10][11].
Encephalitis is characterized by fever, headache and altered level of consciousness together with seizures and focal neurological findings in some cases [1,3,11]. Using data from the same prospective study presented in this paper, our group previously identified fever, personality and behavioural change, headache and lethargy, as the main characteristics of human encephalitis [10,11]. It was also shown that diagnostic variables such as abnormal brain scan and cerebrospinal fluid measurements are also indicators of encephalitis. Seizures, focal neurological deficits, stiff neck, urinary symptoms, respiratory symptoms and gastro-intestinal symptoms have also been previously shown to be associated with encephalitis [1,2,11]. Fowler et al., in retrospective study of paediatric encephalitis, found that fever and encephalopathy were the main disease characteristics in a Swedish sample [3].
Encephalitis is a rare disease, with annual incidence ranging between 3.5-7.4 cases per 100,000 persons worldwide [1,2,12]. It affects people of all ages; however, the condition is more common in children, the elderly and persons with a weakened immune system (e.g. HIV/ AIDS patients and patients undergoing cancer treatment). Encephalitis is known to affect both sexes; however, most studies have indicated a slightly higher incidence rate in males [1,[13][14][15]. The epidemiology of encephalitis is difficult to summarize since few population based studies exist, many causal pathogens are capable of inducing encephalitis-like symptoms and most cases go unreported to health authorities. Consequently, many details about its epidemiology have yet to be explained [1,2,10].
To date, infection of the CNS is considered to be the major cause of encephalitis and more than 100 different pathogens have been recognized as causative agents [1,10]. However, an estimated 32-85% of cases have unknown disease etiology [1,[16][17][18][19][20]. For instance, about 85% of the 189 cases in a study conducted in Minnesota, USA are of unknown cause [20]. In a California based study, about 65% of the 334 cases are of unknown etiology [18]. In a study conducted in the UK, about 60% of 700 cases are of unknown etiology [16]. Among the known causes, Herpes Simplex Virus (HSV) has been recognized as the most common etiology [1,10,20]. Viruses, bacteria, fungi as well as parasites can cause encephalitis [1][2][3]. Rarely, encephalitis can also be triggered by brain injury, brain tumor, drug reactions and lead poisoning. The main infectious causes of encephalitis are listed in a review paper by Granerod and Crowcroft [1].
In many parts of the world, viral infections of the central nervous system are often spread via vector-borne infection, such as mosquito bites and tick bites; however, animal-to-human interactions also can facilitate disease spread (e.g. raccoon feces, cat scratches, animal bites) and human-to-human transmission is also possible. Bacteria causing encephalitis can also spread through animal contact and water exposure. Possible risk factors associated with encephalitis and disease pathologies are provided in Lewis and Glaser [2].
A number of other infectious and non-infectious conditions present with similar symptoms and hence a challenge lies in distinguishing encephalitis from other disguising conditions [1,2,6]. Exploration of human encephalitis data using advanced multivariable statistical modelling approaches that can capture the inherent complexity in the data is, therefore, crucial for elucidating the causes of human encephalitis. Moreover, application of multivariate exploratory techniques will generate clinically important and better focused hypotheses that would benefit encephalitis researchers in reducing the number of variables to be considered for further confirmatory statistical analysis. This will ultimately lead towards better evidence-based clinical practices, including: diagnosis, prognosis discovery and development of novel therapeutic options.
In this paper, we use canonical correlation analysis (CCA) to explore the relationship between a set of exposure variables that are potential risk factors and a set of symptom and diagnostic variables in encephalitis. The symptom and diagnostic variables considered in this paper include variables that are previously identified as main indicators of encephalitis as well as those with a potential to be associated with the disease. Our data consist mostly of binary variables (presence or absence of a particular attribute) and as a result, the usual correlation matrix which is particularly designed for continuous measurements is not appropriate. We therefore propose to use a correlation matrix based on Gini's idea of variance or likeability for categorical variables.

Study population and data description
Data consists of 268 patients recruited from 24 hospitals/neurological centers in three geographical locations across England (South West, London, North West). Measurements from 16 symptom, 6 diagnostic (3 from cerebrospinal fluid, 2 from brain scans/images and 1 electroencephalography) and 13 exposure variables were recorded. Age, gender, duration of illness and length of hospital stay were also available. Most of the variables in the study are binary indicating presence or absence of attributes; others have been dichotomized before performing the CCA analysis. Age is dichotomized where one group consisting of young children (age ≤ 10), and another group consisting of older children and adults (> 10 years). Duration of illness is dichotomized as short (≤ 100 days) and long (> 100 days) and length of hospital stay is dichotomized as short (≤ 50) and long (> 50). These cutoff values are determined using results from analysis of univariate distributions. Variables included in our study are listed in Table 1. More details about the UK encephalitis study can be found in the original paper [10].

Methods
We used canonical correlation analysis (CCA) to investigate the relationship between the set of exposure and demographic variables (X) and the set of symptom, clinical and diagnostic variables (Y) in human encephalitis.

Canonical Correlation Analysis (CCA)
Consider two sets of variables X p = {x 1 , x 2 , . . ., xp} and Y q = {y 1 , y 2 , . . ., y q }, measured on n individuals, where p and q represent the number of variables in each set. Canonical correlation analysis seeks to determine the optimal set of min (p, q) linear combinations (called canonical variates), a'x = ∑ a i x i and b'y = ∑ b j y j , from sets X p and Y q which produce maximum correlation [21][22][23][24][25]. That is, the method finds two vectors a = (a 1 , a 2 ,..., a p ) and b = (b 1 , b 2 , . . ., b q ) such that the following correlation is maximized. (1) Where, S xx and S yy are the within-set covariance matrices for X and Y, respectively, and S xy is the between set covariance matrix. The solution is obtained by solving the following two eigenvalue problems [23,24] where, the Eigen-values λ, which sometimes are denoted by r 2 , represent the squared canonical correlations. The set of Eigen-vectors (a, b) corresponding to the leading eigenvalue are solutions to equation (1). The first canonical covariate is therefore the one which explains most of the relationship. CCA has been successfully applied in medical and epidemiological research [26,27] Covariance/Correlation matrix for categorical data Since data in this study consist mostly of binary variables (presence or absence of a particular attribute), the usual correlation matrix, which is particularly designed for continuous measurements would not be an appropriate choice. Covariance or correlation matrices for categorical data have been previously considered by many and several formulations have been proposed to assess the strength of association between two categorical variables. Here we use the covariance/correlation matrix proposed by Okada et al. [29,30]. Their approach is a generalization of Gini's definition of variance or likeability for categorical data, which is also known as Gini's index [28][29][30][31][32][33]. Let X = {x 1 , x 2 , ..., x p } where x i 's are categorical variables measured on n individuals. The ij th element of the variance-covariance matrix V (the covariance between x i and x j when i≠j and V ii is the variance of x i ) is calculated as Where, L is an orthogonal matrix (orthogonal transformation) [30], in our case L = 1. When calculating variance, for instance, The ij th element of the correlation matrix R can then be calculated as Simplified formulas for two special cases (binary and trinomial variables), using 2 × 2 and 3 × 3 contingency tables, can be found in Okada et al. [30,33]. We implemented the above variance-covariance/correlation formula in the R statistical software and used it in our CCA analysis. Pairwise available data were used when missing values occur.
Statistical analysis is performed using the Canonical Correlation Analysis (CCA) and Significance Tests for Canonical Correlation Analysis (CCP) libraries in the R software package [34][35][36]. Parametric multivariate tests are not appropriate since our data consists of binary variables and hence violates the multivariate normality assumption. We, therefore, used a non-parametric permutation approach and calculated standard errors and p-values based on 10,000 permutations.

Results and Discussion
Our data set consists of 268 patients (152 from North West England, 94 from London and 22 from South West), of which 263 met the case definition (the case definition criteria are presented in the original paper our group recently published [10]), 208 of these patients are confirmed encephalitis cases (40 of the 208 cases are meningoencephalitis patients). We focused on these 208 confirmed encephalitis patients for the CCA analysis in this paper; however, for comparison purposes, we have also performed the analysis on the 263 patients for whom the case definition was met. Summary statistics for our data on encephalitis patients is presented in Table 2.
The results in Table 2 show that men are at a slightly higher (54%, n = 113) risk of encephalitis than women (46%, n = 95). This is in agreement with previous findings [13][14][15]. Most of the encephalitis patients are children and young adults (median age = 30, IQR = 45) where a large proportion of the patients are children of age ≤ 10 (26%, n = 55) indicating that young children are at higher risk of developing encephalitis. The age distribution is quite uniform after age 10 where approximately equal proportions of patients (9.6%, n = 20) are observed in 10 years age intervals. We, therefore, used 10 as a cutoff point when dichotomizing age for the CCA analysis.
Our results show that the majority of encephalitis patients (69.7%, n = 145) had been hospitalized for ≤ 50 days (median = 27; IQR: 43) and duration of illness is less than 100 days (median = 37, IQR = 46.25) for large proportion (80%, n = 167) of the patients. Consequently, we used 50 days and 100 days as cutoffs when dichotomizing hospital stay and duration of illness for CCA analysis, respectively.
Overall, data on the encephalitis patients is sparse in nature where large proportion of zeroes (absence) than ones (presence) is observed for most of the variables (Figures 1 and 2). This is particularly the case for the exposure variables (Figure 1) with the exception of animal contact (48.6% exposed), recent infection (37.5% of the patients have had recent infection) and sick person contact (26%). For instance, the percentage of patients exposed to tick and mosquito bites are only 3.4% (n = 7) and 6.3% (n = 13), respectively. A considerable percentage of patients had water exposure (18.3%) and have experienced head trauma (11.1%).
On the other hand, symptom and diagnostic variables have relatively larger event rates ( Figure 2) where variables with the smallest rates are coma and photophobia which were observed on only 3.8% (n = 8) and 7.7% (n = 16) of the patients, respectively. Fever and abnormal white blood cell count (abnormal WCC)are indicated as the two main characteristics of encephalitis where 77.9% and 76.9% of the patients had fever and abnormal WCC, respectively ( Figure 2, Table 2). The results also show that personality and behavioral change, headache, lethargy and abnormal protein are the next most frequently occurring characteristics of encephalitis. Some missingness are observed in the exposure variables (Figure 2); however, a significant amount of missing data are observed in diagnostic variables where measurements from EEG and Glucose were missing for 42.3% (n = 88) and 37.5% (n = 78) of the patients, respectively (Table 2 Figure 2). Consequently, abnormal EEG, although previously shown to be one of the main indicators of encephalitis, is observed on only half of the patients (48.1%). Nevertheless, among patients with available EEG measurements (n = 120), 83.3% (n = 100) of them have abnormal EEG which is in agreement with previous findings. This is mainly because the diagnostic decision tree often leads clinicians to carry out an EEG in patients with a high likelihood of it being abnormal. One of the triggers is seizures, for example. So patients with EEGs are a particular clinical cluster of their own.
Heatmaps of within and between set correlations are presented in Figure 3 where dark blue and dark red colors indicate very strong correlations (a color indicator bar with ranges of correlations is presented under the heatmaps). Figure 3 indicates that, weak to moderate (-0.22-0.63) pair-wise correlations exist both within and between the X and Y sets of variables, in general where, the largest correlations are observed between length of hospital stay and duration of illness (0.63), and between tick and insect bites (0.55). CCA produced min (p, q) = 15 canonical variates; p = 15 is the number of variables in the X set and q = 24 is the number of variables in the Y set. However, only the first canonical variate is statistically significant at α = 0.05 level. We will, therefore, discuss only the first canonical variate in this paper.
The cross-correlation matrix displayed in Figure 3 shows that weak pair-wise correlation exists between the risk factor (exposure and demographic) and outcome (symptom, clinical and diagnostic) variables. However, the first canonical solution/variate from CCA revealed strong multivariate correlation (ρ = 0.71, standard error (se) = 0.03, p-value = 0.013) between the two sets. We found a moderate correlation (ρ = 0.54, se = 0.02) between the variables in the second canonical variate, however, the value is not statistically significant (p-value = 0.68).
The first canonical solution consists of two sets of variables: the linear combination of X set variables (exposure and demographic features) and the linear combination of the Y set variables (symptom, clinical and diagnostic features). Individual canonical loadings (structural coefficients) between these two sets of variables with their corresponding canonical variates are presented in Table 3.
The top ranked variable in the exposure set is age (loadings = 0.94) indicating that age contributed large amount of variation (88%) in the first canonical variate of exposure sets and hence the driving variable for the canonical variate The cross loading for age also shows that a considerable amount (45%) of the variation in the canonical variate of symptoms is explained by age. This result is in agreement with previous findings that showed that children are at an increased risk of developing encephalitis compared to adults. Sick person contact and immunization also contributed considerably towards the first canonical variate with ladings of 0.47 and 0.27; and cross loadings of 0.34 and 0.22, respectively. The contribution of the rest of the exposure variables towards the variation in the first canonical variate is negligible. Variables that contributed the least include animal contact and sex, where only 0.25% the variation in the first canonical variate was attributed to these variables. Variables that contribute to the first canonical variates of both sets are provided in a simple "finger plot" presented in Figure 4.
Among the symptom and diagnostic variables, abnormal WCC, headache and confusion are the three top ranked variables contributing 27%, 26%, and 25% of the variation in the first canonical variate of the symptom sets, respectively. The other variables with a considerable contribution towards the first canonical variate are abnormal protein, PB change, length of hospital stay and duration of illness, explaining 15%, 12%, 9% and 9% of the variation, respectively. The canonical cross loadings also indicate that symptom variables, provided in Figure  4, explain considerable amount of the variation in the first canonical variate of the exposure sets. Fever, although present in the majority of the patients (77.9%, Table 2), does not contribute much towards the first canonical variates, explaining only 0.04% and 0.16% of the variation in the symptom and exposure variates, respectively.
We also performed CCA on the 263 patients who met the case definition criteria as presented in the original paper [10]. In general, the pattern observed in the within and between correlations for this data set is similar to those obtained for the 208 confirmed encephalitis cases where weak to moderate correlations exist between the variables. A correlation of ρ = 0.68 (p-value = 0.007) was obtained between sets of variables in the first canonical solution. The second canonical solution resulted in ρ = 0.54 (p-value = 0.19). Overall, the canonical loadings for X and their rankings are similar to those presented in Table 3 and Figure 4, respectively. Therefore, our analysis based on 263 patients indentified the same sets of exposure variables to be strongly associated with symptom, clinical and diagnostic variables.
Redundancy coefficients indicate that very small amount of the variation in the original symptom variables were explained by the exposure canonical variates. Only 6% of the variation in the symptom variables is explained by the first exposure canonical variate; 5% by the second canonical variate and 4% by the third. This indicates that, the variation in the symptoms might be caused by host factors rather than environmental and exposure factors. The idea that characteristics of the host may be more important than the pathogen is consistent with the observation that for some causes, such as herpes simplex virus (HSV), encephalitis is a rare outcome of a common infection. Another possible hypothesis, that might be drawn from our results, is the possibility that exposure and symptom variables might provide independent information towards understanding the etiology of encephalitis. Further case-control type of analysis based on exposure, symptom and host factors might shed light to better understanding of factors that might help facilitate diagnosis and treatment of encephalitis patients.

Conclusion
We performed exploratory multivariate analysis using CCA to study associations between two sets of variables in encephalitis patients. One set consists of exposure and demographic variables including variables that are previously indentified in the literature as potential risk factors. The second set includes symptom, clinical and diagnostic variables where some items in the set have been shown to be important clinical characteristics of encephalitis. Although pair-wise cross correlations between the two sets of variables are weak to moderate, CCA revealed strong multivariate correlation between the two sets. Our analysis provided a set consisting of 3 exposure/ demographic variables (age, sick person contact, immunization and water exposure) to be strongly associated with 7 symptom/diagnostic variables (abnormal WCC, headache, confusion, abnormal protein, personality and behavioral change, length of stay and duration of illness) to be strongly associated.
Our analysis also revealed that a very small amount of the variation in the symptom sets is explained by the exposure variables. This indicates that host factors, rather than environmental factors might be important towards understanding the etiology of encephalitis and facilitate early diagnosis and treatment of encephalitis patients.
CCA is exploratory in nature and measures associations rather than causation. However, our analysis indentified exposure variables that might be strongly associated with encephalitis and generated important hypotheses that can be investigated further to indentify risk factors that are predictive of encephalitis. A confirmatory case-control analysis involving  encephalitis and non-encephalitis patients is needed to indentify risk factors and important symptom variables that can be used to facilitate diagnosis. CCA results may, however, provide insight into potentially smaller sets of variables worth investigating further. Furthermore, it is important to highlight that exposure variables such as tick bite do not occur frequently in the UK and also do not often lead to encephalitis, and so are difficult to study using conventional methods such as logistic regression analysis. CCA can, therefore, be a useful tool in indentifying risk factors associated with human encephalitis and other rare and complex diseases where regression approaches may not be optimal.