BMC Medical Research Methodology

Background: Principal component analysis (PCA) and partial least square (PLS) regression may be useful to summarize the HIV genotypic information. Without pre-selection each mutation presented in at least one patient is considered with a different weight. We compared these two strategies with the construction of a usual genotypic score.


Background
The development of HIV resistance mutations is one of the major problems for optimizing treatment of HIVinfected patients. Therefore, resistance testing before starting highly active antiretroviral therapy (HAART) or before switching to a new antiretroviral component is widely recommended [1][2][3][4] and now routinely implemented in industrialised countries. Resistance is due to mutations in the viral genome, e.g. mutations in the reverse transcriptase (RT), protease or integrase genes that cause resistance to nucleoside RT inhibitors (NRTIs) and nonnucleoside RT Inhibitors (NNRTIs), protease inhibitors (PIs), or integrase inhibitors, respectively. Genotypic and phenotypic resistance testing are the two commonly used tests. The impact of genotypic mutations on virological response in patients treated with a particular drug regimen are based on in vitro informations or on the virological response reported in patients who switched to that particular regimen. Before the initiation of an optimized treatment, a genotype of the main (major) patients' virus populations (only virus species present at >20-30% are detected and therefore analysed) is assessed. Statistical analyses aim at finding the baseline genotypic mutations associated with virological response in order to predict whether a patient who will switch to a similar regimen is resistant or not. Noteworthy, data are mostly analysed for the main drug of a given regimen only, i.e. NNRTI and/or PI.
However, traditional statistical analyses of the association between genotypic mutations and virological response are hampered by i) the high number of potential mutations, ii) the correlations between mutations and iii) the low number of patients usually available for this type of study. Specifically, the analysis of the effect of high number of mutations measured in a limited number of patients may lead to over-fitting issues. Hence, inflated variances result in non-significant associations. In order to circumvent these problems and to simplify the interpretation, genotypic mutations are summarised in a so-called genotypic score. This score is the sum of observed resistance mutations at baseline for the given drug in a given patient. The mutations composing the score are selected by different strategies [5,6]. The drawbacks of this analysis are that a preselection of mutations is required and that every mutation has the same weighting. Alternative strategies such as principal component analysis (PCA) and partial least square (PLS) regression have been suggested for the sake of size reduction of correlated predictors [5,[7][8][9] and may present advantages to improve the description of associations between mutations. The two techniques do not lead to a selection of mutations but to a different weighting of each mutation presented in the dataset. We aimed at comparing these two strategies with the usual construction of a genotypic score using data from an existing study evalu-ating the impact of protease mutations on the virological response in patients switching to a fosamprenavir/ritonavir-based HAART [10].

Data
The Zephir study was designed to investigate the impact of baseline protease genotypic mutations in HIV-1 infected PI-experienced patients on virological response. All patients had baseline HIV-1 RNA levels >1.7 log 10 copies/ mL and switched to a ritonavir-boosted fosamprenavirbased HAART [10]. Patients included were followed at the Bordeaux University hospital and at four other public hospitals in Aquitaine, south western France, all participating to the ANRS CO3 Aquitaine Cohort. We used a subset of 87 patients with a complete baseline genotype and plasma HIV-1 RNA available at baseline and at week 12. Virological failure was defined as a HIV-1 RNA ≥400 copies/mL and <1 log 10 copies/mL decrease of HIV-1 RNA between baseline and week 12 (virological success: HIV-1 RNA <400 copies/mL or ≥1 log 10 copies/mL reduction). A mutation was defined as a difference between the amino acid sequence of the studied virus and the wild type (HXB2) virus. In total, we created 69 dummy variables (69 mutations among the 99 possible protease mutations were encountered at least once).

Statistical analysis
Construction of a genotypic score The genotypic score was created in two steps. The first step considered mutations with prevalences ≥10% and ≤90% [5] to assess their association with virological failure. Mutations associated with a p-value ≥ 0.01 (univariable logistic regression) were selected. Second, the backwards procedure selected the combination with the strongest association with virological response [6]. These m selected mutations were used to calculate the first genotypic score for each patient. For instance, a first set contains the six mutations V32, I47, I50, V77, I84 and L90. The score is defined as S = I V32 + I I47 + I I50 + I V77 + I I84 + I L90 (S varying from 0 to 6). During the backwards selection procedure every mutation was removed one by one and all combinations of (m-1) mutations were investigated. The Cochran-Armitage test for linear trends in proportions was used to compare the probability of virological failure in patients having none to (m-1) mutations [11]. The combination providing the lowest p-value was kept and the procedure was repeated with all combinations of (m-2) mutations. The procedure stopped when removal of a mutation did not result in a lower p-value.
We performed 200 bootstrap samples from the original data set to analyze the variability in mutations' selection. We assumed that variability in the selection of mutations due to the restricted sample size might essentially play a role in the first selection step. Therefore, a bootstrap analysis was performed only to the first selection criteria. In each sample the prevalence of each mutation was calculated. A univariable logistic regression was performed to determine the association of each mutation with virologic failure in each sample. Then we calculated the frequencies of selection of each mutation in the 200 bootstrap samples under the conditions mentioned above (prevalence between 10% and 90% and a p-value < 0.01 in univariable analysis).

Principal component analysis (PCA)
Each principal component is a linear combination of the original variables, with coefficients equal to the eigenvectors of the correlation or covariance matrix [7,9]. Principal components analysis determines components which are representing the variability of the mutations. The association between the principal components and the response variable was tested with the Wald test statistics of the estimated regression coefficient related to the principal components. We only tested principal components with an eigenvalue >2 reflecting that ≥3% of the variability of the mutations was explained. Any principal component was kept when it was related to the virological response using a logistic regression according to the Wald test.

Partial least square (PLS) regression
PLS regression is a technique widely used for dealing with numerous correlated explanatory variables [8,12]. PLS regression aims also at identifying components explaining as much as possible the variance of the predictor variables. These components are simultaneously correlated with the response variable. Over-fitting issues were controlled with a leave-one-out cross-validation during the construction process. The number of factors chosen is usually the one that minimizes the predicted residual sum of squares (PRESS) [13].

Comparison
The probability of virological failure at week 12 was studied using a logistic regression model adjusted for either the genotypic score or the principal components or the PLS components as explanatory variables. The performance of each strategy was compared using the cross-validated AUC [7,8]. We used 5-fold cross-validation. We split the dataset in five equal parts. That way we selected five times a dataset with 1/5 of the patients as 'validation set' and the remaining 4/5 of the patients served as 'test set'. In the test set, we determined i) the genotypic score ii) the principal components and iii) the PLS components. The selected mutations were then used to calculate the genotypic score for the patients included in the validation set. The weights for each mutation derived by PCA and PLS were applied to calculate the score of the principal component and the PLS component respectively for the patients of the validation set. For each validation set the AUC under the ROC curve was calculated by means of a logistic regression for the three different methods. Thus, we obtained for each method 5 AUCs and the cross-validated AUC was calculated as the mean of these 5 AUCs. This approach allows to avoid over-fitting because the performance of the methods is tested in a subset of patients that were not used to determine the genotypic score and the weights of mutations in the PCA and PLS components.
Statistical analyses were performed using SAS ® version 9.1 software (SAS Institute, Inc., Cary, NC). We used the procedures PROC PRINCOMP for principal component analysis and PROC PLS for partial least square regression. Principal components and PLS components were determined considering all mutations being present in at least one patient.
During the backward selection procedure the following six mutations 10, 36, 46, 62, 84, and 90 were selected for the calculation of a genotypic score. The genotypic score calculated with these six mutations was significantly associated with virological failure (OR = 4.1 for a difference of one mutation, CI 95% [2.4; 7.0]; p < 10 -4 ; cross-validated OR = 4.9).

Principal component analysis
The first and second principal components explained 11% and 6% of mutations variability. Principal components accounted for a small variability overall. Therefore, their interpretation was difficult. The correlation of the mutations amongst them and to the principal components allowed identifying some clusters as for example mutations 10, 46 and 90 or mutations 32 and 47 already known to be associated together (figure 1). Figure 2 represents the relative weight of each mutation in the dataset to calculate the first principal component. The relative weight of each mutation to calculate the PCA 'score' ranged between 0% (e.g. mutation at codon 22) and 4.3% contributed with the smallest relative weight (0.03%) and mutation at codon 10 with the highest (4.7%). The contribution of mutations included into the IAS list was 69% (i.e. the sum of relative weights). Thus, mutations already known to be associated with virological failure were given more weight than polymorphisms (mutations that also occur occasionally generally without association to antiretroviral treatment).

Comparison
We compared the results of the PCA and PLS with the results obtained using the classical strategy to build a genotypic score. Mutations 10, 46 and 90 were found among the six mutations contributing with the highest weight for the calculation of the first PC, the first PLS component and were selected for the genotypic score. Major mutations 54 and 82, which were found among the mutations with the highest association to virological failure in univariable analysis, were also found among the six mutations contributing with the highest weight for the calculation of the first PC and the first PLS component. In contrast, these two mutations were eliminated from the score during the backward selection procedure (figure 4). Therefore, one first advantage of methods based on PCA and PLS is that they helped in reducing the number of predictors without neglecting mutations that could play a significant role.

Mutations on the first and second principal components
We compared the performance of these three methods with the area under the ROC curve. The cross-validated AUCs for the PCA, PLS and genotypic score were 0.880, 0.868 and 0.863, respectively. The model with the first principal component slightly outperformed the model with one PLS component. The predictive quality of the genotypic score was slightly lower than the two AUCs obtained for PCA and PLS but still showed a very good performance.
To compare the methods in an illustrative way we used a patient presenting the following 21 protease gene mutations at baseline: mutations at positions 33, 54, 82, 90 defined as major, mutations at positions 10, 13, 20, 35, 36 43, 53, 60, 63, 64, 74 defined as minor and mutations at positions 14,15,19,37, 67, 98 defined as polymorphisms. Virological failure was observed for this patient. The genotypic score was S = I 10 +I 36 +I 90 = 3 and the probability of virological failure was 77% using this score. The main difference between the genotypic score and the principal component value or the PLS component value is that with the latter methods we can take in consideration the fact that the patient has 21 protease gene mutations and give them different weights. For instance, the relative weights for mutations 10, 36, 90 were 4.4%, 2.2%, 4.1% and 4.7%, 2.4%, 4.4% for the PCA and PLS 'score', respectively ( figure 2 and 3). The predicted probability of virological failure was 94% and 96% using the PC "score" and the PLS "score", respectively.

Discussion
We investigated PCA and PLS regression to analyse associations between baseline protease mutations and virological failure. PCA and PLS are easily applicable because they are implemented in standard statistical analyses programs such as SAS (SAS Institute, Inc., Cary, NC).
Relative weights of each mutation to calculate the 'score' of the first principal component Relative weights of each mutation to calculate the 'score' of the first PLS component Codons of mutations taken into consideration by the presented methods to predict virological (Codons at which polymor-phisms occur are not depicted) Figure 4 Codons of mutations taken into consideration by the presented methods to predict virological failure(Codons at which polymorphisms occur are not depicted). The IAS mutation list shows all codons which have been described to be related with resistance to any of the protease inhibitors. Black boxes: Codons where major mutations occur.
We compared these two techniques with the construction of a genotypic score because they allow considering each mutation with a different weight. The objective of PCA is to find a set of new "latent variables" in form of a linear transformation of the original predictors. The properties of these latent variables are that they are uncorrelated and that they account for as much of the variance of the predictor variables as possible. PCA has been recently used to determine clusters of mutations in patients that were treated with at least one PI [15] and to predict the phenotypic fold change from genotypic information [16]. PLS regression reduces also a set of predictor variables to a set of uncorrelated "latent variables", the so-called PLS components. The main difference between the two techniques is that PLS also considers the strength of each mutation effect on the virological response to construct the components. Hence, these two methods can help solving the issues of the high number of predictors and their different effects. They may also help in describing the relationship between mutations by detecting potential groups of mutations. PLS was mentioned to be a useful analysing strategy for genotypic mutation data [5] but neither applications nor comparisons had been published yet.
In this study population, these two methods were able to identify some mutations that were expected to contribute with higher weights to virologic failure (e.g. mutations at codons 10, 82 and 90 which contribute to resistance to at least 7 of the 8 currently used PIs [5]). Furthermore, known clusters of mutations could be described. Recent papers including co-variation analysis [15,[17][18][19] found some correlated pairs and clusters which are associated with a specific treatment. Two of them used PCA to visualise correlations of mutations. We identified some clusters of mutations, e.g. mutations at codons 10, 46, and 90 and at codons 33, 46, 54 and 82, which were also found to be correlated with each other. Mutations 32 and 47 had the highest correlation coefficient (r = 0.78) in this population and are known to be key mutations for amprenavir [20] and lopinavir [14]. The cluster of mutations at positions 10, 46, 90 [19] and a high correlation between 32 and 47 were also determined by Wu et al and Kagan et al [19,21]. The mutations 10, 33, 46, 54, 71, 82, 84 and 90 are separated from all other mutations by the PCA and are contributing with the highest weight to calculate this component. The cluster 10, 46, 54, 71, 90 was recently described [17] to appear under lopinavir treatment and these mutations are also related to amprenavir-resistance [22]. We found that PCA had indeed detected this latter cluster in our patient's population previously treated by lopinavir or amprenavir (25% and 32% of the patients, respectively). Furthermore, the fact that the principal component was related to virological response highlights that PCA can detect mutation clusters on the way to lopinavir and fosamprenavir resistance although principal component analysis did not consider the virologic response for the construction of the component. As mentioned above, PLS searches latent variables but takes into account the response variable. Consequently one might expect differences for the distribution of the weights given by the mutations. Actually, the mutations found to contribute the highest weight on the PLS component are almost the same. Among the six mutations contributing with the highest weight, mutations at codons 10, 46, 54 82 and 90 were found for the principal component and the PLS component. Mutation at codon 33 was found on the principal component, while mutation 84 was found on the PLS component. In addition, the mutations which contributed with a higher weight for the calculation of the first principal and first PLS components are those which showed the highest association with virological response in univariable analysis. In conclusion, the weightings of the mutations found were consistent across these alternative strategies. A possible explanation is that the patients were mainly pre-treated with two PIs known to induce similar mutation patterns than fosamprenavir. In other cases, PLS might outperform PCA when a drug induces completely different mutations since the virological response is considered during the construction of the component.
The above presented example (patient presenting 21 protease gene mutations) highlights the advantage of taking into account all mutations and giving them different weights by either PCA or PLS. This results in a better prediction of virological failure. After cross-validation the first principal component and the first PLS component only slightly outperformed the genotypic score in the prediction ability. However, it has to be stated that the crossvalidated AUCs showed no clinical relevant difference. In this study population this might partly be explained by the fact that there was an explicit subset of mutations strongly associated with virological failure. This was also substantiated by the bootstrap analyses in which four of the six mutations remaining in the final genotypic score had been selected in over 95% of the bootstrap samples. This clear separation between mutations associated with virological failure from those which are not, could have facilitated the detection of a predictive subset using the classical strategy to construct a genotypic score.
One of the reasons to apply PCA and PLS analyses to these kind of data was that these approaches do not need a preselection of variables (i.e. mutations) as they are summarized in predictors. Hence, all mutations can be considered even when they are present in a small proportion of patients. Among others, the attempt to study these approaches was to study whether considering all mutations has an advantage and if mutations known to be associated with virologic failure are given higher weights.
However, the slightly better performance of the alternative approaches may be simply linked with the use of a larger amount of information. This was the minimum expected gain of these approaches compared to the usual one.
Therefore, it would be very helpful to study the performance of PCA and PLS in other, potentially bigger, trials considering other antiretroviral regimen/patients.

Conclusion
PCA and PLS regression were helpful in describing the association between mutations and to detect mutation clusters. PCA and PLS showed a good performance but their predictive ability was not clinically superior to that of the genotypic score.