Polytomous diagnosis of ovarian tumors as benign, borderline, primary invasive or metastatic: development and validation of standard and kernelbased risk prediction models
 Ben Van Calster^{1, 2},
 Lil Valentin^{3}Email author,
 Caroline Van Holsbeke^{4, 5},
 Antonia C Testa^{6},
 Tom Bourne^{4, 7},
 Sabine Van Huffel^{1} and
 Dirk Timmerman^{4}
DOI: 10.1186/147122881096
© Van Calster et al; licensee BioMed Central Ltd. 2010
Received: 12 April 2010
Accepted: 20 October 2010
Published: 20 October 2010
Abstract
Background
Hitherto, risk prediction models for preoperative ultrasoundbased diagnosis of ovarian tumors were dichotomous (benign versus malignant). We develop and validate polytomous models (models that predict more than two events) to diagnose ovarian tumors as benign, borderline, primary invasive or metastatic invasive. The main focus is on how different types of models perform and compare.
Methods
A multicenter dataset containing 1066 women was used for model development and internal validation, whilst another multicenter dataset of 1938 women was used for temporal and external validation. Models were based on standard logistic regression and on penalized kernelbased algorithms (least squares support vector machines and kernel logistic regression). We used true polytomous models as well as combinations of dichotomous models based on the 'pairwise coupling' technique to produce polytomous risk estimates. Careful variable selection was performed, based largely on crossvalidated cindex estimates. Model performance was assessed with the dichotomous cindex (i.e. the area under the ROC curve) and a polytomous extension, and with calibration graphs.
Results
For all models, between 9 and 11 predictors were selected. Internal validation was successful with polytomous cindexes between 0.64 and 0.69. For the best model dichotomous cindexes were between 0.73 (primary invasive vs metastatic) and 0.96 (borderline vs metastatic). On temporal and external validation, overall discrimination performance was good with polytomous cindexes between 0.57 and 0.64. However, discrimination between primary and metastatic invasive tumors decreased to near random levels. Standard logistic regression performed well in comparison with advanced algorithms, and combining dichotomous models performed well in comparison with true polytomous models. The best model was a combination of dichotomous logistic regression models. This model is available online.
Conclusions
We have developed models that successfully discriminate between benign, borderline, and invasive ovarian tumors. Methodologically, the combination of dichotomous models was an interesting approach to tackle the polytomous problem. Standard logistic regression models were not outperformed by regularized kernelbased alternatives, a finding to which the careful variable selection procedure will have contributed. The random discrimination between primary and metastatic invasive tumors on temporal/external validation demonstrated once more the necessity of validation studies.
Background
Medical diagnostic studies typically involve predicting the presence or absence of a target condition. The development of prediction models then becomes a problem of binary classification, even though often multiple differential diagnoses exist. There are statistical and other mathematical techniques to simultaneously predict three or more conditions, but these are underused [1]. Possible reasons for this may be increased complexity, lack of knowledge or lack of data, or the force of habit. If we take as an example the characterization of ovarian tumors, diagnostic models have consistently focused on predicting malignancy versus benignity [2–5]. Notwithstanding the importance of the differentiation between cancerous and noncancerous tumors, this dichotomization ignores the relevant heterogeneity in malignant tumors. It is known that different types of malignant tumors may be managed differently [6–9], thereby improving the prognosis and reducing unnecessary financial cost or hospitalization. Therefore, in this work, we focus on polytomous risk prediction models to characterize ovarian masses as benign, borderline malignant, primary invasive, or metastatic invasive.
Ovarian cancer is a common and lethal cancer. The American Cancer Society reports that ovarian cancer has the fifth highest death rate of all cancers among females in the United States, with about 15,000 deaths per year [10]. When confronted with an ovarian mass, an accurate preoperative diagnosis is important to decide on the optimal treatment. For benign tumors management may involve a simple "watch and wait" strategy or minimal access surgery. Misdiagnosis of a benign mass as malignant may lead to a woman undergoing a radical surgical procedure for no reason. Hence the consequences of a misclassification may be very serious.
Primary invasive malignancies originate in the ovary, whilst metastatic invasive malignancies originate elsewhere (e.g. breast, colon, stomach, pancreas) but have spread to the adnexal structures. Borderline tumors are of 'low malignant potential', representing less aggressive tumors that are less lifethreatening. From a clinical viewpoint differentiating between these different types of tumor has significant relevance. Primary invasive tumors are typically managed using invasive techniques such as laparotomy for staging, intervaldebulking surgery or cytoreduction [11]. An intervention for a borderline tumor may be relatively conservative in young women where preservation of fertility is a major issue [9]. For metastatic disease the management option may be influenced by the primary site of malignancy. The differentiation between borderline and invasive tumors is one of the most pertinent clinical issues beyond the differentiation between benignity and malignancy.
In the medical literature, risk prediction models are often based on a logistic regression analysis. In this study, we applied several alternatives to the standard multinomial logistic regression model (MLR) for two reasons. Firstly, in the MLR model the selected variables are used to distinguish between all events whereas variables may be important only for a subset of them [12]. Therefore, we also performed polytomous classification by combining dichotomous logistic regression models to investigate whether this resulted in better performance. Secondly, algorithms more flexible than logistic regression are available, many of them having their origin in the machine learning area. We applied kernelbased polytomous methods based on least squares support vector machines [13] and kernel logistic regression [14]. Another important advantage of these models is that they include a regularization (or penalization) parameter in their standard model formulation to avoid overfitting, a procedure comparable to shrinkage methods for logistic regression models [15]. Because borderline and metastatic tumors had low prevalence such that there was a risk of overfitting, we aimed to compare the performance of these approaches with those based on standard logistic regression analysis (without shrinkage). All algorithms implemented in this study result in probabilities for each of the four events considered.
As explained in the next section, the prediction models were developed and tested (i.e. internally validated) on data from a large international multicenter study. Thorough validation of any developed prediction model is essential to assess the model's robustness and generalizability [16]. Therefore, a temporal and external validation was performed on a large dataset that was collected after model development.
Methods
Design and setting
This is an international multicenter crosssectional study, involving women presenting with an adnexal mass to experienced ultrasound examiners in oncological referral centers, referral centers for ultrasonography, or regional hospitals.
Data
The International Ovarian Tumor Analysis (IOTA) group [17] collected data from 1066 nonpregnant women with at least one persistent adnexal mass (including paraovarian and tubal masses). All patients underwent an ultrasound examination by the principal investigator, a gynecologist or radiologist specialized in gynecological ultrasound. Nine clinical centers participated from Italy (4), France (2), Belgium (1), Sweden (1), and the United Kingdom (1). Only patients who were operated on within 120 days after the ultrasound examination were included. The decision whether to operate or not was made by local clinicians based on the clinical picture and local management protocols. More information on inclusion and exclusion criteria is presented in [18]. Data collection was standardized to enhance reproducibility of the measurements [17]. The primary aim of the IOTA study was the development of dichotomous prediction models contrasting benign with malignant tumors [5, 18, 19].
Data included personal and family history of ovarian and breast cancer, demographic data, grey scale and color Doppler results from ultrasound examination (i.e. more than 40 morphologic and blood flow characteristics describing the tumor), and the presence or absence of pain during examination. After checking for high intervariable dependencies, 36 variables remained. The outcome of interest was the histological diagnosis of the mass as benign, borderline, primary invasive, or metastatic. Eight hundred tumors were benign (75%), 55 were borderline (5%), 169 were primary invasive (16%), and 42 were metastatic (4%). The dataset was split up in a training set containing 754 patients (71%) for model development, and a test set containing the remaining 312 patients for the internal validation of the models. The split was stratified for center and outcome. We did not use bootstrapping for the internal validation because we did not use fully automated variable selection.
Because the number of borderline and metastatic tumors was limited, we selected 16 variables that were thought to be of potential relevance based on the literature and on subject matter knowledge from clinical experts. This improved the situation, yet the number of borderline and metastatic invasive tumors relative to the number of candidate variables remained small.
After model development, the IOTA group collected a new set of data from 19 centers [20]. Seven centers also contributed to the initial dataset, such that the 941 patients contributed by these centers were used for a temporal validation of the models' performance (654 benign, 69 borderline, 186 primary invasive, 32 metastatic). The 12 centers that did not contribute to the initial dataset were located in Italy (6), Belgium (1), Sweden (1), Poland (1), Czech Republic (1), China (1) and Canada (1). The 997 patients contributed by these 12 centers were used for the external validation of the models (742 benign, 42 borderline, 187 primary invasive, 26 metastatic).
The research protocols for the collection of the development and validation datasets were ratified by the local ethics committee at each recruitment center.
The kernelbased algorithms
Least squares support vector machines
Standard support vector machine (SVM) classifiers [21] are nonprobabilistic dichotomous models. First, the predictor space (i.e. the multidimensional scatter plot of the predictors) is mapped into a high dimensional 'feature space'. The aim is to find a feature space where an acceptable linear model can be developed in order to deal with possible nonlinearities in the original predictor space. The linear separation between the events in the feature space tries to maximize the margin between the two groups  hereby imposing regularization  while at the same time controlling the number of misclassifications. A good balance between both is desired: too much focus on margin maximization leads to an overly simplistic model, too much focus on misclassification minimization leads to an overfitted model. A regularization (penalization) parameter is included to control the tradeoff. Through the use of a positive definite kernel function it is not necessary to directly work in the high dimensional feature space. The choice of kernel affects how the linear separation in the feature space relates to the predictor space. The linear kernel x ^{ T } z (with x and z two vectors of predictor values representing two patients) results in linear classifiers in the predictor space whereas other kernels such as the popular Gaussian kernel, $\mathrm{exp}\left({\Vert \text{x}\text{z}\Vert}_{2}^{2}/{\sigma}^{2}\right)$ with the kernel parameter σ that has to be tuned, result in nonlinear classifiers. Least squares SVMs (LSSVMs) are a variant of SVMs that work much faster due to small changes in the cost function [13]. However, performance of LSSVMs is similar to that of SVMs [22].
We overcame the nonprobabilistic nature of standard (LS)SVMs through the use of a Bayesian framework [23]. Using the distribution of the outcome in the development data as prior event probabilities, dichotomous event probabilities were obtained based on the LSSVM output. Hyperparameters such as the regularization and kernel parameters are automatically tuned by the Bayesian procedure.
Kernel logistic regression (KLR)
In essence, KLR only differs from SVMs with respect to the adopted loss function. However, KLR directly results in probabilistic output and is easily extended to a multinomial version (MKLR). We used an MKLR algorithm that is based on LSSVMs [14]. The basis of the algorithm is a regularized MLR model that is solved using a penalized negative log likelihood function using iteratively regularized reweighted least squares. By mapping the predictor space into a high dimensional feature space using a positive definite kernel and applying in each iteration a model with the structure of an LSSVM, a kernel version of MLR is obtained. The hyperparameters were tuned using fivefold crossvalidation (CV).
True polytomous models versus the combination of dichotomous models
Variable selection
For logistic regression models, we were careful regarding variable selection due to the small number of borderline and metastatic tumors. We did not use automated procedures based on pvalues to directly select a final set of predictors because we prefer variable selection based on criteria similar to the model evaluation criteria, and because automated selection is highly unstable and bound to overfit when there are small groups and a large number of candidate variables [25]. The procedure was as follows. Using stepwise, backward, and manual selection procedures, several possible variable sets were generated. The final set of predictors was selected using three criteria. Two criteria measure information content: the Akaike and the Bayesian Information Criterion (AIC, BIC) [26]. These criteria penalize a model's log likelihood for the number of predictors. AIC has the tendency to be liberal whereas BIC tends to be conservative. Therefore, we prefer models with fairly low values for both criteria. The third and most important criterion was the discrimination performance assessed by the average dichotomous cindex for each event after 20 independent runs of stratified fivefold CV. The dichotomous cindex equals the area under the ROC curve.
For the kernelbased methods, a forward selection algorithm based on rankone updates of the kernel matrix in the context of standard LSSVMs was used [27]. It is computationally intensive to select variables by repeatedly adding the variable that gives the best performance gain based on leaveoneout crossvalidation (LOOCV). When using LSSVMs, this strategy can be speeded up significantly for two technical reasons. Firstly, the LSSVM model structure allows for the fast computation of model performance measures based on LOOCV. Secondly, the LSSVM model can be updated using rankone updates in the kernel matrix such that adding a variable does not require the recomputation of the model. This new method, abbreviated as R1U, is very fast, but is currently only available for linearkernel LSSVMs. Using the training set, we observed that a linear kernel LSSVM with R1Uselected variables performed clearly better than linear or nonlinear LSSVMs using variables that were selected using an advanced nonlinear procedure [28]. In each step, we used the cindex estimated by LOOCV to determine which variable to add. We retuned the regularization parameter in each step using a grid search to find the value with maximal cindex.
Overview of methods used to diagnose ovarian tumors
The methods that were used in this study can be divided into two groups based on the variable selection method. The first group consists of two logistic regressionbased methods using logistic regressionbased variable selection: MLR, and pairwise coupling of 1versus1 logistic regression models (LRPC). The second group consists of three kernelbased methods using variables from R1U selection: MKLR, and pairwise coupling of 1versus1 Bayesian LSSVMs (LSSVMPC) or KLR models (KLRPC). We add one logistic regressionbased method to the second group for means of comparison: pairwise coupling of 1versus1 logistic regression models (LRPC2).
Evaluation of model performance
with C(n _{1}, n _{2}, n _{3}, n _{4}) the number of events for which it held that the event's predicted probability was largest for the case with that event. Thus C ranged between 0 and 4. The average C over all sets was divided by 4 to obtain a value between 0 and 1 (with 0.25 for random discrimination). This polytomous index can be interpreted as the probability to correctly identify a case from a randomly chosen event within a set of 4 cases.
Calibration was assessed using calibration graphs that related predicted probabilities to actual probabilities using loess smoothing [31, 32]. This resulted in four graphs, one per event.
Results
Variable selection results
Overview of selected variables
Dichotomous 1versus1 models  

Variable  MLR  Ben vs Bord  Ben vs PrInv  Ben vs Meta  Bord vs PrInv  Bord vs Meta  PrInv vs Meta 
Logistic regressionbased variable selection  
Ascites  ×  ×  ×  ×  ×  ×  
Maximal diameter of solid part  ×  ×  ×  ×  ×  
Age  ×  ×  ×  
Entirely solid tumor  ×  ×  ×  ×  
Irregular internal cyst walls  ×  ×  ×  
Personal history of ovarian cancer  ×  ×  
Bilateral tumors  ×  ×  
Maximal diameter of lesion  ×  ×  
Papillary structures with blood flow  ×  ×  ×  
Unilocular tumor  ×  
R1U variable selection*  
Ascites  ×  ×  ×  ×  ×  
Maximal diameter of solid part  ×  ×  ×  ×  
Age  ×  ×  ×  
Entirely solid tumor  ×  ×  ×  
Irregular internal cyst walls  ×  ×  ×  
Personal history of ovarian cancer  ×  ×  
Bilateral tumors  ×  
Maximal diameter of lesion  ×  
Papillary structures with blood flow  ×  
Number of papillations  ×  
Acoustic shadows  × 
Descriptive statistics of selected variables for the training data set
Benign  Borderline  Primary invasive  Metastatic  

Variables  N = 563  N = 40  N = 121  N = 30 
Continuous  
Age, years (median)  42  52.5  58  59 
Maximum diameter of mass, mm (median)  63  108  98  73 
Maximum diameter of solid part, mm (median)  0  22  51  54 
Ordinal  
Number of papillations (mean)#  0.35  1.70  1.43  0.93 
Binary  
Ascites (%)  3.2  12.5  50.4  40.0 
Entirely solid tumor (%)  6.6  7.5  32.2  56.7 
Irregular internal cyst walls (%)  33.6  67.5  88.4  83.3 
Personal history of ovarian cancer (%)  0.9  5.0  0.8  10.0 
Bilateral tumors (%)  17.6  12.5  41.3  33.3 
Papillary structures with blood flow (%)  6.8  47.5  43.0  23.3 
Acoustic shadows (%)  13.0  2.5  0.0  3.3 
Unilocular tumor without solid component (%)  40.3  2.5  0.0  0.0 
Internal validation (test set results)
Validation results using a polytomous cindex
Internal validation (n = 312)  Temporal validation (n = 941)  External validation (n = 997)  

Model (# predictors)  Polytomous cindex (95% CI)  Difference with best model (95% CI)  Polytomous cindex (95% CI)  Difference with best model (95% CI)  Polytomous cindex (95% CI)  Difference with best model (95% CI) 
Group 1: logistic regression models  
LRPC (10)  .67 (.58.75)    .60 (.56.65)    .60 (.55.65)   
MLR (9)  .64 (.56.73)  .025 (.004; .053)  .58 (.54.62)  .020 (.000; .040)  .58 (.53.62)  .028 (.000; .058) 
Group 2: kernelbased and logistic regression models (based on R1U variable selection)*  
LRPC2 (11)  .69 (.60.77)    .59 (.55.64)    .64 (.59.68)   
KLRPC (11)  .67 (.59.75)  .016 (.013; .051)  .58 (.54.63)  .012 (.006; .027)  .61 (.57.66)  .026 (.004; .049) 
LSSVMPC (11)  .66 (.58.75)  .025 (.007; .060)  .58 (.54.62)  .015 (.005; .035)  .61 (.57.65)  .028 (.005; .052) 
MKLR (11)  .64 (.56.73)  .046 (.003; .086)  .57 (.52.62)  .027 (.000; .056)  .58 (.53.62)  .060 (.033; .092) 
Validation results using pairwise cindexes
Model  Ben vs Bord  Ben vs PrInv  Ben vs Meta  Bord vs PrInv  Bord vs Meta  PrInv vs Meta 

Group 1: logistic regression models  
Internal: LRPC  .82  .95  .93  .88  .96  .73 
Temporal: LRPC  .88  .95  .93  .81  .83  .51 
External: LRPC  .88  .96  .93  .81  .89  .56 
Group 2: kernelbased and logistic regression models (based on R1U variable selection)*  
Internal: LRPC2  .86  .94  .92  .88  .96  .73 
Temporal: LRPC2  .90  .94  .92  .81  .83  .51 
External: LRPC2  .91  .95  .93  .81  .89  .56 
Temporal and external validation
When aiming to implement a model into clinical practice, good temporal and external validation results are essential. The results that we obtained are presented in Table 3, and show a performance decrease relative to the internal validation. On temporal validation the polytomous cindex varied between 0.57 for MKLR and 0.60 for LRPC, on external validation between 0.58 for MLR and MKLR and 0.64 for LRPC2. Similar to the internal validation, LRPC2 and LRPC were the best models. The pairwise cindexes for these models (Table 4) show that pairwise discrimination among the three malignant events clearly dropped on temporal and external validation. Most striking is the observation that discrimination between primary invasive and metastatic invasive tumors, which was acceptable on internal validation (c 0.73), was random on temporal (c 0.51) and external (c 0.56) validation. Discrimination between these two tumor types and borderline tumors was still good with cindexes above 0.8. Benign tumors can be very well separated from any malignant tumor type, even from borderline tumors (cindexes 0.9 or higher).
The temporal and external validation showed that pairwise coupling of dichotomous models resulted in models with superior discrimination compared to true polytomous models. Also, logistic regressionbased models produced better results than KLR or LSSVMbased models. Taken together, LRPC2 produced the best results. Calibration was clearly poorer on temporal and external validation than on internal validation (cf. infra).
Further results for LRPC2
We addressed two important clinical issues. The first issue is the role of tumor stage for the discrimination between different tumor types [33]. Primary invasive tumors can be well separated from benign and borderline tumors, but it is of interest to look at primary invasive stage I tumors and primary invasive stage IIIV tumors separately. We focused on the aggregated temporal and external validation data, and computed the pairwise cindexes for the two primary invasive subgroups when compared with benign, borderline, and metastatic tumors. In the development data 32% of the primary invasive tumors were stage I, 7% stage II, 50% stage III, and 11% stage IV. In the temporal/external validation data, primary invasive tumors were 25% stage I, 9% stage II, 57% stage III, and 9% stage IV. Discrimination from benign tumors was very high for both primary invasive subgroups (c 0.92 for stage I, c 0.95 for stage IIIV). Discrimination from borderline tumors was clearly poorer for primary invasive stage I tumors (c 0.70) compared to primary invasive stage IIIV tumors (c 0.85). Discrimination from metastatic tumors was poor irrespective of stage (c 0.56 for stage I, c 0.53 for stage IIIV). The second important clinical issue is ascites. The presence of ascites makes a diagnosis of primary invasive cancer highly likely. When the temporal and external validation data were aggregated, 71% of the patients with ascites had a primary invasive tumor, 13% had a metastatic tumor, and 16% had a benign or borderline tumor. However, it is important that prediction models work well also in patients without ascites. In this subgroup of patients, when combining the temporal and external validation datasets, pairwise cindexes were 0.91 for discriminating benign from borderline tumors, 0.93 for discriminating benign from primary invasive tumors, 0.91 for discriminating benign from metastatic tumors, 0.77 for discriminating borderline from primary invasive tumors, 0.82 for discriminating borderline from metastatic tumors, and 0.57 for discriminating primary invasive from metastatic tumors. The polytomous cindex was 0.59. This means that the LRPC2 performed well also in patients without ascites.
Finally, LRPC2 can be directly compared with existing dichotomous models using the probability of a benign tumor to discriminate between benign and malignant tumors. LRPC2 obtained cindexes of 0.939 and 0.954 on temporal and external validation, results that are similar to the main dichotomous model from the IOTA group with cindexes of 0.945 and 0.956 on the same datasets [20].
On http://homes.esat.kuleuven.be/~biomed/LRPC2/lrpc2.htm we have made available an Excel sheet that can be used to implement LRPC2.
Comparison of LRPC2 with a model based on automatic stepwise variable selection
Variable selection for our models was partly based on automatic procedures and partly on human interference. We considered this appropriate, in particular for a model such as LRPC2 where six dichotomous models with separate variable selection are combined. Therefore, it was interesting to compare LRPC and LRPC2 with a similar model based on fully automatic variable selection. We used standard forward stepwise selection for each dichotomous model with pvalue criteria for variable entry and removal set at 0.05. The resulting stepLRPC model used 15 variables in total compared to 10 for LRPC and 11 for LRPC2. The polytomous cindexes of stepLRPC were 0.64 on internal validation (versus 0.67 and 0.69 for LRPC and LRPC2), 0.60 on temporal validation (versus 0.60 and 0.59), and 0.58 on external validation (versus 0.60 and 0.64).
Discussion
Methods for polytomous classification are underused in medical applications. In this paper, we used various methods for the probabilistic diagnosis of ovarian tumors as benign, borderline, primary invasive, or metastatic invasive. To the best of our knowledge, this is the first time that prediction models for ovarian tumor diagnosis exceeded the basic differentiation between benign and malignant tumors. Methods included true polytomous (allatonce) algorithms and algorithms that combined dichotomous (1versus1) models using the technique of pairwise coupling. The basic classification algorithms were based on logistic regression, LSSVMs, and kernel logistic regression. All models were internally, temporally, and externally validated. Despite the low number of borderline and metastatic tumors, interesting and consistent results were obtained.
The results showed very good separation of benign, borderline, and invasive tumors. This is an important result because the ability to differentiate between borderline and invasive tumors gives additional, highly useful information for making sensible treatment decisions. There is a clear difference in aggressiveness between borderline and invasive tumors, and they are treated differently. The use of a model such as LRPC2 in clinical practice would therefore be interesting. On http://homes.esat.kuleuven.be/~biomed/LRPC2/lrpc2.htm we have made available an Excel sheet that can be used to implement LRPC2. Unfortunately, the models were unable to reliably discriminate between primary and metastatic invasive tumors.
In the present study, the combination of 1versus1 models with pairwise coupling was an interesting alternative to true polytomous algorithms. The former approach allowed for more finetuned variable selection, and resulted in higher validation performance  as determined by the polytomous cindex  for both logistic regressionbased and kernel logistic regressionbased models. An advantage of 1versus1 models is their increased flexibility by addressing subproblems that are sometimes of particular interest to the clinician, for example when the clinician hesitates between two diagnoses only. The overall best model combined 1versus1 logistic regression models using pairwise coupling (LRPC2). For the discrimination between benign and malignant tumors (cf. the cindex for benign versus other tumors in Tables 3 and 4), this model performed similar to the dichotomous models developed and validated on the same data [18, 20]. LRPC2 used 11 predictors, but not subjective variables such as the experience of abdominal or pelvic pain during the ultrasound examination or the color score of intratumoral blood flow (a subjective score between 1 and 4). These variables are used in some of the existing dichotomous models [18, 20]. None of the polytomous models used the CA125 tumor marker, because this marker was deliberately not considered as a predictor. The most important reasons are that we focused on ultrasound information, and that its use would preclude the immediate use of a model as the results of the blood test have to be awaited. In addition, the inclusion of CA125 as a variable in dichotomous models did not result in better performance [34].
A disadvantage of combining 1versus1 models is that the number of 1versus1 problems grows exponentially with the total number of events. Another decomposition of a polytomous problem that consists of a tree of nested (or sequential) dichotomous models does not suffer from this limitation [35, 36]. A sensible tree in our study would be to make a model to discriminate between benign and malignant tumors, followed by a model to discriminate between borderline and invasive tumors, and finally a model to contrast primary with metastatic invasive tumors. Polytomous probabilities can be obtained in a straightforward manner. When we applied this approach, it resulted in stronger performance degeneration on temporal and external validation than the true polytomous models or the pairwise coupling approach.
Interestingly, we found that approaches based on logistic regression performed very well when compared to the regularized kernelbased alternatives despite the fact that two events (borderline, metastatic) had very few cases. All models suffered from performance decrease on temporal and external validation, but the decrease was not more severe for the unregularized logistic regressionbased models. This might be explained by the careful variable selection strategies for which crossvalidated cindex estimates were the most important criterion. If we applied pairwise coupling of 1versus1 logistic regression models based on standard stepwise variable selection with a pvalue of 0.05 as the selection and removal threshold, we ended up with a total of 15 selected variables. The polytomous and pairwise cindexes of stepLRPC were similar to or worse than those of LRPC2 and showed stronger decrease on temporal and external validation. That being said, the use of logistic regression models in situations like the one in this study asks for a regularized fitting approach such as shrinkage, penalized maximum likelihood estimation, or the LASSO (least absolute selection and shrinkage operator) [37].
This study further demonstrated the necessity of a thorough validation of prediction models, in particular in situations with small sample sizes for some events relative to the number of variables considered as possible predictors. We observed in our study that, opposite to what the internal validation results suggested, the differentiation between primary and metastatic invasive tumors was near random on temporal and external validation. Unfortunately, many models are developed yet only a limited portion of these undergo validation in various clinical settings. This hampers the successful implementation of such models into clinical practice [38].
Even though the use of LRPC2 in clinical practice would provide useful information, future work will focus on the development of a more robust model by combining all the data used in this study to update LRPC2. Ample attention will be devoted to the selection of a limited set of predictors to boost the userfriendliness of the model for busy clinicians.
Conclusions
This study shows that polytomous discrimination of ovarian tumors can be obtained, while maintaining similar performance for the traditional dichotomous diagnosis (benign vs malignant) and without the need for more predictors. Such models can provide highly useful information for clinicians when having to make sensible treatment decisions. For polytomous prediction, the combination of dichotomous 1versus1 models is an interesting alternative to true polytomous (allatonce) models. Despite two events (borderline, metastatic) with relatively few cases, standard logistic regression approaches performed similar to or better than regularized kernelbased alternatives, a finding to which the careful variable selection based on crossvalidated cindex estimates will have contributed. The importance of model validation studies is clearly demonstrated as the lack of discrimination between primary invasive and metastatic invasive tumors became clear only on temporal and external validation. Without thorough evaluation of diagnostic performance, it is unsafe to implement prediction models in clinical practice for decision support.
Abbreviations
 AIC:

Akaike information criterion
 BIC:

Schwarz' Bayesian information criterion
 CV:

Crossvalidation
 IOTA:

International Ovarian Tumor Analysis
 KLR:

Kernel logistic regression
 LOO:

Leaveoneout
 LR:

Logistic regression
 LSSVM:

Least squares support vector machines
 MKLR:

multiclass (polytomous) kernel logistic regression
 MLR:

Multinomial logistic regression
 PC:

Pairwise coupling
 RBF:

Radial basis function
 ROC:

Receiver operating characteristic
 R1U:

Variable selection using rankone updates of the LSSVM kernel matrix
 SVM:

Support vector machine
Declarations
Acknowledgements
Ben Van Calster is a postdoctoral researcher funded by the Research Foundation  Flanders (FWO). We thank Astraia GMBH (Munich, Germany) for providing secure electronic data collection software. We thank the following onsite investigators for data collection: JeanPierre Bernard, Nicoletta Colombo, Artur Czekierdowski, Elisabeth Epstein, Enrico Ferrazzi, Daniela Fischerova, Robert Fruscio, Stefano Greggi, Stefano Guerriero, Jingzhang, Davor Jurkovic, Fabrice Lécuru, Francesco Leone, Andrea Alberto Lissoni, Angelo Maggioni, Salvatore Mancuso, Jennifer McDonald, Henry Muggah, Willem Ombelet, Dario Paladini, Alberto Rossi, Luca Savelli, Mario Sideri, and Diego Trio. Research supported by Research Council KUL: GOAAMBioRICS, CoE EF/05/006 Optimization in Engineering (OPTEC); Flemish Government: FWO: G.0407.02 (support vector machines), G.0302.07 (SVM), G.0341.07 (Data fusion); IWTVlaanderen: TBMIOTA3; Belgian Federal Science Policy Office: IUAP P6/04 (DYSCO, 'Dynamical systems, control and optimization', 20072011); European Union: BIOPATTERN (FP62002IST 508803), ETUMOUR (FP62002LIFESCIHEALTH 503094), Healthagents (IST200427214); The Swedish Medical Research Council: grants nos. K200172X 1160506A, K200272X1160507B, K200473X1160509A and K200673X11605113; funds administered by Malmö University Hospital; and two Swedish governmental grants (ALFmedel and Landstingsfinansierad Regional Forskning).
Authors’ Affiliations
References
 Biesheuvel CJ, Vergouwe Y, Steyerberg EW, Grobbee DE, Moons KGM: Polytomous logistic regression analysis could be applied more often in diagnostic research. J Clin Epid. 2008, 61: 125134. 10.1016/j.jclinepi.2007.03.002.View ArticleGoogle Scholar
 Mol BWJ, Boll D, De Kanter M, Heintz APM, Sijmons EA, Oei SG, Bal H, Brölmann HAM: Distinguishing the benign and malignant adnexal mass: an external validation of prognostic models. Gynecol Oncol. 2001, 80: 162167. 10.1006/gyno.2000.6052.View ArticlePubMedGoogle Scholar
 Geomini P, Kruitwagen R, Bremer GL, Cnossen J, Mol BWJ: The accuracy of risk scores in predicting ovarian malignancy. Obstet Gynecol. 2009, 113: 384394.View ArticlePubMedGoogle Scholar
 Van Holsbeke C, Van Calster B, Valentin L, Testa AC, Ferrazzi E, Dimou I, Lu C, Moerman Ph, Van Huffel S, Vergote I, Timmerman D: External validation of mathematical models to distinguish between benign and malignant adnexal tumors: a multicenter study by the International Ovarian Tumor Analysis group. Clin Cancer Res. 2007, 13: 44404447. 10.1158/10780432.CCR062958.View ArticlePubMedGoogle Scholar
 Van Calster B, Timmerman D, Lu C, Suykens JAK, Valentin L, Van Holsbeke C, Amant F, Vergote I, Van Huffel S: Preoperative diagnosis of ovarian tumors using Bayesian kernelbased methods. Ultrasound Obstet Gynecol. 2007, 29: 496504. 10.1002/uog.3996.View ArticlePubMedGoogle Scholar
 Vergote I, De Brabanter J, Fyles A, Bertelsen K, Einhorn N, Sevelda P, Gore ME, Kærn J, Verrelst H, Sjövall K, Timmerman D, Vandewalle J, Van Gramberen M, Tropé CG: Prognostic importance of degree of differentiation and cyst rupture in stage I invasive epithelial ovarian carcinoma. Lancet. 2001, 357: 176182. 10.1016/S01406736(00)03590X.View ArticlePubMedGoogle Scholar
 Mizuno M, Kikkawa F, Shibata K, Kajiyama H, Suzuki T, Ino K, Kawai M, Mizutani S: Longterm prognosis of stage I ovarian carcinoma. Prognostic importance of intraoperative rupture. Oncology. 2003, 65: 2936. 10.1159/000071202.View ArticlePubMedGoogle Scholar
 Panici PB, Muzii L, Palaia I, Manci N, Bellati F, Plotti F, Zullo M, Angioli R: Minilaparotomy versus laparoscopy in the treatment of benign adnexal cysts: a randomized clinical study. Eur J Obstet Gynecol Reprod Biol. 2007, 133: 218222. 10.1016/j.ejogrb.2006.05.019.View ArticlePubMedGoogle Scholar
 Tinelli R, Tinelli A, Tinelli FG, Cicinelli E, Malvasi A: Conservative surgery for borderline ovarian tumors: a review. Gynecol Oncol. 2006, 100: 185191. 10.1016/j.ygyno.2005.09.021.View ArticlePubMedGoogle Scholar
 Jemal A, Siegel R, Ward E, Hao Y, Xu J, Murray T, Thun MJ: Cancer statistics, 2009. CA Cancer J Clin. 2009, 59: 225249. 10.3322/caac.20006.View ArticlePubMedGoogle Scholar
 Hennessy BT, Coleman RL, Markman M: Ovarian cancer. Lancet. 2009, 374: 13711382. 10.1016/S01406736(09)613386.View ArticlePubMedGoogle Scholar
 Bull SB, Greenwood CMT, Donner A: Efficieny of reduced logistic regression models. Can J Stat. 1994, 22: 319334. 10.2307/3315595.View ArticleGoogle Scholar
 Suykens JAK, Van Gestel T, De Brabanter J, De Moor B, Vandewalle J: Least squares support vector machines. 2002, Singapore, World ScientificGoogle Scholar
 Karsmakers P, Pelckmans K, Suykens JAK: Multiclass kernel logistic regression: a fixed size implementation. Proceedings of the 20th International Joint Conference on Neural Networks: 1217 August; Orlando. Edited by: Si J, Sun R, Brown D, King I, Kasabov N. 2007, Los Alamitos, IEEE Press, 17561761.Google Scholar
 Steyerberg EW, Eijkemans MJC, Harrell FE, Habbema JDF: Prognostic modeling with logistic regression analysis: in search of a sensible strategy in small data sets. Med Decis Making. 2001, 21: 4556. 10.1177/0272989X0102100106.View ArticlePubMedGoogle Scholar
 Altman DG, Vergouwe Y, Royston P, Moons KGM: Prognosis and prognostic research: validating a prognostic model. Br Med J. 2009, 338: b60510.1136/bmj.b605.View ArticleGoogle Scholar
 Timmerman D, Valentin L, Bourne TH, Collins WP, Verrelst H, Vergote I: Terms, definitions and measurements to describe the sonographic features of adnexal tumors: a consensus opinion from the International Ovarian Tumor Analysis (IOTA) group. Ultrasound Obstet Gynecol. 2000, 16: 500505. 10.1046/j.14690705.2000.00287.x.View ArticlePubMedGoogle Scholar
 Timmerman D, Testa AC, Bourne T, Ferrazzi E, Ameye L, Konstantinovic ML, Van Calster B, Collins WP, Vergote I, Van Huffel S, Valentin L: A logistic regression model to distinguish between the benign and malignant adnexal mass before surgery: a multicenter study by the International Ovarian Tumor Analysis (IOTA) group. J Clin Oncol. 2005, 23: 87948801. 10.1200/JCO.2005.01.7632.View ArticlePubMedGoogle Scholar
 Van Calster B, Timmerman D, Nabney IT, Valentin L, Testa AC, Van Holsbeke C, Vergote I, Van Huffel S: Using Bayesian neural networks with ARD input selection to detect malignant ovarian masses prior to surgery. Neural Comput Appl. 2008, 17: 489500.View ArticleGoogle Scholar
 Timmerman D, Van Calster B, Testa AC, Guerriero S, Fischerova D, Lissoni AA, Van Holsbeke C, Fruscio R, Czekierdowski A, Jurkovic D, Savelli L, Vergote I, Bourne T, Van Huffel S, Valentin L: Ovarian cancer prediction in adnexal masses using ultrasound based logistic regression models: a temporal and external validation study by the IOTA group. Ultrasound Obstet Gynecol. 2010, 36: 226234. 10.1002/uog.7636.View ArticlePubMedGoogle Scholar
 Vapnik V: The nature of statistical learning theory. 1995, New York, SpringerView ArticleGoogle Scholar
 Van Gestel T, Suykens JAK, Baesens B, Viaene S, Vanthienen J, Dedene G, De Moor B, Vandewalle J: Benchmarking least squares support vector machine classifiers. Mach Learn. 2004, 54: 532. 10.1023/B:MACH.0000008082.80494.e0.View ArticleGoogle Scholar
 Van Gestel T, Suykens JAK, Lanckriet GRG, Lambrechts A, De Moor B, Vandewalle J: Bayesian framework for leastsquares support vector machine classifiers, Gaussian processes, and kernel Fisher discriminant analysis. Neural Comput. 2002, 14: 11151147. 10.1162/089976602753633411.View ArticlePubMedGoogle Scholar
 Wu TF, Lin CJ, Weng RC: Probability estimates for multiclass classification by pairwise coupling. J Mach Learn Res. 2004, 5: 9751005.Google Scholar
 Steyerberg EW, Eijkemans MJC, Harrell FE, Habbema JDF: Prognostic modeling with logistic regression analysis: a comparison of selection and estimation methods in small data sets. Stat Med. 2000, 19: 10591079. 10.1002/(SICI)10970258(20000430)19:8<1059::AIDSIM412>3.0.CO;20.View ArticlePubMedGoogle Scholar
 Burnham KP, Anderson DR: Model selection and inference: a practical informationtheoretic approach. 1998, New York, SpringerView ArticleGoogle Scholar
 Ojeda F, Suykens JAK, De Moor B: Low rank updated LSSVM classifiers for fast variable selection. Neural Netw. 2008, 21: 437449. 10.1016/j.neunet.2007.12.053.View ArticlePubMedGoogle Scholar
 Van Calster B, Timmerman D, Testa AC, Valentin L, Van Huffel S: Multiclass classification of ovarian tumors. Proceedings of the Sixteenth European Symposium on Artificial Neural Networks: 2325 April 2008; Bruges. Edited by: Verleyen M. 2008, Evere, dside Publications, 6570.Google Scholar
 Mossman D: Threeway ROCs. Med Decis Making. 1999, 19: 7889. 10.1177/0272989X9901900110.View ArticlePubMedGoogle Scholar
 Van Calster B, Van Belle V, Condous G, Bourne T, Timmerman D, Van Huffel S: Multiclass AUC metrics and weighted alternatives. Proceedings of the 21st International Joint Conference on Neural Networks: 16 June; Hongkong. Edited by: Liu D, Kozma R. 2008, Los Alamitos, IEEE Computer Society, 13911397.Google Scholar
 Harrell FE: Regression modeling strategies: with applications to linear models, logistic regression, and survival analysis. 2001, New York, SpringerView ArticleGoogle Scholar
 Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N, Pencina MJ, Kattan MW: Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology. 2010, 21: 128138. 10.1097/EDE.0b013e3181c30fb2.View ArticlePubMedPubMed CentralGoogle Scholar
 Heintz APM, Odicino F, Maisonneuve P, Quinn MA, Benedet JL, Creasman WT, Ngan HYS, Pecorelli S, Beller U: Carcinoma of the ovary. FIGO 6^{th} annual report on the results of treatment in gynecological cancer. Int J Gynaecol Obstet. 2006, 95 (Suppl1): S161S192. 10.1016/S00207292(06)600337.View ArticlePubMedGoogle Scholar
 Timmerman D, Van Calster B, Jurkovic D, Valentin L, Testa AC, Bernard JP, Van Holsbeke C, Van Huffel S, Vergote I, Bourne T: Inclusion of CA125 does not improve mathematical models developed to distinguish between benign and malignant adnexal tumors. J Clin Oncol. 2007, 25: 41944200. 10.1200/JCO.2006.09.5943.View ArticlePubMedGoogle Scholar
 Roukema J, van Loenhout RB, Steyerberg EW, Moons KGM, Bleeker SE, Moll HA: Polytomous regression did not outperform dichotomous logistic regression in diagnosing serious bacterial infections in febrile children. J Clin Epidemiol. 2008, 61: 135141. 10.1016/j.jclinepi.2007.07.005.View ArticlePubMedGoogle Scholar
 Lee JS, Oh IS: Binary classification trees for multiclass classification problems. Proceedings of the Seventh International Conference on Document Analysis and Recognition: 36 August 2003;Edinburgh. Edited by: Antonacopoulos A. 2003, Los Alamitos, IEEE Computer Society, 770774.Google Scholar
 Steyerberg EW, Eijkemans MJC, Habbema JDF: Application of shrinkage techniques in logistic regression analysis: a case study. Stat Neerl. 2001, 55: 7688. 10.1111/14679574.00157.View ArticleGoogle Scholar
 Wyatt JC, Altman DG: Prognostic models: clinically useful or quickly forgotten?. Br Med J. 1995, 311: 15391541.View ArticleGoogle Scholar
 The prepublication history for this paper can be accessed here:http://www.biomedcentral.com/14712288/10/96/prepub
Prepublication history
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.