Methods for polytomous classification are underused in medical applications. In this paper, we used various methods for the probabilistic diagnosis of ovarian tumors as benign, borderline, primary invasive, or metastatic invasive. To the best of our knowledge, this is the first time that prediction models for ovarian tumor diagnosis exceeded the basic differentiation between benign and malignant tumors. Methods included true polytomous (all-at-once) algorithms and algorithms that combined dichotomous (1-versus-1) models using the technique of pairwise coupling. The basic classification algorithms were based on logistic regression, LS-SVMs, and kernel logistic regression. All models were internally, temporally, and externally validated. Despite the low number of borderline and metastatic tumors, interesting and consistent results were obtained.

The results showed very good separation of benign, borderline, and invasive tumors. This is an important result because the ability to differentiate between borderline and invasive tumors gives additional, highly useful information for making sensible treatment decisions. There is a clear difference in aggressiveness between borderline and invasive tumors, and they are treated differently. The use of a model such as LR-PC2 in clinical practice would therefore be interesting. On http://homes.esat.kuleuven.be/~biomed/LRPC2/lrpc2.htm we have made available an Excel sheet that can be used to implement LR-PC2. Unfortunately, the models were unable to reliably discriminate between primary and metastatic invasive tumors.

In the present study, the combination of 1-versus-1 models with pairwise coupling was an interesting alternative to true polytomous algorithms. The former approach allowed for more fine-tuned variable selection, and resulted in higher validation performance - as determined by the polytomous *c*-index - for both logistic regression-based and kernel logistic regression-based models. An advantage of 1-versus-1 models is their increased flexibility by addressing subproblems that are sometimes of particular interest to the clinician, for example when the clinician hesitates between two diagnoses only. The overall best model combined 1-versus-1 logistic regression models using pairwise coupling (LR-PC2). For the discrimination between benign and malignant tumors (cf. the *c*-index for benign versus other tumors in Tables 3 and 4), this model performed similar to the dichotomous models developed and validated on the same data [18, 20]. LR-PC2 used 11 predictors, but not subjective variables such as the experience of abdominal or pelvic pain during the ultrasound examination or the color score of intratumoral blood flow (a subjective score between 1 and 4). These variables are used in some of the existing dichotomous models [18, 20]. None of the polytomous models used the CA-125 tumor marker, because this marker was deliberately not considered as a predictor. The most important reasons are that we focused on ultrasound information, and that its use would preclude the immediate use of a model as the results of the blood test have to be awaited. In addition, the inclusion of CA-125 as a variable in dichotomous models did not result in better performance [34].

A disadvantage of combining 1-versus-1 models is that the number of 1-versus-1 problems grows exponentially with the total number of events. Another decomposition of a polytomous problem that consists of a tree of nested (or sequential) dichotomous models does not suffer from this limitation [35, 36]. A sensible tree in our study would be to make a model to discriminate between benign and malignant tumors, followed by a model to discriminate between borderline and invasive tumors, and finally a model to contrast primary with metastatic invasive tumors. Polytomous probabilities can be obtained in a straightforward manner. When we applied this approach, it resulted in stronger performance degeneration on temporal and external validation than the true polytomous models or the pairwise coupling approach.

Interestingly, we found that approaches based on logistic regression performed very well when compared to the regularized kernel-based alternatives despite the fact that two events (borderline, metastatic) had very few cases. All models suffered from performance decrease on temporal and external validation, but the decrease was not more severe for the unregularized logistic regression-based models. This might be explained by the careful variable selection strategies for which cross-validated *c*-index estimates were the most important criterion. If we applied pairwise coupling of 1-versus-1 logistic regression models based on standard stepwise variable selection with a p-value of 0.05 as the selection and removal threshold, we ended up with a total of 15 selected variables. The polytomous and pairwise *c*-indexes of stepLR-PC were similar to or worse than those of LR-PC2 and showed stronger decrease on temporal and external validation. That being said, the use of logistic regression models in situations like the one in this study asks for a regularized fitting approach such as shrinkage, penalized maximum likelihood estimation, or the LASSO (least absolute selection and shrinkage operator) [37].

This study further demonstrated the necessity of a thorough validation of prediction models, in particular in situations with small sample sizes for some events relative to the number of variables considered as possible predictors. We observed in our study that, opposite to what the internal validation results suggested, the differentiation between primary and metastatic invasive tumors was near random on temporal and external validation. Unfortunately, many models are developed yet only a limited portion of these undergo validation in various clinical settings. This hampers the successful implementation of such models into clinical practice [38].

Even though the use of LR-PC2 in clinical practice would provide useful information, future work will focus on the development of a more robust model by combining all the data used in this study to update LR-PC2. Ample attention will be devoted to the selection of a limited set of predictors to boost the user-friendliness of the model for busy clinicians.