Multiclass risk models for ovarian malignancy: an illustration of prediction uncertainty due to the choice of algorithm

Ledger, Ashleigh; Ceusters, Jolien; Valentin, Lil; Testa, Antonia; Van Holsbeke, Caroline; Franchi, Dorella; Bourne, Tom; Froyman, Wouter; Timmerman, Dirk; Van Calster, Ben

doi:10.1186/s12874-023-02103-3

Research
Open access
Published: 24 November 2023

Multiclass risk models for ovarian malignancy: an illustration of prediction uncertainty due to the choice of algorithm

Ashleigh Ledger¹,
Jolien Ceusters^1,2,
Lil Valentin^3,4,
Antonia Testa^5,6,
Caroline Van Holsbeke⁷,
Dorella Franchi⁸,
Tom Bourne^1,9,10,
Wouter Froyman^1,9,
Dirk Timmerman^1,9 &
…
Ben Van Calster ORCID: orcid.org/0000-0003-1613-7450^1,11,12

BMC Medical Research Methodology volume 23, Article number: 276 (2023) Cite this article

968 Accesses
1 Altmetric
Metrics details

Abstract

Background

Assessing malignancy risk is important to choose appropriate management of ovarian tumors. We compared six algorithms to estimate the probabilities that an ovarian tumor is benign, borderline malignant, stage I primary invasive, stage II-IV primary invasive, or secondary metastatic.

Methods

This retrospective cohort study used 5909 patients recruited from 1999 to 2012 for model development, and 3199 patients recruited from 2012 to 2015 for model validation. Patients were recruited at oncology referral or general centers and underwent an ultrasound examination and surgery ≤ 120 days later. We developed models using standard multinomial logistic regression (MLR), Ridge MLR, random forest (RF), XGBoost, neural networks (NN), and support vector machines (SVM). We used nine clinical and ultrasound predictors but developed models with or without CA125.

Results

Most tumors were benign (3980 in development and 1688 in validation data), secondary metastatic tumors were least common (246 and 172). The c-statistic (AUROC) to discriminate benign from any type of malignant tumor ranged from 0.89 to 0.92 for models with CA125, from 0.89 to 0.91 for models without. The multiclass c-statistic ranged from 0.41 (SVM) to 0.55 (XGBoost) for models with CA125, and from 0.42 (SVM) to 0.51 (standard MLR) for models without. Multiclass calibration was best for RF and XGBoost. Estimated probabilities for a benign tumor in the same patient often differed by more than 0.2 (20% points) depending on the model. Net Benefit for diagnosing malignancy was similar for algorithms at the commonly used 10% risk threshold, but was slightly higher for RF at higher thresholds. Comparing models, between 3% (XGBoost vs. NN, with CA125) and 30% (NN vs. SVM, without CA125) of patients fell on opposite sides of the 10% threshold.

Conclusion

Although several models had similarly good performance, individual probability estimates varied substantially.

Peer Review reports

Background

Patients with an ovarian tumor should be managed appropriately. There is evidence that treatment in oncology centers improves ovarian cancer prognosis [1, 2]. However, benign ovarian cysts are frequent and can be managed conservatively (i.e. non-surgically with clinical and ultrasound follow-up) or with surgery in a general hospital [3]. Risk prediction models can support optimal patient triage by estimating a patient’s risk of malignancy based on a set of predictors [4, 5]. ADNEX is a multinomial logistic regression (MLR) model that uses nine clinical and ultrasound predictors to estimate the probabilities that a tumor is benign, borderline, stage I primary invasive, stage II-IV primary invasive, or secondary metastatic [6,7,8]. ADNEX differentiates between four types of malignancies because these tumor types require different management [7, 9].

There is an increasing interest in the use of flexible machine learning algorithms to develop prediction models [10,11,12]. Contrary to regression models, flexible machine learning algorithms do not require the user to specify the model structure: these algorithms automatically search for nonlinear associations and potential interactions between predictors [10]. This may result in better performing models, but poor design and methodology may yield misleading and overfitted results [10, 11]. A recent systematic review observed better performance for flexible machine learning algorithms versus logistic regression when comparisons were at high risk of bias, but not when comparisons were at low risk of bias [10]. Few of the included studies addressed the accuracy of the risk estimates (calibration), none assessed clinical utility.

In addition, there is increased awareness for uncertainty of predictions [13, 14]. It is known that probability estimates for individuals are unstable, in the sense that fitting the model on different sample from the same population may lead to very different probability estimates for individual patients [15, 16]. This instability decreases with the sample size for model development, but is considerable even when models are based on currently recommended sample sizes [16, 17]. Apart from instability, ‘model uncertainty’ reflects the impact of various decisions made during model development on the estimated probabilities for individual patients. Modeling decision may relate to issues such as the choice of predictors or the method to handle missing data [18, 19]. All other modeling decisions being equal, the choice of modeling algorithm (e.g. logistic regression versus random forest) may also play a role.

In this study, we (1) compare the performance of multiclass risk models for ovarian cancer diagnosis based on regression and flexible machine learning algorithms in terms of discrimination, calibration, and clinical utility, and (2) assess differences between the models regarding the estimated probabilities for individual patients to study model uncertainty caused by choosing a particular algorithm.

Methods

Study design, setting and participants

This is a secondary analysis of prospectively collected data from multicenter cohort studies that were conducted by the International Ovarian Tumor Analysis (IOTA) group. For model training, we used data from 5909 consecutively recruited patients at 24 centers across four consecutive cohort studies between 1999 and 2012 [6, 20,21,22,23]. All patients had at least one adnexal (ovarian, para-ovarian, or tubal) mass that was judged not to be a physiological cyst, provided consent for transvaginal ultrasound examination, were not pregnant, and underwent surgical removal of the adnexal mass within 120 days after the ultrasound examination. This dataset was also used to develop the ADNEX model [6]. For external validation, we used data from 3199 consecutively recruited patients at 25 centers between 2012 and 2015 [8]. All patients had at least one adnexal mass that was judged not to be a physiological cyst with a largest diameter below 3 cm and provided consent for transvaginal ultrasound examination. Although this study recruited patients that subsequently underwent surgery or were managed conservatively, the current work only used data from patients that were operated within 120 days after the ultrasound examination without additional preoperative ultrasound visits. The external validation dataset was therefore comparable to the training dataset.

Participating centers were ultrasound units in a gynecological oncology center (labeled oncology centers), or gynecological ultrasound units not linked to an oncology center. All mother studies received ethics approval from the Research Ethics Committee of the University Hospitals Leuven and from each local ethics committee. All participants provided informed consent. We obtained approval from the Ethics Committee in Leuven (S64709) for secondary use of the data for methodological purposes. We report this study using the TRIPOD checklist [4, 24].

Data collection

A standardized history from each patient was taken at the inclusion visit to obtain clinical information, and all patients underwent a standardized transvaginal ultrasound examination [25]. Transabdominal sonography was added if necessary, e.g. for large masses. Information on a set of predefined gray scale and color or power Doppler ultrasound variables was collected following the research protocol. When more than one mass was present, examiners included the mass with the most complex ultrasound morphology. If multiple masses with similar morphology were found, the largest mass or the mass best seen on ultrasound was included. Measurement of CA125 was neither mandatory nor standardized but was done according to local protocols regarding kits and timing.

Outcome

The outcome was the classification of the mass into one of five outcome categories based on the histological diagnosis of the mass following laparotomy or laparoscopic surgery and on staging of malignant tumors using the classification of the International Federation of Gynecology and Obstetrics (FIGO): benign, borderline, stage I primary invasive, stage II-IV primary invasive, or, secondary metastasis [26, 27]. Stage I invasive tumors have not spread outside the ovary, and hence have the best prognosis of the primary invasive tumors. The histological assessment was performed without knowing the detailed results of the ultrasound examination, but pathologists might have received clinically relevant information as per local procedures.

Statistical analysis

Predictors and sample size

We used the following nine clinical and ultrasound predictors: type of center (oncology center vs. other), patient age (years), serum CA125 level (U/ml), proportion of solid tissue (maximum diameter of the largest solid component divided by the maximum diameter of the lesion), maximum diameter of the lesion (mm), presence of shadows (yes/no), presence of ascites (yes/no), presence of more than ten cyst locules (yes/no), and number of papillary projections (0, 1, 2, 3, > 3). These predictors were selected for the ADNEX model based on expert domain knowledge regarding likely diagnostic importance, objectivity, and measurement difficulty, and based on stability between centers (see Supplementary Material 1 for more information) [6]. We also developed models without CA125: not all centers routinely measure CA125, and including CA125 implies that predictions can only be made when the laboratory result becomes available. We discuss the adequacy of our study sample size in Supplementary Material 2.

Algorithms

We developed models using standard MLR, ridge MLR, random forest (RF), extreme gradient boosting (XGBoost), neural networks (NN), and support vector machines (SVM) [28,29,30,31]. For the MLR models, continuous variables were modeled with restricted cubic splines (using 3 knots) to allow for nonlinear associations [32]. The hyperparameters were tuned with 10-fold cross-validation on the development data (Supplementary Material 3). Using the selected hyperparameters, the full development data was used to train the model.

Model performance on external validation data

Discrimination was assessed with the Polytomous Discrimination Index (PDI), a multiclass extension of the binary c-statistic (or area under the receiver operating characteristic curve, AUROC) [33]. In this study, PDI equals 0.2 (one divided by five outcome categories) for useless models, and 1 for perfect discrimination: PDI estimates the probability that the model can correctly identify a patient from a randomly chosen category from a set of five patients (one from each outcome category). We also calculated pairwise c-statistics for each pair of outcome categories using the conditional risk method [34]. Finally, we calculated the binary c-statistic to discriminate benign from any type of malignant tumor. The estimated risk of any type of malignancy equals one minus the estimated probability of a benign tumor. PDI and c-statistics were analyzed through meta-analysis of center-specific results. We calculated 95% prediction intervals (PI) from the meta-analysis to indicate what performance to expect in a new center.

Calibration was assessed using flexible (loess-based) calibration curves per outcome; center-specific curves were averaged and weighted by the square root of sample size [35]. Calibration curves were summarized by the rescaled Estimated Calibration Index (ECI) [36]. The rescaled ECI equals 0 if the calibration curve fully coincides with the diagonal line, and 1 if the calibration curve is horizontal (i.e. the model has no predictive ability).

We calculated the Net Benefit to assess the utility of the model to select patients for referral to a gynecologic oncology center [37, 38]. A consensus statement suggests to refer patients when the risk of malignancy is ≥ 10% [9]. We plotted Net Benefit for malignancy risk thresholds between 5% and 40% in a decision curve, but we focus on the 10% risk threshold. At each threshold, Net Benefit of the models is compared with default strategies: select everyone (‘treat all’) or select no-one (‘treat none’) for referral [37, 38]. Net Benefit was calculated using meta-analysis of center-specific results [39]. We calculated decision reversal for each pair of models by calculating the percentage of patients for which one model had an estimated risk ≥ 10% and the other < 10%.

Missing values for CA125

CA125 was missing for 1805 (31%) patients in the development data and for 966 (30%) patients in the validation data. Patients with tumors that looked suspicious for malignancy more often had CA125 measured. We used ‘multiple imputation by chained equations’ to deal with missing CA125 values. Imputation results, done separately for the development and validation data, were available from the original publications, see Supplementary Material 4 for details [6, 8].

Modeling procedure and software

Supplementary Material 5 presents the modeling and validation procedure for models with CA125 and models without CA125. The analysis was performed with R version 4.1.2, using packages nnet (MLR), and caret together with packages glmnet (Ridge MLR), ranger (RF), xgboost, nnet (NN), and kernlab (SVM) [29]. Meta-analysis for Net Benefit was performed using Winbugs.

Results

Descriptive statistics for the development and validation datasets are shown in Table 1 and S1. A list of centers with distribution of the five tumor types is shown in Table S2. The median age of the patients was 47 years (interquartile range 35–60) in the development dataset and 49 years in the validation dataset (interquartile range 36–62). Most tumors were benign: 3980 (67%) in the development dataset and 1988 (62%) in the validation dataset. Secondary metastatic tumors were least common: 246 (4%) in the development dataset and 172 (5%) in the validation dataset.

Table 1 Descriptive statistics of predictors and outcome in the development and validation datasets

Full size table

Discrimination performance

For models with CA125, PDI ranged from 0.41 (95% CI 0.39–0.43) for SVM to 0.55 (0.51–0.60) for XGBoost (Table 2, Figure S1). In line with these results, the pairwise c-statistics were generally lower for SVM than for other models (Table S3). For the best models, pairwise c-statistics were above 0.90 for benign versus stage II-IV tumors, benign versus secondary metastatic tumors, benign versus stage I tumors, and borderline versus stage II-IV tumors. For all models, pairwise c-statistics were below 0.80 for borderline versus stage I tumors, stage I versus secondary metastatic tumors, and stage II-IV versus secondary metastatic tumors. The binary c-statistics (or AUROC) for any malignancy was 0.92 for all algorithms except Ridge MLR (0.90) and SVM (0.89) (Figure S2).

Table 2 Overview of discrimination, calibration, and utility performance on external validation data

Full size table

For models without CA125, PDI ranged from 0.42 (95% CI 0.39–0.45) for SVM to 0.51 (0.47–0.54) for standard MLR (Table 2, Figure S3). Including CA125 mainly improved c-statistics for stage II-IV primary invasive vs. secondary metastatic tumors, and stage I vs. stage II-IV primary invasive tumors (Table S4). The binary c-statistics for any malignancy was less affected by excluding CA125, with values up to 0.91 (Figure S4).

Calibration performance

For models with CA125, the probability of a benign tumor was too high on average for all algorithms, in particular for SVM (Fig. 1). The risks of a stage I tumor and a secondary metastatic tumor were fairly well calibrated. The risk of a borderline tumor was slightly too low on average for all algorithms. The risk of a stage II-IV tumor was too low on average for standard MLR, Ridge MLR, and in particular for SVM. Based on the ECI, RF and XGBoost had the best calibration performance, SVM the worst (Table 2). Box plots of the estimated probabilities for each algorithm are presented in Figures S5-S10. For models without CA125, calibration results were roughly similar (Table 2, Figures S11-17).

Clinical utility

All models with CA125 were superior to the default strategies (treat all, treat none) at any threshold (Figure S18). At the 10% threshold for the risk of any malignancy, all algorithms had similar Net Benefit (Table 2). At higher thresholds, RF and XGBoost had the best results, SVM the worst. For models without CA125, results were roughly similar (Table 2, Figure S19, Table S4).

Comparing estimated probabilities between algorithms

For an individual patient, the six models could generate very different probabilities. For example, depending on the model, the estimated probability of a benign tumor differed at least 0.2 (20% points) for 29% (models with CA125) and 31% (models without CA125) of the validation patients (Table 3, Figure S20). Note that these absolute differences were related to the prevalences of the outcome categories: the differences were largest for the most common category (benign) and smallest for the least common category (secondary metastatic). Scatter plots of estimated probabilities for each pair of models are provided in Figs. 2, 3, 4, 5 and 6 for models with CA125 and in Figures S21-S25 for models without. When comparing two models at the 10% threshold for the estimated risk of any malignancy, between 3% (XGBoost vs. NN, with CA125) and 30% (NN vs. SVM, without CA125) of patients fell on opposite sides of the threshold (Table S5).

Table 3 Descriptive statistics of the probability range across the six models on external validation data

Full size table

Discussion

We compared six algorithms to develop multinomial risk prediction models for ovarian cancer diagnosis. There was no algorithm that clearly outperformed the others. XGBoost, RF, NN and MLR had similar performance, SVM had the worst performance. CA125 mainly increased discrimination between stage II-IV primary invasive tumors and the other two types of invasive tumors. Despite similar performance for several algorithms, the choice of algorithm had a clear impact on the estimated probabilities for individual patients. Choosing a different algorithm could lead to different clinical decisions in a substantial percentage of patients.

Strengths of the study include (1) the use of large international multicenter datasets, (2) data collection according to a standardized ultrasound examination technique, measurement technique, and terminology [25], (3) evaluation of risk calibration and clinical utility, and (4) appropriate modeling practices by addressing nonlinearity for continuous predictors in regression models and hyperparameter tuning for the machine learning algorithms. Such modeling practices are often lacking in comparative studies [10]. A limitation could be that we included only patients that received surgery, thereby excluding patients managed conservatively. This limitation affects most studies on ovarian malignancy diagnosis, because surgery allows using histopathology to determine outcome. The use of a fixed set of predictors could also be perceived as a limitation. However, these predictors were carefully selected based largely on expert domain knowledge for development of the ADNEX model, which is perhaps the best performing ultrasound-based prediction model to date [6, 8, 40]. Including more predictors, or using a data-driven selection procedure per algorithm, would likely increase the observed differences in estimated probabilities between algorithms.

Previous studies developed machine learning models using sonographic and clinical variables to estimate the risk of malignancy in adnexal masses on smaller datasets (median sample size 357, range 35-3004) [41,42,43,44,45,46,47,48,49,50,51,52]. Calibration was not assessed, and the outcome was binary (usually benign vs. malignant) in all but two studies. One study distinguished between benign, borderline, and invasive tumors [45], another study distinguished between benign, borderline, primary invasive, and secondary metastatic tumors [44]. However, sample size was small in these two studies (the smallest outcome category had 16 and 30 cases, respectively, in the development set). All studies focused exclusively on neural networks, support vector machines, or related kernel-based methods. All but one of these studies implicitly or explicitly supported the use of machine learning algorithms over logistic regression.

Our results illustrate that the probability estimates for individual patients can vary substantially by algorithm. There are different types of uncertainty of individual predictions [53]. ‘Aleatory uncertainty’ implies that two patients with the same predictor measurements (same age, same maximum lesion diameter, etcetera) may have a different outcome. ‘Epistemic uncertainty’ refers to lack of knowledge about the best final model and is divided into ‘approximation uncertainty’ and ‘model uncertainty’ [53]. ‘Approximation uncertainty’ reflects sample size: the smaller the sample size, the more uncertain the developed model. This means that very different models can be obtained when fitting the same algorithm to different training datasets of the same size, and that these differences become smaller with increasing sample size. ‘Model uncertainty’ reflects the impact of various decisions made during model development. Our study illustrates that the choice of algorithm is an important component of model uncertainty.

A first implication of our work is that there is no important advantage of using flexible machine learning over multinomial logistic regression for developing ultrasound-based risk models for ovarian cancer diagnosis to support clinical decisions. An MLR-based model is easier to implement, update, and explain than a flexible machine learning model. We would like to emphasize that the ADNEX model that was mentioned in the introduction, although based on MLR, includes random intercepts by center [6]. This is an advantage because it acknowledges that prevalences of the outcome categories vary between centers [54]. We did not use random intercepts in the current study, because they do not generalize directly to flexible algorithms. A second implication is that the choice of algorithm matters for individual predictions, even when discrimination, calibration, and clinical utility are similar. Different models with equal clinical utility in the population may yield very different risk estimates for an individual patient, and this may lead to different management decisions for the same individual. Although, in our opinion, the crux of clinical risk prediction models is that their use should lead to improved clinical decisions for a specific population as a whole, differences in risk estimates for the same individual are an important finding. More research is needed to better understand uncertainty in predictions caused by the choice of algorithm, or other decisions made by the modeler such as the predictor selection method. The observation that different algorithms may make different predictions emphasizes the need of sufficiently large databases when developing prediction models. The recently established guidance for minimum sample size to develop a regression-based prediction model is a crucial step forward [55, 56]. However, it is based on general performance measures related to discrimination and calibration, and does not cover uncertainty of risk estimates for individual patients. Hence, if possible, the sample size should be larger than what the guidance would suggest. Flexible machine learning algorithms may require even more data than regression algorithms [57]. We should consider providing an indication of the uncertainty around a risk estimate. Confidence intervals around the estimated probabilities may be provided, although this may be confusing for patients [58]. Moreover, standard confidence intervals do not capture all sources of uncertainty. The biostatistics and machine learning communities are currently researching methods to quantify the confidence of predictions [13, 14, 17, 59, 60]. Related options may be explored, such as models that abstain from making predictions when uncertainty is too large [14].

Conclusion

Several algorithms had similar performance and good clinical utility to estimate the probability of five tumor types in women with an adnexal (ovarian, para-ovarian, or tubal) mass treated with surgery. However, different algorithms could yield very different probabilities for individual patients.

Data availability

The analysis code and statistical analysis plan are available on GitHub (https://github.com/AshleighLedger/Paper-IOTA-ML). The datasets that we analysed during the current study are not publicly available because this was not part of the informed consent at the time (the last patient was recruited in 2015). However, the dataset may be obtained following permission of prof. Dirk Timmerman (dirk.timmerman@uzleuven.be) and after fulfilling all requirements such as data transfer agreements or ethics approval from the leading ethics committee during data collection (Research Ethics Committee of the University Hospitals Leuven).

Abbreviations

MLR:: Multinomial logistic regression
RF:: Random forests
XGBoost:: Extreme gradient boosting
NN:: Neural networks
SVM:: Support vector machines
AUROC:: Area under receiver operating characteristic
IOTA:: International Ovarian Tumor Analysis
FIGO:: International Federation of Gynecology and Obstetrics
PDI:: Polytomous Discrimination Index
PI:: Prediction interval
CI:: Confidence interval
ECI:: Estimated Calibration Index
BOT:: Borderline Tumor

References

Woo YL, Kyrgiou M, Bryant A, et al. Centralisation of services for gynaecological cancers – a Cochrane systematic review. Gynecol Oncol. 2012;126:286–90.
Article PubMed Google Scholar
Vernooij F, Heintz APM, Witteveen PO, et al. Specialized care and survival of Ovarian cancer patients in the Netherlands: nationwide cohort study. J Natl Cancer Inst. 2008;100:399–406.
Article PubMed Google Scholar
Froyman W, Landolfo C, De Cock B, et al. Risk of Complications in patients with conservatively managed ovarian tumours (IOTA5): a 2-year interim analysis of a multicentre, prospective, cohort study. Lancet Oncol. 2019;20:448–58.
Article PubMed Google Scholar
Moons KGM, Altman DG, Reitsma JB, et al. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): explanation and elaboration. Ann Intern Med. 2015;162:W1–73.
Article PubMed Google Scholar
Steyerberg EW. Clinical prediction models: a practical approach to development, validation, and updating. 2nd ed. Cham: Springer; 2019.
Book Google Scholar
Van Calster B, Van Hoorde K, Valentin L, et al. Evaluating the risk of Ovarian cancer before Surgery using the ADNEX model to differentiate between benign, borderline, early and advanced stage invasive, and secondary metastatic tumours: prospective multicentre diagnostic study. BMJ. 2014;349:g5920.
Article PubMed PubMed Central Google Scholar
Van Calster B, Van Hoorde K, Froyman W, et al. Practical guidance for applying the ADNEX model from the IOTA group to discriminate between different subtypes of adnexal tumors. Facts Views Vis Obgyn. 2015;7:32–41.
PubMed PubMed Central Google Scholar
Van Calster B, Valentin L, Froyman W, et al. Validation of models to diagnose Ovarian cancer in patients managed surgically or conservatively: multicentre cohort study. BMJ. 2020;370:m2614.
Article PubMed PubMed Central Google Scholar
Timmerman D, Planchamp F, Bourne T, et al. ESGO/ISUOG/IOTA/ESGE Consensus Statement on pre-operative diagnosis of ovarian tumors. Ultrasound Obstet Gynecol. 2021;58:148–68.
Article CAS PubMed Google Scholar
Christodoulou E, Ma J, Collins GS, Steyerberg EW, Verbakel JY, Van Calster B. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J Clin Epidemiol. 2019;110:12–22.
Article PubMed Google Scholar
Wilkinson J, Arnold KF, Murray EJ, et al. Time to reality check the promises of machine learning-powered prediction medicine. Lancet Digit Health. 2020;2:e677–80.
Article PubMed PubMed Central Google Scholar
Collins GS, Dhiman P, Andour Navarro CL, et al. Protocol for development of a reporting guideline (TRIPOD-AI) and risk of bias tool (PROBAST-AI) for diagnostic and prognostic prediction model studies based on artificial intelligence. BMJ Open. 2021;11:e048008.
Article PubMed PubMed Central Google Scholar
Myers PD, Ng K, Severson K, et al. Identifying unreliable predictions in clinical risk models. NPJ Digit Med. 2020;3:8.
Article PubMed PubMed Central Google Scholar
Kompa B, Snoek J, Beam AL. Second opinion needed: communicating uncertainty in medical machine learning. NPJ Digit Med. 2021;4:4.
Article PubMed PubMed Central Google Scholar
Lemeshow S, Klar J, Teres D. Outcome prediction for individual intensive care patients: useful, misused, or abused? Intensive Care Med. 1995;21:770–6.
Article CAS PubMed Google Scholar
Pate A, Emsley R, Sperrin M, et al. Impact of sample size on the stability of risk scores from clinical prediction models: a case study in Cardiovascular Disease. Diagn Progn Res. 2020;4:14.
Article PubMed PubMed Central Google Scholar
Riley RD, Collins GS. Stability of clinical prediction models developed using statistical or machine learning methods. Biom J. 2023;e2200302.
Pate A, Emsley R, Ashcroft DM, et al. The uncertainty with using risk prediction models for individual decision making: an exemplar cohort study examining the prediction of Cardiovascular Disease in English primary care. BMC Med. 2019;17:134.
Article PubMed PubMed Central Google Scholar
Steyerbeg EW, Eijkemans MJC, Boersma E, et al. Equally valid models gave divergent predictions for mortality in acute Myocardial Infarction patients in a comparison of logistic regression models. J Clin Epidemiol. 2005;58:383–90.
Article Google Scholar
Timmerman D, Testa AC, Bourne T, et al. Logistic regression model to distinguish between the benign and malignant adnexal mass before Surgery: a multicenter study by the International Ovarian Tumor Analysis Group. J Clin Oncol. 2005;23:8794–801.
Article PubMed Google Scholar
Van Holsbeke C, Van Calster B, Testa AC, et al. Prospective internal validation of mathematical models to predict malignancy in adnexal masses: results from the international ovarian Tumor analysis study. Clin Cancer Res. 2009;15:684–91.
Article PubMed Google Scholar
Timmerman D, Van Calster B, Testa AC, et al. Ovarian cancer prediction in adnexal masses using ultrasound-based logistic regression models. A temporal and external validation study by the IOTA group. Ultrasound Obstet Gynecol. 2010;36:226–34.
Article CAS PubMed Google Scholar
Testa A, Kaijser J, Wynants L, et al. Strategies to diagnose Ovarian cancer: new evidence from phase 3 of the multicentre international IOTA study. Br J Cancer. 2014;111:680–8.
Article CAS PubMed PubMed Central Google Scholar
Debray TPA, Collins GS, Riley RD, et al. Transparent reporting of multivariate prediction models developed or validated using clustered data (TRIPOD-Cluster): explanation and elaboration. BMJ. 2023;380:e071018.
Article PubMed PubMed Central Google Scholar
Timmerman D, Valentin L, Bourne TH, et al. Terms, definitions and measurements to describe the sonographic features of adnexal tumors: a consensus opinion from the International Ovarian Tumor Analysis (IOTA) group. Ultrasound Obstet Gynecol. 2000;16:500–5.
Article CAS PubMed Google Scholar
Heintz APM, Odicino F, Maisonneuve P et al. Carcinoma of the ovary. FIGO 26th Annual Report on the Results of Treatment in Gynecological Cancer. Int J Gynaecol Obstet. 2006;95:S161–92.
Prat J. Staging classification for cancer of the ovary, fallopian tube, and peritoneum. Int J Gynaecol Obstet. 2014;124:1–5.
Article PubMed Google Scholar
Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33:1–22.
Article PubMed PubMed Central Google Scholar
Kuhn M, Johnson K. Applied Predictive Modelling. New York: Springer; 2013.
Book Google Scholar
Le Cessie S, van Houwelingen JC. Ridge estimators in logistic regression. Appl Statist. 1992;41:191–201.
Article Google Scholar
Chen TQ, Guestrin C, XGBoost. A scalable tree boosting system. arXiv. 2016; 1603.02754v3. https://arxiv.org/abs/1603.02754.
Harrell FE Jr. Regression modeling strategies: with applications to linear models, logistic and ordinal regression, and survival analysis. 2nd ed. Cham: Springer; 2015.
Book Google Scholar
Van Calster B, Van Belle V, Vergouwe Y, et al. Extending the c-statistic to nominal polytomous outcomes: the polytomous discrimination index. Stat Med. 2012;31:2610–26.
Article PubMed Google Scholar
Van Calster B, Vergouwe Y, Looman CWN, et al. Assessing the discriminative ability of risk models for more than two outcomes categories. Eur J Epidemiol. 2012;27:761–70.
Article PubMed Google Scholar
Van Hoorde K, Vergouwe Y, Timmerman D, et al. Assessing calibration of multinomial risk prediction models. Stat Med. 2014;33:2585–96.
Article PubMed Google Scholar
Edlinger M, van Smeden M, Alber HF, et al. Risk prediction models for discrete ordinal outcomes: calibration and the impact of the proportional odds assumption. Stat Med. 2022;41:1334–60.
Article PubMed Google Scholar
Vickers AJ, Elkin EB. Decision curve analysis: a novel method for evaluating prediction models. Med Decis Making. 2006;26:565–74.
Article PubMed PubMed Central Google Scholar
Vickers AJ, Van Calster B, Steyerberg EW. Net benefit approaches to the evaluation of prediction models, molecular markers and diagnostic tests. BMJ. 2016;352:i6.
Article PubMed PubMed Central Google Scholar
Wynants L, Riley RD, Timmerman D, et al. Random-effects meta-analysis of the clinical utility of tests and prediction models. Stat Med. 2018;37:2034–52.
Article CAS PubMed Google Scholar
Westwood M, Ramaekers B, Lang S, et al. Risk scores to guide referral decisions for people with suspected Ovarian cancer in secondary care: a systematic review and cost-effectiveness analysis. Health Technol Assess. 2018;22:1–264.
Article PubMed PubMed Central Google Scholar
Timmerman D, Verrelst H, Bourne TH, et al. Artificial neural network models for the preoperative discrimination between malignant and benign adnexal masses. Ultrasound Obstet Gynecol. 1999;13:17–25.
Article CAS PubMed Google Scholar
Biagiotti R, Desii C, Vanzi E, et al. Predicting ovarian malignancy: application of artificial neural networks to transvaginal and color doppler flow us. Radiology. 1999;210:399–403.
Article CAS PubMed Google Scholar
Van Calster B, Timmerman D, Lu C, et al. Preoperative diagnosis of ovarian tumors using bayesian kernel-based methods. Ultrasound Obstet Gynecol. 2007;29:496–504.
Article PubMed Google Scholar
Van Calster B, Valentin L, Van Holsbeke C, et al. Polytomous diagnosis of ovarian tumors as benign, borderline, primary invasive or metastatic: development and validation of standard and kernel-based risk prediction models. BMC Med Res Methodol. 2010;10:96.
Article PubMed PubMed Central Google Scholar
Akazawa M, Hashimoto K. Artificial intelligence in Ovarian cancer diagnosis. Anticancer Res. 2020;40:4795–800.
Article CAS PubMed Google Scholar
Lu M, Fan Z, Xu B, et al. Using machine learning to predict Ovarian cancer. Int J Med Inform. 2020;141:104195.
Article PubMed Google Scholar
Park H, Qin L, Guerra P, et al. Decoding incidental ovarian lesions: use of texture analysis and machine learning for characterization and detection of malignancy. Abdom Radiol (NY). 2021;46:2376–383.
Article PubMed Google Scholar
Vaes E, Manchanda R, Nir R, et al. Mathematical models to discriminate between benign and malignant adnexal masses: potential diagnostic improvement using ovarian HistoScanning. Int J Gynecol Cancer. 2011;21:35–43.
Article PubMed Google Scholar
Clayton RD, Snowden S, Weston MJ, et al. Neural networks in the diagnosis of malignant ovarian tumours. Br J Obstet Gynaecol. 1999;106:1078–82.
Article CAS PubMed Google Scholar
Lu C, Van Gestel T, Suykens JAK, et al. Preoperative prediction of malignancy of ovarian tumors using least squares support vector machines. Artif Intell Med. 2003;28:281–306.
Article CAS PubMed Google Scholar
Moszynski R, Szpurek D, Smolen A, et al. Comparison of diagnostic usefulness of predictive models in preliminary differentiation of adnexal masses. Int J Gynecol Cancer. 2006;16:45–51.
Article CAS PubMed Google Scholar
Zeng Y, Nandy S, Rao B, et al. Histogram analysis of en face scattering coefficient map predicts malignancy in human ovarian tissue. J Biophotonics. 2019;12:e201900115.
Article PubMed PubMed Central Google Scholar
Hüllermeier E, Waegeman W. Aleatory and epistemic uncertainty in machine learning: an introduction to concepts and methods. Mach Learn. 2021;110:457–506.
Article Google Scholar
Wynants L, Vergouwe Y, Van Huffel S, et al. Does ignoring clustering in multicenter data influence the performance of prediction models? A simulation study. Stat Methods Med Res. 2018;27:1723–36.
Article CAS PubMed Google Scholar
Riley RD, Ensor J, Snell KIE, et al. Calculating the sample size required for developing a clinical prediction model. BMJ. 2020;368:m441.
Article PubMed Google Scholar
Pate A, Riley RD, Collins GS, et al. Minimum sample size for developing a multivariable prediction model using multinomial logistic regression. Stat Methods Med Res. 2023;32:555–71.
Article PubMed PubMed Central Google Scholar
Van der Ploeg T, Austin PC, Steyerberg EW. Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints. BMC Med Res Methodol. 2014;14:137.
Article PubMed PubMed Central Google Scholar
Bonner C, Trevena LJ, Gaissmaier W, et al. Current best practice for presenting probabilities in patient decision Aids: fundamental principles. Med Decis Making. 2021;41:821–33.
Article PubMed Google Scholar
Liu JZ, Padhy S, Ren J et al. A simple approach to improve single-model deep uncertainty via distance-awareness. arXiv. 2022;2205.00403. https://arxiv.org/abs/2205.00403.
Thomassen D, le Cessie S, van Houwelingen H, Steyerberg E. Effective sample size: a measure of individual uncertainty in predictions. arXiv. 2023;2309.09824. https://arxiv.org/abs/2309.09824.

Download references

Acknowledgements

Not applicable.

Funding

The study is supported by the Research Foundation – Flanders (FWO) (projects G049312N, G0B4716N, 12F3114N, G097322N), and Internal Funds KU Leuven (projects C24/15/037, C24M/20/064). DT is a senior clinical investigator of FWO, and WF was a clinical fellow of FWO. TB is supported by the National Institute for Health Research (NIHR) Biomedical Research Centre based at Imperial College Healthcare National Health Service (NHS) Trust and Imperial College London. The views expressed are those of the authors and not necessarily those of the NHS, NIHR, or Department of Health. LV is supported by the Swedish Research Council (grant K2014-99X-22475-01-3, Dnr 2013–02282), funds administered by Malmo University Hospital and Skane University Hospital, Allmanna Sjukhusets i Malmo Stiftelse for bekampande av cancer (the Malmo General Hospital Foundation for fighting against cancer), and two Swedish governmental grants (Avtal om lakarutbildning och forskning (ALF)-medel and Landstingsfinansierad Regional Forskning). The funders of the study had no role in the study design, data collection, data analysis, data interpretation, writing of the report, or in the decision to submit the paper for publication.

Author information

Authors and Affiliations

Department of Development and Regeneration, KU Leuven, Herestraat 49 box 805, Leuven, 3000, Belgium
Ashleigh Ledger, Jolien Ceusters, Tom Bourne, Wouter Froyman, Dirk Timmerman & Ben Van Calster
Department of Oncology, Leuven Cancer Institute, Laboratory of Tumor Immunology and Immunotherapy, KU Leuven, Leuven, Belgium
Jolien Ceusters
Department of Obstetrics and Gynecology, Skåne University Hospital, Malmö, Sweden
Lil Valentin
Department of Clinical Sciences Malmö, Lund University, Malmö, Sweden
Lil Valentin
Department of Woman, Child and Public Health, Fondazione Policlinico Universitario A. Gemelli IRCCS, Rome, Italy
Antonia Testa
Dipartimento Universitario Scienze della Vita e Sanità Pubblica, Università Cattolica del Sacro Cuore, Rome, Italy
Antonia Testa
Department of Obstetrics and Gynecology, Ziekenhuis Oost-Limburg, Genk, Belgium
Caroline Van Holsbeke
Preventive Gynecology Unit, Division of Gynecology, European Institute of Oncology IRCCS, Milan, Italy
Dorella Franchi
Department of Obstetrics and Gynecology, University Hospitals Leuven, Leuven, Belgium
Tom Bourne, Wouter Froyman & Dirk Timmerman
Queen Charlotte’s and Chelsea Hospital, Imperial College, London, UK
Tom Bourne
Department of Biomedical Data Sciences, Leiden University Medical Centre (LUMC), Leiden, Netherlands
Ben Van Calster
Leuven Unit for Health Technology Assessment Research (LUHTAR), KU Leuven, Leuven, Belgium
Ben Van Calster

Authors

Ashleigh Ledger
View author publications
You can also search for this author in PubMed Google Scholar
Jolien Ceusters
View author publications
You can also search for this author in PubMed Google Scholar
Lil Valentin
View author publications
You can also search for this author in PubMed Google Scholar
Antonia Testa
View author publications
You can also search for this author in PubMed Google Scholar
Caroline Van Holsbeke
View author publications
You can also search for this author in PubMed Google Scholar
Dorella Franchi
View author publications
You can also search for this author in PubMed Google Scholar
Tom Bourne
View author publications
You can also search for this author in PubMed Google Scholar
Wouter Froyman
View author publications
You can also search for this author in PubMed Google Scholar
Dirk Timmerman
View author publications
You can also search for this author in PubMed Google Scholar
Ben Van Calster
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Contributions were based on the CRediT taxonomy. Conceptualization: AL, JC, BVC. Data curation: JC, WF. Formal analysis: AL, JC, BVC. Funding acquisition: LV, WF, DT, BVC. Investigation: LV, AT, CVH, DF, TB, WF, DT. Methodology: AL, BVC. Project administration: WF, DT, BVC. Resources: LV, AT, CVH, DF, TB, WF, DT. Software: AL, JC. Supervision: DT, BVC. Validation: BVC. Visualization: AL, BVC. Writing – original draft: AL, JC, BVC. Writing – review & editing: all authors. AL, JC, WF, DT, BVC directly accessed and verified the raw data, and no authors were precluded from accessing the data. All authors have read, share final responsibility for the decision to submit for publication, and agree to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

Corresponding author

Correspondence to Ben Van Calster.

Ethics declarations

Competing interests

LV reported receiving grants from the Swedish Research Council, Malmö University Hospital and Skåne University Hospital, Allmänna Sjukhusets i Malmö Stiftelse för bekämpande av cancer (the Malmö General Hospital Foundation for Fighting Against Cancer), Avtal om läkarutbildning och forskning (ALF)–medel, and Landstingsfinansierad Regional Forskning during the conduct of the study; and teaching fees from Samsung outside the submitted work. DT and BVC reported receiving grants from the Research Foundation–Flanders (FWO) and Internal Funds KU Leuven during the conduct of the study. TB reported receiving grants from NIHR Biomedical Research Centre, speaking honoraria and departmental funding from Samsung Healthcare and grants from Roche Diagnostics, Illumina, and Abbott. No other disclosures were reported. All other authors declare no competing interests.

Ethics approval and consent to participate

All mother studies received ethics approval from the Research Ethics Committee of the University Hospitals Leuven and from each local ethics committee. All participants provided informed consent. All methods were carried out in accordance with relevant guidelines and regulations. We obtained approval from the Ethics Committee in Leuven (S64709) for secondary use of the data for methodological purposes.

Consent for publication

Not applicable.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

. Predictor selection. Supplementary Material 2. Sample size argumentation. Supplementary Material 3. Hyperparameter tuning. Supplementary Material 4. Multiple imputation for CA125. Supplementary Material 5. Flowcharts for modeling and validation procedure. Table S1. Descriptive statistics by reference standard (final diagnosis). Table S2. List of centers in the development and validation data. Table S3. Pairwise area under the receiver operating characteristic curve (AUROC) values (with 95% CI) for models with CA125 on external validation data. Table S4. Pairwise area under the receiver operating characteristic curve (AUROC) values for models without CA125. Table S5. Percentage of patients on validation data falling on opposite sides of the 10% risk of malignancy threshold when comparing two models. Figure S1. Polytomous discrimination index for models with CA125 on external validation data. Figure S2. AUROC for benign tumors vs any malignancy for models with CA125. Figure S3. Polytomous Discrimination Index (PDI) for models without CA125. Figure S4. AUROC for benign tumors vs any malignancy for models without CA125. Figure S5. Box plots of estimated probabilities for standard MLR with CA125. Figure S6. Box plots of estimated probabilities for ridge MLR with CA125. Figure S7. Box plots of estimated probabilities for random forest with CA125. Figure S8. Box plots of estimated probabilities for extreme gradient boosting (XGBoost) with CA125. Figure S9. Box plots of estimated probabilities for neural network with CA125. Figure S10. Box plots of estimat ed probabilities for support vector machine with CA125. Figure S11. Flexible calibration curves for models without CA125. Figure S12. Box plots of estimated probabilities for standard MLR without CA125. Figure S13. Box plots of estimated probabilities for ridge MLR without CA125. Figure S14. Box plots of estimated probabilities for random forest without CA125. Figure S15. Box plots of estimated probabilities for extreme gradient boosting (XGBoost) without CA125. Figure S16. Box plots of estimated probabilities for neural network without CA125. Figure S17. Box plots of estimated probabilities for support vector machine without CA125. Figure S18. Decision curves for models with CA125 on external validation data. Figure S19. Decision curves for models without CA125 on external validation data. Figure S20. Differences between the highest and lowest estimated probability for each outcome across the six models with CA125 (panel A) and the six models without CA125 (panel B) for patients in the external validation dataset. Each dot denotes the difference between the highest and the lowest estimated probability for one patient. This means that each patient is shown five times in each panel, once for each outcome category. For example, at the far left, the difference between the highest and lowest estimated probability for a benign tumor is shown for all 3199 patients in the dataset. The box represents the interquartile range which contains the middle 50% of the differences. The line inside the box indicates the median. Whiskers correspond to the 5th and 95th percentile. Figure S21. Scatter plots of the estimated risk of a benign tumor for each pair of models without CA125. Figure S22. Scatter plots of the estimated risk of a borderline tumor for each pair of models without CA125. Figure S23. Scatter plots of the estimated risk of a stage I primary invasive tumor for each pair of models without CA125. Figure S24. Scatter plots of the estimated risk of a stage II-IV primary invasive tumor for each pair of models without CA125. Figure S25. Scatter plots of the estimated risk of a secondary metastatic tumor for each pair of models without CA125.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Ledger, A., Ceusters, J., Valentin, L. et al. Multiclass risk models for ovarian malignancy: an illustration of prediction uncertainty due to the choice of algorithm. BMC Med Res Methodol 23, 276 (2023). https://doi.org/10.1186/s12874-023-02103-3

Download citation

Received: 09 August 2023
Accepted: 14 November 2023
Published: 24 November 2023
DOI: https://doi.org/10.1186/s12874-023-02103-3

Multiclass risk models for ovarian malignancy: an illustration of prediction uncertainty due to the choice of algorithm

Abstract

Background

Methods

Results

Conclusion

Background

Methods

Study design, setting and participants

Data collection

Outcome

Statistical analysis

Predictors and sample size

Algorithms

Model performance on external validation data

Missing values for CA125

Modeling procedure and software

Results

Discrimination performance

Calibration performance

Clinical utility

Comparing estimated probabilities between algorithms

Discussion

Conclusion

Data availability

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Ethics approval and consent to participate

Consent for publication

Additional information

Publisher’s Note

Electronic supplementary material

Supplementary Material 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Medical Research Methodology

Contact us