Ordinal logistic regression model describing factors associated with extent of nodal involvement in oral cancer patients and its prospective validation.

Background Oral cancer is the most common cancer among Indian men, and has strong tendency of metastatic spread to neck lymph node which strongly influences prognosis especially 5 year survival-rate and also guides the related managements more effectively. Therefore, a reliable and accurate means of preoperative evaluation of extent of nodal involvement becomes crucial. However, earlier researchers have preferred to address mainly its dichotomous form (involved/not-involved) instead of ordinal form while dealing with epidemiology of nodal involvement. As a matter of fact, consideration of ordinal form appropriately may increase not only the efficiency of the developed model but also accuracy in the results and related implications. Hence, to develop a model describing factors associated with ordinal form of nodal involvement was major focus of this study. Methods The data for model building were taken from the Department of Surgical Oncology, Dr.BRA-IRCH, AIIMS, New Delhi, India. All the OSCC patients (duly operated including neck dissection) and confirmed histopathologically from 1995 to 2013 were included. Further, another data of 204 patients collected prospectively from 2014 to 2015 was considered for the validation of the developed model. To assess the factors associated with extent of nodal involvement, as a first attempt in the field of OSCC, stepwise multivariable regression procedure was used and results are presented as odds-ratio and corresponding 95% confidence interval (CI). For appropriate accounting of ordinal form, the ordinal models were assessed and compared. Also, performance of the developed model was validated on a prospectively collected another data. Results Under multivariable proportional odds model, pain at the time of presentation, sub mucous fibrosis, palpable neck node, oral site and degree of differentiation were found to be significantly associated factors with extent of nodal involvement. In addition, tumor size also emerged to be significant under partial-proportional odds model. Conclusions The analytical results under the present study reveal that in case of ordinal form of the outcome, appropriate ordinal regression may be a preferred choice. Present data suggest that, pain, sub mucous fibrosis, palpable neck node, oral site, degree of differentiation and tumor size are the most probable associated factors with extent of nodal involvement.


(Continued from previous page)
Conclusions: The analytical results under the present study reveal that in case of ordinal form of the outcome, appropriate ordinal regression may be a preferred choice. Present data suggest that, pain, sub mucous fibrosis, palpable neck node, oral site, degree of differentiation and tumor size are the most probable associated factors with extent of nodal involvement.

Background
In India, in terms of prevalence and incidence, oral cancer ranks first among men and overall at rank three, however regarding related mortality it is at third position in men (GLOBOCAN 2012) [1]. The prognosis of oral cancer including 5 year survival rate may get highly affected due to its metastatic spread to neck lymph node [2]. As a result, the regional control and overall survival may also get affected [3][4][5][6][7]. In other words, the outcome of oral cancer may significantly change due to the nodal involvement. Hence, appropriate management of the cervical lymph nodes is an important part of oral cancer therapy [8][9][10][11][12][13]. However, there has been controversy over the indication, timing and methods of neck dissection [12,13]. Difficulties in early diagnosing thus restrict the treatment carried out by surgeons [14]. A reliable and accurate means of preoperative evaluation of cervical lymph node metastasis is therefore crucial for the correct management of oral cancer [8,10,13,15,16]. Also, understanding of its associated factors may provide clues to the clinicians for better management.
There are negligible numbers of studies dealing with analysis of nodal involvement among oral cancer patients. Most of them have focused either on occult nodal metastasis only or a specific oral site i.e., tongue, lip or buccal mucosa or stages [2,[17][18][19][20][21][22][23][24][25][26][27]. Further, they have dealt with only presence/absence of nodal involvement, not in the ordinal form. It is well known that, ignoring the ordering has its own disadvantage mainly because it does not fully utilize the available information [28]. For example, Armstrong and Sloan [29] have reported that compared to a cumulative logit model for a five level ordered response, use of logistic model resulted into a loss of 25-50% of efficiency. To the best of our knowledge, this is the first such study dealing with the ordinal form of extent of involved nodes among oral cancer patients.
Keeping in view of the above points, objective of this study was to develop an ordinal logistic regression model to find out the factors associated with ordinal form of involved nodes and validate it on temporal data.

Data
The utilized dataset under the present study is same as that considered while assessing the factors associated with nodal involvement (yes/no) among oral squamous cell carcinoma patients [30,31]. As such, the data maintained at Department of Surgical Oncology, Institute Rotary Cancer Hospital (IRCH), AIIMS, New Delhi, India for patients with histopathologically proven oral squamous cell carcinoma who went under surgery including neck dissection was considered. Out of data of 1123 oral cancer patients available during 1995 to 2013, 945 fulfilled the inclusion criteria. Further, prospectively collected data of 204 patients from January 2014 to December 2015 was used for the temporal validation of the developed ordinal model. The number of involved nodes of each patient was collected from their histopathological reports. In view to utilize maximum available information for more accurate results, ordinal categories mainly relied on desired varying management strategies being adopted by the oncologists according to cut-off values of involved nodes. Accordingly it was considered as 0, 1, 2-4 and > 4 involved nodes in ordinal categories. The covariates in this study remain same as those under dichotomous model [30,31]. Statistical software, STATA/SE version 14.2 (StataCorp LP, College Station, TX, USA), was used for the analysis.

Statistical analysis
Categorical variables were described using absolute/relative frequency distribution and quantitative variables using measures of central tendency/location like mean (standard deviation)/median (quartile range). The association between qualitative independent variables, was assessed using Chi-square test/ Fisher's exact test. To assess association between two quantitative variables, Pearson/Spearman's correlation coefficient was used. Collinearity between the covariates was assessed by Cramer's V as 0.7 and more. To find out the factors associated with ordinal form of nodal involvement, stepwise ordinal logistic regression procedure was used. Variables which were found to be significant at the level of 25% under crude association analysis (univariable analysis) and/or on the basis of their clinical relevance were taken as a sub-set of covariates for stepwise regression. Results are presented in the form of odds ratio and corresponding 95% confidence interval (CI). Brant test was used for proportionality assumption [32]. The model performance was assessed using measure of discrimination and calibration (i.e. the accuracy of the prediction probability of nodal involvement). Discrimination performance was evaluated using Average Dichotomous C-index, Generalized C-index and Set wise C-index (ORC) [33]. Calibration of the predicted probabilities under the developed model was investigated using M.W. Fagerland and D.W. Hosmer test for goodness of fit [34] and specification error by linktest [35]. The equal spaced integer weight score was used to discriminate the individuals regarding calibration and discrimination ability of the developed model [36].

Statistical models
There are different ordinal logistic regression models to tack care of ordinal form of outcomes. These different ordinal regression models have different ways to form their logits. For example, proportional odds model (considered as cumulative higher category(s) verses cumulative remaining lower category(s)); continuation ratio model (considered as cumulative higher category(s) verses just lower category alone); and adjacent category model (between any of two consecutive categories). Accordingly every form of the logit has its own benefits or limitation, one can use the models as per specific need. To be more specific, continuation ratio model and adjacent category model do not relay on complete data set. In epidemiological and biomedical applications, proportional odd model (POM) is often used. However, sometime continuation ratio model is also used [37,38]. Further, choice between the POM and CRM models relies on the goals of the statistical analysis. As obvious, in case of nodal involvement, interpretation under POM will remain more logical and meaningful. However, if the proportionality assumption does not fulfill, partial proportional odds model may be a better choice [35,39]. Further, choice between proportional odds and partial proportional odds models was assessed using likelihood ratio test, LR, AIC and BIC.

Proportional odds model (POM)
If log odds ratio across the cut points is identical, i.e. proportional odds assumption is satisfied, the proportional odds model is used. The proportional odds model was initially proposed by Walker and Duncan [40] as cumulative logit model, but later it was named as proportional odds model by McCullagh [37]. In case of the present study, independent observations on 945 patients were available. As described earlier, observation on the nodal involvement (Y) for each patient is classified into either of four categories. Likewise, covariates (x i ) denote the p-dimensional vector of covariates (i = 1,2….p), containing the observation on the full set of p explanatory variables. Accordingly, the dependency of Y on x i may be expressed as: Where, Pr(Y ≥ y j ) is the cumulative probability of the event (Y ≥ y j ); α j are the respective intercept parameters; β is a (p by 1) vector of regression coefficients corresponding to x i covariates.

Partial proportional odds model (PPOM)
If identical log odds ratio assumption under POM is not fulfilled for some of the covariates, partial proportional odds model may be used [41]. Out of the two approaches available in this regard, unconstrained partial proportional odds model and constrained partial proportional odds model, unconstrained PPOM model was used due to unavailability of prior knowledge or beliefs regarding constraints and also availability of computational facility [42]. The PPOM permits non proportional odds for a subset of q of the p-predictors (q ≤ p). The unconstrained PPOM cumulative probability may be defined as: Where x i is a (p by 1) vector containing the values of observation i on the full set of p explanatory variables, β is a (p by 1) vector of regression coefficients associated with p variables. Further, t ′ is a (1 by q) vector of qcovariates, containing the values of observation i on that subset of the p explanatory variables for which proportionality assumption is either not fulfilled or is to be tested; and γ j is a (q by 1) vector of regression coefficients associated with the q covariates. Accordingly, t ′ γ j is the increment associated with jth cumulative logit (1 < = j < =3),where γ 1 is equal to 0. If γ j = 0 for all j, then this model reduces to proportional odds model. Accordingly, the proportional odds assumption for the q variables in t is a test of the null hypothesis that γ j = 0 for all j = 2, 3.

Results
The ordinal logistic regression model was developed using data on 945 patients, where as the data of another 204 patients was used for temporal validation of the developed model. Out of 945 patients, females were 212 (22.4%). Majority of the patients were in the age group 40 to 60 years 549 (58.1%) and in lower/lower middle socio-economic class 751 (79.5%). The distribution of number of involved nodes were as; no node involved: 569 (60.21%); 1 node involved: 149 (15.77%); 2-4 nodes involved: 162 (17.14%) and > 4 nodes involved: 65 (6.88%).
Out of 39 available variables, a set of 19 covariates were considered for stepwise regression procedure, among which 9 variables were selected on the basis of its crude association at 25% level of significance and 10 variables on its clinical relevance. Collinearity and first order effect modifier were assessed before developing the multivariable model. They were absent in the present dataset. The distribution of nodal involvement and its associations with covariates are presented in Table 1.
Under multivariable regression analysis, proportional odds assumption was found to be satisfactory for each of the considered covariates selected for developing final model except histopathological tumor size. Moreover, overall proportionality assumption was not violated (p = 0.97). Accordingly, both the models (POM and PPOM) were developed and compared.
Interestingly, the results under proportional odds multivariable regression analysis emerged to be similar to those reported under binary form of nodal involvement [30]. To be more specific, pain at presentation, SMF, palpable neck node, oral site and degree of differentiation were retained to be significantly associated factors also with higher extent of nodal involvement. The patients presenting with pain were 37% more likely to have higher order of nodal positivity as compared to lower [1.37 (1.05 to 1.78)], whereas SMF was protective and patients with SMF had 57% less chance for higher frequency of node involvement [0.43 (0.21 to 0.90)]. Patients other than well differentiated tumors were more likely for presence of positive node [1.41 (1.07 to 1.86)] ( Table 2). Like proportional odds model, under multivariable partial proportional odds model, pain at the time of presentation, SMF, palpable neck node, oral site and degree of differentiation were found to be significantly associated factors. In addition, tumor size also emerged to be significant. For patients with large tumor size, the chance of involving higher number of positive nodes went up. The patients with tumor size more than 4 cm had more than two fold chance to be involved with more than 4 numbers of positive nodes. The detailed results under multivariable regression analysis are presented in (Table 3).
The comparative appraisal of both the models ( Table 2, Table 3) revealed that partial proportional odds model stands to be preferred as indicated by LR and AIC (Table 4). Likelihood-ratio test also suggests the same (p = 0.004).

Assessment of the model
Fagerland and Hosmer test [34] for goodness of fit suggests (p = 0.88) that our developed model describes the distribution of nodal involvement satisfactorily. Further, link test suggest (p = 0.65) that there was no specification error in the developed partial proportional odds model. For discrimination ability of the developed model, Average Dichotomous C-index, Generalized Cindex and Set wise C-index (ORC) were used [33]. The Average Dichotomous C-index values of the developed model was found to be 0.67, which suggests that probability to correctly discriminate a case with a lower outcome level from a case with higher outcome level was 0.67. However, Generalized C-index value was observed as 0.635, which means that probability to discriminate two cases from different categories was 0.635. Further, set wise C-index (ORC) value of the model was 0.634. It implies that the probability of correct discrimination in a pair of cases from two randomly chosen categories was 0.634.

Validation of the developed model
As described earlier, temporal validation [43] of the developed model was evaluated on another dataset of 204 patients (validation data), collected prospectively from the same centre. The goodness of fit test on validation data also suggests (p = 0.35) that the developed model sustains its ability to describe the distribution of nodal involvement satisfactorily. In this case, the probability values of all the three indices (i.e. Average Dichotomous C-index, Generalized C-index and Set wise C-index (ORC)) providing discrimination ability of the ordinal logistic regression were 0.66, 0.63 and 0.64 respectively, which again indicates that the developed model performs equally on validation data. The analytical results amply reveal that the developed model remains to be generalizable and acceptable.

Discussion
In biomedical research, in addition to frequent emergence of ordinal categorical data, sometime ordinal categories are the result of grouping of quantitative data [44]. However, as known in case of change in origin and scale, dichotomization or ignoring the order has its own disadvantage. Armstrong and Sloan (1989) found that compared to a cumulative logit model of a five level ordered response, logistic model attains only 50-75% of efficiency [29].
In this study, ordinal categories mainly relied on desired varying management strategies being adopted by the oncologists according to cut-off values. Accordingly, number of involved cervical nodes were categorized as 0, 1, 2-4 and > 4 involved nodes. Until now, there is no study considering the ordinal form of nodal involvement while assessing its associated factors. The available studies have dealt with only binary form of nodal involvement [2,14,24,30,[45][46][47].
In the present study, POM seems to be appropriate as overall model did not violate the proportional odds assumption significantly. However, one of the covariates was found to violate this assumption. Theoretical ground or empirical tests do not provide clear guidelines about when to relax the proportional odds assumption [48]. Under exploration of this possibility, either to use POM or PPOM, log likelihood, AIC and likelihood ratio test supported the PPOM. Also, a clinically relevant covariate, tumor size emerged to be significant under PPOM.   The availability of GOLOGIT2 with AUTOFIT syntax in STATA makes the things easier to select the appropriate model between POM and PPOM [35,42].
Results under binary form of nodal involvement on the same dataset and considering same set of covariates are already reported elsewhere [30,31]. Under appropriate consideration of the nodal involvement in the ordinal form, one of the clinically more relevant covariates i.e., histopathological tumor size, was also found to be significantly associated with extent of nodal involvement, which was missed in its binary consideration. However, other significant covariates were similar and also the effect size was in same direction. Under model assessment, model for binary as well as ordinal form of nodal involvement described the distribution satisfactorily.

Conclusions
The analytical results under the present study reveal that in case of ordinal form of the outcome, ordinal regression may be a preferred choice. Further, in case of violation of proportionality assumption in any of the covariates, PPOM may be a better choice. This is likely to ensure accuracy not only in results but also in related inferences and their implications. In summary, pain at the time of presentation, sub mucous fibrosis, clinically palpable neck node, oral site, degree of differentiation and tumor size are the most probable associated factors with extent of nodal involvement in oral squamous cell carcinoma patients.
As mentioned earlier, the ordinal categories mainly relied on desired varying management strategies being adopted by the oncologists according to cut-off values of involved nodes. Otherwise, as true in case of any such study, change in cut-off values of nodal involvement is bound to change the results. Further, as true under such modeling, finite sample bias may remain a concern in these models as well.