Baseline-adjusted proportional odds models for the quantification of treatment effects in trials with ordinal sum score outcomes

Background Sum scores of ordinal outcomes are common in randomized clinical trials. The approaches routinely employed for assessing treatment effects, such as t-tests or Wilcoxon tests, are not particularly powerful in detecting changes in relevant parameters or lack the ability to incorporate baseline information. Hence, tailored statistical methods are needed for the analysis of ordinal outcomes in clinical research. Methods We propose baseline-adjusted proportional odds logistic regression models to overcome previous limitations in the analysis of ordinal outcomes in randomized clinical trials. For the validation of our method, we focus on common ordinal sum score outcomes of neurological clinical trials such as the upper extremity motor score, the spinal cord independence measure, and the self-care subscore of the latter. We compare the statistical power of our models to other conventional approaches in a large simulation study of two-arm randomized clinical trials based on data from the European Multicenter Study about Spinal Cord Injury (EMSCI, ClinicalTrials.gov Identifier: NCT01571531). We also use the new method as an alternative analysis of the historical Sygen®clinical trial. Results The simulation study of all postulated trial settings demonstrated that the statistical power of the novel method was greater than that of conventional methods. Baseline adjustments were more suited for the analysis of the upper extremity motor score compared to the spinal cord independence measure and its self-care subscore. Conclusions The proposed baseline-adjusted proportional odds models allow the global treatment effect to be directly interpreted. This clear interpretation, the superior statistical power compared to the conventional analysis approaches, and the availability of open-source software support the application of this novel method for the analysis of ordinal outcomes of future clinical trials.


Background
Statistical methods for the quantitative analysis of ordinal sum score outcomes in medical research are motivated by three distinct core ideas. There is currently no gold standard statistical method used for the analysis. Hence, the aim of this paper is to introduce an advanced tailored statistical method for the analysis of complex ordinal outcomes based on the advantages of the three existing core ideas. The novel method introduced in this paper is applicable to ordinal sum score measurements in general. However, it is here exemplified by an application based on two commonly used ordinal clinical outcome measurements from the field of spinal cord injury (SCI) research [1]. More precisely, the upper extremity motor score (UEMS) and the spinal cord independence measure (SCIM) serve as illustrative ordinal response variables to introduce and demonstrate the novel method. The UEMS and SCIM are frequently used ordinal clinical assessments to classify SCI patients.
SCIs result in life-long para-and tetraplegia, with high impact on the individual quality of life. To date, no treatment is available to regenerate the interrupted nerve fibers and repair the damaged spinal cord. Hence, patients with SCI basically only benefit from the rehabilitation programs that enable patients to compensate and adjust by maximizing their functional skills for the given neurological impairment caused by the SCI [2]. However, important progress has been made in the last 20 years in the scientific understanding of the processes regulating nerve fiber growth and regeneration.
Recent preclinical research in animal models has been extremely successful in demonstrating the efficacy of the antibody anti-Nogo-A raised against the axon growth inhibitory protein Nogo-A [3,4]. These promising results warrant its application in patients suffering from acute spinal cord injury. The Nogo Inhibition in Spinal Cord Injury trial (NISCI, www.nisci-2020.eu, "To enhance plasticity, regeneration and functional recovery after acute spinal cord injury, a multicenter European clinical proof of concept trial") will enroll tetraplegic patients with various degrees of complete to incomplete acute spinal cord injury in a double-blind, placebo-controlled trial to test the efficacy of this antibody therapy in improving motor outcome and quality of life. The newly developed statistical methodology, which quantifies treatment effects in trials with ordinal sum score outcomes, introduced in this paper also aims to improve the analysis of the NISCI trial.

Current state of research
This section contrasts the three distinct statistical core ideas that are commonly used for the quantitative analysis of ordinal outcomes. As stated above, the novel method introduced in this paper is applicable to ordinal sum score measurements in general. However, the application is exemplified based on the commonly used ordinal clinical outcome measurements from the field of SCI research [1]: • the total UEMS from 0 to 50 [5], • the SCIM total score from 0 to 100 [6,7], and • the SCIM self-care (SCIMsc) subscore from 0 to 20 [6,7].
In SCI research, it has been suggested that these sum score measurements correlate with changes in daily living activities that rely on recovery of upper extremity function [8][9][10][11]. UEMS measures the muscle contraction force for 10 key muscles on the arms and hands (5 on each body side), where each contributes a 6-point ordinal scale (0: total paralysis, through 5: active movement against full resistance). The score is not ordinal per se, because it can only be directly compared between patients with the same level of injury, that is, with the same pattern of uneffected key muscles (which, by definition, receive a 5 score). SCIM consists of sub-items assessing mobility abilities (bed, indoors and outdoor activities), respiration and sphincter management, and self-care (feeding, grooming, bathing, and dressing). Both, SCIM total score and self-care subscore, are considered ordinal and can be used to compare the independence of patients regardless of the corresponding level of injury.
The first core idea discussed is considered as the most prominent one. For this, the total sum score of the individual ordinal scores is treated as a conceptually continuous outcome measured on an interval scale. Classical statistical models or tests, such as baseline-adjusted analysis of covariance (ANCOVA) models or t-tests, are then applied to infer mean differences between treatment groups. The main advantages of this approach are ease of communication through reporting of difference in means and associated confidence intervals and the possibility of dealing with more complex trial designs, for example, through the incorporation of baseline information. However, Heller et al. [12] recommend to only apply linear regression and its variants for pragmatic analysis approaches for not highly skewed distributed outcome variables. In the case of the highly skewed UEMS and SCIM this does not seem adequate. Furthermore, differences in ordinal scores cannot directly be interpreted. For example, consider the case of two patients with very acute UEMSs, one with an UEMS of 25 and the other with an UEMS of 40. After six months, the UEMS of both patients increases by 5 points. Although the increase is the same for both patients, the relative percent recovery is much larger for the first patient from a medical point of view. As pointed out in [13], it is clearly erroneous to assume that important differences should be clinically equivalent across the whole scale [14].
This issue is dealt with by the remaining two core ideas, which respect the ordinal nature of the outcomes. One idea is to understand the contributions of each score item, i.e. SCIM item or each segment within the UEMS, as a multivariate ordinal outcome. Item-response analysis, most prominently with the Rasch model, tries to identify a latent continuous interval scale that generates the ordinal assessment. Item properties, such as unidimensionality and potential differential item functioning, have been assessed for the SCIM [15] and UEMS [16]. Once such a latent scale is identified, classical models and tests for continuous outcomes could be used to infer differences in means between treatment groups. The results, however, are difficult to interpret because treatment effects are reported on this latent scale. In SCI research, Reed et al. [17] recently introduced the spinal cord ability ruler (SCAR) as an interval construct to measure volitional performance after spinal cord injury. The SCAR is a single SCI measurement based on two existing and commonly used measurements (International Standards for Neurological Classification of SCI (ISNCSCI) and SCIM). The definition of the SCAR scores is based on a Rasch analysis and was validated by simulation with two databases. Additional statistical models which belong to the class of latent trait model for ordered polytomous responses are the Graded Response Model (GRM, [18]) as well as the Partial Credit Model (PCM, [19]).
The third idea is to define a simple treatment effect measure that can be directly interpreted on the original ordinal scale of the outcome measures. Tanadini et al. [8] suggest to compare the UEMS measurements with respect to the odds ratio (OR) of treatment versus control groups. The challenge here is to extend the classical proportional odds logistic regression Polr) model [20][21][22] for ordinal responses with a considerably high number of non-reducible outcome categories. Since the classical Polr model features one intercept parameter for each possible outcome category, this would require 50 intercept parameters for each possible stratum in a model for the UEMS sum score, which ranges between 0 and 50. Tanadini et al. [8] approach this problem by fitting one segment-wise proportional odds model. While this technique can be used for p-value-based inference of segment-wise ORs with potential baseline adjustment, it is impossible to derive a global treatment effect and a corresponding confidence interval. The latter information is, however, of utmost importance for communicating results of randomized clinical trials.
The above discussion stresses the need for an advanced tailored method for the analysis of complex ordinal outcomes that allows valid inference of a global interpretable treatment parameter and respects the original ordinal scale of the outcomes in potentially complex trial designs. We introduce a novel methodology that brings together the simplicity of a univariate sum score outcome and the interpretability of a global treatment effect. The method is based on maximum likelihood estimation and allows for baseline adjustment as well as confidence interval-based inference.

Methods
The proposed baseline-adjusted proportional odds logistic regression model is generalizable and applicable for any statistical analysis with a conseriderably high amount of possible outcome categories. However, we here strengthen our argument by applying the method to specific ordinal sum score outcome variables from the field of SCI research. Thus, this section begins with the introduction to the underlying data set with which we later evaluate our models. We then focus on the specific models and their application in terms of statistical power and trial sample size calculations with the help of a simulation study. Furthermore, we present a concrete application of the method applied as an alternative analysis approach to the Sygen ® trial data.

Data source and trial outcomes
The proposed treatment effect models were evaluated in a setting similar to the currently conducted NISCI trial. Patient data from the European Multicenter Study about Spinal Cord Injury (EMSCI, ClinicalTrials.gov Identifier: NCT01571531, www.emsci.org) were selected according to the NISCI inclusion criteria (age between 18 and 72 years; AIS of A, B, C or D; neurological level of injury between spinal segments C1 -C8 at time very acute; observed UEMS and SCIM at baseline, i.e. 2 -28 days post-injury, and follow-up, i.e. 150 -186 days post-injury). All patients gave written informed consent. In total, data on a subset of N = 350 female and male patients from EMSCI were obtained.
EMSCI tracks the functional and neurological recovery of patients during the first year after SCI. Patient data are collected in a highly standardized manner in 18 European centers during five specific time frames in reference to the day of injury: • 0 for null (very acute, baseline). 0 -15 days after the injury. • I for acute I. 16 -40 days after the injury.
• II for acute II. 70 -98 days after the injury.
• III for acute III. 150 -186 days after the injury.
• C for chronic. 300 -546 days after the injury.
The drug efficacy being tested in a clinical trial is often quantified as the change of primary outcome between baseline and follow up. For the proceeding analysis, we therefore only consider the timepoints null with superscript 0 and time point acute III with superscript III. The models were evaluated with UEMS [5] and SCIM sum (sub)scores [6,7] as outcomes. The UEMS total sum score at trial end point is abbreviated as M III . The symbols S III tot and S III sc denote the SCIM total sum and SCIM self-care subscore at time point acute III, respectively. Note that some of the subjects in the EMSCI database were assessed using SCIM-II [6], and others were assessed using SCIM-III [7]. However, for the purposes of our analysis, no distinction was made between the two versions of the measure because the self-care items of these measures are highly similar [7,23].

Enhanced proportional odds logistic regression (ePolr)
The enhanced proportional odds logistic regression (ePolr) model is an extension of the classical Polr model [20][21][22]. In the following, we elaborate the similarities of these two models as well as the enhanced properties of the ePolr model, i.e. the baseline-adjustment, stratification, and smoothing. For this, let y III be a categorical response measured on an ordinal scale with a considerably high amount of K categories: y III ∈ {1, 2, . . . , K}. Moreover, y 0 stands for the ordinal outcome variable measured at baseline, and x represents a univariate explanatory variable or, in the two-sample situation, x = 0 for the control and x = 1 for the active group. The models are formulated as models for conditional distribution functions: Both models use the inverse logit transformation / link function, also known as the expit transformation, to ensure interpretability of exp(β) as the conditional OR associated with a one unit increase in x → x + 1.
The ePolr model is part of the transformation model family which was recently introduced by Hothorn et al. in [24]. Following this, the unknowns in these models that have to be estimated from the data are the regression coefficient β and the conditional transformation function h, which incorporates baseline information y 0 . In contrast to earlier proposals of similar models for the analysis of ordinal outcomes with a predefined transformation function h [25], we estimate this function from the data.
The classical Polr model [20][21][22] is based on a discrete parametrization of h, which assigns one intercept parameter θ k , k ∈ {1, 2, . . . , K − 1}, to each possible outcome category without the possibility of stratifying with respect to baseline variables. For example in a SCI clinical trial, the number of required intercepts for the SCIM total sum score would be 100 and potentially exceeds the number of patients in such a trial. Although attempts to extend the classical model, and thus the discrete transformation function h, to many possible categories or even continuous outcomes have been reported [26], baseline adjustment is still a conceptual challenge. However, the incorporation of baseline information, such as y 0 , is very important in the context of clinical trials. To overcome this limitation, we introduce a smooth and parsimonious parametrization of the transformation h that allows for appropriate incorporation of baseline information as well as stratification by the variable strata: The estimation of such a model with a considerable high number of outcome categories K and in the presence of strata, the number of intercept parameters is equal to K − 1 times the number of strata. In such a situation, however, one does not expect abrupt jumps between intercepts of adjacent categories. Therefore, the replacement of category-specific intercepts with a smooth function h depending on a few parameters only was suggested by Parsons [13]. The latter contribution developed corresponding proportional odds models for repeated measurements using a generalised estimating equations (GEE) approach. Parsons' method features orthogonal polynomials of different degrees for h and requires to expand the data by the number of categories. Stratification, i.e. allowing different transformation functions h for different strata, would be conceptually possible if a further expansion of the data by the number of strata is feasible. Similar approaches for the estimation of conditional distribution functions by means of data expansion have been suggested in other contexts as well [27]. Here, we follow Hothorn et al. [24] and employ a model parameterization allowing model estimation and inference in the maximum likelihood framework, also in the presence of large K and potentially many strata. For this, we introduce a spline polynomial a(y) ϑ : R → R in terms of P K basis functions and parameterize the transformation function as h(y) = a( y ) ϑ, with y being the largest integer smaller or equal some real value y. A numerically attractive choice for such a spline is a Bernstein polynomial on the interval [ 1, K] [24]. Monotonicity of h can be ensured by a linear constraint on the P parameters ϑ = (ϑ 1 < · · · < ϑ P ). Stratas are incorporated by application of this basis to each stratum a(y) ⊗ (0, . . . , 1, . . . , 0), the second factor being a dummy coding of strata.
In the specific case of a stratified ePolr model, the likelihood contribution of a single observation with variable y 0 at baseline, y III at trial end and strata strata is Here, the transformation function h is stratified by the y 0 measurements taken at baseline and the stratification variable strata. For each stratum, we obtain a stratumspecific baseline log-odds function and a responsevarying effect of y 0 , the outcome at baseline. Note that all resulting strata have a direct impact on the conditional distribution and all moments, such as means, variances, skewness, and kurtosis, might vary depending on strata. Hence, the baseline-adjusted ePolr model allows for parametric prediction methods. The model parameters β and h are simultaneously estimated by maximum likelihood [24] with the help of the R package tram [28].

ePolr models for spinal cord injury clinical trials
For the evaluation of the suggested ePolr model, we specifically tailored three models M1-3 to SCI related clinical trial outcomes: • M1: UEMS sum score M III from 0 to 50, • M2: SCIM sum score S III tot from 0 to 100, • M3: SCIM self-care subscore S III sc from 0 to 20.
For the sake of simplicity, we considered a two-arm trial that compares a control (I(trt) = 0) to a novel treatment (I(trt) = 1). Each of the following models describes the conditional cumulative distribution function of the corresponding sum score as a function of baseline variables (mainly the baseline sum scores at time 0 as well as the number of segments below level of injury for the UEMS) and a global treatment effect β trt : As mentioned previously, the inverse logit transformation, also known as the expit transformation, ensures interpretability of the treatment effect exp(β trt ) as the conditional odds ratio that compares the treatment group and the control group given baseline variables. For the UEMS, this odds ratio is given by The variable #seg is defined as the number of left and right spinal segments below motor level that are at or more caudal than C5 and at or more rostral than T1. Stratification with respect to #seg is of utmost importance for UEMS, because the UEMS total score can only be considered as an ordinal variable when comparing patients with the same level of injury [8]. The parameter exp(β trt ) for the two SCIM models M2 and M3 can be interpreted in the same way. The unknowns in these models that have to be estimated from the patient trial data are the treatment effect β trt and the conditional transformation function h, which incorporates baseline information. In the specific case of model M1, the likelihood contribution of a treated patient with a UEMS total sum score m 0 at baseline, m III after six months and #seg segments below motor level is The specific choice of the variables considered for model M1 reflects that recovery in the upper extremities of patients with cervical SCI is not only dependent on the severity of the injury (as measured by the baseline UEMS) but also on the injury level [29,30]. Hence, the transformation function h in model M1 is stratified by the UEMS total sum score measurements observed at baseline, m 0 , and the number of tested segments below motor level, #seg, representing the injury level. The extracted data set from the EMSCI database has observations with four to 10 segments below motor level #seg that are at or more caudal than C5 and at or more rostral than T1. In addition, the UEMS at baseline ranges between 0 and 28. Consequently, we defined three strata ([ 0, 6], [ 7,8], [ 9,10]) for #seg and estimated a stratum-specific response-varying effect in the UEMS baseline values. More specifically, the spline basis reads a(y) 8], and a(y) × (0, 0, 1) × (1, m 0 ) for the last stratum. For each stratum, we obtain a stratum-specific baseline log-odds function and a response-varying effect of m 0 , the outcome at baseline. For our empirical experiments, we chose seven parameters for the Bernstein polynomial parametrization [24] on the interval [ 0, 50] and thus estimated a total of 7 × 3 × 2 = 42 parameters for the conditional transformation function h. Due to the monotonicity constraint, the effective degrees of freedom for h is, however, much smaller than the corresponding number of parameters.
The estimated baseline-adjusted ePolr models allow for parametric prediction methods such that a direct comparison between treated and untreated patients is straightforward. The model parameters are simultaneously estimated by maximum likelihood [24]. Concurrently, the OR exp(β trt ), along with a variability assessment and thus a confidence interval, is directly estimated by the model. Corresponding p-values for testing the two-sided null hypothesis of no treatment effect (β trt = 0) rely on the asymptotic normality of the maximum likelihood estimator. Hereafter, we will refer to the Wald test as the asymptotic ePolr test, short for asymptotic enhanced baselineadjusted proportional odds regression coefficient test.
For small samples, however, the asymptotic ePolr test for the treatment effect β trt might be too liberal because of lack of approximation accuracy. We therefore in addition applied a model-based permutation test [31,32] for β trt = 0. A general theory for such permutation tests is available in [33,34] and can be directly applied to the context studied here [35]. The null hypothesis of no treatment effect (β trt = 0) implies that the distribution of the model scores for the treatment effect β trt in models M1-3 is the same in the control and treatment group. For this, we first estimated the intercept-only models based on model M1, M2, and M3 and extracted the single score contributions S i of each observation i for each of the models. More specifically, the score S i is defined as the derivative of the log-likelihood contribution of the ith subject under the null of an absent treatment effect The resulting test statistics is the sum of the score contributions S i of the observations allocated to the treatment group. There is a strong connection to the Wilcoxon rank sum test: Without stratification and application of the transformation function . . , N, the scores S i are the Wilcoxon scores [36]. The fact that both the permutation test suggested here as well as the Wilcoxon rank sum test can be derived as score tests in a proportional odds model highlights their similarity and explains why one can expect both procedures to be especially powerful against proportional odds alternatives.
The conditional null distribution of the test statistic is approximated by 10,000 permutations that rearrange the labels for arm allocation and then are used to obtain the permutation p-value. Hereafter, we will refer to this model-based permutation test of the score contributions as permutated ePolr test.

Sample size and statistical power calculation
Based on the previously defined models M1-3, we introduce a procedure that allows for a simulation-based assessment of an appropriate number of patients to be enrolled in a trial. In addition, we used the simulations to compare the power of our model-based inference method to the power of conventional test procedures for total sum scores.
Our simulation of the placebo-controlled randomized clinical trials with two arms based on the EMSCI data set has the following specific levels of experimental conditions: The combinations of these five trial sample sizes and nine possible treatment effects result in 45 different trial scenarios. For each scenario, we sampled N EMSCI trial participants and randomly allocated them to control or treatment groups following a 1:1 allocation scheme. By restricting the treatment effect to β trt = 0 in the models M1-3, one describes the spontaneous neurological recovery for patients under standard of care, i.e. patients in the control group. In the first step, the models, and thus the stratified transformation functions h, were estimated from treatment-naive EMSCI patient data with spontaneous neurological recovery patterns under standard of care. β trt is constantly set to 0 for participants in the control group. In the second step, the outcomes m III , s III tot , s III sc for the treated patients were simulated by a standard parametric bootstrap procedure sampling new outcomes under the specific postulated treatment effects exp(β trt ) ([OR]): 1, 1.25, 1.5, 1.75, 2, . . . , 3. The baseline measurements at time point 0 (m 0 , s 0 tot , s 0 sc ) were left unchanged and hence are equivalent to the original observations from the EMSCI data.
Concurrently, a battery of five different significance tests for testing H 0 : β trt = 0 were applied to each simulated trial, and subsequently the statistical power was evaluated. The statistical power P(reject H 0 | β trt > 0) was estimated as the fraction of significant p-values for treatment effect β trt at the nominal level α = 0.05 among the 15,000 replications of the 45 different trial scenarios.
As noted above, we evaluated the new method by applying it to three different outcomes: UEMS total sum scores m III , SCIM total sum scores s III tot , and SCIM selfcare subscore s III sc . Subsequently, the treatment effects for these outcomes were tested with the following statistical tests: • t-test comparing the difference of the total sum score at time III and the total sum score at time 0 between the two treatment groups • Wilcoxon rank sum test • ANCOVA comparing the difference between the two treatment groups at time III while adjusting each patient's follow-up score for his or her baseline score at time 0 [37] • Asymptotic ePolr test based on models M1, M2, and M3 • Permutated ePolr test based on models M1, M2, and M3

Alternative analysis of the sygen ® trial
As a concrete application of model M1, we reanalyzed a data subset of the Sygen ® trial; the study design is explained in detail elsewhere [38][39][40]. Briefly, in this trial, N = 760 SCI participants from 28 centers in North America were recruited over a 5-year time period between 1992 and 1997. The primary outcome was a dichotomized variable derived from an ordinal scale representing the overall neurological status of a patient (see [39] for the exact definition). Subsequently, the primary outcome was analyzed by means of binary logistic regression. Another reanalysis of a data subsample of the study was recently published [8]. For this, the authors redefined the single-score UEMS as the primary outcome of the trial and applied the autoregressive transitional ordinal model to test for treatment effect. Following this, we also redefined the primary trial outcome. Based on model M1, we used the UEMS total sum score m III observed at time acute III as the primary trial outcome. The alternative analysis of the trial subsample data set specifically concentrates on the UEMS total sum score observations because the original data collection did not incorporate the SCIM measurements.
Following our simulation study, we extracted the patient data from the Sygen ® trial according to the previously mentioned inclusion criteria of the NISCI trial. The original Sygen ® trial had two treatment doses at the beginning; the higher dosage was abandoned during the study. Consequently, only patients treated with the lower dosage were considered here. The resulting subsample of N = 335 patients are from 26 different study centers.
This analysis is intended to give a practical example of an application of the enhanced baseline-adjusted proportional odds model and should not be considered as a definitive conclusion about the quantification of treatment effect or the outcome of the Sygen ® trial in general.

Randomized clinical trial simulation
The purpose of the simulation study was to compare our novel method with conventional analysis methods, such as the t-test, Wilcoxon rank sum test, and ANCOVA. Note that the simulation setting as well as the results reported are based on ordinal outcome measurements specifically tailored for clinical assessment and classification of SCI patients. However, the model setup is generalizable to any type of ordinal response measure with a considerably high number of outcome categories.
Tables 1, 2, and 3 report the estimated statistical power of models M1, M2, and M3, respectively. The newly introduced permutated ePolr test outperformed conventional analysis methods in every simulation setup. Especially in trial setups with relatively small sample sizes, the results of the significance test based on the baseline-adjusted proportional odds regression model had clear advantages with regards to power, e.g. in detecting significant treatment effects. In all three tables, the estimated statistical power of the asymptotic ePolr test and the permutated ePolr test is consistently higher than the estimates from conventional approaches, i.e. the t-test, Wilcoxon rank sum test, and ANCOVA. However, for extreme experimental conditions, e.g. large postulated treatment effects such as exp(β trt ) = 2.75, the differences among the approaches diminished.
When the treatment had no effect (β trt = 0), the nominal level of 0.05 was well preserved by almost all trial settings. However, we noted that the asymptotic ePolr test for all three models M1-3 was relatively liberal. In other words, more than 5% of the two-sided null hypotheses were rejected. This is due to the inadequately asymptotic normality approximation of the maximum likelihood estimator for small sample sizes. The model-based permutated ePolr test addressed exactly this challenge and successfully detected the expected effects within the simulation study, especially in trials with small sample sizes. Hence, the nominal level of 0.05 was well maintained by the permutated ePolr test. For the reporting and evaluation of the results, we therefore concentrated in particular on the comparison of the permutated ePolr test with the t-test, the Wilcoxon sum rank test, and ANCOVA.
In the following paragraphs, we will point out the most interesting findings based on the common goal of clinical trials to reach a statistical power of 80%. For the detailed simulation study results, see Tables 1, 2, and 3.

Upper extremity motor score (UEMS)
The application of the permutated ePolr test to model M1 clearly profited from the incorporation of the variable #seg and baseline-adjusted effect of the UEMS total sum score measured at time very acute. Accordingly, the estimated power of the permutated ePolr test was consistently higher than the estimated power of the t-test, Wilcoxon rank sum test, and ANCOVA. Thus, the increased power can be explained by the incorporation of baseline information and level of injury as strata in ePolr models. Given

Spinal cord Independence measure (SCIM)
Models M2 and M3 also profited from the baselineadjusted method. The estimated power also increased for the stratification based model M2 with SCIM total sum score and model M3 SCIM self-care subscore measurements. For model M2, the statistical power of 80% was first reached for the permutated ePolr test in the trial setup where the postulated treatment effect was exp(β trt ) = 2 and where there were at least N ≥ 200 recruited trial participants.
The analyses based on model M3 profited least from the new permutated ePolr test compared to the application with models M1 and M2. However, the statistical power of the simulation study in Table 3, still was slightly higher than the power of conventional analysis methods.

Alternative analysis of the sygen® trial
We reanalyzed the Sygen® trial by testing for treatment effect on the UEMS total sum score. We applied all five approaches of the above-mentioned simulation study. None of the five approaches yielded significant results at the nominal level of 5%. The detailed results of the five analysis approaches are reported here. For the asymptotic and permutated Polr tests, we also report the results from additional stratification by study center.  These results can by no means be generalized owing to the strongly selected subsample of patients considered, the different outcomes analyzed, and the different scopes of analysis. Nonetheless, the estimated ORs from the enhanced proportional odds models showed a tendency towards positive effect of treatment on the UEMS, which means that treated patients had on average a slightly better recovery than patients in the treatment-naive group. Especially for the baseline-adjusted proportional odds model (asymptotic ePolr test), the result implies that the odds of a participant in the treatment group of achieving up to a given motor score were exp(β trt ) = 1.142 times the odds of a participant with similar characteristics in the control arm. This indicates a slightly better recovery of treated patients with a p-value considerably smaller than the p-values obtained from the t-or Wilcoxon-tests, and ANCOVA. Nevertheless, we still believe that generalizations of our results to the overall validity of the Sygen ® trial and its compound cannot be drawn.

Randomized clinical trial simulation
As outlined above, conventional approaches of testing for treatment effects in two-armed randomized clinical trials with ordinal sum score outcomes face some limitations. Our novel method based on the baseline-adjusted proportional odds model estimates a global treatment effect conditioning on the total sum score, while allowing for baseline adjustment and confidence interval-based inference. We exemplified the method by applying the models to ordinal outcomes of a future neurological clinical trial. However, the model setup is generalizable to any type of ordinal response measure with a considerably high number of non-reducible outcome categories. A direct comparison of our proposed method and the routinely employed analysis methods based on statistical power showed that the permutated ePolr test has higher statistical power in every postulated trial setting. However, the differences in statistical power of the modelbased methods and conventional approaches are not as large for the SCIM total sum score model M2 and for the SCIM self-care subscore model M3 as for the UEMS total sum score model M1. Based on discussions with neurologists, this is likely explained by the influence of other factors, such as concomitant damage to other body tissues, inflammation or circulator disturbances. These factors can diminish the performance of SCIM activity items, especially at very acute assessment times. Hence, we conclude that in terms of neurological recovery, a baseline adjustment is more advantageous for the analysis of the UEMS total sum score than for the SCIM total sum score and self-care subscore.
Although the asymptotic ePolr test might be relatively liberal, in a clinical setting, it has the clear advantage of confidence interval-based inference. This significance test not only gives a p-value for the randomized clinical trial group comparison, but also quantifies the difference and provides a model-based confidence interval.
It was recently pointed out [41] that the Wilcoxon rank sum test can be understood as the permutation score test in a Polr model. Consequently, the Wilcoxon test is particularly powerful against OR alternatives. In contrast to the t-test, which is powerful against shift alternatives, the Wilcoxon rank sum test implicitly focuses on the same OR parameter exp(β trt ) that our models M1-3 are built upon. We can thus conclude that the superior power demonstrated in the simulation experiments arises from the incorporation of baseline information by stratification and not from different parametrizations or distributional assumptions. This conclusion is also supported by [42], stating that stratum variables should be considered in design and analysis when outcome categories are heavily dependent on membership of some stratum.
Instead of comparing outcome scores obtained at a specific time point, the score trajectory over time could be modelled, with a treatment effect parameter distinguishing between the two groups. The procedure described by Parsons [13] directly targets such a situation by applying a marginal proportional odds models for repeated measurements where the baseline log-odds function h is parameterised by orthogonal polynomials and parameter inference relies on GEEs. An alternative way of approaching this problem could be motivated by marginally interpretable transformation models for longitudinal observations [43], where stratum-specific baseline log-odds functions h could be parameterised as described here. The latter approach allows for simple maximum-likelihood inference and a head-to-head comparison of both approaches remains an interesting topic for future research.

Alternative analysis of the sygen® trial
The alternative analysis of a subsample of patients from the Sygen® trial underlined the straightforward implementation of our method. For this specific example, our model-based method enhanced the conventional analysis approaches of UEMS total sum scores by allowing for stratification based on the UEMS total sum score at baseline as well as baseline adjustment by #seg in terms of the patient's individual motor level. The confidence interval-based inference as part of the reported results of the asymptotic ePolr test is a clear advantage of the new method. We no longer only report a p-value as a result of the significance test but also a confidence interval of the quantified global treatment effect β trt . The treatment effect exp(β trt ) directly estimated from the model has the clear clinical interpretation as an OR that compares the treatment group with the control group.

Computational details
All computations were performed using R version 3.6.1 [44]. The tram R add-on package [28] was used to estimate the ePolr models. The underlying statistical theory is described in [24]. A blueprint for the estimation of the conditional distribution functions of the UEMS at time acute III (UEMS3) for the treatment group and control group is shown below. Data are available in a data frame SCI_OR_3 with variables UEMS at time very acute (UEMS0, numeric) and acute III (UEMS3, numeric), #seg (factor), and trtmt (binary: 0 for control group and 1 for treatment group). The factor variable #seg is defined as the number of left and right spinal segments below motor level that are at or more caudal than C5 and at or more rostral than T1 in three strata ([ 0, 6], [ 7,8], [ 9,10]).