Framework for personalized prediction of treatment response in relapsing remitting multiple sclerosis

Stühler, E.; Braune, S.; Lionetto, F.; Heer, Y.; Jules, E.; Westermann, C.; Bergmann, A.; van Hövell, P.

doi:10.1186/s12874-020-0906-6

Research article
Open access
Published: 07 February 2020

Framework for personalized prediction of treatment response in relapsing remitting multiple sclerosis

E. Stühler¹,
S. Braune²,
F. Lionetto¹,
Y. Heer¹,
E. Jules¹,
C. Westermann¹,
A. Bergmann²,
P. van Hövell¹ &
NeuroTransData Study Group

BMC Medical Research Methodology volume 20, Article number: 24 (2020) Cite this article

5057 Accesses
22 Citations
1 Altmetric
Metrics details

Abstract

Background

Personalized healthcare promises to successfully advance the treatment of heterogeneous neurological disorders such as relapsing remitting multiple sclerosis by addressing the caveats of traditional healthcare. This study presents a framework for personalized prediction of treatment response based on real-world data from the NeuroTransData network.

Methods

A framework for personalized prediction of response to various treatments currently available for relapsing remitting multiple sclerosis patients was proposed. Two indicators of therapy effectiveness were used: number of relapses, and confirmed disability progression. The following steps were performed: (1) Data preprocessing and selection of predictors according to quality and inclusion criteria; (2) Implementation of hierarchical Bayesian generalized linear models for estimating treatment response; (3) Validation of the resulting predictive models based on several performance measures and routines, together with additional analyses that focus on evaluating the usability in clinical practice, such as comparing predicted treatment response with the empirically observed course of multiple sclerosis for different adherence profiles.

Results

The results revealed that the predictive models provide robust and accurate predictions and generalize to new patients and clinical sites. Three different out-of-sample validation schemes (10-fold cross-validation, leave-one-site-out cross-validation, and excluding a test set) were employed to assess generalizability based on three different statistical performance measures (mean squared error, Harrell’s concordance statistic, and negative log-likelihood). Sensitivity to different choices of the priors, to the characteristics of the underlying patient population, and to the sample size, was assessed. Finally, it was shown that model predictions are clinically meaningful.

Conclusions

Applying personalized predictive models in relapsing remitting multiple sclerosis patients is still new territory that is rapidly evolving and has many challenges. The proposed framework addresses the following challenges: robustness and accuracy of the predictions, generalizability to new patients and clinical sites and comparability of the predicted effectiveness of different therapies. The methodological and clinical soundness of the results builds the basis for a future support of patients and doctors when the current treatment is not generating the desired effect and they are considering a therapy switch.

Graphical abstract

(A) The framework is developed using quality-proven real-world data of patients with relapsing remitting multiple sclerosis. Patients have heterogeneous individual characteristics and diverse disease profiles, indicated for example by variations in frequency of relapses and degree of disability. Longitudinal characteristics regarding disease history (e.g. number of previous relapses in the last 12 months) are extracted at the time of an intended therapy switch, i.e. at time point “Today” (left). All clinical parameters are captured in a standardized way (right). (B) The model predicts the course of the disease based on the observed data (panel A), and is able to account for the impact of various available therapies on chosen clinical endpoints. The resulting ranking of therapies has a dependency on patient characteristics, illustrated here by a different highest ranked therapy depending on the number of relapse in the previous 12 months. (C) The model is evaluated for various generalization properties. Compared to performance on the training set (gray) it is able to predict for new patients not part of the training set (red).Top: Prediction for new patients. Middle: Prediction for new clinical sites. Bottom: Prediction for different time windows. (D) In order to assess the clinical impact of the model, disease activity is compared between patients treated with the highest ranked therapy and those treated with any of the other therapies. Patients adhering to the highest ranked therapy are associated with a better disease outcome when compared to those who did not.

Peer Review reports

Background

Knowledge about the effectiveness of available treatments is typically based on results from randomized controlled trials (RCTs). However, these results are derived in a controlled, constrained setting and do not necessarily reflect real-world patient populations and drug labels. In addition, results from RCTs are based on group-level differences, while treatment decisions require individual-level information to enable optimal treatment allocation for an individual patient. This can cause a ‘trial-and-error paradigm of treatment allocation’ [1], in which the individual patient undergoes several therapy switches until a suitable treatment is found. Real-world data are gaining increasing importance to fill this gap between RCTs and utilization of treatment options in daily practice. This is also why the European Medical Agency and the Food and Drug Administration in the USA are seeing the availability of real-world data as an emerging opportunity to improve treatment quality and allocation of resources [2] [3] [4] [5] [6]. In the field of multiple sclerosis (MS), several registries have captured qualified clinical data for more than 10 years, including MSBase (founded in 2004, multinational), OFSEP (2001, France), Swedish MS registry (2001, Sweden), MSDS3D (2010, Germany), and NeuroTransData (NTD; 2008, Germany).

MS is the most prevalent neurological auto-immune disorder of the central nervous system and affects patients in the most dynamic and productive time of their lives by causing severe physical disability and mental handicap, thus impairing abilities for social and professional participation over time in the majority of those affected. The treatment landscape of MS is continually changing: since 2000, more than 900 clinical studies have been listed in the clinical trials registry created by the National Institutes of Health alone. Following the first Interferon-β1b injectable therapy in 1995, 14 different disease modifying therapies (DMTs) based on eight approved compounds became available in the EU in 2017 for the relapsing remitting form of multiple sclerosis (RRMS) [2].

Disease mechanisms and course of RRMS are heterogeneous and challenging to predict at a group level, and even more in an individual patient. This heterogeneity, the impact of clinical outcomes on quality of life and the large number of treatment options make a personalized approach for a tailored treatment of patients desirable.

This study is built on previous research on personalized medicine [5,6,7,8,9,10]. It contributes to an ongoing advancement on the personalization of RRMS treatment allocation by proposing a framework for comparing the effectiveness of DMTs at patient level. Two indicators of treatment effectiveness are taken into account: the number of on-therapy relapses experienced by the patient, and the occurrence of a confirmed disability progression (CDP) during the therapy. The statistical approach relies on hierarchical Bayesian generalized linear models (GLMs).

This study is based on the NTD MS registry, where physicians in Germany capture quality-proven real-world data in a large number of different clinical sites and for heterogeneous patients and disease histories. This provides the basis for the generalization of the predictive models to a wide range of patients and clinical sites that were not part of the model development. The methodological and clinical soundness of the results of this framework is thoroughly evaluated in addition by comparing the predicted treatment response with the clinically observed course of MS for different adherence profiles.

Applying personalized predictive models in RRMS patients is still a new territory that is rapidly evolving and has many challenges. The objective of this study is to address some of these challenges by providing robust and reliable predictions of treatment response based on real-world data. This builds the basis for a future support of patients and doctors when the ongoing treatment fails and a therapy switch is considered.

Methods

This study proposes a framework for personalized prediction of effectiveness of various therapies currently available for RRMS patients. The framework comprises the following steps: (i) data preprocessing, selection of predictors according to quality and inclusion criteria, and definition of two indicators of therapy effectiveness (Section Methods: Data); (ii) model development using hierarchical Bayesian GLMs for estimating therapy response according to both indicators of therapy effectiveness (Section Methods: Model development); (iii) model performance assessment based on state-of-the-art performance measures and procedures, together with additional analyses that focus on applicability in clinical practice, i.e. on generalizability to new data and on comparability of predictions for different treatments (Section Methods: Model performance assessment).

Data

This study employed clinical real-world data recorded in the NTD MS registry. NTD is a Germany-wide network of physicians in the fields of neurology and psychiatry that was founded in 2008. Currently, 153 neurologists in 78 offices work in NTD practices serving about 600,000 outpatients per year. Each practice is certified according to network-specific and ISO 9001 criteria. Compliance with these criteria is audited annually by an external certified audit organization. The NTD MS registry includes about 25,000 patients with MS, which represents about 15% of all MS patients in Germany. In this database, demographic and clinical parameters are captured in real time over an average of 3.7 visits and Expanded Disability Status Scale (EDSS) assessments per year per patient. Standardized clinical assessments of functional system scores and EDSS calculation are performed by certified raters (http://www.neurostatus.net). All personnel undergoes regular training to ensure quality of data in the database. This quality is monitored by the NTD data management team. Data input is checked for inconsistencies and errors by also using an error analysis program. Both automatic and manually executed queries are implemented to further ensure data quality, e.g. checks for inconsistencies and requests for missing information. All data are pseudonymized and pooled to form the NTD MS database. The codes uniquely identifying patients are managed by the Institute for medical information processing, Biometry and Epidemiology (Institut für medizinische Informationsverarbeitung, Biometrie und Epidemiologie (IBE)) at the Ludwig Maximilian University in Munich, Germany, acting as an external trust center. The data acquisition protocol described above was approved by the ethical committee of the Bavarian Medical Board (Bayerische Landesärztekammer; June 14, 2012) and re-approved by the ethical committee of the Medical Board North-Rhine (Ärztekammer Nordrhein; April 25, 2017).

For this study, data were extracted from the NTD MS database on July 1, 2018.

Predictors

The predictors that were used to model therapy effectiveness are listed in Table 1. An overview of their distribution and discretization is provided in Additional file 1: Table S1.3. All predictors were defined and selected based on prior scientific research and clinical expertise (SB, AB) [11].

Table 1 List of model predictors, along with code names for shorter reference across the study

Full size table

Data quality and inclusion criteria

The data used for model development consist of therapy cycles, i.e. each observation corresponds to a therapy cycle. Several quality and inclusion criteria were applied for data preprocessing and patient population selection, respectively.

The selected observations fulfilled quality criteria related to validity, accuracy, completeness, consistency and uniformity of the information in the database, including: all predictors were available at the start of the index therapy, at least one relapse was documented before the start of the index therapy, patients were required to have at least one documented EDSS measurement before the start of the index therapy. Extreme therapy cycles with annualized relapse rate above 12 per year were excluded from the study.

The following inclusion criteria were applied: patients were required to be at least 18 years old, EDSS before the start of the index therapy was required to be less than or equal to six, index therapy was required to be one of the following: Dimethylfumarat (DMF), Fingolimod (FTY), Glatirameracetat (GA), Interferon-ß1 (IF), Natalizumab (NA) or Teriflunomide (TERI). Therapies that were prescribed within 6 months of MS diagnosis without a previous treatment failure were excluded as they do not represent therapy switches during the course of RRMS. If more than one therapy cycle was available for a single patient, one therapy was randomly selected, while the others were discarded. Therapy cycles corresponding to clinical sites with only one remaining patient were also excluded.

After the quality and inclusion criteria were applied, 90% of the therapy cycles (3119) were used for model development and validation, and 10% of the therapy cycles (314) were used as test set for validation as described below in Section Methods: Model performance assessment: Model generalizability.

A detailed overview of the data selection process is shown in Additional file 1: Table S1.1. A comparison of the responses of interest and of the predictors before and after the quality and inclusion criteria were applied is presented in Additional file 1: Tables S1.2, S1.3 and Figure S1.3.

Indicators of therapy effectiveness

Number of relapses and confirmed disability progression (CDP) during the observation time of a therapy cycle are established clinical indicators for therapy effectiveness in RRMS [7, 8]. Both indicators were therefore used to measure the effectiveness of index therapies. Additionally, probabilities of being free of relapse and free of CDP, respectively, were derived for validation [9].

A CDP was defined as a worsening of at least 1.0 point when the previous EDSS is 5.5 or lower and 0.5 point otherwise; the worsening must be sustained for at least 3 months and must be confirmed by at least one other valid EDSS measurement. A more detailed definition of CDP is included in Additional file 2.

Model development

The number of on-therapy relapses was modelled as following a negative binomial distribution whose mean and shape parameters depend on individual patient characteristics and index therapy. The occurrence of a CDP was modelled as following a binomial distribution, where the probability of observing a CDP depends on individual patient characteristics and index therapy.

In both cases, a hierarchical Bayesian GLM was employed [12]. The correlation that typically arises between measurements coming from the same clinical site was addressed by modeling a random intercept. The duration of each observed therapy cycle was incorporated in the models as an offset term, since the number of relapses and the probability of observing a CDP is expected to be larger for longer exposure, i.e. observation time of index therapy. A detailed description of the models is presented in Additional file 3.

In this study, Bayesian estimation was used due to the advantages offered by the possibility to incorporate prior information, which also allows for regularization [13]. The specific values that were given to the parameters’ priors are summarized in Table 2. These priors are weakly informative, in line with the values proposed by [14]. The parameters are assumed to be independent.

Table 2 Default priors assigned to the relapse and CDP models’ parameters

Full size table

Models were fitted with version 2.14.1 of the rstanarm package in R [13]. This implementation uses the Hamiltonian Monte Carlo approach to draw samples from the parameters’ posterior joint distribution. For each Markov chain started to this purpose, the convergence to the target distribution was assessed using the Gelman and Rubin potential scale reduction statistic $ \hat{R} $ [13]. For each of the samples and for each of the six considered index therapies (Section Methods: Data: Data quality and inclusion criteria), the number of relapses or the occurrence of a CDP were predicted for each observation by disregarding clinic-specific random effects, i.e. by setting all random intercepts to zero (in rstanarm, this is done by setting the ‘re.form’ argument of the posterior_predict function to ‘~ 0′). A new patient will thereby have consistent predictions across different clinical sites. These predictions obtained from the posterior distribution were summarized by looking (i) at their average and at the fraction of those that predict an absence of relapse (relapse model), and (ii) at the fraction of those that predict an absence of CDP (CDP model).

Model performance assessment

In this section, the following content is presented: model calibration statistical measures of model performance model generalizability, comparison with nested models of lower complexity, sensitivity of the models to different choices of the priors, to the characteristics of the patient population, and to the sample size and comparison of treatment effectiveness predicted by the models.

Model calibration

The agreement between the predicted and observed outcomes was assessed by distributing the therapy cycles into several bins of predicted outcomes. The bin size was chosen such that there are 20 equally-populated bins in total, covering the full range of the predicted outcomes. For each bin, the mean predicted outcome was compared with the mean observed outcome. If the model is well-calibrated, the two quantities are expected to be close to each other. The agreement between predictions and observations was studied for all therapy cycles and also for each DMT separately. The adoption of equally-populated bins rather than equally-sized bins has the advantage that the statistical uncertainty due to the population size of each bin is the same for all points in the calibration plot.

Statistical measures of model performance

Model performance was evaluated via mean squared error (MSE), negative log-likelihood and Harrell’s concordance statistic (C-Index).

The C-Index was used to analyze the ability of the models to discriminate among different responses, in this case to discriminate between none and at least one relapse, and between the occurrence and absence of a CDP, respectively. When comparing predicted and observed indicators of therapy effectiveness, therapy cycles with roughly the same duration were matched. This is achieved by allowing for up to 6 months difference if the smaller of the two durations is less than half a year, and up to 12 months difference otherwise [15].

The negative log-likelihood per patient was obtained following the approach in [12], page 169, and using the log_lik function of the rstanarm package [13]. The negative log-likelihood for the full patient population was obtained by summing the negative log-likelihoods per patient.

Although the models allow to make predictions for the effectiveness of all six therapies included in this study (Section Methods: Data: Data quality and inclusion criteria), statistical measures were only evaluated where the associated indicator of therapy effectiveness could be observed, i.e. using the predictions for the observed index therapy.

Model generalizability

The generalizability of the models was assessed using three different out-of-sample validation schemes. The first validation scheme consisted in evaluating the model performance using a 10-fold cross-validation. The second validation scheme used a leave-one-site-out cross-validation with respect to the clinical site. Patients from the same clinical site were excluded from the sample that was used to fit the model and then used to test how well this model performs. The procedure was repeated for each clinical site. The third validation scheme evaluated the model performance on the test set.

Performance measures were calculated using out-of-sample predictions as well as in-sample predictions. Each in-sample prediction was obtained from one randomly selected training fold. Therefore, exactly one out-of-sample and one in-sample prediction per therapy cycle were retained. Out-of-sample and in-sample performance measures were compared in order to identify overfitting. The robustness of the performance measures was assessed by repeating the 10-fold cross-validation 40 times, which allowed to compute standard errors.

The modeling approach described in Section Methods: Model development leads to the generation of six predictions per patient, one for each of the six therapies under consideration. Only predictions for the observed index therapy were retained when analyzing generalizability.

Comparison with nested models

The models presented above allow to make comparable predictions for all six therapies included in this study for each patient and each indicator of therapy effectiveness. The impact of the patient characteristics and their interactions with the index therapy on the model predictions was evaluated by comparing the model of Section Methods: Model development with two models of lower complexity. These two models were nested in the predictive model.

The first nested model, referred to as non-personalized model, does not have a dependency on the patient characteristics (Table 3). This model returns a fixed ranking of the six therapies under consideration. The model is not personalized, since two patients with a therapy cycle of the same duration but different characteristics will obtain the same predicted response, and hence will have the same comparative therapy effectiveness profile.

Table 3 Overview of the predictors used for predictive models and nested models

Full size table

The second nested model, referred to as prognostic model, has a dependency on the patient characteristics but not on their interactions with the index therapy (Table 3).

This model is an extension of the non-personalized model that additionally allows for personalization, that is, for patient characteristics to have an impact on the predicted response. However, it is important to note here that patient characteristics are not allowed to interact with the index therapy, i.e. there is no personalization in the obtained ranking of the six therapies under consideration.

The predictive model in this study differs from the prognostic model by the addition that individual patient characteristics were allowed to interact with the therapy, i.e. the therapy effectiveness and corresponding ranking were allowed to differ for different patients.

Sensitivity analysis

The predictions’ robustness was tested with respect to different choices of the priors, to the characteristics of the underlying patient population, and to sample size. Methods and results are presented in detail in Additional file 5, Additional file 6 and Additional file 7.

Comparison of predicted therapy response

The modeling approach presented above leads to the generation of six predictions per patient (Section Methods: Model development), one for each of the six therapies under consideration. As both predictive models allow therapy effectiveness to differ for different patient characteristics, a personalized ranking of therapies was obtained for each patient. Note that this ranking only applies with respect to the chosen indicator of therapy effectiveness, i.e. in this case either the lowest predicted number of relapses or the lowest predicted probability of observing a CDP, and does not represent an overall therapy recommendation which would account for multiple determinants. In order to evaluate the usability of the models in clinical practice, average observed treatment responses were compared between patients who received the highest ranked therapy (denoted as DMT^* in the following) and those who did not.

To avoid potential confounders on responses, the distribution of each predictor was matched between patients who received DMT* and those who did not with a propensity-score-based weighting as implemented by the twang package [16, 17]. The propensity-score-based weighting allowed to match the distributions of the covariates of the two groups without having to discard any data [16, 18, 19]. It was implemented based on age, relapses count (in the past 12 months), EDSS as categorized in [8], and diagnosis distance. The distributions of the covariates of each group were matched by using the population weights to estimate the average treatment effect of the population [16, 17], while all other settings were kept to the default settings of the twang package [16, 17].

To test the statistical significance of the group differences between patients receiving DMT* and those who did not, a weighted GLM was employed according to an analysis of outcomes approach [16]. P-values were derived from the GLM based on the estimated significance of the relevant intercept and slope coefficients. For each observation, an indicator variable was used to specify whether the patient received DMT*. A negative binomial GLM was used for the observed number of relapses [18] and a binomial GLM was employed for the observed occurrence of a CDP as follows:

$$ svyglm. nb\left( observed. number. relapses\sim took. DM{T}^{\ast }+ offset\left(\log \left( duration.{DMT}^{\ast}\right)\right), design= design. ps\right) $$

$$ svyglm\left( observed. occurence. CDP\sim took. DM{T}^{\ast }+ offset\left(\log \left( duration.{DMT}^{\ast}\right)\right), family= binomial, design= design. ps\right) $$

The duration of each therapy cycle was included as offset to allow for cycles with heterogeneous observation time, i.e. to allow for the comparability of the results between patients with different index therapy durations. The procedure is illustrated in Fig. 1.

Results

The following results are presented: overview of patient population after quality and inclusion criteria were applied, importance of model coefficients and model performance assessment.

Patient population

After the inclusion criteria described in Section Methods: Data: Data quality and inclusion criteria have been applied, index therapy cycles consist of: Dimethylfumarat (22%), Fingolimod (25%), Glatirameracetat (13%), Interferon-ß1 (19%), Natalizumab (9%), and Teriflunomide (13%). At least 266 therapy cycles per DMT are available for model development. A detailed description of the patient population is reported in Additional file 1.

Importance of model coefficients

The predictors associated to the eight largest fixed-effect parameter estimates (in magnitude) of the relapse and CDP models are reported in Table 4 and Table 5, respectively. The predictors are listed along with the signs and the posteriors’ median absolute deviations (MADs) of their corresponding estimates. Most of the ranked coefficients are interaction terms, two of which appear in both rankings: (i) the coefficient for the interaction between Natalizumab as index therapy and the diagnosis distance, and (ii) the coefficient for the interaction between Fingolimod as index therapy and the second-line therapy indicator. In the relapse model, the duration of the current therapy seems to have a significant impact both individually and when combined with Teriflunomide or Fingolimod. Note that the largest parameter estimate has the highest uncertainty. In the CDP model, having had a second-line DMT is particularly meaningful when the index therapy is Teriflunomide, Natalizumab or Fingolimod.

Table 4 Most important predictors in the relapse model

Full size table

Table 5 Most important predictors in the CDP model

Full size table

Figure 2 displays the MADs of the fixed-effects’ posterior distributions in the relapse and the CDP models. In both cases, the estimates having highest uncertainty are those associated to the following predictors: duration of the current therapy, interaction between the current therapy and its duration, and Teriflunomide as a current therapy.