A data-driven pipeline to extract potential adverse drug reactions through prescription, procedures and medical diagnoses analysis: application to a cohort study of 2,010 patients taking hydroxychloroquine with an 11-year follow-up

Context Real-life data consist of exhaustive data which are not subject to selection bias. These data enable to study drug-safety profiles but are underused because of their temporality, necessitating complex models (i.e., safety depends on the dose, timing, and duration of treatment). We aimed to create a data-driven pipeline strategy that manages the complex temporality of real-life data to highlight the safety profile of a given drug. Methods We proposed to apply the weighted cumulative exposure (WCE) statistical model to all health events occurring after a drug introduction (in this paper HCQ) and performed bootstrap to select relevant diagnoses, drugs and interventions which could reflect an adverse drug reactions (ADRs). We applied this data-driven pipeline on a French national medico-administrative database to extract the safety profile of hydroxychloroquine (HCQ) from a cohort of 2,010 patients. Results The proposed method selected eight drugs (metopimazine, anethole trithione, tropicamide, alendronic acid & colecalciferol, hydrocortisone, chlormadinone, valsartan and tixocortol), twelve procedures (six ophthalmic procedures, two dental procedures, two skin lesions procedures and osteodensitometry procedure) and two medical diagnoses (systemic lupus erythematous, unspecified and discoid lupus erythematous) to be significantly associated with HCQ exposure. Conclusion We provide a method extracting the broad spectrum of diagnoses, drugs and interventions associated to any given drug, potentially highlighting ADRs. Applied to hydroxychloroquine, this method extracted among others already known ADRs. Supplementary Information The online version contains supplementary material available at 10.1186/s12874-022-01628-3.


Introduction
Adverse drug reactions (ADRs) have been attributed to causing over 770,000 injuries and 100,000 deaths and $76.6 billion in annual costs in the US [1], responsible for 143,915 hospitalizations and are estimated to 10,000 deaths in France [2]. The rapid detection of ADRs has become a crucial public health issue. Currently, drug safety is monitored by the pharmacovigilance system, which uses spontaneous reporting systems (SRSs) to detect, collect, and analyze ADRs. However, SRSs suffer from underreporting. Indeed, less than 10% of serious ADRs are reported [3,4]. Moreover, this system is subject to biases due to selective reporting (most of the reported cases are considered as suspected ADRs) or by a reporting bias for newly marketed medicines.
The increasing availability of electronic medical records (EMR) offers major opportunities to investigate a wide spectrum of ADRs and detect drug safety signals closer to real use and time, as EMR databases record information for large populations over long follow-up periods [5]. These databases, such as electronic medical records and administrative claims databases, have been mostly used to confirm or disprove potential signals flagged by SRSs. A number of data-mining techniques have been specifically developed for the automatic detection of drug-safety signals using either SRS or EMR databases [6][7][8][9][10][11][12]. Over the last decade, several international initiatives have been developed; the Mini-Sentinel and OMOP (Observational Medical Outcomes Partnership) in the United States and the PROTECT (Pharmaco-epidemiological Research on Outcomes of Therapeutics by a European Consortium) and EU-ADR (Exploring and Understanding Adverse Drug Reactions) in Europe.
The challenge of drug-safety signal detection methods is to handle four types of difficulties. The first difficulty is the data source. The study of long-term adverse drug reactions or effects not suspected by healthcare professionals requires the use of a real-life data source, such as EMR or claims databases, which do not suffer from the known bias of underreporting and reporting selection [13,14]. The second difficulty is the consideration of a broad spectrum of potential ADRs, and not only candidate ADRs [15], to be able to highlight new signals. The third difficulty is to precisely take into account the temporal aspect. Time is important, because the type of adverse reactions caused by the medication under study may differ according to the duration of the medication prescriptions. Certain adverse effects may occur soon after the start of the medication under study, whereas others may require a prolonged period of administration to become manifest [16]. Safety depends on the dose, timing, and duration of the treatment. The last difficulty concerns distinguishing true ADRs from the natural course of the disease. Indeed, the natural course of the disease for which the prescribed drug is indicated may be associated with many other medical diagnoses, which may be indistinguishable from ADRs.
In this study, we developed a data-driven pipeline based on a cumulative exposure test to highlight relevant diagnoses, drugs and interventions which could reflect an ADR of a given drug. We used the French national medication reimbursement database to apply this pipeline to HCQ, an old drug with widely known side effects.

Description of the WCE-data-driven pipeline
The WCE-data-driven pipeline aims to overcome the four aforementioned difficulties as follows: − Data source: Our method is dedicated to exhaustive real-life data encompassing the whole medical pathway of each patient after drug intake and all along its use. − Agnostic approach: all elements of the medical pathway are considered with no expert preselection − Temporality assessment: We modeled the increase in risk related to cumulative dose using splines in Cox proportional hazards models [17,18]. Dose during a month is defined as having had at least one reimbursement for the drug of interest during this month. This model is called weighted cumulative exposure (WCE). It allows the representation of complex cumulative effects of dose, timing, and duration of interest drug [19]. − ADRs denoising: We introduced a covariate representing disease severity built from expert knowledge to automatically pinpoint drugs given for the condition for which the exposure drug has been prescribed.
We will first introduce this WCE-data-driven pipeline, followed by a presentation of a comparative method to our WCE-data-driven pipeline. Finally, we present a use case of this pipeline and compared it with a more classical approach.

WCE-data-driven pipeline
The WCE-data-driven pipeline applies to a case-only cohort, i.e. all patients in the cohort were exposed at least once during the study period to the drug of interest.
The WCE-data-driven pipeline is divided into three steps: data extraction, data preparation, and safety-profile extraction (Fig. 1).
i. Data extraction: All drug prescriptions for the drug under study are extracted, along with all health events (treatment initiation date of all drugs, procedure dates, medical diagnoses dates), and patient characteristics, such as age, sex.
ii. Data preparation: We built a data frame corresponding to the WCE package [20] data format for each possible type of event occurring after the first prescription of the drug under study. In each data frame, each row corresponds to a given time period, such as day, week, or month (to be defined). Id identifies the patients. Start and Stop identify the beginning and end of each interval (Stop in row n = Start in row n+1). The intervals are closed on the right. For each patient, the first start (start = 0) is the start date of the study period. The last start is the date the patient has the event or the date of the last patient follow-up. Event is a binary indicator for the event of interest, which has a value of 1 if the event occurred in the interval specified by Start and Stop. For a given subject, event = 1 can only occur in the last interval of his or her follow-up. The following columns represent patient covariates. iii. Safety-profile extraction: The weighted cumulative exposure (WCE) model is applied to estimate the cumulative effect of duration, dose, and prescription date for the drug of interest [19]. The WCE statistical model approach estimates the effect of past exposure using a weighted sum of all previous instances of exposure, with weights depending on the time elapsed since exposure and the dose [18,21]. This makes it possible to account for the fact that the type of adverse reactions caused by the medication under study may depend on the duration of the medication prescription (certain adverse effects may occur soon after the start of the medication under study, whereas others may require a prolonged period of administration to become manifest). WCE statistical model allow taking into account for patient's covariates as relying on Cox time-dependent in which initial covariates might be introduced.
The WCE statistical model works in two steps. First, it estimates the weighting function that assigns weights to past exposures (e.g. dose or intensity of exposure) as a function of time since exposure, using cubic regression splines (Fig. 2).
Then, a Cox proportional hazards model is fitted to compare a group exposed to the drug of interest (HCQ in this example) during the time window to a group not exposed during the same period [20].
To select health events significantly associated with exposure, confidence intervals for each hazard ratio are needed. Initially, WCE model was not designed to test association with an ADR but to assess risk function of a known association, therefore we developed an association test for WCE using bootstrapping to estimate both the confidence intervals and P-values of the exposure and of the initial covariate's coefficient (number of repeats to be defined). We constructed bootstrap replicates of the data, each of which was randomly sampled with a replacement. Confidence intervals were estimated using the percentile method: : Confidence intervals, where θ * j denotes the jth quantile (lower limit) and θ * k denotes the kth quantile All analyses were performed using R (version 4.0.3) and WCE package (version 1.0.2).

WCE competitor
As the above pipeline is not specific to the method used to assess association between the studied drug and each diagnoses, drugs and interventions which could reflect an ADR, we also implemented case-crossover [22,23].
We chose a case-crossover design for WCE competitor because it is a method adapted to a case-only cohort and this design is widely used in pharmacoepidemiology because it is not sensitive to unmeasured, time-invariant confounding factors. Case-crossover requires defining risk periods and a wash-out period. Four periods were defined for each individual, separated by a washout period: one risk period and three control periods. Each period had the same duration (Fig. 3). We performed a sensitivity study by varying the duration of these periods (risk and control) of three, six and nine months, with a washout period of one month.

Use case
In the present study, we applied our data-driven pipeline to patients newly exposed to HCQ (considered in WCE as the exposure variable). The cohort of patients receiving HCQ was extracted from the EGB (Echantillon Généraliste de Bénéficiaires), a permanent 1/97 representative sample of the Système National d'Informations Inter-régimes de l' Assurance Maladie (SNIIRAM), which includes the data for 66 million people. The EGB includes data for approximately 780,000 people [24], consisting of de-identified data on demographic characteristics (sex, year of birth, date of death), long-term diseases (ALD), and reimbursed acts (visits, medical procedures, laboratory tests, dispensed drugs, medical devices) [24][25][26]. We extracted the data of all patients with at least one reimbursement for HCQ. We only included patients who did not receive HCQ during the previous year to select only newly exposed patients (Fig. 4). For these patients, we extracted the following data: age, sex, date of all HCQ deliverances, first date of prescription for all other drugs, date of procedures and medical diagnoses, date of "chronic disease certification status", which is a marker of disease severity in the French health system.
All the parameters that were used in our case study are presented in Table 1.
We assume that a more severe disease leads to a higher use of some drugs, diagnoses or interventions. Fig. 3 Diagram of case-crossover. Schematic representation of a case-crossover analysis. The case-defining event is the prescription of a level 5 ATC class other than HCQ and the prescription is the HCQ drug. Three control periods are selected (shown as CTR) and one risk period (shown as RISK) for each individual. The duration of the periods (risk and control) is set to three months in this example. The washout period (shown as WO) has a standard duration of one month Therefore, severity of the disease might be a confusion factor for the relationship between the drug of interest and some other drugs, diagnoses and interventions as it is associated with the cumulative dose of the drug of interest and to other drugs, diagnoses and interventions related to disease severity. Therefore, when it is not included in the model, all drugs, diagnoses and interventions linked to disease severity will be highlighted using our model, resulting in many positive results that are not likely to be ADRs. To limit these signals, we have created a disease severity variable to improve the specificity of the detected signals. The disease severity variable is designed manually from expert knowledge and depends only on the drug of interest. In our HCQ use case, the disease severity variable was defined as having "chronic disease certification status" for lupus or rheumatoid arthritis. The creation of a disease severity variable is a non-mandatory step and is added in the model like other confounding factors such as age and sex. The EGB database contains only anonymized data and its access is legally authorized without having to receive authorization from the national data protection agency (CNIL). The study protocol was submitted to the appropriate INSERM and CNAMTS entities, as legally required. In addition, the study was approved by our Institutional Review Board (CER-APHP CENTRE, IRB n°20,180,603).
For this use case, we considered age, sex and disease severity as covariates. The significance level chosen was 5% because this was an exploratory use case, therefore we did not correct for multiple testing.

Population description
The HCQ cohort contains 2,010 patients (n women = 1,577, 78%), with 1,081 different ATC classes, 4,200 different diagnostics and 3,075 different procedures. The average age at the time of the first prescription of HCQ was 54.8 (sd = 16.2). Within the cohort, 1,045 (52%) patients declared their disease to National Insurance to get fullprice reimbursement for all treatments related to their chronic disease. Among them, 12% (n = 240) each had vasculitis, systemic lupus erythematous, or systemic scleroderma, 11% (n = 214) rheumatoid arthritis, and 6% (n = 122) each malignant neoplasm or malignant disease of the lymphatic or hematopoietic tissue (Table 2). There was an association between age and HCQ exposure (P < 0.001 by linear regression analysis), as well as between sex and HCQ (P < 0.001 by the Wilcoxon test). There was also a strong association between HCQ exposure and severity (P < 0.001 by the Wilcoxon test).

Application of the WCE-data-driven pipeline
The WCE-data-driven pipeline enabled the identification of eight ATC classes, twelve procedures and two medical diagnoses associated with the prescription of HCQ, of which seven were borderline significant at 5% significance level. The highest risk ratios ATC classes were obtained for hydrocortisone (   (HR = 12.44 [5.14-28.75]), manual or automated campimetry or perimetry, with specific threshold measurement programs (HR = 6.28 [3.93-9.03]), unilateral or bilateral optical coherence tomography of the eye (HR = 4.60 [3.46-6.37]). (Fig. 5 & Additional file 1 Appendix 1). In additional file 1 appendix 2, we have represented the risk functions for light flash electroretinography procedure and electroretinography with dark adaptation procedure.

Application of the case-crossover-data-driven pipeline
With risk periods set up at three months, the case-crossover competitor method identified nine ATC classes, the highest risk ratios were obtained for diclofenac (HR = 1. In total, no ATC class, no medical diagnosis and four procedures were common between WCE and case-crossover methods: unilateral or bilateral optical coherence tomography of the eye, manual or automated campimetry or perimetry, with specific threshold measurement programs, exploring the color sense by matching and light flash electroretinography, with measurement of response amplitudes and latencies.

Comparison with known side effects
Related association signals were found for all very common and common known side effects (Table 3), except headache for which indicated drugs such as acetaminophen are over-the-counter treatments and lability disorders and anorexia for which there is no indicated treatment for acute events. No related association signals were found for uncommon side effects and side effects for which prevalence is not known, except for maculopathies, for which checking procedure was found to be associated (but not treatment for the side effect himself ).

Discussion
We report a new data-driven pipeline able to pinpoint diagnoses, drugs and interventions associated to any given drug and makes it possible to highlight unexpected associations that could represent ADRs. The proposed WCE-pipeline compared to other methods allows the use of the whole prescription trajectories of the drug under study, and therefore allows highlighting both acute and cumulative effects of the drug under study, and to take into account dose-effect relationship, which is essential to infer causal association. This is not the case for methods commonly used to detect ADRs using administrative database such as case-crossover or prescription sequence symmetry analysis (PSSA) [27], making it more powerful to analyze the whole ADRs spectrum. Moreover, our data-driven pipeline is based on a self-controlled method, thus avoiding the usual biases of methods such as casecontrol [28] or propensity score [29] based approaches.
A major strength of the proposed pipeline is the ability to use a medico-administrative claims, which are not biased towards voluntary declaration, enabling to detect unexpected side effect [30]. Compared to the use of hypothesis-based pharmacovigilance strategy, this pipeline required an additional interpretation step to distinguish in associated signals, those unexpected from both known side effects and signals related to medical condition(s) for which the treatment is prescribed. Thus, this pipeline, applied to HCQ on a cohort of 2,010 patients, allowed us to identify most common and very common side effects, but also associations related to the medical conditions for which HCQ is prescribed: anethole trithione, a bile secretion-stimulating drug restores salivation and relieves the discomfort of dry mouth in Sjögren's syndrome, often associated with lupus [31]; hydrocortisone, a glucocorticoid, is often used for withdrawal from more potent corticosteroids in the treatment of lupus or rheumatoid arthritis; chlormadinone, a progestin macro-pill is recommended as a contraceptive method for women with lupus [32,33]; alendronic acid/ cholecalciferol and osteodensitometry procedure are also a co-prescription of HCQ for optimal management of patients with lupus or rheumatoid arthritis, mostly post-menopausal women, by internists or rheumatologists; this is the same for tooth restoration procedures, as osteoporosis treatment requires a dental check to avoid jawl osteonecrosis and also for skin lesions removal as dermatological consult are part from optimal care in lupus; tixocortol, a glucocorticoid, is a nasal spray treating allergic rhinitis, a common comorbidity in patients suffering from lupus [34]; valsartan, specific angiotensin II (Ang II) receptor antagonist, is used to treat hypertension, a comorbidity of post-menopausal women [35,36].
Another advantage of our method is that it is a caseonly design, therefore it does not imply to choose an unexposed group which is always a very tricky issue, and it does not make the assumption like intermittency of drug use or acute event. There are multiple reasons that can cause an adverse drug reaction, such as metabolic genetic polymorphisms, e.g. in Asian populations drug metabolism is different than in Caucasian population; gender; age; nutrition; drug interactions. Our model can take additional covariates to adjust for, as we did with gender and age. It can also incorporate interaction terms and therefore integrate drug interaction.
In our use case, we define the dose as cumulative months of exposure to the drug of interest (HCQ) as our data come from a claim database. This pipeline was however not able to highlight side effects for which no treatment is given neither hospitalization nor procedure is required, but which results in stopping the treatment. No unexpected signal was highlighted using our pipeline. Another limitation of this method is that it is not corrected on multiple tests because our approach is exploratory. Some associations may be due to type-1 error.

Localization
Adverse effect Frequency Related association signals

Gastrointestinal disorders
Abdominal pain, nausea Very common metopimazine Diarrhea, vomiting Common metopimazine

Eye disorders
Blurring of vision due to a disturbance of accommodation

Common eyes procedures
Retinopathy with changes in pigmentation and visual field defects Uncommon eyes procedures Cases of maculopathies and macular degeneration have been reported

Cardiac disorders
Cardiomyopathy which may result in cardiac failure and in some cases a fatal outcome