The re-use of medical data contained in EHRs is a critical issue for clinical research as it could have a major impact on the overall cost of future clinical studies. CTRSSs have been developed to support clinical research at each step of the process of clinical research. The objective of this work is to describe all the necessary tasks for the purpose of re-using the EHR4CR platform at a local level to support existent academic studies. We identified specific issues that could have an impact on the future results of the platform. We also provided metrics to report these critical aspects that could also be used for future reports of CTRSS assessment studies.
The normalization process, i.e. transforming a raw eligibility criterion into a computable criterion, was one of these critical aspects. Most of our final computable criteria (64,2% 43/67) were considered to be at least satisfactory by our expert in clinical research. The choice of working with internationally recognized medical terminologies had an impact on this result. We were able to map 75% of the medical concepts identified in the original criteria to the locally used terminologies from our EHR. These important results highlight the fact that technical specifications of both EHRs and CTRSSs are a major concern for re-using routine care data for clinical research purposes.
Translating free-text eligibility criteria into computable criteria
One of the major limiting aspects of the translation of free-text eligibility criteria into computable criteria is the choice of the terminologies. Most eligibility criteria are expressed in natural language and not as medical concepts. Several options are available. The Unified Medical Language System (UMLS®) is the most complete choice and offers the benefit of its interoperability with many medical terminologies but may be too broad in scope for clinical research. Members of the EHR4CR project decided to work with various international standardized terminologies covering various medical domains. The selected terminologies allowed us to express 81% (92/114) of the medical concepts contained in the original criteria.
There are several benefits for using standards, including, but not limited to, (a) more synonyms available for semi-automated matching, or concept identification, (b) reduced risk of misunderstanding, (c) sharing across recruitment platforms, which would be difficult otherwise.
The choice of a terminology set is important and has a major impact on the number of free-text eligibility criteria that can be transformed into suitable computable criteria.
Criteria are organized into concepts such as age, gender, disease, symptoms, medical history, and are often connected with comparators including Booleans and others such as “at least”, “more than” or “if then” conditions. This complexity could result in difficulties to express eligibility criteria as computable criteria. The choice of syntax is a second key issue because it affects the ability to translate the eligibility criteria into computable criteria. Weng et al.  identified several solutions including (a) re-using the syntax of existing languages such as the Structured Query Language (SQL), (b) developing dedicated syntax e.g. the Arden syntax, or GELLO, (c) logic-based languages, or (d) ad hoc solutions [22–24]. The EHR4CR group decided to develop its own solution called ECLECTIC . We expressed all the aXa and DERENEDIAB trial computable criteria using ECLECTIC. This second issue is also crucial because it determines the ability of the platform to express “real world” eligibility criteria and therefore has a direct impact on the final performance.
One of the key issues of the normalization process is the criteria complexity. By analyzing 1000 eligibility criteria randomly selected in ClinicalTrials.gov, Ross J. et al. found that 85% contained semantic complexity and that 36,8% were not comprehensible or require clinical judgment or additional metadata . The complexity of our criteria (74.6% (50/67)) comparable to that of other studies.
The pre-processing of free-text eligibility criteria for normalization is necessary for obtaining simple and useful computable criteria. Only 53.7% (36/67) of the criteria did not need any modification to be computed after normalization. This step remains difficult to automate and often requires expert input to remove semantic ambiguity, to extend free-text medical concepts, and for concept recognition. For example, the sixth aXa criterion, containing 14 medical concepts consisting of unstructured data, received a note of 5 “translated without any loss of information” by our clinical research expert. This apparently astonishingly good result was explained by the association of very specific diagnostic conditions (e.g. “pulmonary embolism confirmed by a gap in a pulmonary artery” or “Lower extremity Deep vein thrombosis confirmed by the lack of compressibility of a vein segment under the ultrasound probe”) combined to non-useful information (in this context) immediately identifiable by an expert. The EHR4CR platform queries EHR billing codes and therefore already works with confirmed diagnoses.
The translation of free-text criteria into medical concepts is a difficult and crucial task. To limit the risk of introducing bias and to better manage subjective topics, complex criteria should be reviewed - or better independently translated - by two or more experts. Moreover, a good comprehension of the content of data warehouse is needed to ensure quality mappings (especially to manage negation in an open world assumption in which the absence of information is not equivalent to the negation)
Identifying medical concepts using the platform terminologies
The second part of the mapping process consists of connecting the translated computable criteria now translated into biomedical concept from standard terminologies to local EHRs terminologies. Many automated solutions have been proposed but human expertise is still needed to insure high quality mapping. We were able to map 93,5% (86/92) of the medical concepts represented by EHR4CR terminologies to the locally used terminologies. This success is because many of the EHR4CR terminologies were also used in the local CDWs.
While information in EHRs is often structured, mapping concepts of structured eligibility criteria to medical concepts collected in a care context is challenging: (a) terminologies used to translate eligibility criteria are not necessarily the same as those used for storing routine care data in EHRs and (b) there is not always a direct relationship between these different terminologies. For example, “Protein/Creatinine Ratio in urine” would have required a concomitant determination of protein and creatinine in urine and a calculation to be mapped. Similarly, one simple concept could be potentially represented in EHRs in various forms. For example, patients with “Acute Kidney Disease” could be identified in EHRs using diagnosis codes, laboratory results, or free-text. Some studies have developed computer algorithms to identify specific patients with EHR data [25, 26]. Natural language processing tool functionalities of the EHR4CR platform were not available during this work. Therefore, we did not map the medical concepts that corresponded to unstructured data to EHR4CR terminologies. This was possible because the medical informatics experts responsible for the normalization process had a perfect knowledge of the CDW contents.
Structured data completeness and missing data of EHRs to support clinical research
Structured data are the easiest to use in the implementation of a CTRSS but these data are collected for purposes other than patient recruitment into clinical trials. Nevertheless, we found that 78% (89/114) of the medical concepts present in the aXa, DERENEDIAB and EWING 2008 criteria corresponded to structured data. This is more than the 55% identified by Köpcke et al. who worked on the EHRs from five German University Hospitals . This may be because the definition of medical concepts in our study was not exactly the same as in theirs resulting in differences between the results of the two studies. Future studies aiming to assess CTRSS performance need to specify the data completeness for each medical concept to provide comparable outcomes.
Missing data is another crucial point that must be carefully addressed during the normalization process. Missing data will influence the results of the platform and this influence will be highly variable depending on if these missing values concern the inclusion or the exclusion criteria. Indeed, eligibility criteria with a high prevalence of missing data decrease the sensitivity of the platform because all eligibility criteria are associated with “AND” operators in the platform. For example, the use of contraception or not is is far from being reported for all women in the HEGP-CDW. Eligibility criteria containing this concept decrease the number of identified patients. Missing data also increase the number of patients falsely identified and therefore the specificity of the platform because all exclusion criteria are associated with “AND NOT” operators. Another important issue is that missing data are not objectively quantifiable in a CDW. Thus, local expertise from people with a complete knowledge of the queried CDW and a perfect understanding of the objectives of the future supported trials are absolutely necessary before using the platform.
Completeness was evaluated for the entire data warehouse. Such an evaluation is relevant for items such as gender. Low completeness for a given criterion may exclude participation of a medical center. However, for specific results, completeness is only relevant for a subset of patients for whom the information is needed (e.g. HbA1c for diabetic patients).
Another critical issue from the point of view of the CTRSS user is the question of time. Each eligibility criterion may vary over time: new biological measurements, drug therapies, diagnoses, etc. The frequency with which the database is queried will have an influence on the results and especially on the sensitivity of the CTRSS. In most cases, data contained in CDWs are not real-time data. For example, to be collected in our CDW, diagnoses must be 1/ coded by the physician and 2/ imported. Any real-time inclusion study (e.g. inclusion at diagnosis or immediately after) is problematic, especially for acute diseases.
Feasibility of the approach
The process described in this study is time consuming. In our case, the mapping and translation of the criteria was performed successfully by an expert physician experienced in medical informatics and with a strong knowledge of the content and structure of the local clinical data warehouse. If the platform such as EHR4CR were to be developed, specialized support team composed of medical informatics specialists and data-manager would probably be needed to enable scalability.
General statements about the EHR4CR normalization (or standardization) pipeline
Semantic interoperability is one of the main challenges to address to enable the reuse of hospital EHR data to support research. Semantic interoperability within a broad international research network reusing clinical data from EHRs requires a rigorous governance process to ensure the quality of the data standardization process.
This study demonstrates good coverage of the EHR4CR central terminology used during the normalization process of the eligibility criteria of three studies. However, its scope needs to be continuously extended to address the representation of a much broader set of eligibility criteria. Updating of the EHR4CR central terminology cannot be fully automated (e.g. through automatic coding of free text clinical trial protocols). A collaborative editor is required to efficiently support the creation of new semantic resources to expand the scope to additional studies and associated eligibility criteria.
Despite recent efforts, formal representation of multimodal and multi-level data that supports data interoperability across clinical research and care domains is still challenging.
Terminology mapping at hospital sites is the major bottleneck of the data standardization pipeline. Supportive tools are still in their infancy.