This study population of 6,387 RA patients provides one of the largest studies of the early presentation of RA in general practice using EHRs. Our results suggest that that the process of RA diagnosis takes time and information may be available in free text before a diagnosis is recorded as a Read code. The indicator code groups under investigation (DMARD, referral to rheumatology, joint sign or symptom, synovitis, inflammatory arthritis diagnosis and rheumatoid factor test) were found in between 3% (synovitis) and 55% (rheumatoid factor test) of patients. A previous paper discussed the findings regarding indicator code groups finding they were widespread in RA patient records prior to the diagnostic code but were unlikely to be adequate for describing the full picture of the early presentation of RA or for making up a probabilistic case definition in the absence of an RA diagnostic code .
Findings from the current study suggest that data stored in free text can add to our understanding of the early presentation of RA. By searching for keywords, it was found that additional information was hidden in the text. For example, keywords relating to inflammatory arthritis were present in an additional 14% of patients where coded information relating to inflammatory arthritis was absent; keywords relating to synovitis were found in an additional 17% where synovitis codes were absent, and keywords for rheumatoid factor test were found in an extra 12% of cases where codes for a test were absent. The rheumatoid factor test figures are complicated by the fact that only positive results were searched for in text. The text could have reported additional tests for which no result was recorded, or which were negative, but which were not picked up in the keyword search. This extra information occurred most often close to the time of diagnosis but was present throughout the study period. Time intervals between indicator code groups and the first RA diagnostic code were similar to intervals between the keywords and the RA code, as would be expected in the recording of the same type of information.
The Read codes associated with keywords were not readily predictable. Of the top 35 codes which had keywords in the free text associated with them, only 9 were our pre-identified RA specific indicator codes. Instead, keywords were often associated with administrative codes for referrals and letters or communications from specialists. This makes sense within the context of a disease which presents in primary care but because of diagnostic uncertainty generally results in a referral followed by confirmation of diagnosis and development of a management plan within secondary care. This association of text information with communication type codes also been found in studies of other diseases, for example ovarian cancer . Much of the free text regarding these conditions is likely to be found in letters between GP and specialists which are appended to the record under more general codes.
Strengths and limitations of our study
This study offers one of the biggest sample sizes of RA patients in the literature and allowed a detailed look at the diagnostic process in primary care which is missing from the literature. There are few publications, for example, on the proportion of musculoskeletal patients referred over time from primary to secondary care [9, 22]. It is also among the first to try to quantify the amount of additional relevant information available in free text. However, a major limitation of this work is that we did not look at the text directly, due to the costs of anonymisation, and therefore were not able to allow for negation or other qualifiers surrounding keywords. It is therefore feasible that some of the occurrences of the keywords are for an absence, such as no evidence of synovitis, or the term relates to another person, for instance mother had a polyarthropathy. We may therefore be over-estimating the extent of relevant information held in text. One study for example  found that specificity of case finding dropped from 98.2% to 38.3% when negation terms were not included in the text search. It should be noted, however, that the presence of the keyword indicates that an inflammatory arthritis is being considered or discussed with the patient, and the clustering around the time of diagnosis suggests that many of these terms will apply to the patients. Even if only half of the keywords occurring in patients without any indicator markers were related to the actual presence of, for example, synovitis in the patient, this would still increase the prevalence of synovitis by more than 8%. Despite the lack of qualifiers and negation, automated keyword searching could also be a useful tool for selecting a smaller set of cases whose records could then be manually scrutinised for specific terms.
The selection of codes for the indicators and the keywords for the searches is critical to the validity of this work. The development of the indicator markers was a rigorous process that has been described in full elsewhere . Similarly we tried to triangulate the information we used when preparing the keyword lists in order to allow for as many alternative expressions and misspellings as possible. One possible explanation for the extra information in text is that we selected the wrong codes for the disease indicators, thereby missing important coded information. However, from the association between keywords and communication/letter codes as well as sick note codes (e.g. MED3 – doctor’s statement) it seems that information is often put in text alongside a more generic code. The process of entering communication received from hospitals is not managed in a standard way by GP practices. Sometimes letters are scanned and added to the records as a pdf file and therefore are not searchable in the database. In other cases the entire letter is entered into the free text section and can be searched. Another issue is that the transmission of free text from the practice to the GPRD can be suppressed by the GP using a double backslash at the start of the entry. This is unlikely to affect letters, but results in an unknown amount of free text relating to clinical consultations being withheld, again affecting estimates of the amount information available. There are therefore likely to be practice-level differences in the availability of the free text which will again lead to an under-estimation of the keywords but also has implications for technologies to increase access to textual data. It would also be worth extending the keyword list to include other indicators such as DMARDs and referrals and further work will include these in searches of free text.
For free text information entered by GPs in the course of their consultation, there is likely to be a wide range of ways to express similar concepts and it is known that many entries have spelling errors or use abbreviations. We only picked up the most frequent misspellings and abbreviations in the keyword specifications. This would lead to an under-estimate of the occurrence of keywords in the record. A full exploration of the free text by hand is planned and will help us to understand more about how information is entered by GPs in the course of their consultations, including understanding more fully the range of abbreviations used and the different ways that signs and symptoms may be described. Qualifiers and negation will be taken into account during this process, resulting in a highly accurate estimation of the information held in free text about RA presentation and symptoms.
A further limitation of this study is that we have not yet investigated how often these keywords occur in control data, that is, in patients with no RA diagnostic code. There is a theoretical possibility that the distribution of these keywords would be the same in control cases as it is for RA cases. Future work will address this possibility by comparing rates of indicators and keywords in control data to ascertain their predictive value for finding cases of RA.
How results fit with other literature
Other authors have also highlighted the potential deficits from coded data in epidemiological studies . Using live clinical data such as the GPRD for epidemiological studies requires mass application of case identification criteria, rather than examining each case individually. This can lead to high, or unknown, rates of misclassification of cases , which bias the outcome of studies, especially those examining rates of certain tests or treatment. Studies which define cases using only diagnostic codes may miss cases where the diagnostic information is held in free text or coded several weeks after the diagnosis has been received. A further issue is the unknown quality of consultation recording and coding which is poorly established in the literature. It appears this may vary both between practitioners and practices but also between diseases . GPs may regularly use the codes most readily available in the system even if they are inappropriate, and express the clinical details in free text descriptions . Free text has been used for case finding and to assess quality of care in complex conditions such as diabetes and cancer [24, 25]. Several authors have shown that including data from free text increases case ascertainment for both acute conditions such as respiratory infections and chronic diseases such as angina [26–28] as well as RA  and can enhance estimates of symptoms in cancer presentation by 40% .
Ethnographic studies have the potential to help understand how social practices shape the records we used for research . We need field studies on the use electronic record systems, in order to understand why coding and free text are used as they are. Records are not created by a single person but rather by collaborative work practices that are carried out for complex reasons . There is additionally a tension between the use of records by health-care providers who value flexibility and expressivity, and those of researchers who value structure and categorisation . Early findings from the human-computer interaction work-strand of our project show that doctors often choose not to use specific diagnostic codes early in the disease process. Sometimes there is clinical uncertainty, but sometimes coding structures do not facilitate the recording of precise clinical findings and doctors need “exit strategies” to be able to report unexpected clinical exceptions . Doctors’ concerns are more centred on creating records that are useful to them and their team at the point of care, rather than on creating records that will be accessible for secondary uses. There are a number of influences that affect the degree of coding used and choice of codes and these operate at policy, local, system and individual levels.
Implications of our findings and further work
We deliberately chose a complex non-incentivised condition which posed a considerable challenge to recording in code, so our findings may not be generalisable to other more clear-cut or incentivised conditions. A systematic review of quality of coding suggested that completeness of coding may be related to distinctiveness of diagnosis . Our results lead to speculation that cases may be missed if coded data alone were used to identify patients with possible rheumatoid arthritis, before a definitive diagnosis is recorded. For epidemiological studies, an estimate of false negatives (that is, patients with the disease but not identified by the case finding algorithm) is useful to give an indication of bias within the study . Including free text in case finding algorithms may increase the potential for identifying patients without diagnostic codes in these studies, thereby reducing bias. If so, it becomes imperative that systematic ways of automatically extracting and assessing information in free text are developed.
We found no evidence of differences between men and women in the balance of coded and textual data or in the timing of recording. Hence, although data based on codes may be incomplete, in this initial investigation there was no evidence of biased recording by gender or timing. This needs to be explored for other patient characteristics. The possibility of systematic differences in the way information across social groups or different co-morbidities is recorded remains and would have important implications for secondary use of such clinical databases.
The greatest hurdle to the more widespread use of text is the technological challenge to automate or semi-automate processing. We have laid the basis for methods that will allow us to further investigate extracting information concealed in free text. It is of interest that much of the keyword information was found in letters from specialists and other referral communication type text. Letters are much easier to process using computer algorithms than GPs’ clinical notes due to fewer idiosyncrasies and abbreviations in the language used, although consultation notes will still need to be scrutinised for extracting information such as presenting symptoms . In future work we will add negation detection algorithms and model the context in which the keyword occurs, as well as expanding the indicators which are searched for in free text. We have obtained promising initial results in pilot experiments into deriving abbreviations and synonyms of indicators, using unsupervised machine learning techniques . Other groups have had success with various text-processing algorithms in identifying RA cases and have even found these algorithms are portable between settings [18, 19]. We will also investigate methods to automate the process of augmenting the initial keyword list using sample data and resources like UMLS. Once full information has been extracted from the free text, we will apply statistical methods such as cluster analysis to combinations of coded and textual information to estimate which are the best to use for probabilistic case definition for RA. These search algorithms can then be tested on “control” data where no diagnostic code for RA exists, to verify their ability to find cases using contextual information. These methodologies may extend to other complex, non-incentivised diseases and may be useful for case definition in general for studies using EHRs.