Skip to main content

Table 4 Algorithms and features used by systems mostly based on machine learning methods.

From: Automatic de-identification of textual documents in the electronic health record: a review of recent research

De-identification system Machine learning algorithm Features
   Lexical/morphological Syntactic Semantic
Aramaki CRF Word, surrounding words (5 words window), capitalization, word length, regular expressions (date, phone), sentence position and length. POS (word + 2 surrounding words) Dictionary terms (names, locations)
Gardner CRF Word lemma, capitalization, numbers, prefixes/suffixes, 2-3 character n-grams POS (word) None
Guo SVM Word, capitalization, prefixes/suffixes, word length, numbers, regular expressions (date, ID, phone, age) POS (word) Entities extracted by ANNIE (doctors, hospitals, locations)
Hara SVM Word, lemma, capitalization, regular expressions (phone, date, ID) POS (word) Section headings
Szarvas Decision Tree Word length, capitalization, numbers, regular expressions (age, date, ID, phone), token frequency None Dictionary terms (first names, US locations, countries, cities, diseases, non-PHI terms), section heading.
Taira Maximum Entropy Capitalization, punctuation, numbers, regular expressions (prefixes, physician and hospital name, syndrome/disease/procedure) POS (word) Semantic lexicon, dictionary terms (proper names, prefixes, drugs, devices), semantic selectional restrictions
Uzuner SVM Word, lexical bigrams, capitalization, punctuation, numbers, word length. POS (word + 2 surrounding words), syntactic bigrams (link grammar) MeSH ID, dictionary terms (names, US and world locations, hospital names), section headers.
Wellner CRF Word unigrams/bigrams, surrounding words (3 words window), prefixes/suffixes, capitalization, numbers, regular expressions (phone, ID, zip, date, locations/hospitals) None Dictionary terms (US states, months, general English terms).
  1. CRF = Conditional Random Fields; SVM = Support Vector Machine; POS = Part-of-speech