De-identification system | Machine learning algorithm | Features | ||
---|---|---|---|---|
 |  | Lexical/morphological | Syntactic | Semantic |
Aramaki | CRF | Word, surrounding words (5 words window), capitalization, word length, regular expressions (date, phone), sentence position and length. | POS (word + 2 surrounding words) | Dictionary terms (names, locations) |
Gardner | CRF | Word lemma, capitalization, numbers, prefixes/suffixes, 2-3 character n-grams | POS (word) | None |
Guo | SVM | Word, capitalization, prefixes/suffixes, word length, numbers, regular expressions (date, ID, phone, age) | POS (word) | Entities extracted by ANNIE (doctors, hospitals, locations) |
Hara | SVM | Word, lemma, capitalization, regular expressions (phone, date, ID) | POS (word) | Section headings |
Szarvas | Decision Tree | Word length, capitalization, numbers, regular expressions (age, date, ID, phone), token frequency | None | Dictionary terms (first names, US locations, countries, cities, diseases, non-PHI terms), section heading. |
Taira | Maximum Entropy | Capitalization, punctuation, numbers, regular expressions (prefixes, physician and hospital name, syndrome/disease/procedure) | POS (word) | Semantic lexicon, dictionary terms (proper names, prefixes, drugs, devices), semantic selectional restrictions |
Uzuner | SVM | Word, lexical bigrams, capitalization, punctuation, numbers, word length. | POS (word + 2 surrounding words), syntactic bigrams (link grammar) | MeSH ID, dictionary terms (names, US and world locations, hospital names), section headers. |
Wellner | CRF | Word unigrams/bigrams, surrounding words (3 words window), prefixes/suffixes, capitalization, numbers, regular expressions (phone, ID, zip, date, locations/hospitals) | None | Dictionary terms (US states, months, general English terms). |