Skip to main content

Table 4 Algorithms and features used by systems mostly based on machine learning methods.

From: Automatic de-identification of textual documents in the electronic health record: a review of recent research

De-identification system

Machine learning algorithm

Features

  

Lexical/morphological

Syntactic

Semantic

Aramaki

CRF

Word, surrounding words (5 words window), capitalization, word length, regular expressions (date, phone), sentence position and length.

POS (word + 2 surrounding words)

Dictionary terms (names, locations)

Gardner

CRF

Word lemma, capitalization, numbers, prefixes/suffixes, 2-3 character n-grams

POS (word)

None

Guo

SVM

Word, capitalization, prefixes/suffixes, word length, numbers, regular expressions (date, ID, phone, age)

POS (word)

Entities extracted by ANNIE (doctors, hospitals, locations)

Hara

SVM

Word, lemma, capitalization, regular expressions (phone, date, ID)

POS (word)

Section headings

Szarvas

Decision Tree

Word length, capitalization, numbers, regular expressions (age, date, ID, phone), token frequency

None

Dictionary terms (first names, US locations, countries, cities, diseases, non-PHI terms), section heading.

Taira

Maximum Entropy

Capitalization, punctuation, numbers, regular expressions (prefixes, physician and hospital name, syndrome/disease/procedure)

POS (word)

Semantic lexicon, dictionary terms (proper names, prefixes, drugs, devices), semantic selectional restrictions

Uzuner

SVM

Word, lexical bigrams, capitalization, punctuation, numbers, word length.

POS (word + 2 surrounding words), syntactic bigrams (link grammar)

MeSH ID, dictionary terms (names, US and world locations, hospital names), section headers.

Wellner

CRF

Word unigrams/bigrams, surrounding words (3 words window), prefixes/suffixes, capitalization, numbers, regular expressions (phone, ID, zip, date, locations/hospitals)

None

Dictionary terms (US states, months, general English terms).

  1. CRF = Conditional Random Fields; SVM = Support Vector Machine; POS = Part-of-speech