Skip to main content

Table 1 Main characteristics of the de-identification tools

From: Evaluating current automatic de-identification methods with Veteran’s health administration clinical documents

  

HMS Scrubber

MeDS

MIT deid

MIST

HIDE

Main technique

Rule-based

X

X

X

n/a

n/a

 

ML-based

n/a

n/a

n/a

X

X

Programming language

Java

Java

Perl

Python

Python

ML algorithm

n/a

n/a

n/a

CRF (Carafe)

CRF (CRFsuite)

Input documents

XML/txt

HL7/txt

txt

txt/XML-inline/json

XML/txt/HL7

HIPAA compliant

X

X

X

1

1

Regular Expressions (#)

~50

~40

~90

2

2

PHI markers (e.g., Mr.)

X

X

X

3

--

Part-of-speech information

--

X

--

--

--

String similarity techniques (e.g. edit distance, fuzzy matching)

--

X

--

--

--

Dictionaries* (size)

Person names

~101K

~280K

~96K4

--

--

 

Geographic places

 

~167K

~4K

--

--

 

US area code

--

--

~380

--

--

 

Medical phrases

--

~50

~28

--

--

 

Medical terms

--

~80K

~175K

--

--

 

Companies

--

~200

~500

--

--

 

Ethnicities

--

~120

~195

--

--

 

Common words

--

~220K

~50K

--

--

Machine Learning features

Contextual window

n/a

n/a

n/a

3-words

4-words

 

Morphological (#)

n/a

n/a

n/a

22

34

 

Syntactic

n/a

n/a

n/a

--

--

 

Semantic

n/a

n/a

n/a

--

--

 

From dictionaries

n/a

n/a

n/a

5

5

  1. *HMS Scrubber’s dictionary sources: 1990 US Census (person and place names).
  2. *MeDS’ dictionary sources: Ispell, 2005 SS Death Index, Regenstrief Medical Record System, UMLS, MESH.
  3. *MIT deid’s dictionary sources: 1990 US Census, MIMIC II Database, Atkinson’s Spell Checking Oriented Word Lists, UMLS.
  4. 1 It will depend on the types of the PHI instances used for training.
  5. 2 Both MIST and HIDE use regular expression in order to derive the morphological features from the tokens (e.g., all_caps_token ‘^[A-Z] + $’).
  6. 3 Within MIST, PHI markers are used only for detecting companies (e.g., “Ltd.”).
  7. 4 Person names dictionaries comprise lists of names, last names and name prefixes.
  8. 5 These systems are tailored to derive features from dictionaries, however they are not distributed with any default dictionary.