Skip to main content

Advertisement

Table 1 Main characteristics of the de-identification tools

From: Evaluating current automatic de-identification methods with Veteran’s health administration clinical documents

   HMS Scrubber MeDS MIT deid MIST HIDE
Main technique Rule-based X X X n/a n/a
  ML-based n/a n/a n/a X X
Programming language Java Java Perl Python Python
ML algorithm n/a n/a n/a CRF (Carafe) CRF (CRFsuite)
Input documents XML/txt HL7/txt txt txt/XML-inline/json XML/txt/HL7
HIPAA compliant X X X 1 1
Regular Expressions (#) ~50 ~40 ~90 2 2
PHI markers (e.g., Mr.) X X X 3 --
Part-of-speech information -- X -- -- --
String similarity techniques (e.g. edit distance, fuzzy matching) -- X -- -- --
Dictionaries* (size) Person names ~101K ~280K ~96K4 -- --
  Geographic places   ~167K ~4K -- --
  US area code -- -- ~380 -- --
  Medical phrases -- ~50 ~28 -- --
  Medical terms -- ~80K ~175K -- --
  Companies -- ~200 ~500 -- --
  Ethnicities -- ~120 ~195 -- --
  Common words -- ~220K ~50K -- --
Machine Learning features Contextual window n/a n/a n/a 3-words 4-words
  Morphological (#) n/a n/a n/a 22 34
  Syntactic n/a n/a n/a -- --
  Semantic n/a n/a n/a -- --
  From dictionaries n/a n/a n/a 5 5
  1. *HMS Scrubber’s dictionary sources: 1990 US Census (person and place names).
  2. *MeDS’ dictionary sources: Ispell, 2005 SS Death Index, Regenstrief Medical Record System, UMLS, MESH.
  3. *MIT deid’s dictionary sources: 1990 US Census, MIMIC II Database, Atkinson’s Spell Checking Oriented Word Lists, UMLS.
  4. 1 It will depend on the types of the PHI instances used for training.
  5. 2 Both MIST and HIDE use regular expression in order to derive the morphological features from the tokens (e.g., all_caps_token ‘^[A-Z] + $’).
  6. 3 Within MIST, PHI markers are used only for detecting companies (e.g., “Ltd.”).
  7. 4 Person names dictionaries comprise lists of names, last names and name prefixes.
  8. 5 These systems are tailored to derive features from dictionaries, however they are not distributed with any default dictionary.