Unsupervised anomaly detection of implausible electronic health records: a real-world evaluation in cancer registries

Table 3 Number of selected and implausible records overall and per tumor localization in each sample. FindFPOF and the autoencoder had a higher precision overall and for each tumor localization than the baseline. In the random sample, \(8\%\) of all records and \(2\%\) of the breast records were implausible. For the autoencoder sample, \(28\%\) of all records and \(10\%\) of the breast records were implausible. FindFPOF and the autoencoder selected more records from those localizations that had a higher percentage of implausible records in the random sample. For the random sample, \(18\%\) of the colorectal records were implausible, while only \(2\%\) of the breast records were implausible. Thus, the autoencoder and FindFPOF returned more colorectal records (approximately two-thirds of the records are colorectal) that had a higher percentage of implausible records (approximately one-third). In contrast, the samples returned by the autoencoder and FindFPOF contain a lower percentage of breast records (\(28\%\) and \(14\%\), respectively) than both the random sample and the full dataset (\(54\%\) and \(58\%\), respectively). For each sample, the tumor localizations with the highest number of selected and implausible records are highlighted

	Records		Tumor localization
			All	Breast	Colorectal	Prostate
Full dataset	All	n \(\left( \frac{n}{n_{all}} \right)\)	21,104 (100%)	11,573 (54%)	6995 (34%)	2536 (12%)
Random sample	Selected	n \(\left( \frac{n}{n_{all}} \right)\)	300 (100%)	172 (58%)	87 (28%)	41 (14%)
	Implausible	\(\#impl\) \(\left( \text{ precision: } \frac{\#impl}{n}\right)\)	23 (8%)	4 (2%)	16 (18%)	3 (8%)
Autoencoder sample	Selected	n \(\left( \frac{n}{n_{all}}\right)\)	300 (100%)	85 (28%)	193 (64%)	22 (8%)
	Implausible	\(\#impl\) \(\left( \text{ precision: } \frac{\#impl}{n} \right)\)	83 (28%)	9 (10%)	67 (34%)	7 (32%)
FindFPOF sample	Selected	n \(\left( \frac{n}{n_{all}}\right)\)	300 (100%)	40 (14%)	200 (66%)	60 (20%)
	Implausible	\(\#impl\) \(\left( \text{ precision: } \frac{\#impl}{n} \right)\)	83 (28%)	3 (8%)	65 (32%)	15 (24%)
All samples	Selected	Total (different)	900 (785)	297 (266)	480 (406)	123 (113)
	Implausible	Total (different)	189 (157)	16 (14)	148 (124)	25 (19)

ISSN: 1471-2288