
John Snow Labs
Automated De-Identification of Clinical Text Datasets
Pages
13
Time to read
34 mins
Publication
Language
English

Pages
13
Time to read
34 mins
Publication
Language
English
This technical report presents findings on the automated de-identification of clinical text datasets, focusing on a system that has successfully de-identified over one billion clinical notes. The report outlines the challenges of achieving high accuracy in real-world settings and describes a hybrid context-based model architecture that outperforms traditional Named Entity Recognition (NER) models. The system is certified for production use and demonstrates significant error reduction compared to major cloud services. It achieves over 98% coverage of sensitive data across multiple languages without requiring fine-tuning. The report details the architecture of the de-identification pipeline, which includes stages such as text pre-processing, named entity recognition, and data obfuscation. It emphasizes the importance of automating the identification of protected health information (PHI) to facilitate research while ensuring patient privacy. The findings underscore the need for reliable and linked anonymized documents in the healthcare sector.