Automated De-Identification of Clinical Text Datasets preview page 1

John Snow Labs

Automated De-Identification of Clinical Text Datasets

Pages

Time to read

34 mins

Publication

12/15/23

Language

English

Summary

This technical report presents findings on the automated de-identification of clinical text datasets, focusing on a system that has successfully de-identified over one billion clinical notes. The report outlines the challenges of achieving high accuracy in real-world settings and describes a hybrid context-based model architecture that outperforms traditional Named Entity Recognition (NER) models. The system is certified for production use and demonstrates significant error reduction compared to major cloud services. It achieves over 98% coverage of sensitive data across multiple languages without requiring fine-tuning. The report details the architecture of the de-identification pipeline, which includes stages such as text pre-processing, named entity recognition, and data obfuscation. It emphasizes the importance of automating the identification of protected health information (PHI) to facilitate research while ensuring patient privacy. The findings underscore the need for reliable and linked anonymized documents in the healthcare sector.

John Snow Labs

Automated De-Identification of Clinical Text Datasets

Summary

Get the Full Copy