The 50% Miss Rate Problem
A 2025 survey (arXiv:2509.14464) tested LLM tools on clinical records. The results were bad. These tools missed more than 50% of clinical PHI in multilingual documents. The cause is simple. LLMs are built for text output. They are not built for the high-recall detection task that HIPAA demands.
HIPAA Safe Harbor lists 18 protected identifier types. Names, dates, phone numbers, SSNs, MRNs, health plan IDs, device IDs, and IP addresses. Each needs its own detection logic.
Clinical notes make this harder. Take this example: "Pt. John D., DOB 4/12/67, MRN 1234567, admitted 03/15/24, Dr. Smith ordered ECG." One sentence. Five protected identifiers. Most use short forms. A model built for clinical meaning often fails the detection task.
What LLMs Miss and Why
LLM tools fail on clinical records in set ways.
Short-form identifiers: Clinical notes use shorthand. DOB, MRN, and Pt. are common forms. A model tuned for clinical meaning may not flag "Pt. John D." as a name. Sensitive data extraction needs a different goal.
Context-dependent dates: Not all dates pose the same risk. "Age 67" is a soft marker. "DOB 4/12/67" is a direct protected identifier. "03/15/24" as an admit date is protected too. Pattern matching alone is not enough.
Non-US formats: Cyberhaven (Q4 2025) found that 34.8% of all ChatGPT inputs contain sensitive data, including multilingual PII. In healthcare, this means non-US record IDs, regional date formats, and local health ID types. US-trained tools miss these consistently.
Custom hospital identifiers: Hospitals use their own MRN formats, staff IDs, and site codes. These are not in standard NER training data. A tool with no custom entity support will not find them.
The Research Dataset Risk
A hospital building a research dataset from 500,000 notes faces a real compliance problem. HIPAA calls for a "very small risk" standard on de-identified data. A tool missing half of all protected identifiers cannot meet that bar.
Research archives are not clean data. Notes span many departments, time periods, and sometimes languages. A tool that works on billing data may fail on narrative notes. Sensitive data in free text has no field label.
IRB approval adds more demands. Institutions must show the method used, the identifier types removed, and the checks done. A tool missing half of all records cannot meet those demands.
See our compliance overview and security practices for how anonym.legal supports HIPAA work.
The Three-Layer Fix
The 2025 survey found one clear pattern. The tools with the lowest miss rates used three detection layers.
Layer one — regex: Finds structured identifiers. SSNs, MRNs, phone numbers, health plan IDs. Reliable on fixed formats.
Layer two — NER: Uses transformer models. Finds names, dates, and sensitive data in narrative text. Works where regex cannot.
Layer three — custom entities: Handles site-specific forms. Proprietary MRN patterns, staff IDs, facility codes. No standard model covers these.
Pure ML tools degrade on short forms and non-English text. Pure regex tools miss sensitive data with no field label. Neither alone is enough.
Only the three-layer design reached sub-5% miss rates in the survey. That is the bar for HIPAA Safe Harbor compliance.
See our guide on HIPAA Safe Harbor de-identification for research for next steps.