Updated for 2026
Not All De-Identification Tools Are Equal
Accuracy is the only metric that matters for PHI de-identification. A 4% gap looks small. On one million records, that is 40,000 exposed patients.
ECIR 2025 benchmarks show wide accuracy gaps across leading tools. These results should shape every healthcare buying decision.
ECIR 2025 Benchmark Results
<!-- VERIFIED-EXTERNAL: John Snow Labs ECIR 2025 Text2Story Workshop paper -->| Tool | F1-Score | Precision | Recall |
|---|---|---|---|
| John Snow Labs | 96% | 95% | 97% |
| Azure AI | 91% | 90% | 92% |
| AWS Comprehend Medical | 83% | 81% | 85% |
| GPT-4o | 79% | 82% | 76% |
F1-score blends two things. Precision: how many flagged items were real PHI. Recall: how many real PHI items were found.
- Low precision means over-redaction and lost context.
- Low recall means missed PHI — a breach.
Why the Gap Exists
Training Data Matters
John Snow Labs trains on clinical notes. These notes are messy and full of short forms. GPT-4o trains on a broad mix of text. It was not built for clinical data.
| Tool | Training Focus |
|---|---|
| John Snow Labs | Healthcare-specific, clinical notes |
| Azure AI | General medical + clinical |
| AWS Comprehend Medical | General medical entities |
| GPT-4o | Broad training, not healthcare-specific |
Entity Coverage Varies
Not every tool finds the same PHI types.
| Entity | John Snow | Azure | AWS | GPT-4o |
|---|---|---|---|---|
| Patient names | Yes | Yes | Yes | Yes |
| Medical record numbers | Yes | Yes | Limited | Limited |
| Medication dosages | Yes | Yes | Yes | Partial |
| Procedure codes | Yes | Yes | Limited | No |
| Clinical abbreviations | Yes | Partial | No | Partial |
| Family member names | Yes | Yes | Partial | Partial |
Context Is Hard to Get Right
Take this clinical note:
"Patient reports taking Smith's medication. Dr. Johnson recommends increasing the dose."
A good PHI tool must do three things here:
- Read "Smith" as a brand name, not a patient.
- Flag "Dr. Johnson" as a provider name to redact.
- Know "Patient" is a role label, not a name.
GPT-4o misses these cases. That pushes recall to 76%.
The Cost of Low Accuracy
Going from 79% to 96% cuts exposure by 170,000 records per million processed.
<!-- VERIFIED: arithmetic derived from ECIR 2025 benchmark figures -->| Accuracy | Records | PHI Exposure |
|---|---|---|
| 96% | 1,000,000 | 40,000 |
| 91% | 1,000,000 | 90,000 |
| 83% | 1,000,000 | 170,000 |
| 79% | 1,000,000 | 210,000 |
HIPAA Penalties Scale With Exposure
<!-- VERIFIED-EXTERNAL: HIPAA Journal penalty tiers / 45 CFR 160.404 -->| Tier | Cause | Penalty Per Violation |
|---|---|---|
| 1 | Unaware | $100–$50,000 |
| 2 | Reasonable cause | $1,000–$50,000 |
| 3 | Willful neglect, corrected | $10,000–$50,000 |
| 4 | Willful neglect, uncorrected | $50,000+ |
Picking a 79% tool when 96% tools exist may be willful neglect under HHS rules. The gap is known. A better tool is on the market.
How a Hybrid Pipeline Raises Accuracy
No single method finds all PHI types. A hybrid pipeline stacks methods. Each one fills the gaps the others leave.
Input Text
↓
[Regex Patterns] — Structured data: SSN, MRN, dates
↓
[spaCy NER] — Names, locations, organizations
↓
[Transformer Models] — Context-dependent entities
↓
[Medical Dictionaries] — Healthcare-specific terms
↓
Merged Results (highest confidence wins)
| Method | Strengths | Weaknesses |
|---|---|---|
| Regex | Perfect for structured data | No context handling |
| spaCy | Fast, common entities | Limited medical vocab |
| Transformers | Context-aware, high recall | Slower |
| Dictionaries | Full medical terms | Static, needs updates |
Each method catches what the others miss. See how this works in the security compliance page and legal conformance docs.
Questions to Ask Any Vendor
Before you sign, ask five things:
- What F1-score on clinical notes? Get third-party data. Reject vague claims.
- Which entity types? All 18 HIPAA Safe Harbor identifiers must be covered.
- How do you handle abbreviations? "Pt," "Dx," and "Hx" need correct resolution.
- Do you catch family member PHI? "Mother has diabetes" is PHI. Many tools miss it.
- Do you support all note formats? Progress notes, discharge summaries, and radiology reports differ a lot.
Red flags to watch for:
- No specific accuracy numbers
- Testing only on clean, structured data
- No healthcare training data
- Few entity types
- No HIPAA Safe Harbor validation
Testing Tools Yourself
Run your own test in four steps.
Step 1 — Build a dataset. Use de-identified notes from many specialties. Cover all 18 HIPAA types plus edge cases like short forms and family names.
Step 2 — Set a gold standard. Experts mark every PHI item with type and exact span.
Step 3 — Run each tool. Compare output to the gold standard. Score precision, recall, and F1.
Step 4 — Break down failures. Group misses by type, context, and format. This shows where each tool fails.
Conclusion
ECIR 2025 data is clear. A 17-point gap — 96% versus 79% — means 170,000 extra exposed records per million. Tool choice is the biggest risk variable at scale.
When you pick a PHI detection tool:
- Require specific accuracy data on clinical text
- Confirm full HIPAA Safe Harbor coverage
- Test on your own document formats
- Choose hybrid pipelines over single-method tools
Read how tokenization works in the token system docs. Common questions are in the FAQ.
anonym.legal replaces PHI with tokens before documents reach any AI tool. Names, dates, and record numbers are swapped on your side. Results come back with real details restored — only for you. Explore pricing.