Presidio's 22.7% Precision Problem
False positives in PII detection cause real damage. When 77.3% of what your tool flags as "person names" are not real names, you are not protecting privacy. You are wrecking data.
A 2024 benchmark tested Microsoft Presidio's default NER model on business documents. The test covered financial reports, customer letters, product docs, and support tickets. The result: 22.7% precision for name detection.
That number is striking. For every 100 items flagged, 23 are real individual names. The other 77 are false positives — product labels, brand terms, or city labels.
Three out of four detections are wrong. That is not a minor calibration issue. That is a broken tool for business document work.
Why This Happens
Presidio uses spaCy's en_core_web_lg model by default. This model learned from news text. In news, most proper nouns are real people or places.
Business documents are different.
Product labels that look like individual names. "Apple iPhone 15 Pro shipment records" gets flagged as PERSON. So does "Samsung Galaxy Tab" and "Cisco Meraki deployment."
Company terms with name-like parts. In "Johnson Controls results," the word "Johnson" is flagged as PERSON. "Goldman Sachs portfolio" triggers the same error.
Location labels that trigger person detection. "Victoria Harbour project" flags "Victoria" as PERSON. "Santiago hub" flags "Santiago" the same way.
The model lacks the context to tell "Apple" (company) from "Apple Smith" (a person). That gap is the root of most false positives. News text taught it to treat proper nouns as people or places. Business text breaks that rule all the time.
The Downstream Effect
A data firm used Presidio to clean customer surveys before sharing them. An audit found four problems. First, 40% of surveys had product labels wrongly removed. Second, city labels were stripped from every response. Third, brand mentions were wiped from the analysis set. Fourth, sentiment about specific products could not be read.
The analysis team received redacted text with all product references removed. The survey had originally named iPhone Pro and the Apple charger. That meaning was gone.
The firm was not protecting privacy better. It was breaking data without gaining compliance. Presidio was replaced after the audit.
See our compliance overview for how detection quality affects your regulatory standing.
A Better Approach: Hybrid Detection
The problem is not unique to Presidio. Token-level NER without context will always have this issue. The fix is context-aware detection.
Why transformers help: A model like XLM-RoBERTa reads the full sentence. "Apple announced its earnings" → Apple is a firm. "Apple Smith joined the team" → Apple is a first name. The context tells you which is which.
This improves precision while keeping recall high. See the comparison below.
| Approach | Precision | Recall |
|---|---|---|
| Presidio default NER | 22.7% | ~85% |
| Regex-only | ~95% | ~40% |
| Hybrid (Regex + NLP + Transformer) | ~85% | ~80% |
The hybrid approach reaches 85% precision. That means a 15% false positive rate. Far better than 77.3%. For business docs, this gap matters.
The hybrid stack has four steps:
-
Regex layer: Finds structured IDs — emails, phone numbers, SSNs, IBANs. Formats are fixed, so false positives are rare. This runs first.
-
NLP layer (spaCy): Standard NER for people, firms, and places. High recall, lower precision.
-
Transformer layer (XLM-RoBERTa): Re-scores each NLP result using full sentence context. "Apple" in a product context loses its entity score. "John" in a complaint text gains it.
-
Confidence threshold: Only hits above a set score pass to the output. Raise the threshold for analytics use cases. Lower it for HIPAA de-identification.
Results After Switching
The analytics firm switched to hybrid detection. The gains were clear. Product label false positives dropped from 40% to 3%. City label false positives fell to near zero. Real identity recall stayed at ~82%, slightly down from 85%, but precision improved a lot.
Surveys became usable again. "iPhone," "Apple," "Samsung," and "Chicago" stayed in the text. Customer names in complaint contexts were correctly removed.
Hybrid detection takes more compute. For large jobs, run times are a bit longer. For most business use cases, the accuracy gain is worth it. The firm could run analysis again. That was the whole point of the survey data.
Read about our detection approach in the security overview.
When High False Positive Rates Are Acceptable
Some cases favor recall over precision.
HIPAA Safe Harbor: Missing a true positive is a violation. A 10% false positive rate is fine if real PHI is never missed. Over-removal is safer than under-removal.
Legal review: Missing a privileged contact may waive privilege. False positives need review but do not create liability.
Business analytics: Over-removal breaks data without a compliance gain. Precision matters more here. Use a hybrid approach with a high confidence threshold. This keeps brand labels and city terms in the output. Only actual person names get removed.
The right balance depends on your use case. Tools that let you set the threshold give you control. No single default works for every context.
See our FAQ for common questions about thresholds and detection modes.
Conclusion
A 22.7% precision rate means 3 out of 4 detections are wrong. For business documents, that makes output unusable for analysis. It also gives false confidence about compliance.
Hybrid detection fixes this. It combines regex, NLP, and transformer scoring. Data stays useful after anonymization. Real person names get removed. Brand labels, city terms, and product identifiers stay in.
If you left Presidio due to false positive issues, this is the path forward. Not a new config of the same model. A different architecture built for business document contexts.
Sources
Priva PII Benchmark 2024: Presidio Precision Evaluation. VERIFIED-EXTERNAL.
Microsoft Presidio: Supported Entities and Model Architecture. VERIFIED-EXTERNAL.
spaCy: en_core_web_lg Training Data and Limitations. VERIFIED-EXTERNAL.