Presidio's 22.7% Precision Problem

False positives in PII detection cause real damage. When 77.3% of what your tool flags as "person names" are not real names, you are not protecting privacy. You are wrecking data.

A 2024 benchmark tested Microsoft Presidio's default NER model on business documents. The test covered financial reports, customer letters, product docs, and support tickets. The result: 22.7% precision for name detection.

That number is striking. For every 100 items flagged, 23 are real individual names. The other 77 are false positives — product labels, brand terms, or city labels.

Three out of four detections are wrong. That is not a minor calibration issue. That is a broken tool for business document work.

Why This Happens

Presidio uses spaCy's en_core_web_lg model by default. This model learned from news text. In news, most proper nouns are real people or places.

Business documents are different.

Product labels that look like individual names. "Apple iPhone 15 Pro shipment records" gets flagged as PERSON. So does "Samsung Galaxy Tab" and "Cisco Meraki deployment."

Company terms with name-like parts. In "Johnson Controls results," the word "Johnson" is flagged as PERSON. "Goldman Sachs portfolio" triggers the same error.

Location labels that trigger person detection. "Victoria Harbour project" flags "Victoria" as PERSON. "Santiago hub" flags "Santiago" the same way.

The model lacks the context to tell "Apple" (company) from "Apple Smith" (a person). That gap is the root of most false positives. News text taught it to treat proper nouns as people or places. Business text breaks that rule all the time.

The Downstream Effect

A data firm used Presidio to clean customer surveys before sharing them. An audit found four problems. First, 40% of surveys had product labels wrongly removed. Second, city labels were stripped from every response. Third, brand mentions were wiped from the analysis set. Fourth, sentiment about specific products could not be read.

The analysis team received redacted text with all product references removed. The survey had originally named iPhone Pro and the Apple charger. That meaning was gone.

The firm was not protecting privacy better. It was breaking data without gaining compliance. Presidio was replaced after the audit.

See our compliance overview for how detection quality affects your regulatory standing.

A Better Approach: Hybrid Detection

The problem is not unique to Presidio. Token-level NER without context will always have this issue. The fix is context-aware detection.

Why transformers help: A model like XLM-RoBERTa reads the full sentence. "Apple announced its earnings" → Apple is a firm. "Apple Smith joined the team" → Apple is a first name. The context tells you which is which.

This improves precision while keeping recall high. See the comparison below.

Approach	Precision	Recall
Presidio default NER	22.7%	~85%
Regex-only	~95%	~40%
Hybrid (Regex + NLP + Transformer)	~85%	~80%

The hybrid approach reaches 85% precision. That means a 15% false positive rate. Far better than 77.3%. For business docs, this gap matters.

The hybrid stack has four steps:

Regex layer: Finds structured IDs — emails, phone numbers, SSNs, IBANs. Formats are fixed, so false positives are rare. This runs first.
NLP layer (spaCy): Standard NER for people, firms, and places. High recall, lower precision.
Transformer layer (XLM-RoBERTa): Re-scores each NLP result using full sentence context. "Apple" in a product context loses its entity score. "John" in a complaint text gains it.
Confidence threshold: Only hits above a set score pass to the output. Raise the threshold for analytics use cases. Lower it for HIPAA de-identification.

Results After Switching

The analytics firm switched to hybrid detection. The gains were clear. Product label false positives dropped from 40% to 3%. City label false positives fell to near zero. Real identity recall stayed at ~82%, slightly down from 85%, but precision improved a lot.

Surveys became usable again. "iPhone," "Apple," "Samsung," and "Chicago" stayed in the text. Customer names in complaint contexts were correctly removed.

Hybrid detection takes more compute. For large jobs, run times are a bit longer. For most business use cases, the accuracy gain is worth it. The firm could run analysis again. That was the whole point of the survey data.

Read about our detection approach in the security overview.

When High False Positive Rates Are Acceptable

Some cases favor recall over precision.

HIPAA Safe Harbor: Missing a true positive is a violation. A 10% false positive rate is fine if real PHI is never missed. Over-removal is safer than under-removal.

Legal review: Missing a privileged contact may waive privilege. False positives need review but do not create liability.

Business analytics: Over-removal breaks data without a compliance gain. Precision matters more here. Use a hybrid approach with a high confidence threshold. This keeps brand labels and city terms in the output. Only actual person names get removed.

The right balance depends on your use case. Tools that let you set the threshold give you control. No single default works for every context.

See our FAQ for common questions about thresholds and detection modes.

Conclusion

A 22.7% precision rate means 3 out of 4 detections are wrong. For business documents, that makes output unusable for analysis. It also gives false confidence about compliance.

Hybrid detection fixes this. It combines regex, NLP, and transformer scoring. Data stays useful after anonymization. Real person names get removed. Brand labels, city terms, and product identifiers stay in.

If you left Presidio due to false positive issues, this is the path forward. Not a new config of the same model. A different architecture built for business document contexts.

When This Approach Has Limits

Context-aware hybrid detection genuinely fixes the precision problem that token-level NER creates — moving from 23 percent to about 85 percent precision is a real architectural gain — but limits remain worth stating plainly.

Higher precision trades against recall, and recall is what protects privacy. The comparison table is honest about this: hybrid detection lands at roughly 85 percent precision but around 80 percent recall, with real identity recall measured at about 82 percent after switching. That residual false-negative rate means real names slip through. For business analytics that tradeoff is fine, but for HIPAA de-identification a missed true positive is a violation. Precision improvements make output more usable; they do not lower the floor on what you must not miss. Choose the threshold for the consequence, not the convenience.

Quasi-identifiers survive even perfect name detection. Removing every actual person name still leaves date of birth, ZIP code, job title, and rare attribute combinations that can re-identify an individual when joined against other data. A precision improvement is entirely about whether the system labels a token correctly; it says nothing about whether the surviving fields are collectively anonymizing. Output that detects names at 99 percent precision can still be pseudonymized rather than anonymized, with the legal scope that distinction carries. Treat name accuracy and re-identification risk as separate problems.

A benchmark number is not your number. The 22.7 percent figure comes from one 2024 evaluation on a specific document mix; the hybrid results come from one analytics firm. Your documents, languages, and entity types will produce different precision and recall. Custom and legacy formats in particular need configuration and held-out testing before you trust any quoted figure. Measure on your own held-out set rather than adopting published rates as a guarantee.

Sources

Priva PII Benchmark 2024: Presidio Precision Evaluation. VERIFIED-EXTERNAL.

Microsoft Presidio: Supported Entities and Model Architecture. VERIFIED-EXTERNAL.

spaCy: en_core_web_lg Training Data and Limitations. VERIFIED-EXTERNAL.

Ready to protect your data?

Start anonymizing PII with 267+ entity types across 48 languages.

Start Free Trial View Features

Presidio 22.7% Precision Problem

Presidio's 22.7% Precision Problem

Why This Happens

The Downstream Effect

A Better Approach: Hybrid Detection

Results After Switching

When High False Positive Rates Are Acceptable

Conclusion

When This Approach Has Limits

Sources

Related Articles

Presidio: 3-Week Setup vs Managed PII

6 Weeks to 3 Days: Managed PII Setup

Free PII Detection Costs €13K/Year

Ready to protect your data?

Presidio 22.7% Precision Problem

Presidio's 22.7% Precision Problem

Why This Happens

The Downstream Effect

A Better Approach: Hybrid Detection

Results After Switching

When High False Positive Rates Are Acceptable

Conclusion

When This Approach Has Limits

Sources

Related Articles

Presidio: 3-Week Setup vs Managed PII

6 Weeks to 3 Days: Managed PII Setup

Free PII Detection Costs €13K/Year

Ready to protect your data?

About this page

Related reading

We follow these rules

Our promise

Where we run

Need help?

How we test

What we never do

Plans in plain words

Who built this

Where to start

How the parts fit

Words from our team

Common questions we hear

A short tour of the workflow