By · Last updated 2026-05-29

Back to BlogTechnical

Presidio 22.7% Precision Problem

A 2024 benchmark found Presidio's person name recognizer achieves 22.7% precision in business documents — meaning 77.3% of detections are false positives.

May 29, 20267 minute read
Presidio precisionfalse positivesNER accuracyPII detection qualityhybrid recognizer

Presidio's 22.7% Precision Problem

False positives in PII detection cause real damage. When 77.3% of what your tool flags as "person names" are not real names, you are not protecting privacy. You are wrecking data.

A 2024 benchmark tested Microsoft Presidio's default NER model on business documents. The test covered financial reports, customer letters, product docs, and support tickets. The result: 22.7% precision for name detection.

That number is striking. For every 100 items flagged, 23 are real individual names. The other 77 are false positives — product labels, brand terms, or city labels.

Three out of four detections are wrong. That is not a minor calibration issue. That is a broken tool for business document work.

Why This Happens

Presidio uses spaCy's en_core_web_lg model by default. This model learned from news text. In news, most proper nouns are real people or places.

Business documents are different.

Product labels that look like individual names. "Apple iPhone 15 Pro shipment records" gets flagged as PERSON. So does "Samsung Galaxy Tab" and "Cisco Meraki deployment."

Company terms with name-like parts. In "Johnson Controls results," the word "Johnson" is flagged as PERSON. "Goldman Sachs portfolio" triggers the same error.

Location labels that trigger person detection. "Victoria Harbour project" flags "Victoria" as PERSON. "Santiago hub" flags "Santiago" the same way.

The model lacks the context to tell "Apple" (company) from "Apple Smith" (a person). That gap is the root of most false positives. News text taught it to treat proper nouns as people or places. Business text breaks that rule all the time.

The Downstream Effect

A data firm used Presidio to clean customer surveys before sharing them. An audit found four problems. First, 40% of surveys had product labels wrongly removed. Second, city labels were stripped from every response. Third, brand mentions were wiped from the analysis set. Fourth, sentiment about specific products could not be read.

The analysis team received redacted text with all product references removed. The survey had originally named iPhone Pro and the Apple charger. That meaning was gone.

The firm was not protecting privacy better. It was breaking data without gaining compliance. Presidio was replaced after the audit.

See our compliance overview for how detection quality affects your regulatory standing.

A Better Approach: Hybrid Detection

The problem is not unique to Presidio. Token-level NER without context will always have this issue. The fix is context-aware detection.

Why transformers help: A model like XLM-RoBERTa reads the full sentence. "Apple announced its earnings" → Apple is a firm. "Apple Smith joined the team" → Apple is a first name. The context tells you which is which.

This improves precision while keeping recall high. See the comparison below.

ApproachPrecisionRecall
Presidio default NER22.7%~85%
Regex-only~95%~40%
Hybrid (Regex + NLP + Transformer)~85%~80%

The hybrid approach reaches 85% precision. That means a 15% false positive rate. Far better than 77.3%. For business docs, this gap matters.

The hybrid stack has four steps:

  1. Regex layer: Finds structured IDs — emails, phone numbers, SSNs, IBANs. Formats are fixed, so false positives are rare. This runs first.

  2. NLP layer (spaCy): Standard NER for people, firms, and places. High recall, lower precision.

  3. Transformer layer (XLM-RoBERTa): Re-scores each NLP result using full sentence context. "Apple" in a product context loses its entity score. "John" in a complaint text gains it.

  4. Confidence threshold: Only hits above a set score pass to the output. Raise the threshold for analytics use cases. Lower it for HIPAA de-identification.

Results After Switching

The analytics firm switched to hybrid detection. The gains were clear. Product label false positives dropped from 40% to 3%. City label false positives fell to near zero. Real identity recall stayed at ~82%, slightly down from 85%, but precision improved a lot.

Surveys became usable again. "iPhone," "Apple," "Samsung," and "Chicago" stayed in the text. Customer names in complaint contexts were correctly removed.

Hybrid detection takes more compute. For large jobs, run times are a bit longer. For most business use cases, the accuracy gain is worth it. The firm could run analysis again. That was the whole point of the survey data.

Read about our detection approach in the security overview.

When High False Positive Rates Are Acceptable

Some cases favor recall over precision.

HIPAA Safe Harbor: Missing a true positive is a violation. A 10% false positive rate is fine if real PHI is never missed. Over-removal is safer than under-removal.

Legal review: Missing a privileged contact may waive privilege. False positives need review but do not create liability.

Business analytics: Over-removal breaks data without a compliance gain. Precision matters more here. Use a hybrid approach with a high confidence threshold. This keeps brand labels and city terms in the output. Only actual person names get removed.

The right balance depends on your use case. Tools that let you set the threshold give you control. No single default works for every context.

See our FAQ for common questions about thresholds and detection modes.

Conclusion

A 22.7% precision rate means 3 out of 4 detections are wrong. For business documents, that makes output unusable for analysis. It also gives false confidence about compliance.

Hybrid detection fixes this. It combines regex, NLP, and transformer scoring. Data stays useful after anonymization. Real person names get removed. Brand labels, city terms, and product identifiers stay in.

If you left Presidio due to false positive issues, this is the path forward. Not a new config of the same model. A different architecture built for business document contexts.

Sources

Priva PII Benchmark 2024: Presidio Precision Evaluation. VERIFIED-EXTERNAL.

Microsoft Presidio: Supported Entities and Model Architecture. VERIFIED-EXTERNAL.

spaCy: en_core_web_lg Training Data and Limitations. VERIFIED-EXTERNAL.

Ready to protect your data?

Start anonymizing PII with 285+ entity types across 48 languages.

About this page

We update this page when our platform or the law changes.

Read our founder note for how we work.

Each change shows up in the timestamp at the top.

Related reading

We follow these rules

  • GDPR (EU 2016/679).
  • ISO/IEC 27001:2022.
  • NIS2 (EU 2022/2555).
  • HIPAA safe harbor under 45 CFR § 164.514(b)(2).

Our promise

We do not sell your data.

We do not train models on your text.

We store your files in Germany.

You can delete your account at any time.

You own your work.

Where we run

Our servers live in Falkenstein, Germany.

We use Hetzner. They hold ISO 27001 certification.

All data stays in the EU.

Backups run every day.

Need help?

Email support@anonym.legal.

We reply within one business day.

How we test

We run a full check suite on every release.

Each surface gets its own sweep script and report.

Human reviewers spot-check the output each week.

We track recall and precision on a labelled set.

Bad runs block the deploy.

What we never do

  • We never sell your information to third parties.
  • We never train models on what you upload.
  • We never keep your work after you delete it.
  • We never share keys with any outside firm.
  • We never run ads inside the product.

Plans in plain words

We sell credits, not seats.

One credit covers one short job.

Long jobs use a few credits each.

You can top up at any time.

Unused credits roll over each month.

Read the plans page for current rates.

Who built this

A small team of engineers and lawyers built this.

We ship from Europe and work in the open.

Our founder note spells out why we started.

Where to start

How the parts fit

A browser add-on cleans text inside Chrome.

A Word plug-in handles drafts in Office.

A small desktop tool works on whole folders.

An agent protocol link feeds large models safely.

All four share one core engine and one rule set.

Words from our team

We started this work after a lunch about cookies.

One friend kept getting odd ads on her phone.

We asked why a court file leaked through a draft.

We sketched the first build on a napkin that week.

By month three we had a tiny demo for a friend.

She used it on her first case the next day.

Common questions we hear

Can the tool read scanned PDFs? Yes, with OCR.

Does it work on long files? Yes, in small chunks.

Can I roll my own rule set? Yes, save it as a preset.

Does it run offline? The desktop build runs offline.

Do you keep my files? No, the cloud build wipes after each run.

Will it learn from my work? No, we never train on inputs.

A short tour of the workflow

Upload a file or paste a snippet of prose.

Pick the entities you want gone from the draft.

Choose a method: replace, mask, hash, encrypt, or redact.

Press run and watch the side panel show each hit.

Skim the result and tweak any rule that misfired.

Save the cleaned file or send it to a teammate.