By · Last updated 2026-04-03

Back to BlogTechnical

The False Positive Tax on PII Tools

Presidio GitHub issue #1071 documents systematic false positives. A 2024 study found 22.7% precision in mixed-language enterprise datasets.

April 3, 20268 minute read
false positive ratePresidio precisionPII detection accuracyscore threshold configurationhybrid detection

The False Positive Tax on PII Detection Tools

Updated for 2026

Most PII tools are judged on recall. Recall measures what share of real PII the tool finds. But precision matters just as much. Precision measures what share of the tool's alerts are real PII.

Low precision is expensive. A system with 95% recall and 22.7% precision catches most PII. Yet for every real PII entity it flags, it also raises 3.4 wrong alerts. In a dataset with 10,000 real PII entities, that system fires roughly 44,000 alerts. About 34,000 of them are wrong. Each one costs time to review or causes over-redaction.

This is the false positive tax. It is the overhead any team pays when running a high-recall, low-precision PII system at scale. The direct cost is reviewer time. The indirect cost is worse: over-redacted documents hide useful data, slow work, and erode trust in the tool.

What Presidio Issue #1071 Shows

Microsoft Presidio GitHub discussion #1071 (2024) records a specific pattern. The TFN (Tax File Number) and PCI recognizers use checksum validation. Numbers that pass the checksum receive a score of 1.0 — maximum confidence. No PII context is required.

The root cause: context word checking runs after the checksum step, not before. A number that passes the checksum gets a top score regardless of surrounding text. In financial spreadsheets, scientific datasets, or log files, this floods the output with wrong alerts. Score threshold filtering cannot fix it. The scores are already at maximum.

A second pattern appears in Presidio issue #999. German word segmentation breaks down for compound nouns. Words like Bundesbehörde (federal authority) can be split incorrectly and tagged as personal names. This adds noise in any German-language document.

The 22.7% Precision Problem

Alvaro et al. (2024) tested Presidio on mixed-language enterprise datasets. They found 22.7% precision. In real documents, fewer than one in four Presidio alerts is a real PII entity. This matches what practitioners report. A tool tuned for recall alone produces too much noise for production use.

A 2024 DICOM study showed that raising score_threshold to 0.7 still left wrong alerts in 38 of 39 medical images. A threshold that clears noise in one document type creates missed detections in another.

This is not a Presidio-only problem. Any fixed threshold forces a trade-off. A high threshold cuts noise but raises misses. A low threshold raises recall but inflates the alert count.

Context-Aware Scoring

The fix is context-aware confidence scoring. Instead of scoring based on the pattern match alone, the system boosts confidence when context words appear near the match. It also lowers the score when context is absent.

For TFN detection: words like "tax file number," "TFN," or "Australian tax" near a number boost its score. A number that passes the checksum but has no nearby context words scores below the review threshold. The spurious alert is suppressed.

For cross-language noise: entity types tied to specific countries can be scoped to documents in the matching language. A TFN detector scoped to English and Australian-English text removes noise. Running it on German content without scoping is the source of the problem.

The third layer in a hybrid system is a transformer model. It reads the full context window around each candidate. It tells apart "John Smith, Patient ID 12345" from a product code that matches a name pattern. Context resolves the ambiguity that regex and checksums cannot.

See how the three-tier detection engine handles precision at scale. The multilingual PII detection guide covers how cross-language noise affects GDPR compliance.

Practical Steps

Before deploying any PII tool, measure its precision — not just recall.

Run the tool on a document set with known PII and known non-PII. Count alerts in both groups. Calculate true_positives / (true_positives + false_positives). This number reveals the review burden before you commit to a rollout.

For teams already using Presidio, score distribution analysis is a fast path. Export a sample of detections with their confidence scores. Count how many score below 0.6, 0.7, and 0.8. A large share of high-score alerts in clean text signals a context gap, not a threshold problem. The security compliance overview explains how to document this in a DPIA.

Sources

Ready to protect your data?

Start anonymizing PII with 285+ entity types across 48 languages.

About this page

We update this page when our platform or the law changes.

Read our founder note for how we work.

Each change shows up in the timestamp at the top.

Related reading

We follow these rules

  • GDPR (EU 2016/679).
  • ISO/IEC 27001:2022.
  • NIS2 (EU 2022/2555).
  • HIPAA safe harbor under 45 CFR § 164.514(b)(2).

Our promise

We do not sell your data.

We do not train models on your text.

We store your files in Germany.

You can delete your account at any time.

You own your work.

Where we run

Our servers live in Falkenstein, Germany.

We use Hetzner. They hold ISO 27001 certification.

All data stays in the EU.

Backups run every day.

Need help?

Email support@anonym.legal.

We reply within one business day.

How we test

We run a full check suite on every release.

Each surface gets its own sweep script and report.

Human reviewers spot-check the output each week.

We track recall and precision on a labelled set.

Bad runs block the deploy.

What we never do

  • We never sell your information to third parties.
  • We never train models on what you upload.
  • We never keep your work after you delete it.
  • We never share keys with any outside firm.
  • We never run ads inside the product.

Plans in plain words

We sell credits, not seats.

One credit covers one short job.

Long jobs use a few credits each.

You can top up at any time.

Unused credits roll over each month.

Read the plans page for current rates.

Who built this

A small team of engineers and lawyers built this.

We ship from Europe and work in the open.

Our founder note spells out why we started.

Where to start

How the parts fit

A browser add-on cleans text inside Chrome.

A Word plug-in handles drafts in Office.

A small desktop tool works on whole folders.

An agent protocol link feeds large models safely.

All four share one core engine and one rule set.

Words from our team

We started this work after a lunch about cookies.

One friend kept getting odd ads on her phone.

We asked why a court file leaked through a draft.

We sketched the first build on a napkin that week.

By month three we had a tiny demo for a friend.

She used it on her first case the next day.

Common questions we hear

Can the tool read scanned PDFs? Yes, with OCR.

Does it work on long files? Yes, in small chunks.

Can I roll my own rule set? Yes, save it as a preset.

Does it run offline? The desktop build runs offline.

Do you keep my files? No, the cloud build wipes after each run.

Will it learn from my work? No, we never train on inputs.

A short tour of the workflow

Upload a file or paste a snippet of prose.

Pick the entities you want gone from the draft.

Choose a method: replace, mask, hash, encrypt, or redact.

Press run and watch the side panel show each hit.

Skim the result and tweak any rule that misfired.

Save the cleaned file or send it to a teammate.