The False Positive Tax on PII Detection Tools

Updated for 2026

Most PII tools are judged on recall. Recall measures what share of real PII the tool finds. But precision matters just as much. Precision measures what share of the tool's alerts are real PII.

Low precision is expensive. A system with 95% recall and 22.7% precision catches most PII. Yet for every real PII entity it flags, it also raises 3.4 wrong alerts. In a dataset with 10,000 real PII entities, that system fires roughly 44,000 alerts. About 34,000 of them are wrong. Each one costs time to review or causes over-redaction.

This is the false positive tax. It is the overhead any team pays when running a high-recall, low-precision PII system at scale. The direct cost is reviewer time. The indirect cost is worse: over-redacted documents hide useful data, slow work, and erode trust in the tool.

What Presidio Issue #1071 Shows

Microsoft Presidio GitHub discussion #1071 (2024) records a specific pattern. The TFN (Tax File Number) and PCI recognizers use checksum validation. Numbers that pass the checksum receive a score of 1.0 — maximum confidence. No PII context is required.

The root cause: context word checking runs after the checksum step, not before. A number that passes the checksum gets a top score regardless of surrounding text. In financial spreadsheets, scientific datasets, or log files, this floods the output with wrong alerts. Score threshold filtering cannot fix it. The scores are already at maximum.

A second pattern appears in Presidio issue #999. German word segmentation breaks down for compound nouns. Words like Bundesbehörde (federal authority) can be split incorrectly and tagged as personal names. This adds noise in any German-language document.

The 22.7% Precision Problem

Alvaro et al. (2024) tested Presidio on mixed-language enterprise datasets. They found 22.7% precision. In real documents, fewer than one in four Presidio alerts is a real PII entity. This matches what practitioners report. A tool tuned for recall alone produces too much noise for production use.

A 2024 DICOM study showed that raising score_threshold to 0.7 still left wrong alerts in 38 of 39 medical images. A threshold that clears noise in one document type creates missed detections in another.

This is not a Presidio-only problem. Any fixed threshold forces a trade-off. A high threshold cuts noise but raises misses. A low threshold raises recall but inflates the alert count.

Context-Aware Scoring

The fix is context-aware confidence scoring. Instead of scoring based on the pattern match alone, the system boosts confidence when context words appear near the match. It also lowers the score when context is absent.

For TFN detection: words like "tax file number," "TFN," or "Australian tax" near a number boost its score. A number that passes the checksum but has no nearby context words scores below the review threshold. The spurious alert is suppressed.

For cross-language noise: entity types tied to specific countries can be scoped to documents in the matching language. A TFN detector scoped to English and Australian-English text removes noise. Running it on German content without scoping is the source of the problem.

The third layer in a hybrid system is a transformer model. It reads the full context window around each candidate. It tells apart "John Smith, Patient ID 12345" from a product code that matches a name pattern. Context resolves the ambiguity that regex and checksums cannot.

See how the three-tier detection engine handles precision at scale. The multilingual PII detection guide covers how cross-language noise affects GDPR compliance.

Practical Steps

Before deploying any PII tool, measure its precision — not just recall.

Run the tool on a document set with known PII and known non-PII. Count alerts in both groups. Calculate true_positives / (true_positives + false_positives). This number reveals the review burden before you commit to a rollout.

For teams already using Presidio, score distribution analysis is a fast path. Export a sample of detections with their confidence scores. Count how many score below 0.6, 0.7, and 0.8. A large share of high-score alerts in clean text signals a context gap, not a threshold problem. The security compliance overview explains how to document this in a DPIA.

When This Approach Has Limits

Measuring precision before rollout is the right discipline, and it exposes the false-positive tax that recall-only benchmarks hide. But optimizing precision has its own boundaries.

Precision and recall trade against each other. Driving down false positives — raising thresholds, demanding context — also raises the chance of missing a genuine identifier. In a privacy context the missed identifier is the more serious failure, so a precision-focused configuration must be checked for the recall it gives up. The point is to measure both and choose the balance deliberately, not to chase precision until detection quietly leaks PII.

Measured precision is only as representative as the test set. Calculating true positives over total positives on a labeled sample tells you about that sample. Production data with different languages, formats, or OCR quality can behave very differently, and the 22.7% figure that motivates this article came from one mixed-language dataset, not a universal constant. Re-measure on your own data and re-measure again as document types drift.

Better metrics inform a decision; they do not make it compliant. A precision number documented in a DPIA is evidence of diligence, but the DPIA still requires a controller's judgment on necessity, proportionality, and residual risk. Lowering the review burden is an operational win; deciding whether the remaining error rate is acceptable for your regulatory context is a human call that no precision score makes for you.

Sources

Microsoft Presidio GitHub Discussion #1071: systematic false positives.
Microsoft Presidio GitHub Issue #999: German language false positive patterns.
Alvaro et al. (2024): Presidio precision on mixed-language enterprise datasets.
DICOM score threshold analysis — Microsoft Presidio community.

Ready to protect your data?

Start anonymizing PII with 267+ entity types across 48 languages.

Start Free Trial View Features

The False Positive Tax on PII Tools

The False Positive Tax on PII Detection Tools

What Presidio Issue #1071 Shows

The 22.7% Precision Problem

Context-Aware Scoring

Practical Steps

When This Approach Has Limits

Sources

Related Articles

Presidio: 3-Week Setup vs Managed PII

6 Weeks to 3 Days: Managed PII Setup

Free PII Detection Costs €13K/Year

Ready to protect your data?

The False Positive Tax on PII Tools

The False Positive Tax on PII Detection Tools

What Presidio Issue #1071 Shows

The 22.7% Precision Problem

Context-Aware Scoring

Practical Steps

When This Approach Has Limits

Sources

Related Articles

Presidio: 3-Week Setup vs Managed PII

6 Weeks to 3 Days: Managed PII Setup

Free PII Detection Costs €13K/Year

Ready to protect your data?

About this page

Related reading

We follow these rules

Our promise

Where we run

Need help?

How we test

What we never do

Plans in plain words

Who built this

Where to start

How the parts fit

Words from our team

Common questions we hear

A short tour of the workflow