The False Positive Tax on PII Detection Tools
Updated for 2026
Most PII tools are judged on recall. Recall measures what share of real PII the tool finds. But precision matters just as much. Precision measures what share of the tool's alerts are real PII.
Low precision is expensive. A system with 95% recall and 22.7% precision catches most PII. Yet for every real PII entity it flags, it also raises 3.4 wrong alerts. In a dataset with 10,000 real PII entities, that system fires roughly 44,000 alerts. About 34,000 of them are wrong. Each one costs time to review or causes over-redaction.
This is the false positive tax. It is the overhead any team pays when running a high-recall, low-precision PII system at scale. The direct cost is reviewer time. The indirect cost is worse: over-redacted documents hide useful data, slow work, and erode trust in the tool.
What Presidio Issue #1071 Shows
Microsoft Presidio GitHub discussion #1071 (2024) records a specific pattern. The TFN (Tax File Number) and PCI recognizers use checksum validation. Numbers that pass the checksum receive a score of 1.0 — maximum confidence. No PII context is required.
The root cause: context word checking runs after the checksum step, not before. A number that passes the checksum gets a top score regardless of surrounding text. In financial spreadsheets, scientific datasets, or log files, this floods the output with wrong alerts. Score threshold filtering cannot fix it. The scores are already at maximum.
A second pattern appears in Presidio issue #999. German word segmentation breaks down for compound nouns. Words like Bundesbehörde (federal authority) can be split incorrectly and tagged as personal names. This adds noise in any German-language document.
The 22.7% Precision Problem
Alvaro et al. (2024) tested Presidio on mixed-language enterprise datasets. They found 22.7% precision. In real documents, fewer than one in four Presidio alerts is a real PII entity. This matches what practitioners report. A tool tuned for recall alone produces too much noise for production use.
A 2024 DICOM study showed that raising score_threshold to 0.7 still left wrong alerts in 38 of 39 medical images. A threshold that clears noise in one document type creates missed detections in another.
This is not a Presidio-only problem. Any fixed threshold forces a trade-off. A high threshold cuts noise but raises misses. A low threshold raises recall but inflates the alert count.
Context-Aware Scoring
The fix is context-aware confidence scoring. Instead of scoring based on the pattern match alone, the system boosts confidence when context words appear near the match. It also lowers the score when context is absent.
For TFN detection: words like "tax file number," "TFN," or "Australian tax" near a number boost its score. A number that passes the checksum but has no nearby context words scores below the review threshold. The spurious alert is suppressed.
For cross-language noise: entity types tied to specific countries can be scoped to documents in the matching language. A TFN detector scoped to English and Australian-English text removes noise. Running it on German content without scoping is the source of the problem.
The third layer in a hybrid system is a transformer model. It reads the full context window around each candidate. It tells apart "John Smith, Patient ID 12345" from a product code that matches a name pattern. Context resolves the ambiguity that regex and checksums cannot.
See how the three-tier detection engine handles precision at scale. The multilingual PII detection guide covers how cross-language noise affects GDPR compliance.
Practical Steps
Before deploying any PII tool, measure its precision — not just recall.
Run the tool on a document set with known PII and known non-PII. Count alerts in both groups. Calculate true_positives / (true_positives + false_positives). This number reveals the review burden before you commit to a rollout.
For teams already using Presidio, score distribution analysis is a fast path. Export a sample of detections with their confidence scores. Count how many score below 0.6, 0.7, and 0.8. A large share of high-score alerts in clean text signals a context gap, not a threshold problem. The security compliance overview explains how to document this in a DPIA.
Sources
- Microsoft Presidio GitHub Discussion #1071: systematic false positives.
- Microsoft Presidio GitHub Issue #999: German language false positive patterns.
- Alvaro et al. (2024): Presidio precision on mixed-language enterprise datasets.
- DICOM score threshold analysis — Microsoft Presidio community.