Updated for 2026

GDPR audits often turn up the same hidden risk: old image-based PDF archives.

Law firms hold 20 years of scanned client files. Hospitals keep decades of patient forms. Government bodies store scanned records. Banks have imaged loan files.

These archives share one trait. The files are raster images — scanned PDFs, TIFF, or JPEG. There is no text layer. Standard PII tools cannot read them. To most anonymization tools, these files do not exist.

A common belief: "These are image files — GDPR doesn't apply."

GDPR Article 17(1) gives people the right to erasure. Recital 26 says anonymization removes personal information from scope. Neither carves out an exception for image formats. A law firm that cannot fulfill an erasure request for a 15-year-old client file has a compliance gap. It does not have an exemption.

See our compliance overview and security practices for how we support GDPR.

How the Detection Pipeline Works

The process runs in three stages.

Stage 1 — OCR

The OCR engine reads the image and extracts text. It records the position of each word. Output is machine-readable text with coordinates. Accuracy drops when handwriting, faded ink, or old typefaces are present.

Stage 2 — NLP Entity Detection

Named Entity Recognition (NER) scans the OCR text. It finds person names, organizations, and locations. Pattern matching adds SSNs, phone numbers, and account numbers. Each hit gets a confidence score.

Stage 3 — Anonymization

Detected entities are replaced in the text output. The original image is not changed. Changing the image requires separate redaction tooling. The anonymized text supports erasure requests, DSAR responses, and compliance records.

Modern OCR engines reach 98–99% character accuracy on clean printed pages. Handwriting or degraded scans drop to 85–92%. Entity-level accuracy tends to be higher than character-level accuracy. A name can be identified even when a few letters are wrong.

The practical upshot: OCR accuracy affects how many entities you catch. It does not determine whether the method works. Even at 90% accuracy, you find most names and numbers. Quality tiers are still needed. The method itself is sound.

Processing a Large Archive

Large legacy archives follow a four-phase workflow.

Phase 1 — Inventory: List all image-based archives. Note source system and date range. Put high-erasure-risk records first. Client-facing files come before internal ones.

Phase 2 — Batch processing: Run OCR and PII detection in batches. Five to ten thousand files per batch is a common size. Processing runs overnight. Output is a PII report and an anonymized text extract for each file.

Phase 3 — Erasure fulfillment: The subject sends a request with their name and the period. Search the anonymized extracts for their tokens. Find the files. Redact them. Log the action.

Phase 4 — Ongoing compliance: Put new scanned files through the same pipeline before you archive them. Keep PII reports as Article 30 Records of Processing Activities evidence.

Case Study: Law Firm Archive

A law firm audit found 80,000 image-based PDF client contracts scanned from 1998 to 2010. Standard PII tools showed zero detections. The image format was invisible.

Fifteen former clients had submitted erasure requests in the prior 12 months. The firm said: "We cannot confirm your records have been erased." That answer does not meet GDPR Article 17.

What the firm did:

Ran OCR and PII detection on all 80,000 files in batches of 5,000
Processing took about three weeks
Result: 80,000 anonymized text extracts with per-file reports
Built a searchable index linking entities to file IDs

After processing:

Finding files for one subject: 4 minutes on average
Files per request: 6–8 on average
Redaction time per request: 20–30 minutes

All 15 outstanding requests were resolved within 30 days.

The key point: the compliance obligation existed before the processing. The firm just lacked the tools to meet it. OCR-based processing did not create a new duty. It made an existing duty possible to fulfill.

OCR Limits and Quality Tiers

Handwriting has lower OCR accuracy. Set a lower confidence threshold before processing handwritten content.

Poor scan quality reduces scores. Contrast enhancement and de-skewing help before OCR runs.

Unusual layouts — multi-column pages, old legal typefaces — may also score lower.

Set quality tiers for compliance work:

Above 95% page accuracy: run automated processing
80–95%: run automated processing, then human review for flagged entities
Below 80%: send to manual review

A tiered approach gives regulators a clear answer about how you assessed reliability. Most automated tools handle the high-confidence files. A manual queue handles the rest. Throughput stays high. Compliance quality stays high too.

Our FAQ covers common questions about OCR-based processing and audit trail requirements.

When This Approach Has Limits

Using OCR plus NLP to bring image-only archives into reach is the right approach, and the point that GDPR offers no image-format exemption is correct. But three limits apply.

OCR accuracy bounds entity recall on exactly the worst files. The method works at 90 percent character accuracy because names survive a wrong letter, but degraded scans, cursive marginalia, faded carbon copies, and unusual old typefaces push accuracy lower, and a missed character can mean a missed identifier entirely. The residual false-negative rate concentrates in the hardest documents, which are also the oldest and most likely to hold sensitive records. The quality tiers in this article exist precisely because automated processing alone leaves a tail of misses that only human review of low-confidence files can close.

Anonymizing the text extract does not anonymize the source image. This pipeline replaces entities in the machine-readable output while leaving the original raster file unchanged, by design. An erasure request answered only against the extracts can still leave the data subject's name visible in the stored TIFF or scanned PDF. Detection tells you which files contain whose data and where; redacting the image itself requires separate tooling and a separate step. Confirm your erasure workflow acts on the source, not just the searchable index built from it.

The pipeline enables an erasure obligation; it does not discharge it. Finding files and logging actions supports an Article 17 response, but whether erasure is complete and adequate is a legal judgment, not a tool output. Retention rules, litigation holds, and overlapping copies in backups all bear on what "erased" actually means for a given file. The engine makes a previously impossible duty achievable; a human still has to decide that the duty has in fact been met.

Sources

Ready to protect your data?

Start anonymizing PII with 267+ entity types across 48 languages.

Start Free Trial View Features

GDPR Legacy Scanned Documents: OCR + PII

How the Detection Pipeline Works

Processing a Large Archive

Case Study: Law Firm Archive

OCR Limits and Quality Tiers

When This Approach Has Limits

Sources

Related Articles

Self-Hosted PII Fails Compliance Audits

Presidio Misses 220+ GDPR Entities

Configuration Drift: A Hidden GDPR Risk

Ready to protect your data?

GDPR Legacy Scanned Documents: OCR + PII

GDPR and Legacy Scanned Files: OCR for PII

How the Detection Pipeline Works

Processing a Large Archive

Case Study: Law Firm Archive

OCR Limits and Quality Tiers

When This Approach Has Limits

Sources

Related Articles

Self-Hosted PII Fails Compliance Audits

Presidio Misses 220+ GDPR Entities

Configuration Drift: A Hidden GDPR Risk

Ready to protect your data?

About this page

Related reading

We follow these rules

Our promise

Where we run

Need help?

How we test

What we never do

Plans in plain words

Who built this

Where to start

How the parts fit

Words from our team

Common questions we hear

A short tour of the workflow