GDPR and Legacy Scanned Files: OCR for PII
Updated for 2026
GDPR audits often turn up the same hidden risk: old image-based PDF archives.
Law firms hold 20 years of scanned client files. Hospitals keep decades of patient forms. Government bodies store scanned records. Banks have imaged loan files.
These archives share one trait. The files are raster images — scanned PDFs, TIFF, or JPEG. There is no text layer. Standard PII tools cannot read them. To most anonymization tools, these files do not exist.
A common belief: "These are image files — GDPR doesn't apply."
GDPR Article 17(1) gives people the right to erasure. Recital 26 says anonymization removes personal information from scope. Neither carves out an exception for image formats. A law firm that cannot fulfill an erasure request for a 15-year-old client file has a compliance gap. It does not have an exemption.
See our compliance overview and security practices for how we support GDPR.
How the Detection Pipeline Works
The process runs in three stages.
Stage 1 — OCR
The OCR engine reads the image and extracts text. It records the position of each word. Output is machine-readable text with coordinates. Accuracy drops when handwriting, faded ink, or old typefaces are present.
Stage 2 — NLP Entity Detection
Named Entity Recognition (NER) scans the OCR text. It finds person names, organizations, and locations. Pattern matching adds SSNs, phone numbers, and account numbers. Each hit gets a confidence score.
Stage 3 — Anonymization
Detected entities are replaced in the text output. The original image is not changed. Changing the image requires separate redaction tooling. The anonymized text supports erasure requests, DSAR responses, and compliance records.
Modern OCR engines reach 98–99% character accuracy on clean printed pages. Handwriting or degraded scans drop to 85–92%. Entity-level accuracy tends to be higher than character-level accuracy. A name can be identified even when a few letters are wrong.
The practical upshot: OCR accuracy affects how many entities you catch. It does not determine whether the method works. Even at 90% accuracy, you find most names and numbers. Quality tiers are still needed. The method itself is sound.
Processing a Large Archive
Large legacy archives follow a four-phase workflow.
Phase 1 — Inventory: List all image-based archives. Note source system and date range. Put high-erasure-risk records first. Client-facing files come before internal ones.
Phase 2 — Batch processing: Run OCR and PII detection in batches. Five to ten thousand files per batch is a common size. Processing runs overnight. Output is a PII report and an anonymized text extract for each file.
Phase 3 — Erasure fulfillment: The subject sends a request with their name and the period. Search the anonymized extracts for their tokens. Find the files. Redact them. Log the action.
Phase 4 — Ongoing compliance: Put new scanned files through the same pipeline before you archive them. Keep PII reports as Article 30 Records of Processing Activities evidence.
Case Study: Law Firm Archive
A law firm audit found 80,000 image-based PDF client contracts scanned from 1998 to 2010. Standard PII tools showed zero detections. The image format was invisible.
Fifteen former clients had submitted erasure requests in the prior 12 months. The firm said: "We cannot confirm your records have been erased." That answer does not meet GDPR Article 17.
What the firm did:
- Ran OCR and PII detection on all 80,000 files in batches of 5,000
- Processing took about three weeks
- Result: 80,000 anonymized text extracts with per-file reports
- Built a searchable index linking entities to file IDs
After processing:
- Finding files for one subject: 4 minutes on average
- Files per request: 6–8 on average
- Redaction time per request: 20–30 minutes
All 15 outstanding requests were resolved within 30 days.
The key point: the compliance obligation existed before the processing. The firm just lacked the tools to meet it. OCR-based processing did not create a new duty. It made an existing duty possible to fulfill.
OCR Limits and Quality Tiers
Handwriting has lower OCR accuracy. Set a lower confidence threshold before processing handwritten content.
Poor scan quality reduces scores. Contrast enhancement and de-skewing help before OCR runs.
Unusual layouts — multi-column pages, old legal typefaces — may also score lower.
Set quality tiers for compliance work:
- Above 95% page accuracy: run automated processing
- 80–95%: run automated processing, then human review for flagged entities
- Below 80%: send to manual review
A tiered approach gives regulators a clear answer about how you assessed reliability. Most automated tools handle the high-confidence files. A manual queue handles the rest. Throughput stays high. Compliance quality stays high too.
Our FAQ covers common questions about OCR-based processing and audit trail requirements.