By · Last updated 2026-05-29

Back to BlogGDPR & Compliance

GDPR Legacy Scanned Documents: OCR + PII

GDPR's right to erasure applies to personal data 'regardless of format.' Image-based PDFs from paper archives are not exempt.

May 29, 20267 minute read
legacy documentsOCR PII detectionGDPR erasurescanned documentsdocument archive

GDPR and Legacy Scanned Files: OCR for PII

Updated for 2026

GDPR audits often turn up the same hidden risk: old image-based PDF archives.

Law firms hold 20 years of scanned client files. Hospitals keep decades of patient forms. Government bodies store scanned records. Banks have imaged loan files.

These archives share one trait. The files are raster images — scanned PDFs, TIFF, or JPEG. There is no text layer. Standard PII tools cannot read them. To most anonymization tools, these files do not exist.

A common belief: "These are image files — GDPR doesn't apply."

GDPR Article 17(1) gives people the right to erasure. Recital 26 says anonymization removes personal information from scope. Neither carves out an exception for image formats. A law firm that cannot fulfill an erasure request for a 15-year-old client file has a compliance gap. It does not have an exemption.

See our compliance overview and security practices for how we support GDPR.

How the Detection Pipeline Works

The process runs in three stages.

Stage 1 — OCR

The OCR engine reads the image and extracts text. It records the position of each word. Output is machine-readable text with coordinates. Accuracy drops when handwriting, faded ink, or old typefaces are present.

Stage 2 — NLP Entity Detection

Named Entity Recognition (NER) scans the OCR text. It finds person names, organizations, and locations. Pattern matching adds SSNs, phone numbers, and account numbers. Each hit gets a confidence score.

Stage 3 — Anonymization

Detected entities are replaced in the text output. The original image is not changed. Changing the image requires separate redaction tooling. The anonymized text supports erasure requests, DSAR responses, and compliance records.

Modern OCR engines reach 98–99% character accuracy on clean printed pages. Handwriting or degraded scans drop to 85–92%. Entity-level accuracy tends to be higher than character-level accuracy. A name can be identified even when a few letters are wrong.

The practical upshot: OCR accuracy affects how many entities you catch. It does not determine whether the method works. Even at 90% accuracy, you find most names and numbers. Quality tiers are still needed. The method itself is sound.

Processing a Large Archive

Large legacy archives follow a four-phase workflow.

Phase 1 — Inventory: List all image-based archives. Note source system and date range. Put high-erasure-risk records first. Client-facing files come before internal ones.

Phase 2 — Batch processing: Run OCR and PII detection in batches. Five to ten thousand files per batch is a common size. Processing runs overnight. Output is a PII report and an anonymized text extract for each file.

Phase 3 — Erasure fulfillment: The subject sends a request with their name and the period. Search the anonymized extracts for their tokens. Find the files. Redact them. Log the action.

Phase 4 — Ongoing compliance: Put new scanned files through the same pipeline before you archive them. Keep PII reports as Article 30 Records of Processing Activities evidence.

Case Study: Law Firm Archive

A law firm audit found 80,000 image-based PDF client contracts scanned from 1998 to 2010. Standard PII tools showed zero detections. The image format was invisible.

Fifteen former clients had submitted erasure requests in the prior 12 months. The firm said: "We cannot confirm your records have been erased." That answer does not meet GDPR Article 17.

What the firm did:

  • Ran OCR and PII detection on all 80,000 files in batches of 5,000
  • Processing took about three weeks
  • Result: 80,000 anonymized text extracts with per-file reports
  • Built a searchable index linking entities to file IDs

After processing:

  • Finding files for one subject: 4 minutes on average
  • Files per request: 6–8 on average
  • Redaction time per request: 20–30 minutes

All 15 outstanding requests were resolved within 30 days.

The key point: the compliance obligation existed before the processing. The firm just lacked the tools to meet it. OCR-based processing did not create a new duty. It made an existing duty possible to fulfill.

OCR Limits and Quality Tiers

Handwriting has lower OCR accuracy. Set a lower confidence threshold before processing handwritten content.

Poor scan quality reduces scores. Contrast enhancement and de-skewing help before OCR runs.

Unusual layouts — multi-column pages, old legal typefaces — may also score lower.

Set quality tiers for compliance work:

  • Above 95% page accuracy: run automated processing
  • 80–95%: run automated processing, then human review for flagged entities
  • Below 80%: send to manual review

A tiered approach gives regulators a clear answer about how you assessed reliability. Most automated tools handle the high-confidence files. A manual queue handles the rest. Throughput stays high. Compliance quality stays high too.

Our FAQ covers common questions about OCR-based processing and audit trail requirements.

Sources

Ready to protect your data?

Start anonymizing PII with 285+ entity types across 48 languages.

About this page

We update this page when our platform or the law changes.

Read our founder note for how we work.

Each change shows up in the timestamp at the top.

Related reading

We follow these rules

  • GDPR (EU 2016/679).
  • ISO/IEC 27001:2022.
  • NIS2 (EU 2022/2555).
  • HIPAA safe harbor under 45 CFR § 164.514(b)(2).

Our promise

We do not sell your data.

We do not train models on your text.

We store your files in Germany.

You can delete your account at any time.

You own your work.

Where we run

Our servers live in Falkenstein, Germany.

We use Hetzner. They hold ISO 27001 certification.

All data stays in the EU.

Backups run every day.

Need help?

Email support@anonym.legal.

We reply within one business day.

How we test

We run a full check suite on every release.

Each surface gets its own sweep script and report.

Human reviewers spot-check the output each week.

We track recall and precision on a labelled set.

Bad runs block the deploy.

What we never do

  • We never sell your information to third parties.
  • We never train models on what you upload.
  • We never keep your work after you delete it.
  • We never share keys with any outside firm.
  • We never run ads inside the product.

Plans in plain words

We sell credits, not seats.

One credit covers one short job.

Long jobs use a few credits each.

You can top up at any time.

Unused credits roll over each month.

Read the plans page for current rates.

Who built this

A small team of engineers and lawyers built this.

We ship from Europe and work in the open.

Our founder note spells out why we started.

Where to start

How the parts fit

A browser add-on cleans text inside Chrome.

A Word plug-in handles drafts in Office.

A small desktop tool works on whole folders.

An agent protocol link feeds large models safely.

All four share one core engine and one rule set.

Words from our team

We started this work after a lunch about cookies.

One friend kept getting odd ads on her phone.

We asked why a court file leaked through a draft.

We sketched the first build on a napkin that week.

By month three we had a tiny demo for a friend.

She used it on her first case the next day.

Common questions we hear

Can the tool read scanned PDFs? Yes, with OCR.

Does it work on long files? Yes, in small chunks.

Can I roll my own rule set? Yes, save it as a preset.

Does it run offline? The desktop build runs offline.

Do you keep my files? No, the cloud build wipes after each run.

Will it learn from my work? No, we never train on inputs.

A short tour of the workflow

Upload a file or paste a snippet of prose.

Pick the entities you want gone from the draft.

Choose a method: replace, mask, hash, encrypt, or redact.

Press run and watch the side panel show each hit.

Skim the result and tweak any rule that misfired.

Save the cleaned file or send it to a teammate.