By · Last updated 2026-05-29

Back to BlogTechnical

Document Format Fragmentation in PII Tools

A single DSAR response may span Word contracts, PDF invoices, Excel customer lists, and CSV exports. Using different tools for each format creates.

May 29, 20267 minute read
document formatsPDF anonymizationExcel GDPRbatch processingDSAR compliance

The Multi-Format Problem in PII Compliance

Updated for 2026

Ask a compliance officer which formats they anonymize for DSAR responses. The list is always the same: Word contracts, PDF invoices, Excel customer data, CSV exports, and JSON logs.

Then ask which tools they use. The answer is usually three to five. Each tool has different entity coverage. Each has different settings. Each produces a different audit log.

This is format fragmentation. It creates real compliance gaps.

Why Fragmentation Happens

No single tool has handled every production format at the same quality. Specialized tools emerged for each format. One for PDFs. One for spreadsheets. A macro for CSV. Each has its own entity list. None share an audit trail.

The result is predictable. A DSAR response spans multiple file types. Multiple tools process it. Each tool uses different standards. Entity X is caught in the PDF but missed in the Excel file. DPA audits expose this inconsistency.

Format-Specific Technical Challenges

Each format creates its own detection problems.

PDF

PDFs come in two types: native text and image-based scans. Scanned PDFs need OCR first. OCR introduces errors. Native PDFs often store each word as a separate text object. This breaks entity detection across word boundaries. Multi-column layouts need reading-order reconstruction before analysis can start.

Word (DOCX)

DOCX files hold text in XML. But also in headers, footers, comments, tracked changes, and text boxes. A letterhead address in the page header is PII. Most tools miss it. Tracked changes can hold deleted PII. That text is invisible in the rendered view but present in the file.

Excel (XLSX)

Excel stores PII across any cell in hundreds of columns and thousands of rows. Column headers like "SSN" or "Email" give context that NER models miss from raw text. Dates and SSNs are often stored as numbers. Free-text fields like "manager notes" hold unstructured PII. Column-based tools skip those fields.

CSV

CSV lacks Excel's structure. Free-text fields in "notes" columns mix PII with other content. Encoding issues — UTF-8 versus Latin-1 — cause failures for non-ASCII characters in European names and addresses.

JSON

Nested JSON buries PII deep: user.address.street.line1. Arrays need iteration. The same field name can hold different data types in different objects. Good detection needs schema awareness and content analysis together.

Here is a concrete GDPR DSAR scenario.

A data subject requests all personal data held about them. The compliance team finds these files:

  • 3 Word documents (contracts, correspondence).
  • 2 PDF documents (invoices, support transcripts).
  • 1 Excel spreadsheet (customer account data).
  • 1 CSV export (system access logs).

They use Tool A for PDFs. Tool B for Word. A macro for XLSX. Manual review for CSV. Each tool has different entity coverage.

The data subject gets the anonymized package. The Excel "manager notes" column was not processed. The Word letterhead address was missed. Both contain PII the data subject asked to have anonymized.

Under GDPR Article 15 (right of access) or Article 17 (right to erasure), this is an incomplete DSAR response. If the data subject or a regulator finds the gap, the inconsistent tooling is a documented contributing factor.

The Case for a Consistent Standard

Strong DSAR compliance does not just list which PII types to anonymize. It requires the same standard across every format in the response set.

That means:

  • Same entity types checked in Word, PDF, Excel, CSV, and JSON.
  • Same confidence thresholds applied to all files.
  • Same replacement tokens used. If "John Smith" appears in three documents, one token replaces the name in all three.
  • One audit trail covering all formats.

A single-platform solution makes this possible through presets. One "DSAR EU Individuals" preset checks the same 32 entity types. It runs on a PDF contract, an Excel record, and a CSV log. The same engine processes all three.

For more on how presets work across batch jobs, see our guide to GDPR DSAR batch processing at scale.

Batch Processing Mixed-Format Sets

DSAR compliance at scale means processing mixed-format folders as a unit.

Input: A folder with 15 files — PDFs, DOCX, XLSX, CSV — representing all data held for one data subject.

Processing steps:

  • Detect the format of each file.
  • Apply the right parser. PDF text extraction. DOCX XML parsing. XLSX cell iteration. CSV field parsing.
  • Run the same NLP pipeline on extracted text from all files.
  • Apply the same preset to every file in the batch.
  • Use a shared token pool. The same name gets the same replacement token across all 15 files.

Output:

  • Anonymized versions of all 15 files in their original formats.
  • One cross-format audit report. It shows every detected entity, its source document, its confidence score, and the action taken.

That audit report is the compliance document. It proves all 15 files were processed with the same standard. For a DPA audit, this is far stronger than piecemeal tooling.

Related: real-time PII prevention for AI data leaks.

Known Limits of Unified Pipelines

Format unification solves fragmentation. But it introduces its own constraints.

Conversion fidelity: Converting DOCX to a processing format and back can lose track-changes history or corrupt embedded objects. Legal documents need extra validation after processing.

Per-format maintenance: Entity recognizers for CSV differ from those for scanned forms. A "unified" pipeline still needs per-format preprocessing. That preprocessing needs updates as formats evolve.

Accuracy on uncommon formats: Most NLP models train on web text and common office documents. Legacy formats — old EDI files, custom XML schemas, CAD metadata — often produce worse accuracy than benchmarks suggest.

Non-reconstructable formats: Some PDF types and image-only files cannot be anonymized in place. They need visual redaction. Visual redaction destroys machine-readable structure. If you need post-anonymization search or indexing, this may fall short.

Practical DSAR Workflow

For compliance teams with regular DSAR volumes:

  1. Collect all documents for the data subject
  2. Create a DSAR batch — drag all files in, regardless of format
  3. Select the "DSAR EU Individuals" preset
  4. Run the batch
  5. Download anonymized outputs and the consolidated audit report
  6. Spot-check two or three documents from the output
  7. Package the anonymized documents for the data subject response
  8. Attach the audit report to the DSAR case record

Step 1 (manual collection) is still the main time cost. Steps 2 through 8 take under 10 minutes for a typical batch. The audit report from step 5 satisfies the GDPR accountability principle.


anonym.legal handles DOCX, PDF, XLSX, CSV, and JSON. Every file uses the same preset. One audit report covers the batch.

Sources

Ready to protect your data?

Start anonymizing PII with 285+ entity types across 48 languages.

About this page

We update this page when our platform or the law changes.

Read our founder note for how we work.

Each change shows up in the timestamp at the top.

Related reading

We follow these rules

  • GDPR (EU 2016/679).
  • ISO/IEC 27001:2022.
  • NIS2 (EU 2022/2555).
  • HIPAA safe harbor under 45 CFR § 164.514(b)(2).

Our promise

We do not sell your data.

We do not train models on your text.

We store your files in Germany.

You can delete your account at any time.

You own your work.

Where we run

Our servers live in Falkenstein, Germany.

We use Hetzner. They hold ISO 27001 certification.

All data stays in the EU.

Backups run every day.

Need help?

Email support@anonym.legal.

We reply within one business day.

How we test

We run a full check suite on every release.

Each surface gets its own sweep script and report.

Human reviewers spot-check the output each week.

We track recall and precision on a labelled set.

Bad runs block the deploy.

What we never do

  • We never sell your information to third parties.
  • We never train models on what you upload.
  • We never keep your work after you delete it.
  • We never share keys with any outside firm.
  • We never run ads inside the product.

Plans in plain words

We sell credits, not seats.

One credit covers one short job.

Long jobs use a few credits each.

You can top up at any time.

Unused credits roll over each month.

Read the plans page for current rates.

Who built this

A small team of engineers and lawyers built this.

We ship from Europe and work in the open.

Our founder note spells out why we started.

Where to start

How the parts fit

A browser add-on cleans text inside Chrome.

A Word plug-in handles drafts in Office.

A small desktop tool works on whole folders.

An agent protocol link feeds large models safely.

All four share one core engine and one rule set.

Words from our team

We started this work after a lunch about cookies.

One friend kept getting odd ads on her phone.

We asked why a court file leaked through a draft.

We sketched the first build on a napkin that week.

By month three we had a tiny demo for a friend.

She used it on her first case the next day.

Common questions we hear

Can the tool read scanned PDFs? Yes, with OCR.

Does it work on long files? Yes, in small chunks.

Can I roll my own rule set? Yes, save it as a preset.

Does it run offline? The desktop build runs offline.

Do you keep my files? No, the cloud build wipes after each run.

Will it learn from my work? No, we never train on inputs.

A short tour of the workflow

Upload a file or paste a snippet of prose.

Pick the entities you want gone from the draft.

Choose a method: replace, mask, hash, encrypt, or redact.

Press run and watch the side panel show each hit.

Skim the result and tweak any rule that misfired.

Save the cleaned file or send it to a teammate.