The Multi-Format Problem in PII Compliance

Updated for 2026

Ask a compliance officer which formats they anonymize for DSAR responses. The list is always the same: Word contracts, PDF invoices, Excel customer data, CSV exports, and JSON logs.

Then ask which tools they use. The answer is usually three to five. Each tool has different entity coverage. Each has different settings. Each produces a different audit log.

This is format fragmentation. It creates real compliance gaps.

Why Fragmentation Happens

No single tool has handled every production format at the same quality. Specialized tools emerged for each format. One for PDFs. One for spreadsheets. A macro for CSV. Each has its own entity list. None share an audit trail.

The result is predictable. A DSAR response spans multiple file types. Multiple tools process it. Each tool uses different standards. Entity X is caught in the PDF but missed in the Excel file. DPA audits expose this inconsistency.

Format-Specific Technical Challenges

Each format creates its own detection problems.

PDF

PDFs come in two types: native text and image-based scans. Scanned PDFs need OCR first. OCR introduces errors. Native PDFs often store each word as a separate text object. This breaks entity detection across word boundaries. Multi-column layouts need reading-order reconstruction before analysis can start.

Word (DOCX)

DOCX files hold text in XML. But also in headers, footers, comments, tracked changes, and text boxes. A letterhead address in the page header is PII. Most tools miss it. Tracked changes can hold deleted PII. That text is invisible in the rendered view but present in the file.

Excel (XLSX)

Excel stores PII across any cell in hundreds of columns and thousands of rows. Column headers like "SSN" or "Email" give context that NER models miss from raw text. Dates and SSNs are often stored as numbers. Free-text fields like "manager notes" hold unstructured PII. Column-based tools skip those fields.

CSV

CSV lacks Excel's structure. Free-text fields in "notes" columns mix PII with other content. Encoding issues — UTF-8 versus Latin-1 — cause failures for non-ASCII characters in European names and addresses.

JSON

Nested JSON buries PII deep: user.address.street.line1. Arrays need iteration. The same field name can hold different data types in different objects. Good detection needs schema awareness and content analysis together.

Inconsistency Is a Legal Risk

Here is a concrete GDPR DSAR scenario.

A data subject requests all personal data held about them. The compliance team finds these files:

3 Word documents (contracts, correspondence).
2 PDF documents (invoices, support transcripts).
1 Excel spreadsheet (customer account data).
1 CSV export (system access logs).

They use Tool A for PDFs. Tool B for Word. A macro for XLSX. Manual review for CSV. Each tool has different entity coverage.

The data subject gets the anonymized package. The Excel "manager notes" column was not processed. The Word letterhead address was missed. Both contain PII the data subject asked to have anonymized.

Under GDPR Article 15 (right of access) or Article 17 (right to erasure), this is an incomplete DSAR response. If the data subject or a regulator finds the gap, the inconsistent tooling is a documented contributing factor.

The Case for a Consistent Standard

Strong DSAR compliance does not just list which PII types to anonymize. It requires the same standard across every format in the response set.

That means:

Same entity types checked in Word, PDF, Excel, CSV, and JSON.
Same confidence thresholds applied to all files.
Same replacement tokens used. If "John Smith" appears in three documents, one token replaces the name in all three.
One audit trail covering all formats.

A single-platform solution makes this possible through presets. One "DSAR EU Individuals" preset checks the same 32 entity types. It runs on a PDF contract, an Excel record, and a CSV log. The same engine processes all three.

For more on how presets work across batch jobs, see our guide to GDPR DSAR batch processing at scale.

Batch Processing Mixed-Format Sets

DSAR compliance at scale means processing mixed-format folders as a unit.

Input: A folder with 15 files — PDFs, DOCX, XLSX, CSV — representing all data held for one data subject.

Processing steps:

Detect the format of each file.
Apply the right parser. PDF text extraction. DOCX XML parsing. XLSX cell iteration. CSV field parsing.
Run the same NLP pipeline on extracted text from all files.
Apply the same preset to every file in the batch.
Use a shared token pool. The same name gets the same replacement token across all 15 files.

Output:

Anonymized versions of all 15 files in their original formats.
One cross-format audit report. It shows every detected entity, its source document, its confidence score, and the action taken.

That audit report is the compliance document. It proves all 15 files were processed with the same standard. For a DPA audit, this is far stronger than piecemeal tooling.

Known Limits of Unified Pipelines

Format unification solves fragmentation. But it introduces its own constraints.

Conversion fidelity: Converting DOCX to a processing format and back can lose track-changes history or corrupt embedded objects. Legal documents need extra validation after processing.

Per-format maintenance: Entity recognizers for CSV differ from those for scanned forms. A "unified" pipeline still needs per-format preprocessing. That preprocessing needs updates as formats evolve.

Accuracy on uncommon formats: Most NLP models train on web text and common office documents. Legacy formats — old EDI files, custom XML schemas, CAD metadata — often produce worse accuracy than benchmarks suggest.

Non-reconstructable formats: Some PDF types and image-only files cannot be anonymized in place. They need visual redaction. Visual redaction destroys machine-readable structure. If you need post-anonymization search or indexing, this may fall short.

Practical DSAR Workflow

For compliance teams with regular DSAR volumes:

Collect all documents for the data subject
Create a DSAR batch — drag all files in, regardless of format
Select the "DSAR EU Individuals" preset
Run the batch
Download anonymized outputs and the consolidated audit report
Spot-check two or three documents from the output
Package the anonymized documents for the data subject response
Attach the audit report to the DSAR case record

Step 1 (manual collection) is still the main time cost. Steps 2 through 8 take under 10 minutes for a typical batch. The audit report from step 5 satisfies the GDPR accountability principle.

anonym.legal handles DOCX, PDF, XLSX, CSV, and JSON. Every file uses the same preset. One audit report covers the batch.

When This Approach Has Limits

Applying one consistent standard across every format in a DSAR set is the right answer to fragmentation — the same entity types, thresholds, tokens, and audit trail close the gaps that mismatched tools create. But three limits apply.

One standard does not equal uniform accuracy. Running the same preset everywhere makes coverage consistent on paper, but real recall still varies by format. The article's own breakdown shows why: scanned PDFs depend on OCR quality, DOCX hides text in headers and tracked changes, Excel free-text columns resist column-based logic, and CSV encoding issues break on non-ASCII names. A shared 32-entity preset is only as good as the weakest per-format extractor behind it. A residual false-negative rate persists, so spot-check output from each format rather than assuming the preset performed equally well across all of them.

Consistent tokens can still leave a dataset re-identifiable. Replacing direct identifiers uniformly across files is necessary, but quasi-identifiers — a date, a job title, a location spread across a contract, an invoice, and a spreadsheet — can re-identify someone in combination even after names and IDs are gone. Consistent tokenization makes the result pseudonymized, not necessarily anonymized, with the legal consequences that distinction carries under Articles 15 and 17. Decide deliberately whether the response needs true anonymization, and assess linkage risk across the combined set, not file by file.

Batch scale magnifies a systematic miss. Processing a mixed-format folder as a unit is efficient, but a single blind spot — an uncovered DOCX header field, an unparsed Excel notes column — then repeats across every file and every subject in the run. The consolidated audit report documents what the pipeline did; it cannot flag what the pipeline never detected, so an incomplete DSAR can look complete on the report. A human still owns the legal sufficiency of the response. Validate the pipeline against held-out samples before scaling, and treat the audit log as evidence of process, not proof of completeness.

Sources

Ready to protect your data?

Start anonymizing PII with 267+ entity types across 48 languages.

Start Free Trial View Features

Document Format Fragmentation in PII Tools

The Multi-Format Problem in PII Compliance

Why Fragmentation Happens

Format-Specific Technical Challenges

PDF

Word (DOCX)

Excel (XLSX)

CSV

JSON

Inconsistency Is a Legal Risk

The Case for a Consistent Standard

Batch Processing Mixed-Format Sets

Known Limits of Unified Pipelines

Practical DSAR Workflow

When This Approach Has Limits

Sources

Related Articles

Presidio: 3-Week Setup vs Managed PII

6 Weeks to 3 Days: Managed PII Setup

Free PII Detection Costs €13K/Year

Ready to protect your data?

Document Format Fragmentation in PII Tools

The Multi-Format Problem in PII Compliance

Why Fragmentation Happens

Format-Specific Technical Challenges

PDF

Word (DOCX)

Excel (XLSX)

CSV

JSON

Inconsistency Is a Legal Risk

The Case for a Consistent Standard

Batch Processing Mixed-Format Sets

Known Limits of Unified Pipelines

Practical DSAR Workflow

When This Approach Has Limits

Sources

Related Articles

Presidio: 3-Week Setup vs Managed PII

6 Weeks to 3 Days: Managed PII Setup

Free PII Detection Costs €13K/Year

Ready to protect your data?

About this page

Related reading

We follow these rules

Our promise

Where we run

Need help?

How we test

What we never do

Plans in plain words

Who built this

Where to start

How the parts fit

Words from our team

Common questions we hear

A short tour of the workflow