The Multi-Format Problem in PII Compliance
Updated for 2026
Ask a compliance officer which formats they anonymize for DSAR responses. The list is always the same: Word contracts, PDF invoices, Excel customer data, CSV exports, and JSON logs.
Then ask which tools they use. The answer is usually three to five. Each tool has different entity coverage. Each has different settings. Each produces a different audit log.
This is format fragmentation. It creates real compliance gaps.
Why Fragmentation Happens
No single tool has handled every production format at the same quality. Specialized tools emerged for each format. One for PDFs. One for spreadsheets. A macro for CSV. Each has its own entity list. None share an audit trail.
The result is predictable. A DSAR response spans multiple file types. Multiple tools process it. Each tool uses different standards. Entity X is caught in the PDF but missed in the Excel file. DPA audits expose this inconsistency.
Format-Specific Technical Challenges
Each format creates its own detection problems.
PDFs come in two types: native text and image-based scans. Scanned PDFs need OCR first. OCR introduces errors. Native PDFs often store each word as a separate text object. This breaks entity detection across word boundaries. Multi-column layouts need reading-order reconstruction before analysis can start.
Word (DOCX)
DOCX files hold text in XML. But also in headers, footers, comments, tracked changes, and text boxes. A letterhead address in the page header is PII. Most tools miss it. Tracked changes can hold deleted PII. That text is invisible in the rendered view but present in the file.
Excel (XLSX)
Excel stores PII across any cell in hundreds of columns and thousands of rows. Column headers like "SSN" or "Email" give context that NER models miss from raw text. Dates and SSNs are often stored as numbers. Free-text fields like "manager notes" hold unstructured PII. Column-based tools skip those fields.
CSV
CSV lacks Excel's structure. Free-text fields in "notes" columns mix PII with other content. Encoding issues — UTF-8 versus Latin-1 — cause failures for non-ASCII characters in European names and addresses.
JSON
Nested JSON buries PII deep: user.address.street.line1. Arrays need iteration. The same field name can hold different data types in different objects. Good detection needs schema awareness and content analysis together.
Inconsistency Is a Legal Risk
Here is a concrete GDPR DSAR scenario.
A data subject requests all personal data held about them. The compliance team finds these files:
- 3 Word documents (contracts, correspondence).
- 2 PDF documents (invoices, support transcripts).
- 1 Excel spreadsheet (customer account data).
- 1 CSV export (system access logs).
They use Tool A for PDFs. Tool B for Word. A macro for XLSX. Manual review for CSV. Each tool has different entity coverage.
The data subject gets the anonymized package. The Excel "manager notes" column was not processed. The Word letterhead address was missed. Both contain PII the data subject asked to have anonymized.
Under GDPR Article 15 (right of access) or Article 17 (right to erasure), this is an incomplete DSAR response. If the data subject or a regulator finds the gap, the inconsistent tooling is a documented contributing factor.
The Case for a Consistent Standard
Strong DSAR compliance does not just list which PII types to anonymize. It requires the same standard across every format in the response set.
That means:
- Same entity types checked in Word, PDF, Excel, CSV, and JSON.
- Same confidence thresholds applied to all files.
- Same replacement tokens used. If "John Smith" appears in three documents, one token replaces the name in all three.
- One audit trail covering all formats.
A single-platform solution makes this possible through presets. One "DSAR EU Individuals" preset checks the same 32 entity types. It runs on a PDF contract, an Excel record, and a CSV log. The same engine processes all three.
For more on how presets work across batch jobs, see our guide to GDPR DSAR batch processing at scale.
Batch Processing Mixed-Format Sets
DSAR compliance at scale means processing mixed-format folders as a unit.
Input: A folder with 15 files — PDFs, DOCX, XLSX, CSV — representing all data held for one data subject.
Processing steps:
- Detect the format of each file.
- Apply the right parser. PDF text extraction. DOCX XML parsing. XLSX cell iteration. CSV field parsing.
- Run the same NLP pipeline on extracted text from all files.
- Apply the same preset to every file in the batch.
- Use a shared token pool. The same name gets the same replacement token across all 15 files.
Output:
- Anonymized versions of all 15 files in their original formats.
- One cross-format audit report. It shows every detected entity, its source document, its confidence score, and the action taken.
That audit report is the compliance document. It proves all 15 files were processed with the same standard. For a DPA audit, this is far stronger than piecemeal tooling.
Related: real-time PII prevention for AI data leaks.
Known Limits of Unified Pipelines
Format unification solves fragmentation. But it introduces its own constraints.
Conversion fidelity: Converting DOCX to a processing format and back can lose track-changes history or corrupt embedded objects. Legal documents need extra validation after processing.
Per-format maintenance: Entity recognizers for CSV differ from those for scanned forms. A "unified" pipeline still needs per-format preprocessing. That preprocessing needs updates as formats evolve.
Accuracy on uncommon formats: Most NLP models train on web text and common office documents. Legacy formats — old EDI files, custom XML schemas, CAD metadata — often produce worse accuracy than benchmarks suggest.
Non-reconstructable formats: Some PDF types and image-only files cannot be anonymized in place. They need visual redaction. Visual redaction destroys machine-readable structure. If you need post-anonymization search or indexing, this may fall short.
Practical DSAR Workflow
For compliance teams with regular DSAR volumes:
- Collect all documents for the data subject
- Create a DSAR batch — drag all files in, regardless of format
- Select the "DSAR EU Individuals" preset
- Run the batch
- Download anonymized outputs and the consolidated audit report
- Spot-check two or three documents from the output
- Package the anonymized documents for the data subject response
- Attach the audit report to the DSAR case record
Step 1 (manual collection) is still the main time cost. Steps 2 through 8 take under 10 minutes for a typical batch. The audit report from step 5 satisfies the GDPR accountability principle.
anonym.legal handles DOCX, PDF, XLSX, CSV, and JSON. Every file uses the same preset. One audit report covers the batch.