Updated for 2026

One Fix, Two New Risks

Many firms now block AI leaks by stripping out names and IDs before text reaches an AI provider. One-way hashing, hard redaction, or full removal all seem safe. The AI gets clean text. Sensitive details stay in-house.

The logic holds on the security side. Cyberhaven's Q4 2025 study found that 34.8% of content sent to ChatGPT holds sensitive data. Ponemon's 2024 report put the average AI breach cost at $2.1 million. The risk is real and the cost is high.

But full removal trades one risk for another: spoliation of evidence.

For firms subject to lawsuits or audits, destroying the ability to restore raw records can count as spoliation under federal and state rules.

Research from eSecurity Planet and Cyberhaven found that 77% of staff share sensitive data with AI tools each week. This spans legal, healthcare, finance, and tech.

Shared content often includes:

Client letters and case notes
Draft contracts and deal terms
Internal plans and business records
Financial models and projections
Legal memos and case notes
Patient records and clinical notes
HR files and staff messages

When full removal is the AI control, every document that passes through it may lose its legal value. If those documents surface in a lawsuit — very likely over any multi-year period for firms in regulated fields — the firm has potentially lost evidence.

See our legal alignment overview for how anonym.legal meets discovery duties. You can also review the token system guide to see how the masking pipeline works in practice.

GDPR Article 4(5) defines pseudonymization as processing personal records in a way that means they "can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately."

The key point: the extra key that enables re-linking must be kept. Records that can be re-linked via stored keys count as pseudonymized under GDPR.

Records that cannot be re-linked at all are not pseudonymized. They are anonymized. The gap matters:

Token-masked records keep some GDPR duties but can be restored for legal use.
Fully wiped records may fall outside GDPR scope but cannot be restored at all.

The European Data Protection Board's Guidelines 05/2022 confirm that reversibility is a core part of the definition. Firms using one-way removal are not doing GDPR pseudonymization. They are cutting the ability to recover records.

Learn more at our conformance hub and protection overview.

Federal Rules: The Spoliation Test

Under the Federal Rules of Civil Procedure, parties must preserve records that may be relevant to expected legal action. This duty starts when a lawsuit is reasonably foreseeable — not when it is filed.

Rule 37(e) lets courts impose penalties when a party fails to preserve stored records. Penalties can include:

Adverse inference instructions
Evidence preclusion
Case-ending sanctions in serious cases

Here is how this plays out. A firm uses AI workflows that fully remove sensitive content in the normal course of business. Those records later become relevant to a lawsuit. The firm has altered them so the raw text cannot be restored. If that occurred after the duty to preserve attached, spoliation exposure follows.

This is not a fringe case. Firms in regulated fields with recurring legal exposure face constant foreseeable lawsuits across broad document types. Deploying full removal across all workflows — without carve-outs for at-risk records — creates large spoliation risk.

Reversible vs. Irreversible: Key Difference

The difference between reversible and one-way masking is in the design.

One-Way: no way back

SHA-256 hashing of a name produces a fixed hash. The name cannot be derived from it. Hard redaction removes text so the raw content is gone.

Reversible: recovery is possible

Token substitution with key retention and AES-256-GCM encryption both transform records in ways that can be undone. A name replaced with a token can be restored via a lookup table. AES-256-GCM content can be decrypted with the right key. The raw text stays reachable.

For AI protection, both methods work the same way. The AI processes tokens and never sees the real records.

For legal duty, only reversible token masking works. One-way methods cut off recovery and create the spoliation risk noted above.

Read how our token system handles this end to end. For deeper context, see the glossary and FAQ.

The Dual-Compliant Design

A design that meets both AI security and legal disclosure duties uses reversible AES-256-GCM token masking:

Records are processed before they reach any AI tool.
Sensitive items — names, IDs, PHI, privileged content — are swapped for structured tokens.
The token map is kept in a separate store with access controls that match the data type.
AI processing runs on the token copy. The AI never sees the real records.
Results are restored using the token map for normal business use.
The token map is placed under legal hold when discovery duties attach.

Under this design, no raw content is ever lost. The AI provider never sees it in usable form. The token map keeps recovery possible when the law requires it. Spoliation risk is gone — no records are destroyed. They are only masked in a way that can be undone.

GDPR Article 4(5) is met: the extra key (token map) is kept apart with the right technical and process safeguards. The Federal Rules preservation duty is met: raw records can be restored when a legal hold applies.

Explore our entity detection approach, protection overview, and plans and rates for full details.

The Binary Choice

Firms face a clear fork:

Permanently remove data — solve the AI leak problem but create legal risk.
Use reversible token masking — meet both protection and conformance needs at once.

The $2.1 million average AI breach cost drives the security decision. But spoliation sanctions are not cheap either. In cases with large monetary stakes, costs can reach the same order of magnitude. Both risks deserve a place in the decision.

A sound AI policy covers both ends. It blocks sensitive records from leaving the firm in usable form. And it keeps those same records reachable when a court or regulator asks for them. Reversible token masking is the only method that does both at once.

For more background, see our founder statement and case studies.

When This Approach Has Limits

Reversible token masking resolves the spoliation-versus-leak dilemma, but it is not a closed solution. Three caveats deserve their place in the decision.

The token map becomes the spoliation surface. Once you commit to reversibility to satisfy preservation duties, the lookup table is the record that must survive. If it is deleted, corrupted, or rotated without retaining the prior mapping while a legal hold is in force, you have destroyed the ability to restore the original — which is the same spoliation outcome you were trying to avoid, just relocated. The compliance burden shifts from "do not delete the documents" to "do not lose the key," and that obligation needs its own retention policy, backups, and hold discipline.

Detection gaps mean some sensitive data is never tokenized at all. This design only protects what the upstream detector identifies as sensitive. A privileged passage written in unusual phrasing, an identifier in a format the system does not recognize, or PHI buried in a scanned image with poor OCR can pass through untokenized and reach the AI provider in usable form. The dual-compliant guarantee holds for detected entities; it does not certify that every sensitive item was detected.

Pseudonymized data is still personal data. GDPR Article 4(5) treats reversible tokenization as pseudonymization, not anonymization. While the token map exists, the tokenized output remains personal data subject to the full regime — lawful basis, data-subject rights, breach notification. Teams sometimes treat tokenized text as if it were fully anonymized and outside scope. It is not, and a regulator will assess the mapping's existence, not the appearance of the output.

Sources

Cyberhaven Q4 2025: Data Exposure in AI Tools — link
IBM / Ponemon Institute: Cost of a Data Breach Report 2024 — link
EDPB Guidelines 05/2022 on Pseudonymization — link
Federal Rules of Civil Procedure Rule 37(e) — link
E-Discovery LLC: Relevance Redactions and Legal Standards — link

Ready to protect your data?

Start anonymizing PII with 267+ entity types across 48 languages.

Start Free Trial View Features

Permanent Anonymization: Spoliation Risk

One Fix, Two New Risks

Federal Rules: The Spoliation Test

Reversible vs. Irreversible: Key Difference

One-Way: no way back

Reversible: recovery is possible

The Dual-Compliant Design

The Binary Choice

When This Approach Has Limits

Sources

Related Articles

Legal PII: Privilege Detection

PII Detection Cuts E-Discovery Costs

Anonymous HR Surveys with Reversible PII

Ready to protect your data?

Permanent Anonymization: Spoliation Risk

One Fix, Two New Risks

The AI Sharing Scale

GDPR: Reversibility Is Required

Federal Rules: The Spoliation Test

Reversible vs. Irreversible: Key Difference

One-Way: no way back

Reversible: recovery is possible

The Dual-Compliant Design

The Binary Choice

When This Approach Has Limits

Sources

Related Articles

Legal PII: Privilege Detection

PII Detection Cuts E-Discovery Costs

Anonymous HR Surveys with Reversible PII

Ready to protect your data?

About this page

Related reading

We follow these rules

Our promise

Where we run

Need help?

How we test

What we never do

Plans in plain words

Who built this

Where to start

How the parts fit

Words from our team

Common questions we hear

A short tour of the workflow