One Script Is Not Enough

Every data science team has written something like this:

import re
def anonymize_email(text):
    return re.sub(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', '[EMAIL]', text)

This replaces email addresses. That is all it does. The dataset still holds names, phone numbers, and medical IDs. It will still fail a GDPR audit.

The gap between "I anonymized the emails" and "this dataset is GDPR-compliant" is large. Teams underestimate it all the time.

GDPR Article 5(1)(b) is the key rule. It is called the purpose limitation principle. Personal records may only be used for the purpose they were collected for.

Customer orders were collected for order fulfillment. Not for training a recommendation model. Health records were collected for treatment. Not for training a readmission model. Survey answers were collected for product feedback. Not for training a sentiment classifier.

To use those records for ML training, a team needs one of three things:

Explicit consent from each person for the ML purpose — hard to get, often impossible retroactively
A legitimate interest assessment showing the ML use is compatible — legally uncertain, DPA-dependent
Anonymization — replacing or removing personal details so the dataset is no longer personal under GDPR

Proper anonymization gives the most legal certainty. The challenge is doing it right every time.

The Problem With One-Off Scripts

Teams that write a new Python script for each dataset create compounding issues.

Incomplete coverage. A script built for one schema misses new fields. A clinical notes column added six months ago? Not in the regex. A middle name field? The script only handles first and last name patterns.

No consistency. Dataset A was processed with script_v1. Dataset B used script_v3. Dataset C was processed by a different team member. The merged training set has three different methods applied. A DPO cannot certify it.

No audit trail. The script ran. What did it change? Which entities were found? Without processing records, compliance is impossible. When a DPA auditor asks "how do you know this training set is clean?", the answer "we ran a Python script" is not enough.

Model drift. Regex patterns that worked in 2023 miss new identifier formats from 2024. Scripts do not update themselves.

A Batch Processing Walkthrough

A healthcare AI team needs to anonymize 8,000 patient records. The US team needs access from an EU office. Schrems II applies — EU-origin records cannot go to US infrastructure without proper safeguards.

Traditional path: A data engineer writes a custom script. Two to three days of development. One to two days of DPO review. One day of iteration. Total: four to six days. The ML project slips.

Batch processing path:

Export the 8,000 records as CSV
Upload to batch processing
Set entity types: PERSON, EMAIL_ADDRESS, PHONE_NUMBER, US_SSN, MEDICAL_RECORD, DATE_OF_BIRTH, LOCATION
Choose method: Replace (substitutes realistic synthetic values to preserve structure)
Process: 45 minutes for 8,000 records
Download the clean CSV
DPO reviews processing metadata — entities found per record, methods applied: 2 hours
DPO approves. Transfer proceeds.

Total time: 45 minutes plus 2 hours of DPO review. Instead of four to six days.

See the EU AI Act training guide for how these same steps satisfy Article 10 obligations.

Replace vs. Redact for ML Use

The anonymization method matters for model quality.

Redact replaces PII with a token like [REDACTED]. This works for PII detection models. For other tasks — sentiment, classification, recommendation — it hurts. The model learns that [REDACTED] is a special token. It cannot learn from the natural distribution of names and values.

Replace swaps "John Smith" for "David Chen." It swaps "jsmith@company.com" for "dchen@synthetic.com." The structure stays intact. Entity placement, co-occurrence patterns, sentence flow — all preserved. The model learns from realistic context.

For ML training sets, Replace is the right choice. The model does not learn the fake values. It learns the patterns around them. That is what matters.

Schrems II and Cross-Border Transfers

The Schrems II ruling (CJEU, 2020) invalidated the EU-US Privacy Shield. EU-origin records cannot go to US ML infrastructure — AWS US-East, GCP US-Central — without proper transfer safeguards.

The three main safeguards are:

Standard Contractual Clauses with a Transfer Impact Assessment
Binding Corporate Rules for transfers within a company group
Derogation for anonymized records — properly anonymized files are no longer personal under GDPR and are exempt from transfer rules

For teams using US infrastructure with EU-origin sets, proper anonymization removes the Schrems II problem. The clean dataset is not personal. It can move freely.

This is one of the strongest practical benefits of batch anonymization. It does more than satisfy GDPR. It removes cross-border friction entirely.

For more on transfer restrictions, see the GDPR purpose limitation guide.

What to Give the DPO

When submitting a clean training set for DPO approval, include these five items:

Source description. What was the original dataset? What was the collection purpose? What personal categories did it contain?
Anonymization config. Which entity types were detected and replaced? What method was applied?
Processing metadata. Entity counts per record, confidence scores, total records processed.
Residual risk assessment. What is the chance any individual could be re-identified? For Replace-method anonymization with 267+ entity types on structured text, this probability is very low.
Intended use. What model will be trained? What is the training purpose?

Batch processing provides items 2 and 3 automatically. Items 1, 4, and 5 come from the data scientist.

See the anonym.legal batch API for how processing metadata is returned with each job.

What You Gain

GDPR-compliant ML sets are achievable without custom scripts, without multi-day delays, and without losing model quality.

The Replace method keeps the natural language properties that matter for NLP training. It removes the personal details that create GDPR risk.

45 minutes of batch processing is the difference between a delayed compliance review and a straightforward DPO sign-off.

When This Approach Has Limits

Replacing one-off scripts with a consistent, logged batch step is genuinely better engineering and better compliance — the core argument holds. But three limits apply.

Anonymized in law is a higher bar than Replace clears by default. The Schrems II derogation only applies to data that is truly anonymous, meaning no longer attributable to a person by any means reasonably likely to be used. Swapping John Smith for David Chen removes direct identifiers, but if free-text fields, rare diagnoses, or distinctive event sequences remain, the record may still single someone out. That makes the output pseudonymized, not anonymized, and pseudonymized EU-origin data is still in scope for cross-border transfer rules. Treating Replace as automatically clearing Schrems II can reintroduce the exact transfer problem the section describes solving.

Detection accuracy sets a floor on residual personal data. The model only replaces what it recognizes. Clinical notes, mixed-language text, and schema fields added after the configuration was written are detected less reliably than canonical emails or phone numbers. On an 8,000-record set even a low miss rate leaves real names in the training data, and the larger the corpus the more a systematic gap compounds. Validate against held-out samples and review per-record entity counts rather than trusting that every field was caught, especially when a new source column appears.

The residual-risk call is human, not a tool output. Item 4 of the DPO package is a re-identification-risk assessment, and that is a legal and statistical judgment, not a confidence score. Processing metadata supports the determination; it does not make it. The data scientist and DPO must weigh quasi-identifiers, dataset size, and likely linkage attacks, then sign. The tool's job is to execute the configuration consistently and document what it did. Deciding whether the result is lawful to use, and on what basis, stays firmly with the people accountable for it.

Sources

Ready to protect your data?

Start anonymizing PII with 267+ entity types across 48 languages.

Start Free Trial View Features

GDPR ML Training Data Anonymization

One Script Is Not Enough

The Problem With One-Off Scripts

A Batch Processing Walkthrough

Replace vs. Redact for ML Use

Schrems II and Cross-Border Transfers

What to Give the DPO

What You Gain

When This Approach Has Limits

Sources

Related Articles

Presidio: 3-Week Setup vs Managed PII

6 Weeks to 3 Days: Managed PII Setup

Free PII Detection Costs €13K/Year

Ready to protect your data?

GDPR ML Training Data Anonymization

One Script Is Not Enough

Why GDPR Limits ML Training Use

The Problem With One-Off Scripts

A Batch Processing Walkthrough

Replace vs. Redact for ML Use

Schrems II and Cross-Border Transfers

What to Give the DPO

What You Gain

When This Approach Has Limits

Sources

Related Articles

Presidio: 3-Week Setup vs Managed PII

6 Weeks to 3 Days: Managed PII Setup

Free PII Detection Costs €13K/Year

Ready to protect your data?

About this page

Related reading

We follow these rules

Our promise

Where we run

Need help?

How we test

What we never do

Plans in plain words

Who built this

Where to start

How the parts fit

Words from our team

Common questions we hear

A short tour of the workflow