One Script Is Not Enough
Every data science team has written something like this:
import re
def anonymize_email(text):
return re.sub(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', '[EMAIL]', text)
This replaces email addresses. That is all it does. The dataset still holds names, phone numbers, and medical IDs. It will still fail a GDPR audit.
The gap between "I anonymized the emails" and "this dataset is GDPR-compliant" is large. Teams underestimate it all the time.
Why GDPR Limits ML Training Use
GDPR Article 5(1)(b) is the key rule. It is called the purpose limitation principle. Personal records may only be used for the purpose they were collected for.
Customer orders were collected for order fulfillment. Not for training a recommendation model. Health records were collected for treatment. Not for training a readmission model. Survey answers were collected for product feedback. Not for training a sentiment classifier.
To use those records for ML training, a team needs one of three things:
- Explicit consent from each person for the ML purpose — hard to get, often impossible retroactively
- A legitimate interest assessment showing the ML use is compatible — legally uncertain, DPA-dependent
- Anonymization — replacing or removing personal details so the dataset is no longer personal under GDPR
Proper anonymization gives the most legal certainty. The challenge is doing it right every time.
The Problem With One-Off Scripts
Teams that write a new Python script for each dataset create compounding issues.
Incomplete coverage. A script built for one schema misses new fields. A clinical notes column added six months ago? Not in the regex. A middle name field? The script only handles first and last name patterns.
No consistency. Dataset A was processed with script_v1. Dataset B used script_v3. Dataset C was processed by a different team member. The merged training set has three different methods applied. A DPO cannot certify it.
No audit trail. The script ran. What did it change? Which entities were found? Without processing records, compliance is impossible. When a DPA auditor asks "how do you know this training set is clean?", the answer "we ran a Python script" is not enough.
Model drift. Regex patterns that worked in 2023 miss new identifier formats from 2024. Scripts do not update themselves.
A Batch Processing Walkthrough
A healthcare AI team needs to anonymize 8,000 patient records. The US team needs access from an EU office. Schrems II applies — EU-origin records cannot go to US infrastructure without proper safeguards.
Traditional path: A data engineer writes a custom script. Two to three days of development. One to two days of DPO review. One day of iteration. Total: four to six days. The ML project slips.
Batch processing path:
- Export the 8,000 records as CSV
- Upload to batch processing
- Set entity types: PERSON, EMAIL_ADDRESS, PHONE_NUMBER, US_SSN, MEDICAL_RECORD, DATE_OF_BIRTH, LOCATION
- Choose method: Replace (substitutes realistic synthetic values to preserve structure)
- Process: 45 minutes for 8,000 records
- Download the clean CSV
- DPO reviews processing metadata — entities found per record, methods applied: 2 hours
- DPO approves. Transfer proceeds.
Total time: 45 minutes plus 2 hours of DPO review. Instead of four to six days.
See the EU AI Act training guide for how these same steps satisfy Article 10 obligations.
Replace vs. Redact for ML Use
The anonymization method matters for model quality.
Redact replaces PII with a token like [REDACTED]. This works for PII detection models. For other tasks — sentiment, classification, recommendation — it hurts. The model learns that [REDACTED] is a special token. It cannot learn from the natural distribution of names and values.
Replace swaps "John Smith" for "David Chen." It swaps "jsmith@company.com" for "dchen@synthetic.com." The structure stays intact. Entity placement, co-occurrence patterns, sentence flow — all preserved. The model learns from realistic context.
For ML training sets, Replace is the right choice. The model does not learn the fake values. It learns the patterns around them. That is what matters.
Schrems II and Cross-Border Transfers
The Schrems II ruling (CJEU, 2020) invalidated the EU-US Privacy Shield. EU-origin records cannot go to US ML infrastructure — AWS US-East, GCP US-Central — without proper transfer safeguards.
The three main safeguards are:
- Standard Contractual Clauses with a Transfer Impact Assessment
- Binding Corporate Rules for transfers within a company group
- Derogation for anonymized records — properly anonymized files are no longer personal under GDPR and are exempt from transfer rules
For teams using US infrastructure with EU-origin sets, proper anonymization removes the Schrems II problem. The clean dataset is not personal. It can move freely.
This is one of the strongest practical benefits of batch anonymization. It does more than satisfy GDPR. It removes cross-border friction entirely.
For more on transfer restrictions, see the GDPR purpose limitation guide.
What to Give the DPO
When submitting a clean training set for DPO approval, include these five items:
- Source description. What was the original dataset? What was the collection purpose? What personal categories did it contain?
- Anonymization config. Which entity types were detected and replaced? What method was applied?
- Processing metadata. Entity counts per record, confidence scores, total records processed.
- Residual risk assessment. What is the chance any individual could be re-identified? For Replace-method anonymization with 285+ entity types on structured text, this probability is very low.
- Intended use. What model will be trained? What is the training purpose?
Batch processing provides items 2 and 3 automatically. Items 1, 4, and 5 come from the data scientist.
See the anonym.legal batch API for how processing metadata is returned with each job.
What You Gain
GDPR-compliant ML sets are achievable without custom scripts, without multi-day delays, and without losing model quality.
The Replace method keeps the natural language properties that matter for NLP training. It removes the personal details that create GDPR risk.
45 minutes of batch processing is the difference between a delayed compliance review and a straightforward DPO sign-off.