Reproducible Privacy: Why ML Teams Need Presets, Not Just Docs
The DPO approved the anonymization plan. It covers four items: names, emails, phone numbers, and dates of birth. The method is Replace. The plan is four pages and lives in the compliance wiki.
Twelve data scientists read it at kickoff. Each one sets up the tool on their own. Some add national IDs. Some add IP addresses. Some switch to Redact. Three months later, the sets are not consistent.
The CNIL checked several AI firms in 2024. The issue: improper use of personal details in model sets. They did not just ask whether anonymization happened. They asked how consistently it was applied.
Docs are needed. They are not enough. The fix is the preset.
Why ML Model Sets Need Their Own Config
Building model sets has unique needs. General document anonymization does not share them.
Replace, not Redact. Models trained on text where names become [REDACTED] learn that token as a name-position marker. This hurts the model. Replace swaps "John Smith" for "David Chen." The model sees real name patterns. It does not see a mask token.
Same process for all records. A set where 70% of names are replaced and 30% are [REDACTED] sends mixed signal. Each record must go through the same steps.
Same entity list. If the set holds health details, removing names but leaving dates of birth in some records creates gaps. All twelve data scientists must remove the same types.
No over-removal. Taking out dates that are timestamps — not dates of birth — reduces set quality with no compliance gain. The approved preset says exactly which items to remove.
Repeatable output. If a set must be run again — say, after a missed entity type is found — the preset gives the same result each time. Ad-hoc configs do not.
The Twelve Data Scientist Problem
A fintech ML team in Europe uses sets from customer logs. The DPO approved the purpose — fraud detection — with one rule: all customer names, emails, phone numbers, and payment IDs must be replaced before model work starts.
Without presets:
- Person 1 removes names, emails, and phone numbers — but misses payment IDs
- Person 2 includes payment IDs but uses Redact, not Replace
- Person 3 follows the plan document exactly
- Persons 4–12 vary
The merged set is partly non-compliant and partly over-processed. A DPO cannot certify it.
With a DPO-approved preset:
- The DPO creates "ML Dev — Fraud Detection" with exact entity types and the Replace method
- The preset goes to all twelve people with one rule: use this for all set work
- No one can change the preset without DPO sign-off
Every person now produces the same output. The merged set is consistent. The yearly AI audit passes with zero findings. The prior year had three findings from inconsistent set work.
GDPR and the AI Act
Updated for 2026
The EU AI Act took full effect in August 2024. It adds rules for AI systems that use personal details for model work. High-risk AI systems must document their sets, including what anonymization was applied.
GDPR Article 5(1)(b) — the purpose limit rule — blocks use of personal details without a clear legal basis. The CNIL's 2024 cases focused on this gap: details collected for one service used for model work with no valid basis or anonymization.
Presets help satisfy both sets of rules:
- Preset name and config: the documented method
- Processing logs: proof the method was applied
- DPO approval: a recorded sign-off on the config
This creates the audit trail both laws require. For Article 10 obligations in detail, see the EU AI Act training data guide.
Preset Config for NLP Model Sets
Types to include in most NLP model sets:
- PERSON — Replace with similar names
- EMAIL_ADDRESS — Replace with synthetic addresses
- PHONE_NUMBER — Replace with synthetic numbers
- CREDIT_CARD / IBAN — Replace or Redact
- LOCATION — Replace with similar places if location matters; Redact if not
- DATE_OF_BIRTH — Redact; age grouping is often needed
Types often left out:
- General dates — timestamps help temporal models
- Org names — help named-entity models
- URLs — help link and reference models
The ML lead and DPO set these rules in the approved preset. Team members apply it. They do not make config choices.
Presets as Institutional Memory
Before presets. The right entity config lived in the heads of three data scientists. They had worked through the compliance review. Two left in Q3. The knowledge went with them.
After presets. The config lives in "ML Dev — Customer Records v2.1." The version log shows when it was made, who approved it, and what changed from v2.0. New team members use the preset and get all the knowledge built into it.
Version 2.1 added IBAN detection after a review found it missing. Version 2.0 was approved in February 2025. The log is complete.
For how processing logs and DPO review flows work, see the GDPR ML training anonymization guide.
Presets vs. the CNIL Pattern
The CNIL's 2024 AI cases set a clear pattern. They ask not just what was removed but how it was governed. A shared preset with a DPO approval record and processing logs answers this directly.
An ad-hoc config does not. The same gap exists in other EU DPA cases that follow CNIL logic. For more on the CNIL AI approach, see the CNIL GDPR AI compliance guide.
Conclusion
Docs tell team members what to do. Presets make it easy — and enforceable — to do it the same way each time.
For ML model sets, consistency is both a legal need and a technical one. The preset meets both at once.
DPAs looking at AI practices want evidence of uniform anonymization. A preset applied the same way across all set work is the clearest proof you can give them.