Reproducible Privacy: Why ML Teams Need Presets, Not Just Docs

The DPO approved the anonymization plan. It covers four items: names, emails, phone numbers, and dates of birth. The method is Replace. The plan is four pages and lives in the compliance wiki.

Twelve data scientists read it at kickoff. Each one sets up the tool on their own. Some add national IDs. Some add IP addresses. Some switch to Redact. Three months later, the sets are not consistent.

The CNIL checked several AI firms in 2024. The issue: improper use of personal details in model sets. They did not just ask whether anonymization happened. They asked how consistently it was applied.

Docs are needed. They are not enough. The fix is the preset.

Why ML Model Sets Need Their Own Config

Building model sets has unique needs. General document anonymization does not share them.

Replace, not Redact. Models trained on text where names become [REDACTED] learn that token as a name-position marker. This hurts the model. Replace swaps "John Smith" for "David Chen." The model sees real name patterns. It does not see a mask token.

Same process for all records. A set where 70% of names are replaced and 30% are [REDACTED] sends mixed signal. Each record must go through the same steps.

Same entity list. If the set holds health details, removing names but leaving dates of birth in some records creates gaps. All twelve data scientists must remove the same types.

No over-removal. Taking out dates that are timestamps — not dates of birth — reduces set quality with no compliance gain. The approved preset says exactly which items to remove.

Repeatable output. If a set must be run again — say, after a missed entity type is found — the preset gives the same result each time. Ad-hoc configs do not.

The Twelve Data Scientist Problem

A fintech ML team in Europe uses sets from customer logs. The DPO approved the purpose — fraud detection — with one rule: all customer names, emails, phone numbers, and payment IDs must be replaced before model work starts.

Without presets:

Person 1 removes names, emails, and phone numbers — but misses payment IDs
Person 2 includes payment IDs but uses Redact, not Replace
Person 3 follows the plan document exactly
Persons 4–12 vary

The merged set is partly non-compliant and partly over-processed. A DPO cannot certify it.

With a DPO-approved preset:

The DPO creates "ML Dev — Fraud Detection" with exact entity types and the Replace method
The preset goes to all twelve people with one rule: use this for all set work
No one can change the preset without DPO sign-off

Every person now produces the same output. The merged set is consistent. The yearly AI audit passes with zero findings. The prior year had three findings from inconsistent set work.

Updated for 2026

The EU AI Act took full effect in August 2024. It adds rules for AI systems that use personal details for model work. High-risk AI systems must document their sets, including what anonymization was applied.

GDPR Article 5(1)(b) — the purpose limit rule — blocks use of personal details without a clear legal basis. The CNIL's 2024 cases focused on this gap: details collected for one service used for model work with no valid basis or anonymization.

Presets help satisfy both sets of rules:

Preset name and config: the documented method
Processing logs: proof the method was applied
DPO approval: a recorded sign-off on the config

This creates the audit trail both laws require. For Article 10 obligations in detail, see the EU AI Act training data guide.

Preset Config for NLP Model Sets

Types to include in most NLP model sets:

PERSON — Replace with similar names
EMAIL_ADDRESS — Replace with synthetic addresses
PHONE_NUMBER — Replace with synthetic numbers
CREDIT_CARD / IBAN — Replace or Redact
LOCATION — Replace with similar places if location matters; Redact if not
DATE_OF_BIRTH — Redact; age grouping is often needed

Types often left out:

General dates — timestamps help temporal models
Org names — help named-entity models
URLs — help link and reference models

The ML lead and DPO set these rules in the approved preset. Team members apply it. They do not make config choices.

Presets as Institutional Memory

Before presets. The right entity config lived in the heads of three data scientists. They had worked through the compliance review. Two left in Q3. The knowledge went with them.

After presets. The config lives in "ML Dev — Customer Records v2.1." The version log shows when it was made, who approved it, and what changed from v2.0. New team members use the preset and get all the knowledge built into it.

Version 2.1 added IBAN detection after a review found it missing. Version 2.0 was approved in February 2025. The log is complete.

For how processing logs and DPO review flows work, see the GDPR ML training anonymization guide.

Presets vs. the CNIL Pattern

The CNIL's 2024 AI cases set a clear pattern. They ask not just what was removed but how it was governed. A shared preset with a DPO approval record and processing logs answers this directly.

An ad-hoc config does not. The same gap exists in other EU DPA cases that follow CNIL logic. For more on the CNIL AI approach, see the CNIL GDPR AI compliance guide.

Conclusion

Docs tell team members what to do. Presets make it easy — and enforceable — to do it the same way each time.

For ML model sets, consistency is both a legal need and a technical one. The preset meets both at once.

DPAs looking at AI practices want evidence of uniform anonymization. A preset applied the same way across all set work is the clearest proof you can give them.

When This Approach Has Limits

A DPO-approved, version-controlled preset gives ML teams the reproducibility and audit trail this article argues for — but three limits apply.

Reproducibility does not equal anonymity. Running every record through the same preset guarantees a consistent result; it does not guarantee that result is anonymous. Replace swaps names for synthetic ones, but the surrounding text — locations, employer, transaction patterns, rare attribute combinations — can still single out an individual, and timestamps deliberately kept for temporal modeling are themselves quasi-identifiers. A dataset uniformly processed this way is typically pseudonymized, which keeps it in GDPR scope and within the AI Act's training-data obligations. The preset makes the processing defensible and repeatable; a privacy review still has to judge residual re-identification risk.

Detection accuracy caps every record equally. Replace only substitutes what the detector flagged. Names in unusual forms, payment IDs in a custom shape, or identifiers buried in free-text logs are caught less reliably than structured fields, so a uniformly applied preset can leave uniform gaps across the whole corpus. At training-set scale that is the dangerous case: one systematic miss repeats across millions of rows rather than once. Validate the preset against a held-out sample and measure the residual miss rate before certifying a set, rather than inferring coverage from the preset name.

The preset is evidence of method, not a legal basis. A version log and DPO sign-off document how anonymization was applied; they do not establish the purpose-limitation basis under GDPR Article 5(1)(b) that the CNIL cases turned on, nor satisfy AI Act Article 10 governance on their own. Those are determinations qualified people make. Reusing data collected for one service to train a model still needs a valid basis the preset cannot supply.

Sources

Ready to protect your data?

Start anonymizing PII with 267+ entity types across 48 languages.

Start Free Trial View Features

Reproducible Privacy: ML Presets

Reproducible Privacy: Why ML Teams Need Presets, Not Just Docs

Why ML Model Sets Need Their Own Config

The Twelve Data Scientist Problem

Preset Config for NLP Model Sets

Presets as Institutional Memory

Presets vs. the CNIL Pattern

Conclusion

When This Approach Has Limits

Sources

Related Articles

Presidio: 3-Week Setup vs Managed PII

6 Weeks to 3 Days: Managed PII Setup

Free PII Detection Costs €13K/Year

Ready to protect your data?

Reproducible Privacy: ML Presets

Reproducible Privacy: Why ML Teams Need Presets, Not Just Docs

Why ML Model Sets Need Their Own Config

The Twelve Data Scientist Problem

GDPR and the AI Act

Preset Config for NLP Model Sets

Presets as Institutional Memory

Presets vs. the CNIL Pattern

Conclusion

When This Approach Has Limits

Sources

Related Articles

Presidio: 3-Week Setup vs Managed PII

6 Weeks to 3 Days: Managed PII Setup

Free PII Detection Costs €13K/Year

Ready to protect your data?

About this page

Related reading

We follow these rules

Our promise

Where we run

Need help?

How we test

What we never do

Plans in plain words

Who built this

Where to start

How the parts fit

Words from our team

Common questions we hear

A short tour of the workflow