By · Last updated 2026-05-29

Back to BlogTechnical

Reproducible Privacy: ML Presets

ML training data anonymization must be consistent and reproducible. If data scientists A and B apply different entity types, training datasets are.

May 29, 20266 minute read
ML training datareproducible privacyGDPR AI ActCNIL enforcementdata science compliance

Reproducible Privacy: Why ML Teams Need Presets, Not Just Docs

The DPO approved the anonymization plan. It covers four items: names, emails, phone numbers, and dates of birth. The method is Replace. The plan is four pages and lives in the compliance wiki.

Twelve data scientists read it at kickoff. Each one sets up the tool on their own. Some add national IDs. Some add IP addresses. Some switch to Redact. Three months later, the sets are not consistent.

The CNIL checked several AI firms in 2024. The issue: improper use of personal details in model sets. They did not just ask whether anonymization happened. They asked how consistently it was applied.

Docs are needed. They are not enough. The fix is the preset.

Why ML Model Sets Need Their Own Config

Building model sets has unique needs. General document anonymization does not share them.

Replace, not Redact. Models trained on text where names become [REDACTED] learn that token as a name-position marker. This hurts the model. Replace swaps "John Smith" for "David Chen." The model sees real name patterns. It does not see a mask token.

Same process for all records. A set where 70% of names are replaced and 30% are [REDACTED] sends mixed signal. Each record must go through the same steps.

Same entity list. If the set holds health details, removing names but leaving dates of birth in some records creates gaps. All twelve data scientists must remove the same types.

No over-removal. Taking out dates that are timestamps — not dates of birth — reduces set quality with no compliance gain. The approved preset says exactly which items to remove.

Repeatable output. If a set must be run again — say, after a missed entity type is found — the preset gives the same result each time. Ad-hoc configs do not.

The Twelve Data Scientist Problem

A fintech ML team in Europe uses sets from customer logs. The DPO approved the purpose — fraud detection — with one rule: all customer names, emails, phone numbers, and payment IDs must be replaced before model work starts.

Without presets:

  • Person 1 removes names, emails, and phone numbers — but misses payment IDs
  • Person 2 includes payment IDs but uses Redact, not Replace
  • Person 3 follows the plan document exactly
  • Persons 4–12 vary

The merged set is partly non-compliant and partly over-processed. A DPO cannot certify it.

With a DPO-approved preset:

  • The DPO creates "ML Dev — Fraud Detection" with exact entity types and the Replace method
  • The preset goes to all twelve people with one rule: use this for all set work
  • No one can change the preset without DPO sign-off

Every person now produces the same output. The merged set is consistent. The yearly AI audit passes with zero findings. The prior year had three findings from inconsistent set work.

GDPR and the AI Act

Updated for 2026

The EU AI Act took full effect in August 2024. It adds rules for AI systems that use personal details for model work. High-risk AI systems must document their sets, including what anonymization was applied.

GDPR Article 5(1)(b) — the purpose limit rule — blocks use of personal details without a clear legal basis. The CNIL's 2024 cases focused on this gap: details collected for one service used for model work with no valid basis or anonymization.

Presets help satisfy both sets of rules:

  • Preset name and config: the documented method
  • Processing logs: proof the method was applied
  • DPO approval: a recorded sign-off on the config

This creates the audit trail both laws require. For Article 10 obligations in detail, see the EU AI Act training data guide.

Preset Config for NLP Model Sets

Types to include in most NLP model sets:

  • PERSON — Replace with similar names
  • EMAIL_ADDRESS — Replace with synthetic addresses
  • PHONE_NUMBER — Replace with synthetic numbers
  • CREDIT_CARD / IBAN — Replace or Redact
  • LOCATION — Replace with similar places if location matters; Redact if not
  • DATE_OF_BIRTH — Redact; age grouping is often needed

Types often left out:

  • General dates — timestamps help temporal models
  • Org names — help named-entity models
  • URLs — help link and reference models

The ML lead and DPO set these rules in the approved preset. Team members apply it. They do not make config choices.

Presets as Institutional Memory

Before presets. The right entity config lived in the heads of three data scientists. They had worked through the compliance review. Two left in Q3. The knowledge went with them.

After presets. The config lives in "ML Dev — Customer Records v2.1." The version log shows when it was made, who approved it, and what changed from v2.0. New team members use the preset and get all the knowledge built into it.

Version 2.1 added IBAN detection after a review found it missing. Version 2.0 was approved in February 2025. The log is complete.

For how processing logs and DPO review flows work, see the GDPR ML training anonymization guide.

Presets vs. the CNIL Pattern

The CNIL's 2024 AI cases set a clear pattern. They ask not just what was removed but how it was governed. A shared preset with a DPO approval record and processing logs answers this directly.

An ad-hoc config does not. The same gap exists in other EU DPA cases that follow CNIL logic. For more on the CNIL AI approach, see the CNIL GDPR AI compliance guide.

Conclusion

Docs tell team members what to do. Presets make it easy — and enforceable — to do it the same way each time.

For ML model sets, consistency is both a legal need and a technical one. The preset meets both at once.

DPAs looking at AI practices want evidence of uniform anonymization. A preset applied the same way across all set work is the clearest proof you can give them.

Sources

Ready to protect your data?

Start anonymizing PII with 285+ entity types across 48 languages.

About this page

We update this page when our platform or the law changes.

Read our founder note for how we work.

Each change shows up in the timestamp at the top.

Related reading

We follow these rules

  • GDPR (EU 2016/679).
  • ISO/IEC 27001:2022.
  • NIS2 (EU 2022/2555).
  • HIPAA safe harbor under 45 CFR § 164.514(b)(2).

Our promise

We do not sell your data.

We do not train models on your text.

We store your files in Germany.

You can delete your account at any time.

You own your work.

Where we run

Our servers live in Falkenstein, Germany.

We use Hetzner. They hold ISO 27001 certification.

All data stays in the EU.

Backups run every day.

Need help?

Email support@anonym.legal.

We reply within one business day.

How we test

We run a full check suite on every release.

Each surface gets its own sweep script and report.

Human reviewers spot-check the output each week.

We track recall and precision on a labelled set.

Bad runs block the deploy.

What we never do

  • We never sell your information to third parties.
  • We never train models on what you upload.
  • We never keep your work after you delete it.
  • We never share keys with any outside firm.
  • We never run ads inside the product.

Plans in plain words

We sell credits, not seats.

One credit covers one short job.

Long jobs use a few credits each.

You can top up at any time.

Unused credits roll over each month.

Read the plans page for current rates.

Who built this

A small team of engineers and lawyers built this.

We ship from Europe and work in the open.

Our founder note spells out why we started.

Where to start

How the parts fit

A browser add-on cleans text inside Chrome.

A Word plug-in handles drafts in Office.

A small desktop tool works on whole folders.

An agent protocol link feeds large models safely.

All four share one core engine and one rule set.

Words from our team

We started this work after a lunch about cookies.

One friend kept getting odd ads on her phone.

We asked why a court file leaked through a draft.

We sketched the first build on a napkin that week.

By month three we had a tiny demo for a friend.

She used it on her first case the next day.

Common questions we hear

Can the tool read scanned PDFs? Yes, with OCR.

Does it work on long files? Yes, in small chunks.

Can I roll my own rule set? Yes, save it as a preset.

Does it run offline? The desktop build runs offline.

Do you keep my files? No, the cloud build wipes after each run.

Will it learn from my work? No, we never train on inputs.

A short tour of the workflow

Upload a file or paste a snippet of prose.

Pick the entities you want gone from the draft.

Choose a method: replace, mask, hash, encrypt, or redact.

Press run and watch the side panel show each hit.

Skim the result and tweak any rule that misfired.

Save the cleaned file or send it to a teammate.