The Countdown Has Started

Updated for 2026

The EU AI Act deadline is real. Article 10 rules apply from August 2, 2026. If your team builds or runs a high-risk AI system, act now. Time is short.

Fines go higher than GDPR. The max fine is €35 million or 7% of global annual turnover. GDPR caps at €20 million or 4%. No other AI law has higher fines.

Which AI Systems Are High-Risk?

The AI Act sorts systems by risk. High-risk systems (Annex III) cover AI used in:

Education — school access or student scoring
Jobs — CV screening, interview scoring, worker monitoring
Key services — credit scoring, insurance pricing, emergency dispatch
Law enforcement — crime prediction, biometric ID
Healthcare — medical device software, patient triage
Infrastructure — energy, water, or transport management
Justice — legal research tools, sentence tools

Work in any of these? Article 10 applies to you.

Article 10: Four Key Rules

Article 10 sets rules for datasets used by high-risk AI systems. Here are the four main ones.

1. Written Governance

Datasets must follow "appropriate data governance and management practices." You need written steps for collection, quality checks, and ongoing review.

2. Bias Testing

Records must be checked for "possible biases" that could cause unfair outputs. Active testing is required. Avoiding intentional bias is not enough.

3. Accuracy and Coverage

Datasets must be "relevant, sufficiently representative, and free of errors." Web crawls that miss certain groups may fail this test.

4. Special Record Types

Article 10(5) is the most direct rule. When a high-risk system uses special category records — health, race, religion, politics, biometrics — you may only process them when "strictly necessary" for bias checks. You must also apply "appropriate safeguards." Data scrubbing is one of the strongest safeguards you can use.

The bottom line: most AI model datasets hold personal records. Article 10 says use the minimum needed, with strong technical safeguards.

See our legal compliance page and security overview for details.

Penalty Tiers

The EU AI Act has three fine tiers. All of them exceed GDPR for the same type of breach:

Regulation	Max Fine	Turnover Cap
GDPR	€20 million	4% global turnover
EU AI Act (high-risk)	€15 million	3% global turnover
EU AI Act (prohibited)	€35 million	7% global turnover

Dataset breaches fall in the high-risk tier (€15M / 3%). If a regulator finds that using personal records without safeguards is a prohibited act, the top tier applies.

Real examples: €500M turnover at 3% = €15M fine. €5B turnover at 3% = €150M fine. These are real numbers, not theory.

Why Data Scrubbing Solves This

Properly scrubbed records fall outside GDPR scope. That removes most of Article 10's burden.

The hard rules — special category handling, bias checks, data subject rights — only apply when a dataset holds personal records. Remove those records first. The burden mostly goes away.

The CNIL (French data authority) made this clear in early 2026. Its AI guidance says this: data scrubbing of personal records not needed for model performance is the primary technical measure for Article 10.

This is not a fringe view. It is the mainstream position of the EU's top AI regulator.

What Data Scrubbing Means in Practice

Scrubbing AI model datasets is not the same as scrubbing live production records. Model datasets can hold:

Documents with PII — contracts, emails, reports, support tickets
Structured records — customer tables used to build predictive models
Labeled content — images or text with notes that include personal data
Synthetic records — where generation may still preserve personal patterns

You must detect PII in all of these formats. Missing one type exposes the whole dataset. A contract with names removed but full addresses still intact will teach a model to link location to demographic patterns.

The anonym.legal API handles batch processing for large AI datasets. It detects 267+ entity types across 48 languages. For European AI companies with multilingual datasets, cross-language coverage is critical. A gap in one language creates EU AI Act risk across the whole system.

For more on entity detection, see the token system guide and entity types reference.

Practical Steps: Scrubbing Your Dataset

Step 1: Audit first

Run a detection pass before you scrub anything. This tells you what PII is present:

curl -X POST https://anonym.legal/api/presidio/analyze \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "'"$(cat document.txt)"'",
    "language": "en"
  }'

The response lists every detected entity with its type, position, and score. Run this across all your files to see the full scope before you begin.

Step 2: Batch scrub

For large datasets, use the batch endpoint to process many files at once:

import requests
import os
from pathlib import Path

def scrub_batch(documents: list[dict]) -> list[dict]:
    response = requests.post(
        "https://anonym.legal/api/presidio/anonymize-batch",
        json={"items": documents, "language": "en"},
        headers={"Authorization": f"Bearer {os.environ['ANONYM_API_KEY']}"}
    )
    return response.json()["results"]

source_dir = Path("./dataset")
docs = [
    {"id": f.name, "text": f.read_text()}
    for f in source_dir.glob("*.txt")
]

batch_size = 50
for i in range(0, len(docs), batch_size):
    results = scrub_batch(docs[i:i+batch_size])
    for result in results:
        out = source_dir / "clean" / result["id"]
        out.write_text(result["text"])
        print(f"Done: {result['id']} — {len(result['items'])} entities removed")

Step 3: Keep records

Article 10 requires written records of what you did. For each dataset, keep:

The detection model and version used
Which entity types were found and how each was replaced
Entity counts removed per dataset
The date of scrubbing and the dataset version used

This meets the "data governance and management practices" requirement in Article 10(2)(a).

Common Questions

Does scrubbing break model quality?

In most cases, no. The model learns patterns from text structure, not personal details. Names, phone numbers, and addresses can be replaced with placeholders like [NAME] or [PHONE] and the model still learns the same patterns. Many research teams have found that scrubbed datasets produce models of equal quality. The key is to use consistent placeholders so the model sees a clear pattern.

What if my dataset is very large?

Use the batch API. It handles large volumes in parallel. The pricing page shows plans for high-volume use cases. Many teams process millions of records per month.

What about non-English datasets?

The API supports 48 languages. Each language uses a detection model trained on that language. This means German, French, Spanish, Japanese, and others are all covered. See the FAQ for a full language list. Mixed-language datasets are also supported — you can specify the language per document in the batch request.

Colorado AI Act: Two Deadlines

Colorado's AI Act takes effect on June 30, 2026 — five weeks before the EU deadline. It sets similar rules for "high-risk AI systems" under state law. The main focus is bias and discrimination.

Teams in both the EU and Colorado face two deadlines at once. Scrubbing your datasets helps meet both laws: Article 10 (EU) and Colorado's anti-bias rules. The technical steps are the same.

Act Now

Five months is enough time — if you start today. It is not enough if you wait until June.

A practical timeline:

Weeks 1–2: Audit your datasets — find out what personal records are present
Weeks 3–6: Build and test your scrubbing pipeline
Weeks 7–10: Write up your governance records; get legal review
Weeks 11–16: Validate — confirm scrubbed datasets meet Article 10 quality rules
August 2: Enforcement date — compliant practices in place

The anonym.legal API plugs into your current pipeline without big changes. Check pricing for volume plans. The FAQ covers common Article 10 questions.

Use the GDPR compliance checklist for records that overlap between GDPR and Article 10.

The EU AI Act is ready to enforce. Will your organization be ready by August 2?

Start with the GDPR compliance checklist →

Limits and Open Questions

Data scrubbing for AI Act rules is still evolving. Here are the key gaps.

Thresholds are not defined. The EU AI Act does not say what level of scrubbing is "sufficient." Until the European AI Office issues guidance, you face legal risk. You may not know if your method will satisfy regulators.

Re-identification risk remains. Research shows large language models can memorize and replay content from their datasets. Records that passed scrubbing standards before model development may still be extractable. Scrubbing before development does not fully solve this.

Synthetic records have limits. Synthetic generation keeps statistical patterns but can add subtle biases or miss rare edge cases. Models built only on synthetic content may perform poorly on real inputs.

Article 10 is still being interpreted. The phrase "appropriate technical measures" needs interpretation. Early DPA work across EU member states has not settled on clear standards. Watch EDPB guidance and member state decisions throughout 2026.

When This Approach Has Limits

Treating dataset scrubbing as the primary technical measure for Article 10, as the CNIL position holds, is a sound strategy because records outside GDPR scope shed most of the article's burden, but limits remain worth stating plainly.

Detection accuracy bounds the result, and missing one format exposes the whole dataset. The article makes this point and it deserves restating: a contract with names removed but full addresses intact teaches a model to link location to demographics. The engine covers 267+ entity types across 48 languages, yet a residual false-negative rate remains for unusual formats, labeled image notes, and free text. Across millions of records a small miss rate still leaves thousands of unscrubbed identifiers, and a single overlooked entity type, structured or labeled, can put an entire training corpus back inside personal-data scope.

Scrubbed is not always anonymized, especially in combination. Replacing names with placeholders removes direct identifiers, but quasi-identifiers persist. Postcode, age band, and occupation together can re-identify a person even when names are gone, which means the dataset may be pseudonymized rather than anonymized under GDPR. That distinction carries the legal consequences this whole strategy aims to avoid, since pseudonymized data stays in scope. The article's own open-questions section notes that models can memorize and replay records that passed scrubbing, so the standard for "reasonably possible re-identification" is higher than placeholder substitution alone reaches.

The tool supports Article 10 compliance but does not constitute it. Running the batch endpoint and keeping detection-model records meets part of the data-governance requirement, yet the article is clear that thresholds for "sufficient" scrubbing are undefined and "appropriate technical measures" remains unsettled. A regulator judges the outcome, the bias testing, the representativeness, and the documentation as a whole. The detection pass is one human-supervised step. Someone with legal accountability still has to validate that scrubbed datasets meet the quality rules and decide whether the program clears the August 2 standard.

Sources

EU AI Act, Regulation (EU) 2024/1689, Articles 9–17 (high-risk AI obligations), OJ L 2024/1689
EU AI Act, Article 10 — Data and data governance
CNIL AI dataset guidance, January 2026
Colorado AI Act, SB 205, effective June 30, 2026
EU AI Act timeline: prohibited practices February 2, 2025; high-risk systems August 2, 2026

EU AI Act August 2026: Anonymizing Training Data to Meet Article 10

The Countdown Has Started

Which AI Systems Are High-Risk?

Article 10: Four Key Rules

Penalty Tiers

Why Data Scrubbing Solves This

What Data Scrubbing Means in Practice

Practical Steps: Scrubbing Your Dataset

Common Questions

Colorado AI Act: Two Deadlines

Act Now

Limits and Open Questions

When This Approach Has Limits

Sources

Related Articles

Self-Hosted PII Fails Compliance Audits

Presidio Misses 220+ GDPR Entities

Configuration Drift: A Hidden GDPR Risk

Ready to protect your data?

EU AI Act August 2026: Anonymizing Training Data to Meet Article 10

The Countdown Has Started

Which AI Systems Are High-Risk?

Article 10: Four Key Rules

Penalty Tiers

Why Data Scrubbing Solves This

What Data Scrubbing Means in Practice

Practical Steps: Scrubbing Your Dataset

Common Questions

Colorado AI Act: Two Deadlines

Act Now

Limits and Open Questions

When This Approach Has Limits

Sources

Related Articles

Self-Hosted PII Fails Compliance Audits

Presidio Misses 220+ GDPR Entities

Configuration Drift: A Hidden GDPR Risk

Ready to protect your data?

About this page

Related reading

We follow these rules

Our promise

Where we run

Need help?

How we test

What we never do

Plans in plain words

Who built this

Where to start

How the parts fit

Words from our team

Common questions we hear

A short tour of the workflow