The Countdown Has Started
Updated for 2026
The EU AI Act deadline is real. Article 10 rules apply from August 2, 2026. If your team builds or runs a high-risk AI system, act now. Time is short.
Fines go higher than GDPR. The max fine is €35 million or 7% of global annual turnover. GDPR caps at €20 million or 4%. No other AI law has higher fines.
Which AI Systems Are High-Risk?
The AI Act sorts systems by risk. High-risk systems (Annex III) cover AI used in:
- Education — school access or student scoring
- Jobs — CV screening, interview scoring, worker monitoring
- Key services — credit scoring, insurance pricing, emergency dispatch
- Law enforcement — crime prediction, biometric ID
- Healthcare — medical device software, patient triage
- Infrastructure — energy, water, or transport management
- Justice — legal research tools, sentence tools
Work in any of these? Article 10 applies to you.
Article 10: Four Key Rules
Article 10 sets rules for datasets used by high-risk AI systems. Here are the four main ones.
1. Written Governance
Datasets must follow "appropriate data governance and management practices." You need written steps for collection, quality checks, and ongoing review.
2. Bias Testing
Records must be checked for "possible biases" that could cause unfair outputs. Active testing is required. Avoiding intentional bias is not enough.
3. Accuracy and Coverage
Datasets must be "relevant, sufficiently representative, and free of errors." Web crawls that miss certain groups may fail this test.
4. Special Record Types
Article 10(5) is the most direct rule. When a high-risk system uses special category records — health, race, religion, politics, biometrics — you may only process them when "strictly necessary" for bias checks. You must also apply "appropriate safeguards." Data scrubbing is one of the strongest safeguards you can use.
The bottom line: most AI model datasets hold personal records. Article 10 says use the minimum needed, with strong technical safeguards.
See our legal compliance page and security overview for details.
Penalty Tiers
The EU AI Act has three fine tiers. All of them exceed GDPR for the same type of breach:
| Regulation | Max Fine | Turnover Cap |
|---|---|---|
| GDPR | €20 million | 4% global turnover |
| EU AI Act (high-risk) | €15 million | 3% global turnover |
| EU AI Act (prohibited) | €35 million | 7% global turnover |
Dataset breaches fall in the high-risk tier (€15M / 3%). If a regulator finds that using personal records without safeguards is a prohibited act, the top tier applies.
Real examples: €500M turnover at 3% = €15M fine. €5B turnover at 3% = €150M fine. These are real numbers, not theory.
Why Data Scrubbing Solves This
Properly scrubbed records fall outside GDPR scope. That removes most of Article 10's burden.
The hard rules — special category handling, bias checks, data subject rights — only apply when a dataset holds personal records. Remove those records first. The burden mostly goes away.
The CNIL (French data authority) made this clear in early 2026. Its AI guidance says this: data scrubbing of personal records not needed for model performance is the primary technical measure for Article 10.
This is not a fringe view. It is the mainstream position of the EU's top AI regulator.
What Data Scrubbing Means in Practice
Scrubbing AI model datasets is not the same as scrubbing live production records. Model datasets can hold:
- Documents with PII — contracts, emails, reports, support tickets
- Structured records — customer tables used to build predictive models
- Labeled content — images or text with notes that include personal data
- Synthetic records — where generation may still preserve personal patterns
You must detect PII in all of these formats. Missing one type exposes the whole dataset. A contract with names removed but full addresses still intact will teach a model to link location to demographic patterns.
The anonym.legal API handles batch processing for large AI datasets. It detects 285+ entity types across 48 languages. For European AI companies with multilingual datasets, cross-language coverage is critical. A gap in one language creates EU AI Act risk across the whole system.
For more on entity detection, see the token system guide and entity types reference.
Practical Steps: Scrubbing Your Dataset
Step 1: Audit first
Run a detection pass before you scrub anything. This tells you what PII is present:
curl -X POST https://anonym.legal/api/presidio/analyze \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"text": "'"$(cat document.txt)"'",
"language": "en"
}'
The response lists every detected entity with its type, position, and score. Run this across all your files to see the full scope before you begin.
Step 2: Batch scrub
For large datasets, use the batch endpoint to process many files at once:
import requests
import os
from pathlib import Path
def scrub_batch(documents: list[dict]) -> list[dict]:
response = requests.post(
"https://anonym.legal/api/presidio/anonymize-batch",
json={"items": documents, "language": "en"},
headers={"Authorization": f"Bearer {os.environ['ANONYM_API_KEY']}"}
)
return response.json()["results"]
source_dir = Path("./dataset")
docs = [
{"id": f.name, "text": f.read_text()}
for f in source_dir.glob("*.txt")
]
batch_size = 50
for i in range(0, len(docs), batch_size):
results = scrub_batch(docs[i:i+batch_size])
for result in results:
out = source_dir / "clean" / result["id"]
out.write_text(result["text"])
print(f"Done: {result['id']} — {len(result['items'])} entities removed")
Step 3: Keep records
Article 10 requires written records of what you did. For each dataset, keep:
- The detection model and version used
- Which entity types were found and how each was replaced
- Entity counts removed per dataset
- The date of scrubbing and the dataset version used
This meets the "data governance and management practices" requirement in Article 10(2)(a).
Common Questions
Does scrubbing break model quality?
In most cases, no. The model learns patterns from text structure, not personal details. Names, phone numbers, and addresses can be replaced with placeholders like [NAME] or [PHONE] and the model still learns the same patterns. Many research teams have found that scrubbed datasets produce models of equal quality. The key is to use consistent placeholders so the model sees a clear pattern.
What if my dataset is very large?
Use the batch API. It handles large volumes in parallel. The pricing page shows plans for high-volume use cases. Many teams process millions of records per month.
What about non-English datasets?
The API supports 48 languages. Each language uses a detection model trained on that language. This means German, French, Spanish, Japanese, and others are all covered. See the FAQ for a full language list. Mixed-language datasets are also supported — you can specify the language per document in the batch request.
Colorado AI Act: Two Deadlines
Colorado's AI Act takes effect on June 30, 2026 — five weeks before the EU deadline. It sets similar rules for "high-risk AI systems" under state law. The main focus is bias and discrimination.
Teams in both the EU and Colorado face two deadlines at once. Scrubbing your datasets helps meet both laws: Article 10 (EU) and Colorado's anti-bias rules. The technical steps are the same.
Act Now
Five months is enough time — if you start today. It is not enough if you wait until June.
A practical timeline:
- Weeks 1–2: Audit your datasets — find out what personal records are present
- Weeks 3–6: Build and test your scrubbing pipeline
- Weeks 7–10: Write up your governance records; get legal review
- Weeks 11–16: Validate — confirm scrubbed datasets meet Article 10 quality rules
- August 2: Enforcement date — compliant practices in place
The anonym.legal API plugs into your current pipeline without big changes. Check pricing for volume plans. The FAQ covers common Article 10 questions.
Use the GDPR compliance checklist for records that overlap between GDPR and Article 10.
The EU AI Act is ready to enforce. Will your organization be ready by August 2?
Start with the GDPR compliance checklist →
Limits and Open Questions
Data scrubbing for AI Act rules is still evolving. Here are the key gaps.
Thresholds are not defined. The EU AI Act does not say what level of scrubbing is "sufficient." Until the European AI Office issues guidance, you face legal risk. You may not know if your method will satisfy regulators.
Re-identification risk remains. Research shows large language models can memorize and replay content from their datasets. Records that passed scrubbing standards before model development may still be extractable. Scrubbing before development does not fully solve this.
Synthetic records have limits. Synthetic generation keeps statistical patterns but can add subtle biases or miss rare edge cases. Models built only on synthetic content may perform poorly on real inputs.
Article 10 is still being interpreted. The phrase "appropriate technical measures" needs interpretation. Early DPA work across EU member states has not settled on clear standards. Watch EDPB guidance and member state decisions throughout 2026.
Sources
- EU AI Act, Regulation (EU) 2024/1689, Articles 9–17 (high-risk AI obligations), OJ L 2024/1689
- EU AI Act, Article 10 — Data and data governance
- CNIL AI dataset guidance, January 2026
- Colorado AI Act, SB 205, effective June 30, 2026
- EU AI Act timeline: prohibited practices February 2, 2025; high-risk systems August 2, 2026