Multilingual PII Detection for GDPR
Updated for 2026
The Hidden GDPR Gap
GDPR has no language preference. Article 4(1) defines "personal data" without naming the language it appears in. A German Steuer-ID is as protected as a US Social Security Number. A French NIR is as regulated as a UK National Insurance number.
Most PII detection tools were built for English only.
Research from ACL 2024 found that hybrid NLP tools reach F1 scores of 0.60–0.83 for European locales. English-only tools score near zero for non-English national ID formats. The gap is stark. A tool may catch 95% of English PII. Yet it misses 40–60% of German, French, Polish, or Dutch PII in the same file. That is a serious problem. It leaves companies exposed.
This is a real GDPR gap. It affects nearly every global firm using English-centric redaction tools. See our GDPR guide for more.
Why PII Is Locale-Specific
PII detection has two parts.
The first is pattern-based scanning. This covers structured IDs like tax numbers and phone formats.
The second is NER-based scanning. This covers contextual entities like names and addresses.
Both parts depend on locale.
Structured IDs Differ by Country
| Country | Tax ID | Format | Validation |
|---|---|---|---|
| Germany | Steuer-ID | 11 digits | Modulo-11 |
| France | NIR | 15 digits + 2-digit key | INSEE |
| Sweden | Personnummer | 10 digits | Luhn |
| Poland | PESEL | 11 digits | Modulo-10 |
| Netherlands | BSN | 9 digits | Elfproef |
| Spain | DNI/NIE | 8 digits + letter | Modulo-23 |
| Italy | Codice Fiscale | 16 chars | Custom checksum |
An English-only regex for SSNs (NNN-NN-NNNN) will not match any of these formats. Each needs its own regex. Each also needs its own checksum logic.
NER Needs Native Models
German names differ from English ones. "Hans-Dieter Müller" is clear to a native German model. An English-trained model often misses such names.
False positives are also a problem. The Microsoft Presidio issue tracker shows German words being misclassified as English PII. The word "Null" (German for "zero") is one example. It triggers false name hits in English-trained models. In production use, error rates inflate to 3 false positives per real entity (Alvaro et al., 2024).
Regulatory Risk
EU data bodies are aware of this problem. Several national DPAs have issued guidance.
German BfDI: GDPR Article 5(1)(f) applies to all records. It covers non-English data processed by third-party tools.
French CNIL: The 2024 CNIL Annual Report raised concerns. It flagged AI tools that handle French records without French-locale PII scanning.
EU DPAs broadly: GDPR Article 25 (Privacy by Design) requires safeguards suited to the actual records being processed. This includes non-English PII in global deployments.
The risk is clear. A firm may show 95% PII detection on English content in a GDPR audit. But if it also handles German, French, and Polish records with the same tool, gaps will appear. Auditors notice. Fines can follow. See our safeguards page for how we address this.
Three-Tier Design
Research and production use agree on a three-tier hybrid design as the best approach.
Tier 1: Native spaCy Models
spaCy provides trained models for 25 locales. These include German, French, Spanish, Portuguese, Italian, Dutch, Russian, Chinese, Japanese, Korean, and Polish. Each model trains on native text. They learn the syntax and entity patterns of each locale. This matters. Native training means better recall and fewer false positives.
For German: de_core_news_lg handles compound nouns and German name patterns.
For French: fr_core_news_lg handles French entities, titles, place names, and organizations.
Native models beat cross-lingual models for name scanning on high-resource locales.
Tier 2: Stanza for More Locales
Stanford's Stanza library covers locales not in spaCy. These include Croatian, Slovenian, and Ukrainian. This adds reach for EU speaker groups that spaCy does not serve. Stanza is free and open source. It integrates well with the rest of the stack.
Tier 3: XLM-RoBERTa for Broad Reach
For locales where spaCy and Stanza lack NER models, XLM-RoBERTa fills the gap. It trains on Common Crawl text across 100 locales. It achieves 91.4% cross-lingual F1 for PII detection (HuggingFace 2024). It handles code-switching well. That is a key feature. It matters when one document holds text in several locales at once.
Visit our token system docs to see how API calls scale with multilingual volume.
Locale-Specific Entity Types
Models alone are not enough. GDPR alignment also requires entity type scope for country-specific IDs.
EU National IDs by country:
- DE: Steuer-ID, Sozialversicherungsnummer, Personalausweisnummer
- FR: NIR, SIREN, SIRET
- PL: PESEL, NIP, REGON
- NL: BSN
- SE: Personnummer, Samordningsnummer
- ES: DNI, NIE, NIF, CIF
- IT: Codice Fiscale, Partita IVA
Phone formats: Each EU country has unique prefix structures. +49, +33, and +48 each need their own validation logic.
Address formats: Postal codes vary widely. German PLZ uses 5 digits. French codes use 5 digits (01–99 range). UK postcodes are alphanumeric. Spanish codes use 5 digits (01000–52999).
Real-World Case: Swiss Pharma
A Swiss firm processes employment contracts. Each contract mixes German, French, and English text. Switzerland has four official languages. Their tool was set up for German only. It missed all French-section PII.
A contract for a Geneva-based employee included a French AVS number (13 digits), a Swiss bank IBAN, and a name in French format. The German-only tool missed the French-format name. It failed to find the French AVS number. It only partly detected the IBAN.
The three-tier approach processes the whole document. It detects locale per text segment. It applies the right NER model for each part. It validates each national ID with the correct country logic.
Mixed-Locale Documents
The hardest case is intra-document locale mixing. Examples:
- A German firm's English contract with German employee records (names, tax IDs)
- A French GDPR consent form with an English privacy excerpt
- A chat where the agent replies in English and the customer writes in Arabic
XLM-RoBERTa handles this natively. It needs no explicit locale flags. It processes mixed-locale text without prior segmentation. This saves time. It also avoids errors from faulty splits.
For production use, combining auto locale detection (at the sentence level) with XLM-RoBERTa inference gives robust handling of mixed-locale documents.
Practical Steps
Audit your tool's reach. Ask your redaction vendor for F1 scores for your specific locales. "Supports 20 languages" often means the tool routes text through machine translation first. That is not native scanning.
Map your records to locales. Do a records inventory that includes locale distribution. A global firm with 70% English, 20% German, and 10% French faces different risks. One with 95% English is in a different position.
Test with national ID samples. Build a test set with 10 examples of the national IDs in your operations—Steuer-ID, NIR, PESEL, BSN, and others. Verify detection rates. This is faster than a full F1 test.
Review your DPIAs. Check if locale scope is included. An incomplete DPIA assuming English-only records may need an update. Act now. Do not wait for an audit to find the gap.
For full entity type definitions, see the entities reference and the FAQ. For plans and API call rates, visit pricing.
anonym.legal's PII detection engine uses a three-tier multilingual approach. It covers 25 high-resource locales via native spaCy models. Stanza adds extra locale reach. XLM-RoBERTa cross-lingual transformers extend scope to 48 locales. Country-specific entity types for all EU member states are included.