Updated for 2026

GDPR has no language preference. Article 4(1) defines "personal data" without naming the language it appears in. A German Steuer-ID is as protected as a US Social Security Number. A French NIR is as regulated as a UK National Insurance number.

Most PII detection tools were built for English only.

Research from ACL 2024 found that hybrid NLP tools reach F1 scores of 0.60–0.83 for European locales. English-only tools score near zero for non-English national ID formats. The gap is stark. A tool may catch 95% of English PII. Yet it misses 40–60% of German, French, Polish, or Dutch PII in the same file. That is a serious problem. It leaves companies exposed.

This is a real GDPR gap. It affects nearly every global firm using English-centric redaction tools. See our GDPR guide for more.

Why PII Is Locale-Specific

PII detection has two parts.

The first is pattern-based scanning. This covers structured IDs like tax numbers and phone formats.

The second is NER-based scanning. This covers contextual entities like names and addresses.

Both parts depend on locale.

Structured IDs Differ by Country

Country	Tax ID	Format	Validation
Germany	Steuer-ID	11 digits	Modulo-11
France	NIR	15 digits + 2-digit key	INSEE
Sweden	Personnummer	10 digits	Luhn
Poland	PESEL	11 digits	Modulo-10
Netherlands	BSN	9 digits	Elfproef
Spain	DNI/NIE	8 digits + letter	Modulo-23
Italy	Codice Fiscale	16 chars	Custom checksum

An English-only regex for SSNs (NNN-NN-NNNN) will not match any of these formats. Each needs its own regex. Each also needs its own checksum logic.

NER Needs Native Models

German names differ from English ones. "Hans-Dieter Müller" is clear to a native German model. An English-trained model often misses such names.

False positives are also a problem. The Microsoft Presidio issue tracker shows German words being misclassified as English PII. The word "Null" (German for "zero") is one example. It triggers false name hits in English-trained models. In production use, error rates inflate to 3 false positives per real entity (Alvaro et al., 2024).

Regulatory Risk

EU data bodies are aware of this problem. Several national DPAs have issued guidance.

German BfDI: GDPR Article 5(1)(f) applies to all records. It covers non-English data processed by third-party tools.

French CNIL: The 2024 CNIL Annual Report raised concerns. It flagged AI tools that handle French records without French-locale PII scanning.

EU DPAs broadly: GDPR Article 25 (Privacy by Design) requires safeguards suited to the actual records being processed. This includes non-English PII in global deployments.

The risk is clear. A firm may show 95% PII detection on English content in a GDPR audit. But if it also handles German, French, and Polish records with the same tool, gaps will appear. Auditors notice. Fines can follow. See our safeguards page for how we address this.

Three-Tier Design

Research and production use agree on a three-tier hybrid design as the best approach.

Tier 1: Native spaCy Models

spaCy provides trained models for 25 locales. These include German, French, Spanish, Portuguese, Italian, Dutch, Russian, Chinese, Japanese, Korean, and Polish. Each model trains on native text. They learn the syntax and entity patterns of each locale. This matters. Native training means better recall and fewer false positives.

For German: de_core_news_lg handles compound nouns and German name patterns. For French: fr_core_news_lg handles French entities, titles, place names, and organizations.

Native models beat cross-lingual models for name scanning on high-resource locales.

Tier 2: Stanza for More Locales

Stanford's Stanza library covers locales not in spaCy. These include Croatian, Slovenian, and Ukrainian. This adds reach for EU speaker groups that spaCy does not serve. Stanza is free and open source. It integrates well with the rest of the stack.

Tier 3: XLM-RoBERTa for Broad Reach

For locales where spaCy and Stanza lack NER models, XLM-RoBERTa fills the gap. It trains on Common Crawl text across 100 locales. It achieves 91.4% cross-lingual F1 for PII detection (HuggingFace 2024). It handles code-switching well. That is a key feature. It matters when one document holds text in several locales at once.

Visit our token system docs to see how API calls scale with multilingual volume.

Locale-Specific Entity Types

Models alone are not enough. GDPR alignment also requires entity type scope for country-specific IDs.

EU National IDs by country:

DE: Steuer-ID, Sozialversicherungsnummer, Personalausweisnummer
FR: NIR, SIREN, SIRET
PL: PESEL, NIP, REGON
NL: BSN
SE: Personnummer, Samordningsnummer
ES: DNI, NIE, NIF, CIF
IT: Codice Fiscale, Partita IVA

Phone formats: Each EU country has unique prefix structures. +49, +33, and +48 each need their own validation logic.

Address formats: Postal codes vary widely. German PLZ uses 5 digits. French codes use 5 digits (01–99 range). UK postcodes are alphanumeric. Spanish codes use 5 digits (01000–52999).

Real-World Case: Swiss Pharma

A Swiss firm processes employment contracts. Each contract mixes German, French, and English text. Switzerland has four official languages. Their tool was set up for German only. It missed all French-section PII.

A contract for a Geneva-based employee included a French AVS number (13 digits), a Swiss bank IBAN, and a name in French format. The German-only tool missed the French-format name. It failed to find the French AVS number. It only partly detected the IBAN.

The three-tier approach processes the whole document. It detects locale per text segment. It applies the right NER model for each part. It validates each national ID with the correct country logic.

Mixed-Locale Documents

The hardest case is intra-document locale mixing. Examples:

A German firm's English contract with German employee records (names, tax IDs)
A French GDPR consent form with an English privacy excerpt
A chat where the agent replies in English and the customer writes in Arabic

XLM-RoBERTa handles this natively. It needs no explicit locale flags. It processes mixed-locale text without prior segmentation. This saves time. It also avoids errors from faulty splits.

For production use, combining auto locale detection (at the sentence level) with XLM-RoBERTa inference gives robust handling of mixed-locale documents.

Practical Steps

Audit your tool's reach. Ask your redaction vendor for F1 scores for your specific locales. "Supports 20 languages" often means the tool routes text through machine translation first. That is not native scanning.

Map your records to locales. Do a records inventory that includes locale distribution. A global firm with 70% English, 20% German, and 10% French faces different risks. One with 95% English is in a different position.

Test with national ID samples. Build a test set with 10 examples of the national IDs in your operations—Steuer-ID, NIR, PESEL, BSN, and others. Verify detection rates. This is faster than a full F1 test.

Review your DPIAs. Check if locale scope is included. An incomplete DPIA assuming English-only records may need an update. Act now. Do not wait for an audit to find the gap.

For full entity type definitions, see the entities reference and the FAQ. For plans and API call rates, visit pricing.

anonym.legal's PII detection engine uses a three-tier multilingual approach. It covers 25 high-resource locales via native spaCy models. Stanza adds extra locale reach. XLM-RoBERTa cross-lingual transformers extend scope to 48 locales. Country-specific entity types for all EU member states are included.

When This Approach Has Limits

A multilingual, locale-aware pipeline closes the English-centric gap that trips up most tools — but coverage across 48 languages is not uniform, and it is worth being precise about where it weakens:

Accuracy varies sharply by language resource level. The same ACL 2024 research that motivates this design reports European F1 in the 0.60–0.83 range, not the high-90s seen for English. Low-resource languages, transliterated text, and dialects have thinner training data and lower recall. "Supported" does not mean "equivalent" — validate the specific languages in your records.
National-ID detection is only as current as its rules. Structured IDs rely on per-country regex and checksum logic. When a government changes a format or introduces a new identifier, detection lags until the rule is updated. A novel or recently-changed ID can pass through undetected.
Mixed-language and code-switched documents remain the hard case. Cross-lingual models handle them far better than English-only tools, but segment-level language detection can still mis-route a short or ambiguous passage, and a misrouted segment is a missed-entity risk. The harder the language mix, the more a review step earns its place.
Detection is not a DPIA. Catching German, French, and Polish PII supports GDPR Article 25, but it does not by itself prove compliance. Confirming your records inventory, locale distribution, and DPIA scope actually match what the tool processes is a controller responsibility the tooling cannot discharge for you.

Use a multilingual pipeline to remove the single biggest blind spot in English-first redaction — then measure per-language accuracy on your own data and keep human review on the languages and document types that carry the most risk.

Sources

Ready to protect your data?

Start anonymizing PII with 267+ entity types across 48 languages.

Start Free Trial View Features

Multilingual PII Detection for GDPR