By · Last updated 2026-03-03

Back to BlogGDPR & Compliance

Multilingual PII Detection for GDPR

A German Steuer-ID, French NIR, and Swedish Personnummer all require different detection logic.

March 3, 202610 minute read
multilingualGDPRNLPPII detectionEuropean compliancespaCyXLM-RoBERTa

Multilingual PII Detection for GDPR

Updated for 2026

The Hidden GDPR Gap

GDPR has no language preference. Article 4(1) defines "personal data" without naming the language it appears in. A German Steuer-ID is as protected as a US Social Security Number. A French NIR is as regulated as a UK National Insurance number.

Most PII detection tools were built for English only.

Research from ACL 2024 found that hybrid NLP tools reach F1 scores of 0.60–0.83 for European locales. English-only tools score near zero for non-English national ID formats. The gap is stark. A tool may catch 95% of English PII. Yet it misses 40–60% of German, French, Polish, or Dutch PII in the same file. That is a serious problem. It leaves companies exposed.

This is a real GDPR gap. It affects nearly every global firm using English-centric redaction tools. See our GDPR guide for more.

Why PII Is Locale-Specific

PII detection has two parts.

The first is pattern-based scanning. This covers structured IDs like tax numbers and phone formats.

The second is NER-based scanning. This covers contextual entities like names and addresses.

Both parts depend on locale.

Structured IDs Differ by Country

CountryTax IDFormatValidation
GermanySteuer-ID11 digitsModulo-11
FranceNIR15 digits + 2-digit keyINSEE
SwedenPersonnummer10 digitsLuhn
PolandPESEL11 digitsModulo-10
NetherlandsBSN9 digitsElfproef
SpainDNI/NIE8 digits + letterModulo-23
ItalyCodice Fiscale16 charsCustom checksum

An English-only regex for SSNs (NNN-NN-NNNN) will not match any of these formats. Each needs its own regex. Each also needs its own checksum logic.

NER Needs Native Models

German names differ from English ones. "Hans-Dieter Müller" is clear to a native German model. An English-trained model often misses such names.

False positives are also a problem. The Microsoft Presidio issue tracker shows German words being misclassified as English PII. The word "Null" (German for "zero") is one example. It triggers false name hits in English-trained models. In production use, error rates inflate to 3 false positives per real entity (Alvaro et al., 2024).

Regulatory Risk

EU data bodies are aware of this problem. Several national DPAs have issued guidance.

German BfDI: GDPR Article 5(1)(f) applies to all records. It covers non-English data processed by third-party tools.

French CNIL: The 2024 CNIL Annual Report raised concerns. It flagged AI tools that handle French records without French-locale PII scanning.

EU DPAs broadly: GDPR Article 25 (Privacy by Design) requires safeguards suited to the actual records being processed. This includes non-English PII in global deployments.

The risk is clear. A firm may show 95% PII detection on English content in a GDPR audit. But if it also handles German, French, and Polish records with the same tool, gaps will appear. Auditors notice. Fines can follow. See our safeguards page for how we address this.

Three-Tier Design

Research and production use agree on a three-tier hybrid design as the best approach.

Tier 1: Native spaCy Models

spaCy provides trained models for 25 locales. These include German, French, Spanish, Portuguese, Italian, Dutch, Russian, Chinese, Japanese, Korean, and Polish. Each model trains on native text. They learn the syntax and entity patterns of each locale. This matters. Native training means better recall and fewer false positives.

For German: de_core_news_lg handles compound nouns and German name patterns. For French: fr_core_news_lg handles French entities, titles, place names, and organizations.

Native models beat cross-lingual models for name scanning on high-resource locales.

Tier 2: Stanza for More Locales

Stanford's Stanza library covers locales not in spaCy. These include Croatian, Slovenian, and Ukrainian. This adds reach for EU speaker groups that spaCy does not serve. Stanza is free and open source. It integrates well with the rest of the stack.

Tier 3: XLM-RoBERTa for Broad Reach

For locales where spaCy and Stanza lack NER models, XLM-RoBERTa fills the gap. It trains on Common Crawl text across 100 locales. It achieves 91.4% cross-lingual F1 for PII detection (HuggingFace 2024). It handles code-switching well. That is a key feature. It matters when one document holds text in several locales at once.

Visit our token system docs to see how API calls scale with multilingual volume.

Locale-Specific Entity Types

Models alone are not enough. GDPR alignment also requires entity type scope for country-specific IDs.

EU National IDs by country:

  • DE: Steuer-ID, Sozialversicherungsnummer, Personalausweisnummer
  • FR: NIR, SIREN, SIRET
  • PL: PESEL, NIP, REGON
  • NL: BSN
  • SE: Personnummer, Samordningsnummer
  • ES: DNI, NIE, NIF, CIF
  • IT: Codice Fiscale, Partita IVA

Phone formats: Each EU country has unique prefix structures. +49, +33, and +48 each need their own validation logic.

Address formats: Postal codes vary widely. German PLZ uses 5 digits. French codes use 5 digits (01–99 range). UK postcodes are alphanumeric. Spanish codes use 5 digits (01000–52999).

Real-World Case: Swiss Pharma

A Swiss firm processes employment contracts. Each contract mixes German, French, and English text. Switzerland has four official languages. Their tool was set up for German only. It missed all French-section PII.

A contract for a Geneva-based employee included a French AVS number (13 digits), a Swiss bank IBAN, and a name in French format. The German-only tool missed the French-format name. It failed to find the French AVS number. It only partly detected the IBAN.

The three-tier approach processes the whole document. It detects locale per text segment. It applies the right NER model for each part. It validates each national ID with the correct country logic.

Mixed-Locale Documents

The hardest case is intra-document locale mixing. Examples:

  • A German firm's English contract with German employee records (names, tax IDs)
  • A French GDPR consent form with an English privacy excerpt
  • A chat where the agent replies in English and the customer writes in Arabic

XLM-RoBERTa handles this natively. It needs no explicit locale flags. It processes mixed-locale text without prior segmentation. This saves time. It also avoids errors from faulty splits.

For production use, combining auto locale detection (at the sentence level) with XLM-RoBERTa inference gives robust handling of mixed-locale documents.

Practical Steps

Audit your tool's reach. Ask your redaction vendor for F1 scores for your specific locales. "Supports 20 languages" often means the tool routes text through machine translation first. That is not native scanning.

Map your records to locales. Do a records inventory that includes locale distribution. A global firm with 70% English, 20% German, and 10% French faces different risks. One with 95% English is in a different position.

Test with national ID samples. Build a test set with 10 examples of the national IDs in your operations—Steuer-ID, NIR, PESEL, BSN, and others. Verify detection rates. This is faster than a full F1 test.

Review your DPIAs. Check if locale scope is included. An incomplete DPIA assuming English-only records may need an update. Act now. Do not wait for an audit to find the gap.

For full entity type definitions, see the entities reference and the FAQ. For plans and API call rates, visit pricing.


anonym.legal's PII detection engine uses a three-tier multilingual approach. It covers 25 high-resource locales via native spaCy models. Stanza adds extra locale reach. XLM-RoBERTa cross-lingual transformers extend scope to 48 locales. Country-specific entity types for all EU member states are included.

Sources

Ready to protect your data?

Start anonymizing PII with 285+ entity types across 48 languages.

About this page

We update this page when our platform or the law changes.

Read our founder note for how we work.

Each change shows up in the timestamp at the top.

Related reading

We follow these rules

  • GDPR (EU 2016/679).
  • ISO/IEC 27001:2022.
  • NIS2 (EU 2022/2555).
  • HIPAA safe harbor under 45 CFR § 164.514(b)(2).

Our promise

We do not sell your data.

We do not train models on your text.

We store your files in Germany.

You can delete your account at any time.

You own your work.

Where we run

Our servers live in Falkenstein, Germany.

We use Hetzner. They hold ISO 27001 certification.

All data stays in the EU.

Backups run every day.

Need help?

Email support@anonym.legal.

We reply within one business day.

How we test

We run a full check suite on every release.

Each surface gets its own sweep script and report.

Human reviewers spot-check the output each week.

We track recall and precision on a labelled set.

Bad runs block the deploy.

What we never do

  • We never sell your information to third parties.
  • We never train models on what you upload.
  • We never keep your work after you delete it.
  • We never share keys with any outside firm.
  • We never run ads inside the product.

Plans in plain words

We sell credits, not seats.

One credit covers one short job.

Long jobs use a few credits each.

You can top up at any time.

Unused credits roll over each month.

Read the plans page for current rates.

Who built this

A small team of engineers and lawyers built this.

We ship from Europe and work in the open.

Our founder note spells out why we started.

Where to start

How the parts fit

A browser add-on cleans text inside Chrome.

A Word plug-in handles drafts in Office.

A small desktop tool works on whole folders.

An agent protocol link feeds large models safely.

All four share one core engine and one rule set.

Words from our team

We started this work after a lunch about cookies.

One friend kept getting odd ads on her phone.

We asked why a court file leaked through a draft.

We sketched the first build on a napkin that week.

By month three we had a tiny demo for a friend.

She used it on her first case the next day.

Common questions we hear

Can the tool read scanned PDFs? Yes, with OCR.

Does it work on long files? Yes, in small chunks.

Can I roll my own rule set? Yes, save it as a preset.

Does it run offline? The desktop build runs offline.

Do you keep my files? No, the cloud build wipes after each run.

Will it learn from my work? No, we never train on inputs.

A short tour of the workflow

Upload a file or paste a snippet of prose.

Pick the entities you want gone from the draft.

Choose a method: replace, mask, hash, encrypt, or redact.

Press run and watch the side panel show each hit.

Skim the result and tweak any rule that misfired.

Save the cleaned file or send it to a teammate.