Japan My Number: APPI and the Verhoeff Check
Japan's Personal Information Protection Commission (PPC) issued 45 enforcement decisions in 2024. It also published Japan's first AI privacy guidance. A PPC study found that 63% of generic NLP tools fail to detect My Number (マイナンバー) in Japanese files. If your team handles data of Japanese residents, that gap means direct APPI risk.
What My Number Is
Japan gives every resident a unique 12-digit identifier. This is My Number, part of the Individual Number System (マイナンバー制度). It covers tax, pension, health insurance, and disaster response. This identifier is sensitive data under APPI. You need a legal reason to collect or share it.
The Verhoeff Check Problem
My Number uses the Verhoeff algorithm for its check digit. Verhoeff is a math method that catches all single-digit errors. It also catches all errors where two adjacent digits swap. It needs three lookup tables to work. You cannot compute it by hand. It requires code.
This matters for two reasons. First, Japan's 12-digit format looks like many other codes. Invoice references, document IDs, and date strings all share the same format. Without a Verhoeff check, a tool will flag the wrong values. Second, most tools do not use Verhoeff. They use simpler modulo-10 or modulo-11 checks. Those do not work here.
The PPC study found that 63% of tools either skip the check or use a simpler method. Both problems occur at once: false positives and false negatives.
The Luhn algorithm, used for credit cards, is simpler. My Number does not use Luhn. Tools built for Luhn will not work.
Three Scripts, One Name
Japanese text uses three writing systems at once. A tool must handle all three.
Hiragana (ひらがな): Used for grammar and native words. 46 base characters.
Katakana (カタカナ): Used for foreign words and names. 46 base characters. Foreign names in Japan appear in this script.
Kanji (漢字): Symbols for nouns and names. About 2,000 are in common use.
One person's name can appear in four forms: Kanji (田中太郎), Hiragana (たなかたろう), Katakana (タナカ タロウ), and Romaji (Tanaka Taro). A tool must match all four. If it misses one, it misses most of that person's records.
Other Japanese IDs to Detect
Driver's license (運転免許証番号): 12 digits. The first two digits show the prefecture. Tokyo is 10. Osaka is 62. This lets a tool check whether the value is valid for that region.
Passport (旅券番号): Two letters plus seven digits. ICAO format. Japan uses specific letter pairs.
Health insurance card (健康保険証記号番号): A symbol plus a number. The format depends on the insurer. National Health Insurance (国民健康保険) and Society-Managed Insurance (協会けんぽ) use different formats.
Residence card (在留カード番号): For foreign residents. Two letters, eight digits, two letters. The Ministry of Justice issues this card.
APPI's Anonymization Rule
APPI has a strict anonymized data standard called anonymized information (匿名加工情報). It goes further than GDPR in one key area. Anonymization must be third-party verifiable and technically irreversible.
To comply, an organization must:
- Remove all direct identifiers, including My Number.
- Handle all quasi-identifier combinations.
- Use k-anonymity or a similar method.
- Publish a general description of the steps taken.
- Never try to re-identify the data.
The PPC's 2024 AI guidance adds a specific rule. If you train an AI on anonymized data, you cannot use that model to re-identify people. This is a direct ban on model inversion attacks against APPI training sets.
To meet PPC standards, you need four things. First, Verhoeff validation for My Number detection. Second, Japanese NER using ja_core_news with proper tokenization. Third, name matching across Kanji, Kana, and Romaji. Fourth, prefecture code checks for driver's licenses.
India uses Aadhaar, which also requires Verhoeff validation. The India DPDPA technical compliance guide covers that in detail. For multi-country identifier detection, see EU national tax ID detection under GDPR.