By · Last updated 2026-05-25

Back to BlogHealthcare

HIPAA Safe Harbor De-ID at Scale

HIPAA Safe Harbor requires removing 18 specific PHI identifier categories. Academic medical centers need de-identification at scale but existing tools.

May 25, 20269 minute read
HIPAA Safe Harborde-identificationhealthcare researchPHI removalacademic medical center

HIPAA Safe Harbor De-Identification at Scale: A Guide for Healthcare Researchers

An academic medical center needs to scrub 200,000 discharge records. The goal: build a readmission prediction model. The existing tool costs $120,000 per year. The grant budget for data work: $5,000.

This gap is common. Healthcare research needs large datasets. Those datasets hold protected health information (PHI). PHI includes names, dates, addresses, and other personal details. Removing PHI lets researchers use the data legally. But the tools are priced for hospital systems, not research grants.

HIPAA Safe Harbor: The 18 Identifiers

HIPAA's Safe Harbor method (45 CFR §164.514(b)) lists 18 PHI types. All must go before health data loses its "protected" status. After removal, research can proceed without patient consent.

Here are all 18 types:

  1. Names
  2. Geographic data smaller than state (zip codes need truncation to 3 digits for small populations)
  3. All dates except year — admission, discharge, birth, death, and other dates
  4. Phone numbers
  5. Fax numbers
  6. Email addresses
  7. Social security numbers
  8. Medical record numbers
  9. Health plan beneficiary numbers
  10. Account numbers
  11. Certificate and license numbers
  12. Vehicle identifiers and serial numbers
  13. Device identifiers and serial numbers
  14. Web URLs
  15. IP addresses
  16. Biometric identifiers (fingerprints, voice prints)
  17. Full-face photos and similar images
  18. Any other unique identifying number or code

The first five appear in nearly every discharge record. All must be removed or changed.

Dates need special care. Every patient date must keep the year but lose the specific day and month. "March 15, 2023" becomes "2023." You can keep duration as a field — but only after the source dates are gone.

The Scale Problem

Useful healthcare datasets are large:

  • Readmission prediction: 50,000–500,000 encounters
  • Treatment outcome work: 10,000–100,000 patients per condition
  • Drug efficacy: 5,000–50,000 records
  • Population health: 100,000+ encounters

Manual review at this scale does not work. A 5-minute review per record takes 250–2,500 working days for 100,000 records. Human error rates run 1–5%. Even a small miss rate creates HIPAA risk. Two reviewers treating dates differently can break Safe Harbor status. That is an easy mistake to make on a large dataset.

Automated scrubbing is the only real option. It must catch all 18 types across the varied formats found in clinical notes.

The Tool Pricing Gap

Enterprise tools target hospital systems:

  • Datavant: $100,000+/year
  • Veradigm (Allscripts): similar prices
  • Clinithink CLiX: contact sales only
  • Syntegra (synthetic data): enterprise pricing

These vendors sell to large organizations with legal and compliance teams. Research grants are not their market.

Free and open-source tools exist but take expertise:

  • MITRE MIST: free, but needs heavy setup and has limited language support
  • Stanford NLP DEID: research-grade, needs Java and coding skills
  • i2b2 NLP tools: clinical NLP, setup required

Most researchers need reliable PHI removal with simple setup. Open-source tools need coding and linguistics skills to run. They also need validation work. Enterprise tools cost more than most grants allow. The gap is real and it blocks research.

Five-Step Batch Process

For 200,000 discharge records, a sequential batch approach works well.

Step 1: Export from the EHR. Pull structured and unstructured fields as text or PDF files per encounter. Epic, Cerner, and Meditech all support this. They export CSV or HL7 files with clinical note fields included.

Step 2: Run batches of 5,000. Batches this size are fast and small enough for review at each stage.

Set entity types for Safe Harbor:

  • PERSON (patient names, family members in notes)
  • US_SSN
  • US_MEDICAL_RECORD_NUMBER
  • PHONE_NUMBER
  • EMAIL_ADDRESS
  • URL
  • IP_ADDRESS
  • LOCATION (addresses, zip codes, cities — anything below state level)
  • DATE (all clinical dates; patients over 89 become "> 89")
  • HEALTHCARE_ID (insurance numbers, beneficiary numbers)
  • ACCOUNT_NUMBER

For more on batch PHI scrubbing for clinical notes, see batch processing clinical notes with local HIPAA tools. That guide covers file formats and entity tuning in depth.

Step 3: Handle dates as a separate step. Keep the year. Remove the month and day. Replace any age over 89 with "> 89." Rare age-disease pairs can re-identify patients. Compute duration fields first — length of stay, days to readmission. Then delete the source dates.

Step 4: Sample and review each batch. After each 5,000-record batch, pull 50 records for human review. Check all 18 types. Look for context items like researcher names in notes or referring physician details. Confirm date handling matches Safe Harbor rules. Fix any gaps before moving on.

Step 5: Document and certify. HIPAA requires someone with statistical knowledge to confirm the re-ID risk is very small. For Safe Harbor, the team doing the removal makes that call. Write up your entity config and sampling results. Keep them for IRB records.

Need an audit trail for each removal? Explainable redaction with HIPAA audit trail covers logging in detail.

Cost Comparison

Enterprise tool: $120,000/year. Covers setup, training, unlimited processing, and compliance support.

Batch processing:

  • 200,000 records × 300 words average = 60,000,000 tokens
  • At €0.0001/token: €6,000 in processing
  • Pro plan (€180/year) or Business plan (€348/year) for the project
  • Researcher review time: 20–40 hours
  • Total: roughly €7,000–8,000

Savings versus the enterprise tool: $111,000–113,000. Research that stalled at $120,000 becomes feasible at $7,000.

Key Limits

Text only. This approach handles text-based PHI. Images, audio, and biometric data (Safe Harbor categories 13, 16, and 17) need other tools.

Validation is required. Automated tools miss some items. A 0.1% miss rate on 200,000 records leaves 200 records with live PHI. That is a real HIPAA risk. Do not skip validation.

Check with your privacy office. IRB approval for the study does not cover the scrubbing method. Most centers review PHI removal approaches separately. This guide adds to that review — it does not replace it.

Expert Determination is an option. HIPAA also allows scrubbing via "Expert Determination" (45 CFR §164.514(b)(1)). A statistics expert certifies the re-identification risk is very small. This path fits unusual datasets. It works well when removing all dates would break time-series analysis.

For a side-by-side look at automated PHI tools, see PHI detection accuracy comparison.

Conclusion

Healthcare research that could help patients is stuck behind PHI removal costs. Manual review does not scale. Enterprise tools cost more than most grants allow. Datasets stay locked or improperly scrubbed.

Token-based batch processing makes large-scale research feasible. Academic centers and independent researchers get the same accuracy as large hospital systems. On a standard grant budget.

Sources

Ready to protect your data?

Start anonymizing PII with 285+ entity types across 48 languages.

About this page

We update this page when our platform or the law changes.

Read our founder note for how we work.

Each change shows up in the timestamp at the top.

Related reading

We follow these rules

  • GDPR (EU 2016/679).
  • ISO/IEC 27001:2022.
  • NIS2 (EU 2022/2555).
  • HIPAA safe harbor under 45 CFR § 164.514(b)(2).

Our promise

We do not sell your data.

We do not train models on your text.

We store your files in Germany.

You can delete your account at any time.

You own your work.

Where we run

Our servers live in Falkenstein, Germany.

We use Hetzner. They hold ISO 27001 certification.

All data stays in the EU.

Backups run every day.

Need help?

Email support@anonym.legal.

We reply within one business day.

How we test

We run a full check suite on every release.

Each surface gets its own sweep script and report.

Human reviewers spot-check the output each week.

We track recall and precision on a labelled set.

Bad runs block the deploy.

What we never do

  • We never sell your information to third parties.
  • We never train models on what you upload.
  • We never keep your work after you delete it.
  • We never share keys with any outside firm.
  • We never run ads inside the product.

Plans in plain words

We sell credits, not seats.

One credit covers one short job.

Long jobs use a few credits each.

You can top up at any time.

Unused credits roll over each month.

Read the plans page for current rates.

Who built this

A small team of engineers and lawyers built this.

We ship from Europe and work in the open.

Our founder note spells out why we started.

Where to start

How the parts fit

A browser add-on cleans text inside Chrome.

A Word plug-in handles drafts in Office.

A small desktop tool works on whole folders.

An agent protocol link feeds large models safely.

All four share one core engine and one rule set.

Words from our team

We started this work after a lunch about cookies.

One friend kept getting odd ads on her phone.

We asked why a court file leaked through a draft.

We sketched the first build on a napkin that week.

By month three we had a tiny demo for a friend.

She used it on her first case the next day.

Common questions we hear

Can the tool read scanned PDFs? Yes, with OCR.

Does it work on long files? Yes, in small chunks.

Can I roll my own rule set? Yes, save it as a preset.

Does it run offline? The desktop build runs offline.

Do you keep my files? No, the cloud build wipes after each run.

Will it learn from my work? No, we never train on inputs.

A short tour of the workflow

Upload a file or paste a snippet of prose.

Pick the entities you want gone from the draft.

Choose a method: replace, mask, hash, encrypt, or redact.

Press run and watch the side panel show each hit.

Skim the result and tweak any rule that misfired.

Save the cleaned file or send it to a teammate.