By · Last updated 2026-05-29

Back to BlogTechnical

Presidio: 3-Week Setup vs Managed PII

Microsoft Presidio has thousands of GitHub stars and hundreds of open issues. Setup complexity, PySpark integration overhead, and Python dependency.

May 29, 20266 minute read
Presidio setupPySpark integrationmanaged PresidioPython dependenciesPII setup complexity

Presidio: Powerful Tool, Long Setup

Updated for 2026.

Microsoft Presidio is a solid tool for PII detection and de-identification. But it is a big engineering project. Running it in production takes real effort. The community agrees on this.

GitHub Issue #237 is a good example. Even skilled developers hit environment conflicts. They run into model load failures and API errors. Days of debug work can pass before the first working run.

What the Community Data Shows

The Presidio GitHub repo has thousands of stars. That shows strong interest. But the open issues list tells a different story.

Environment problems: Python version conflicts are common. So are spaCy model mismatches and ONNX runtime errors. These issues hit developers who follow the docs exactly.

Model load failures: spaCy models download fine but fail to load in some setups. Containers and low-memory configs are common trouble spots. Fixing them needs deep knowledge of spaCy internals.

Production API failures: The analyzer works fine in dev. It breaks under production load. Threading issues and memory pressure from NLP models are the main causes.

Integration overhead: The Ploomber blog on this framework covers the full picture. It uses multiple services — the analyzer, the anonymizer, and an optional image redactor. Linking them adds work. Data transfer between services adds more.

The Microsoft Fabric Case

Microsoft Fabric's own docs show the gap between "available" and "working."

A Fabric blog post on PySpark states this directly: the setup "requires managing external dependencies and custom logic." Fabric users chose a managed cloud platform to skip that kind of work. But adding external tools brings the complexity back.

The steps for PySpark setup are:

  1. Install presidio-analyzer and presidio-anonymizer in Fabric notebooks.
  2. Download spaCy models in the Fabric environment.
  3. Write PySpark UDF wrappers for the analyzer and anonymizer.
  4. Handle spaCy model packing for use across Spark workers.
  5. Set up language detection for multi-language datasets.

Every step has known failure modes. Teams on this path often spend one to two weeks before they process their first document.

Two Paths: Self-Hosted vs. Managed

The managed approach flips the setup challenge.

Self-hosted path:

  1. Install Docker.
  2. Set up docker-compose.yml.
  3. Download spaCy models.
  4. Debug container networking.
  5. Set up API endpoints.
  6. Test entity detection.
  7. Fix false positives and negatives.
  8. Build custom recognizers for non-standard entity types.
  9. Add audit logging.
  10. Tune for production load.

Time to first de-identified document: three to twenty-one days.

Managed service path:

  1. Create an account.
  2. Upload a document or call the API.

Time to first de-identified document: twelve minutes.

Both paths use the same detection approach. The managed path runs on hardware someone else maintains.

When Self-Hosting Makes More Sense

The managed service does not fit every case.

Custom model training: Some cases need new NER models. Proprietary drug names or internal product codes are examples. Self-hosting gives you the training tools.

Spark-native processing: Some pipelines need PII detection inside the Spark executor. An external API call adds latency that breaks that pattern. Self-hosting is the only fit here.

Full control: Some security policies block all external API calls in a data pipeline. The anonym.legal Desktop App runs fully offline. Self-hosted is the fully isolated option.

For most cases — document processing, API workflows, and conformance tooling — the managed service removes the infrastructure project entirely.

Running Both Paths at Once

The free tier gives you 200 credits per month. That is enough to test real documents. No credit card. No commitment.

Here is a simple parallel approach.

Week 1: Set up the self-hosted analyzer in dev. See how complex production config will be.

Day 1, in parallel: Create a managed service account. Run the same test documents through the managed API. Compare the results.

Key questions:

  • Does the managed service detect the types you need? It covers 285+ entity types. The open-source build covers around 40 by default.
  • Is the accuracy good enough?
  • Does the API fit your pattern?
  • Do the plans match your volume and budget?

If yes on all: the managed service removes the infrastructure project. If no: the gaps you find are real reasons to stay self-hosted.

See how other teams made this call in our case studies. Check safeguards and protection details on our security and conformance page. Find answers to common questions in our FAQ.

In Short

A three-week setup is not a failure of the docs or the framework. It shows what production-grade NLP infrastructure needs. The challenges are real. They take time and skill to solve.

For many teams, PII de-identification is a conformance requirement. It is not a core engineering task. The managed service delivers the same detection. It does so without the infrastructure project. Twelve minutes from signup to a first de-identified document keeps the evaluation cost very low.

Sources

Ready to protect your data?

Start anonymizing PII with 285+ entity types across 48 languages.

About this page

We update this page when our platform or the law changes.

Read our founder note for how we work.

Each change shows up in the timestamp at the top.

Related reading

We follow these rules

  • GDPR (EU 2016/679).
  • ISO/IEC 27001:2022.
  • NIS2 (EU 2022/2555).
  • HIPAA safe harbor under 45 CFR § 164.514(b)(2).

Our promise

We do not sell your data.

We do not train models on your text.

We store your files in Germany.

You can delete your account at any time.

You own your work.

Where we run

Our servers live in Falkenstein, Germany.

We use Hetzner. They hold ISO 27001 certification.

All data stays in the EU.

Backups run every day.

Need help?

Email support@anonym.legal.

We reply within one business day.

How we test

We run a full check suite on every release.

Each surface gets its own sweep script and report.

Human reviewers spot-check the output each week.

We track recall and precision on a labelled set.

Bad runs block the deploy.

What we never do

  • We never sell your information to third parties.
  • We never train models on what you upload.
  • We never keep your work after you delete it.
  • We never share keys with any outside firm.
  • We never run ads inside the product.

Plans in plain words

We sell credits, not seats.

One credit covers one short job.

Long jobs use a few credits each.

You can top up at any time.

Unused credits roll over each month.

Read the plans page for current rates.

Who built this

A small team of engineers and lawyers built this.

We ship from Europe and work in the open.

Our founder note spells out why we started.

Where to start

How the parts fit

A browser add-on cleans text inside Chrome.

A Word plug-in handles drafts in Office.

A small desktop tool works on whole folders.

An agent protocol link feeds large models safely.

All four share one core engine and one rule set.

Words from our team

We started this work after a lunch about cookies.

One friend kept getting odd ads on her phone.

We asked why a court file leaked through a draft.

We sketched the first build on a napkin that week.

By month three we had a tiny demo for a friend.

She used it on her first case the next day.

Common questions we hear

Can the tool read scanned PDFs? Yes, with OCR.

Does it work on long files? Yes, in small chunks.

Can I roll my own rule set? Yes, save it as a preset.

Does it run offline? The desktop build runs offline.

Do you keep my files? No, the cloud build wipes after each run.

Will it learn from my work? No, we never train on inputs.

A short tour of the workflow

Upload a file or paste a snippet of prose.

Pick the entities you want gone from the draft.

Choose a method: replace, mask, hash, encrypt, or redact.

Press run and watch the side panel show each hit.

Skim the result and tweak any rule that misfired.

Save the cleaned file or send it to a teammate.