Presidio: Powerful Tool, Long Setup
Updated for 2026.
Microsoft Presidio is a solid tool for PII detection and de-identification. But it is a big engineering project. Running it in production takes real effort. The community agrees on this.
GitHub Issue #237 is a good example. Even skilled developers hit environment conflicts. They run into model load failures and API errors. Days of debug work can pass before the first working run.
What the Community Data Shows
The Presidio GitHub repo has thousands of stars. That shows strong interest. But the open issues list tells a different story.
Environment problems: Python version conflicts are common. So are spaCy model mismatches and ONNX runtime errors. These issues hit developers who follow the docs exactly.
Model load failures: spaCy models download fine but fail to load in some setups. Containers and low-memory configs are common trouble spots. Fixing them needs deep knowledge of spaCy internals.
Production API failures: The analyzer works fine in dev. It breaks under production load. Threading issues and memory pressure from NLP models are the main causes.
Integration overhead: The Ploomber blog on this framework covers the full picture. It uses multiple services — the analyzer, the anonymizer, and an optional image redactor. Linking them adds work. Data transfer between services adds more.
The Microsoft Fabric Case
Microsoft Fabric's own docs show the gap between "available" and "working."
A Fabric blog post on PySpark states this directly: the setup "requires managing external dependencies and custom logic." Fabric users chose a managed cloud platform to skip that kind of work. But adding external tools brings the complexity back.
The steps for PySpark setup are:
- Install presidio-analyzer and presidio-anonymizer in Fabric notebooks.
- Download spaCy models in the Fabric environment.
- Write PySpark UDF wrappers for the analyzer and anonymizer.
- Handle spaCy model packing for use across Spark workers.
- Set up language detection for multi-language datasets.
Every step has known failure modes. Teams on this path often spend one to two weeks before they process their first document.
Two Paths: Self-Hosted vs. Managed
The managed approach flips the setup challenge.
Self-hosted path:
- Install Docker.
- Set up docker-compose.yml.
- Download spaCy models.
- Debug container networking.
- Set up API endpoints.
- Test entity detection.
- Fix false positives and negatives.
- Build custom recognizers for non-standard entity types.
- Add audit logging.
- Tune for production load.
Time to first de-identified document: three to twenty-one days.
Managed service path:
- Create an account.
- Upload a document or call the API.
Time to first de-identified document: twelve minutes.
Both paths use the same detection approach. The managed path runs on hardware someone else maintains.
When Self-Hosting Makes More Sense
The managed service does not fit every case.
Custom model training: Some cases need new NER models. Proprietary drug names or internal product codes are examples. Self-hosting gives you the training tools.
Spark-native processing: Some pipelines need PII detection inside the Spark executor. An external API call adds latency that breaks that pattern. Self-hosting is the only fit here.
Full control: Some security policies block all external API calls in a data pipeline. The anonym.legal Desktop App runs fully offline. Self-hosted is the fully isolated option.
For most cases — document processing, API workflows, and conformance tooling — the managed service removes the infrastructure project entirely.
Running Both Paths at Once
The free tier gives you 200 credits per month. That is enough to test real documents. No credit card. No commitment.
Here is a simple parallel approach.
Week 1: Set up the self-hosted analyzer in dev. See how complex production config will be.
Day 1, in parallel: Create a managed service account. Run the same test documents through the managed API. Compare the results.
Key questions:
- Does the managed service detect the types you need? It covers 285+ entity types. The open-source build covers around 40 by default.
- Is the accuracy good enough?
- Does the API fit your pattern?
- Do the plans match your volume and budget?
If yes on all: the managed service removes the infrastructure project. If no: the gaps you find are real reasons to stay self-hosted.
See how other teams made this call in our case studies. Check safeguards and protection details on our security and conformance page. Find answers to common questions in our FAQ.
In Short
A three-week setup is not a failure of the docs or the framework. It shows what production-grade NLP infrastructure needs. The challenges are real. They take time and skill to solve.
For many teams, PII de-identification is a conformance requirement. It is not a core engineering task. The managed service delivers the same detection. It does so without the infrastructure project. Twelve minutes from signup to a first de-identified document keeps the evaluation cost very low.
Sources
- Microsoft Presidio GitHub: Open Issues — VERIFIED-EXTERNAL
- Ploomber: Presidio in Production — VERIFIED-EXTERNAL
- Microsoft Fabric: PII Detection with PySpark — VERIFIED-EXTERNAL