From Six Weeks of DevOps Pain to a 3-Day Integration
Updated for 2026.
Six weeks. Two engineers. Four failed deployment attempts. One healthcare SaaS team spent all of this on a self-hosted Presidio setup. Then they switched to a managed API. The switch took 3 days.
The "free" label on open-source software is tempting. So is the promise of full control. But the real cost shows up in engineering hours. Not license fees.
What Presidio Docs Don't Cover
Presidio's docs handle local setup well. Run two Docker containers. Point the anonymizer at the analyzer. It works on your laptop.
Production is a different story.
Scaling: Local Presidio runs as a single instance. Production needs multiple instances behind a load balancer, health checks, and graceful failure. Presidio docs give no guidance on this. Each team solves it alone.
Memory use: spaCy models load into RAM per instance. The en_core_web_lg model alone is 741 MB. Under memory pressure, performance drops. Then the process crashes with an out-of-memory error. Presidio has no built-in guidance for this.
Timeouts: Large documents take longer. Production code needs configurable timeouts, safe timeout responses, and retry logic. None of this is documented in Presidio.
Model load failures: Under high concurrency, multiple workers try to load the same spaCy model at once. This is a race condition. The result is random 500 errors that are hard to reproduce. Presidio GitHub issues document this. The main docs do not.
Audit logs: GDPR and HIPAA require audit trails for PII processing. Presidio has no built-in logging. Each team must write their own middleware.
API versioning: Presidio's API has changed between versions. Code built for Presidio 2.0 may need updates for 2.2 and above. Version pinning helps. But it adds its own maintenance burden.
A Healthcare SaaS Team's Six Weeks
This team built PHI anonymization into a research data export pipeline.
Week 1: They followed the Presidio docs. Local dev worked. The Kubernetes deployment failed. Pod initialization threw model loading errors. The team chased Kubernetes config issues.
Week 2: Kubernetes config was fixed. Model loading worked sometimes. Under load testing, about 15% of requests failed with model loading timeouts. They added retry logic.
Week 3: Retry logic hid the root issue but passed load tests. A compliance review asked for audit logs. The team wrote custom logging middleware.
Week 4: Healthcare entity types — medical record numbers, health plan IDs — were not covered by Presidio defaults. The team wrote two custom recognizers.
Week 5: They pushed to production. A memory leak appeared. spaCy model objects built up across requests. The team added a daily pod restart as a workaround.
Week 6: Production failed under real traffic. The daily restart caused service gaps. The root cause was clear: the memory leak needed either a major app redesign or a different tool.
The review: The engineering manager ran the numbers. Six weeks times two engineers equals 12 engineering weeks. The deployment was live but unstable. Ongoing maintenance was estimated at 5 to 10 hours per week.
The switch: The team tested the anonym.legal API. PHI entity coverage worked out of the box. No custom recognizers needed. SLA-backed uptime. Audit logging included. Integration took 3 days using their existing API client code.
The cost comparison:
- 12 engineering weeks at US market rates: $48,000 to $72,000
- Estimated annual maintenance for self-hosted: $25,000 to $40,000
- anonym.legal Business plan: €348 per year (roughly $385)
The managed API costs less in its first week than the self-hosted build cost in its first hour.
When Data Cannot Leave Your Network
Some healthcare teams cannot send data to any external service. Air-gap rules or data sovereignty policies block it.
For these cases, the Desktop Application (anonym.plus) offers the same engine in a local install:
- Same detection engine: Presidio plus XLM-RoBERTa
- No calls to external services
- Batch processing for clinical notes and research datasets
- No setup beyond installation
- Automatic model management
This removes the main objection to managed SaaS: "our data can't leave." It still keeps the simplicity that makes managed tools worthwhile.
Build vs. Buy: A Simple Framework
Choose a managed API when:
- Your team has no dedicated infrastructure engineers
- You need to ship in days, not weeks
- SLA-backed uptime is a requirement
- The managed service covers your entity types
- You need audit logs and compliance records included
Choose self-hosted when:
- Regulations block data from leaving your network (check the Desktop App first)
- Your processing volume makes self-hosted cheaper at scale
- You need deep customization the API cannot support
- You have a platform team that treats this as one of many managed services
Choose the Desktop Application when:
- Offline processing is required
- Medical research data cannot leave a clinical environment
- Financial data has geographic processing limits
Conclusion
Six weeks of engineering time is not a Presidio flaw. It is the expected cost of running any production-grade NLP service on your own. Scaling, memory issues, model load failures, audit logs, and custom entity work all add up fast.
Managed APIs absorb that cost. For PII anonymization — a compliance need, not a product feature — the managed route almost always wins on total cost of ownership.
Read how the anonym.legal API handles PHI detection. See full compliance details in our security overview. Compare plans on our pricing page.
Sources
- Ploomber: Presidio Production Deployment Deep Dive — ploomber.io.
- Microsoft Fabric Community: Presidio with PySpark — blog.fabric.microsoft.com.
- Presidio GitHub: Production Deployment Issues — github.com/microsoft/presidio/issues.