The Real Cost of "Free" PII Detection

"It's free" is not a cost analysis. It is a license price — one factor among many.

Microsoft Presidio costs €0 to download. The software is open-source. But running it at an insurance company costs over €13,000 in the first year. That gap is engineering time.

What a Production Deployment Needs

Getting the tool ready for production takes 40–80 hours. Here is where that time goes.

Docker setup: 4–8 hours. The tool uses several containers. An analyzer service, an anonymizer service, and an optional image redactor. Getting them to talk to each other is hard. GitHub issues show it is a common failure point.

Python setup: 2–4 hours. The libraries have strict version rules. Conflicts are common — especially between spaCy model versions and Python 3.8/3.9/3.10. GitHub shows hundreds of open issues on this topic.

Language model downloads: 2–4 hours. spaCy models range from 300 MB to 1.4 GB each. A five-language setup needs 1.5–7 GB of storage. Model loading failures are among the most common support issues.

Custom recognizers: 8–16 hours. The default set covers about 40 entity types. Most are US identifiers. EU deployments need European national IDs. Healthcare teams need medical record formats. Each type needs Python code, YAML setup, and testing.

API setup: 4–8 hours. Production config includes timeouts, auth, rate limits, and logging. The official docs are thin. Most teams find answers in GitHub issue threads.

Audit logging: 4–8 hours. GDPR requires records of data processing. The tool has no audit log by default. Teams must write it as custom code.

Team docs: 4–8 hours.

Total initial setup: 28–52 hours at €100/hour = €2,800–5,200.

Annual Maintenance Costs

The tool ships updates 2–4 times per year. Major releases have broken APIs. Keeping up means tracking changes, testing in staging, and deploying.

spaCy model updates add work too. New model versions need re-downloading and accuracy checks before going live.

Python dependency conflicts keep coming. A clean setup today may break when a security patch ships next month.

Monitoring is ongoing as well. Container health, memory leaks, and restart steps all need regular attention. spaCy models are memory-heavy.

Total annual maintenance: 60–120 hours at €100/hour = €6,000–12,000.

A Real-World Case Study

A compliance team at an insurance firm set out to process claims documents. They had two junior data engineers and no DevOps support.

Week 1. The two main containers could not talk to each other. Three days to fix with help from GitHub.

Week 2. Models failed to load in production. Memory config was different from the dev setup. Two days to diagnose, one more to fix.

Week 3. A custom UK National Insurance Number rule worked in tests but hit false positives on real documents. Two more days of tuning.

Week 4. The project was escalated. Three engineering weeks spent. Still not in production.

The team then tried anonym.legal. First document processed: 12 minutes after signup. UK National Insurance Number detection was already built in. No setup needed.

They moved to anonym.legal Pro at €180/year.

Year-one TCO:

Self-hosted path — 40–80 more hours to finish, then €6,000–12,000/year to maintain. Total: €10,000–20,000.
anonym.legal Pro — €180/year. Deploy time: ~12 minutes.
Engineering hours saved: ~132/year at €100/hour = €13,200.

That is a 70x cost gap in year one.

For teams also facing false positive issues, see our post on Presidio's precision problem.

When Self-Hosting Makes Sense

Managed SaaS wins for most teams. But self-hosting fits some cases.

Data sovereignty. Some rules or contracts ban sending data outside. Our Desktop App (anonym.plus) runs fully offline. No data leaves the machine. Same accuracy, no server needed.

Very high volume. Millions of API calls per day can push per-call pricing above server costs. At that scale, owning the stack makes sense.

Product integration. Building PII detection into your own product and need full control? Custom open-source work is valid here.

Existing DevOps. Teams with a platform team already running many services face lower added cost. Infrastructure is a sunk cost for them.

For everyone else — compliance teams, startups, teams with no DevOps — managed SaaS is the clear choice. See our security compliance overview for how hosted processing meets enterprise needs.

Conclusion

Open-source tools have costs that do not show up in the license. For this type of tool, the big cost is engineering time. Setup: 40–80 hours. Annual upkeep: 60–120 hours. At normal rates, the self-hosted path costs 20–75x more than a managed service.

The right question is not "what does the software cost?" It is "what does running it cost?" For most teams, that answer points to managed SaaS.

When This Approach Has Limits

Counting engineering hours rather than license fees is the right way to compare total cost of ownership, and the self-hosted hour estimates are realistic — but three limits apply to the conclusion.

The hour ranges are estimates that swing on team and context. Forty to eighty setup hours and sixty to one hundred twenty annual maintenance hours assume a team without prior Presidio experience and without an existing platform team. A team that already runs many containerized NLP services absorbs much of this as marginal cost, as the article itself concedes. The 70x year-one gap depends on the high end of the self-hosted estimate and the low end of usage; run the numbers with your own rates and your own existing infrastructure rather than adopting the headline multiple.

A lower cost says nothing about detection quality. The comparison is entirely about who maintains the stack, not about whether either path correctly finds your PII. A managed service that deploys in twelve minutes can still miss a custom or legacy identifier format, and a missed entity is a compliance gap regardless of how cheaply you reached it. Custody and convenience are not detection accuracy. Validate that the chosen tool actually detects your entity types on held-out documents before treating the cost decision as settled.

Sending data to a managed service relocates rather than removes obligations. Choosing hosted processing puts a processor relationship, a data-transfer assessment, and vendor due diligence on your plate that self-hosting avoids. The article fairly notes data-sovereignty cases where the offline Desktop App fits instead. The TCO math captures engineering time; it does not price the legal and procurement work of approving an external processor. Factor that into the comparison, because it is real cost the hour counts omit.

Sources

Microsoft Presidio GitHub: Issues and Setup Documentation. VERIFIED-EXTERNAL.

Ploomber: Presidio Production Deployment Guide. VERIFIED-EXTERNAL.

GDPR Article 32: Technical measures for appropriate security. VERIFIED-EXTERNAL.

Limitations / When this doesn't apply

The hour ranges are estimates that swing on team and context. The 40–80 setup hours and 60–120 annual maintenance hours assume a team without prior Presidio experience and without an existing platform team; a team already running containerized NLP services absorbs much of this as marginal cost. The headline 70x year-one gap depends on the high end of the self-hosted estimate and the low end of usage, so run the numbers with your own rates and infrastructure rather than adopting the multiple as given.

A lower cost says nothing about detection quality. The comparison is about who maintains the stack, not whether either path correctly finds your PII — a managed service that deploys in twelve minutes can still miss a custom or legacy identifier format, and a missed entity is a compliance gap regardless of how cheaply you reached it. Validate that the chosen tool detects your entity types on held-out documents before treating the cost decision as settled. And sending data to a managed service relocates rather than removes obligations: hosted processing adds a processor relationship, a data-transfer assessment, and vendor due diligence that the TCO hour-count omits but that are real cost (and where the offline Desktop App may fit instead).

Ready to protect your data?

Start anonymizing PII with 267+ entity types across 48 languages.

Start Free Trial View Features

Free PII Detection Costs €13K/Year

The Real Cost of "Free" PII Detection

What a Production Deployment Needs

Annual Maintenance Costs

A Real-World Case Study

When Self-Hosting Makes Sense

Conclusion

When This Approach Has Limits

Sources

Limitations / When this doesn't apply

Related Articles

Presidio: 3-Week Setup vs Managed PII

6 Weeks to 3 Days: Managed PII Setup

Presidio 22.7% Precision Problem

Ready to protect your data?

Free PII Detection Costs €13K/Year

The Real Cost of "Free" PII Detection

What a Production Deployment Needs

Annual Maintenance Costs

A Real-World Case Study

When Self-Hosting Makes Sense

Conclusion

When This Approach Has Limits

Sources

Limitations / When this doesn't apply

Related Articles

Presidio: 3-Week Setup vs Managed PII

6 Weeks to 3 Days: Managed PII Setup

Presidio 22.7% Precision Problem

Ready to protect your data?

About this page

Related reading

We follow these rules

Our promise

Where we run

Need help?

How we test

What we never do

Plans in plain words

Who built this

Where to start

How the parts fit

Words from our team

Common questions we hear

A short tour of the workflow