By · Last updated 2026-05-29

Back to BlogAI Security

AI Coding Assistants Leak Production PII

Unit test fixtures with real customer records. Log files with production data for debugging. GitHub found 39 million secrets leaked in 2024.

May 29, 20268 minute read
AI coding assistantproduction PIIdeveloper securityMCP ServerGitHub Copilot

Why AI Coding Tools Leak Real Customer Records

Most PII leaks from dev teams are not breaches. They are side effects of daily work.

Production data enters test environments. From there, it reaches AI coding tools — and the vendors running them.

GitHub's 2025 research confirmed this. Developers leaked 39 million secrets in public repos during 2024. API keys and personal details all appeared. Most came from test fixtures and debug logs. See our security safeguards overview to learn how teams address this risk.

Updated for 2026: AI coding tool adoption has grown fast. So has the exposure surface.

How Real Records Enter Dev Environments

The routes are common and predictable.

Test fixture files: Unit tests need realistic inputs. The fastest path is copying rows from production. The developer plans to replace them "later." Later rarely comes. Real emails and account IDs stay through dozens of commits.

Debug logs: A bug cannot be reproduced locally. A developer pulls a log from the live system. That log has customer emails, IP addresses, and session tokens. The file lands in the project root and gets committed.

Migration scripts: Schema changes include sample rows for test environments. A DBA copies real rows as samples. The script — with genuine customer entries — enters version control.

Docs and README files: Usage examples use "realistic" inputs. Realistic often means copied from real users. The README ends up with real order IDs and account addresses.

Config files: Dev configs carry staging keys that reach real customer data. These files get committed with secrets inside.

What AI Assistants Actually Receive

When developers use AI coding tools, multiple channels send private information out.

Whole-file context: The tool may receive entire files. That includes test fixtures with real entries, log excerpts, or config files with live keys.

Clipboard pastes: Developers paste code into chat for review. The surrounding context often has customer details in it.

IDE indexing: Cursor and GitHub Copilot index local files for context. Any project file with real rows becomes part of that index.

Error messages: Developers paste stack traces into AI chat when debugging. Stack traces can carry customer IDs.

Each channel sends private information to the AI vendor's API. This creates GDPR and HIPAA risk. See our conformance overview for how these rules apply to dev tools.

GDPR and HIPAA: Key Facts for Dev Teams

These rules apply to AI coding tool usage.

GDPR Article 28 — Processor: Sending personal information to an AI vendor makes that vendor a data processor. A Data Processing Agreement is required. Most vendors offer DPAs. Developers who use AI tools outside formal buying may lack a signed DPA.

GDPR Article 6 — Lawful Basis: Dev testing requires a lawful basis for processing personal information. Legitimate interest may apply — but it needs a balancing test. Using real customer rows when fake ones would work fails that test.

HIPAA — BAA: Healthcare developers must have a Business Associate Agreement with the AI vendor. OpenAI, Anthropic, and GitHub Copilot offer BAAs for enterprise users. Individual usage outside an enterprise plan may not be covered.

Minimization: Real customer entries in test fixtures break the minimization rule. Fake rows serve the same purpose without the privacy cost.

Our FAQ covers common questions on these rules.

Practical Steps for Dev Teams

Start with a quick audit. Most teams find issues within the first hour.

Immediate actions:

  1. Audit test fixtures — search for email, phone, and ID patterns.
  2. Check production log files in project dirs for customer IDs.
  3. Update .gitignore to exclude log files and env-specific data files.
  4. Replace real entries with synthetic generators like Faker or Mimesis.

The audit alone often surfaces years of accumulated exposure. One team found real customer emails in 14 test files created by six different developers over three years. None of the developers had intended to leave them there.

Before any AI assistant session:

  • Run PII detection on files before sharing them.
  • For IDE tools like Cursor: exclude test dirs from indexing.
  • For chat-based tools: review pasted code for personal information.

MCP Server add-on:

The anonym.legal MCP Server connects PII detection into Claude Desktop and Cursor. The steps are simple:

  1. Open a file in the editor.
  2. Call the MCP Server: detect PII in the file.
  3. Review flagged items.
  4. Redact in place.
  5. Share the clean file with the AI tool.

This adds under 30 seconds per file. It removes the manual "check for PII" burden. See our pricing plans to add MCP Server access to your team.

Synthetic inputs — the lasting fix:

Never use real rows in test fixtures. Synthetic libraries produce realistic inputs without exposing real users. Faker (Python/Node.js), Factory Boy (Python), and Bogus (.NET) generate valid inputs for any schema. Each library lets you seed a locale and output realistic names, emails, and phone numbers — all fake.

Case Study: SaaS Team Finds Real Entries in Cursor

The find came during a GDPR audit. A SaaS team using Cursor found real customer emails in unit test fixtures. A developer had copied 50 customer rows from production 18 months earlier. Those rows had been committed to version control and indexed by Cursor.

Over 18 months, Cursor accessed the fixture files roughly 11,000 times across 8 developer IDE sessions. Each session may have sent fixture content to the Cursor API.

What the team did:

  1. Replaced all 50 real rows with Faker-generated fake inputs.
  2. Updated .gitignore to exclude log files.
  3. Added MCP Server for on-demand PII detection before sharing code.
  4. Set a norm: no production entries in any committed file.

The MCP Server was the key change. Developers now run detection before Cursor sessions on customer-facing code. Zero extra effort beyond the MCP call.

Read more in our case studies section.

Sources

GitHub Security Research 2024. VERIFIED-EXTERNAL.

GDPR Article 28. VERIFIED-EXTERNAL.

HIPAA BAA Guidance. VERIFIED-EXTERNAL.

Ready to protect your data?

Start anonymizing PII with 285+ entity types across 48 languages.

About this page

We update this page when our platform or the law changes.

Read our founder note for how we work.

Each change shows up in the timestamp at the top.

Related reading

We follow these rules

  • GDPR (EU 2016/679).
  • ISO/IEC 27001:2022.
  • NIS2 (EU 2022/2555).
  • HIPAA safe harbor under 45 CFR § 164.514(b)(2).

Our promise

We do not sell your data.

We do not train models on your text.

We store your files in Germany.

You can delete your account at any time.

You own your work.

Where we run

Our servers live in Falkenstein, Germany.

We use Hetzner. They hold ISO 27001 certification.

All data stays in the EU.

Backups run every day.

Need help?

Email support@anonym.legal.

We reply within one business day.

How we test

We run a full check suite on every release.

Each surface gets its own sweep script and report.

Human reviewers spot-check the output each week.

We track recall and precision on a labelled set.

Bad runs block the deploy.

What we never do

  • We never sell your information to third parties.
  • We never train models on what you upload.
  • We never keep your work after you delete it.
  • We never share keys with any outside firm.
  • We never run ads inside the product.

Plans in plain words

We sell credits, not seats.

One credit covers one short job.

Long jobs use a few credits each.

You can top up at any time.

Unused credits roll over each month.

Read the plans page for current rates.

Who built this

A small team of engineers and lawyers built this.

We ship from Europe and work in the open.

Our founder note spells out why we started.

Where to start

How the parts fit

A browser add-on cleans text inside Chrome.

A Word plug-in handles drafts in Office.

A small desktop tool works on whole folders.

An agent protocol link feeds large models safely.

All four share one core engine and one rule set.

Words from our team

We started this work after a lunch about cookies.

One friend kept getting odd ads on her phone.

We asked why a court file leaked through a draft.

We sketched the first build on a napkin that week.

By month three we had a tiny demo for a friend.

She used it on her first case the next day.

Common questions we hear

Can the tool read scanned PDFs? Yes, with OCR.

Does it work on long files? Yes, in small chunks.

Can I roll my own rule set? Yes, save it as a preset.

Does it run offline? The desktop build runs offline.

Do you keep my files? No, the cloud build wipes after each run.

Will it learn from my work? No, we never train on inputs.

A short tour of the workflow

Upload a file or paste a snippet of prose.

Pick the entities you want gone from the draft.

Choose a method: replace, mask, hash, encrypt, or redact.

Press run and watch the side panel show each hit.

Skim the result and tweak any rule that misfired.

Save the cleaned file or send it to a teammate.