By · Last updated 2026-05-29

Back to BlogTechnical

GDPR Pipeline: Anonymize Before Storage

dbt column tags are not GDPR compliance. Raw customer data hits your Snowflake warehouse unmasked before tag-based policies apply.

May 29, 20268 minute read
data pipelinedbtSnowflakedata warehouseELT anonymizationGDPR engineering

GDPR-Safe Pipeline: Anonymize PII Before Storage

Updated for 2026

You tagged your PII columns in dbt. You set up dynamic masking in Snowflake. You feel GDPR-compliant.

Your source content still lands in the warehouse unmasked. Masking runs at query time. The unmasked content sits in your raw schema. Anyone with raw schema access can read it. Your dbt models ran before masking policies existed. Old ingested tables were never masked.

The gap between "we have masking policies" and "our pipeline is safe" is where GDPR violations happen.

See our compliance overview for how anonym.legal supports GDPR.

How ELT Pipelines Expose PII

The Extract-Load-Transform (ELT) pattern is now the norm. It loads source data into the warehouse first. Transforms come later. The steps look like this:

  1. Extract: Source systems export all fields. Salesforce CRM, Stripe payments, Intercom support — everything goes out.
  2. Load: Source data lands in the warehouse ingestion schema. Snowflake, BigQuery, Redshift all work the same way. Every PII field is included.
  3. Transform: dbt models clean and join the data for analytics.

The ingestion layer holds full personal information. Names, email addresses, phone numbers, payment details, support ticket text. In many teams, engineers and analysts have raw schema access. They can query these tables at any time.

Tag-based masking in Snowflake helps at query time. But only for properly set up downstream models. It does not mask old ingested tables. It does not block direct schema queries. Every model and dashboard must be tagged. That burden grows as the schema grows.

Anonymize Before Load

Anonymizing PII at the pipeline level removes raw-layer risk. Do it before content lands in the warehouse.

ETL approach (pre-load anonymization):

  1. Extract from source systems
  2. Run through an anonymization step
  3. Load clean output into the warehouse

The warehouse never receives unmasked PII. The ingestion schema holds only clean content. Downstream models, dashboards, and direct queries all work with clean output.

You have two main paths.

Option 1 — API integration:

For systems with webhooks or streaming exports, route entries through the anonym.legal API first. Support tickets leaving Intercom go through the API before the warehouse. Stripe exports do the same.

POST /api/anonymize
{
  "text": "Customer John Smith (john@example.com) reported...",
  "entities": ["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER"],
  "method": "replace"
}

Option 2 — Batch preprocessing:

For daily or weekly CSV/JSON file exports, run files through batch processing before loading.

Airflow DAG structure:

extract_task >> anonymize_batch_task >> load_to_warehouse_task

The anonymize task uploads files and gets back clean versions. The load task handles the rest.

See our security practices page for sub-processor and data flow details.

What dbt Column Tags Do and Don't Do

dbt lets you tag PII columns:

models:
  - name: stg_customers
    columns:
      - name: email
        tags: ['pii', 'email']
      - name: full_name
        tags: ['pii', 'personal_data']

Tags let you:

  • Document where PII lives
  • Trigger downstream masking policies (requires warehouse-level setup)
  • Track lineage with tools like Secoda

Tags do not:

  • Mask ingested tables in the raw schema
  • Block direct table queries
  • Anonymize data at load time
  • Retroactively mask old data

dbt column tags are a governance tool. They show you where PII is. They do not apply the "appropriate technical measures" that GDPR Article 32 requires.

The Snowflake Masking Gap

Snowflake's dynamic masking hides column content from users at query time. It is a strong control for production use. But it has clear limits.

Key limits:

  • Every new column needs an explicit policy
  • Schema changes can leave new columns unmasked until you update policies
  • SYSADMIN and ACCOUNTADMIN roles can bypass masking
  • Import jobs often run with high privileges that skip masking
  • Old data loaded before policies were set is stored in plain form — policies run at read time, not write time

Masking at query time is not enough. Data must be clean before it is stored.

Compliance Documentation

GDPR's accountability rule requires proof. Words are not enough. For engineering teams this means written records.

Records of Processing Activities (ROPA): Document that customer information is anonymized before it loads to the analytics warehouse. The anonymization step is a processing activity under GDPR.

Technical safeguard notes: Write down which entity types your pipeline targets. Note the anonymization method used. Batch run logs give you this for free.

Data lineage: Secoda or dbt's built-in lineage can show that source tables flow through an anonymization step before reaching analytics models. This is your audit trail.

Vendor register: The anonymization service is a sub-processor. Their DPA and privacy policy must be in your vendor register.

Implementation Steps

For a dbt and Snowflake pipeline:

Step 1: Audit your raw layer

Find which tables hold personal information. Query your dbt column tags or your catalog for PII-tagged tables.

Step 2: Set the anonymization scope

For each source table, decide which columns hold PII. Then decide which need anonymization and which need pseudonymization. Support ticket body: anonymize. Order ID: pseudonymize to keep join keys intact. Timestamp: keep as-is for time-series analysis.

Step 3: Pick an implementation path

Small team with batch exports: use batch file processing before load. Engineering team available: build API integration in Airflow or Prefect.

Step 4: Test and validate

Run anonymization on a sample before going live. Check that dbt models still work. Some models join on email. Those need consistent replacement values. Pseudonymization keeps join keys. Redaction breaks them.

Step 5: Handle old raw tables

Content loaded before anonymization was in place needs retroactive processing. Export, anonymize, reload. This is a one-time task per table.

Conclusion

Tag-based masking shows you where PII lives. It does not stop users with schema access from reading it. For real GDPR compliance, PII must be clean before it reaches the warehouse. That makes the ingestion layer as safe as the production layer.

This is harder than column tagging. But it is what "appropriate technical measures" actually means.

Sources

Ready to protect your data?

Start anonymizing PII with 285+ entity types across 48 languages.

About this page

We update this page when our platform or the law changes.

Read our founder note for how we work.

Each change shows up in the timestamp at the top.

Related reading

We follow these rules

  • GDPR (EU 2016/679).
  • ISO/IEC 27001:2022.
  • NIS2 (EU 2022/2555).
  • HIPAA safe harbor under 45 CFR § 164.514(b)(2).

Our promise

We do not sell your data.

We do not train models on your text.

We store your files in Germany.

You can delete your account at any time.

You own your work.

Where we run

Our servers live in Falkenstein, Germany.

We use Hetzner. They hold ISO 27001 certification.

All data stays in the EU.

Backups run every day.

Need help?

Email support@anonym.legal.

We reply within one business day.

How we test

We run a full check suite on every release.

Each surface gets its own sweep script and report.

Human reviewers spot-check the output each week.

We track recall and precision on a labelled set.

Bad runs block the deploy.

What we never do

  • We never sell your information to third parties.
  • We never train models on what you upload.
  • We never keep your work after you delete it.
  • We never share keys with any outside firm.
  • We never run ads inside the product.

Plans in plain words

We sell credits, not seats.

One credit covers one short job.

Long jobs use a few credits each.

You can top up at any time.

Unused credits roll over each month.

Read the plans page for current rates.

Who built this

A small team of engineers and lawyers built this.

We ship from Europe and work in the open.

Our founder note spells out why we started.

Where to start

How the parts fit

A browser add-on cleans text inside Chrome.

A Word plug-in handles drafts in Office.

A small desktop tool works on whole folders.

An agent protocol link feeds large models safely.

All four share one core engine and one rule set.

Words from our team

We started this work after a lunch about cookies.

One friend kept getting odd ads on her phone.

We asked why a court file leaked through a draft.

We sketched the first build on a napkin that week.

By month three we had a tiny demo for a friend.

She used it on her first case the next day.

Common questions we hear

Can the tool read scanned PDFs? Yes, with OCR.

Does it work on long files? Yes, in small chunks.

Can I roll my own rule set? Yes, save it as a preset.

Does it run offline? The desktop build runs offline.

Do you keep my files? No, the cloud build wipes after each run.

Will it learn from my work? No, we never train on inputs.

A short tour of the workflow

Upload a file or paste a snippet of prose.

Pick the entities you want gone from the draft.

Choose a method: replace, mask, hash, encrypt, or redact.

Press run and watch the side panel show each hit.

Skim the result and tweak any rule that misfired.

Save the cleaned file or send it to a teammate.