Updated for 2026

You tagged your PII columns in dbt. You set up dynamic masking in Snowflake. You feel GDPR-compliant.

Your source content still lands in the warehouse unmasked. Masking runs at query time. The unmasked content sits in your raw schema. Anyone with raw schema access can read it. Your dbt models ran before masking policies existed. Old ingested tables were never masked.

The gap between "we have masking policies" and "our pipeline is safe" is where GDPR violations happen.

See our compliance overview for how anonym.legal supports GDPR.

How ELT Pipelines Expose PII

The Extract-Load-Transform (ELT) pattern is now the norm. It loads source data into the warehouse first. Transforms come later. The steps look like this:

Extract: Source systems export all fields. Salesforce CRM, Stripe payments, Intercom support — everything goes out.
Load: Source data lands in the warehouse ingestion schema. Snowflake, BigQuery, Redshift all work the same way. Every PII field is included.
Transform: dbt models clean and join the data for analytics.

The ingestion layer holds full personal information. Names, email addresses, phone numbers, payment details, support ticket text. In many teams, engineers and analysts have raw schema access. They can query these tables at any time.

Tag-based masking in Snowflake helps at query time. But only for properly set up downstream models. It does not mask old ingested tables. It does not block direct schema queries. Every model and dashboard must be tagged. That burden grows as the schema grows.

Anonymize Before Load

Anonymizing PII at the pipeline level removes raw-layer risk. Do it before content lands in the warehouse.

ETL approach (pre-load anonymization):

Extract from source systems
Run through an anonymization step
Load clean output into the warehouse

The warehouse never receives unmasked PII. The ingestion schema holds only clean content. Downstream models, dashboards, and direct queries all work with clean output.

You have two main paths.

Option 1 — API integration:

For systems with webhooks or streaming exports, route entries through the anonym.legal API first. Support tickets leaving Intercom go through the API before the warehouse. Stripe exports do the same.

POST /api/anonymize
{
  "text": "Customer John Smith (john@example.com) reported...",
  "entities": ["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER"],
  "method": "replace"
}

Option 2 — Batch preprocessing:

For daily or weekly CSV/JSON file exports, run files through batch processing before loading.

Airflow DAG structure:

extract_task >> anonymize_batch_task >> load_to_warehouse_task

The anonymize task uploads files and gets back clean versions. The load task handles the rest.

See our security practices page for sub-processor and data flow details.

What dbt Column Tags Do and Don't Do

dbt lets you tag PII columns:

models:
  - name: stg_customers
    columns:
      - name: email
        tags: ['pii', 'email']
      - name: full_name
        tags: ['pii', 'personal_data']

Tags let you:

Document where PII lives
Trigger downstream masking policies (requires warehouse-level setup)
Track lineage with tools like Secoda

Tags do not:

Mask ingested tables in the raw schema
Block direct table queries
Anonymize data at load time
Retroactively mask old data

dbt column tags are a governance tool. They show you where PII is. They do not apply the "appropriate technical measures" that GDPR Article 32 requires.

The Snowflake Masking Gap

Snowflake's dynamic masking hides column content from users at query time. It is a strong control for production use. But it has clear limits.

Key limits:

Every new column needs an explicit policy
Schema changes can leave new columns unmasked until you update policies
SYSADMIN and ACCOUNTADMIN roles can bypass masking
Import jobs often run with high privileges that skip masking
Old data loaded before policies were set is stored in plain form — policies run at read time, not write time

Masking at query time is not enough. Data must be clean before it is stored.

Compliance Documentation

GDPR's accountability rule requires proof. Words are not enough. For engineering teams this means written records.

Records of Processing Activities (ROPA): Document that customer information is anonymized before it loads to the analytics warehouse. The anonymization step is a processing activity under GDPR.

Technical safeguard notes: Write down which entity types your pipeline targets. Note the anonymization method used. Batch run logs give you this for free.

Data lineage: Secoda or dbt's built-in lineage can show that source tables flow through an anonymization step before reaching analytics models. This is your audit trail.

Vendor register: The anonymization service is a sub-processor. Their DPA and privacy policy must be in your vendor register.

Implementation Steps

For a dbt and Snowflake pipeline:

Step 1: Audit your raw layer

Find which tables hold personal information. Query your dbt column tags or your catalog for PII-tagged tables.

Step 2: Set the anonymization scope

For each source table, decide which columns hold PII. Then decide which need anonymization and which need pseudonymization. Support ticket body: anonymize. Order ID: pseudonymize to keep join keys intact. Timestamp: keep as-is for time-series analysis.

Step 3: Pick an implementation path

Small team with batch exports: use batch file processing before load. Engineering team available: build API integration in Airflow or Prefect.

Step 4: Test and validate

Run anonymization on a sample before going live. Check that dbt models still work. Some models join on email. Those need consistent replacement values. Pseudonymization keeps join keys. Redaction breaks them.

Step 5: Handle old raw tables

Content loaded before anonymization was in place needs retroactive processing. Export, anonymize, reload. This is a one-time task per table.

Conclusion

Tag-based masking shows you where PII lives. It does not stop users with schema access from reading it. For real GDPR compliance, PII must be clean before it reaches the warehouse. That makes the ingestion layer as safe as the production layer.

This is harder than column tagging. But it is what "appropriate technical measures" actually means.

When This Approach Has Limits

Anonymizing before load is the right architectural move — clean data at the ingestion layer beats masking that only runs at read time. But limits remain worth stating plainly.

Pseudonymized join keys stay in legal scope. Step 2 of the implementation keeps order IDs and other keys intact through reversible pseudonymization so dbt models can still join. That is useful, but pseudonymized data is not anonymized data under GDPR. It remains personal data, fully in scope for Article 32, access rights, and breach rules. You have not removed the obligation, you have relocated it onto the mapping that links pseudonyms back to people. Whoever holds that key holds re-identifiable data, and it must be guarded, access-logged, and retained no longer than the analytic purpose requires.

Detection accuracy bounds what reaches the clean schema. The pre-load step only protects fields the model recognized as sensitive. Free-text support ticket bodies, notes pasted into CRM fields, and source-specific formats are detected less reliably than structured email or phone columns. A name buried in an Intercom transcript that the model misses lands in the warehouse in plain form, where it sits indefinitely. Test your entity configuration against representative exports from each source system, and review held-out samples rather than assuming every channel is covered before you wire the DAG into production.

Quasi-identifiers survive direct-identifier removal. Stripping names and emails does not make a warehouse table anonymous if the remaining columns still single people out. Postal code plus birth date plus a rare product combination can re-identify an individual even with every direct identifier gone. Analytics tables are built precisely to preserve these correlations, so a dataset you treat as clean may still be pseudonymous in law. Decide deliberately which quasi-identifiers to generalize or drop, and document that judgment — the tool runs the masking, but the re-identification-risk assessment is yours to make and defend.

Sources

Limitations / When this doesn't apply

Pseudonymized join keys stay in legal scope. Keeping order IDs intact through reversible pseudonymization so dbt models can join is useful, but pseudonymized data is not anonymized data — it remains personal data, fully in scope for Article 32, access rights, and breach rules. You have relocated the obligation onto the mapping that links pseudonyms back to people, and whoever holds that key holds re-identifiable data that must be guarded, access-logged, and retained no longer than the analytic purpose requires.

Detection accuracy bounds what reaches the clean schema. The pre-load step only protects fields the model recognized: free-text support ticket bodies, notes pasted into CRM fields, and source-specific formats are detected less reliably than structured email or phone columns, so a missed name lands in the warehouse in plain form and sits indefinitely. Test your entity configuration against representative exports from each source system. And quasi-identifiers survive direct-identifier removal — postal code plus birth date plus a rare product combination can re-identify someone even with names and emails gone, so the re-identification-risk assessment on analytics tables is yours to make and document.

Ready to protect your data?

Start anonymizing PII with 267+ entity types across 48 languages.

Start Free Trial View Features

GDPR Pipeline: Anonymize Before Storage

How ELT Pipelines Expose PII

Anonymize Before Load

What dbt Column Tags Do and Don't Do

The Snowflake Masking Gap

Compliance Documentation

Implementation Steps

Conclusion

When This Approach Has Limits

Sources

Limitations / When this doesn't apply

Related Articles

Presidio: 3-Week Setup vs Managed PII

6 Weeks to 3 Days: Managed PII Setup

Free PII Detection Costs €13K/Year

Ready to protect your data?

GDPR Pipeline: Anonymize Before Storage

GDPR-Safe Pipeline: Anonymize PII Before Storage

How ELT Pipelines Expose PII

Anonymize Before Load

What dbt Column Tags Do and Don't Do

The Snowflake Masking Gap

Compliance Documentation

Implementation Steps

Conclusion

When This Approach Has Limits

Sources

Limitations / When this doesn't apply

Related Articles

Presidio: 3-Week Setup vs Managed PII

6 Weeks to 3 Days: Managed PII Setup

Free PII Detection Costs €13K/Year

Ready to protect your data?

About this page

Related reading

We follow these rules

Our promise

Where we run

Need help?

How we test

What we never do

Plans in plain words

Who built this

Where to start

How the parts fit

Words from our team

Common questions we hear

A short tour of the workflow