GDPR-Safe Pipeline: Anonymize PII Before Storage
Updated for 2026
You tagged your PII columns in dbt. You set up dynamic masking in Snowflake. You feel GDPR-compliant.
Your source content still lands in the warehouse unmasked. Masking runs at query time. The unmasked content sits in your raw schema. Anyone with raw schema access can read it. Your dbt models ran before masking policies existed. Old ingested tables were never masked.
The gap between "we have masking policies" and "our pipeline is safe" is where GDPR violations happen.
See our compliance overview for how anonym.legal supports GDPR.
How ELT Pipelines Expose PII
The Extract-Load-Transform (ELT) pattern is now the norm. It loads source data into the warehouse first. Transforms come later. The steps look like this:
- Extract: Source systems export all fields. Salesforce CRM, Stripe payments, Intercom support — everything goes out.
- Load: Source data lands in the warehouse ingestion schema. Snowflake, BigQuery, Redshift all work the same way. Every PII field is included.
- Transform: dbt models clean and join the data for analytics.
The ingestion layer holds full personal information. Names, email addresses, phone numbers, payment details, support ticket text. In many teams, engineers and analysts have raw schema access. They can query these tables at any time.
Tag-based masking in Snowflake helps at query time. But only for properly set up downstream models. It does not mask old ingested tables. It does not block direct schema queries. Every model and dashboard must be tagged. That burden grows as the schema grows.
Anonymize Before Load
Anonymizing PII at the pipeline level removes raw-layer risk. Do it before content lands in the warehouse.
ETL approach (pre-load anonymization):
- Extract from source systems
- Run through an anonymization step
- Load clean output into the warehouse
The warehouse never receives unmasked PII. The ingestion schema holds only clean content. Downstream models, dashboards, and direct queries all work with clean output.
You have two main paths.
Option 1 — API integration:
For systems with webhooks or streaming exports, route entries through the anonym.legal API first. Support tickets leaving Intercom go through the API before the warehouse. Stripe exports do the same.
POST /api/anonymize
{
"text": "Customer John Smith (john@example.com) reported...",
"entities": ["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER"],
"method": "replace"
}
Option 2 — Batch preprocessing:
For daily or weekly CSV/JSON file exports, run files through batch processing before loading.
Airflow DAG structure:
extract_task >> anonymize_batch_task >> load_to_warehouse_task
The anonymize task uploads files and gets back clean versions. The load task handles the rest.
See our security practices page for sub-processor and data flow details.
What dbt Column Tags Do and Don't Do
dbt lets you tag PII columns:
models:
- name: stg_customers
columns:
- name: email
tags: ['pii', 'email']
- name: full_name
tags: ['pii', 'personal_data']
Tags let you:
- Document where PII lives
- Trigger downstream masking policies (requires warehouse-level setup)
- Track lineage with tools like Secoda
Tags do not:
- Mask ingested tables in the raw schema
- Block direct table queries
- Anonymize data at load time
- Retroactively mask old data
dbt column tags are a governance tool. They show you where PII is. They do not apply the "appropriate technical measures" that GDPR Article 32 requires.
The Snowflake Masking Gap
Snowflake's dynamic masking hides column content from users at query time. It is a strong control for production use. But it has clear limits.
Key limits:
- Every new column needs an explicit policy
- Schema changes can leave new columns unmasked until you update policies
- SYSADMIN and ACCOUNTADMIN roles can bypass masking
- Import jobs often run with high privileges that skip masking
- Old data loaded before policies were set is stored in plain form — policies run at read time, not write time
Masking at query time is not enough. Data must be clean before it is stored.
Compliance Documentation
GDPR's accountability rule requires proof. Words are not enough. For engineering teams this means written records.
Records of Processing Activities (ROPA): Document that customer information is anonymized before it loads to the analytics warehouse. The anonymization step is a processing activity under GDPR.
Technical safeguard notes: Write down which entity types your pipeline targets. Note the anonymization method used. Batch run logs give you this for free.
Data lineage: Secoda or dbt's built-in lineage can show that source tables flow through an anonymization step before reaching analytics models. This is your audit trail.
Vendor register: The anonymization service is a sub-processor. Their DPA and privacy policy must be in your vendor register.
Implementation Steps
For a dbt and Snowflake pipeline:
Step 1: Audit your raw layer
Find which tables hold personal information. Query your dbt column tags or your catalog for PII-tagged tables.
Step 2: Set the anonymization scope
For each source table, decide which columns hold PII. Then decide which need anonymization and which need pseudonymization. Support ticket body: anonymize. Order ID: pseudonymize to keep join keys intact. Timestamp: keep as-is for time-series analysis.
Step 3: Pick an implementation path
Small team with batch exports: use batch file processing before load. Engineering team available: build API integration in Airflow or Prefect.
Step 4: Test and validate
Run anonymization on a sample before going live. Check that dbt models still work. Some models join on email. Those need consistent replacement values. Pseudonymization keeps join keys. Redaction breaks them.
Step 5: Handle old raw tables
Content loaded before anonymization was in place needs retroactive processing. Export, anonymize, reload. This is a one-time task per table.
Conclusion
Tag-based masking shows you where PII lives. It does not stop users with schema access from reading it. For real GDPR compliance, PII must be clean before it reaches the warehouse. That makes the ingestion layer as safe as the production layer.
This is harder than column tagging. But it is what "appropriate technical measures" actually means.