The Gap That Column Deletion Misses

Updated for 2026

Research datasets move between universities as CSV files. When teams prep a CSV for sharing, the work is column-based. Find the personal info. Delete or replace it.

That method works for fixed fields. A column named "email" holds email addresses — delete it. A column named "phone" holds phone numbers — delete it. A column named "participant_name" holds names — swap it for a code.

But free-text response columns are a blind spot. Removing labeled columns does not touch them.

A survey with 5,000 rows might have five structured PII columns and fifteen open-text response columns. The structured ones hold names, emails, phone numbers, IDs, and birth years. The open-text ones hold comments, notes, and suggestions.

The structured columns get cleaned. The open-text columns stay raw. But people write things like these three examples.

First: "My doctor at Boston Medical Center, Dr. Maria Santos, said the treatment was new." Second: "I've been dealing with this since my 2019 accident." Third: "You can reach my caregiver at margaret.wells@gmail.com for details."

Each entry names a real person. Some include health facts or contact info. None of this appears in a column header. None of it is caught by column deletion.

GDPR Recital 26 defines anonymous records as records that cannot be linked to any person. The bar is high. Records are only truly anonymous when re-identification is not reasonably possible.

A CSV with clean fixed columns but named people in open-text does not pass that test. Those names are identifiable. The dataset is still personal. GDPR Article 89 rules still apply. So these three risks emerge.

Article 89 research exemption: Article 89 lets researchers process personal info for science with fewer duties. But only where "appropriate safeguards" exist. Sharing a file with open-text PII while claiming Article 89 cover is a legal failure.

Ethics approval: Most IRBs and ethics boards require full anonymization for shared datasets. Partial work — fixed columns cleaned, open-text left raw — typically fails. The board can reject the submission.

Data sharing agreements: DSAs between institutions set the required anonymization level. Partial work that fails GDPR Recital 26 may breach the DSA. See our Legal Compliance overview for how this fits a wider program.

Why Open-Text Is So Hard to Clean

Free-text survey answers are among the hardest PII targets. Here is why.

Names in context: "Dr. Maria Santos at Boston Medical Center" requires named entity recognition (NER) to flag a person and an org. Keyword lists cannot find this.

Names in stories: "John Henderson's car hit mine" puts a real name inside a story. It is a person named in passing. Only NER catches it.

Non-standard formats: Contact info may read "reach me at margaret dot wells at gmail." Simple regex tools miss these.

Research-specific terms: Clinical surveys often contain hospital IDs, site codes, and place names. These can identify a person even when they look generic.

So pattern matching alone is not enough. NLP-based tools are needed for real survey anonymization. See Security & Compliance for technical options.

A Real Example From Three Universities

A research team at three European universities ran a patient experience survey. The dataset had 5,000 respondents, 3 fixed PII columns, and 8 open-text columns. The plan was to share the file across sites under a DSA and GDPR Article 89.

With column deletion only:

Fixed PII columns: removed
Open-text columns: left raw
Claim: "PII columns deleted"
PII left behind: 47 named people, 23 email addresses in comments, 18 place names that could identify respondents

With NLP-based detection:

Fixed PII columns: replaced with consistent tokens
Open-text columns: 47 names replaced, 23 emails masked, 18 place names made generic ("Boston Medical Center" → "[Healthcare Institution]")
Result: a file that passes GDPR Recital 26
Ethics board approved the method
DPO confirmed DSA compliance

The gap is real. The first output looks clean. The second output is clean.

Use these steps before sharing any survey or interview file.

Step 1: Label each column Mark every column as fixed PII, fixed non-PII, or open-text. Write it down.

Step 2: Handle fixed PII Delete entries not needed for analysis. Replace entries needed for linking records. Record the codes used.

Step 3: Scan open-text columns Run NLP detection on all open-text columns. Review each result. Confirm which ones are real PII.

Step 4: Apply replacements Replace confirmed PII in the open-text output. Use clear labels like [PERSON], [EMAIL], or [LOCATION].

Step 5: Verify and document Sample 50–100 rows from the output. Check the open-text entries by hand. Write a short summary: tools used, entity types found, columns processed. Share it with the file for ethics review.

This turns "we deleted the name column" into a clear, documented process. It meets GDPR Article 89 and the anonymization standards most ethics boards require. Visit our docs hub for related guides.

When This Approach Has Limits

Treating free-text response columns as a real PII surface and applying NLP rather than column deletion is the correct insight — names in narrative answers genuinely escape keyword and column-based methods. But three limits apply.

NLP recall is not complete on messy survey prose. Named entity recognition catches "Dr. Maria Santos at Boston Medical Center" far better than keyword lists, but survey free-text is some of the hardest input there is: misspellings, code-switching, obfuscated contact details like "margaret dot wells at gmail," and culturally varied names all lower accuracy. A residual false-negative rate means some named people survive the scan. The three-university result — 47 names, 23 emails, 18 places handled — is one dataset, not a guaranteed rate. The protocol's manual sample of 50 to 100 rows is essential precisely because the model will miss cases; keep it.

Removing names may not clear the GDPR Recital 26 bar. Masking direct identifiers in open text is necessary, but Recital 26 asks whether re-identification is reasonably possible at all. Survey responses often carry quasi-identifiers — a rare condition, a specific clinic, an exact date, an unusual event described in a comment — that can single out a respondent even after every name and email is gone. Generalizing "Boston Medical Center" to a healthcare-institution label helps, but the residual linkage risk is a judgment about the whole record, not a per-entity fix. The output may be pseudonymized rather than truly anonymous.

Anonymity for research is an expert judgment, not a tool output. Article 89 protection and DSA compliance depend on "appropriate safeguards," and an IRB, DPO, or DSA reviewer decides whether the dataset clears that standard — a detection pass does not. The tool supports that decision by surfacing and replacing PII and by producing the documentation reviewers expect; it does not make the determination. The article's own example shows the ethics board approving the method and the DPO confirming the DSA, after the processing. Keep that human sign-off, and treat the NLP step as evidence feeding a legal judgment rather than the judgment itself.

Sources

GDPR Article 89: Safeguards for Scientific Research — VERIFIED-EXTERNAL
GDPR Recital 26: Anonymisation Principle — VERIFIED-EXTERNAL
ICO: Anonymisation and Data Protection Risk — VERIFIED-EXTERNAL

Ready to protect your data?

Start anonymizing PII with 267+ entity types across 48 languages.

Start Free Trial View Features

CSV Free-Text PII: Beyond Column Deletion

The Gap That Column Deletion Misses

Why Open-Text Is So Hard to Clean

A Real Example From Three Universities

When This Approach Has Limits

Sources

Related Articles

Self-Hosted PII Fails Compliance Audits

Presidio Misses 220+ GDPR Entities

Configuration Drift: A Hidden GDPR Risk

Ready to protect your data?

CSV Free-Text PII: Beyond Column Deletion

The Gap That Column Deletion Misses

Why This Fails the GDPR Standard

Why Open-Text Is So Hard to Clean

A Real Example From Three Universities

A Five-Step Pre-Sharing Protocol

When This Approach Has Limits

Sources

Related Articles

Self-Hosted PII Fails Compliance Audits

Presidio Misses 220+ GDPR Entities

Configuration Drift: A Hidden GDPR Risk

Ready to protect your data?

About this page

Related reading

We follow these rules

Our promise

Where we run

Need help?

How we test

What we never do

Plans in plain words

Who built this

Where to start

How the parts fit

Words from our team

Common questions we hear

A short tour of the workflow