Search for GDPR erasure on Delta Lake and almost everything you find assumes you are running Databricks: a workspace, a Spark cluster, a Delta Live Tables pipeline that re-processes the data, a scheduled job that calls VACUUM from a notebook. Databricks documents that pattern well for its own runtime, and the delta-rs project covers the same operations through Python and Rust APIs. Both are fine answers if those stacks are what you operate. But a lot of teams hold Delta tables in an object store and want to handle a deletion request the way they would on a database: open a SQL session, run the statements, capture the evidence, close the ticket.
That is the gap this article fills. DeltaForge is a commercial, customer-installed SQL engine that reads and writes Delta Lake natively, so the entire Article 17 lifecycle below is SQL statements against your own storage, executed inside your own network. Every snippet is taken from the delta-gdpr-data-erasure demo that ships in the DeltaForge demo library, so you can run the whole flow yourself against a throwaway table before you touch production.
Does DELETE actually remove data from a Delta table?
No, and this is the single most important fact in the whole topic. Delta Lake never modifies a Parquet file in place. When you run DELETE or UPDATE, the engine identifies the data files that contain matching rows, writes new files that exclude (or rewrite) those rows, and records the swap in the transaction log: remove actions for the old files, add actions for the new ones. The old files are not touched. They stay in storage precisely so that time travel works, which means a query with VERSION AS OF can read your "deleted" personal data for as long as those files survive.
So a Delta erasure has two distinct halves, and a regulator-grade process needs both:
- Logical erasure
- The UPDATE or DELETE commit. Atomic, auditable, immediately effective for every query against the current version. The PII is gone from the live table.
- Physical erasure
- VACUUM removing the superseded Parquet files after the retention window. Only at this point does the PII stop existing in storage, and only at this point does time travel stop being able to resurrect it.
Everything between those two events is a residual-risk window. The walkthrough below makes that window visible, on purpose, because being able to demonstrate it (and demonstrate its closure) is what turns an ad hoc delete into a defensible compliance procedure.
How do you implement the right to be forgotten on Delta Lake?
The lifecycle is four statements long: one to erase, one to audit, one to prove the exposure, one to purge. The demo models a bank holding 30 customer accounts with three PII columns (ssn, phone, mailing_address) alongside the operational data the bank is legally entitled to keep. Customers 1 through 10 have invoked their right to erasure.
First, the table. The demo creates a Delta table directly at an object-store location; the engine writes the _delta_log and Parquet files itself:
CREATE ZONE IF NOT EXISTS external TYPE EXTERNAL
COMMENT 'External and Delta tables for demo datasets';
CREATE SCHEMA IF NOT EXISTS external.delta_demos
COMMENT 'Delta table management tutorial demos';
CREATE DELTA TABLE IF NOT EXISTS external.delta_demos.gdpr_customer_accounts (
id INT,
account_holder VARCHAR,
email VARCHAR,
ssn VARCHAR,
phone VARCHAR,
mailing_address VARCHAR,
account_type VARCHAR,
branch_city VARCHAR,
country VARCHAR,
opened_date VARCHAR,
balance DECIMAL(12,2)
) LOCATION 's3://acme-lake/demo/delta-gdpr-data-erasure/gdpr_customer_accounts';
INSERT INTO external.delta_demos.gdpr_customer_accounts VALUES
(1, 'Alice Monroe', 'alice@bankmail.com', '123-45-6789', '+1-555-0201', '100 Wall St, NY', 'checking', 'New York', 'US', '2022-01-10', 15420.50),
(2, 'Bob Chen', 'bob@bankmail.com', '234-56-7890', '+1-555-0202', '200 State St, Chicago', 'savings', 'Chicago', 'US', '2022-02-15', 82300.00),
(11, 'Karen Liu', 'karen@bankmail.com', '111-22-3333', '+86-21-1111', '28 Nanjing Rd', 'savings', 'Shanghai', 'CN', '2022-11-15', 156000.00);
Now the erasure itself. The demo nulls the PII columns rather than deleting the rows, because the bank must retain the accounts for financial-record obligations; GDPR erasure targets the personal data, not necessarily the record. If your retention analysis says the whole row goes, substitute DELETE FROM ... WHERE and the rest of the lifecycle is identical, because both commands produce the same copy-on-write commit shape.
UPDATE external.delta_demos.gdpr_customer_accounts
SET ssn = NULL, phone = NULL, mailing_address = NULL
WHERE id BETWEEN 1 AND 10;
Verify the logical state. The live table should show exactly ten erased accounts and twenty untouched ones:
SELECT
COUNT(*) AS total_accounts,
COUNT(*) FILTER (WHERE ssn IS NULL) AS erased_ssn,
COUNT(*) FILTER (WHERE phone IS NULL) AS erased_phone,
COUNT(*) FILTER (WHERE mailing_address IS NULL) AS erased_address
FROM external.delta_demos.gdpr_customer_accounts;
The audit trail comes free
Every Delta commit is recorded in the transaction log with its operation, timestamp, and predicate. DESCRIBE HISTORY reads that log back as a result set, which is exactly what a compliance officer needs to evidence when the erasure happened and what it touched:
DESCRIBE HISTORY external.delta_demos.gdpr_customer_accounts;
Prove the exposure before you close it
Here is the part most runbooks skip. After the UPDATE, version 2 of the table still contains every SSN, and any reader with time-travel access can see them. The demo asserts this deliberately, using DeltaForge's ASSERT command (an inline expectation that fails the script if the result diverges):
ASSERT ROW_COUNT = 1
ASSERT VALUE ssn = '123-45-6789' WHERE id = 1
SELECT id, account_holder, ssn, phone
FROM external.delta_demos.gdpr_customer_accounts VERSION AS OF 2
WHERE id = 1;
SELECT COUNT(*) AS exposed_records
FROM external.delta_demos.gdpr_customer_accounts VERSION AS OF 2
WHERE ssn IS NOT NULL;
This is the risk your deletion procedure exists to close, and being able to show it (PII recoverable at T1, unrecoverable at T2) is far stronger evidence than a screenshot of a successful UPDATE.
VACUUM completes the physical deletion
VACUUM lists the table's storage directory, identifies files no longer referenced by the current version that are older than the retention threshold, and deletes them. It returns a small result set with files_deleted and bytes_freed, which belongs in the erasure ticket:
VACUUM external.delta_demos.gdpr_customer_accounts RETAIN 0 HOURS;
After this statement, the VERSION AS OF 2 queries above fail with file-not-found, because the Parquet files holding the PII no longer exist. That failure is the success condition. A final grouped check confirms the live table is in the intended end state:
SELECT
CASE WHEN id BETWEEN 1 AND 10 THEN 'Erased (ids 1-10)' ELSE 'Intact (ids 11-30)' END AS account_group,
COUNT(*) AS accounts,
COUNT(ssn) AS has_ssn,
COUNT(phone) AS has_phone,
COUNT(mailing_address) AS has_address
FROM external.delta_demos.gdpr_customer_accounts
GROUP BY CASE WHEN id BETWEEN 1 AND 10 THEN 'Erased (ids 1-10)' ELSE 'Intact (ids 11-30)' END
ORDER BY account_group;
How do deletion vectors and VACUUM affect GDPR erasure?
Deletion vectors change the physical story in a way that matters specifically for erasure. With delta.enableDeletionVectors set to true (it defaults to false in DeltaForge), a DELETE no longer rewrites the affected Parquet file. Instead it writes a per-file RoaringBitmap recording which row positions are logically gone, and readers mask those rows out after decoding. That is a large performance win when an operation touches a tiny fraction of a big file, and a single-subject GDPR delete is exactly that shape.
But note what it means for compliance: the deleted rows are still physically present, byte for byte, inside the current data file. The bitmap hides them from queries; it does not remove them from storage. So on a deletion-vector table the lifecycle gains a step:
-
Delete logically. The DELETE commits a deletion vector against the affected files. Fast, atomic, auditable, and the rows vanish from every query against the current version.
-
Rewrite physically with OPTIMIZE. A compaction rewrite applies the vectors: the new files simply do not contain the masked rows, and they carry no vector.
OPTIMIZE external.delta_demos.gdpr_customer_accounts; -
Purge with VACUUM. The pre-OPTIMIZE files (which still contain the subject's rows) and the superseded vector files both become removable once the retention window passes. DeltaForge's VACUUM treats vector files and data files uniformly.
On retention: VACUUM's default window is 168 hours, and that default exists to protect concurrent readers and in-flight time travel. RETAIN 0 HOURS is appropriate in the demo and in single-writer maintenance windows, but on a shared production table the safer pattern is to schedule the erasure, let the standard retention elapse, and then VACUUM, treating the documented retention period as part of your published erasure SLA. You can preview exactly what a vacuum will remove without removing it:
VACUUM external.delta_demos.gdpr_customer_accounts RETAIN 168 HOURS DRY RUN;
The change data feed keeps its own copy
If the table has change data feed enabled, there is a third place the PII lives: the _change_data directory, where each commit's row-level changes are materialized for downstream consumers. An erasure UPDATE writes the pre-image of every changed row into that feed, SSNs included. CDF files deliberately follow a longer retention than data files (by default they track the log retention duration, 30 days, because streaming consumers need to catch up), so a VACUUM that purges your old data files can leave pre-erasure values sitting in the change feed.
DeltaForge extends VACUUM with an independent CDC_RETAIN clause so you can bound both windows explicitly in one statement:
VACUUM external.delta_demos.gdpr_customer_accounts RETAIN 168 HOURS CDC_RETAIN 168 HOURS;
When you set your erasure SLA, take the maximum of the two windows, and confirm that every downstream consumer of the change feed applies the deletes it reads rather than archiving them.
Is pseudonymisation an alternative to physical deletion?
Legally, no. GDPR Article 4(5) defines pseudonymisation as processing that prevents attribution to a data subject without additional information, and the regulation is explicit that pseudonymised data is still personal data. A keyed hash of a surname is reversible by anyone holding the key and a candidate list. So when a data subject invokes Article 17, replacing their PII with pseudonyms does not, on its own, discharge the request; the lifecycle above does.
Where pseudonymisation earns its place is everywhere outside the erasure ticket: it shrinks the surface of PII your analysts, dashboards, and downstream extracts ever see, which reduces both your breach exposure and the number of places an erasure has to chase copies. DeltaForge ships this as a first-class SQL object. A pseudonymisation rule attaches to a table column and transforms values at query time; nothing on disk changes, and aggregations stay correct because they compute over the stored values:
CREATE PSEUDONYMISATION RULE ON external.pseudonymisation_demos.bank_customers (ssn)
TRANSFORM redact
PARAMS (mask = '***-**-****');
CREATE PSEUDONYMISATION RULE ON external.pseudonymisation_demos.bank_customers (phone)
TRANSFORM mask
PARAMS (show = 5);
CREATE PSEUDONYMISATION RULE ON external.pseudonymisation_demos.bank_customers (last_name)
TRANSFORM keyed_hash
SCOPE person
PARAMS (salt = 'bank_name_salt_2024');
CREATE PSEUDONYMISATION RULE ON external.pseudonymisation_demos.bank_customers (date_of_birth)
TRANSFORM generalize
SCOPE relationship
PARAMS (range = 10);
With the rules active, an ordinary SELECT returns the protected shapes, and SHOW PSEUDONYMISATION RULES FOR the table enumerates what is in force for an auditor:
SHOW PSEUDONYMISATION RULES FOR external.pseudonymisation_demos.bank_customers;
SELECT
customer_id,
first_name,
last_name AS last_name_hashed,
date_of_birth AS dob_generalized,
phone AS phone_masked,
ssn AS ssn_redacted,
account_tier,
balance
FROM external.pseudonymisation_demos.bank_customers;
The clean mental model: pseudonymisation governs who sees what today; erasure governs what exists at all. A mature GDPR posture on a data lake uses both, and neither substitutes for the other.
Run it yourself, and what this does not cover
Both demos used here, delta-gdpr-data-erasure and pseudonymisation-quickstart, ship in the DeltaForge demo library with setup, assertion-checked queries, and cleanup scripts, so the full lifecycle is reproducible end to end on your own object store. The SQL runs in the DeltaForge SQL engine, a commercial, customer-installed product: the engine, the Delta log, and the PII all stay inside your network, which is itself a meaningful property when the subject matter is personal data.
Honest scope notes. First, this article covers the table-level mechanics; a real Article 17 process also has to chase derived copies (extracts, ML feature stores, backups), and no table format solves that for you, though Delta's audit log tells you which tables changed and when. Second, object-store versioning or snapshot policies outside Delta's control can retain deleted objects after VACUUM; align your bucket lifecycle rules with your erasure SLA. Third, deleted data remains in _delta_log only as metadata (file paths and statistics, which can include min/max values for PII columns); if your columns are sensitive enough that log statistics matter, factor log retention into the same review. None of these caveats is unique to DeltaForge; they are the parts of the problem that live above the storage format, and pretending otherwise is how erasure procedures fail audits.