GDPR Right to Be Forgotten on Delta Lake, in SQL

Q: How do you implement the right to be forgotten on Delta Lake?

Run an UPDATE or DELETE that removes the data subject's PII, audit the commit with DESCRIBE HISTORY, then run VACUUM after the retention window so the pre-erasure Parquet files are physically removed from storage. Until VACUUM completes, time travel can still read the old versions, so the erasure is not finished at the UPDATE. In DeltaForge the whole lifecycle is plain SQL against a customer-installed engine; no Spark cluster or Databricks job is required.

Q: Does DELETE actually remove data from a Delta table?

Not immediately. Delta Lake is copy-on-write: DELETE and UPDATE write new Parquet files without the affected rows and record remove and add actions in the transaction log. The old files stay on storage to support time travel, so deleted personal data remains readable through VERSION AS OF until VACUUM physically purges files older than the retention period. A GDPR erasure is only complete after VACUUM has removed the superseded files.

Q: How do deletion vectors and VACUUM affect GDPR erasure?

Deletion vectors make DELETE faster by masking rows with a bitmap instead of rewriting files, but the masked rows remain physically present inside the current Parquet file. For erasure you must run OPTIMIZE to rewrite the file without those rows, then VACUUM to remove the superseded file and the old vector. VACUUM's RETAIN clause bounds data files; DeltaForge's CDC_RETAIN clause separately bounds change data feed files, which also carry pre-erasure values.

Q: Is pseudonymisation an alternative to physical deletion?

No. Under GDPR Article 4(5), pseudonymised data is still personal data because it can be re-attributed using additional information, so it does not satisfy an Article 17 erasure request on its own. It is a complementary safeguard: DeltaForge's CREATE PSEUDONYMISATION RULE redacts, masks, hashes, or generalizes PII at query time without changing what is stored, which reduces day-to-day exposure while physical erasure remains the answer to a deletion request.

Search for GDPR erasure on Delta Lake and almost everything you find assumes you are running Databricks: a workspace, a Spark cluster, a Delta Live Tables pipeline that re-processes the data, a scheduled job that calls VACUUM from a notebook. Databricks documents that pattern well for its own runtime, and the delta-rs project covers the same operations through Python and Rust APIs. Both are fine answers if those stacks are what you operate. But a lot of teams hold Delta tables in an object store and want to handle a deletion request the way they would on a database: open a SQL session, run the statements, capture the evidence, close the ticket.

That is the gap this article fills. DeltaForge is a commercial, customer-installed SQL engine that reads and writes Delta Lake natively, so the entire Article 17 lifecycle below is SQL statements against your own storage, executed inside your own network. Every snippet is taken from the delta-gdpr-data-erasure demo that ships in the DeltaForge demo library, so you can run the whole flow yourself against a throwaway table before you touch production.

Does DELETE actually remove data from a Delta table?

No, and this is the single most important fact in the whole topic. Delta Lake never modifies a Parquet file in place. When you run DELETE or UPDATE, the engine identifies the data files that contain matching rows, writes new files that exclude (or rewrite) those rows, and records the swap in the transaction log: remove actions for the old files, add actions for the new ones. The old files are not touched. They stay in storage precisely so that time travel works, which means a query with VERSION AS OF can read your "deleted" personal data for as long as those files survive.

So a Delta erasure has two distinct halves, and a regulator-grade process needs both:

Logical erasure: The UPDATE or DELETE commit. Atomic, auditable, immediately effective for every query against the current version. The PII is gone from the live table.
Physical erasure: VACUUM removing the superseded Parquet files after the retention window. Only at this point does the PII stop existing in storage, and only at this point does time travel stop being able to resurrect it.

Everything between those two events is a residual-risk window. The walkthrough below makes that window visible, on purpose, because being able to demonstrate it (and demonstrate its closure) is what turns an ad hoc delete into a defensible compliance procedure.

How do you implement the right to be forgotten on Delta Lake?

The lifecycle is four statements long: one to erase, one to audit, one to prove the exposure, one to purge. The demo models a bank holding 30 customer accounts with three PII columns (ssn, phone, mailing_address) alongside the operational data the bank is legally entitled to keep. Customers 1 through 10 have invoked their right to erasure.

First, the table. The demo creates a Delta table directly at an object-store location; the engine writes the _delta_log and Parquet files itself:

CREATE ZONE IF NOT EXISTS external TYPE EXTERNAL
    COMMENT 'External and Delta tables for demo datasets';

CREATE SCHEMA IF NOT EXISTS external.delta_demos
    COMMENT 'Delta table management tutorial demos';

CREATE DELTA TABLE IF NOT EXISTS external.delta_demos.gdpr_customer_accounts (
    id               INT,
    account_holder   VARCHAR,
    email            VARCHAR,
    ssn              VARCHAR,
    phone            VARCHAR,
    mailing_address  VARCHAR,
    account_type     VARCHAR,
    branch_city      VARCHAR,
    country          VARCHAR,
    opened_date      VARCHAR,
    balance          DECIMAL(12,2)
) LOCATION 's3://acme-lake/demo/delta-gdpr-data-erasure/gdpr_customer_accounts';

INSERT INTO external.delta_demos.gdpr_customer_accounts VALUES
    (1,  'Alice Monroe', 'alice@bankmail.com', '123-45-6789', '+1-555-0201', '100 Wall St, NY',       'checking', 'New York', 'US', '2022-01-10', 15420.50),
    (2,  'Bob Chen',     'bob@bankmail.com',   '234-56-7890', '+1-555-0202', '200 State St, Chicago', 'savings',  'Chicago',  'US', '2022-02-15', 82300.00),
    (11, 'Karen Liu',    'karen@bankmail.com', '111-22-3333', '+86-21-1111', '28 Nanjing Rd',         'savings',  'Shanghai', 'CN', '2022-11-15', 156000.00);

Setup from the demo, abridged to 3 of the 30 seed rows. The full insert script ships with the demo.

Now the erasure itself. The demo nulls the PII columns rather than deleting the rows, because the bank must retain the accounts for financial-record obligations; GDPR erasure targets the personal data, not necessarily the record. If your retention analysis says the whole row goes, substitute DELETE FROM ... WHERE and the rest of the lifecycle is identical, because both commands produce the same copy-on-write commit shape.

UPDATE external.delta_demos.gdpr_customer_accounts
SET ssn = NULL, phone = NULL, mailing_address = NULL
WHERE id BETWEEN 1 AND 10;

One atomic commit erases all three PII columns for the ten requesting customers. The table moves to a new version; the old files are now unreferenced but still on disk.

Verify the logical state. The live table should show exactly ten erased accounts and twenty untouched ones:

SELECT
    COUNT(*) AS total_accounts,
    COUNT(*) FILTER (WHERE ssn IS NULL) AS erased_ssn,
    COUNT(*) FILTER (WHERE phone IS NULL) AS erased_phone,
    COUNT(*) FILTER (WHERE mailing_address IS NULL) AS erased_address
FROM external.delta_demos.gdpr_customer_accounts;

Returns one row: total_accounts 30, erased_ssn 10, erased_phone 10, erased_address 10.

The audit trail comes free

Every Delta commit is recorded in the transaction log with its operation, timestamp, and predicate. DESCRIBE HISTORY reads that log back as a result set, which is exactly what a compliance officer needs to evidence when the erasure happened and what it touched:

DESCRIBE HISTORY external.delta_demos.gdpr_customer_accounts;

In the demo this returns four versions: 0 (table create), 1 and 2 (the seed inserts), and 3 (the erasure UPDATE with its predicate). Version numbers in your run depend on how you loaded the data.

Prove the exposure before you close it

Here is the part most runbooks skip. After the UPDATE, version 2 of the table still contains every SSN, and any reader with time-travel access can see them. The demo asserts this deliberately, using DeltaForge's ASSERT command (an inline expectation that fails the script if the result diverges):

ASSERT ROW_COUNT = 1
ASSERT VALUE ssn = '123-45-6789' WHERE id = 1
SELECT id, account_holder, ssn, phone
FROM external.delta_demos.gdpr_customer_accounts VERSION AS OF 2
WHERE id = 1;

SELECT COUNT(*) AS exposed_records
FROM external.delta_demos.gdpr_customer_accounts VERSION AS OF 2
WHERE ssn IS NOT NULL;

The first query recovers Alice Monroe's full SSN from the pre-erasure version. The second returns exposed_records 30: every account is still readable through time travel.

This is the risk your deletion procedure exists to close, and being able to show it (PII recoverable at T1, unrecoverable at T2) is far stronger evidence than a screenshot of a successful UPDATE.

VACUUM completes the physical deletion

VACUUM lists the table's storage directory, identifies files no longer referenced by the current version that are older than the retention threshold, and deletes them. It returns a small result set with files_deleted and bytes_freed, which belongs in the erasure ticket:

VACUUM external.delta_demos.gdpr_customer_accounts RETAIN 0 HOURS;

The demo uses RETAIN 0 HOURS so the purge is immediate and the lifecycle is observable in one sitting. Production guidance on the retention value follows below.

After this statement, the VERSION AS OF 2 queries above fail with file-not-found, because the Parquet files holding the PII no longer exist. That failure is the success condition. A final grouped check confirms the live table is in the intended end state:

SELECT
    CASE WHEN id BETWEEN 1 AND 10 THEN 'Erased (ids 1-10)' ELSE 'Intact (ids 11-30)' END AS account_group,
    COUNT(*) AS accounts,
    COUNT(ssn) AS has_ssn,
    COUNT(phone) AS has_phone,
    COUNT(mailing_address) AS has_address
FROM external.delta_demos.gdpr_customer_accounts
GROUP BY CASE WHEN id BETWEEN 1 AND 10 THEN 'Erased (ids 1-10)' ELSE 'Intact (ids 11-30)' END
ORDER BY account_group;

Two rows: the erased group shows zero across all three PII columns, the intact group shows 20 everywhere. Non-PII data (names retained under financial-records grounds, balances, analytics) is unchanged throughout.

How do deletion vectors and VACUUM affect GDPR erasure?

Deletion vectors change the physical story in a way that matters specifically for erasure. With delta.enableDeletionVectors set to true (it defaults to false in DeltaForge), a DELETE no longer rewrites the affected Parquet file. Instead it writes a per-file RoaringBitmap recording which row positions are logically gone, and readers mask those rows out after decoding. That is a large performance win when an operation touches a tiny fraction of a big file, and a single-subject GDPR delete is exactly that shape.

But note what it means for compliance: the deleted rows are still physically present, byte for byte, inside the current data file. The bitmap hides them from queries; it does not remove them from storage. So on a deletion-vector table the lifecycle gains a step:

Delete logically. The DELETE commits a deletion vector against the affected files. Fast, atomic, auditable, and the rows vanish from every query against the current version.
Rewrite physically with OPTIMIZE. A compaction rewrite applies the vectors: the new files simply do not contain the masked rows, and they carry no vector.
```
OPTIMIZE external.delta_demos.gdpr_customer_accounts;
```
Purge with VACUUM. The pre-OPTIMIZE files (which still contain the subject's rows) and the superseded vector files both become removable once the retention window passes. DeltaForge's VACUUM treats vector files and data files uniformly.

On retention: VACUUM's default window is 168 hours, and that default exists to protect concurrent readers and in-flight time travel. RETAIN 0 HOURS is appropriate in the demo and in single-writer maintenance windows, but on a shared production table the safer pattern is to schedule the erasure, let the standard retention elapse, and then VACUUM, treating the documented retention period as part of your published erasure SLA. You can preview exactly what a vacuum will remove without removing it:

VACUUM external.delta_demos.gdpr_customer_accounts RETAIN 168 HOURS DRY RUN;

DRY RUN lists the candidate files and deletes nothing, which is also useful evidence for the change record before the destructive step.

The change data feed keeps its own copy

If the table has change data feed enabled, there is a third place the PII lives: the _change_data directory, where each commit's row-level changes are materialized for downstream consumers. An erasure UPDATE writes the pre-image of every changed row into that feed, SSNs included. CDF files deliberately follow a longer retention than data files (by default they track the log retention duration, 30 days, because streaming consumers need to catch up), so a VACUUM that purges your old data files can leave pre-erasure values sitting in the change feed.

DeltaForge extends VACUUM with an independent CDC_RETAIN clause so you can bound both windows explicitly in one statement:

VACUUM external.delta_demos.gdpr_customer_accounts RETAIN 168 HOURS CDC_RETAIN 168 HOURS;

Data files and change data feed files are purged on the same seven-day clock. Without CDC_RETAIN, CDF files would persist on the 30-day log retention default.

When you set your erasure SLA, take the maximum of the two windows, and confirm that every downstream consumer of the change feed applies the deletes it reads rather than archiving them.

Is pseudonymisation an alternative to physical deletion?

Legally, no. GDPR Article 4(5) defines pseudonymisation as processing that prevents attribution to a data subject without additional information, and the regulation is explicit that pseudonymised data is still personal data. A keyed hash of a surname is reversible by anyone holding the key and a candidate list. So when a data subject invokes Article 17, replacing their PII with pseudonyms does not, on its own, discharge the request; the lifecycle above does.

Where pseudonymisation earns its place is everywhere outside the erasure ticket: it shrinks the surface of PII your analysts, dashboards, and downstream extracts ever see, which reduces both your breach exposure and the number of places an erasure has to chase copies. DeltaForge ships this as a first-class SQL object. A pseudonymisation rule attaches to a table column and transforms values at query time; nothing on disk changes, and aggregations stay correct because they compute over the stored values:

CREATE PSEUDONYMISATION RULE ON external.pseudonymisation_demos.bank_customers (ssn)
    TRANSFORM redact
    PARAMS (mask = '***-**-****');

CREATE PSEUDONYMISATION RULE ON external.pseudonymisation_demos.bank_customers (phone)
    TRANSFORM mask
    PARAMS (show = 5);

CREATE PSEUDONYMISATION RULE ON external.pseudonymisation_demos.bank_customers (last_name)
    TRANSFORM keyed_hash
    SCOPE person
    PARAMS (salt = 'bank_name_salt_2024');

CREATE PSEUDONYMISATION RULE ON external.pseudonymisation_demos.bank_customers (date_of_birth)
    TRANSFORM generalize
    SCOPE relationship
    PARAMS (range = 10);

Four transform types from the pseudonymisation-quickstart demo: full redaction, partial masking, deterministic keyed hashing (so joins on the pseudonym still link), and generalization of dates of birth to the decade.

With the rules active, an ordinary SELECT returns the protected shapes, and SHOW PSEUDONYMISATION RULES FOR the table enumerates what is in force for an auditor:

SHOW PSEUDONYMISATION RULES FOR external.pseudonymisation_demos.bank_customers;

SELECT
    customer_id,
    first_name,
    last_name      AS last_name_hashed,
    date_of_birth  AS dob_generalized,
    phone          AS phone_masked,
    ssn            AS ssn_redacted,
    account_tier,
    balance
FROM external.pseudonymisation_demos.bank_customers;

SSNs come back as ***-**-****, phones show only their last five characters, surnames are deterministic hashes, birth dates are decades. COUNT, SUM, and AVG over balance remain exact.

The clean mental model: pseudonymisation governs who sees what today; erasure governs what exists at all. A mature GDPR posture on a data lake uses both, and neither substitutes for the other.

Run it yourself, and what this does not cover

Both demos used here, delta-gdpr-data-erasure and pseudonymisation-quickstart, ship in the DeltaForge demo library with setup, assertion-checked queries, and cleanup scripts, so the full lifecycle is reproducible end to end on your own object store. The SQL runs in the DeltaForge SQL engine, a commercial, customer-installed product: the engine, the Delta log, and the PII all stay inside your network, which is itself a meaningful property when the subject matter is personal data.

Honest scope notes. First, this article covers the table-level mechanics; a real Article 17 process also has to chase derived copies (extracts, ML feature stores, backups), and no table format solves that for you, though Delta's audit log tells you which tables changed and when. Second, object-store versioning or snapshot policies outside Delta's control can retain deleted objects after VACUUM; align your bucket lifecycle rules with your erasure SLA. Third, deleted data remains in _delta_log only as metadata (file paths and statistics, which can include min/max values for PII columns); if your columns are sensitive enough that log statistics matter, factor log retention into the same review. None of these caveats is unique to DeltaForge; they are the parts of the problem that live above the storage format, and pretending otherwise is how erasure procedures fail audits.