Delta Lake Change Data Feed in SQL: table_changes() End to End

Q: How do you enable change data feed on a Delta table?

Set the table property delta.enableChangeDataFeed to true, either in the TBLPROPERTIES clause of CREATE DELTA TABLE or later with ALTER TABLE SET TBLPROPERTIES. From that commit forward, every INSERT, UPDATE, DELETE, and MERGE records row-level change information alongside the table's data files. Enabling CDF is not retroactive: commits made before the property was set have no change records, so plan to enable it at creation time on tables that downstream consumers will track.

Q: How do you query table_changes() with SQL?

Call table_changes() in the FROM clause of an ordinary SELECT. The first argument is the table name as a string, the second is the starting commit version, and the optional third is the ending version. Each returned row carries the table's columns plus three metadata columns: _change_type (insert, update_preimage, update_postimage, or delete), _commit_version, and _commit_timestamp. You can filter, aggregate, join, and ORDER BY these columns like any other relation.

Q: Can you consume Delta change data feed without Spark?

Yes. The CDF reference material from Databricks and the open-source Delta Lake project documents Spark-based readers, and libraries such as delta-rs expose CDF programmatically, but none of that gives you a SQL statement you can hand to an analyst or an ODBC client. DeltaForge, a commercial engine you install in your own environment, implements table_changes() directly in its PostgreSQL-flavored SQL dialect, so any SQL editor, ODBC, or ADBC connection can read the change feed.

Q: How do you build an incremental pipeline from CDF?

Track the last commit version you processed, then read table_changes() from that version forward and apply the rows to the target with MERGE. Filter to insert, update_postimage, and delete rows, deduplicate keys that changed more than once in the window with ROW_NUMBER over _commit_version descending, then route delete rows to WHEN MATCHED THEN DELETE, updates to UPDATE, and new keys to INSERT. Persist the highest version you applied so the next run resumes exactly where this one stopped.

Every insert, update, and delete captured and replayed downstream: the workflow from this tutorial, shown end to end.

If you search for a Delta Lake change data feed tutorial today, you mostly find reference pages. The table_changes function is documented as a lookup entry, the CDF concept page shows Spark DataFrame readers, and community forums on other lakehouse platforms contain threads asking for the function by name because their SQL surface does not expose it. What is missing is a walkthrough you can actually run: a table, a sequence of changes, the queries that read them back, and an honest account of what the feature does and does not capture. That is this article.

Everything below comes from the delta-change-data-feed demo that ships in the DeltaForge demo library, so you can replay the whole flow against your own storage. DeltaForge is a commercial engine you install in your own environment; it reads and writes the open Delta Lake format directly, which is why a plain SQL statement can do what the reference docs route through a Spark session.

What CDF writes at the storage layer

A Delta table is a directory of Parquet data files plus a transaction log under _delta_log/. The log records which files each commit added and removed, which is enough to reconstruct any version of the table but not enough to tell you what happened to an individual row: a rewritten file could contain one changed row and ten thousand untouched ones.

Change data feed closes that gap. With the delta.enableChangeDataFeed property set, operations that rewrite rows (UPDATE, DELETE, MERGE) write dedicated row-level change records into a _change_data/ directory alongside the data files. Pure appends need no extra files, because every row in a newly added data file is by definition an insert; the feed serves those straight from the data files. Each change row carries three metadata columns on top of the table's own schema:

_change_type: One of insert, update_preimage (the row before an update), update_postimage (the row after), or delete (the final state of a removed row).
_commit_version: The table version that produced the change. This is the column incremental consumers checkpoint on.
_commit_timestamp: When that commit landed, useful for time-based windows and audit reporting.

The detail that surprises most people: an UPDATE emits two rows per changed record, the preimage and the postimage. That doubling is the point. A consumer can see not just that a row changed but exactly which fields moved from what to what, which is what makes CDF usable for audit trails and field-level diffs, not only for replication.

How do you enable change data feed on a Delta table?

Set one table property. The cleanest place is the TBLPROPERTIES clause of CREATE DELTA TABLE, so the feed exists from the first commit:

CREATE ZONE IF NOT EXISTS external TYPE EXTERNAL
    COMMENT 'External and Delta tables';

CREATE SCHEMA IF NOT EXISTS external.delta_demos
    COMMENT 'Delta table management tutorial demos';

CREATE DELTA TABLE IF NOT EXISTS external.delta_demos.cdf_customer_accounts (
    id              INT,
    name            VARCHAR,
    email           VARCHAR,
    tier            VARCHAR,
    balance         DOUBLE,
    status          VARCHAR,
    created_date    VARCHAR
) LOCATION 's3://acme-lake/demo/cdf_customer_accounts'
TBLPROPERTIES ('delta.enableChangeDataFeed' = 'true');

Creating a CDF-enabled Delta table. The zone maps a storage root; the LOCATION resolves inside it.

For a table that already exists, flip the property on with ALTER TABLE:

ALTER TABLE external.delta_demos.cdf_customer_accounts
SET TBLPROPERTIES ('delta.enableChangeDataFeed' = 'true');

Enabling CDF on an existing table takes effect from the next commit forward.

One thing to internalize early: enabling CDF is not retroactive. Commits made before the property was set have no change records, and a table_changes() read that reaches back into that range fails rather than silently returning partial history. If a downstream system will ever need the feed, enable it at creation time.

The full demo flow, five versions of row-level change

The delta-change-data-feed demo models a small fintech book of business: 40 customer accounts seeded at baseline, then four rounds of ordinary DML that together produce every change type CDF can emit. The seed insert looks like this (abbreviated here; the demo seeds all 40 rows):

INSERT INTO external.delta_demos.cdf_customer_accounts VALUES
    (1,  'Alice Morgan',   'alice.morgan@mail.com',   'silver', 5200.00, 'active', '2023-01-05'),
    (2,  'Bob Fischer',    'bob.fischer@mail.com',    'silver', 3100.00, 'active', '2023-01-10'),
    (8,  'Henry Kowalski', 'henry.kowalski@mail.com', 'silver', 8100.00, 'active', '2023-03-10'),
    (31, 'Erik Lindgren',  'erik.lindgren@mail.com',  'silver', 8500.00, 'active', '2023-11-01');

The baseline seed. In the change feed these rows surface as _change_type = 'insert'.

Then the changes, each an ordinary statement, each one commit:

Promote the top accounts. Ten customers move to a gold tier. CDF records 10 preimage rows (tier still silver) and 10 postimage rows (tier now gold) at this version:
```
UPDATE external.delta_demos.cdf_customer_accounts
SET tier = 'gold'
WHERE id IN (31, 8, 23, 12, 33, 39, 6, 29, 17, 27);
```

New customers join. Plain inserts, the simplest change type, one feed row per new record:

INSERT INTO external.delta_demos.cdf_customer_accounts
SELECT * FROM (VALUES
    (41, 'Oscar Fernandez', 'oscar.fernandez@mail.com', 'bronze', 2500.00, 'active', '2024-03-01'),
    (42, 'Priya Sharma',    'priya.sharma@mail.com',    'silver', 4800.00, 'active', '2024-03-05')
) AS t(id, name, email, tier, balance, status, created_date);

Close accounts in two steps. A status flip followed by a delete. This is deliberate: the feed shows four logical events per closed account (preimage active, postimage closed, then the delete), which is exactly what a compliance audit wants to see:
```
UPDATE external.delta_demos.cdf_customer_accounts
SET status = 'closed'
WHERE id IN (16, 28, 45);

DELETE FROM external.delta_demos.cdf_customer_accounts
WHERE status = 'closed';
```
Adjust balances. A 20 percent bump for five premium accounts. The preimage/postimage pair captures the old and new balance for each row, a ready-made financial audit record:
```
UPDATE external.delta_demos.cdf_customer_accounts
SET balance = ROUND(balance * 1.20, 2)
WHERE id IN (31, 8, 23, 12, 33);
```

To see which version each statement produced, ask the log directly:

DESCRIBE HISTORY external.delta_demos.cdf_customer_accounts;

On a fresh run the versions line up as: 0 baseline seed, 1 tier upgrade, 2 inserts, 3 status update, 4 delete, 5 balance adjustment.

How do you query table_changes() with SQL?

table_changes() sits in the FROM clause like any other relation. The first argument is the table name as a string, the second the starting version, and the optional third the ending version (inclusive on both ends). To inspect the tier upgrade, read exactly version 1:

SELECT id, name, tier, _change_type, _commit_version, _commit_timestamp
FROM table_changes('external.delta_demos.cdf_customer_accounts', 1, 1)
ORDER BY id, _change_type;

A bounded read of a single version: a table_changes SQL example you can adapt to any range.

The result has 20 rows, two per updated account, and the shape makes the preimage/postimage mechanics concrete:

 id | name           | tier   | _change_type     | _commit_version
----+----------------+--------+------------------+----------------
  6 | Frank Dubois   | gold   | update_postimage | 1
  6 | Frank Dubois   | silver | update_preimage  | 1
  8 | Henry Kowalski | gold   | update_postimage | 1
  8 | Henry Kowalski | silver | update_preimage  | 1
 .. | ...            | ...    | ...              | ...

Each updated row appears twice: once as it was, once as it became.

Because the feed is just a relation, aggregation works the way you would hope. A change-type census across the table's whole history:

SELECT _change_type, COUNT(*) AS change_count
FROM table_changes('external.delta_demos.cdf_customer_accounts', 0, 5)
GROUP BY _change_type
ORDER BY _change_type;

On the full demo data: 3 delete, 48 insert, 18 update_postimage, 18 update_preimage.

Omitting the ending version gives an open-ended read from a starting version to the current head, which is the form incremental consumers use:

SELECT id, name, balance, _change_type, _commit_version
FROM table_changes('external.delta_demos.cdf_customer_accounts', 3)
ORDER BY _commit_version, id;

Everything since version 3: the closure updates, the deletes, and the balance adjustments. 19 rows on the demo data.

Can you consume Delta change data feed without Spark?

The documented landscape is narrow. Databricks documents table_changes as a function of its runtime and SQL warehouses, and the open-source Delta Lake documentation shows CDF reads through Spark's DataFrame reader with the readChangeFeed option. Outside the Spark world, the delta-rs project exposes CDF reads programmatically, which works if you are writing Rust or Python but does not give an analyst, a BI tool, or an ODBC client a SQL statement to run. The gap is visible in public forums, where the function gets requested by name on platforms whose SQL endpoints do not provide it.

DeltaForge's answer is to implement table_changes() natively in the engine's PostgreSQL-flavored SQL dialect. The engine reads the Delta transaction log and the _change_data/ directory directly, so every query in this article runs from the DeltaForge SQL editor, over ODBC or ADBC from tools like Power BI, or inside a scheduled pipeline, with no JVM and no cluster anywhere in the path. The same engine writes CDF too: the UPDATE, DELETE, and MERGE statements above produce change records that Spark-based readers can consume, because the on-disk format is the open Delta protocol, not a private side channel.

How do you build an incremental pipeline from CDF?

The reason CDF exists is incremental ETL. Without it, a downstream copy of a 100-million-row table has to re-read all 100 million rows to discover that fifty changed. With it, the consumer reads fifty change rows. The consumer pattern has three parts: a checkpoint, a deduplicated read, and a MERGE.

First, the target. Any Delta table with the same business columns works:

CREATE DELTA TABLE IF NOT EXISTS external.delta_demos.cdf_accounts_mirror (
    id              INT,
    name            VARCHAR,
    email           VARCHAR,
    tier            VARCHAR,
    balance         DOUBLE,
    status          VARCHAR,
    created_date    VARCHAR
) LOCATION 's3://acme-lake/demo/cdf_accounts_mirror';

The downstream mirror. It does not need CDF enabled itself.

Then the apply step. Two details matter. Preimage rows are excluded, because the mirror only needs final states. And if a key changed more than once inside the version window (account 31 was both promoted and balance-adjusted in our history), only the latest event should win, which ROW_NUMBER() over _commit_version descending handles:

MERGE INTO external.delta_demos.cdf_accounts_mirror AS m
USING (
    WITH ranked AS (
        SELECT id, name, email, tier, balance, status, created_date,
               _change_type, _commit_version,
               ROW_NUMBER() OVER (PARTITION BY id ORDER BY _commit_version DESC) AS rn
        FROM table_changes('external.delta_demos.cdf_customer_accounts', 1, 5)
        WHERE _change_type IN ('insert', 'update_postimage', 'delete')
    )
    SELECT id, name, email, tier, balance, status, created_date, _change_type
    FROM ranked
    WHERE rn = 1
) AS c
ON m.id = c.id
WHEN MATCHED AND c._change_type = 'delete' THEN DELETE
WHEN MATCHED THEN UPDATE SET
    name = c.name, email = c.email, tier = c.tier,
    balance = c.balance, status = c.status, created_date = c.created_date
WHEN NOT MATCHED AND c._change_type <> 'delete' THEN INSERT
    (id, name, email, tier, balance, status, created_date)
    VALUES (c.id, c.name, c.email, c.tier, c.balance, c.status, c.created_date);

The CDF consumer pattern: dedupe to the latest event per key, then route deletes, updates, and inserts through one MERGE.

The checkpoint is the part the engine cannot do for you. After a successful apply, persist the highest _commit_version you processed (a one-row control table is enough), and start the next run from that version plus one. Reading a fixed, bounded range is deterministic, so a failed run can simply be retried with the same bounds. Verify the mirror the same way the demo verifies its own state, with a count and a spot check:

SELECT COUNT(*) AS total_count FROM external.delta_demos.cdf_customer_accounts;

On the full demo data the source lands at 45 rows: 40 seeded, plus 8 inserted, minus 3 deleted. The mirror should match.

Retention: how long change records live

Change records are files, and files are subject to VACUUM. By default DeltaForge retains CDF files on the log retention schedule rather than the shorter data-file retention, because change consumers routinely need a longer window than time-travel readers. When a consumer can fall behind (a weekly batch reading a daily-updated table, say), set the CDF window explicitly with the CDC_RETAIN clause, a DeltaForge extension to VACUUM:

VACUUM external.delta_demos.cdf_customer_accounts RETAIN 168 HOURS CDC_RETAIN 720 HOURS;

Data files kept 7 days, change records kept 30. A consumer that lags past retention cannot catch up; it must re-baseline.

Scope and limits

CDF is a sharp tool with real edges. These are the ones that bite in practice:

Not retroactive: Only commits made after the property was enabled have change records. There is no backfill.
Write amplification on updates: Every updated row produces both a preimage and a postimage. An UPDATE touching a million rows writes two million change records; budget storage on high-churn tables accordingly.
Schema changes are invisible to the feed: ALTER TABLE ADD COLUMN appears in the transaction log's metadata, not in _change_data/. Schema-aware consumers must watch DESCRIBE HISTORY separately.
Rewrites look like delete plus insert: INSERT OVERWRITE and REPLACE WHERE emit delete and insert pairs, not update images. A consumer that diffs fields via preimage/postimage will see full row replacement instead.
MERGE commits mix change types: A single MERGE version can contain inserts, updates, and deletes together. Always branch on _change_type, never on the version alone.

None of these are DeltaForge quirks; they follow from the Delta protocol's design. The practical takeaway is to enable CDF early, checkpoint on _commit_version, branch on _change_type, and size retention to your slowest consumer. With those four habits, a Delta table becomes its own change stream, queryable by anything that can speak SQL to the engine.