If you search for a Delta Lake change data feed tutorial today, you mostly find reference pages. The table_changes function is documented as a lookup entry, the CDF concept page shows Spark DataFrame readers, and community forums on other lakehouse platforms contain threads asking for the function by name because their SQL surface does not expose it. What is missing is a walkthrough you can actually run: a table, a sequence of changes, the queries that read them back, and an honest account of what the feature does and does not capture. That is this article.
Everything below comes from the delta-change-data-feed demo that ships in the DeltaForge demo library, so you can replay the whole flow against your own storage. DeltaForge is a commercial engine you install in your own environment; it reads and writes the open Delta Lake format directly, which is why a plain SQL statement can do what the reference docs route through a Spark session.
What CDF writes at the storage layer
A Delta table is a directory of Parquet data files plus a transaction log under _delta_log/. The log records which files each commit added and removed, which is enough to reconstruct any version of the table but not enough to tell you what happened to an individual row: a rewritten file could contain one changed row and ten thousand untouched ones.
Change data feed closes that gap. With the delta.enableChangeDataFeed property set, operations that rewrite rows (UPDATE, DELETE, MERGE) write dedicated row-level change records into a _change_data/ directory alongside the data files. Pure appends need no extra files, because every row in a newly added data file is by definition an insert; the feed serves those straight from the data files. Each change row carries three metadata columns on top of the table's own schema:
_change_type- One of
insert,update_preimage(the row before an update),update_postimage(the row after), ordelete(the final state of a removed row). _commit_version- The table version that produced the change. This is the column incremental consumers checkpoint on.
_commit_timestamp- When that commit landed, useful for time-based windows and audit reporting.
The detail that surprises most people: an UPDATE emits two rows per changed record, the preimage and the postimage. That doubling is the point. A consumer can see not just that a row changed but exactly which fields moved from what to what, which is what makes CDF usable for audit trails and field-level diffs, not only for replication.
How do you enable change data feed on a Delta table?
Set one table property. The cleanest place is the TBLPROPERTIES clause of CREATE DELTA TABLE, so the feed exists from the first commit:
CREATE ZONE IF NOT EXISTS external TYPE EXTERNAL
COMMENT 'External and Delta tables';
CREATE SCHEMA IF NOT EXISTS external.delta_demos
COMMENT 'Delta table management tutorial demos';
CREATE DELTA TABLE IF NOT EXISTS external.delta_demos.cdf_customer_accounts (
id INT,
name VARCHAR,
email VARCHAR,
tier VARCHAR,
balance DOUBLE,
status VARCHAR,
created_date VARCHAR
) LOCATION 's3://acme-lake/demo/cdf_customer_accounts'
TBLPROPERTIES ('delta.enableChangeDataFeed' = 'true');
For a table that already exists, flip the property on with ALTER TABLE:
ALTER TABLE external.delta_demos.cdf_customer_accounts
SET TBLPROPERTIES ('delta.enableChangeDataFeed' = 'true');
One thing to internalize early: enabling CDF is not retroactive. Commits made before the property was set have no change records, and a table_changes() read that reaches back into that range fails rather than silently returning partial history. If a downstream system will ever need the feed, enable it at creation time.
The full demo flow, five versions of row-level change
The delta-change-data-feed demo models a small fintech book of business: 40 customer accounts seeded at baseline, then four rounds of ordinary DML that together produce every change type CDF can emit. The seed insert looks like this (abbreviated here; the demo seeds all 40 rows):
INSERT INTO external.delta_demos.cdf_customer_accounts VALUES
(1, 'Alice Morgan', 'alice.morgan@mail.com', 'silver', 5200.00, 'active', '2023-01-05'),
(2, 'Bob Fischer', 'bob.fischer@mail.com', 'silver', 3100.00, 'active', '2023-01-10'),
(8, 'Henry Kowalski', 'henry.kowalski@mail.com', 'silver', 8100.00, 'active', '2023-03-10'),
(31, 'Erik Lindgren', 'erik.lindgren@mail.com', 'silver', 8500.00, 'active', '2023-11-01');
_change_type = 'insert'.Then the changes, each an ordinary statement, each one commit:
-
Promote the top accounts. Ten customers move to a gold tier. CDF records 10 preimage rows (tier still silver) and 10 postimage rows (tier now gold) at this version:
UPDATE external.delta_demos.cdf_customer_accounts SET tier = 'gold' WHERE id IN (31, 8, 23, 12, 33, 39, 6, 29, 17, 27); -
New customers join. Plain inserts, the simplest change type, one feed row per new record:
INSERT INTO external.delta_demos.cdf_customer_accounts SELECT * FROM (VALUES (41, 'Oscar Fernandez', 'oscar.fernandez@mail.com', 'bronze', 2500.00, 'active', '2024-03-01'), (42, 'Priya Sharma', 'priya.sharma@mail.com', 'silver', 4800.00, 'active', '2024-03-05') ) AS t(id, name, email, tier, balance, status, created_date); -
Close accounts in two steps. A status flip followed by a delete. This is deliberate: the feed shows four logical events per closed account (preimage active, postimage closed, then the delete), which is exactly what a compliance audit wants to see:
UPDATE external.delta_demos.cdf_customer_accounts SET status = 'closed' WHERE id IN (16, 28, 45); DELETE FROM external.delta_demos.cdf_customer_accounts WHERE status = 'closed'; -
Adjust balances. A 20 percent bump for five premium accounts. The preimage/postimage pair captures the old and new balance for each row, a ready-made financial audit record:
UPDATE external.delta_demos.cdf_customer_accounts SET balance = ROUND(balance * 1.20, 2) WHERE id IN (31, 8, 23, 12, 33);
To see which version each statement produced, ask the log directly:
DESCRIBE HISTORY external.delta_demos.cdf_customer_accounts;
How do you query table_changes() with SQL?
table_changes() sits in the FROM clause like any other relation. The first argument is the table name as a string, the second the starting version, and the optional third the ending version (inclusive on both ends). To inspect the tier upgrade, read exactly version 1:
SELECT id, name, tier, _change_type, _commit_version, _commit_timestamp
FROM table_changes('external.delta_demos.cdf_customer_accounts', 1, 1)
ORDER BY id, _change_type;
The result has 20 rows, two per updated account, and the shape makes the preimage/postimage mechanics concrete:
id | name | tier | _change_type | _commit_version
----+----------------+--------+------------------+----------------
6 | Frank Dubois | gold | update_postimage | 1
6 | Frank Dubois | silver | update_preimage | 1
8 | Henry Kowalski | gold | update_postimage | 1
8 | Henry Kowalski | silver | update_preimage | 1
.. | ... | ... | ... | ...
Because the feed is just a relation, aggregation works the way you would hope. A change-type census across the table's whole history:
SELECT _change_type, COUNT(*) AS change_count
FROM table_changes('external.delta_demos.cdf_customer_accounts', 0, 5)
GROUP BY _change_type
ORDER BY _change_type;
Omitting the ending version gives an open-ended read from a starting version to the current head, which is the form incremental consumers use:
SELECT id, name, balance, _change_type, _commit_version
FROM table_changes('external.delta_demos.cdf_customer_accounts', 3)
ORDER BY _commit_version, id;
Can you consume Delta change data feed without Spark?
The documented landscape is narrow. Databricks documents table_changes as a function of its runtime and SQL warehouses, and the open-source Delta Lake documentation shows CDF reads through Spark's DataFrame reader with the readChangeFeed option. Outside the Spark world, the delta-rs project exposes CDF reads programmatically, which works if you are writing Rust or Python but does not give an analyst, a BI tool, or an ODBC client a SQL statement to run. The gap is visible in public forums, where the function gets requested by name on platforms whose SQL endpoints do not provide it.
DeltaForge's answer is to implement table_changes() natively in the engine's PostgreSQL-flavored SQL dialect. The engine reads the Delta transaction log and the _change_data/ directory directly, so every query in this article runs from the DeltaForge SQL editor, over ODBC or ADBC from tools like Power BI, or inside a scheduled pipeline, with no JVM and no cluster anywhere in the path. The same engine writes CDF too: the UPDATE, DELETE, and MERGE statements above produce change records that Spark-based readers can consume, because the on-disk format is the open Delta protocol, not a private side channel.
How do you build an incremental pipeline from CDF?
The reason CDF exists is incremental ETL. Without it, a downstream copy of a 100-million-row table has to re-read all 100 million rows to discover that fifty changed. With it, the consumer reads fifty change rows. The consumer pattern has three parts: a checkpoint, a deduplicated read, and a MERGE.
First, the target. Any Delta table with the same business columns works:
CREATE DELTA TABLE IF NOT EXISTS external.delta_demos.cdf_accounts_mirror (
id INT,
name VARCHAR,
email VARCHAR,
tier VARCHAR,
balance DOUBLE,
status VARCHAR,
created_date VARCHAR
) LOCATION 's3://acme-lake/demo/cdf_accounts_mirror';
Then the apply step. Two details matter. Preimage rows are excluded, because the mirror only needs final states. And if a key changed more than once inside the version window (account 31 was both promoted and balance-adjusted in our history), only the latest event should win, which ROW_NUMBER() over _commit_version descending handles:
MERGE INTO external.delta_demos.cdf_accounts_mirror AS m
USING (
WITH ranked AS (
SELECT id, name, email, tier, balance, status, created_date,
_change_type, _commit_version,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY _commit_version DESC) AS rn
FROM table_changes('external.delta_demos.cdf_customer_accounts', 1, 5)
WHERE _change_type IN ('insert', 'update_postimage', 'delete')
)
SELECT id, name, email, tier, balance, status, created_date, _change_type
FROM ranked
WHERE rn = 1
) AS c
ON m.id = c.id
WHEN MATCHED AND c._change_type = 'delete' THEN DELETE
WHEN MATCHED THEN UPDATE SET
name = c.name, email = c.email, tier = c.tier,
balance = c.balance, status = c.status, created_date = c.created_date
WHEN NOT MATCHED AND c._change_type <> 'delete' THEN INSERT
(id, name, email, tier, balance, status, created_date)
VALUES (c.id, c.name, c.email, c.tier, c.balance, c.status, c.created_date);
The checkpoint is the part the engine cannot do for you. After a successful apply, persist the highest _commit_version you processed (a one-row control table is enough), and start the next run from that version plus one. Reading a fixed, bounded range is deterministic, so a failed run can simply be retried with the same bounds. Verify the mirror the same way the demo verifies its own state, with a count and a spot check:
SELECT COUNT(*) AS total_count FROM external.delta_demos.cdf_customer_accounts;
Retention: how long change records live
Change records are files, and files are subject to VACUUM. By default DeltaForge retains CDF files on the log retention schedule rather than the shorter data-file retention, because change consumers routinely need a longer window than time-travel readers. When a consumer can fall behind (a weekly batch reading a daily-updated table, say), set the CDF window explicitly with the CDC_RETAIN clause, a DeltaForge extension to VACUUM:
VACUUM external.delta_demos.cdf_customer_accounts RETAIN 168 HOURS CDC_RETAIN 720 HOURS;
Scope and limits
CDF is a sharp tool with real edges. These are the ones that bite in practice:
- Not retroactive
- Only commits made after the property was enabled have change records. There is no backfill.
- Write amplification on updates
- Every updated row produces both a preimage and a postimage. An UPDATE touching a million rows writes two million change records; budget storage on high-churn tables accordingly.
- Schema changes are invisible to the feed
- ALTER TABLE ADD COLUMN appears in the transaction log's metadata, not in
_change_data/. Schema-aware consumers must watchDESCRIBE HISTORYseparately. - Rewrites look like delete plus insert
- INSERT OVERWRITE and REPLACE WHERE emit delete and insert pairs, not update images. A consumer that diffs fields via preimage/postimage will see full row replacement instead.
- MERGE commits mix change types
- A single MERGE version can contain inserts, updates, and deletes together. Always branch on
_change_type, never on the version alone.
None of these are DeltaForge quirks; they follow from the Delta protocol's design. The practical takeaway is to enable CDF early, checkpoint on _commit_version, branch on _change_type, and size retention to your slowest consumer. With those four habits, a Delta table becomes its own change stream, queryable by anything that can speak SQL to the engine.