Slowly Changing Dimension Type 2 is the standard answer to a question every warehouse eventually faces: what did this record look like on a given date? Instead of overwriting a dimension row when an attribute changes, you expire the old row and insert a new version, so the table accumulates a complete, queryable history. Insurance, finance, and other regulated industries do not treat this as optional; an auditor asking which coverage a policy carried last March expects an answer from the data, not from a backup.
The frustrating part is finding a worked example that is just SQL. Most published walkthroughs of slowly changing dimensions on Delta Lake are Spark notebooks: the Delta Lake project documents its MERGE-based change-data pattern through Spark APIs, and the popular tutorials follow suit with DataFrames and DeltaTable.forName calls. If your team works in SQL, you end up translating Python line by line and hoping the semantics survive. This article skips the translation. Every statement below is plain SQL, copied from a validated DeltaForge demo, and the whole flow runs against ordinary Delta tables in object storage.
How do you implement SCD Type 2 in Delta Lake with SQL?
The dimension table needs four bookkeeping columns on top of its business attributes: a surrogate_key that uniquely identifies each row version, valid_from and valid_to dates that bound the version's effective range, and an is_current flag that marks the active version of each business key. Open-ended rows carry the sentinel date 9999-12-31 in valid_to.
With that shape in place, applying a batch of changes is a two-pass operation:
- Expire. A
MERGEjoins the changes to the dimension on the business key, restricted tois_current = 1. Every matched row getsvalid_toset to the day before the change takes effect andis_currentset to 0. The historical record is preserved exactly as it was; only the two bookkeeping columns move. - Insert. An
INSERT ... SELECTreads the same changes table and adds one new row per change: a fresh surrogate key, the updated attributes,valid_fromequal to the effective date, the sentinelvalid_to, andis_current = 1.
After both passes, each changed key has two rows (one expired, one current), unchanged keys still have exactly one current row, and a point-in-time query is a simple range predicate on valid_from and valid_to.
Set up the dimension and the changes table
The scenario is an insurance policy dimension: 15 active policies, and a batch of 8 modifications (coverage upgrades, a region change, premium adjustments) arriving with an effective date of 2025-01-15. First the namespace. DeltaForge uses a three-level zone.schema.table hierarchy, and the zone's STORAGE_ROOT anchors where table folders land in object storage; the example uses s3://acme-lake/demo throughout.
CREATE ZONE IF NOT EXISTS external TYPE EXTERNAL
STORAGE_ROOT = 's3://acme-lake/demo'
COMMENT 'External and Delta tables for demo datasets';
CREATE SCHEMA IF NOT EXISTS external.delta_demos
COMMENT 'Delta table management tutorial demos';
LOCATION below a relative path under s3://acme-lake/demo.Next the dimension itself, seeded with 15 current rows. All of them start with valid_from = '2024-01-01', the open sentinel in valid_to, and is_current = 1:
CREATE DELTA TABLE IF NOT EXISTS external.delta_demos.policy_dim (
surrogate_key INT,
policy_id VARCHAR,
holder_name VARCHAR,
coverage_type VARCHAR,
annual_premium DOUBLE,
region VARCHAR,
risk_score INT,
valid_from VARCHAR,
valid_to VARCHAR,
is_current INT
) LOCATION 'delta-merge-scd2/policy_dim';
INSERT INTO external.delta_demos.policy_dim VALUES
(1, 'POL-1001', 'Alice Johnson', 'standard', 2400.00, 'northeast', 45, '2024-01-01', '9999-12-31', 1),
(2, 'POL-1002', 'Bob Martinez', 'basic', 1200.00, 'southeast', 30, '2024-01-01', '9999-12-31', 1),
(3, 'POL-1003', 'Carol Chen', 'premium', 4800.00, 'west', 65, '2024-01-01', '9999-12-31', 1),
(4, 'POL-1004', 'David Kim', 'standard', 2600.00, 'midwest', 50, '2024-01-01', '9999-12-31', 1),
(5, 'POL-1005', 'Elena Rodriguez', 'basic', 1100.00, 'pacific', 25, '2024-01-01', '9999-12-31', 1),
(6, 'POL-1006', 'Frank O''Brien', 'platinum', 7200.00, 'northeast', 80, '2024-01-01', '9999-12-31', 1),
(7, 'POL-1007', 'Grace Patel', 'standard', 2200.00, 'southeast', 40, '2024-01-01', '9999-12-31', 1),
(8, 'POL-1008', 'Henry Nakamura', 'premium', 5100.00, 'west', 70, '2024-01-01', '9999-12-31', 1),
(9, 'POL-1009', 'Irene Fischer', 'basic', 1300.00, 'midwest', 28, '2024-01-01', '9999-12-31', 1),
(10, 'POL-1010', 'James Cooper', 'standard', 2500.00, 'pacific', 48, '2024-01-01', '9999-12-31', 1),
(11, 'POL-1011', 'Karen Liu', 'premium', 4600.00, 'northeast', 62, '2024-01-01', '9999-12-31', 1),
(12, 'POL-1012', 'Leo Washington', 'basic', 1400.00, 'southeast', 32, '2024-01-01', '9999-12-31', 1),
(13, 'POL-1013', 'Maria Gonzalez', 'standard', 2300.00, 'west', 42, '2024-01-01', '9999-12-31', 1),
(14, 'POL-1014', 'Nathan Brooks', 'platinum', 6800.00, 'midwest', 75, '2024-01-01', '9999-12-31', 1),
(15, 'POL-1015', 'Olivia Thompson', 'premium', 5400.00, 'pacific', 68, '2024-01-01', '9999-12-31', 1);
surrogate_key, valid_from, valid_to and is_current.The changes table is the staging area your upstream feed lands in. It carries the business key, the new attribute values, and the date the change takes effect; it does not carry any of the SCD bookkeeping columns, because those are derived during the apply:
CREATE DELTA TABLE IF NOT EXISTS external.delta_demos.policy_changes (
policy_id VARCHAR,
holder_name VARCHAR,
coverage_type VARCHAR,
annual_premium DOUBLE,
region VARCHAR,
risk_score INT,
effective_date VARCHAR
) LOCATION 'delta-merge-scd2/policy_changes';
INSERT INTO external.delta_demos.policy_changes VALUES
('POL-1001', 'Alice Johnson', 'premium', 4200.00, 'northeast', 52, '2025-01-15'),
('POL-1003', 'Carol Chen', 'platinum', 7500.00, 'west', 68, '2025-01-15'),
('POL-1005', 'Elena Rodriguez', 'standard', 2100.00, 'pacific', 35, '2025-01-15'),
('POL-1007', 'Grace Patel', 'standard', 2800.00, 'midwest', 44, '2025-01-15'),
('POL-1009', 'Irene Fischer', 'basic', 1500.00, 'midwest', 30, '2025-01-15'),
('POL-1010', 'James Cooper', 'premium', 4400.00, 'pacific', 55, '2025-01-15'),
('POL-1012', 'Leo Washington', 'standard', 2000.00, 'southeast', 38, '2025-01-15'),
('POL-1014', 'Nathan Brooks', 'platinum', 7100.00, 'midwest', 72, '2025-01-15');
What does an SCD2 MERGE statement look like?
Here is the expire pass, the heart of the pattern. Note the second predicate in the ON condition: matching on policy_id alone would also join the changes against already-expired historical rows once the table has accumulated history, and a MERGE source row that matches multiple target rows applies its action to every one of them. Pinning the join to is_current = 1 guarantees at most one match per change and protects the historical rows forever:
MERGE INTO external.delta_demos.policy_dim AS target
USING external.delta_demos.policy_changes AS source
ON target.policy_id = source.policy_id AND target.is_current = 1
WHEN MATCHED THEN
UPDATE SET
valid_to = '2025-01-14',
is_current = 0;
The second pass inserts the replacement versions. Surrogate keys continue from the seed range (the dimension holds keys 1 through 15, so the new rows take 16 through 23), valid_from picks up the batch's effective date, and the sentinel reopens valid_to:
INSERT INTO external.delta_demos.policy_dim
SELECT 15 + ROW_NUMBER() OVER (ORDER BY policy_id),
policy_id, holder_name, coverage_type, annual_premium, region, risk_score,
effective_date, '9999-12-31', 1
FROM external.delta_demos.policy_changes;
MAX(surrogate_key) or use an identity column.Why two statements instead of one MERGE? Because of how match semantics work. The new version of POL-1001 has the same business key as the row being expired, so within a single MERGE it is a matched row, and a WHEN NOT MATCHED THEN INSERT clause never fires for it. The classic single-statement workaround is to stage a union of the changes with themselves, using a null join key to force the insert branch. It works, but it doubles the source, makes the script harder to review, and buys you little: the two-pass version is what most production SCD2 pipelines actually run, and it is what the DeltaForge demo ships.
Where WHEN MATCHED and WHEN NOT MATCHED fit
The clauses themselves are worth seeing side by side, because they cover the simpler patterns you will also need. A MERGE that pairs WHEN MATCHED THEN UPDATE with WHEN NOT MATCHED THEN INSERT is the standard upsert, which in dimension terms is SCD Type 1: new keys are inserted, existing keys are overwritten in place, and no history survives:
MERGE INTO warehouse.sales.customers AS target
USING warehouse.staging.new_customers AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN
UPDATE SET name = source.name,
email = source.email,
updated_at = source.updated_at
WHEN NOT MATCHED THEN
INSERT (customer_id, name, email, created_at, updated_at)
VALUES (source.customer_id, source.name, source.email,
source.created_at, source.updated_at);
In the SCD2 flow, WHEN NOT MATCHED still has a natural role if your change feed can contain brand-new business keys: those rows match nothing and can be inserted directly as first versions inside the expire MERGE. DeltaForge's MERGE also supports WHEN NOT MATCHED BY SOURCE for acting on target rows absent from the feed, conditional WHEN MATCHED AND ... clauses evaluated top to bottom, and MERGE WITH SCHEMA EVOLUTION for sources that grow new columns. None of those are needed here, which is rather the point: the core SCD2 pattern uses the plainest possible MERGE.
How do you handle is_active, start_date and end_date in a MERGE?
Column naming varies by shop: is_active/start_date/end_date, is_current/valid_from/valid_to, or row_effective_date/row_expiration_date in Kimball's own vocabulary. The mechanics never change, and they reduce to three rules the MERGE and INSERT enforce together:
- Open rows carry a sentinel
- The current version of every key has
end_date = '9999-12-31'(or NULL, if you prefer; the sentinel keeps range predicates simple) and the active flag set to 1. Exactly one row per key may be in this state. - Expiring closes the range
- The MERGE's
WHEN MATCHEDclause setsend_dateto the day before the change's effective date and flips the flag to 0. Adjacent versions therefore tile the timeline with no gaps and no overlaps: one version ends 2025-01-14, the next begins 2025-01-15. - Inserting reopens it
- The new version takes
start_datefrom the change's effective date, the sentinelend_date, and the flag at 1. A point-in-time lookup is thenWHERE d >= start_date AND d <= end_date, and the current view isWHERE is_active = 1.
The one decision worth making deliberately is the expiry boundary. This tutorial closes the old row at effective date minus one day, which suits daily-grain batch feeds. If your changes carry timestamps, close at the exact effective instant and treat the range as half-open instead, so two versions can share a calendar date without overlapping.
Verify the result and read it back
After both passes the dimension holds 23 rows: 15 originals (7 of them untouched and still current, 8 expired) plus 8 new current versions. The split is one query away:
SELECT CASE WHEN is_current = 1 THEN 'current' ELSE 'expired' END AS record_status,
COUNT(*) AS record_count
FROM external.delta_demos.policy_dim
GROUP BY is_current
ORDER BY is_current DESC;
-- record_status | record_count
-- current | 15
-- expired | 8
The history trail for a single policy shows the pattern doing its job. Alice Johnson upgraded from standard to premium coverage, so POL-1001 now has two tiled versions:
SELECT surrogate_key, policy_id, holder_name, coverage_type,
annual_premium, valid_from, valid_to, is_current
FROM external.delta_demos.policy_dim
WHERE policy_id = 'POL-1001'
ORDER BY valid_from;
-- surrogate_key | coverage_type | annual_premium | valid_from | valid_to | is_current
-- 1 | standard | 2400.0 | 2024-01-01 | 2025-01-14 | 0
-- 16 | premium | 4200.0 | 2025-01-15 | 9999-12-31 | 1
For a pipeline you want these checks to be enforced, not eyeballed. DeltaForge's ASSERT statement attaches an expectation to the query that follows it and fails the run if the expectation does not hold, which turns the SCD2 invariants into executable guards:
ASSERT VALUE cnt = 8
SELECT COUNT(*) AS cnt
FROM external.delta_demos.policy_dim
WHERE is_current = 0 AND valid_to = '2025-01-14';
ASSERT VALUE cnt = 0
SELECT COUNT(*) AS cnt
FROM external.delta_demos.policy_dim
WHERE is_current = 1 AND valid_to != '9999-12-31';
Reporting views sit on the is_current = 1 slice. Summarizing the live portfolio by coverage type returns four tiers, with the upgrades visible in the counts (5 premium and 3 platinum policies, where the seed had 4 and 2):
SELECT coverage_type,
COUNT(*) AS policy_count,
ROUND(SUM(annual_premium), 2) AS total_premium,
ROUND(AVG(annual_premium), 2) AS avg_premium
FROM external.delta_demos.policy_dim
WHERE is_current = 1
GROUP BY coverage_type
ORDER BY avg_premium;
is_current = 1 and aggregate as if history did not exist.What happens at the storage layer
Delta MERGE is copy-on-write. The engine joins the source against the target, identifies which Parquet files contain matched rows, rewrites those files with the updated values, and records the swap in the table's _delta_log as remove and add actions inside a single atomic commit. Files holding only unmatched rows are never touched, which is why the is_current = 1 predicate matters for large dimensions: combined with file-level statistics it lets the engine skip the bulk of the historical data entirely during the match.
Because the pattern runs two statements, the log gains two versions: one commit for the MERGE, one for the INSERT. Each is individually atomic, so a concurrent reader sees either the table before the MERGE, between the passes, or after the INSERT, and never a torn state inside either statement. The in-between version is the pattern's one visibility quirk, covered under limits below. The upside of the versioned log is that the whole batch is inspectable after the fact: the expired rows, the new versions, and the exact commit boundary between them are all preserved in table history.
Run the full demo end to end
Everything above ships as the delta-merge-scd2 demo in the DeltaForge demo library, with the SQL split into setup, queries, and cleanup scripts plus assertion checks on every step (23 total rows, 8 expired, 15 current, key ranges, per-policy trails). To run it yourself:
- Run the setup script: it creates the zone, schema, and both Delta tables, then seeds the 15 policies and 8 changes shown above into
s3://acme-lake/demo(or whatever storage root you point the zone at; local paths and Azure work the same way). - Run the queries script: it previews both tables, executes the expire MERGE and the insert pass, and walks the verification queries with
ASSERTguards attached, so a wrong row count fails loudly instead of scrolling past. - Run the cleanup script when you are done. Delta tables own physical files, so dropping them is explicit about removing data:
DROP DELTA TABLE IF EXISTS external.delta_demos.policy_changes WITH FILES;
DROP DELTA TABLE IF EXISTS external.delta_demos.policy_dim WITH FILES;
WITH FILES deletes the Parquet data and the transaction log along with the catalog entry.Can you do SCD Type 2 without PySpark?
Yes, and it is worth being precise about why. SCD Type 2 is a property of the SQL pattern and of the Delta Lake table format, not of any particular engine. The transaction log does not know or care whether a Spark cluster or something else wrote its commits; any writer that produces valid Delta commits can run this pattern. Spark's dominance in the tutorial space reflects history, not a technical requirement: the Delta Lake project grew up inside the Spark ecosystem, so its documented MERGE examples are presented through Spark APIs, and the delta-rs project offers a Python and Rust path for teams who want to script it without a JVM.
DeltaForge takes the SQL route the whole way. It is a commercial engine you install on your own infrastructure, it speaks PostgreSQL-flavored SQL, and the statements in this article are the entire interface: no cluster to size, no notebook runtime, no Python environment to maintain. The same script runs from the built-in SQL editor, over ODBC or ADBC from your BI tools, or on a schedule as a pipeline, and the SQL engine executes MERGE, time travel, OPTIMIZE and VACUUM against Delta tables wherever they live: S3, Azure, or local disk.
Scope and limits
An honest accounting of where this pattern needs care:
- The between-passes window. Two statements means two commits, and a reader between them sees the changed keys with no current row. For nightly batch dimensions this is rarely observable in practice; if it matters, run both statements as one scheduled pipeline step and point consumers at views, or gate reads on the batch's completion.
- Surrogate key generation. The demo derives keys with
ROW_NUMBER()over a known offset, which is transparent for a tutorial but assumes single-writer batches. Production pipelines should derive the offset from the table's current maximum or use an identity mechanism. - One change per key per batch. The pattern applies one version transition per business key per run. If a key can change twice inside a single batch, deduplicate the feed to the latest change first, or process intra-batch history with effective timestamps before merging.
- The ON predicate is the safety rail. Dropping
is_current = 1from the join does not fail; it silently expires historical rows too once the same key changes again. Keep the assertion queries from above in the pipeline so a regression surfaces as a failed run, not as quiet history corruption.
None of these are Delta-specific or DeltaForge-specific; they are the same trade-offs the pattern has on any warehouse. What the Delta format adds is the atomic commit per statement and the versioned log underneath, and what pure SQL adds is that the whole thing fits in one reviewable script instead of a notebook.