SCD Type 2 on Delta Lake in Pure SQL: MERGE Tutorial

Q: How do you implement SCD Type 2 in Delta Lake with SQL?

Give the dimension table a surrogate key plus valid_from, valid_to and is_current columns, then run a two-pass pattern. Pass one is a MERGE that joins the incoming changes on the business key with is_current = 1 in the ON condition, and its WHEN MATCHED clause sets valid_to to the day before the effective date and is_current to 0. Pass two is an INSERT ... SELECT that adds one new current row per change with valid_to held open at 9999-12-31. Each statement commits atomically to the Delta transaction log.

Q: What does an SCD2 MERGE statement look like?

The expire pass reads: MERGE INTO policy_dim AS target USING policy_changes AS source ON target.policy_id = source.policy_id AND target.is_current = 1 WHEN MATCHED THEN UPDATE SET valid_to = '2025-01-14', is_current = 0. The is_current = 1 predicate in the ON condition guarantees the MERGE only touches the active version of each key and never rewrites historical rows. The replacement versions are added by a separate INSERT, because inside a single MERGE the new version of an existing key counts as matched, not as not matched.

Q: How do you handle is_active, start_date and end_date in a MERGE?

Whether your columns are named is_active, start_date and end_date or is_current, valid_from and valid_to, the mechanics are identical. Open rows carry a sentinel end date such as 9999-12-31 and the active flag set to 1. The MERGE's WHEN MATCHED clause closes a row by setting end_date to the day before the change takes effect and the flag to 0, and the follow-up INSERT opens the replacement row with start_date equal to the effective date, the sentinel end date, and the flag back to 1.

Q: Can you do SCD Type 2 without PySpark?

Yes. SCD Type 2 is a property of the SQL pattern and the Delta transaction log, not of Spark. Any engine that writes valid Delta commits can run it. DeltaForge, a commercial engine you install on your own infrastructure, executes the whole pattern in PostgreSQL-flavored SQL: CREATE DELTA TABLE, MERGE, INSERT and the verification queries all run from a SQL editor, an ODBC or ADBC connection, or a scheduled pipeline, with no cluster, notebook, or Python environment involved.

Slowly Changing Dimension Type 2 is the standard answer to a question every warehouse eventually faces: what did this record look like on a given date? Instead of overwriting a dimension row when an attribute changes, you expire the old row and insert a new version, so the table accumulates a complete, queryable history. Insurance, finance, and other regulated industries do not treat this as optional; an auditor asking which coverage a policy carried last March expects an answer from the data, not from a backup.

The frustrating part is finding a worked example that is just SQL. Most published walkthroughs of slowly changing dimensions on Delta Lake are Spark notebooks: the Delta Lake project documents its MERGE-based change-data pattern through Spark APIs, and the popular tutorials follow suit with DataFrames and DeltaTable.forName calls. If your team works in SQL, you end up translating Python line by line and hoping the semantics survive. This article skips the translation. Every statement below is plain SQL, copied from a validated DeltaForge demo, and the whole flow runs against ordinary Delta tables in object storage.

How do you implement SCD Type 2 in Delta Lake with SQL?

The dimension table needs four bookkeeping columns on top of its business attributes: a surrogate_key that uniquely identifies each row version, valid_from and valid_to dates that bound the version's effective range, and an is_current flag that marks the active version of each business key. Open-ended rows carry the sentinel date 9999-12-31 in valid_to.

With that shape in place, applying a batch of changes is a two-pass operation:

Expire. A MERGE joins the changes to the dimension on the business key, restricted to is_current = 1. Every matched row gets valid_to set to the day before the change takes effect and is_current set to 0. The historical record is preserved exactly as it was; only the two bookkeeping columns move.
Insert. An INSERT ... SELECT reads the same changes table and adds one new row per change: a fresh surrogate key, the updated attributes, valid_from equal to the effective date, the sentinel valid_to, and is_current = 1.

After both passes, each changed key has two rows (one expired, one current), unchanged keys still have exactly one current row, and a point-in-time query is a simple range predicate on valid_from and valid_to.

Set up the dimension and the changes table

The scenario is an insurance policy dimension: 15 active policies, and a batch of 8 modifications (coverage upgrades, a region change, premium adjustments) arriving with an effective date of 2025-01-15. First the namespace. DeltaForge uses a three-level zone.schema.table hierarchy, and the zone's STORAGE_ROOT anchors where table folders land in object storage; the example uses s3://acme-lake/demo throughout.

CREATE ZONE IF NOT EXISTS external TYPE EXTERNAL
    STORAGE_ROOT = 's3://acme-lake/demo'
    COMMENT 'External and Delta tables for demo datasets';

CREATE SCHEMA IF NOT EXISTS external.delta_demos
    COMMENT 'Delta table management tutorial demos';

The zone's storage root makes every table LOCATION below a relative path under s3://acme-lake/demo.

Next the dimension itself, seeded with 15 current rows. All of them start with valid_from = '2024-01-01', the open sentinel in valid_to, and is_current = 1:

CREATE DELTA TABLE IF NOT EXISTS external.delta_demos.policy_dim (
    surrogate_key  INT,
    policy_id      VARCHAR,
    holder_name    VARCHAR,
    coverage_type  VARCHAR,
    annual_premium DOUBLE,
    region         VARCHAR,
    risk_score     INT,
    valid_from     VARCHAR,
    valid_to       VARCHAR,
    is_current     INT
) LOCATION 'delta-merge-scd2/policy_dim';

INSERT INTO external.delta_demos.policy_dim VALUES
    (1,  'POL-1001', 'Alice Johnson',   'standard', 2400.00, 'northeast', 45, '2024-01-01', '9999-12-31', 1),
    (2,  'POL-1002', 'Bob Martinez',    'basic',    1200.00, 'southeast', 30, '2024-01-01', '9999-12-31', 1),
    (3,  'POL-1003', 'Carol Chen',      'premium',  4800.00, 'west',      65, '2024-01-01', '9999-12-31', 1),
    (4,  'POL-1004', 'David Kim',       'standard', 2600.00, 'midwest',   50, '2024-01-01', '9999-12-31', 1),
    (5,  'POL-1005', 'Elena Rodriguez', 'basic',    1100.00, 'pacific',   25, '2024-01-01', '9999-12-31', 1),
    (6,  'POL-1006', 'Frank O''Brien',  'platinum', 7200.00, 'northeast', 80, '2024-01-01', '9999-12-31', 1),
    (7,  'POL-1007', 'Grace Patel',     'standard', 2200.00, 'southeast', 40, '2024-01-01', '9999-12-31', 1),
    (8,  'POL-1008', 'Henry Nakamura',  'premium',  5100.00, 'west',      70, '2024-01-01', '9999-12-31', 1),
    (9,  'POL-1009', 'Irene Fischer',   'basic',    1300.00, 'midwest',   28, '2024-01-01', '9999-12-31', 1),
    (10, 'POL-1010', 'James Cooper',    'standard', 2500.00, 'pacific',   48, '2024-01-01', '9999-12-31', 1),
    (11, 'POL-1011', 'Karen Liu',       'premium',  4600.00, 'northeast', 62, '2024-01-01', '9999-12-31', 1),
    (12, 'POL-1012', 'Leo Washington',  'basic',    1400.00, 'southeast', 32, '2024-01-01', '9999-12-31', 1),
    (13, 'POL-1013', 'Maria Gonzalez',  'standard', 2300.00, 'west',      42, '2024-01-01', '9999-12-31', 1),
    (14, 'POL-1014', 'Nathan Brooks',   'platinum', 6800.00, 'midwest',   75, '2024-01-01', '9999-12-31', 1),
    (15, 'POL-1015', 'Olivia Thompson', 'premium',  5400.00, 'pacific',   68, '2024-01-01', '9999-12-31', 1);

The SCD2 dimension: business attributes plus surrogate_key, valid_from, valid_to and is_current.

The changes table is the staging area your upstream feed lands in. It carries the business key, the new attribute values, and the date the change takes effect; it does not carry any of the SCD bookkeeping columns, because those are derived during the apply:

CREATE DELTA TABLE IF NOT EXISTS external.delta_demos.policy_changes (
    policy_id      VARCHAR,
    holder_name    VARCHAR,
    coverage_type  VARCHAR,
    annual_premium DOUBLE,
    region         VARCHAR,
    risk_score     INT,
    effective_date VARCHAR
) LOCATION 'delta-merge-scd2/policy_changes';

INSERT INTO external.delta_demos.policy_changes VALUES
    ('POL-1001', 'Alice Johnson',   'premium',  4200.00, 'northeast', 52, '2025-01-15'),
    ('POL-1003', 'Carol Chen',      'platinum', 7500.00, 'west',      68, '2025-01-15'),
    ('POL-1005', 'Elena Rodriguez', 'standard', 2100.00, 'pacific',   35, '2025-01-15'),
    ('POL-1007', 'Grace Patel',     'standard', 2800.00, 'midwest',   44, '2025-01-15'),
    ('POL-1009', 'Irene Fischer',   'basic',    1500.00, 'midwest',   30, '2025-01-15'),
    ('POL-1010', 'James Cooper',    'premium',  4400.00, 'pacific',   55, '2025-01-15'),
    ('POL-1012', 'Leo Washington',  'standard', 2000.00, 'southeast', 38, '2025-01-15'),
    ('POL-1014', 'Nathan Brooks',   'platinum', 7100.00, 'midwest',   72, '2025-01-15');

Eight incoming changes effective 2025-01-15. The demo keeps dates as VARCHAR for portability; DATE works the same way.

What does an SCD2 MERGE statement look like?

Here is the expire pass, the heart of the pattern. Note the second predicate in the ON condition: matching on policy_id alone would also join the changes against already-expired historical rows once the table has accumulated history, and a MERGE source row that matches multiple target rows applies its action to every one of them. Pinning the join to is_current = 1 guarantees at most one match per change and protects the historical rows forever:

MERGE INTO external.delta_demos.policy_dim AS target
USING external.delta_demos.policy_changes AS source
ON target.policy_id = source.policy_id AND target.is_current = 1
WHEN MATCHED THEN
    UPDATE SET
        valid_to   = '2025-01-14',
        is_current = 0;

Pass 1: expire the active version of each changed policy. Eight rows are updated; nothing is deleted.

The second pass inserts the replacement versions. Surrogate keys continue from the seed range (the dimension holds keys 1 through 15, so the new rows take 16 through 23), valid_from picks up the batch's effective date, and the sentinel reopens valid_to:

INSERT INTO external.delta_demos.policy_dim
SELECT 15 + ROW_NUMBER() OVER (ORDER BY policy_id),
       policy_id, holder_name, coverage_type, annual_premium, region, risk_score,
       effective_date, '9999-12-31', 1
FROM external.delta_demos.policy_changes;

Pass 2: one new current row per change. In production you would typically derive the key offset from MAX(surrogate_key) or use an identity column.

Why two statements instead of one MERGE? Because of how match semantics work. The new version of POL-1001 has the same business key as the row being expired, so within a single MERGE it is a matched row, and a WHEN NOT MATCHED THEN INSERT clause never fires for it. The classic single-statement workaround is to stage a union of the changes with themselves, using a null join key to force the insert branch. It works, but it doubles the source, makes the script harder to review, and buys you little: the two-pass version is what most production SCD2 pipelines actually run, and it is what the DeltaForge demo ships.

Where WHEN MATCHED and WHEN NOT MATCHED fit

The clauses themselves are worth seeing side by side, because they cover the simpler patterns you will also need. A MERGE that pairs WHEN MATCHED THEN UPDATE with WHEN NOT MATCHED THEN INSERT is the standard upsert, which in dimension terms is SCD Type 1: new keys are inserted, existing keys are overwritten in place, and no history survives:

MERGE INTO warehouse.sales.customers AS target
USING warehouse.staging.new_customers AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN
    UPDATE SET name = source.name,
               email = source.email,
               updated_at = source.updated_at
WHEN NOT MATCHED THEN
    INSERT (customer_id, name, email, created_at, updated_at)
    VALUES (source.customer_id, source.name, source.email,
            source.created_at, source.updated_at);

The general WHEN MATCHED / WHEN NOT MATCHED shape: an SCD Type 1 upsert that keeps no history.

In the SCD2 flow, WHEN NOT MATCHED still has a natural role if your change feed can contain brand-new business keys: those rows match nothing and can be inserted directly as first versions inside the expire MERGE. DeltaForge's MERGE also supports WHEN NOT MATCHED BY SOURCE for acting on target rows absent from the feed, conditional WHEN MATCHED AND ... clauses evaluated top to bottom, and MERGE WITH SCHEMA EVOLUTION for sources that grow new columns. None of those are needed here, which is rather the point: the core SCD2 pattern uses the plainest possible MERGE.

How do you handle is_active, start_date and end_date in a MERGE?

Column naming varies by shop: is_active/start_date/end_date, is_current/valid_from/valid_to, or row_effective_date/row_expiration_date in Kimball's own vocabulary. The mechanics never change, and they reduce to three rules the MERGE and INSERT enforce together:

Open rows carry a sentinel: The current version of every key has end_date = '9999-12-31' (or NULL, if you prefer; the sentinel keeps range predicates simple) and the active flag set to 1. Exactly one row per key may be in this state.
Expiring closes the range: The MERGE's WHEN MATCHED clause sets end_date to the day before the change's effective date and flips the flag to 0. Adjacent versions therefore tile the timeline with no gaps and no overlaps: one version ends 2025-01-14, the next begins 2025-01-15.
Inserting reopens it: The new version takes start_date from the change's effective date, the sentinel end_date, and the flag at 1. A point-in-time lookup is then WHERE d >= start_date AND d <= end_date, and the current view is WHERE is_active = 1.

The one decision worth making deliberately is the expiry boundary. This tutorial closes the old row at effective date minus one day, which suits daily-grain batch feeds. If your changes carry timestamps, close at the exact effective instant and treat the range as half-open instead, so two versions can share a calendar date without overlapping.

Verify the result and read it back

After both passes the dimension holds 23 rows: 15 originals (7 of them untouched and still current, 8 expired) plus 8 new current versions. The split is one query away:

SELECT CASE WHEN is_current = 1 THEN 'current' ELSE 'expired' END AS record_status,
       COUNT(*) AS record_count
FROM external.delta_demos.policy_dim
GROUP BY is_current
ORDER BY is_current DESC;

-- record_status | record_count
-- current       | 15
-- expired       | 8

Fifteen current rows (7 untouched + 8 new versions) and 8 expired historical rows.

The history trail for a single policy shows the pattern doing its job. Alice Johnson upgraded from standard to premium coverage, so POL-1001 now has two tiled versions:

SELECT surrogate_key, policy_id, holder_name, coverage_type,
       annual_premium, valid_from, valid_to, is_current
FROM external.delta_demos.policy_dim
WHERE policy_id = 'POL-1001'
ORDER BY valid_from;

-- surrogate_key | coverage_type | annual_premium | valid_from | valid_to   | is_current
-- 1             | standard      | 2400.0         | 2024-01-01 | 2025-01-14 | 0
-- 16            | premium       | 4200.0         | 2025-01-15 | 9999-12-31 | 1

The audit answer in two rows: what the policy was, what it became, and exactly when.

For a pipeline you want these checks to be enforced, not eyeballed. DeltaForge's ASSERT statement attaches an expectation to the query that follows it and fails the run if the expectation does not hold, which turns the SCD2 invariants into executable guards:

ASSERT VALUE cnt = 8
SELECT COUNT(*) AS cnt
FROM external.delta_demos.policy_dim
WHERE is_current = 0 AND valid_to = '2025-01-14';

ASSERT VALUE cnt = 0
SELECT COUNT(*) AS cnt
FROM external.delta_demos.policy_dim
WHERE is_current = 1 AND valid_to != '9999-12-31';

Executable invariants: every expired row closed on the right date, and no current row with a closed range.

Reporting views sit on the is_current = 1 slice. Summarizing the live portfolio by coverage type returns four tiers, with the upgrades visible in the counts (5 premium and 3 platinum policies, where the seed had 4 and 2):

SELECT coverage_type,
       COUNT(*) AS policy_count,
       ROUND(SUM(annual_premium), 2) AS total_premium,
       ROUND(AVG(annual_premium), 2) AS avg_premium
FROM external.delta_demos.policy_dim
WHERE is_current = 1
GROUP BY coverage_type
ORDER BY avg_premium;

The current-state reporting view: filter on is_current = 1 and aggregate as if history did not exist.

What happens at the storage layer

Delta MERGE is copy-on-write. The engine joins the source against the target, identifies which Parquet files contain matched rows, rewrites those files with the updated values, and records the swap in the table's _delta_log as remove and add actions inside a single atomic commit. Files holding only unmatched rows are never touched, which is why the is_current = 1 predicate matters for large dimensions: combined with file-level statistics it lets the engine skip the bulk of the historical data entirely during the match.

Because the pattern runs two statements, the log gains two versions: one commit for the MERGE, one for the INSERT. Each is individually atomic, so a concurrent reader sees either the table before the MERGE, between the passes, or after the INSERT, and never a torn state inside either statement. The in-between version is the pattern's one visibility quirk, covered under limits below. The upside of the versioned log is that the whole batch is inspectable after the fact: the expired rows, the new versions, and the exact commit boundary between them are all preserved in table history.

Run the full demo end to end

Everything above ships as the delta-merge-scd2 demo in the DeltaForge demo library, with the SQL split into setup, queries, and cleanup scripts plus assertion checks on every step (23 total rows, 8 expired, 15 current, key ranges, per-policy trails). To run it yourself:

Run the setup script: it creates the zone, schema, and both Delta tables, then seeds the 15 policies and 8 changes shown above into s3://acme-lake/demo (or whatever storage root you point the zone at; local paths and Azure work the same way).
Run the queries script: it previews both tables, executes the expire MERGE and the insert pass, and walks the verification queries with ASSERT guards attached, so a wrong row count fails loudly instead of scrolling past.
Run the cleanup script when you are done. Delta tables own physical files, so dropping them is explicit about removing data:

DROP DELTA TABLE IF EXISTS external.delta_demos.policy_changes WITH FILES;
DROP DELTA TABLE IF EXISTS external.delta_demos.policy_dim WITH FILES;

Cleanup: WITH FILES deletes the Parquet data and the transaction log along with the catalog entry.

Can you do SCD Type 2 without PySpark?

Yes, and it is worth being precise about why. SCD Type 2 is a property of the SQL pattern and of the Delta Lake table format, not of any particular engine. The transaction log does not know or care whether a Spark cluster or something else wrote its commits; any writer that produces valid Delta commits can run this pattern. Spark's dominance in the tutorial space reflects history, not a technical requirement: the Delta Lake project grew up inside the Spark ecosystem, so its documented MERGE examples are presented through Spark APIs, and the delta-rs project offers a Python and Rust path for teams who want to script it without a JVM.

DeltaForge takes the SQL route the whole way. It is a commercial engine you install on your own infrastructure, it speaks PostgreSQL-flavored SQL, and the statements in this article are the entire interface: no cluster to size, no notebook runtime, no Python environment to maintain. The same script runs from the built-in SQL editor, over ODBC or ADBC from your BI tools, or on a schedule as a pipeline, and the SQL engine executes MERGE, time travel, OPTIMIZE and VACUUM against Delta tables wherever they live: S3, Azure, or local disk.

Scope and limits

An honest accounting of where this pattern needs care:

The between-passes window. Two statements means two commits, and a reader between them sees the changed keys with no current row. For nightly batch dimensions this is rarely observable in practice; if it matters, run both statements as one scheduled pipeline step and point consumers at views, or gate reads on the batch's completion.
Surrogate key generation. The demo derives keys with ROW_NUMBER() over a known offset, which is transparent for a tutorial but assumes single-writer batches. Production pipelines should derive the offset from the table's current maximum or use an identity mechanism.
One change per key per batch. The pattern applies one version transition per business key per run. If a key can change twice inside a single batch, deduplicate the feed to the latest change first, or process intra-batch history with effective timestamps before merging.
The ON predicate is the safety rail. Dropping is_current = 1 from the join does not fail; it silently expires historical rows too once the same key changes again. Keep the assertion queries from above in the pipeline so a regression surfaces as a failed run, not as quiet history corruption.

None of these are Delta-specific or DeltaForge-specific; they are the same trade-offs the pattern has on any warehouse. What the Delta format adds is the atomic commit per statement and the versioned log underneath, and what pure SQL adds is that the whole thing fits in one reviewable script instead of a notebook.