Do these indexes change the on-disk Delta table format?

No. The parent data stays in standard Delta format. The index is a separate child Delta table stored alongside it.

Which index structures are available?

Three: PGM (default), B+ tree, and bloom filter. PGM suits clustered or monotonic keys; B+ tree handles any key distribution; bloom filter does per-file probabilistic pruning.

When does an index actually help?

Selective reads on a hot column, UPDATE / DELETE / MERGE on a specific row or small set, and repeated equality joins on the same key.

What happens to a stale index?

It is silently ignored. The planner falls back to the standard scan path, which always returns correct results.

Indexes on Delta Lake Tables: PGM, B+ Tree, Bloom

Q: What does an index cost?

Storage: a small fraction of the parent table size. Write overhead: every parent write also updates the index when auto-update is on. Build time: a one-time scan of the parent at index creation.

What It Is

A managed companion to a Delta table

The Index

For each indexed column, the index records where the matching rows live so the engine can read just those rows rather than the whole table. The index is itself a child Delta table.

Optional. Tables work without indexes.
Same answers either way, just less data read.
Stale index is silently ignored.
Parent data stays in standard Delta format.

Reader Scope

Indexes are consumed by the DeltaForge query planner. Other engines reading the parent Delta table will not pick them up; they fall back to the standard scan path and return the same results.

DeltaForge planner uses the index.
External readers ignore it.
Correctness preserved across both paths.
No fork, no proprietary table format.

When To Reach For One

Indexes complement built-in data skipping; they don't replace it

Helps

Selective reads on a hot column: point lookups, narrow ranges, IN lists, prefix matches.
UPDATE, DELETE, MERGE on a specific row or small set; the locate step becomes a direct read.
Repeated equality joins on the same key when one side carries an index on that column.

Does Not Help

The predicate matches most of the table anyway.
The engine already prunes the column well from its built-in min/max statistics.
The table is small enough that scanning everything is already cheap.

Cost

An index is a running expense, not a free upgrade

Storage A small fraction of the parent table's size

Write overhead Every parent write also updates the index when auto-update is on

Build time One-time scan of the parent at index creation

Index Structures

Pick by access pattern

PGM (default)

A learned index built from a piecewise geometric model over the key distribution. Compact on disk; suited to clustered or monotonic keys typical of analytical workloads.

B+ Tree

Classic balanced tree with predictable behavior across any key distribution. Choose with USING btree when the data is unsorted or highly random.

Bloom Filter

File-level probabilistic test. Each Parquet file carries a bloom filter for the indexed columns; the planner skips files whose filter rejects the predicate. Tunable fpp and num_items.

Frequently Asked Questions

Quick answers on indexing Delta Lake tables

Can you create an index on a Delta Lake table?

Yes. DeltaForge creates PGM learned, B+ tree, and bloom filter indexes on Delta Lake tables. The index lives as a child Delta table next to the parent, and the parent stays in standard Delta format.

Do indexes speed up MERGE, UPDATE, and DELETE?

When the statement targets a specific row or a small set, yes. The locate step becomes a direct read instead of a scan, which is where slow MERGE workloads spend most of their time.

What happens when an index goes stale?

The planner silently ignores it and falls back to the standard scan path, so results stay correct. Optional auto-update keeps the index in sync on parent commits.

Row-Level Indexes for Delta Lake Tables