The comparison point: DuckDB

DuckDB is the obvious reference. It is excellent for single-machine analytics, and for datasets that fit on one laptop or server it is often faster to set up than anything else. But DuckDB has a hard architectural limit: it is a single-process engine. When your table grows past the machine's usable disk, or when you need concurrent writers, transaction history, and ongoing MERGE operations, DuckDB is not the right tool. It also has no native Delta Lake or Iceberg write support.

DeltaForge operates differently. Compute nodes stream Parquet row groups from object storage in parallel. There is no in-memory table to fill up. A 10 TB table on S3 is read the same way a 10 GB table is: row groups decoded in parallel, aggregated in flight, never fully loaded. Adding a compute node increases throughput; the data itself stays in your object storage.

What actually limits performance at extreme scale

The ceiling is not memory. The two things that create problems on very large, actively written tables are both properties of Delta table maintenance, not the engine.

File count

Every write creates at least one Parquet file. A table that receives thousands of small appends per day without compaction accumulates hundreds of thousands of files. Before execution starts, the engine must build the file list and load per-file statistics for pruning. With 500,000 small files, that planning step is slow regardless of how fast the engine reads data. Run OPTIMIZE regularly to compact them down to target-sized files (128 MB default), and this cost disappears.

-- Compact and co-locate by your most-filtered columns
OPTIMIZE my_catalog.events ZORDER BY (event_date, customer_id);

Delta log depth

Every commit appends a JSON entry to _delta_log. Delta Lake's checkpoint mechanism compacts entries every 10 commits by default, so a 5-year-old table with millions of commits still replays only one checkpoint file plus up to 9 JSON entries. The risk is a high-frequency write table where the checkpoint interval is too wide, or where checkpoints have never run. DeltaForge also caches the table snapshot in memory across queries within the same session, so the first query pays the cost and subsequent ones do not.

-- Tighten the checkpoint interval on high-write tables
ALTER TABLE my_catalog.events
SET TBLPROPERTIES ('delta.checkpointInterval' = '50');

The honest comparison

DuckDB DeltaForge
Memory ceiling Single machine RAM / disk None: streams from object storage
Scale-out Single process Add compute nodes
Delta Lake writes Read-only (via extension) Full ACID writes, MERGE, time travel
Main risk at scale Table outgrows the machine Unmanaged file count or log depth
Remedy Upgrade hardware or export to larger system OPTIMIZE + VACUUM on a regular cadence

Related