How do you run OPTIMIZE and VACUUM without Databricks?
If you search for Delta maintenance, nearly everything you find assumes one of two stacks. The Delta Lake documentation for Spark shows OPTIMIZE and VACUUM as Spark SQL run inside a Spark session, and the Databricks documentation layers its own scheduling and managed features on top. The main alternative, the delta-rs project, documents compaction, Z-ordering, and vacuum as methods on a Python DeltaTable object, which is a good fit if your maintenance jobs are Python programs, and less of one if your team operates in SQL.
DeltaForge takes the third path: it implements the Delta Lake protocol natively inside its own engine, so the maintenance commands are first-class SQL statements. OPTIMIZE my_table; bin-packs small Parquet files into larger ones and commits the result as a new table version. VACUUM my_table; lists the storage directory, cross-checks it against the transaction log, and deletes files no longer referenced by the current version. There is no JVM to provision and no cluster to size; the engine runs wherever you installed it and works directly against the table's _delta_log.
Two things matter for trust here. First, both commands are atomic log operations: OPTIMIZE records its rewrite as AddFile and RemoveFile actions in a single commit, so concurrent readers see either the old layout or the new one, never a mix. Second, both are physical-only: the rows your queries return are identical before and after, and this runbook verifies that explicitly rather than asking you to take it on faith.
What the small files problem does at the storage layer
A Delta table is a directory of immutable Parquet files plus a transaction log that says which of those files make up the current version. Two write patterns degrade that layout over time.
The first is ingestion fragmentation. Every INSERT batch writes at least one new Parquet file, so a pipeline that loads five micro-batches a day creates five files a day, each carrying its own footer, metadata, and per-file open cost at read time. A query that should scan one well-sized file ends up opening dozens of fragments, and the overhead is paid on every read, forever, until someone compacts.
The second is copy-on-write churn. Parquet files are immutable, so an UPDATE or DELETE does not modify a file in place; it rewrites every affected file without the changed rows and marks the old file as removed in the log. The old file stays on disk to serve time travel queries against earlier versions. Run a few status updates and a cancellation sweep over a week of micro-batches and the storage directory holds several complete generations of the same rows, of which exactly one is live.
These are two separate bills. Fragmentation costs you query latency; orphaned generations cost you storage. OPTIMIZE pays the first and VACUUM pays the second, which is why they form a sequence rather than a choice.
The runbook, end to end
The walkthrough below is the delta-vacuum-optimize-maintenance demo, which ships in the DeltaForge demo library alongside delta-optimize-compaction and delta-vacuum-cleanup, so you can run every statement yourself. It simulates an order pipeline with daily micro-batches, then mutations, then the maintenance pass.
-
Create the zone, schema, and Delta table. The
LOCATIONis any object-store path the engine can reach; the table's transaction log lives under it.CREATE ZONE IF NOT EXISTS external TYPE EXTERNAL COMMENT 'External and Delta tables for demo datasets'; CREATE SCHEMA IF NOT EXISTS external.delta_demos COMMENT 'Delta table maintenance demos'; CREATE DELTA TABLE IF NOT EXISTS external.delta_demos.order_pipeline ( id INT, order_ref VARCHAR, category VARCHAR, product VARCHAR, price DOUBLE, status VARCHAR, order_date VARCHAR ) LOCATION 's3://acme-lake/demo/order_pipeline';One table, one path. Schema discovery and registration happen at creation; there is no separate catalog ceremony. -
Load and mutate. Five daily batches of orders land as five separate file sets, then a shipping update and a cancellation delete rewrite some of them via copy-on-write.
INSERT INTO external.delta_demos.order_pipeline VALUES (1, 'ORD-1001', 'Electronics', 'Wireless Headphones', 79.99, 'pending', '2025-03-03'), (2, 'ORD-1002', 'Electronics', 'USB-C Hub', 34.99, 'pending', '2025-03-03'), (3, 'ORD-1003', 'Clothing', 'Running Shoes', 129.99, 'pending', '2025-03-03'); -- ...four more daily batches land the same way, 40 orders in total... UPDATE external.delta_demos.order_pipeline SET status = 'shipped' WHERE id BETWEEN 1 AND 8; DELETE FROM external.delta_demos.order_pipeline WHERE id IN (11, 22, 37);Eight table versions later: 37 live rows, many small files, several orphaned generations. -
Inspect the physical state.
DESCRIBE DETAILreports the file count and size of the current version, which is your before picture.DESCRIBE DETAIL external.delta_demos.order_pipeline;Expect a double-digit file count for a table this small: that is the small files problem in miniature. -
Compact.
OPTIMIZEselects files below the target size, bin-packs them, rewrites each bin as one larger file, and commits. It returns three metrics:files_removed,files_added, and the newversion.OPTIMIZE external.delta_demos.order_pipeline;Re-running it is safe: if no small files are found, nothing is rewritten and no new version is created. -
Purge the orphans.
VACUUMdeletes every file on storage that the current version no longer references and that has aged past the retention window. The demo usesRETAIN 0 HOURSso the effect is visible immediately; the next section covers why production should not.VACUUM external.delta_demos.order_pipeline RETAIN 0 HOURS;VACUUM returns files_deleted and bytes_freed. It does not create a new table version; it is purely physical. -
Verify. The demo asserts that every business aggregate matches the pre-maintenance baseline, which is the proof that both operations are lossless. The
ASSERTlines are DeltaForge's built-in test harness and run as part of the script.ASSERT VALUE total_orders = 37 ASSERT VALUE total_revenue = 2143.77 ASSERT VALUE categories = 4 SELECT COUNT(*) AS total_orders, ROUND(SUM(price), 2) AS total_revenue, COUNT(DISTINCT category) AS categories FROM external.delta_demos.order_pipeline;Same row count, same revenue, same categories. Only the file layout changed.
Should you run OPTIMIZE before or after VACUUM?
OPTIMIZE first, then VACUUM. The reasoning falls straight out of what each command does. OPTIMIZE rewrites small files into large ones, and the small originals become unreferenced the moment the compaction commit lands; until something deletes them, the table occupies storage for both generations. VACUUM is the only operation that deletes unreferenced files. So the compaction itself manufactures exactly the kind of garbage that VACUUM exists to collect.
If you run the sequence the other way around, nothing breaks, but the vacuum pass only collects the orphans your DML already produced, and the fresh orphans from the subsequent OPTIMIZE wait on disk until the next cycle. With OPTIMIZE first, one VACUUM collects both populations: the copy-on-write leftovers from updates and deletes, and the pre-compaction fragments. One sequence, one storage bill.
One honest caveat on cadence interaction: because VACUUM honors a retention window, files orphaned by an OPTIMIZE that ran moments earlier are younger than the default 168-hour threshold and will survive a default-retention vacuum. That is by design, not a bug; they will be collected once they age past the window on a later run. Lowering retention to harvest them immediately trades away the safety margin discussed next.
How often should you vacuum a Delta table?
There is no single number, but there is a defensible default: weekly, with the default retention. The cadence question is really a question about two windows you must not cut into.
- Default retention
- 168 hours (7 days). Files must be both unreferenced and older than this before VACUUM will delete them.
- What the window protects
- Time travel queries against recent versions, and long-running concurrent readers that pinned an older snapshot. Deleting a file one of them still needs produces file-not-found failures mid-query.
- Change Data Feed files
- CDF files under
_change_data/follow the log retention default (30 days) rather than the data retention window, because downstream CDC consumers typically read further back.CDC_RETAINoverrides this independently. - Preview first
DRY RUNlists what would be deleted without deleting it. Note it takes no lock, so the real run may differ slightly if writes land in between.
In practice: high-churn tables (frequent MERGE, UPDATE, DELETE) accumulate orphans fastest and justify a tighter cycle, even daily, still with a safe retention value. Batch-only tables with no time-travel consumers can run shorter retention safely. What you should never do is set RETAIN below your longest expected time-travel lookback; once VACUUM deletes a version's files, queries against that version fail permanently. The syntax covers all of these cases:
-- Weekly default: deletes unreferenced files older than 7 days
VACUUM external.delta_demos.hr_employees;
-- Preview which files would be deleted, without deleting them
VACUUM external.delta_demos.hr_employees DRY RUN;
-- Tighter window for a high-churn table with no long lookbacks
VACUUM external.delta_demos.hr_employees RETAIN 72 HOURS;
-- Short data retention, but keep 30 days of Change Data Feed history
VACUUM external.delta_demos.hr_employees RETAIN 24 HOURS CDC_RETAIN 720 HOURS;
Does Z-ORDER still matter outside Databricks?
Yes, and the reason is worth stating plainly. Databricks now recommends liquid clustering over Z-ordering for tables on its platform, and the part of liquid clustering that makes it attractive, the automatic background re-clustering, is managed by Databricks itself. If your Delta tables live outside that platform, you do not get that managed loop. What you do get, portably, is OPTIMIZE ... ZORDER BY, which any protocol-conformant engine can apply and any reader benefits from.
The mechanism is layout, not indexing. Without it, rows sit in insertion order: a March batch of European temperature readings here, a mixed batch there. A query filtering on region and sensor_type must scan every file, because every file's min/max statistics span all regions and all types. ZORDER BY rewrites the data along a Z-curve (Morton code) that interleaves the bits of the chosen columns, so rows with similar values across all of those columns land in the same files. Each file's statistics then cover a tight range, and the engine skips files that cannot contain matches. The delta-zorder-deep demo makes this concrete with IoT telemetry spread across three insert batches:
OPTIMIZE external.delta_demos.zorder_sensor_telemetry
ZORDER BY (region, sensor_type, recorded_date);
The query that previously had to touch every file now prunes against tight statistics, and returns exactly the same rows it did before the rewrite:
SELECT id, device_id, reading, unit, quality_score, recorded_date
FROM external.delta_demos.zorder_sensor_telemetry
WHERE region = 'eu-west'
AND sensor_type = 'temperature'
AND recorded_date BETWEEN '2025-03-01' AND '2025-03-05'
ORDER BY recorded_date;
DESCRIBE HISTORY external.delta_demos.zorder_sensor_telemetry;
Two scoping notes, because Z-ORDER is routinely oversold. It is a multi-dimensional compromise: no single column ends up perfectly sorted, so if your queries only ever filter one column, a plain sort serves that column better. And the locality budget dilutes quickly; past three or four columns, each addition weakens the skipping benefit for all the others. Pick the two or three columns your dashboards actually filter together and stop there.
Should I partition my Delta table?
Less often than instinct suggests. Partitioning splits the table into physical directories keyed on a column's values, and it earns its keep only when both of these hold: the column is low-cardinality (dates, regions, tenants in the tens or hundreds), and nearly every query filters on it. When those hold, partition pruning eliminates whole directories before a single file is opened, and OPTIMIZE can be scoped with a WHERE clause to compact only recent partitions instead of rewriting history. Note that the WHERE clause filters on partition columns only; a predicate on a non-partition column is silently ignored, not applied.
When those conditions do not hold, partitioning actively hurts, and it hurts through the exact mechanism this runbook exists to fix: every partition accumulates its own small files, so a high-cardinality partition column multiplies the fragment count by the partition count. A table partitioned by customer_id with fifty thousand customers is fifty thousand small-files problems wearing one table name. For most analytical tables the better layout is no partitions at all, with ZORDER BY on the common filter columns; file-level statistics give the engine the pruning, and the maintenance surface stays a single directory.
A workable decision rule: partition on at most one column, only if it is low-cardinality and ubiquitous in filters, and reach for Z-ORDER for every dimension after that.
Run it yourself
Everything above is executable. Four demos in the DeltaForge demo library cover the cycle, each with setup data, the maintenance commands, and ASSERT-verified integrity checks: delta-optimize-compaction (compaction and target sizing), delta-vacuum-cleanup (orphan cleanup after DML churn), delta-zorder-deep (multi-column layout and data skipping), and delta-vacuum-optimize-maintenance (the full OPTIMIZE-then-VACUUM playbook walked through here). They run on the DeltaForge SQL engine against any object store you point them at, and the same statements drop into a scheduled pipeline once you are satisfied with the behavior on a test table.