Why file count affects query startup

Before reading rows, a Delta engine must determine the current snapshot and which files or row groups can be skipped. With a long log and hundreds or thousands of cloud files, that planning work can require many object-store requests before the data scan begins.

Enable Meta Store for a table

Register the mirror and opt the table into automatic Meta Store use:

ALTER TABLE analytics.sales.orders
SET TBLPROPERTIES (
    'delta.forge.metastore.enabled' = 'true'
);

CREATE METASTORE 'analytics.sales.orders';
The initial hydration reads the Delta history and active Parquet footers. The table property keeps the accelerated path aligned with table changes.

What changes at query time

The planner can fetch the active file set, row groups, deletion vectors and column statistics as an Arrow payload from the control plane. It prunes the plan from that metadata, then reads only the selected Parquet data from cloud storage.

Refresh or rebuild when needed

ALTER METASTORE 'analytics.sales.orders'
REFRESH MODE = INCREMENTAL;
An explicit incremental refresh catches up from the last mirrored version.

Use a full refresh to rebuild the mirror from the authoritative Delta log:

ALTER METASTORE 'analytics.sales.orders'
REFRESH MODE = FULL;

Inspect the mirror

SHOW METASTORE 'analytics.sales.orders';

SHOW METASTORE STATS 'analytics.sales.orders';

SHOW METASTORE FILES 'analytics.sales.orders';
Check the cached version, file count, total size and whether the current session has activated its plan cache.

When it is useful

  • Delta tables on S3, ADLS Gen2 or GCS.
  • Tables with many active files or row groups.
  • Long-running tables with frequent commits.
  • Interactive BI workloads and multiple compute nodes.

Small tables on local storage may not benefit enough to justify the extra database state and refresh operation.

The result

Meta Store does not replace Delta Lake. The Delta log remains the source of truth while repeated planning work is served from an indexed database representation shared by the compute nodes.