Skip to content
All conformance pages Direction 2

Read Conformance Test Plan

Apache Spark 4.0 writes a Delta Lake or native Iceberg table. DeltaForge reads it back. Each script asserts row count, specific cell values, and aggregates computed outside the engine before the test runs.

Spark writes Delta and native Iceberg
DeltaForge reads through its own engine
7,500+ scripts, value-level assertions
1. Apache Spark 4.0 writes the table df.write.format("delta") / df.writeTo("ns.tbl").using("iceberg") spark-writes/ + spark-writes-iceberg/ 2. Bytes on storage Delta Lake _delta_log/ + part-*.parquet deletion_vector_*.bin checkpoints Native Iceberg metadata/v1.metadata.json metadata/snap-*.avro data/*.parquet 3. DeltaForge reads and asserts SELECT ...; ASSERT ROW_COUNT = N; ASSERT VALUE ... df-reads-spark/ + df-reads-spark-iceberg/ row count + value asserts + aggregate asserts

Planned coverage

7,528 scripts across Delta Lake and native Iceberg reads. Pass/fail counts appear after the first run.

File and metadata layouts

Parquet variants (Snappy, Zstd, Gzip), Delta log V1/V2, multipart checkpoints, Iceberg manifest formats V1/V2/V3.

Type system

All numeric types including high-precision decimal, temporal types (date, timestamp, timestamp-ntz, INT96 legacy, nanosecond V3), strings, binary, complex types to any nesting depth.

Spark-specific features

Deletion vectors (Roaring bitmap), column mapping (name and id mode), type widening, generated columns, identity columns, Iceberg position deletes, equality deletes, partition transforms.

DML and time travel

Reads after INSERT, UPDATE, DELETE, and MERGE. VERSION AS OF and TIMESTAMP AS OF. Iceberg snapshot-id reads. Reads after RESTORE.

Why some scripts are skipped. A subset of read scripts cannot run because Spark OSS does not implement the matching write-path feature. Examples include row tracking, identity columns, and in-commit timestamps. Without the writer producing the table, there is nothing to read against. These rows are tagged skip_cause: "spark_oss_limitation" so they are excluded from the executable pass-rate denominator.

What "pass" means

Each verification script is hand-written with explicit expected values, not derived from engine output.

  • ROW_COUNT. Every script asserts the exact row count after reading.
  • VALUE. Specific cells: "the value of order_number for id = 1 must be ORD10001". Catches per-cell drift.
  • Aggregates. MIN, MAX, SUM, COUNT DISTINCT against known totals. Catches statistical drift that single-cell checks would miss.
  • Schema shape. Column names, types, nullability, and field IDs verified on every script.

Expected values are derived from the deterministic generator formulas, so the test catches drift from any direction: reader bug, type-coercion bug, file-skipping bug, or generator regression.

# df-reads-spark/01_basic_data_files.sql
CREATE DELTA TABLE basic_data_files (
    id BIGINT, order_number STRING, ...
) LOCATION '${TABLE_PATH}';

ASSERT ROW_COUNT = 372
SELECT * FROM basic_data_files;

ASSERT VALUE order_number = 'ORD10001' WHERE id = 1
SELECT id, order_number FROM basic_data_files;

ASSERT VALUE max_id = 372
SELECT MAX(id) AS max_id FROM basic_data_files;

If Spark wrote it, DeltaForge can read it

Or it shows up as a failure on the conformance dashboard. No other outcomes.