Skip to content
All conformance pages Direction 3

Chaos Engineering Test Plan

Round-trip parity proves the engines agree under happy conditions. This plan proves they agree under failure: process kills mid-commit, network partitions, concurrent writers, clock skew. The data must remain correct or roll back cleanly.

Atomic commit verified under failure injection
Concurrent writer conflict resolution
Both engines must read the same post-chaos state
Roadmap This page is the test plan, not yet a results page. Categories below define what will run; results appear once runs begin.
Writer A DeltaForge commit v=42 Writer B Apache Spark commit v=42 Writer C DeltaForge vacuum + optimize SIGKILL net-split disk-full Storage under chaos part-002.parquet (torn) part-001.parquet (ok) _delta_log/00043.json (uncommitted) PUT v=42 (A) PUT v=42 (B) Verification: correct or rolled back cleanly DeltaForge and Spark both read -- both must agree on the same state

Three invariants under test

The Delta Lake and Iceberg specs describe atomic commits, optimistic concurrency, and checkpointing. This plan tests whether the implementation holds when things go wrong.

Atomicity

A commit is fully visible or fully invisible. No half-states survive process kill, network drop, or disk-full.

Isolation

Concurrent writers either both succeed in serial order, or the loser retries cleanly against the updated state.

Durability

A successful commit survives any failure of any other component after the commit completes.

Planned test categories

Each category is a family of failure scenarios. All are planned; results will appear once runs begin.

Process-Kill During Commit

planned

SIGKILL at each phase: after data files written but before log entry; after log temp file, before atomic rename; after rename, before checkpoint.

  • Kill before data flush
  • Kill between data and log
  • Kill mid-checkpoint
  • Kill mid-vacuum
Status: roadmap

Concurrent Writers

planned

Two or more writers race for the same commit version. Exactly one must win; the loser retries cleanly without corrupting the table.

  • Two DeltaForge writers
  • DeltaForge plus Spark, same table
  • Conflict against schema change
  • Long-running merge vs short append
Status: roadmap

Network and Storage Failure

planned

Object storage returns 5xx, drops the connection mid-PUT, or hangs indefinitely. Disk fills mid-write. No half-committed state left on disk.

  • S3 / GCS / Azure 5xx storms
  • TCP-reset mid upload
  • Disk-full ENOSPC during commit
  • Slow / hanging metadata reads
Status: roadmap

Read During Maintenance

planned

A long-running read is in flight while VACUUM, OPTIMIZE, or RESTORE runs concurrently. Snapshot isolation must hold.

  • Read while VACUUM removes old files
  • Read while OPTIMIZE compacts
  • Read across a RESTORE boundary
  • Snapshot expiration during long scan
Status: roadmap

Clock Skew and Time Travel

planned

Wall clock goes backward, jumps forward, or differs across writers. ICT-tagged commits and TIMESTAMP AS OF queries must resolve sensibly.

  • Backward clock jump between commits
  • Skewed clocks across writers
  • TIMESTAMP AS OF on the boundary
  • NTP step during a long transaction
Status: roadmap

Checkpoint and Log Corruption

planned

Truncated checkpoint, missing log entry, partially-written multipart checkpoint. Reader must reconstruct correctly or fail loudly, never silently produce wrong rows.

  • Truncated _delta_log JSON
  • Missing multipart checkpoint part
  • Missing V2 sidecar
  • Iceberg manifest list with broken pointer
Status: roadmap

Verification methodology

Every planned chaos test uses the same independent-engine principle as the round-trip checks: both DeltaForge and Spark read the result and must agree.

Step 1

Setup

  • Spin up a clean Delta or Iceberg table
  • Pre-populate with a known baseline state
  • Record baseline content hash and schema
Step 2

Inject

  • Start the operation under test
  • Inject the failure at a deterministic point
  • Capture every file the writer touched
Step 3

Recover

  • Restart the writer cleanly
  • Allow normal recovery and retry logic to run
  • Capture the final on-disk state
Step 4

Verify

  • Both DeltaForge and Spark read the table
  • Either both see commit N or both see commit N-1
  • Content hash must match one of the two clean states
  • No orphaned data, no duplicate commits, no torn rows

Correctness under failure is not optional

This page will show each category, scenario, and release result once runs begin.