What failure categories are planned?

Process-kill during commit; concurrent writers; network and storage failure (5xx storms, TCP reset, disk-full); read during maintenance (VACUUM, OPTIMIZE, RESTORE); clock skew and time travel; and checkpoint and log corruption.

How is correctness verified after chaos?

Both DeltaForge and Apache Spark read the table; either both see commit N or both see commit N-1; the content hash must match one of the two clean states; no orphaned data, no duplicate commits, no torn rows.

Why use two independent engines for chaos verification?

A single engine can mask its own recovery bugs by being consistent with itself. Reading the post-chaos table with both DeltaForge and Apache Spark eliminates that blind spot.

How are concurrent writes to Delta Lake tested?

Multiple writers, including mixed-engine pairs, commit against the same table at once. The plan requires that conflicting commits are detected, the loser retries or fails cleanly, and the surviving history is a valid serial order; silent lost updates are a hard failure.

Concurrent Delta and Iceberg Chaos Test Plan

Q: What invariants does the chaos plan test?

Three: atomicity (a commit is fully visible or fully invisible), isolation (concurrent writers either both succeed in serial order or the loser retries cleanly), and durability (a successful commit survives any later failure of any other component).

Three invariants under test

The Delta Lake and Iceberg specs describe atomic commits, optimistic concurrency, and checkpointing. This plan tests whether the implementation holds when things go wrong.

Atomicity

A commit is fully visible or fully invisible. No half-states survive process kill, network drop, or disk-full.

Isolation

Concurrent writers either both succeed in serial order, or the loser retries cleanly against the updated state.

Durability

A successful commit survives any failure of any other component after the commit completes.

Planned test categories

Each category is a family of failure scenarios. All are planned; results will appear once runs begin.

Process-Kill During Commit

planned

SIGKILL at each phase: after data files written but before log entry; after log temp file, before atomic rename; after rename, before checkpoint.

Kill before data flush
Kill between data and log
Kill mid-checkpoint
Kill mid-vacuum

Status: roadmap

Concurrent Writers

planned

Two or more writers race for the same commit version. Exactly one must win; the loser retries cleanly without corrupting the table.

Two DeltaForge writers
DeltaForge plus Spark, same table
Conflict against schema change
Long-running merge vs short append

Status: roadmap

Network and Storage Failure

planned

Object storage returns 5xx, drops the connection mid-PUT, or hangs indefinitely. Disk fills mid-write. No half-committed state left on disk.

S3 / GCS / Azure 5xx storms
TCP-reset mid upload
Disk-full ENOSPC during commit
Slow / hanging metadata reads

Status: roadmap

Read During Maintenance

planned

A long-running read is in flight while VACUUM, OPTIMIZE, or RESTORE runs concurrently. Snapshot isolation must hold.

Read while VACUUM removes old files
Read while OPTIMIZE compacts
Read across a RESTORE boundary
Snapshot expiration during long scan

Status: roadmap

Clock Skew and Time Travel

planned

Wall clock goes backward, jumps forward, or differs across writers. ICT-tagged commits and TIMESTAMP AS OF queries must resolve sensibly.

Backward clock jump between commits
Skewed clocks across writers
TIMESTAMP AS OF on the boundary
NTP step during a long transaction

Status: roadmap

Checkpoint and Log Corruption

planned

Truncated checkpoint, missing log entry, partially-written multipart checkpoint. Reader must reconstruct correctly or fail loudly, never silently produce wrong rows.

Truncated _delta_log JSON
Missing multipart checkpoint part
Missing V2 sidecar
Iceberg manifest list with broken pointer

Status: roadmap

Verification methodology

Every planned chaos test uses the same independent-engine principle as the round-trip checks: both DeltaForge and Spark read the result and must agree.

Step 1

Setup

Spin up a clean Delta or Iceberg table
Pre-populate with a known baseline state
Record baseline content hash and schema

Step 2

Inject

Start the operation under test
Inject the failure at a deterministic point
Capture every file the writer touched

Step 3

Recover

Restart the writer cleanly
Allow normal recovery and retry logic to run
Capture the final on-disk state

Step 4

Verify

Both DeltaForge and Spark read the table
Either both see commit N or both see commit N-1
Content hash must match one of the two clean states
No orphaned data, no duplicate commits, no torn rows

Chaos Engineering Test Plan

Three invariants under test

Atomicity

Isolation

Durability

Planned test categories

Process-Kill During Commit

Concurrent Writers

Network and Storage Failure

Read During Maintenance

Clock Skew and Time Travel

Checkpoint and Log Corruption

Verification methodology

Setup

Inject

Recover

Verify

Correctness under failure is not optional