Skip to content
Compute Engine

Distributed Vectorized Execution

Zero-copy Arrow-native compute engine with SIMD acceleration, parallel task scheduling, and memory-efficient streaming. Designed from the ground up for analytical workloads on modern hardware.

SIMD-accelerated batch processing
Zero-copy Arrow columnar format
Parallel multi-core scheduling
Columnar Data id price qty 1001 1002 1003 1004 1005 1006 29.99 14.50 99.00 7.25 45.00 12.75 3 12 1 8 5 20 VECTORIZE SIMD Execution Lanes Lane 0 29.99*3 = 89.97 Lane 1 14.50*12 = 174.00 Lane 2 99.00*1 = 99.00 Lane 3 7.25*8 = 58.00 Parallel Task Scheduler Worker 0 partition[0..3] Worker 1 partition[4..7] Worker 2 partition[8..11] Worker N partition[12..N] MERGE Result Columnar Data id: [1001..1006] total: [89.97, 174.00, ...]

Vectorized Execution Model

Process thousands of values per operation using columnar batches

Query Plan
Logical Plan Physical Plan Execution DAG
Task Scheduler
Partition Assignment Pipeline Stages Work Stealing Priority Queue
Execution Runtime
Thread Pool Memory Arena Batch Pipeline Spill Manager
Vectorized Operators
SIMD Kernels Arrow Arrays Null Handling Dictionary Encoding

SIMD-Accelerated Operations

Hand-tuned vectorized kernels for maximum throughput

Comparison Operations

  • AVX-512 8-way parallel comparison
  • Null-aware comparison semantics
  • String comparison with SSE4.2
  • Vectorized LIKE pattern matching
  • IN list checking with Bloom filters

Arithmetic Operations

  • Fused multiply-add (FMA)
  • Overflow-checked arithmetic
  • Decimal multiplication with carry
  • Vectorized division
  • Modular arithmetic

Aggregation

  • Horizontal SIMD sum/min/max
  • Parallel hash aggregation
  • Two-pass variance calculation
  • Approximate distinct count (HLL)
  • Vectorized group-by

String Operations

  • SIMD string search (memmem)
  • Vectorized UTF-8 validation
  • Parallel string concatenation
  • SSE4.2 substring search
  • Batch regex matching
8x faster with AVX-512 vs scalar
4.2GB/s filter throughput per core
12M rows/sec aggregation

Memory Management

Precise memory accounting with graceful spilling

Memory Pools

  • Per-query memory limits
  • Hierarchical memory accounting
  • Reservation and tracking
  • Memory pressure callbacks
  • Pool isolation between queries

Buffer Management

  • Zero-copy buffer sharing
  • Reference-counted buffers
  • Aligned allocation (64-byte)
  • Large page support (hugepages)
  • Buffer pool recycling

Spill-to-Disk

  • Graceful memory overflow handling
  • Sort spill with external merge
  • Hash table partitioned spill
  • Async spill writes
  • Compressed spill format

Cache Management

  • LRU page cache
  • Metadata cache
  • Statistics cache
  • Compiled expression cache
  • Schema cache

Parallel Execution Framework

Scale linearly across all available CPU cores

Task Scheduling

  • Adaptive task scheduler
  • Hardware-aware task placement
  • Priority-based scheduling
  • Cooperative multitasking
  • Yield points for cancellation

Partition Parallelism

  • Automatic partition detection
  • Dynamic partition splitting
  • Load balancing across workers
  • Partition coalescing
  • Skew handling

Pipeline Parallelism

  • Morsel-driven parallelism
  • Exchange operators
  • Parallel hash build
  • Parallel sort
  • Concurrent aggregation

I/O Parallelism

  • Async I/O with kernel-level acceleration
  • Prefetch scheduling
  • Parallel file reads
  • Batched network I/O
  • Streaming result delivery

Seven Histogram Algorithms

Unparalleled cardinality estimation for query optimization

📊

Equi-Width

Fixed-width buckets for uniform distributions. Fast construction, simple storage.

Best for: Uniformly distributed numeric data
📈

Equi-Depth

Equal-frequency buckets adapting to data skew. Standard choice for most workloads.

Best for: General-purpose, skewed distributions
🎯

Singleton

Individual buckets for high-frequency values. Perfect for low-cardinality columns.

Best for: Categorical data, enum columns
🔀

Hybrid

Singletons for frequent values, equi-depth for the rest. Handles mixed distributions.

Best for: Real-world data with hot values
📉

Compressed

Run-length encoded for repeated sequences. Memory-efficient storage.

Best for: Data with long runs of equal values

Streaming

Online construction without full data pass. Count-Min Sketch based.

Best for: Large datasets, streaming updates
🧮

Wavelet

Multi-resolution representation using wavelet transform. Excellent range query estimation.

Best for: Range queries, time-series data

Physical Operators

Comprehensive operator library for any query pattern

Scan Operators

  • Columnar file scan with projection and filter pushdown
  • Delta Lake scan with time travel and deletion vectors
  • In-memory and inline data scans

Join Operators

  • Hash join with parallel build/probe and spill support
  • Sort-merge join for pre-sorted inputs
  • Nested loop join for non-equi joins
  • Optimized semi/anti joins for EXISTS and IN queries

Aggregation Operators

  • Parallel hash-based grouping
  • Pre-sorted aggregation
  • Window function evaluation
  • Single-group aggregation

Sort & Limit

  • External merge sort with spill-to-disk
  • Heap-based top-K selection
  • Row count limiting and offset

Set Operators

  • Union and union all
  • Intersect and except

Exchange Operators

  • Hash and range repartitioning
  • Partition coalescing and broadcast

Expression Evaluation Engine

JIT-style compiled expressions for maximum performance

Expression Compilation

  • Type-specialized evaluation
  • Null handling optimization
  • Constant folding
  • Common subexpression elimination
  • Short-circuit evaluation

Batch Evaluation

  • Evaluate entire column at once
  • Amortize function call overhead
  • SIMD-friendly memory layout
  • Null bitmap propagation
  • Selection vector support

Type Coercion

  • Implicit type promotion
  • Explicit CAST operations
  • Safe cast with null on error
  • TRY_CAST semantics
  • Format string parsing

UDF Integration

  • Scalar UDF evaluation
  • Aggregate UDF support
  • Window UDF support
  • Table-valued functions
  • Async UDF execution

Transparent Compute Metering

Pay only for actual query execution. No clusters, no idle costs

DFCU - Delta Forge Compute Unit

  • Transparent formula: DFCU = (wall_clock_s × cores) / 3600
  • 1 DFCU = 1 core-hour of actual compute
  • Billed per query, no cluster spin-up or idle time
  • No minimum runtime windows or shutdown delays
  • Auditable and predictable cost model

Real-Time Usage Dashboard

  • Total DFCU consumed and estimated cost
  • Active compute nodes and core count
  • Queries executed and data processed (read + written)
  • Daily trend charts for capacity planning
  • Filter by user, pipeline, schedule, or node
0s cluster spin-up time
0s idle time billed
100% cost visibility

See Pricing & DFCU Details

Harness the full power of modern hardware

SIMD-accelerated, zero-copy, production-ready compute engine.