Skip to content
Connectors

Connect to Any Data Source

Read from common file formats, relational databases (PostgreSQL, MySQL, SQL Server, Oracle), and cloud services. Write to Delta Lake tables with full schema evolution support.

Predicate pushdown to source
Automatic schema inference
Parallel ingestion at scale
Visual Flattener for nested formats
Delta Forge Unified Query Engine Amazon S3 s3://bucket/ Parquet, CSV Azure Blob ADLS Gen2 Delta Tables PostgreSQL RDBMS SQL Server Enterprise Apache Kafka Streaming Google Cloud GCS + BigQuery File Formats Parquet CSV JSON +15 Delta Lake Output Unified, versioned, ACID-compliant

Visual Flattener

Turn any nested format into a SQL table — visually configure, automatically flatten

Delta Forge includes a visual schema discovery and configuration tool that transforms complex, nested data formats into flat, queryable SQL tables. One unified experience across six formats: JSON, XML, EDI, HL7, FHIR, and Protobuf.

How It Works

1

Discover

Scan files and automatically discover all nested paths, types, and sample values

2

Configure

Use an interactive tree view to select which fields to include, exclude, explode, or keep as JSON

3

Query

Flattened data appears as a standard SQL table. Missing paths become NULL. Configuration persists across queries.

Five Selection Modes Per Field

INCLUDE

Whitelist specific paths into the output

EXCLUDE

Remove entirely from output

EXPLODE

Create one row per array element (like SQL UNNEST)

JSON

Keep subtree as a JSON string column instead of flattening

Default

Automatic flattening behavior — all paths included with standard naming

Six Formats, One Visual Experience

JSON

JSONPath, SIMD-accelerated parsing

XML

XPath-like expressions, attribute handling, namespace support

EDI

Segment-based flattening, composite elements

HL7

Component flattening, friendly aliases

FHIR

Resource type discovery, bundle unbundling

Protobuf

Enum decoding, repeated field handling

Schema Evolution Built In

Automatic Structure Merging

  • Files with different structures handled automatically
  • Path aliases map multiple source paths to one output column
  • Missing paths become NULL — consistent schema across all files
-- Input (nested JSON):
-- {
--   "user": { "name": "Alice", "contact": { "email": "alice@example.com" } },
--   "tags": ["vip", "active"],
--   "metadata": { "source": "api", "raw": {...} }
-- }

-- Output (flattened SQL table):
-- user_name | user_contact_email  | tags             | metadata
-- Alice     | alice@example.com   | ["vip","active"] | (kept as JSON)

SELECT user_name, user_contact_email, tags, metadata
FROM customers;  -- flattened table, ready to query
One visual experience

All 6 formats share the same tree view — no format-specific tooling needed

Persistent configuration

Configuration persists to the table — query results are always consistent

SIMD-accelerated

500MB/s+ throughput for JSON processing

No code required

Point, click, query — flatten nested data without writing any transformation code

Database Connectors

Connect to relational and NoSQL databases with predicate pushdown. All connection credentials are stored securely in OS Keychain or Azure Key Vault, never in config files.

PostgreSQL

Full-featured connectivity with SSL, connection pooling, and predicate pushdown

RDBMS SSL

MySQL / MariaDB

MySQL 5.7+ and MariaDB support with binary protocol

RDBMS Replication

SQL Server

Microsoft SQL Server with Windows and Azure AD authentication

Enterprise Azure AD

Oracle Database

Oracle 12c+ with TNS and Easy Connect naming

Enterprise RAC

MongoDB

Document database with aggregation pipeline pushdown

NoSQL Document

Redis

Key-value store with cluster and sentinel support

Cache Cluster

Cloud Object Storage

Native integration with all major cloud providers

Amazon Web Services

  • Amazon S3 (all storage classes)
  • S3 Express One Zone
  • AWS Glue Catalog integration
  • IAM roles & STS credentials
  • Cross-account access

Microsoft Azure

  • Azure Blob Storage
  • Data Lake Storage Gen2
  • Azure Active Directory auth
  • Managed identity support
  • SAS token authentication

Google Cloud Platform

  • Google Cloud Storage
  • BigQuery external tables
  • Service account auth
  • Workload identity federation
  • Multi-regional buckets

File Format Support

Native support for all major data formats with optimized readers

Columnar Formats

Parquet Column pruning, predicate pushdown
ORC Hive-compatible, ACID support
Arrow IPC Zero-copy reads
Avro Schema evolution

Text, Semi-Structured & Binary

CSV Auto-dialect detection
JSON NDJSON, subtree capture
XML XPath, subtree capture
Excel Multi-sheet, XLSX/XLS/ODS
Protobuf Proto3 binary parsing

Protocol Buffers

Query Proto3 binary data with SQL — a capability most engines simply don't have

Schema-Driven Parsing

  • Read Proto3 binary files directly with a .proto descriptor
  • Specify the message type to decode from the schema
  • Glob patterns for multi-file ingestion
  • Streaming reads for large binary datasets

Nested Messages & Repeated Fields

  • Nested messages flattened into dot-notation columns
  • Repeated fields mapped to Arrow list arrays
  • Map fields decoded as key-value struct arrays
  • Oneof fields with automatic null-filling

Enum Decoding

  • Enum values decoded to human-readable string names
  • Unknown enum values preserved as integer fallbacks
  • Optional raw integer mode for performance

Well-Known Types

  • google.protobuf.Timestamp → Arrow TIMESTAMP
  • google.protobuf.Duration → Arrow INTERVAL
  • google.protobuf.StringValue & wrapper types
  • google.protobuf.Struct as JSON columns
-- Read IoT sensor data from Proto3 binary files
SELECT device_id, temperature, humidity, recorded_at
FROM read_protobuf(
    'sensors/*.pb',
    'sensor.proto',
    'SensorReading'
)
WHERE temperature > 35.0
ORDER BY recorded_at DESC;

Apache ORC

Production-grade ORC reading for Hive data warehouses — battle-tested across 6 industry demos

Hive-Compatible Reading

  • Read ORC files from Hive-managed and external tables
  • Full ACID transaction support (insert, update, delete)
  • Partition pruning with Hive-style directory layouts
  • Proven across banking, clinical trials, energy, insurance, server logs, and warehouse demos

Complex Types

  • STRUCT fields mapped to nested Arrow structs
  • MAP fields as key-value list arrays
  • ARRAY fields as Arrow list columns
  • Deeply nested combinations of all three

Stripe-Level Statistics

  • Min/max statistics per stripe for predicate pushdown
  • Bloom filters for high-cardinality column filtering
  • Row-group-level skipping for large files
  • Column-level statistics for query optimization

Compression Codecs

  • ZLIB — maximum compression ratio
  • Snappy — balanced speed and size
  • LZ4 — fastest decompression
  • ZSTD — best overall compression
  • Automatic codec detection per file

Apache Avro

Schema evolution across files with automatic type promotion and null-filling

Schema Evolution

  • Read files written with different schema versions together
  • New columns in newer files automatically NULL-filled for older rows
  • Removed columns gracefully excluded from the merged schema
  • Type promotion: intlong, floatdouble

Logical Types

  • date → Arrow DATE32
  • timestamp-millis / timestamp-micros → Arrow TIMESTAMP
  • decimal with precision and scale preserved
  • uuid, time-millis, time-micros

Compression Codec Mixing

  • Each Avro file can use a different codec
  • Snappy, Deflate, ZSTD, Bzip2 detected per-file
  • Transparent decompression during query execution
  • No configuration needed — codecs detected automatically

Nested Records

  • Avro records mapped to Arrow struct columns
  • Arrays mapped to Arrow list columns
  • Maps mapped to key-value struct arrays
  • Unions decoded with type-tag discrimination

JSON & NDJSON

Flexible JSON reading with subtree capture for semi-structured analytics

Subtree Capture with json_paths

  • Preserve nested objects as queryable JSON blob columns
  • Extract flat fields while keeping complex structures intact
  • Ideal for semi-structured data with variable nesting
  • JSON blob columns queryable with json_extract functions

Format Variants

  • NDJSON (newline-delimited) for streaming workloads
  • JSON arrays for bulk exports
  • Mixed-type arrays with automatic type widening
  • Deeply nested objects with configurable flatten depth
-- Keep nested 'address' as a JSON blob, extract flat fields normally
SELECT name, email, address
FROM read_json('customers/*.json',
    json_paths := '{address}'
);

-- Result: 'address' column contains full JSON objects
-- {"street": "123 Main St", "city": "Denver", "state": "CO", "zip": "80202"}

-- Then query into the captured subtree
SELECT name, json_extract(address, '$.city') AS city
FROM read_json('customers/*.json',
    json_paths := '{address}'
);

XML

Structured XML reading with subtree capture and schema evolution

Subtree Capture

  • Preserve nested XML elements as string columns
  • Extract parent-level attributes while keeping child trees intact
  • XPath-based element selection for targeted reading
  • Mixed content handling with text and element children

Schema Evolution

  • Merge schemas across XML files with different structures
  • New elements in newer files NULL-filled for older rows
  • Attribute and element unification in the output schema
  • Namespace-aware parsing for enterprise XML formats

RSS & Feed Parsing

  • RSS 2.0 and Atom feed ingestion as relational tables
  • Channel metadata extracted alongside item rows
  • Enclosure and media elements captured
  • Date normalization across feed date formats

Excel Workbooks

Multi-sheet reading with intelligent header detection and per-sheet type inference

Multi-Sheet Reading

  • Read specific sheets by name or index
  • Read all sheets at once into separate tables
  • Sheet name available as a metadata column
  • Support for XLSX, XLS (legacy), and ODS formats

Header Row Detection

  • Automatic header row identification
  • Skip leading blank rows and title rows
  • Configurable header row offset for non-standard layouts
  • Column name sanitization and deduplication

Type Inference Per Sheet

  • Independent type inference for each sheet
  • Excel date serial numbers decoded to proper dates
  • Currency and percentage formatting preserved
  • Formula cells read as computed values

Streaming Connectors

Real-time data ingestion from event streams

Apache Kafka

High-throughput distributed event streaming with consumer groups

Amazon Kinesis

AWS managed streaming with automatic scaling

Azure Event Hubs

Azure-native event ingestion at scale

Google Pub/Sub

GCP messaging with exactly-once delivery

Intelligent Schema Inference

Automatic type detection across 40+ locales with auto-generated transform views — no manual schema definitions

Culture-Aware Parsing

  • German dates: DD.MM.YYYY, US dates: MM/DD/YYYY
  • French decimals: 1 234 567,89
  • German grouping: 1.234.567,89
  • Spanish month names: Enero, Febrero, Marzo...
  • AM/PM designators and negative number formats across locales

12 Detected Types

  • Boolean, SmallInt, Int, BigInt, Decimal, Float
  • Date, Time, DateTime, UUID, Varchar
  • Configurable confidence thresholds (default 80%)
  • Automatic fallback to VARCHAR when confidence is low
  • SQL cast expression generation for each column
  • Auto-generated transform views from inferred types

Schema Merging & Evolution

  • Three modes: Merge (union), Strict, Intersection
  • Type widening: int → bigint, float → double
  • Null-filling for columns missing in older files
  • Column order preservation from first schema
  • Force-nullable mode for safe dynamic evolution

Parallel Processing

  • Rayon-based parallel inference across all CPU cores
  • Configurable sample sizes (1K fast to 100K+ thorough)
  • Compiled regex patterns cached for zero re-compilation
  • Schema fingerprinting for O(1) change detection
  • Automatic catalog sync without manual "Scan Files"
-- Same column, different locales — Delta Forge infers correctly

-- German (de-DE): period groups, comma decimal
order_total:  1.234.567,89  →  DECIMAL
order_date:   15.03.2024    →  DATE

-- US English (en-US): comma groups, period decimal
order_total:  1,234,567.89  →  DECIMAL
order_date:   03/15/2024    →  DATE

-- French (fr-FR): space groups, comma decimal
order_total:  1 234 567,89  →  DECIMAL

-- Auto-generated transform view based on inference
CREATE VIEW v_orders AS
SELECT
    CAST(order_total AS DECIMAL(12,2)) AS order_total,
    CAST(order_date AS DATE) AS order_date,
    customer_name
FROM raw_orders;

Connect all your data sources

Unify your data from any source into Delta Lake tables.