Yes, You Can Query EDI Files with SQL: Here Is How

Q: Can you query EDI files directly with SQL?

Yes. DeltaForge registers a directory of raw X12, EDIFACT, or TRADACOMS files as an external SQL table. Each transaction set or message becomes one row, with envelope headers (ISA, GS, ST for X12; UNB, UNH for EDIFACT; STX, MHD for TRADACOMS) exposed as typed columns and the complete parsed transaction available as a JSON column. The files stay in object storage unchanged, and SELECT, GROUP BY, joins, and aggregates run against them like any other table.

Q: How do you convert X12 EDI to SQL tables?

You do not run a conversion job. CREATE EXTERNAL TABLE ... USING EDI with an edi_config of ediFormat x12 maps the files in place: envelope fields arrive as columns automatically, and the materialized_paths option promotes business fields such as the purchase order number (BEG-03) or invoice number (BIG-02) to first-class columns. If you want a managed copy in Delta Lake, select from the external table into a Delta table with ordinary SQL.

Q: Do you need an EDI parser to analyze EDI data?

Not for analytics. Parser libraries and translators solve transformation and exchange, and they leave the database load to you: you still write loader code before the first query runs. DeltaForge embeds X12, EDIFACT, and TRADACOMS parsing inside the SQL engine, so parsing happens at scan time. Every segment remains reachable through the df_transaction_json column when you need an element you did not promote to its own column.

Q: How do you query EDIFACT or TRADACOMS files?

With the same statement and a different ediFormat value. Setting it to edifact yields UNB and UNH header columns plus full message JSON, one row per message, including multi-message interchanges; EANCOM, the GS1 retail subset, parses through the same path. Setting it to tradacoms yields STX and MHD columns, one row per MHD message, and materialized_paths can promote order line fields such as the EAN-13 product code and unit price.

Can you query EDI files directly with SQL?

Yes. This is the entire program:

SELECT
    df_file_name,
    isa_6 AS sender_id,
    isa_8 AS receiver_id,
    st_1  AS txn_type,
    gs_8  AS x12_version
FROM external.edi_demos.supply_chain_messages
ORDER BY df_file_name;

One row per X12 transaction set, headers as columns, no preprocessing step anywhere in the pipeline.

Run against a folder of fourteen real X12 files spanning eight transaction types and two interchange versions, it returns rows like these:

df_file_name                            sender_id        receiver_id  txn_type  x12_version
x12_810_invoice_a.edi                   ABCMUSICSUPPLY   ARIBAEDI     810       004010
x12_850_purchase_order.edi              000123456        PARTNERID    850       004010
x12_997_functional_acknowledgment.edi   TO               FROM         997       004010
...                                     ...              ...          ...       ...

Output shape: the interchange envelope of every transaction, queryable the moment the table exists.

The table behind that query is an external table over the raw .edi files. Nothing was translated, exported, or loaded. The values above are not illustrative; they are the assertion targets the DeltaForge demo harness checks on every run of the edi-supply-chain-x12 demo, which ships in the DeltaForge demo library so you can reproduce them yourself.

What happens at the storage layer

An EDI external table is a schema-on-read mapping, and understanding the mapping explains everything else in this article. When you create the table, DeltaForge records the location glob, the format, and the parse configuration in its catalog. When you query it, the engine lists the matching files in object storage, parses each interchange at scan time, and emits one row per X12 transaction set (or per EDIFACT message, or per TRADACOMS MHD message). The files themselves are never rewritten, moved, or copied; they remain the system of record exactly as your trading partners delivered them.

Each row carries three kinds of columns. Envelope headers come first: for X12 that is isa_1 through isa_16, gs_1 through gs_8, plus st_1 and st_2, so sender, receiver, dates, control numbers, and transaction type are ordinary columns on every row. Second, df_transaction_json holds the complete parsed transaction, every segment and element, as a JSON document, with df_transaction_id as a unique hash. Third, optional file metadata columns such as df_file_name and df_row_number record provenance, which matters when a feed mixes files from several partners. Here is the table definition that produced the query above:

CREATE ZONE IF NOT EXISTS external
    TYPE EXTERNAL
    COMMENT 'External tables backed by files in object storage';

CREATE SCHEMA IF NOT EXISTS external.edi_demos
    COMMENT 'EDI transaction-backed external tables';

CREATE EXTERNAL TABLE IF NOT EXISTS external.edi_demos.supply_chain_messages
USING EDI
LOCATION 's3://acme-lake/demo/edi-supply-chain-x12/*.edi'
OPTIONS (
    edi_config = '{"ediFormat": "x12"}',
    file_metadata = '{"columns":["df_file_name","df_row_number"]}'
);

The whole ingestion layer. The glob can match thousands of interchanges across partners and X12 versions.

Because parsing is part of the scan, heterogeneity stops being a schema problem. The demo feed mixes X12 versions 00204 and 00401 from four different sender IDs in a single table, and a plain GROUP BY answers questions that would otherwise need partner-specific parser configuration:

SELECT
    st_1 AS txn_type,
    COUNT(*) AS txn_count
FROM external.edi_demos.supply_chain_messages
GROUP BY st_1
ORDER BY txn_count DESC, st_1;

Transaction mix across the feed: 810 invoices, 850 purchase orders, 855, 856, 857, 861, 997, and 824 advice documents, one aggregate away.

How do you convert X12 EDI to SQL tables?

The honest answer is that you stop thinking of it as a conversion. The external table already is the SQL table. What you tune is how much of the transaction body gets promoted from JSON into first-class columns, and the instrument for that is materialized_paths in the edi_config. Listing a segment-element path such as beg_3 (the purchase order number in an 850) makes it a real column with NULLs on transaction types where the segment does not occur:

CREATE EXTERNAL TABLE IF NOT EXISTS external.edi_demos.supply_chain_materialized
USING EDI
LOCATION 's3://acme-lake/demo/edi-supply-chain-x12/*.edi'
OPTIONS (
    edi_config = '{
        "ediFormat": "x12",
        "materialized_paths": [
            "beg_1", "beg_3", "beg_5",
            "big_1", "big_2",
            "bsn_2", "bsn_3",
            "n1_1", "n1_2",
            "ctt_1"
        ]
    }',
    file_metadata = '{"columns":["df_file_name","df_row_number"]}'
);

Same files, second view: PO fields (BEG), invoice fields (BIG), shipment fields (BSN), party names (N1), and line counts (CTT) as columns.

With those paths in place, purchase order analytics is unremarkable SQL:

SELECT
    df_file_name,
    st_1  AS txn_type,
    beg_3 AS po_number,
    beg_5 AS po_date,
    n1_2  AS party_name,
    ctt_1 AS line_items
FROM external.edi_demos.supply_chain_materialized
WHERE beg_3 IS NOT NULL
ORDER BY df_file_name;

Returns the three 850s in the feed: PO 1000012 shipping to John Doe, PO 4600000406 via Transplace Laredo, and PO XX-1234 for ABC AEROSPACE.

If you do want a persisted, managed copy (for downstream consumers that should not depend on the raw feed, or to get Delta Lake time travel over a snapshot), you select from the external table into a Delta table with ordinary SQL. The point is that materialization becomes a choice about lifecycle, not a prerequisite for the first query. The same pattern is covered for flat files in CSV to Delta Lake without Spark.

Do you need an EDI parser to analyze EDI data?

This is the search that usually precedes this article: someone types "x12 parser python" or "x12 parser alternative" because they have a bucket of interchanges and a reporting question. The tools that search surfaces are good at what they are documented to do, and what they are documented to do is not analytics. Open-source translators such as pyx12 validate X12 and translate it to XML. Component libraries for .NET and Java deserialize transactions into typed objects inside your application code. Full translators map EDI into ERP-ready formats as part of an exchange workflow. In every one of those designs, getting data into a database where SQL can reach it is left as your integration work: you write the loader, you define the relational schema, you maintain both as partners and versions change.

If the goal is analytics rather than exchange, you can skip that search entirely. The parse already lives inside the DeltaForge SQL engine, and the escape hatch for fields you did not promote is always present on every row:

SELECT
    df_file_name,
    st_1 AS txn_type,
    df_transaction_json
FROM external.edi_demos.supply_chain_messages
ORDER BY df_file_name
LIMIT 3;

df_transaction_json carries every segment: PO1 line items, SAC charges, TXI tax detail, HL hierarchies. JSON functions reach anything materialized_paths did not.

That two-tier model (promoted columns for the fields you query constantly, JSON for the long tail) is what replaces the parser-plus-loader stack. There is no relational schema for you to design and no version drift to maintain, because the envelope columns and the JSON document are derived from the standard itself at scan time.

How do you query EDIFACT or TRADACOMS files?

With the same statement and one changed value. For UN/EDIFACT, set ediFormat to edifact and the header columns become the UNB interchange envelope and UNH message header:

CREATE EXTERNAL TABLE IF NOT EXISTS external.edi_demos.edifact_messages
USING EDI
LOCATION 's3://acme-lake/demo/edi-edifact-international-trade/*.edi'
OPTIONS (
    edi_config = '{"ediFormat": "edifact"}',
    file_metadata = '{"columns":["df_file_name","df_row_number"]}'
);

SELECT
    df_file_name,
    unb_1 AS syntax_id,
    unh_2 AS msg_type,
    unh_1 AS msg_ref,
    unb_2 AS sender
FROM external.edi_demos.edifact_messages
ORDER BY df_file_name;

From the edi-edifact-international-trade demo: 22 files, more than a dozen message types (ORDERS, INVOIC, CUSCAR, BAPLIE, PAXLST, CONTRL), six syntax identifiers, one table.

Two details matter in practice. Files containing multiple messages produce multiple rows, so a CONTRL acknowledgment file with two messages yields two rows, and the demo's row count exceeds its file count by design. And EANCOM, the GS1 retail subset of EDIFACT, parses through the same path; the demo mixes six EANCOM files (DESADV, IFTSTA, INVOIC, ORDRSP, PRICAT, IFTMIN) with pure EDIFACT in one table and classifies them with a CASE expression.

TRADACOMS, the UK retail standard, follows the same pattern with ediFormat set to tradacoms. The engine emits one row per MHD message, exposing the STX transmission envelope and MHD header, and materialized_paths promotes order line detail fields exactly as it does for X12:

CREATE EXTERNAL TABLE IF NOT EXISTS external.edi_demos.tradacoms_order_lines
USING EDI
LOCATION 's3://acme-lake/demo/edi-tradacoms-purchase-orders/*.edi'
OPTIONS (
    edi_config = '{
        "ediFormat": "tradacoms",
        "materialized_paths": [
            "old_1", "old_2", "old_3", "old_5", "old_6", "old_10",
            "ord_1", "clo_1",
            "din_1", "din_4",
            "otr_1", "otr_2",
            "typ_1", "typ_2",
            "sdt_2", "cdt_2"
        ]
    }',
    file_metadata = '{"columns":["df_file_name","df_row_number"]}'
);

Order line detail (OLD), order references (ORD), delivery instructions (DIN), trailers (OTR), and trading partner names (SDT, CDT) as columns.

TRADACOMS fields are strings on the wire, and CAST-based arithmetic works on them directly, so order valuation is a one-statement job:

SELECT
    ord_1 AS order_ref,
    SUM(CAST(old_5 AS INTEGER) * CAST(old_6 AS INTEGER)) AS order_total
FROM external.edi_demos.tradacoms_order_lines
WHERE mhd_2 = 'ORDERS:9' AND old_1 IS NOT NULL
GROUP BY ord_1
ORDER BY ord_1;

From the edi-tradacoms-purchase-orders demo: order 981 totals 120, order 982 totals 48, validated by the demo's assertions.

One honest limitation to plan around: materialized_paths extracts the last occurrence of a repeating segment per message. In the TRADACOMS demo, an ORDERS message with three OLD lines surfaces only the final line in the materialized columns. When the analysis needs every repetition (full line-item detail, for example), read the repeating segments out of df_transaction_json, where all of them are preserved.

Run the full flow yourself

The edi-supply-chain-x12 demo in the demo library packages the complete flow with fourteen real-world files and assertion checks. End to end it is three statements:

Create the external tables. Run the zone, schema, and the two CREATE EXTERNAL TABLE statements shown above: supply_chain_messages for the compact envelope view and supply_chain_materialized for the business-field view. Both point at the same fourteen files; neither copies a byte.
Query. The header overview, transaction mix, and purchase order queries above run as written. The demo's own query set goes further: trading partner pairs from ISA sender and receiver IDs, invoice detail from BIG segments, and functional group analysis from GS codes.
Verify. Confirm the feed parsed completely by checking the version split. The demo asserts seven transactions in each interchange version:
```
SELECT
    isa_12 AS isa_version,
    COUNT(*) AS txn_count
FROM external.edi_demos.supply_chain_messages
GROUP BY isa_12
ORDER BY isa_12;
```
Expected output: 00204 with 7 transactions and 00401 with 7, alongside the demo's summary checks that all fourteen files parsed, every row carries JSON, and the materialized columns populated.

The edi-edifact-international-trade and edi-tradacoms-purchase-orders demos package the EDIFACT and TRADACOMS variants the same way, each with its own assertion suite.

Scope: what this is and what it is not

DeltaForge sits at the analytics layer of an EDI estate, and it is worth being precise about that boundary.

It is: A commercial, customer-installed SQL engine with PostgreSQL-flavored SQL that reads EDI files where they land. Your archive bucket becomes a queryable history of every interchange, suitable for reconciliation, partner scorecards, order lifecycle tracking, and feeding BI tools through the engine's ODBC and ADBC connectors.
It is not: A VAN, a translator, or an integration broker. It does not transmit interchanges, generate 997 or CONTRL acknowledgments back to partners, or manage trading partner mailboxes. Your exchange infrastructure keeps doing exchange; DeltaForge makes what flowed through it analyzable.
Granularity: One row per transaction set or message. Repeating detail segments live in df_transaction_json; materialized_paths surfaces the last occurrence per message, so line-grain analysis reads the JSON.

That division of labor is the practical answer to the decade of forum threads. They were right that EDI needs parsing; they were wrong that the parsing has to happen outside the database. Once the parser is part of the scan, "how do I query EDI files with SQL" stops being an architecture project and becomes a SELECT statement.