Can you query EDI files directly with SQL?
Yes. This is the entire program:
SELECT
df_file_name,
isa_6 AS sender_id,
isa_8 AS receiver_id,
st_1 AS txn_type,
gs_8 AS x12_version
FROM external.edi_demos.supply_chain_messages
ORDER BY df_file_name;
Run against a folder of fourteen real X12 files spanning eight transaction types and two interchange versions, it returns rows like these:
df_file_name sender_id receiver_id txn_type x12_version
x12_810_invoice_a.edi ABCMUSICSUPPLY ARIBAEDI 810 004010
x12_850_purchase_order.edi 000123456 PARTNERID 850 004010
x12_997_functional_acknowledgment.edi TO FROM 997 004010
... ... ... ... ...
The table behind that query is an external table over the raw .edi files. Nothing was translated, exported, or loaded. The values above are not illustrative; they are the assertion targets the DeltaForge demo harness checks on every run of the edi-supply-chain-x12 demo, which ships in the DeltaForge demo library so you can reproduce them yourself.
What happens at the storage layer
An EDI external table is a schema-on-read mapping, and understanding the mapping explains everything else in this article. When you create the table, DeltaForge records the location glob, the format, and the parse configuration in its catalog. When you query it, the engine lists the matching files in object storage, parses each interchange at scan time, and emits one row per X12 transaction set (or per EDIFACT message, or per TRADACOMS MHD message). The files themselves are never rewritten, moved, or copied; they remain the system of record exactly as your trading partners delivered them.
Each row carries three kinds of columns. Envelope headers come first: for X12 that is isa_1 through isa_16, gs_1 through gs_8, plus st_1 and st_2, so sender, receiver, dates, control numbers, and transaction type are ordinary columns on every row. Second, df_transaction_json holds the complete parsed transaction, every segment and element, as a JSON document, with df_transaction_id as a unique hash. Third, optional file metadata columns such as df_file_name and df_row_number record provenance, which matters when a feed mixes files from several partners. Here is the table definition that produced the query above:
CREATE ZONE IF NOT EXISTS external
TYPE EXTERNAL
COMMENT 'External tables backed by files in object storage';
CREATE SCHEMA IF NOT EXISTS external.edi_demos
COMMENT 'EDI transaction-backed external tables';
CREATE EXTERNAL TABLE IF NOT EXISTS external.edi_demos.supply_chain_messages
USING EDI
LOCATION 's3://acme-lake/demo/edi-supply-chain-x12/*.edi'
OPTIONS (
edi_config = '{"ediFormat": "x12"}',
file_metadata = '{"columns":["df_file_name","df_row_number"]}'
);
Because parsing is part of the scan, heterogeneity stops being a schema problem. The demo feed mixes X12 versions 00204 and 00401 from four different sender IDs in a single table, and a plain GROUP BY answers questions that would otherwise need partner-specific parser configuration:
SELECT
st_1 AS txn_type,
COUNT(*) AS txn_count
FROM external.edi_demos.supply_chain_messages
GROUP BY st_1
ORDER BY txn_count DESC, st_1;
How do you convert X12 EDI to SQL tables?
The honest answer is that you stop thinking of it as a conversion. The external table already is the SQL table. What you tune is how much of the transaction body gets promoted from JSON into first-class columns, and the instrument for that is materialized_paths in the edi_config. Listing a segment-element path such as beg_3 (the purchase order number in an 850) makes it a real column with NULLs on transaction types where the segment does not occur:
CREATE EXTERNAL TABLE IF NOT EXISTS external.edi_demos.supply_chain_materialized
USING EDI
LOCATION 's3://acme-lake/demo/edi-supply-chain-x12/*.edi'
OPTIONS (
edi_config = '{
"ediFormat": "x12",
"materialized_paths": [
"beg_1", "beg_3", "beg_5",
"big_1", "big_2",
"bsn_2", "bsn_3",
"n1_1", "n1_2",
"ctt_1"
]
}',
file_metadata = '{"columns":["df_file_name","df_row_number"]}'
);
With those paths in place, purchase order analytics is unremarkable SQL:
SELECT
df_file_name,
st_1 AS txn_type,
beg_3 AS po_number,
beg_5 AS po_date,
n1_2 AS party_name,
ctt_1 AS line_items
FROM external.edi_demos.supply_chain_materialized
WHERE beg_3 IS NOT NULL
ORDER BY df_file_name;
If you do want a persisted, managed copy (for downstream consumers that should not depend on the raw feed, or to get Delta Lake time travel over a snapshot), you select from the external table into a Delta table with ordinary SQL. The point is that materialization becomes a choice about lifecycle, not a prerequisite for the first query. The same pattern is covered for flat files in CSV to Delta Lake without Spark.
Do you need an EDI parser to analyze EDI data?
This is the search that usually precedes this article: someone types "x12 parser python" or "x12 parser alternative" because they have a bucket of interchanges and a reporting question. The tools that search surfaces are good at what they are documented to do, and what they are documented to do is not analytics. Open-source translators such as pyx12 validate X12 and translate it to XML. Component libraries for .NET and Java deserialize transactions into typed objects inside your application code. Full translators map EDI into ERP-ready formats as part of an exchange workflow. In every one of those designs, getting data into a database where SQL can reach it is left as your integration work: you write the loader, you define the relational schema, you maintain both as partners and versions change.
If the goal is analytics rather than exchange, you can skip that search entirely. The parse already lives inside the DeltaForge SQL engine, and the escape hatch for fields you did not promote is always present on every row:
SELECT
df_file_name,
st_1 AS txn_type,
df_transaction_json
FROM external.edi_demos.supply_chain_messages
ORDER BY df_file_name
LIMIT 3;
That two-tier model (promoted columns for the fields you query constantly, JSON for the long tail) is what replaces the parser-plus-loader stack. There is no relational schema for you to design and no version drift to maintain, because the envelope columns and the JSON document are derived from the standard itself at scan time.
How do you query EDIFACT or TRADACOMS files?
With the same statement and one changed value. For UN/EDIFACT, set ediFormat to edifact and the header columns become the UNB interchange envelope and UNH message header:
CREATE EXTERNAL TABLE IF NOT EXISTS external.edi_demos.edifact_messages
USING EDI
LOCATION 's3://acme-lake/demo/edi-edifact-international-trade/*.edi'
OPTIONS (
edi_config = '{"ediFormat": "edifact"}',
file_metadata = '{"columns":["df_file_name","df_row_number"]}'
);
SELECT
df_file_name,
unb_1 AS syntax_id,
unh_2 AS msg_type,
unh_1 AS msg_ref,
unb_2 AS sender
FROM external.edi_demos.edifact_messages
ORDER BY df_file_name;
Two details matter in practice. Files containing multiple messages produce multiple rows, so a CONTRL acknowledgment file with two messages yields two rows, and the demo's row count exceeds its file count by design. And EANCOM, the GS1 retail subset of EDIFACT, parses through the same path; the demo mixes six EANCOM files (DESADV, IFTSTA, INVOIC, ORDRSP, PRICAT, IFTMIN) with pure EDIFACT in one table and classifies them with a CASE expression.
TRADACOMS, the UK retail standard, follows the same pattern with ediFormat set to tradacoms. The engine emits one row per MHD message, exposing the STX transmission envelope and MHD header, and materialized_paths promotes order line detail fields exactly as it does for X12:
CREATE EXTERNAL TABLE IF NOT EXISTS external.edi_demos.tradacoms_order_lines
USING EDI
LOCATION 's3://acme-lake/demo/edi-tradacoms-purchase-orders/*.edi'
OPTIONS (
edi_config = '{
"ediFormat": "tradacoms",
"materialized_paths": [
"old_1", "old_2", "old_3", "old_5", "old_6", "old_10",
"ord_1", "clo_1",
"din_1", "din_4",
"otr_1", "otr_2",
"typ_1", "typ_2",
"sdt_2", "cdt_2"
]
}',
file_metadata = '{"columns":["df_file_name","df_row_number"]}'
);
TRADACOMS fields are strings on the wire, and CAST-based arithmetic works on them directly, so order valuation is a one-statement job:
SELECT
ord_1 AS order_ref,
SUM(CAST(old_5 AS INTEGER) * CAST(old_6 AS INTEGER)) AS order_total
FROM external.edi_demos.tradacoms_order_lines
WHERE mhd_2 = 'ORDERS:9' AND old_1 IS NOT NULL
GROUP BY ord_1
ORDER BY ord_1;
One honest limitation to plan around: materialized_paths extracts the last occurrence of a repeating segment per message. In the TRADACOMS demo, an ORDERS message with three OLD lines surfaces only the final line in the materialized columns. When the analysis needs every repetition (full line-item detail, for example), read the repeating segments out of df_transaction_json, where all of them are preserved.
Run the full flow yourself
The edi-supply-chain-x12 demo in the demo library packages the complete flow with fourteen real-world files and assertion checks. End to end it is three statements:
-
Create the external tables. Run the zone, schema, and the two CREATE EXTERNAL TABLE statements shown above:
supply_chain_messagesfor the compact envelope view andsupply_chain_materializedfor the business-field view. Both point at the same fourteen files; neither copies a byte. -
Query. The header overview, transaction mix, and purchase order queries above run as written. The demo's own query set goes further: trading partner pairs from ISA sender and receiver IDs, invoice detail from BIG segments, and functional group analysis from GS codes.
-
Verify. Confirm the feed parsed completely by checking the version split. The demo asserts seven transactions in each interchange version:
SELECT isa_12 AS isa_version, COUNT(*) AS txn_count FROM external.edi_demos.supply_chain_messages GROUP BY isa_12 ORDER BY isa_12;Expected output:
00204with 7 transactions and00401with 7, alongside the demo's summary checks that all fourteen files parsed, every row carries JSON, and the materialized columns populated.
The edi-edifact-international-trade and edi-tradacoms-purchase-orders demos package the EDIFACT and TRADACOMS variants the same way, each with its own assertion suite.
Scope: what this is and what it is not
DeltaForge sits at the analytics layer of an EDI estate, and it is worth being precise about that boundary.
- It is
- A commercial, customer-installed SQL engine with PostgreSQL-flavored SQL that reads EDI files where they land. Your archive bucket becomes a queryable history of every interchange, suitable for reconciliation, partner scorecards, order lifecycle tracking, and feeding BI tools through the engine's ODBC and ADBC connectors.
- It is not
- A VAN, a translator, or an integration broker. It does not transmit interchanges, generate 997 or CONTRL acknowledgments back to partners, or manage trading partner mailboxes. Your exchange infrastructure keeps doing exchange; DeltaForge makes what flowed through it analyzable.
- Granularity
- One row per transaction set or message. Repeating detail segments live in df_transaction_json; materialized_paths surfaces the last occurrence per message, so line-grain analysis reads the JSON.
That division of labor is the practical answer to the decade of forum threads. They were right that EDI needs parsing; they were wrong that the parsing has to happen outside the database. Once the parser is part of the scan, "how do I query EDI files with SQL" stops being an architecture project and becomes a SELECT statement.