If you searched for how to query an Excel file with SQL, you have probably already met the two standard answers, and neither is good. The first is the SQL Server lineage: OPENROWSET or a linked server through the Microsoft Access Database Engine (ACE) OLE DB provider. That route requires installing the provider on the database server, enabling ad hoc distributed queries, matching the provider's bitness to the server process, and placing the file somewhere the server can reach. It was designed for a file on a Windows share, not for a bucket of monthly exports in S3. The second is the lakehouse lineage: Microsoft Fabric's own documentation for getting Excel data into a lakehouse Delta table walks you through a Spark notebook that reads the workbook with pandas and writes it back out. Both answers turn a spreadsheet into an engineering task.
DeltaForge treats Excel as a first-class table format. The engine reads xlsx natively, so the path from workbook to query result is one DDL statement, and the path from workbook to Delta table is one more. The SQL below is taken from two demos in the DeltaForge demo library, excel/sales-analytics and excel/multi-sheet-reporting, whose row counts and aggregates are asserted by the demo harness, so every number quoted here is verified output rather than marketing arithmetic.
How do you query an Excel file with SQL?
You register the file, or the folder containing it, as an external table. The statement below comes from the excel/sales-analytics demo, which ships four Superstore workbooks (one per year, 2014 through 2017) totalling 9,994 orders. The LOCATION points at the folder, and the options pick the sheet, declare the header row, and set the type-inference sample size:
CREATE ZONE IF NOT EXISTS external TYPE EXTERNAL
COMMENT 'External tables for file-backed demo data';
CREATE SCHEMA IF NOT EXISTS external.excel_demos
COMMENT 'Excel-backed external tables';
CREATE EXTERNAL TABLE IF NOT EXISTS external.excel_demos.all_orders
USING EXCEL
LOCATION 's3://acme-lake/demo/sales-xlsx'
OPTIONS (
sheet_name = 'Orders',
has_header = 'true',
infer_schema_rows = '1000',
file_metadata = '{"columns":["df_file_name","df_row_number"]}'
);
It is worth being precise about what just happened at the storage layer: nothing. No data moved, no staging copy was written, no import job ran. The external table is catalog metadata that records the location, the format, and the options. At query time the engine lists the xlsx files under the prefix, opens the requested sheet in each, and decodes rows straight into Arrow batches. Schema discovery runs automatically when the table is created, sampling up to the first 1,000 data rows (the infer_schema_rows setting) to type each column: Excel date cells become DATE, numeric cells become DOUBLE or BIGINT, and text becomes VARCHAR. Column names are normalized to lowercase, so the workbook's "Sub-Category" header is queried as sub_category.
The file_metadata option adds two system columns, df_file_name and df_row_number, to every row. With four files behind one table, that is how you keep lineage to the source workbook:
SELECT df_file_name, COUNT(*) AS row_count
FROM external.excel_demos.all_orders
GROUP BY df_file_name
ORDER BY df_file_name;
From here the spreadsheet is simply a table. Aggregations, window functions, and joins against other tables behave exactly as they would over Parquet. The demo's regional rollup is ordinary SQL:
SELECT region,
COUNT(*) AS orders,
ROUND(SUM(CAST(sales AS DOUBLE)), 2) AS total_sales,
ROUND(SUM(CAST(profit AS DOUBLE)), 2) AS total_profit,
ROUND(AVG(CAST(discount AS DOUBLE)), 3) AS avg_discount
FROM external.excel_demos.all_orders
GROUP BY region
ORDER BY total_sales DESC;
Walking the full demo: create, query, verify
The excel/sales-analytics demo ships in the DeltaForge demo library with the four workbooks included, so you can reproduce everything in this section against the same data. The flow is three steps.
-
Create the tables. Run the
all_ordersDDL above, then add a second table that isolates a single file from the same folder with afile_filterglob:CREATE EXTERNAL TABLE IF NOT EXISTS external.excel_demos.orders_2017 USING EXCEL LOCATION 's3://acme-lake/demo/sales-xlsx' OPTIONS ( sheet_name = 'Orders', has_header = 'true', file_filter = 'sales-data-2017*', file_metadata = '{"columns":["df_file_name","df_row_number"]}' ); -
Query. Run the per-file breakdown and the regional rollup shown earlier, or any ad-hoc SELECT. Nothing about the table is precomputed; each query reads the workbooks as they exist at that moment, so replacing a file in the folder is immediately visible to the next query.
-
Verify. The demo uses DeltaForge's
ASSERTcommand to pin expected results, which is also a convenient pattern for your own ingestion checks:ASSERT ROW_COUNT = 9994 SELECT * FROM external.excel_demos.all_orders; ASSERT ROW_COUNT = 3312 SELECT * FROM external.excel_demos.orders_2017;
Beyond the options used here, the Excel reader also supports cell ranges (range = 'A1:K500' style addressing), skip_rows and max_rows, headerless files with generated column names, whitespace trimming, and custom null markers such as "N/A". Those are exercised in the same demo family if you need them for messier workbooks.
One workbook, three sheets, one JOIN
Sheets are the part of Excel no other file format has, and the part most tooling flattens away. The excel/multi-sheet-reporting demo models a common real situation: each regional office submits one workbook containing a Sales sheet, a Returns sheet, and a Staff sheet. Because sheet_name is a table option, each sheet becomes its own external table over the same files:
CREATE EXTERNAL TABLE IF NOT EXISTS external.excel_demos.all_sales
USING EXCEL
LOCATION 's3://acme-lake/demo/regional-xlsx'
OPTIONS (
sheet_name = 'Sales',
has_header = 'true',
infer_schema_rows = '100',
file_metadata = '{"columns":["df_file_name","df_row_number"]}'
);
CREATE EXTERNAL TABLE IF NOT EXISTS external.excel_demos.all_returns
USING EXCEL
LOCATION 's3://acme-lake/demo/regional-xlsx'
OPTIONS (
sheet_name = 'Returns',
has_header = 'true',
infer_schema_rows = '100',
file_metadata = '{"columns":["df_file_name","df_row_number"]}'
);
Returns reference order IDs from the Sales sheet, so a cross-sheet JOIN computes net revenue per region, something that would otherwise mean VLOOKUP gymnastics inside the workbook or a script outside it:
SELECT s.df_file_name AS region,
ROUND(SUM(CAST(s.total_amount AS DOUBLE)), 2) AS gross_sales,
ROUND(COALESCE(SUM(CAST(r.refund_amount AS DOUBLE)), 0), 2) AS total_refunds,
ROUND(SUM(CAST(s.total_amount AS DOUBLE)) - COALESCE(SUM(CAST(r.refund_amount AS DOUBLE)), 0), 2) AS net_revenue
FROM external.excel_demos.all_sales s
LEFT JOIN external.excel_demos.all_returns r
ON s.order_id = r.order_id
GROUP BY s.df_file_name
ORDER BY s.df_file_name;
Can you load Excel into Delta Lake without a notebook?
Yes, and this is where the external table pays off twice. Querying in place is right for exploration, but a Delta table is the right destination once the data feeds dashboards or downstream jobs. With the external table already defined, the load is one statement:
CREATE DELTA TABLE warehouse.sales.orders
LOCATION 's3://acme-lake/warehouse/sales/orders'
AS SELECT order_id, order_date, ship_date, customer_name, segment,
region, category, sub_category, sales, quantity, discount, profit
FROM external.excel_demos.all_orders;
ASSERT VALUE order_count = 9994
SELECT COUNT(*) AS order_count
FROM warehouse.sales.orders;
At the storage layer this writes Parquet data files plus a _delta_log/ transaction log under the target location, committing the schema and the initial files as version 0. The writer streams batches from the Excel reader directly into the Delta write path and targets file sizes of roughly 128 to 153 MB by default, so a large load does not produce a small-file mess. The result is a standard Delta table readable by any Delta-compatible engine, not a DeltaForge-private format.
Compare that with the documented path on Microsoft Fabric, where loading an Excel file into a lakehouse Delta table is shown as a PySpark notebook that reads the workbook via pandas and writes it back as Delta. That works, and for teams already living in notebooks it may be fine. But it puts a Spark runtime, a Python environment, and notebook plumbing between an analyst and a spreadsheet. The SQL path above needs none of those, and because it is plain SQL it slots into a scheduled DeltaForge pipeline for recurring drops without any extra machinery.
How do you query Excel files stored in S3 or ADLS?
Exactly as shown: the LOCATION string is the only thing that changes. The examples in this article use an S3 URI (s3://acme-lake/demo/sales-xlsx); an ADLS Gen2 path or a local directory works the same way. The engine's object-store layer handles listing the prefix, fetching the workbooks, and applying the file_filter glob, so the pattern of "every office drops a workbook in the bucket" becomes one queryable table with per-file lineage through df_file_name. There is no requirement to download files to the database host first, which is precisely the constraint that made the OPENROWSET approach awkward for cloud storage.
Do you still need OPENROWSET to query xlsx files?
Not with an engine that reads the format natively. To be fair to the old path: OPENROWSET with the ACE provider does work for a file on the SQL Server host, and it is still the documented mechanism there. The friction is operational, and Microsoft's own documentation spells it out: the Access Database Engine redistributable must be installed on the server, the provider's bitness must match the SQL Server process, ad hoc distributed queries must be enabled, and the workbook must sit on a path the service account can read. Every one of those is a server-administration step for what the analyst experiences as "I just want to SELECT from this spreadsheet." The external-table approach removes the provider, the server-local file requirement, and the configuration flags; the trade is that you are running a different engine for the query.
Scope and limits, honestly
A few boundaries are worth knowing before you lean on this in production.
- xlsx is not a columnar format
- An external table reads the workbook on every query, and xlsx offers no column pruning or predicate pushdown the way Parquet does. For files in the megabyte range this is irrelevant; for hot data queried constantly, do the CTAS into Delta and query that. In-place querying and loading are complementary, not competing.
- Values, not formulas
- The reader consumes cell values from the sheet. Treat the workbook as data, not as a live calculation engine: if a sheet's numbers depend on macros or volatile functions, export the computed values before relying on them.
- Type inference is sampled
infer_schema_rowscontrols how many rows the discovery pass examines. A column that is numeric for the first thousand rows and text afterwards needs a larger sample or an explicit cast in your queries, the same caveat that applies to CSV inference anywhere.- Schema follows the first sheet layout
- Multi-file tables assume the files share a layout for the selected sheet. The
file_filteroption is the escape hatch when a folder mixes layouts: carve each layout into its own table. - Commercial software
- DeltaForge is commercial, customer-installed software that runs in your own environment, on-prem or in your cloud account. The demo library, including both demos used here, ships with the product so you can verify all of this against your own files.
Run it yourself
Both demos used in this article, excel/sales-analytics and excel/multi-sheet-reporting, ship in the DeltaForge demo library with their data files, setup scripts, verification queries, and cleanup scripts. The verification queries assert every row count and aggregate quoted above, so a green run is proof the engine on your machine reproduces these results. Browse the library at /demos, and see the DeltaForge SQL engine page for the full set of file formats the same external-table pattern covers.