The status quo is a Python script
If you have a folder of CSV exports and want them in Delta Lake, the documented routes mostly assume code. The Delta Lake documentation's CSV guidance is written for the deltalake Python package and pandas; the Spark route assumes a running Spark session and a spark.read.csv followed by write.format("delta"); Polars and DuckDB users stitch together a reader and a writer library. All of these work, and none of them is wrong. But each one turns a data movement task into a small software project: an environment, a dependency set, a script someone now owns, and a scheduler to run it.
If your team already lives in SQL, that detour is unnecessary. DeltaForge is a customer-installed analytics engine with a native Delta Lake writer built in, which means the entire conversion is expressible as SQL statements. The engine handles file listing, schema inference, type handling, Parquet encoding, and the Delta transaction log commit. The statements below are not pseudocode: they are taken from demos that ship with the product and run under its test harness.
How do you convert a CSV file to a Delta table without Spark?
Two statements. The first registers the CSV files where they sit and infers their schema; the second reads them and writes a Delta table.
CREATE EXTERNAL TABLE IF NOT EXISTS external.csv_demos.sales
USING CSV
LOCATION 's3://acme-lake/demo/sales-quickstart/sales*.csv'
OPTIONS (
header = 'true'
);
CREATE DELTA TABLE external.csv_demos.sales_delta
LOCATION 's3://acme-lake/demo/delta/sales'
AS SELECT * FROM external.csv_demos.sales;
No data is copied by the first statement. An external table is a metadata registration: queries against it read the CSV files directly from the source location, with row-level filter pushdown applied. The second statement is where the physical conversion happens. The CTAS streams the query result into Parquet files at the target LOCATION and commits a _delta_log/ directory alongside them, so the result is a standard Delta Lake table readable by any Delta-compatible engine, not a proprietary copy.
That is the honest scope of "two statements": it assumes the catalog namespace (a zone and a schema) already exists. The full walkthrough below includes that one-time setup, which is two more DDL statements you run once per environment, not per load.
Can CREATE EXTERNAL TABLE auto-detect a CSV schema?
Yes, and this is the part that usually pushes people toward a programming language. When you omit the column list, the engine reads a sample of the files at the LOCATION and infers the schema: column names from the header row when header = 'true', auto-generated names (column_0, column_1, ...) otherwise, and types derived from the sampled data. The number of rows sampled is configurable with infer_schema_rows, and an explicit column list overrides inference entirely when files lack headers or when you want to pin types yourself.
Real CSV is rarely the comma-separated ideal. The Northwind demo ships eleven semicolon-delimited files, and the only change is one option:
CREATE EXTERNAL TABLE IF NOT EXISTS external.csv_demos.nw_orders
USING CSV
LOCATION 's3://acme-lake/demo/northwind-database/orders.csv'
OPTIONS (
header = 'true',
delimiter = ';'
);
Inference also copes with schema drift across files. The sales-quickstart demo deliberately spans five quarterly files whose columns change over time: sales_rep appears in Q2, discount_pct in Q3, region is retired in Q4 in favor of territory, and channel arrives in 2025. One external table unifies all five files; columns missing from older files simply surface as NULL. That matters for the conversion because the CTAS inherits the unified schema, so a year of drifting exports lands in one coherent Delta table.
Walkthrough: sales-quickstart end to end
The csv/sales-quickstart demo ships in the DeltaForge demo library, so you can run every step yourself, and the csv/northwind-database demo extends the same pattern to an eleven-table relational schema. Here is the full flow with the one-time setup included.
-
Create the namespace once. DeltaForge uses three-part names (
zone.schema.table), with the zone as the governance boundary:CREATE ZONE IF NOT EXISTS external TYPE EXTERNAL COMMENT 'External tables for file-backed data'; CREATE SCHEMA IF NOT EXISTS external.csv_demos COMMENT 'CSV-backed external tables'; -
Register the files. The
file_metadataoption injects system columns so every row can be traced back to its source file and line, which is worth having before a conversion, not after:CREATE EXTERNAL TABLE IF NOT EXISTS external.csv_demos.sales USING CSV LOCATION 's3://acme-lake/demo/sales-quickstart/sales*.csv' OPTIONS ( header = 'true', file_metadata = '{"columns":["df_file_name","df_row_number"]}' ); -
Verify the raw data before writing anything. This aggregate runs directly against the CSV files; in the demo it returns six products with a revenue total of 5,339.35 across all fifteen rows:
SELECT product_name, SUM(CAST(quantity AS INT)) AS total_quantity, ROUND(SUM(CAST(quantity AS INT) * CAST(unit_price AS DOUBLE)), 2) AS total_revenue FROM external.csv_demos.sales GROUP BY product_name ORDER BY total_revenue DESC; -
Convert. Casting in the SELECT is deliberate: it locks the Delta table's schema to the types you intend rather than whatever inference sampled, and it is exactly how the demo's own queries treat numeric columns:
CREATE DELTA TABLE external.csv_demos.sales_delta LOCATION 's3://acme-lake/demo/delta/sales' AS SELECT id, product_name, CAST(quantity AS INT) AS quantity, CAST(unit_price AS DOUBLE) AS unit_price, CAST(sale_date AS DATE) AS sale_date, region, sales_rep, territory, channel FROM external.csv_demos.sales; -
Verify the Delta side. The count should match the external table, and the table's history shows the CTAS as the initial commit:
SELECT COUNT(*) AS row_count FROM external.csv_demos.sales_delta; DESCRIBE HISTORY external.csv_demos.sales_delta;
From here the table behaves like any managed Delta table: it accepts INSERT, UPDATE, DELETE, and MERGE, supports time travel and schema evolution, and is readable over the engine's ODBC and ADBC drivers by Power BI and other tools.
What happens at the storage layer
The CTAS is not a thin wrapper over a file copy. Understanding what it writes explains why the result interoperates cleanly.
- Transaction log
- The table's schema, properties, and initial file list are committed as version 0 in
_delta_log/. The log, not any catalog, is the source of truth for what the table contains; any Delta-compatible reader replays it. - File sizing
- CTAS targets Parquet files of 128 to 153 MB by default, specifically to avoid the small-file problem that a naive per-CSV conversion creates. Batch and file size are overridable per statement.
- Write path
- Data streams from the CSV scan into the Parquet writer; local destinations use a direct file writer and cloud destinations use a streaming writer, so the conversion does not materialize the dataset in memory.
- Name normalization
- Column names are normalized to lowercase on write, which prevents case-sensitivity mismatches between SQL identifiers and Parquet column names later.
Do you need Python to load CSV into Delta Lake?
No, though it is worth being precise about why the Python answer dominates search results. The open-source Delta Lake ecosystem exposes its writers as libraries: the deltalake package (delta-rs) documents write_deltalake from pandas or Arrow data, and Spark documents the DataFrame writer. A library needs a host language, so every tutorial is necessarily a program. Those are reasonable tools, and if you already operate Spark or a Python data platform, they may be the right ones.
The trade-off changes when the writer lives inside a SQL engine. On DeltaForge the load is a statement, which means it can be typed into the SQL editor, checked into a scheduled pipeline, executed through the SQL engine API, or triggered from any tool that speaks the engine's ODBC or ADBC interface. There is no environment to build and no script to maintain, and access control applies to the statement the same way it applies to every other query. To be equally clear about what DeltaForge is not: it is commercial, customer-installed software, not an open-source library, and the SQL dialect is PostgreSQL-flavored SQL rather than a PostgreSQL wire-protocol implementation.
How do you convert CSV to an Iceberg table with SQL?
The same two-statement pattern covers Apache Iceberg, because DeltaForge's writer supports UniForm: dual-format metadata emission. Add the UniForm table properties to the CTAS and every write to the table also produces an Iceberg metadata.json plus manifests that describe the same physical Parquet files.
CREATE DELTA TABLE external.csv_demos.sales_iceberg
LOCATION 's3://acme-lake/demo/delta/sales_iceberg'
TBLPROPERTIES (
'delta.universalFormat.enabledFormats' = 'iceberg',
'delta.enableIcebergCompatV2' = 'true',
'delta.columnMapping.mode' = 'name',
'delta.minReaderVersion' = '2',
'delta.minWriterVersion' = '7'
)
AS SELECT * FROM external.csv_demos.sales;
The data is written once; only the metadata is doubled. Iceberg format versions 1, 2, and 3 are supported, and column mapping supplies the Iceberg field IDs. The reverse direction works too: DeltaForge reads native Iceberg tables in place, including time travel via snapshot selection, with the same external-table statement you used for CSV:
CREATE EXTERNAL TABLE IF NOT EXISTS external.iceberg_demos.inventory
USING ICEBERG
LOCATION 's3://acme-lake/demo/iceberg/inventory/';
Scope and limits
A few honest boundaries, because they shape how you use the pattern.
- External tables are read-only. They have no transactions, no time travel, and no schema evolution of their own; those properties arrive with the Delta table the CTAS creates. If you only ever need to query the CSVs, you can stop after statement one and never convert.
- Inference reads the source at CREATE time. The LOCATION must exist and be readable when the statement runs, and sampling a large directory on slow storage adds latency to the CREATE. An explicit column list skips the sampling but not the file listing.
- Ambiguous types deserve explicit casts. Inference is sample-based, so numeric-looking text columns can land with types you did not intend. Casting in the CTAS SELECT, as the walkthrough does, makes the Delta schema deliberate instead of accidental.
- Format defaults to CSV. If you omit USING and the file extensions are ambiguous, the engine assumes CSV. Pointing that default at JSON or binary files misreads them, so state the format when extensions are unreliable.
- The conversion is a snapshot, not a sync. CTAS captures the files as they exist at execution. New CSVs appearing later are visible through the external table immediately, but reach the Delta table only when you append or merge them, typically via a scheduled pipeline running the same SQL.
None of these limits push you back to a cluster. They are the same considerations a Spark or Python pipeline would face, expressed as properties of two SQL statements instead of a codebase.