CSV to Delta Lake Without Spark: Two SQL Statements

Q: How do you convert a CSV file to a Delta table without Spark?

On DeltaForge it takes two SQL statements. CREATE EXTERNAL TABLE registers the CSV files in place and infers the schema from a sample, so the raw data is immediately queryable. CREATE DELTA TABLE ... AS SELECT then reads that external table and writes a full Delta Lake table: Parquet data files plus a _delta_log transaction log, committed as version 0. No Spark cluster, no JVM, and no Python script is involved; the engine runs the whole pipeline natively.

Q: Can CREATE EXTERNAL TABLE auto-detect a CSV schema?

Yes. When you omit the column list, DeltaForge reads a sample of the files at the LOCATION and infers column names and types automatically. Column names come from the header row when header = 'true', or are auto-generated otherwise. Options cover delimiters, quoting, encoding, skipped rows, and the number of rows sampled for inference. You can also pass an explicit column list to override inference entirely, which is useful for headerless files or ambiguous types.

Q: Do you need Python to load CSV into Delta Lake?

No. Most tutorials reach for Python because the common open-source writers (the deltalake package, pandas, or PySpark) are libraries, so you write and operate a script. DeltaForge is a commercial, customer-installed SQL engine with a native Delta Lake writer, so the load is two SQL statements you can run from its SQL editor, schedule as a pipeline, or call over ODBC and ADBC. Python remains a fine choice; it simply is not required.

Q: How do you convert CSV to an Iceberg table with SQL?

Use the same two-statement pattern and add UniForm table properties to the CTAS. With delta.universalFormat.enabledFormats = 'iceberg' and delta.enableIcebergCompatV2 = 'true', every write to the Delta table also emits an Iceberg metadata.json and manifests describing the same Parquet files, so Iceberg-compatible engines read the result directly. DeltaForge also reads native Iceberg tables through CREATE EXTERNAL TABLE ... USING ICEBERG, with time travel via snapshot selection.

The classic Northwind exports, from a pile of CSV files to live answers, with data quality checks along the way.

The status quo is a Python script

If you have a folder of CSV exports and want them in Delta Lake, the documented routes mostly assume code. The Delta Lake documentation's CSV guidance is written for the deltalake Python package and pandas; the Spark route assumes a running Spark session and a spark.read.csv followed by write.format("delta"); Polars and DuckDB users stitch together a reader and a writer library. All of these work, and none of them is wrong. But each one turns a data movement task into a small software project: an environment, a dependency set, a script someone now owns, and a scheduler to run it.

If your team already lives in SQL, that detour is unnecessary. DeltaForge is a customer-installed analytics engine with a native Delta Lake writer built in, which means the entire conversion is expressible as SQL statements. The engine handles file listing, schema inference, type handling, Parquet encoding, and the Delta transaction log commit. The statements below are not pseudocode: they are taken from demos that ship with the product and run under its test harness.

How do you convert a CSV file to a Delta table without Spark?

Two statements. The first registers the CSV files where they sit and infers their schema; the second reads them and writes a Delta table.

CREATE EXTERNAL TABLE IF NOT EXISTS external.csv_demos.sales
USING CSV
LOCATION 's3://acme-lake/demo/sales-quickstart/sales*.csv'
OPTIONS (
    header = 'true'
);

CREATE DELTA TABLE external.csv_demos.sales_delta
LOCATION 's3://acme-lake/demo/delta/sales'
AS SELECT * FROM external.csv_demos.sales;

The whole conversion. The glob in LOCATION picks up every matching file; the CTAS writes Parquet plus a Delta transaction log.

No data is copied by the first statement. An external table is a metadata registration: queries against it read the CSV files directly from the source location, with row-level filter pushdown applied. The second statement is where the physical conversion happens. The CTAS streams the query result into Parquet files at the target LOCATION and commits a _delta_log/ directory alongside them, so the result is a standard Delta Lake table readable by any Delta-compatible engine, not a proprietary copy.

That is the honest scope of "two statements": it assumes the catalog namespace (a zone and a schema) already exists. The full walkthrough below includes that one-time setup, which is two more DDL statements you run once per environment, not per load.

Can CREATE EXTERNAL TABLE auto-detect a CSV schema?

Yes, and this is the part that usually pushes people toward a programming language. When you omit the column list, the engine reads a sample of the files at the LOCATION and infers the schema: column names from the header row when header = 'true', auto-generated names (column_0, column_1, ...) otherwise, and types derived from the sampled data. The number of rows sampled is configurable with infer_schema_rows, and an explicit column list overrides inference entirely when files lack headers or when you want to pin types yourself.

Real CSV is rarely the comma-separated ideal. The Northwind demo ships eleven semicolon-delimited files, and the only change is one option:

CREATE EXTERNAL TABLE IF NOT EXISTS external.csv_demos.nw_orders
USING CSV
LOCATION 's3://acme-lake/demo/northwind-database/orders.csv'
OPTIONS (
    header = 'true',
    delimiter = ';'
);

From the northwind-database demo. Quote and escape characters, comment prefixes, null markers, skipped rows, whitespace trimming, and encoding are all OPTIONS as well.

Inference also copes with schema drift across files. The sales-quickstart demo deliberately spans five quarterly files whose columns change over time: sales_rep appears in Q2, discount_pct in Q3, region is retired in Q4 in favor of territory, and channel arrives in 2025. One external table unifies all five files; columns missing from older files simply surface as NULL. That matters for the conversion because the CTAS inherits the unified schema, so a year of drifting exports lands in one coherent Delta table.

Walkthrough: sales-quickstart end to end

The csv/sales-quickstart demo ships in the DeltaForge demo library, so you can run every step yourself, and the csv/northwind-database demo extends the same pattern to an eleven-table relational schema. Here is the full flow with the one-time setup included.

Create the namespace once. DeltaForge uses three-part names (zone.schema.table), with the zone as the governance boundary:

CREATE ZONE IF NOT EXISTS external
    TYPE EXTERNAL
    COMMENT 'External tables for file-backed data';

CREATE SCHEMA IF NOT EXISTS external.csv_demos
    COMMENT 'CSV-backed external tables';

Register the files. The file_metadata option injects system columns so every row can be traced back to its source file and line, which is worth having before a conversion, not after:

CREATE EXTERNAL TABLE IF NOT EXISTS external.csv_demos.sales
USING CSV
LOCATION 's3://acme-lake/demo/sales-quickstart/sales*.csv'
OPTIONS (
    header = 'true',
    file_metadata = '{"columns":["df_file_name","df_row_number"]}'
);

Verify the raw data before writing anything. This aggregate runs directly against the CSV files; in the demo it returns six products with a revenue total of 5,339.35 across all fifteen rows:

SELECT
    product_name,
    SUM(CAST(quantity AS INT)) AS total_quantity,
    ROUND(SUM(CAST(quantity AS INT) * CAST(unit_price AS DOUBLE)), 2) AS total_revenue
FROM external.csv_demos.sales
GROUP BY product_name
ORDER BY total_revenue DESC;

Convert. Casting in the SELECT is deliberate: it locks the Delta table's schema to the types you intend rather than whatever inference sampled, and it is exactly how the demo's own queries treat numeric columns:

CREATE DELTA TABLE external.csv_demos.sales_delta
LOCATION 's3://acme-lake/demo/delta/sales'
AS SELECT
    id,
    product_name,
    CAST(quantity AS INT) AS quantity,
    CAST(unit_price AS DOUBLE) AS unit_price,
    CAST(sale_date AS DATE) AS sale_date,
    region,
    sales_rep,
    territory,
    channel
FROM external.csv_demos.sales;

Verify the Delta side. The count should match the external table, and the table's history shows the CTAS as the initial commit:
```
SELECT COUNT(*) AS row_count FROM external.csv_demos.sales_delta;

DESCRIBE HISTORY external.csv_demos.sales_delta;
```

From here the table behaves like any managed Delta table: it accepts INSERT, UPDATE, DELETE, and MERGE, supports time travel and schema evolution, and is readable over the engine's ODBC and ADBC drivers by Power BI and other tools.

What happens at the storage layer

The CTAS is not a thin wrapper over a file copy. Understanding what it writes explains why the result interoperates cleanly.

Transaction log: The table's schema, properties, and initial file list are committed as version 0 in _delta_log/. The log, not any catalog, is the source of truth for what the table contains; any Delta-compatible reader replays it.
File sizing: CTAS targets Parquet files of 128 to 153 MB by default, specifically to avoid the small-file problem that a naive per-CSV conversion creates. Batch and file size are overridable per statement.
Write path: Data streams from the CSV scan into the Parquet writer; local destinations use a direct file writer and cloud destinations use a streaming writer, so the conversion does not materialize the dataset in memory.
Name normalization: Column names are normalized to lowercase on write, which prevents case-sensitivity mismatches between SQL identifiers and Parquet column names later.

Do you need Python to load CSV into Delta Lake?

No, though it is worth being precise about why the Python answer dominates search results. The open-source Delta Lake ecosystem exposes its writers as libraries: the deltalake package (delta-rs) documents write_deltalake from pandas or Arrow data, and Spark documents the DataFrame writer. A library needs a host language, so every tutorial is necessarily a program. Those are reasonable tools, and if you already operate Spark or a Python data platform, they may be the right ones.

The trade-off changes when the writer lives inside a SQL engine. On DeltaForge the load is a statement, which means it can be typed into the SQL editor, checked into a scheduled pipeline, executed through the SQL engine API, or triggered from any tool that speaks the engine's ODBC or ADBC interface. There is no environment to build and no script to maintain, and access control applies to the statement the same way it applies to every other query. To be equally clear about what DeltaForge is not: it is commercial, customer-installed software, not an open-source library, and the SQL dialect is PostgreSQL-flavored SQL rather than a PostgreSQL wire-protocol implementation.

How do you convert CSV to an Iceberg table with SQL?

The same two-statement pattern covers Apache Iceberg, because DeltaForge's writer supports UniForm: dual-format metadata emission. Add the UniForm table properties to the CTAS and every write to the table also produces an Iceberg metadata.json plus manifests that describe the same physical Parquet files.

CREATE DELTA TABLE external.csv_demos.sales_iceberg
LOCATION 's3://acme-lake/demo/delta/sales_iceberg'
TBLPROPERTIES (
    'delta.universalFormat.enabledFormats' = 'iceberg',
    'delta.enableIcebergCompatV2' = 'true',
    'delta.columnMapping.mode' = 'name',
    'delta.minReaderVersion' = '2',
    'delta.minWriterVersion' = '7'
)
AS SELECT * FROM external.csv_demos.sales;

One CTAS, two table formats over one copy of the data. Iceberg-compatible engines read the metadata.json; Delta engines read the _delta_log.

The data is written once; only the metadata is doubled. Iceberg format versions 1, 2, and 3 are supported, and column mapping supplies the Iceberg field IDs. The reverse direction works too: DeltaForge reads native Iceberg tables in place, including time travel via snapshot selection, with the same external-table statement you used for CSV:

CREATE EXTERNAL TABLE IF NOT EXISTS external.iceberg_demos.inventory
USING ICEBERG
LOCATION 's3://acme-lake/demo/iceberg/inventory/';

Reading an existing Iceberg table requires no conversion at all; the engine resolves the metadata chain and manifests directly.

Scope and limits

A few honest boundaries, because they shape how you use the pattern.

External tables are read-only. They have no transactions, no time travel, and no schema evolution of their own; those properties arrive with the Delta table the CTAS creates. If you only ever need to query the CSVs, you can stop after statement one and never convert.
Inference reads the source at CREATE time. The LOCATION must exist and be readable when the statement runs, and sampling a large directory on slow storage adds latency to the CREATE. An explicit column list skips the sampling but not the file listing.
Ambiguous types deserve explicit casts. Inference is sample-based, so numeric-looking text columns can land with types you did not intend. Casting in the CTAS SELECT, as the walkthrough does, makes the Delta schema deliberate instead of accidental.
Format defaults to CSV. If you omit USING and the file extensions are ambiguous, the engine assumes CSV. Pointing that default at JSON or binary files misreads them, so state the format when extensions are unreliable.
The conversion is a snapshot, not a sync. CTAS captures the files as they exist at execution. New CSVs appearing later are visible through the external table immediately, but reach the Delta table only when you append or merge them, typically via a scheduled pipeline running the same SQL.

None of these limits push you back to a cluster. They are the same considerations a Spark or Python pipeline would face, expressed as properties of two SQL statements instead of a codebase.