Can you analyze FHIR data without a Spark cluster?

Yes, and for most provider organizations it is the more proportionate answer. The situation is familiar: you ran a bulk $export against your FHIR server, or your interoperability gateway drops one transaction Bundle per encounter into a landing folder. You now have files, and you want counts, trends, and joins, not a platform project.

The established open tooling all reaches for a cluster runtime. Pathling, from CSIRO, is a server and set of libraries for FHIR analytics built on Apache Spark. Google's FHIR Data Pipes converts FHIR to Parquet through Apache Beam pipelines designed for distributed runners. dbignite, from Databricks Labs, is a PySpark library that presumes a Databricks or Spark environment. These are sound architectures for national registries and research consortia processing billions of resources. But a mid-size hospital's daily export is gigabytes, not terabytes, and standing up cluster infrastructure (plus the operational surface that comes with it) to count glucose readings is the tail wagging the dog.

DeltaForge takes the other path: it is a commercial analytical engine you install on your own infrastructure, and it treats FHIR files as something SQL can read in place. You point an external table at the export directory, the engine discovers the schema and flattens the nested resources, and from then on it is PostgreSQL-flavored SQL. Nothing in this article requires a second machine.

How do you query FHIR bulk export NDJSON with SQL?

The bulk export format is NDJSON: one complete FHIR resource per line. The fhir-clinical-observations demo ships a 100-row export of heart rate Observations (LOINC 8867-4), each linking to a Patient through its subject reference. One statement makes the file queryable:

CREATE ZONE IF NOT EXISTS external TYPE EXTERNAL
    COMMENT 'External tables for file-backed demo data';

CREATE SCHEMA IF NOT EXISTS external.fhir_demos
    COMMENT 'FHIR R5 resources as JSON';

CREATE EXTERNAL TABLE IF NOT EXISTS external.fhir_demos.observations_bulk
USING JSON
LOCATION 's3://acme-lake/demo/fhir-observations'
OPTIONS (
    file_filter = '*.ndjson',
    json_flatten_config = '{
        "root_path": "$",
        "include_paths": [
            "$.resourceType",
            "$.id",
            "$.status",
            "$.code",
            "$.subject",
            "$.effectiveDateTime",
            "$.valueQuantity"
        ],
        "json_paths": ["$.code", "$.subject", "$.valueQuantity"],
        "column_mappings": {
            "$.id": "observation_id",
            "$.effectiveDateTime": "effective_date",
            "$.valueQuantity": "value_quantity"
        },
        "max_depth": 4,
        "separator": "_",
        "default_array_handling": "to_json",
        "infer_types": true
    }',
    file_metadata = '{"columns":["df_file_name","df_row_number"]}'
);
An external table over a FHIR bulk export. The engine lists the directory, applies the filter, and parses each NDJSON line as one Observation; the files are never copied.

The interesting work happens in json_flatten_config. The include_paths list selects which FHIR elements to surface, json_paths keeps deeply structured elements like code (a CodeableConcept) and valueQuantity intact as JSON columns, and column_mappings renames camelCase FHIR names to the snake_case an analyst expects, so effectiveDateTime becomes effective_date. The file_metadata option adds df_file_name and df_row_number to every row, which is your provenance trail back to the source file: in a clinical pipeline that lineage matters for audit, not just debugging.

At the storage layer, nothing moved. CREATE EXTERNAL TABLE registers the location and configuration in the catalog and runs schema discovery; queries parse the JSON at scan time against whatever files currently match the filter. From here the export is just a table:

SELECT SUBSTRING(effective_date, 1, 10) AS observation_date,
       COUNT(*) AS readings
FROM external.fhir_demos.observations_bulk
WHERE effective_date IS NOT NULL
GROUP BY SUBSTRING(effective_date, 1, 10)
ORDER BY observation_date;
Temporal distribution of heart rate readings. Against the demo export this returns 25 distinct days, with 5 readings on 2024-01-01 and 9 on 2024-01-14.

A plain SELECT observation_id, status, subject, effective_date, value_quantity over the same table shows the flattened shape: subject holds the reference JSON ({"reference":"Patient/bulk-patient-41"}), and value_quantity holds the measurement with its unit. Grouping by subject already gives per-patient observation counts from a single file, no preprocessing step in sight.

A dedicated FHIR provider for transaction Bundles

Bulk exports are the easy case because the resources arrive pre-sorted by type. The harder, more common hospital pattern is the transaction Bundle: one JSON document per encounter containing a Patient, an Encounter, several Observations, and MedicationRequests, all nested inside Bundle.entry[]. Generic JSON flattening fights that shape. DeltaForge has a dedicated FHIR provider for it, exercised by the hospital-bundle-ingest demo:

CREATE SCHEMA IF NOT EXISTS external.fhir_bronze
    COMMENT 'One row per FHIR resource per Bundle';

CREATE EXTERNAL TABLE IF NOT EXISTS external.fhir_bronze.bundle_resources
USING fhir
LOCATION 's3://acme-lake/demo/fhir-bundles'
OPTIONS (
    unbundle = 'true',
    include_bundle_metadata = 'true',
    resource_types = 'Patient,Encounter,Observation,MedicationRequest',
    path_separator = '_',
    fhir_version = 'r4',
    materialized_paths = 'status,gender,birthDate,active,subject_reference,code_coding_code,code_coding_display,code_coding_system,effectiveDateTime,valueQuantity_value,valueQuantity_unit,authoredOn,intent,medicationCodeableConcept_coding_code,medicationCodeableConcept_coding_display,class_code,class_display,period_start,period_end',
    file_metadata = '{"columns":["df_file_name","df_row_number"]}'
);

DETECT SCHEMA FOR TABLE external.fhir_bronze.bundle_resources;
The USING fhir provider: unbundle explodes each Bundle.entry into its own row, resource_types filters the landing, and materialized_paths projects nested FHIR paths into flat columns.

Four options do the heavy lifting. unbundle turns each Bundle.entry into one row, so a 6-entry admission Bundle becomes 6 rows. include_bundle_metadata stamps every row with bundle_id, bundle_type, bundle_timestamp, and bundle_total, which is the audit answer to "which transmission contained this resource". materialized_paths reaches into nested structures, so Observation.code.coding[].code surfaces as the flat column code_coding_code (names are lowercased to match unquoted-identifier rules). And every row carries df_full_url, the Bundle.entry.fullUrl, which becomes important in a moment. The demo's four encounter Bundles land as exactly 24 rows:

SELECT resourcetype, COUNT(*) AS row_count
FROM external.fhir_bronze.bundle_resources
GROUP BY resourcetype
ORDER BY resourcetype;
Output shape: Encounter 4, MedicationRequest 5, Observation 11, Patient 4. The demo asserts these counts on every run.

How do you join Patient and Observation resources in SQL?

This is the question that decides whether the stack is real or a toy. Inside a transaction Bundle, an Observation points at its Patient through subject.reference, and by FHIR's resolution rules that reference value matches the Patient entry's fullUrl. Because the provider preserved fullUrl as df_full_url and materialized subject_reference as a column, the join is a self-join on the bronze table:

SELECT
    obs.id AS observation_id,
    obs.code_coding_display AS observation_label,
    pat.id AS patient_id,
    pat.gender AS patient_gender,
    obs.bundle_id
FROM external.fhir_bronze.bundle_resources obs
JOIN external.fhir_bronze.bundle_resources pat
  ON obs.subject_reference = pat.df_full_url
 AND pat.resourcetype = 'Patient'
WHERE obs.resourcetype = 'Observation'
ORDER BY obs.id;
In-Bundle reference resolution. All 11 Observations resolve to their Patient rows: obs-1001-glu joins to pat-1001, obs-1002-temp to pat-1002, and so on across all 4 Bundles.

No master patient index, no reference-resolution microservice, no Python glue. The same pattern extends to MedicationRequest.subject or Encounter.subject; anything that references by fullUrl joins the same way. Be clear about what this is and is not: it resolves references within the corpus you landed. Matching the same human across two source systems is probabilistic record linkage, a different problem this join does not claim to solve.

FHIR NDJSON to Delta Lake: promoting bronze to silver

External tables are the bronze layer: cheap, schema-on-read, always reflecting the files. For dashboards you want a typed, joined, stable table, and that is where Delta Lake comes in. The demo promotes the Observation-to-Patient join into a silver Delta table with proper types:

CREATE ZONE IF NOT EXISTS clinical_silver TYPE DELTA
    COMMENT 'Delta tables for the clinical silver layer';

CREATE SCHEMA IF NOT EXISTS clinical_silver.fhir_silver
    COMMENT 'Resource-typed silver tables built from the bronze unbundled view';

CREATE DELTA TABLE IF NOT EXISTS clinical_silver.fhir_silver.encounter_observations (
    bundle_id STRING,
    bundle_timestamp STRING,
    observation_full_url STRING,
    patient_full_url STRING,
    patient_id STRING,
    patient_gender STRING,
    loinc_code STRING,
    loinc_display STRING,
    observation_value DOUBLE,
    observation_unit STRING,
    effective_at STRING,
    source_file STRING
) LOCATION 's3://acme-lake/demo/silver/encounter_observations';

INSERT INTO clinical_silver.fhir_silver.encounter_observations
SELECT
    obs.bundle_id,
    obs.bundle_timestamp,
    obs.df_full_url AS observation_full_url,
    pat.df_full_url AS patient_full_url,
    pat.id AS patient_id,
    pat.gender AS patient_gender,
    obs.code_coding_code AS loinc_code,
    obs.code_coding_display AS loinc_display,
    CAST(obs.valuequantity_value AS DOUBLE) AS observation_value,
    obs.valuequantity_unit AS observation_unit,
    obs.effectivedatetime AS effective_at,
    obs.df_file_name AS source_file
FROM external.fhir_bronze.bundle_resources obs
JOIN external.fhir_bronze.bundle_resources pat
  ON obs.subject_reference = pat.df_full_url
 AND pat.resourcetype = 'Patient'
WHERE obs.resourcetype = 'Observation';
Bronze to silver in two statements. The CAST turns the materialized string value into a DOUBLE, so the silver table is numerically typed for aggregation.

At the storage layer this is a real Delta Lake table: Parquet data files plus a _delta_log transaction log at the table location, written atomically. The transaction log is the source of truth for the table's state, which means any Delta-capable reader sees a consistent snapshot, the table supports time travel and ACID appends for tomorrow's Bundles, and BI tools connect to it through DeltaForge's ODBC and ADBC drivers without touching the raw FHIR at all. That table, fed nightly from exports, is a working FHIR data warehouse in miniature: bronze external tables over the raw resources, silver Delta tables holding the curated clinical slice.

Run both demos end to end

Everything above comes from two demos in the DeltaForge demo library; the fhir-clinical-observations and hospital-bundle-ingest demos ship with their data files and assertion suites, so you can replay the whole flow at /demos. The sequence is short:

  1. Run each demo's setup script against your engine, pointing LOCATION at the directory holding the demo data (or your own export). The bulk table, the bronze Bundle table, and the silver Delta table are created exactly as shown above.

  2. Run the query suites. Each query carries ASSERT blocks that pin the expected results: 100 rows in the bulk table, 24 unbundled resources in bronze, 11 rows in silver, df_full_url populated on every bronze row. A representative check is the LOINC lookup:

    SELECT id, bundle_id, code_coding_code, code_coding_display,
           valuequantity_value, valuequantity_unit, effectivedatetime
    FROM external.fhir_bronze.bundle_resources
    WHERE resourcetype = 'Observation'
      AND code_coding_code = '2339-0'
    ORDER BY id;

    Three glucose readings come back, one each from bundles enc-1001, enc-1003, and enc-1004, with values 165, 110, and 96 mg/dL.

  3. Verify the silver promotion, which is the table a dashboard would actually read:

    SELECT bundle_id, patient_id, loinc_code, loinc_display,
           observation_value, observation_unit, effective_at
    FROM clinical_silver.fhir_silver.encounter_observations
    ORDER BY patient_id, loinc_code;

    Eleven rows, one per Observation, each carrying its patient, LOINC code, typed numeric value, and source file for lineage.

What is a lightweight alternative to Pathling or dbignite?

If your constraint is "no cluster", the comparison comes down to what each tool assumes about your runtime. None of this is a criticism of the Spark-based tools; it is a statement about fit.

Pathling
A FHIR analytics server and set of libraries built on Apache Spark, with first-class FHIRPath support. Documented deployment targets are Spark environments; strong fit for research-scale cohort work.
FHIR Data Pipes
Google Open Health Stack pipelines built on Apache Beam that transform FHIR into Parquet for downstream SQL. Designed around distributed pipeline runners and a separate query engine on the output.
dbignite
A Databricks Labs PySpark library for FHIR interoperability on the lakehouse. Assumes a Databricks or Spark runtime and a Python-first workflow.
DeltaForge
A commercial, customer-installed SQL engine. FHIR NDJSON and Bundles become external tables directly; transformation, joins, and Delta Lake writes happen in one process with PostgreSQL-flavored SQL. No JVM, no cluster, no separate pipeline tool.

Where this stack stops

Honesty about scope is part of the engineering. DeltaForge is the analytics layer in this picture, and only that. It is not a FHIR server: there is no REST endpoint, no $export operation, no resource validation against profiles; you bring files that something else exported. It does not implement the SQL on FHIR ViewDefinition format; the flattening you saw is configured through provider options rather than portable ViewDefinition resources, so if cross-engine view portability is a hard requirement, that spec is not what you are getting here. There is no terminology service: LOINC and SNOMED codes land as plain strings, and code-system lookups mean joining your own reference table, which is itself one CREATE EXTERNAL TABLE over a CSV away. And deeply repeated structures have a floor: at the demo's flatten depth, arrays like blood pressure's component[] land as JSON columns rather than fully exploded rows, so systolic and diastolic need a further extraction step. The demos show exactly where that boundary sits rather than hiding it.

Within that scope, the trade is straightforward: the Spark-based stacks buy horizontal scale you may never need at the cost of infrastructure you definitely have to run. A single engine that reads the export where it lies, joins resources with one SQL statement, and writes an auditable Delta table covers the analytics that small and mid-size provider organizations actually run, and it does so with a footprint one person can own.