The standard advice for graph analysis on lake data goes like this: export your Parquet to CSV or wire up a Spark connector, load it into Neo4j with LOAD CSV or neo4j-admin database import, model the schema again on the other side, and keep the two copies in sync forever. That workflow exists because dedicated graph databases own their storage format. The data has to move before the first MATCH can run.
This article shows the in-place alternative. Everything below is taken from two validated demos in the DeltaForge demo library, so you can run the whole flow yourself. DeltaForge is commercial, customer-installed software: the engine runs on your infrastructure, next to your lake, and speaks PostgreSQL-flavored SQL alongside openCypher.
Can you run Cypher queries without Neo4j?
Yes. Cypher outgrew its origins some time ago: the openCypher project published the language specification, and multiple engines now implement it independently. DeltaForge is one of them, with a specific twist that matters for lakehouse teams: it does not bring its own graph storage. The graph is a declared mapping over tables you already have, which means the question "how do I get my data into the graph database" simply disappears. There is no graph database to get data into.
Concretely, DeltaForge gives you three layers of graph capability, all reading the same Delta tables: openCypher pattern matching (MATCH, WHERE NOT negative patterns, multi-hop traversals, Cypher aggregation), a suite of graph algorithms callable as CALL algo.* (PageRank, Louvain, betweenness, connected components, shortest paths, and more, with CPU implementations cross-checked against NetworkX output), and full SQL interoperability through the cypher() table function. The last one is the part no dedicated graph database can offer, because a dedicated graph database does not also run your warehouse SQL.
How do you query parquet files as a graph?
The storage layer is the right place to start, because it explains why no import is needed. A Delta Lake table is a directory of plain Parquet data files plus a transaction log. When DeltaForge builds a graph over Delta tables, the bytes it reads are Parquet bytes; the graph definition itself stores nothing. It is a logical overlay that records which table holds vertices, which holds edges, and which columns carry the IDs.
The model is the one most teams already have: an entity table and a relationship table. The graph-sql-cypher-mix demo uses 40 enterprise customers and 96 directed referral edges between them:
CREATE ZONE IF NOT EXISTS external TYPE EXTERNAL
COMMENT 'External and Delta tables for demo datasets';
CREATE SCHEMA IF NOT EXISTS external.customer_network
COMMENT 'Customer referral network shared by SQL and Cypher';
CREATE DELTA TABLE IF NOT EXISTS external.customer_network.customers (
id BIGINT,
name STRING,
region STRING,
industry STRING,
tier STRING,
annual_contract INT
) LOCATION 's3://acme-lake/demo/sales/customers';
CREATE DELTA TABLE IF NOT EXISTS external.customer_network.referrals (
id BIGINT,
src BIGINT,
dst BIGINT,
weight DOUBLE,
referral_type STRING,
year_established INT
) LOCATION 's3://acme-lake/demo/sales/referrals';
INSERT INTO ... VALUES statements.One statement turns those two tables into a queryable property graph. Every column on the vertex table becomes a node property; every column on the edge table becomes a relationship property:
CREATE GRAPH IF NOT EXISTS external.customer_network.customer_network
VERTEX TABLE external.customer_network.customers ID COLUMN id NODE TYPE COLUMN region NODE NAME COLUMN name
EDGE TABLE external.customer_network.referrals SOURCE COLUMN src TARGET COLUMN dst
WEIGHT COLUMN weight
EDGE TYPE COLUMN referral_type
DIRECTED;
CREATE GRAPHCSR external.customer_network.customer_network;
CREATE GRAPH registers the mapping; CREATE GRAPHCSR pre-builds the adjacency structure so the first query does not pay the build cost.Here is what actually happens underneath. The first time a Cypher query touches the graph, the engine compiles the edge table into a Compressed Sparse Row (CSR) adjacency structure, the same representation high-performance graph libraries use, which gives constant-time neighbor lookups per vertex. The CSR is cached in memory and persisted to disk as a .dcsr sidecar file next to the edge table, version-stamped against the Delta log. On later sessions it loads from that sidecar in roughly 200 milliseconds instead of rebuilding from the tables. Your Parquet files are never rewritten and never copied into a proprietary store; the only artifact the graph layer adds is that cache file, and dropping the graph definition leaves the tables untouched.
Starting from raw Parquet files
If your edges currently sit as bare Parquet files with no Delta log, register them as an external table and copy them into a Delta table once. External Parquet tables are queryable but read-only, and a graph whose edges you intend to mutate should sit on Delta, so the copy buys you ACID writes, time travel, and DML on the graph:
CREATE EXTERNAL TABLE IF NOT EXISTS external.customer_network.raw_referrals
USING PARQUET
LOCATION 's3://acme-lake/exports/referrals/';
INSERT INTO external.customer_network.referrals
SELECT id, src, dst, weight, referral_type, year_established
FROM external.customer_network.raw_referrals;
The full flow, from tables to PageRank
The graph-sql-cypher-mix demo ships in the DeltaForge demo library and runs this entire sequence with assertion checks at every step, so the numbers below are verified, not illustrative. The flow has four moves.
-
Create and load the tables. The zone, schema, and the two Delta tables from the snippet above, seeded with 40 customers, 96 referral edges, plus 120 orders and 8 sales reps that stay purely relational. The demo also runs
OPTIMIZE ... ZORDER BY (src, dst)on the edge table so the CSR build reads sequentially. -
Define the graph and warm the cache.
CREATE GRAPHplusCREATE GRAPHCSR, exactly as shown above. -
Query it with pure Cypher. The
USEclause selects the graph; the pattern reads like any openCypher engine:USE external.customer_network.customer_network MATCH (a)-[r]->(b) RETURN r.id AS edge_id, a.name AS src_name, b.name AS dst_name, r.weight AS weight, r.referral_type AS referral_type ORDER BY r.id;This returns 96 rows, one per referral edge, with node names resolved from the vertex table and
weightandreferral_typepulled straight from the edge table's Parquet columns. The first row isBolt_IncreferringForge_Inc, a partner edge with weight 0.6. -
Run an algorithm inside SQL. This is where the architecture pays off. PageRank executes over the CSR, and its output joins the customers Delta table in a single statement:
SELECT c.id, c.name, c.region, c.industry, pr.score AS influence_score FROM cypher('external.customer_network.customer_network', $$ CALL algo.pageRank({dampingFactor: 0.85, iterations: 20}) YIELD node_id, score RETURN node_id AS customer_id, score $$) AS (customer_id BIGINT, score DOUBLE) pr JOIN external.customer_network.customers c ON pr.customer_id = c.id ORDER BY pr.score DESC;Forty rows come back: every customer with its region and industry from the relational side and its influence score from the graph side. The demo asserts the row count and that every score is positive; it deliberately does not assert exact score values, because PageRank magnitudes vary with floating-point iteration order.
Can Cypher and SQL run in the same engine?
The previous query already answered this, but the interoperability goes further than embedding Cypher in a FROM clause. Because algorithm output is just a table, you can persist it. The demo writes PageRank results into a Delta table with a plain INSERT:
CREATE DELTA TABLE IF NOT EXISTS external.customer_network.influence_scores (
customer_id BIGINT,
influence_score DOUBLE,
influence_rank BIGINT
) LOCATION 's3://acme-lake/demo/sales/influence_scores';
INSERT INTO external.customer_network.influence_scores
SELECT * FROM cypher('external.customer_network.customer_network', $$
CALL algo.pageRank({dampingFactor: 0.85, iterations: 20})
YIELD node_id, score, rank
RETURN node_id AS customer_id, score AS influence_score, rank AS influence_rank
$$) AS (customer_id BIGINT, influence_score DOUBLE, influence_rank BIGINT);
And because CTEs compose, a single query can hold a graph-powered CTE and a relational CTE side by side. This one computes degree centrality from the referral graph and total revenue from the orders table, then joins them per customer:
WITH hub_scores AS (
SELECT * FROM cypher('external.customer_network.customer_network', $$
CALL algo.degree()
YIELD node_id, total_degree
RETURN node_id AS customer_id, total_degree
$$) AS (customer_id BIGINT, total_degree BIGINT)
),
revenue AS (
SELECT customer_id, SUM(amount) AS total_revenue
FROM external.customer_network.orders
GROUP BY customer_id
)
SELECT c.id, c.name, c.region, h.total_degree, r.total_revenue
FROM hub_scores h
JOIN external.customer_network.customers c ON h.customer_id = c.id
JOIN revenue r ON c.id = r.customer_id
ORDER BY h.total_degree DESC, r.total_revenue DESC;
Doing the equivalent with a separate graph database means running PageRank there, exporting scores, loading them back into the warehouse, and scheduling that round trip. Here it is one query plan in one engine, and the result of the graph step is immediately joinable, immediately persistable, and immediately visible to anything that speaks SQL.
Do you have to import data into a graph database to use Cypher?
With a dedicated graph database, yes, and that is by design rather than a flaw: systems like Neo4j optimize a native store for transactional graph workloads, and their documented ingestion paths (LOAD CSV, neo4j-admin database import, the Spark and Kafka connectors) all populate that store before queries run. If your workload is an operational application built around the graph, that trade can be the right one.
For analytics on lake data, the trade inverts. The import pipeline becomes a permanent piece of infrastructure: a second copy of the data, a second security model, a second backup story, and a freshness lag equal to however often the sync runs. Every schema change on the lake side has to be re-modeled on the graph side. Teams routinely build and maintain all of this to run what amounts to a handful of centrality and community queries per week.
The in-place model removes the pipeline rather than optimizing it. Mutating the graph is mutating the tables: INSERT a row into the edge table and the relationship exists; UPDATE a vertex row and the node property changes; MERGE works too, since the backing store is Delta. After bulk write activity you refresh the topology cache with CREATE GRAPHCSR, and the graph reflects the tables again. Time travel comes along free: the graph at table version 12 is a real, reconstructible thing, because the graph is the tables.
Advanced Cypher, not just algorithms
A fair objection: maybe the lakehouse engine runs PageRank, but real Cypher work needs negative patterns, multi-hop traversals, and aggregation. The second source demo, graph-advanced-cypher, exists to test exactly that surface. It builds a university collaboration network (40 researchers, 170 directed edges across four relationship types) and exercises the patterns that separate a Cypher implementation from a shortest-path library:
USE external.research_network.research_network
MATCH (n)
WHERE NOT (n)-->()
RETURN n.name AS name, n.department AS department, n.rank AS rank
ORDER BY n.name;
USE external.research_network.research_network
MATCH (a)-[:co-author]->(b)-[:co-author]->(c)-[:co-author]->(a)
RETURN a.name AS researcher_a, b.name AS researcher_b, c.name AS researcher_c
ORDER BY a.name, b.name;
USE external.research_network.research_network
MATCH (n)
RETURN n.department AS dept,
count(n) AS num_researchers,
avg(n.h_index) AS avg_h,
min(n.h_index) AS min_h,
max(n.h_index) AS max_h
ORDER BY avg_h DESC;
Edge type filtering ([:co-author], [:advisor]) resolves against the EDGE TYPE COLUMN declared in the graph definition, so a single edge table can carry heterogeneous relationship types without splitting into one table per type. The same demo runs betweenness centrality, connected components, and Louvain over the network, each with asserted expectations.
Scope and limits
An honest accounting of where the edges are, because the in-place model has real constraints and you should know them before committing.
- Vertex IDs must be BIGINT
- The CSR build requires a BIGINT ID column on the vertex table. Other integer types fail at build time, and string keys need a surrogate-key step first.
- Orphan edges are dropped silently
- Edge rows whose source or target does not exist in the vertex table are excluded from the topology without error. Validate referential integrity with a SQL anti-join before defining the graph if your edge data is untrusted.
- CSR refresh is explicit by default
- The topology cache is a full rebuild, not incremental. After bulk DML on the backing tables, re-run
CREATE GRAPHCSRto refresh it; per-graph auto-refresh is available but opt-in, because rebuild amplification on large graphs is a real cost. - External Parquet is read-only
- You can query raw Parquet through an external table, but a graph you intend to mutate should sit on Delta tables, which support full DML. The one-time copy shown above is the boundary crossing.
- Some algorithm output is non-deterministic
- PageRank score magnitudes and Louvain partitions can vary across runs (floating-point ordering, modularity tie-breaking). The demos assert invariants such as row counts, score positivity, and rank bounds instead of exact values, and that is the right pattern for your own tests too.
- Commercial software
- DeltaForge is a commercial product you install on your own infrastructure. The demos and SQL in this article run on any licensed deployment.
If your workload is a high-write operational graph application with millisecond point lookups as the product, a dedicated graph database remains a reasonable architecture. If your graph questions are analytical, the data already lives in Parquet or Delta, and the import pipeline is the thing you resent maintaining, run the graph-sql-cypher-mix demo end to end and see whether you still need it.