Can you run community detection in SQL?
Yes, and without leaving your lakehouse. DeltaForge is a commercial, customer-installed analytics engine that speaks PostgreSQL-flavored SQL and embeds a graph runtime over the same Delta Lake tables it queries. You declare a graph once with CREATE GRAPH, pointing it at an existing vertex table and edge table. After that, community detection is a single statement: CALL algo.louvain yields one row per node with its community assignment, and you aggregate or join that output exactly as you would any other result set.
The usual alternatives all involve moving the data. Neo4j's Graph Data Science library documents Louvain and Leiden over in-memory graph projections inside a Neo4j database, which means an export pipeline from your warehouse and a second copy of the data to keep synchronized. NetworkX and igraph are excellent Python libraries, but they document single-machine, in-memory workflows that start with loading an edge list out of your storage into a script. Neither path is wrong; both add a system and a data hop between your tables and the answer. The point of running graph clustering on the lakehouse itself is that there is nothing to move.
Do you need a graph database for community detection?
No. A graph database earns its keep on transactional, deep-traversal workloads: thousands of concurrent point lookups walking five hops from a known node. Community detection is the opposite shape. It is an analytical batch computation that touches every edge, and its natural input is an edge list, which is to say a two-column table. If your relationships already live in warehouse tables (transactions, co-purchases, calls, follows, shared attributes), the graph database adds an ETL pipeline whose only job is to feed an algorithm your engine could have run in place.
DeltaForge keeps the table as the source of truth. The graph is a logical declaration over the tables, plus a cached topology structure the engine builds from them. Update the edge table with ordinary DML and the graph follows; there is no separate store that can drift.
How do you run the Louvain algorithm on warehouse tables?
The walkthrough below is the graph-polbooks demo, which ships in the DeltaForge demo library. It loads the classic political books co-purchasing network: 105 books sold on Amazon, 441 undirected co-purchase edges stored as 882 directed rows, and a known ground truth of three communities (liberal, neutral, conservative). That ground truth is the reason this dataset is the standard benchmark: you can check whether the algorithm recovers structure that is objectively there.
-
Stage the raw files and materialize typed Delta tables. The external tables read pipe-delimited CSV in place; the
CREATE DELTA TABLE ... AS SELECTstatements write properly typed Delta tables next to them:CREATE ZONE IF NOT EXISTS external TYPE EXTERNAL COMMENT 'External and Delta tables for demo datasets'; CREATE SCHEMA IF NOT EXISTS external.polbooks_raw COMMENT 'Political Books external CSV staging tables'; CREATE SCHEMA IF NOT EXISTS external.political_books COMMENT 'Political Books Delta tables and co-purchasing graph'; CREATE EXTERNAL TABLE IF NOT EXISTS external.polbooks_raw.polbooks_edges USING CSV LOCATION 's3://acme-lake/demo/graph-polbooks/edges.csv' OPTIONS (header = 'true', delimiter = '|'); CREATE EXTERNAL TABLE IF NOT EXISTS external.polbooks_raw.polbooks_vertices USING CSV LOCATION 's3://acme-lake/demo/graph-polbooks/vertices.csv' OPTIONS (header = 'true', delimiter = '|'); CREATE DELTA TABLE IF NOT EXISTS external.political_books.edges LOCATION 's3://acme-lake/demo/graph-polbooks/delta/edges' AS SELECT CAST(src AS BIGINT) AS src, CAST(dst AS BIGINT) AS dst, CAST(weight AS DOUBLE) AS weight, CAST(edge_type AS VARCHAR) AS edge_type FROM external.polbooks_raw.polbooks_edges; CREATE DELTA TABLE IF NOT EXISTS external.political_books.vertices LOCATION 's3://acme-lake/demo/graph-polbooks/delta/vertices' AS SELECT CAST(vertex_id AS BIGINT) AS vertex_id, CAST(name AS VARCHAR) AS name, CAST(category AS VARCHAR) AS leaning FROM external.polbooks_raw.polbooks_vertices;From CSV on object storage to typed Delta tables. The schema is the standard graph shape: a vertex table keyed by ID and an edge table of (src, dst, weight) rows. -
Optimize the physical layout and declare the graph.
OPTIMIZE ... ZORDER BYco-locates rows on the hot columns so the topology build reads sequentially;CREATE GRAPHbinds the two tables into a named graph;CREATE GRAPHCSRpre-builds and persists the topology cache:OPTIMIZE external.political_books.vertices ZORDER BY (vertex_id, leaning); OPTIMIZE external.political_books.edges ZORDER BY (src, dst); CREATE GRAPH IF NOT EXISTS external.political_books.political_books VERTEX TABLE external.political_books.vertices ID COLUMN vertex_id NODE NAME COLUMN name NODE TYPE COLUMN leaning EDGE TABLE external.political_books.edges SOURCE COLUMN src TARGET COLUMN dst WEIGHT COLUMN weight EDGE TYPE COLUMN edge_type DIRECTED; CREATE GRAPHCSR external.political_books.political_books;The graph is metadata over the tables, not a copy of them. CREATE GRAPHCSR materializes a compressed sparse row file so the first algorithm call loads a prebuilt topology instead of rebuilding it from the Delta tables. -
Verify the load before trusting any algorithm output. DeltaForge's
ASSERTcommand makes the checks self-enforcing:ASSERT VALUE row_count = 105 SELECT COUNT(*) AS row_count FROM external.political_books.vertices; ASSERT VALUE row_count = 882 SELECT COUNT(*) AS row_count FROM external.political_books.edges; ASSERT VALUE orphan_edges = 0 SELECT COUNT(*) AS orphan_edges FROM external.political_books.edges e WHERE NOT EXISTS (SELECT 1 FROM external.political_books.vertices v WHERE v.vertex_id = e.src) OR NOT EXISTS (SELECT 1 FROM external.political_books.vertices v WHERE v.vertex_id = e.dst);Counts and referential integrity. An edge pointing at a missing vertex will distort every community boundary downstream, so check it here. -
Run Louvain. The
USEclause selects the graph; everything after it is the algorithm call and ordinary result shaping:USE external.political_books.political_books CALL algo.louvain({resolution: 1.0}) YIELD node_id, community_id RETURN community_id, count(*) AS members ORDER BY members DESC;One statement from edge table to community sizes. The ground truth here is three communities, and Louvain at resolution 1.0 recovers a partition close to that split.
Reading the community assignments
The YIELD clause is the algorithm's output contract: algo.louvain produces a stream of (node_id, community_id) pairs, one per vertex. The query above aggregates them into community sizes, but the node-level view is just as direct:
USE external.political_books.political_books
CALL algo.louvain({resolution: 1.0})
YIELD node_id, community_id
RETURN node_id, community_id
ORDER BY community_id, node_id;
- Output shape
- One row per vertex:
node_id BIGINT,community_id BIGINT. Community IDs are opaque labels, not ordinals; only membership matters. - Determinism
- Louvain uses random tie-breaking, so the exact community count varies between runs. On this dataset the partition lands close to the three ground-truth groups, and the demo asserts the invariant (at least two communities, all 105 nodes assigned) rather than a fixed count.
- Tuning
- The
resolutionparameter controls granularity. Values above 1.0 favor more, smaller communities; below 1.0 favors fewer, larger ones.
That non-determinism is worth being honest about because every Louvain implementation shares it; the algorithm optimizes modularity greedily and ties break by node ordering. If your downstream process needs run-to-run stability, persist one accepted partition as a table and treat re-running the algorithm as a deliberate refresh, not an idempotent view.
How do you find fraud rings in transaction data with SQL?
The polbooks network is co-purchasing data, and a fraud graph has exactly the same shape: entities as vertices, interactions as edges. Map accounts to a vertex table and transfers to an edge table, and the declaration is the same statement with your column names:
CREATE GRAPH IF NOT EXISTS external.fraud.transfers_graph
VERTEX TABLE external.fraud.accounts ID COLUMN account_id NODE NAME COLUMN account_name NODE TYPE COLUMN account_type
EDGE TABLE external.fraud.transfers SOURCE COLUMN sender_id TARGET COLUMN receiver_id
WEIGHT COLUMN amount
EDGE TYPE COLUMN transfer_type
DIRECTED;
CREATE GRAPHCSR external.fraud.transfers_graph;
USE external.fraud.transfers_graph
CALL algo.louvain({resolution: 1.0})
YIELD node_id, community_id
RETURN community_id, count(*) AS members
ORDER BY members DESC;
Why this works: collusion clusters are, structurally, communities. A ring of accounts cycling funds among themselves has far more internal edges than edges to the rest of the network, and that is precisely the signal modularity optimization maximizes. The polbooks demo demonstrates the same mechanism on labeled data: purchasing behavior alone, with no political metadata in the edges, recovers the ideological groupings. Substitute transfers for co-purchases and the inference carries over.
A practical refinement: run algo.connectedComponents() first. It is deterministic and cheap, and it splits the network into fully disconnected subgraphs, which catches the crude rings outright and gives Louvain cleaner inputs for the subtler ones. The demo includes it as a connectivity check:
USE external.political_books.political_books
CALL algo.connectedComponents()
YIELD node_id, component_id
RETURN component_id, count(*) AS members
ORDER BY members DESC;
Label propagation and Leiden on the same graph
Louvain is the default, not the only option. The second assigned demo, graph-karate-club (Zachary's karate club, 34 members, 78 friendship edges, the most studied graph in network science), exercises the full community-detection family against published reference values from Neo4j GDS 2.6.9 and NetworkX. Label propagation is the fast, simple alternative:
USE external.karate_club.karate_club
CALL algo.labelpropagation({maxIterations: 10})
YIELD node_id, community_id
RETURN community_id, count(*) AS members
ORDER BY members DESC;
Leiden is the refinement of Louvain that guarantees well-connected communities and converges more stably. On the karate graph it consistently produces four communities at gamma 1.0, matching the Neo4j GDS reference across seeds:
USE external.karate_club.karate_club
CALL algo.leiden({gamma: 1.0, maxLevels: 10})
YIELD node_id, community_id, level
RETURN community_id, count(*) AS members
ORDER BY members DESC;
Scoring a partition with modularity
Community detection without a quality score is guesswork, so the engine exposes modularity as a callable metric. Given any assignment of nodes to communities, it returns the modularity Q of that partition over the graph, which is deterministic. The karate demo scores Zachary's published two-faction split, which lands at roughly 0.179:
USE external.karate_club.karate_club
CALL algo.modularity({
communityProperty: [
0,0,0,0,0,0,0,0,0,1,
0,0,0,0,1,1,0,0,1,0,
1,0,1,1,1,1,1,1,1,1,
1,1,1,1
]
})
YIELD modularity
RETURN modularity;
What happens at the storage layer
Nothing about the graph leaves Delta Lake. The vertex and edge tables are ordinary Delta tables: Parquet data files plus a transaction log, readable by anything that reads Delta. CREATE GRAPH writes only catalog metadata binding the two tables and their column roles. The one derived artifact is the compressed sparse row cache that CREATE GRAPHCSR persists beside the tables: a binary adjacency structure the algorithms traverse, so a warmed graph loads its topology in about 200 ms instead of rebuilding from the edge table on first query. It is a cache in the strict sense; rerunning CREATE GRAPHCSR after a bulk edge load refreshes it from the tables, which remain the source of truth.
The OPTIMIZE ... ZORDER BY step in setup is not ceremony. Z-ordering the edge table on (src, dst) makes the CSR build a sequential scan, and z-ordering vertices on the ID plus the hot filter column tightens Parquet min/max statistics so selective lookups skip entire files. Graph workloads reward physical layout exactly the way analytical scans do, because under the hood they are analytical scans.
Limits worth knowing
Scope honesty, so you can decide with full information. Louvain is non-deterministic run to run; use Leiden or persist an accepted partition when stability matters. The CALL algo.* procedures operate on the full declared graph; if you want communities over a filtered subgraph, materialize the filtered edge set as its own table and declare a graph over it. And DeltaForge's graph layer is built for analytics over lakehouse tables, not for serving thousands of concurrent transactional traversals; if that is your workload, a dedicated graph database remains the right tool. For segmentation, fraud-ring triage, and network analysis over data you already store in Delta Lake, the SQL path removes an entire system from the architecture.
Both demos referenced here, graph-polbooks and graph-karate-club, ship in the DeltaForge demo library with their data files, setup scripts, and assertion suites, so you can reproduce every result in this article against the published reference values.