Community Detection in SQL: Louvain on Delta Lake

Q: Can you run community detection in SQL?

Yes. DeltaForge runs Louvain, Leiden, and label propagation directly against Delta Lake tables. You define a named graph over an existing vertex table and edge table with CREATE GRAPH, then call the algorithm from a query: CALL algo.louvain returns one row per node with its community assignment, which you aggregate, join, and filter like any other result set. There is no export step and no separate graph runtime; the tables on object storage remain the single source of truth.

Q: How do you run the Louvain algorithm on warehouse tables?

Load your nodes and relationships into Delta tables (a CTAS from an external CSV table is enough), declare a graph with CREATE GRAPH naming the vertex table, edge table, ID, source, target, and weight columns, then run CALL algo.louvain with a resolution parameter. The engine builds a compressed sparse row structure from the edge table, runs modularity optimization, and yields node_id and community_id rows you can aggregate or join back to your dimension tables.

Q: How do you find fraud rings in transaction data with SQL?

Model accounts as vertices and transfers as edges, declare a directed graph over those two Delta tables, and run Louvain. Accounts that transact with each other far more than with the rest of the network land in the same community, which is exactly the shape of a fraud ring or collusion cluster. Small, dense communities with high internal edge weight are the review queue. Connected components is a useful first pass to isolate fully disconnected subnetworks before Louvain refines them.

Q: Do you need a graph database for community detection?

No. A dedicated graph database earns its place for deep transactional traversal workloads, but community detection is an analytical batch computation over an edge list, and your edge list already lives in warehouse tables. DeltaForge runs the same algorithm family (Louvain, Leiden, label propagation, connected components) directly on Delta Lake storage, so you skip the export pipeline, the second copy of the data, and the synchronization job that keeps the two systems honest.

Can you run community detection in SQL?

Yes, and without leaving your lakehouse. DeltaForge is a commercial, customer-installed analytics engine that speaks PostgreSQL-flavored SQL and embeds a graph runtime over the same Delta Lake tables it queries. You declare a graph once with CREATE GRAPH, pointing it at an existing vertex table and edge table. After that, community detection is a single statement: CALL algo.louvain yields one row per node with its community assignment, and you aggregate or join that output exactly as you would any other result set.

The usual alternatives all involve moving the data. Neo4j's Graph Data Science library documents Louvain and Leiden over in-memory graph projections inside a Neo4j database, which means an export pipeline from your warehouse and a second copy of the data to keep synchronized. NetworkX and igraph are excellent Python libraries, but they document single-machine, in-memory workflows that start with loading an edge list out of your storage into a script. Neither path is wrong; both add a system and a data hop between your tables and the answer. The point of running graph clustering on the lakehouse itself is that there is nothing to move.

Do you need a graph database for community detection?

No. A graph database earns its keep on transactional, deep-traversal workloads: thousands of concurrent point lookups walking five hops from a known node. Community detection is the opposite shape. It is an analytical batch computation that touches every edge, and its natural input is an edge list, which is to say a two-column table. If your relationships already live in warehouse tables (transactions, co-purchases, calls, follows, shared attributes), the graph database adds an ETL pipeline whose only job is to feed an algorithm your engine could have run in place.

DeltaForge keeps the table as the source of truth. The graph is a logical declaration over the tables, plus a cached topology structure the engine builds from them. Update the edge table with ordinary DML and the graph follows; there is no separate store that can drift.

How do you run the Louvain algorithm on warehouse tables?

The walkthrough below is the graph-polbooks demo, which ships in the DeltaForge demo library. It loads the classic political books co-purchasing network: 105 books sold on Amazon, 441 undirected co-purchase edges stored as 882 directed rows, and a known ground truth of three communities (liberal, neutral, conservative). That ground truth is the reason this dataset is the standard benchmark: you can check whether the algorithm recovers structure that is objectively there.

Stage the raw files and materialize typed Delta tables. The external tables read pipe-delimited CSV in place; the CREATE DELTA TABLE ... AS SELECT statements write properly typed Delta tables next to them:

CREATE ZONE IF NOT EXISTS external TYPE EXTERNAL
    COMMENT 'External and Delta tables for demo datasets';

CREATE SCHEMA IF NOT EXISTS external.polbooks_raw
    COMMENT 'Political Books external CSV staging tables';

CREATE SCHEMA IF NOT EXISTS external.political_books
    COMMENT 'Political Books Delta tables and co-purchasing graph';

CREATE EXTERNAL TABLE IF NOT EXISTS external.polbooks_raw.polbooks_edges
USING CSV LOCATION 's3://acme-lake/demo/graph-polbooks/edges.csv'
OPTIONS (header = 'true', delimiter = '|');

CREATE EXTERNAL TABLE IF NOT EXISTS external.polbooks_raw.polbooks_vertices
USING CSV LOCATION 's3://acme-lake/demo/graph-polbooks/vertices.csv'
OPTIONS (header = 'true', delimiter = '|');

CREATE DELTA TABLE IF NOT EXISTS external.political_books.edges
LOCATION 's3://acme-lake/demo/graph-polbooks/delta/edges'
AS SELECT
    CAST(src AS BIGINT) AS src,
    CAST(dst AS BIGINT) AS dst,
    CAST(weight AS DOUBLE) AS weight,
    CAST(edge_type AS VARCHAR) AS edge_type
FROM external.polbooks_raw.polbooks_edges;

CREATE DELTA TABLE IF NOT EXISTS external.political_books.vertices
LOCATION 's3://acme-lake/demo/graph-polbooks/delta/vertices'
AS SELECT
    CAST(vertex_id AS BIGINT) AS vertex_id,
    CAST(name AS VARCHAR) AS name,
    CAST(category AS VARCHAR) AS leaning
FROM external.polbooks_raw.polbooks_vertices;

From CSV on object storage to typed Delta tables. The schema is the standard graph shape: a vertex table keyed by ID and an edge table of (src, dst, weight) rows.

Optimize the physical layout and declare the graph. OPTIMIZE ... ZORDER BY co-locates rows on the hot columns so the topology build reads sequentially; CREATE GRAPH binds the two tables into a named graph; CREATE GRAPHCSR pre-builds and persists the topology cache:

OPTIMIZE external.political_books.vertices
    ZORDER BY (vertex_id, leaning);

OPTIMIZE external.political_books.edges
    ZORDER BY (src, dst);

CREATE GRAPH IF NOT EXISTS external.political_books.political_books
    VERTEX TABLE external.political_books.vertices ID COLUMN vertex_id NODE NAME COLUMN name NODE TYPE COLUMN leaning
    EDGE TABLE external.political_books.edges SOURCE COLUMN src TARGET COLUMN dst
    WEIGHT COLUMN weight
    EDGE TYPE COLUMN edge_type
    DIRECTED;

CREATE GRAPHCSR external.political_books.political_books;

The graph is metadata over the tables, not a copy of them. CREATE GRAPHCSR materializes a compressed sparse row file so the first algorithm call loads a prebuilt topology instead of rebuilding it from the Delta tables.

Verify the load before trusting any algorithm output. DeltaForge's ASSERT command makes the checks self-enforcing:

ASSERT VALUE row_count = 105
SELECT COUNT(*) AS row_count FROM external.political_books.vertices;

ASSERT VALUE row_count = 882
SELECT COUNT(*) AS row_count FROM external.political_books.edges;

ASSERT VALUE orphan_edges = 0
SELECT COUNT(*) AS orphan_edges
FROM external.political_books.edges e
WHERE NOT EXISTS (SELECT 1 FROM external.political_books.vertices v WHERE v.vertex_id = e.src)
   OR NOT EXISTS (SELECT 1 FROM external.political_books.vertices v WHERE v.vertex_id = e.dst);

Counts and referential integrity. An edge pointing at a missing vertex will distort every community boundary downstream, so check it here.

Run Louvain. The USE clause selects the graph; everything after it is the algorithm call and ordinary result shaping:
```
USE external.political_books.political_books
CALL algo.louvain({resolution: 1.0})
YIELD node_id, community_id
RETURN community_id, count(*) AS members
ORDER BY members DESC;
```
One statement from edge table to community sizes. The ground truth here is three communities, and Louvain at resolution 1.0 recovers a partition close to that split.

Reading the community assignments

The YIELD clause is the algorithm's output contract: algo.louvain produces a stream of (node_id, community_id) pairs, one per vertex. The query above aggregates them into community sizes, but the node-level view is just as direct:

USE external.political_books.political_books
CALL algo.louvain({resolution: 1.0})
YIELD node_id, community_id
RETURN node_id, community_id
ORDER BY community_id, node_id;

Per-node assignments, 105 rows. This is the table you join back to your dimension data to label each cluster.

Output shape: One row per vertex: node_id BIGINT, community_id BIGINT. Community IDs are opaque labels, not ordinals; only membership matters.
Determinism: Louvain uses random tie-breaking, so the exact community count varies between runs. On this dataset the partition lands close to the three ground-truth groups, and the demo asserts the invariant (at least two communities, all 105 nodes assigned) rather than a fixed count.
Tuning: The resolution parameter controls granularity. Values above 1.0 favor more, smaller communities; below 1.0 favors fewer, larger ones.

That non-determinism is worth being honest about because every Louvain implementation shares it; the algorithm optimizes modularity greedily and ties break by node ordering. If your downstream process needs run-to-run stability, persist one accepted partition as a table and treat re-running the algorithm as a deliberate refresh, not an idempotent view.

How do you find fraud rings in transaction data with SQL?

The polbooks network is co-purchasing data, and a fraud graph has exactly the same shape: entities as vertices, interactions as edges. Map accounts to a vertex table and transfers to an edge table, and the declaration is the same statement with your column names:

CREATE GRAPH IF NOT EXISTS external.fraud.transfers_graph
    VERTEX TABLE external.fraud.accounts ID COLUMN account_id NODE NAME COLUMN account_name NODE TYPE COLUMN account_type
    EDGE TABLE external.fraud.transfers SOURCE COLUMN sender_id TARGET COLUMN receiver_id
    WEIGHT COLUMN amount
    EDGE TYPE COLUMN transfer_type
    DIRECTED;

CREATE GRAPHCSR external.fraud.transfers_graph;

USE external.fraud.transfers_graph
CALL algo.louvain({resolution: 1.0})
YIELD node_id, community_id
RETURN community_id, count(*) AS members
ORDER BY members DESC;

The same pattern on a transfers table. Small, dense communities whose members transact mostly with each other are the candidates a fraud analyst reviews first.

Why this works: collusion clusters are, structurally, communities. A ring of accounts cycling funds among themselves has far more internal edges than edges to the rest of the network, and that is precisely the signal modularity optimization maximizes. The polbooks demo demonstrates the same mechanism on labeled data: purchasing behavior alone, with no political metadata in the edges, recovers the ideological groupings. Substitute transfers for co-purchases and the inference carries over.

A practical refinement: run algo.connectedComponents() first. It is deterministic and cheap, and it splits the network into fully disconnected subgraphs, which catches the crude rings outright and gives Louvain cleaner inputs for the subtler ones. The demo includes it as a connectivity check:

USE external.political_books.political_books
CALL algo.connectedComponents()
YIELD node_id, component_id
RETURN component_id, count(*) AS members
ORDER BY members DESC;

One component on polbooks, confirming the whole network is reachable. On a transfers graph, multiple components are themselves a finding.

Label propagation and Leiden on the same graph

Louvain is the default, not the only option. The second assigned demo, graph-karate-club (Zachary's karate club, 34 members, 78 friendship edges, the most studied graph in network science), exercises the full community-detection family against published reference values from Neo4j GDS 2.6.9 and NetworkX. Label propagation is the fast, simple alternative:

USE external.karate_club.karate_club
CALL algo.labelpropagation({maxIterations: 10})
YIELD node_id, community_id
RETURN community_id, count(*) AS members
ORDER BY members DESC;

Label propagation in SQL. Honesty note from the demo's own assertions: on the dense karate graph, LPA's local-majority rule collapses to a single community, and the GDS reference implementation does the same. LPA shines on larger, sparser graphs.

Leiden is the refinement of Louvain that guarantees well-connected communities and converges more stably. On the karate graph it consistently produces four communities at gamma 1.0, matching the Neo4j GDS reference across seeds:

USE external.karate_club.karate_club
CALL algo.leiden({gamma: 1.0, maxLevels: 10})
YIELD node_id, community_id, level
RETURN community_id, count(*) AS members
ORDER BY members DESC;

Leiden adds a level column for the hierarchy. Prefer it over Louvain when you need reproducible community counts.

Scoring a partition with modularity

Community detection without a quality score is guesswork, so the engine exposes modularity as a callable metric. Given any assignment of nodes to communities, it returns the modularity Q of that partition over the graph, which is deterministic. The karate demo scores Zachary's published two-faction split, which lands at roughly 0.179:

USE external.karate_club.karate_club
CALL algo.modularity({
    communityProperty: [
        0,0,0,0,0,0,0,0,0,1,
        0,0,0,0,1,1,0,0,1,0,
        1,0,1,1,1,1,1,1,1,1,
        1,1,1,1
    ]
})
YIELD modularity
RETURN modularity;

Modularity of the ground-truth split. Use this to compare candidate partitions: a Louvain run that scores well above an existing segmentation is evidence, not opinion.

What happens at the storage layer

Nothing about the graph leaves Delta Lake. The vertex and edge tables are ordinary Delta tables: Parquet data files plus a transaction log, readable by anything that reads Delta. CREATE GRAPH writes only catalog metadata binding the two tables and their column roles. The one derived artifact is the compressed sparse row cache that CREATE GRAPHCSR persists beside the tables: a binary adjacency structure the algorithms traverse, so a warmed graph loads its topology in about 200 ms instead of rebuilding from the edge table on first query. It is a cache in the strict sense; rerunning CREATE GRAPHCSR after a bulk edge load refreshes it from the tables, which remain the source of truth.

The OPTIMIZE ... ZORDER BY step in setup is not ceremony. Z-ordering the edge table on (src, dst) makes the CSR build a sequential scan, and z-ordering vertices on the ID plus the hot filter column tightens Parquet min/max statistics so selective lookups skip entire files. Graph workloads reward physical layout exactly the way analytical scans do, because under the hood they are analytical scans.

Limits worth knowing

Scope honesty, so you can decide with full information. Louvain is non-deterministic run to run; use Leiden or persist an accepted partition when stability matters. The CALL algo.* procedures operate on the full declared graph; if you want communities over a filtered subgraph, materialize the filtered edge set as its own table and declare a graph over it. And DeltaForge's graph layer is built for analytics over lakehouse tables, not for serving thousands of concurrent transactional traversals; if that is your workload, a dedicated graph database remains the right tool. For segmentation, fraud-ring triage, and network analysis over data you already store in Delta Lake, the SQL path removes an entire system from the architecture.

Both demos referenced here, graph-polbooks and graph-karate-club, ship in the DeltaForge demo library with their data files, setup scripts, and assertion suites, so you can reproduce every result in this article against the published reference values.