The triangle-counting kernel in EdgeExpand::tc previously used a vertex_array_t sized to |V(d0_nbr_label)| to mark first-hop matches. This bitset was reset (re-init or per-element clear) for every input vertex, which has two costs on large graphs:
-
Memory: O(|V(d0_nbr_label)|) bits resident per worker thread, even when the actual matched-set is sparse (typical LDBC SNB fanouts emit only a few hundred to a few thousand matches per source).
-
Cache: every membership test (d0_set[u]) and clear-loop touches bytes scattered across a multi-MB region, missing L1/L2 on virtually every probe and polluting the cache lines used by the CSR traversal in the same inner loop.
The triangle-counting kernel in EdgeExpand::tc previously used a vertex_array_t sized to |V(d0_nbr_label)| to mark first-hop matches. This bitset was reset (re-init or per-element clear) for every input vertex, which has two costs on large graphs:
Memory: O(|V(d0_nbr_label)|) bits resident per worker thread, even when the actual matched-set is sparse (typical LDBC SNB fanouts emit only a few hundred to a few thousand matches per source).
Cache: every membership test (d0_set[u]) and clear-loop touches bytes scattered across a multi-MB region, missing L1/L2 on virtually every probe and polluting the cache lines used by the CSR traversal in the same inner loop.