Intermittent hangs and crashes during parallel BDD operations — possible GC synchronization issues
Summary
I'm experiencing intermittent hangs and crashes (SIGSEGV/SIGBUS) when running parallel BDD operations under high contention on Sylvan v1.9.4 + Lace v1.5.4. The crashes appear non-deterministic and are more frequent when the node table approaches capacity, suggesting a relationship with GC triggering.
I've done a detailed code audit of the GC synchronization path and wanted to share several findings — some may explain the instability, others are precautionary observations.
Environment
- Sylvan v1.9.4, Lace v1.5.4
- Multi-core Linux (x86_64)
- Observed with both GCC and Clang
Analysis (by Claude Code ^_^)
1. cache_clear() uses munmap + mmap (not MAP_FIXED)
In sylvan_cache.c, cache_clear() calls cache_free() (which munmaps the old cache_table and cache_status) and then cache_create() (which mmaps new memory at a potentially different address):
void cache_clear() {
cache_free(); // munmap — old pointers become dangling immediately
cache_create(cache_size, cache_max); // mmap — new allocation, updates global pointers
}
This is more dangerous than the MAP_FIXED approach used in clear_aligned() (which at least preserves virtual addresses). After munmap, any stale reference to the old cache_table would cause SIGSEGV.
Furthermore, cache_table and cache_status are non-atomic static globals:
static cache_entry_t cache_table;
static uint32_t* cache_status;
After GC completes and workers resume BDD operations, the visibility of the new pointer values to other workers relies on implicit acquire semantics from the barrier spin-loop exit (while (wait == lace_bar.wait) {} reading an atomic_int). This works in practice but is fragile — there is no explicit atomic_store/atomic_load or fence guarding the cache_table pointer update.
2. lace_barrier() uses memory_order_relaxed throughout
All atomic operations inside lace_barrier() use memory_order_relaxed:
void lace_barrier() {
int wait = lace_bar.wait;
if ((int)n_workers == 1 + atomic_fetch_add_explicit(&lace_bar.count, 1, memory_order_relaxed)) {
atomic_store_explicit(&lace_bar.count, 0, memory_order_relaxed);
atomic_store_explicit(&lace_bar.leaving, n_workers, memory_order_relaxed);
atomic_store_explicit(&lace_bar.wait, 1-wait, memory_order_relaxed);
} else {
while (wait == lace_bar.wait) {}
}
lace_bar.leaving -= 1;
}
The correctness of GC synchronization depends on the atomic_thread_fence(memory_order_seq_cst) inside lace_exec_in_new_frame(), which sits between Barrier 1 (in lace_yield) and Barrier 2 (in lace_exec_in_new_frame). This fence provides the actual memory ordering guarantee that all workers' prior stores (table/cache accesses) are visible before the GC task runs.
This design is correct but tightly coupled — the barrier itself provides no ordering, and the correctness relies entirely on the fence placement in lace_exec_in_new_frame. If any code path calls lace_barrier() without the subsequent seq_cst fence (e.g., in a future refactor), the GC safety guarantee silently breaks.
3. Potential hang: cooperative NEWFRAME relies on all workers reaching sylvan_gc_test()
The NEWFRAME mechanism is fully cooperative. Workers can only yield at sylvan_gc_test() / YIELD_NEWFRAME() checkpoints. If any worker is stuck in a long-running non-TASK C function (e.g., a pathological hash probe sequence in llmsset_lookup2 under extreme table load), the GC barrier cannot complete and all other workers spin indefinitely.
In sylvan_table.c, llmsset_lookup2() contains a linear probe loop:
while (1) {
// ... probe dbs->table[idx] ...
if (idx >= dbs->table_size) idx -= dbs->table_size;
// No YIELD_NEWFRAME checkpoint in this loop
}
When the table is near capacity (which is exactly when GC is triggered), probe sequences can become very long. Meanwhile, other workers that have already triggered GC are spinning at lace_barrier(), waiting for this worker. This creates a livelock-like situation: the table-full condition triggers GC, but GC can't start because a worker is still probing the nearly-full table.
This could explain the hangs I'm observing.
4. lace_bar.leaving reuse between consecutive barriers
Inside lace_exec_in_new_frame, three barriers are called in sequence:
lace_barrier(); // Barrier 2
root->f(...); // GC task (MAP_FIXED, munmap, etc.)
lace_barrier(); // Barrier 3
After Barrier 2 releases, all workers execute lace_bar.leaving -= 1 asynchronously. When Barrier 3's last-arriving worker stores leaving = n_workers, there is no explicit fence ensuring that all workers have completed their leaving -= 1 from Barrier 2. The correctness relies on the assumption that the GC task (root->f(...)) takes long enough for all workers to complete the decrement — a timing assumption, not a formal guarantee.
Questions
-
Has there been any testing under extreme table-fill conditions (>95% occupancy) with many workers? The hang scenario in point 3 seems most likely to reproduce under these conditions.
-
Would it be feasible to add a YIELD_NEWFRAME checkpoint inside the probe loop of llmsset_lookup2, or would this break the lookup invariants?
-
Should cache_table / cache_status be declared _Atomic to make the post-GC visibility guarantee explicit rather than relying on implicit barrier semantics?
Reproduction
The issue is non-deterministic. It reproduces more frequently with:
- High worker count (matching or exceeding physical cores)(but also in one worker situation)
- Small initial table sizes (forcing frequent GC)
- Workloads that create many unique nodes rapidly
Intermittent hangs and crashes during parallel BDD operations — possible GC synchronization issues
Summary
I'm experiencing intermittent hangs and crashes (SIGSEGV/SIGBUS) when running parallel BDD operations under high contention on Sylvan v1.9.4 + Lace v1.5.4. The crashes appear non-deterministic and are more frequent when the node table approaches capacity, suggesting a relationship with GC triggering.
I've done a detailed code audit of the GC synchronization path and wanted to share several findings — some may explain the instability, others are precautionary observations.
Environment
Analysis (by Claude Code ^_^)
1.
cache_clear()usesmunmap+mmap(notMAP_FIXED)In
sylvan_cache.c,cache_clear()callscache_free()(whichmunmaps the oldcache_tableandcache_status) and thencache_create()(whichmmaps new memory at a potentially different address):This is more dangerous than the
MAP_FIXEDapproach used inclear_aligned()(which at least preserves virtual addresses). Aftermunmap, any stale reference to the oldcache_tablewould cause SIGSEGV.Furthermore,
cache_tableandcache_statusare non-atomic static globals:After GC completes and workers resume BDD operations, the visibility of the new pointer values to other workers relies on implicit acquire semantics from the barrier spin-loop exit (
while (wait == lace_bar.wait) {}reading anatomic_int). This works in practice but is fragile — there is no explicitatomic_store/atomic_loador fence guarding thecache_tablepointer update.2.
lace_barrier()usesmemory_order_relaxedthroughoutAll atomic operations inside
lace_barrier()usememory_order_relaxed:The correctness of GC synchronization depends on the
atomic_thread_fence(memory_order_seq_cst)insidelace_exec_in_new_frame(), which sits between Barrier 1 (inlace_yield) and Barrier 2 (inlace_exec_in_new_frame). This fence provides the actual memory ordering guarantee that all workers' prior stores (table/cache accesses) are visible before the GC task runs.This design is correct but tightly coupled — the barrier itself provides no ordering, and the correctness relies entirely on the fence placement in
lace_exec_in_new_frame. If any code path callslace_barrier()without the subsequent seq_cst fence (e.g., in a future refactor), the GC safety guarantee silently breaks.3. Potential hang: cooperative NEWFRAME relies on all workers reaching
sylvan_gc_test()The NEWFRAME mechanism is fully cooperative. Workers can only yield at
sylvan_gc_test()/YIELD_NEWFRAME()checkpoints. If any worker is stuck in a long-running non-TASK C function (e.g., a pathological hash probe sequence inllmsset_lookup2under extreme table load), the GC barrier cannot complete and all other workers spin indefinitely.In
sylvan_table.c,llmsset_lookup2()contains a linear probe loop:When the table is near capacity (which is exactly when GC is triggered), probe sequences can become very long. Meanwhile, other workers that have already triggered GC are spinning at
lace_barrier(), waiting for this worker. This creates a livelock-like situation: the table-full condition triggers GC, but GC can't start because a worker is still probing the nearly-full table.This could explain the hangs I'm observing.
4.
lace_bar.leavingreuse between consecutive barriersInside
lace_exec_in_new_frame, three barriers are called in sequence:After Barrier 2 releases, all workers execute
lace_bar.leaving -= 1asynchronously. When Barrier 3's last-arriving worker storesleaving = n_workers, there is no explicit fence ensuring that all workers have completed theirleaving -= 1from Barrier 2. The correctness relies on the assumption that the GC task (root->f(...)) takes long enough for all workers to complete the decrement — a timing assumption, not a formal guarantee.Questions
Has there been any testing under extreme table-fill conditions (>95% occupancy) with many workers? The hang scenario in point 3 seems most likely to reproduce under these conditions.
Would it be feasible to add a
YIELD_NEWFRAMEcheckpoint inside the probe loop ofllmsset_lookup2, or would this break the lookup invariants?Should
cache_table/cache_statusbe declared_Atomicto make the post-GC visibility guarantee explicit rather than relying on implicit barrier semantics?Reproduction
The issue is non-deterministic. It reproduces more frequently with: