Intermittent hangs and crashes during parallel BDD operations

## Intermittent hangs and crashes during parallel BDD operations — possible GC synchronization issues

### Summary

I'm experiencing intermittent hangs and crashes (SIGSEGV/SIGBUS) when running parallel BDD operations under high contention on Sylvan v1.9.4 + Lace v1.5.4. The crashes appear non-deterministic and are more frequent when the node table approaches capacity, suggesting a relationship with GC triggering.

I've done a detailed code audit of the GC synchronization path and wanted to share several findings — some may explain the instability, others are precautionary observations.

### Environment

- Sylvan v1.9.4, Lace v1.5.4
- Multi-core Linux (x86_64)
- Observed with both GCC and Clang

### Analysis (by Claude Code ^_^)

#### 1. `cache_clear()` uses `munmap` + `mmap` (not `MAP_FIXED`)

In `sylvan_cache.c`, `cache_clear()` calls `cache_free()` (which `munmap`s the old `cache_table` and `cache_status`) and then `cache_create()` (which `mmap`s new memory at a potentially different address):

```c
void cache_clear() {
    cache_free();                         // munmap — old pointers become dangling immediately
    cache_create(cache_size, cache_max);  // mmap — new allocation, updates global pointers
}
```

This is more dangerous than the `MAP_FIXED` approach used in `clear_aligned()` (which at least preserves virtual addresses). After `munmap`, any stale reference to the old `cache_table` would cause SIGSEGV.

Furthermore, `cache_table` and `cache_status` are **non-atomic static globals**:

```c
static cache_entry_t  cache_table;
static uint32_t*      cache_status;
```

After GC completes and workers resume BDD operations, the visibility of the new pointer values to other workers relies on implicit acquire semantics from the barrier spin-loop exit (`while (wait == lace_bar.wait) {}` reading an `atomic_int`). This works in practice but is fragile — there is no explicit `atomic_store`/`atomic_load` or fence guarding the `cache_table` pointer update.

#### 2. `lace_barrier()` uses `memory_order_relaxed` throughout

All atomic operations inside `lace_barrier()` use `memory_order_relaxed`:

```c
void lace_barrier() {
    int wait = lace_bar.wait;
    if ((int)n_workers == 1 + atomic_fetch_add_explicit(&lace_bar.count, 1, memory_order_relaxed)) {
        atomic_store_explicit(&lace_bar.count, 0, memory_order_relaxed);
        atomic_store_explicit(&lace_bar.leaving, n_workers, memory_order_relaxed);
        atomic_store_explicit(&lace_bar.wait, 1-wait, memory_order_relaxed);
    } else {
        while (wait == lace_bar.wait) {}
    }
    lace_bar.leaving -= 1;
}
```

The correctness of GC synchronization depends on the `atomic_thread_fence(memory_order_seq_cst)` inside `lace_exec_in_new_frame()`, which sits between Barrier 1 (in `lace_yield`) and Barrier 2 (in `lace_exec_in_new_frame`). This fence provides the actual memory ordering guarantee that all workers' prior stores (table/cache accesses) are visible before the GC task runs.

This design is correct **but tightly coupled** — the barrier itself provides no ordering, and the correctness relies entirely on the fence placement in `lace_exec_in_new_frame`. If any code path calls `lace_barrier()` without the subsequent seq_cst fence (e.g., in a future refactor), the GC safety guarantee silently breaks.

#### 3. Potential hang: cooperative NEWFRAME relies on all workers reaching `sylvan_gc_test()`

The NEWFRAME mechanism is fully cooperative. Workers can only yield at `sylvan_gc_test()` / `YIELD_NEWFRAME()` checkpoints. If any worker is stuck in a long-running non-TASK C function (e.g., a pathological hash probe sequence in `llmsset_lookup2` under extreme table load), the GC barrier cannot complete and all other workers spin indefinitely.

In `sylvan_table.c`, `llmsset_lookup2()` contains a linear probe loop:

```c
while (1) {
    // ... probe dbs->table[idx] ...
    if (idx >= dbs->table_size) idx -= dbs->table_size;
    // No YIELD_NEWFRAME checkpoint in this loop
}
```

When the table is near capacity (which is exactly when GC is triggered), probe sequences can become very long. Meanwhile, other workers that have already triggered GC are spinning at `lace_barrier()`, waiting for this worker. This creates a **livelock-like situation**: the table-full condition triggers GC, but GC can't start because a worker is still probing the nearly-full table.

This could explain the **hangs** I'm observing.

#### 4. `lace_bar.leaving` reuse between consecutive barriers

Inside `lace_exec_in_new_frame`, three barriers are called in sequence:

```c
lace_barrier();              // Barrier 2
root->f(...);                // GC task (MAP_FIXED, munmap, etc.)
lace_barrier();              // Barrier 3
```

After Barrier 2 releases, all workers execute `lace_bar.leaving -= 1` asynchronously. When Barrier 3's last-arriving worker stores `leaving = n_workers`, there is no explicit fence ensuring that all workers have completed their `leaving -= 1` from Barrier 2. The correctness relies on the assumption that the GC task (`root->f(...)`) takes long enough for all workers to complete the decrement — a timing assumption, not a formal guarantee.

### Questions

1. Has there been any testing under extreme table-fill conditions (>95% occupancy) with many workers? The hang scenario in point 3 seems most likely to reproduce under these conditions.

2. Would it be feasible to add a `YIELD_NEWFRAME` checkpoint inside the probe loop of `llmsset_lookup2`, or would this break the lookup invariants?

3. Should `cache_table` / `cache_status` be declared `_Atomic` to make the post-GC visibility guarantee explicit rather than relying on implicit barrier semantics?

### Reproduction

The issue is non-deterministic. It reproduces more frequently with:
- High worker count (matching or exceeding physical cores)(but also in one worker situation)
- Small initial table sizes (forcing frequent GC)
- Workloads that create many unique nodes rapidly


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent hangs and crashes during parallel BDD operations #57

Intermittent hangs and crashes during parallel BDD operations — possible GC synchronization issues

Summary

Environment

Analysis (by Claude Code ^_^)

1. `cache_clear()` uses `munmap` + `mmap` (not `MAP_FIXED`)

2. `lace_barrier()` uses `memory_order_relaxed` throughout

3. Potential hang: cooperative NEWFRAME relies on all workers reaching `sylvan_gc_test()`

4. `lace_bar.leaving` reuse between consecutive barriers

Questions

Reproduction

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Intermittent hangs and crashes during parallel BDD operations #57

Description

Intermittent hangs and crashes during parallel BDD operations — possible GC synchronization issues

Summary

Environment

Analysis (by Claude Code ^_^)

1. cache_clear() uses munmap + mmap (not MAP_FIXED)

2. lace_barrier() uses memory_order_relaxed throughout

3. Potential hang: cooperative NEWFRAME relies on all workers reaching sylvan_gc_test()

4. lace_bar.leaving reuse between consecutive barriers

Questions

Reproduction

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

1. `cache_clear()` uses `munmap` + `mmap` (not `MAP_FIXED`)

2. `lace_barrier()` uses `memory_order_relaxed` throughout

3. Potential hang: cooperative NEWFRAME relies on all workers reaching `sylvan_gc_test()`

4. `lace_bar.leaving` reuse between consecutive barriers