Prime-mode scaling: zygote + startup / backward-fusion / DB / allocator latency wins (byte-identical) by wanweilin · Pull Request #4 · thibautbar/fire6

wanweilin · 2026-06-12T17:17:55Z

Prime-mode (FIRE6p, modular arithmetic) single-integral latency work. This is a self-contained superset of #3 — it carries #3's two commits (in-process FLINT backend fixes, the zygote fork-server, warm parse, #fthreads=1 default, the two deadlock fixes, and the FIRE6/scaling/ docs + fuel patch) and adds four more hot-path / startup wins below. If you prefer to take this one, #3 can be closed.

All of the additions are output-neutral: with identical inputs the patched binary produces reduction tables that are SHA-256-identical to the stock binary. Re-verified this round on grav2l (9D 2-loop), tri2l (7D 2-loop) and p3lLA (lbases-symmetric), in the sequential scored mode and with FIRE_BWD_FUSED=1.

What the two new commits add

1. Startup latency + backward-wave fusion

Remove the write-only needed_for[] array (common.h). It was a static set<sector_count_t>[MAX_SECTORS+1] in a header with a single insert and no reader, so every TU built and tore down its own MAX_SECTORS empty std::sets at process start and exit — a large slice of the fixed exec→main + main→exit cost on small reductions.
Opt-in startup/exit gates: FIRE_SKIP_VERSION (skip the git log popen), FIRE_FAST_EXIT (_exit() after tables/DBs are durable, reaping the zygote first so children-rusage stays intact), FIRE_SKIP_CLEAN (leave per-run tmpfs DBs for an external harness).
FIRE_BWD_FUSED=1 (opt-in): replaces the per-(level, half-wave) full barrier in backward substitution with per-dependency waits. A sector waits only for the specific lowers whose backward task is in flight; planned-but-undispatched lowers are same/future-wave and the unfused path reads their forward-final tables anyway, so the result is byte-identical by construction.

2. DB slot count, LRU-rotation off, term-list allocator pool

CacheDB SLOTNUM 16 → 64 (kccachedb.h.in + .threadsafe.in): more slot tables cut intra-slot contention / rehash cost on the per-sector prime-mode DBs.
Disable LRU rotation on the in-memory sector DBs (switch_rotation(false)): capacity is unlimited (nothing is evicted), so the per-read splice to the LRU tail is pure overhead — and it makes every get() a write. Off → reads are read-only.
Term-list node allocator → FSBAllocator2 pool (equation.h ALLOCATOR2 + FSBAllocator.hh): the std::list<pair<point,COEFF>> behind add_to/descent/split was on the default std::allocator (a malloc per node) while the analogous map ALLOCATOR1 already used the pool. std::list needs an equality-comparable allocator (added operator==/!=), and the pool statics are made thread_local (per-thread free list — no lock/atomic; each FLAME worker is single-threaded, so it is a pure malloc→pool speedup).

Scope / honesty

Targets prime mode on Linux; non-prime binaries build and run but weren't the optimization target.
The FIRE_* gates and FIRE_BWD_FUSED are opt-in (default off); SLOTNUM, rotation-off and the allocator pool are always on and output-neutral.
Deliberately not included here: an experimental intra-sector elimination thread-pool and a zero-copy get() path — they're entangled with profiling/slot-lock scaffolding and aren't a safe default win at benchmark sizes, so they're left for a separate, cleanly-isolated change.

🤖 Generated with Claude Code

… deadlock fixes, profiling End-to-end effect together with the companion fuel patch (#calc flint) and a tmpfs #database: heaviest benchmark (9-propagator 2-loop, ~120k-step Laporta) goes 8.98 s -> 0.91 s at 1 core and 4.93 s -> 0.38 s at 8 cores vs stock 6.5.2 (fermat backend), with byte-identical reduction tables. Details: documentation/scaling/ (next commit). - prime-mode default #fthreads=1 (#ifdef PRIME): the evaluator pool is unused in prime mode (fuel_time == 0); defaulting it to #threads forks N idle fer64 processes per invocation and dominates short reductions. An explicit #fthreads is still honored. - zygote fork-server (functions.cpp): fork a single-threaded zygote at main entry; per-sector workers fork from it and run flame_main() in-process (standalone --thread -1), skipping execv, ~8 ms of C++ static init, and config re-parse. SOCK_SEQPACKET + SCM_RIGHTS done-pipe protocol; explicit reaping keeps children-max-RSS accounting truthful. Revert at runtime: FIRE_NO_ZYGOTE=1. - warm zygote: the zygote pre-parses the sector-independent IBP templates once; workers skip parse_config entirely and load only their own sector's #lbases (loading other sectors' rule bases CHANGES the reduction - see TECHNICAL.md). Revert: FIRE_NO_WARM_ZYGOTE=1. - two latent deadlock fixes (independently useful on stock FIRE): (1) child-side fthreads /= threads can yield 0 evaluator workers and hang standalone children; (2) lost-wakeup race in f-worker teardown - f_stop was written without holding f_submit_mutex; fix = atomic flag + empty-critical-section fence before notify_all (3 sites). - FIRE_PROFILE=1 instrumentation: per-sector phase timings (sort/ apply/fwd/split/bksub, point-table get/add), sector dependency edges, parent barrier/serial breakdown. - opt-in in-memory sector table (FIRE_MEMTABLE=1): measured NEUTRAL (kyotocabinet CacheDB is already in-memory); kept for experiments. FLAME's entry logic moved from thread.cpp to flame_main() in functions.cpp so zygote workers can run it in-process; thread.cpp keeps a thin wrapper. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

- FIRE6/scaling/{README,TECHNICAL,USAGE}.md: mechanisms, measurements, negative results, build & config guide - FIRE6/scaling/reports/: self-contained HTML benchmark reports - FIRE6/extra/fuel-flint-prime-fixes.patch: required companion fix for the fuel submodule (#calc flint correctness in prime mode); apply with git -C FIRE6/extra/fuel apply ../fuel-flint-prime-fixes.patch before 'make dep' Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Startup/structure wins on top of the Day-1 zygote work, all output-neutral (reduction tables SHA-256-identical, re-verified on grav2l / tri2l / p3lLA): - Remove the write-only `needed_for[]` header array. It was a `static set<sector_count_t>[MAX_SECTORS+1]` in common.h with a single insert and no reader anywhere, so every translation unit constructed and destructed its own ~MAX_SECTORS empty std::sets at process startup AND exit -- a large share of the fixed exec->main + main->exit cost on small reductions. - Startup / exit gates (env, opt-in, default off): FIRE_SKIP_VERSION skips the `git log` popen; FIRE_FAST_EXIT _exit()s once the tables and databases are durable (reaping the zygote first so the children-rusage accounting stays intact); FIRE_SKIP_CLEAN leaves the per-run tmpfs databases for an external harness to remove. zygote_shutdown() is exposed (functions.h) for FAST_EXIT. - Backward-substitution wave fusion (FIRE_BWD_FUSED=1, opt-in): replaces the per-(level, half-wave) full barrier with per-dependency waits -- a sector's point transfer waits only for the specific lower sectors whose backward task is in flight (state 2). Planned-but-undispatched lowers are same-/future-wave and the unfused path reads their forward-final tables anyway, so the fused result is byte-identical by construction. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Three reduction hot-path wins, all output-neutral (tables SHA-256-identical): - CacheDB SLOTNUM 16 -> 64 (kccachedb.h.in + kccachedb.h.threadsafe.in): with the per-sector prime-mode databases, more slot tables cut intra-slot contention and rehash cost on the descent-path get()s. - Disable LRU rotation in the in-memory sector databases (common.cpp, switch_rotation(false)): capacity is unlimited here (nothing is ever evicted), so the per-read record splice to the LRU tail is pure overhead -- and it turns every get() into a write. With rotation off a get() is purely read-only. - Term-list node allocator -> FSBAllocator2 pool (equation.h ALLOCATOR2 + FSBAllocator.hh): the std::list<pair<point,COEFF>> behind add_to / descent / split was using the default std::allocator (a malloc per node), while the analogous map type ALLOCATOR1 already used the pooled FSBAllocator2. Two small header fixes: std::list (unlike std::map) needs an equality-comparable allocator (added operator==/!=), and the pool statics are made thread_local so the free list is per-thread -- no lock, no atomic on the hot path. Each FLAME worker is single-threaded, so this is a pure malloc->pool speedup. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

wanweilin and others added 4 commits June 10, 2026 13:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prime-mode scaling: zygote + startup / backward-fusion / DB / allocator latency wins (byte-identical)#4

Prime-mode scaling: zygote + startup / backward-fusion / DB / allocator latency wins (byte-identical)#4
wanweilin wants to merge 4 commits into
thibautbar:masterfrom
wanweilin:claude/prime-scaling-perf

wanweilin commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wanweilin commented Jun 12, 2026

What the two new commits add

Scope / honesty

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant