Prime-mode scaling: zygote + startup / backward-fusion / DB / allocator latency wins (byte-identical)#4
Open
wanweilin wants to merge 4 commits into
Open
Conversation
… deadlock fixes, profiling End-to-end effect together with the companion fuel patch (#calc flint) and a tmpfs #database: heaviest benchmark (9-propagator 2-loop, ~120k-step Laporta) goes 8.98 s -> 0.91 s at 1 core and 4.93 s -> 0.38 s at 8 cores vs stock 6.5.2 (fermat backend), with byte-identical reduction tables. Details: documentation/scaling/ (next commit). - prime-mode default #fthreads=1 (#ifdef PRIME): the evaluator pool is unused in prime mode (fuel_time == 0); defaulting it to #threads forks N idle fer64 processes per invocation and dominates short reductions. An explicit #fthreads is still honored. - zygote fork-server (functions.cpp): fork a single-threaded zygote at main entry; per-sector workers fork from it and run flame_main() in-process (standalone --thread -1), skipping execv, ~8 ms of C++ static init, and config re-parse. SOCK_SEQPACKET + SCM_RIGHTS done-pipe protocol; explicit reaping keeps children-max-RSS accounting truthful. Revert at runtime: FIRE_NO_ZYGOTE=1. - warm zygote: the zygote pre-parses the sector-independent IBP templates once; workers skip parse_config entirely and load only their own sector's #lbases (loading other sectors' rule bases CHANGES the reduction - see TECHNICAL.md). Revert: FIRE_NO_WARM_ZYGOTE=1. - two latent deadlock fixes (independently useful on stock FIRE): (1) child-side fthreads /= threads can yield 0 evaluator workers and hang standalone children; (2) lost-wakeup race in f-worker teardown - f_stop was written without holding f_submit_mutex; fix = atomic flag + empty-critical-section fence before notify_all (3 sites). - FIRE_PROFILE=1 instrumentation: per-sector phase timings (sort/ apply/fwd/split/bksub, point-table get/add), sector dependency edges, parent barrier/serial breakdown. - opt-in in-memory sector table (FIRE_MEMTABLE=1): measured NEUTRAL (kyotocabinet CacheDB is already in-memory); kept for experiments. FLAME's entry logic moved from thread.cpp to flame_main() in functions.cpp so zygote workers can run it in-process; thread.cpp keeps a thin wrapper. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- FIRE6/scaling/{README,TECHNICAL,USAGE}.md: mechanisms, measurements,
negative results, build & config guide
- FIRE6/scaling/reports/: self-contained HTML benchmark reports
- FIRE6/extra/fuel-flint-prime-fixes.patch: required companion fix for
the fuel submodule (#calc flint correctness in prime mode); apply with
git -C FIRE6/extra/fuel apply ../fuel-flint-prime-fixes.patch
before 'make dep'
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Startup/structure wins on top of the Day-1 zygote work, all output-neutral (reduction tables SHA-256-identical, re-verified on grav2l / tri2l / p3lLA): - Remove the write-only `needed_for[]` header array. It was a `static set<sector_count_t>[MAX_SECTORS+1]` in common.h with a single insert and no reader anywhere, so every translation unit constructed and destructed its own ~MAX_SECTORS empty std::sets at process startup AND exit -- a large share of the fixed exec->main + main->exit cost on small reductions. - Startup / exit gates (env, opt-in, default off): FIRE_SKIP_VERSION skips the `git log` popen; FIRE_FAST_EXIT _exit()s once the tables and databases are durable (reaping the zygote first so the children-rusage accounting stays intact); FIRE_SKIP_CLEAN leaves the per-run tmpfs databases for an external harness to remove. zygote_shutdown() is exposed (functions.h) for FAST_EXIT. - Backward-substitution wave fusion (FIRE_BWD_FUSED=1, opt-in): replaces the per-(level, half-wave) full barrier with per-dependency waits -- a sector's point transfer waits only for the specific lower sectors whose backward task is in flight (state 2). Planned-but-undispatched lowers are same-/future-wave and the unfused path reads their forward-final tables anyway, so the fused result is byte-identical by construction. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Three reduction hot-path wins, all output-neutral (tables SHA-256-identical): - CacheDB SLOTNUM 16 -> 64 (kccachedb.h.in + kccachedb.h.threadsafe.in): with the per-sector prime-mode databases, more slot tables cut intra-slot contention and rehash cost on the descent-path get()s. - Disable LRU rotation in the in-memory sector databases (common.cpp, switch_rotation(false)): capacity is unlimited here (nothing is ever evicted), so the per-read record splice to the LRU tail is pure overhead -- and it turns every get() into a write. With rotation off a get() is purely read-only. - Term-list node allocator -> FSBAllocator2 pool (equation.h ALLOCATOR2 + FSBAllocator.hh): the std::list<pair<point,COEFF>> behind add_to / descent / split was using the default std::allocator (a malloc per node), while the analogous map type ALLOCATOR1 already used the pooled FSBAllocator2. Two small header fixes: std::list (unlike std::map) needs an equality-comparable allocator (added operator==/!=), and the pool statics are made thread_local so the free list is per-thread -- no lock, no atomic on the hot path. Each FLAME worker is single-threaded, so this is a pure malloc->pool speedup. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Prime-mode (
FIRE6p, modular arithmetic) single-integral latency work. This is a self-contained superset of #3 — it carries #3's two commits (in-process FLINT backend fixes, the zygote fork-server, warm parse,#fthreads=1default, the two deadlock fixes, and theFIRE6/scaling/docs + fuel patch) and adds four more hot-path / startup wins below. If you prefer to take this one, #3 can be closed.All of the additions are output-neutral: with identical inputs the patched binary produces reduction tables that are SHA-256-identical to the stock binary. Re-verified this round on
grav2l(9D 2-loop),tri2l(7D 2-loop) andp3lLA(lbases-symmetric), in the sequential scored mode and withFIRE_BWD_FUSED=1.What the two new commits add
1. Startup latency + backward-wave fusion
needed_for[]array (common.h). It was astatic set<sector_count_t>[MAX_SECTORS+1]in a header with a single insert and no reader, so every TU built and tore down its own MAX_SECTORS emptystd::sets at process start and exit — a large slice of the fixedexec→main+main→exitcost on small reductions.FIRE_SKIP_VERSION(skip thegit logpopen),FIRE_FAST_EXIT(_exit()after tables/DBs are durable, reaping the zygote first so children-rusage stays intact),FIRE_SKIP_CLEAN(leave per-run tmpfs DBs for an external harness).FIRE_BWD_FUSED=1(opt-in): replaces the per-(level, half-wave) full barrier in backward substitution with per-dependency waits. A sector waits only for the specific lowers whose backward task is in flight; planned-but-undispatched lowers are same/future-wave and the unfused path reads their forward-final tables anyway, so the result is byte-identical by construction.2. DB slot count, LRU-rotation off, term-list allocator pool
CacheDBSLOTNUM16 → 64 (kccachedb.h.in+.threadsafe.in): more slot tables cut intra-slot contention / rehash cost on the per-sector prime-mode DBs.switch_rotation(false)): capacity is unlimited (nothing is evicted), so the per-read splice to the LRU tail is pure overhead — and it makes everyget()a write. Off → reads are read-only.FSBAllocator2pool (equation.hALLOCATOR2+FSBAllocator.hh): thestd::list<pair<point,COEFF>>behindadd_to/descent/splitwas on the defaultstd::allocator(amallocper node) while the analogous mapALLOCATOR1already used the pool.std::listneeds an equality-comparable allocator (addedoperator==/!=), and the pool statics are madethread_local(per-thread free list — no lock/atomic; each FLAME worker is single-threaded, so it is a puremalloc→poolspeedup).Scope / honesty
FIRE_*gates andFIRE_BWD_FUSEDare opt-in (default off);SLOTNUM, rotation-off and the allocator pool are always on and output-neutral.get()path — they're entangled with profiling/slot-lock scaffolding and aren't a safe default win at benchmark sizes, so they're left for a separate, cleanly-isolated change.🤖 Generated with Claude Code