Prime-mode scaling: FLINT backend fixes + zygote fork-server (~10-16x single-integral latency, byte-identical tables) by wanweilin · Pull Request #3 · thibautbar/fire6

wanweilin · 2026-06-10T13:41:22Z

Performance work on prime mode (FIRE6p), measured against stock 6.5.2 (this repo's master) on a 13-topology suite (1- and 2-loop families, 2–9 propagators), dual-socket Xeon Platinum 8468. Reduction tables are byte-identical (SHA-256) to stock on every benchmark, including with every runtime switch toggled.

metric (heaviest case: 9-propagator 2-loop, ~120k-step Laporta)	stock	this PR	factor
single integral, 1 core	8.98 s	0.91 s	9.9×
single integral, 8 cores	4.93 s	0.38 s	13.0×
self-speedup 1→8 cores	1.8× (regresses >8c)	2.4× (81–96 % of the 2.95× sector-width ceiling)	—
concurrent independent integrals, 96 cores	~1.8 /s (fer64 fork storm)	53.8 /s	~29×
same-topology batch (2000 targets, one run)	163 /s	1754 /s	10.8×

Full curves/plots: scaling/reports/ (self-contained HTML, download & open). Mechanisms and measurements: FIRE6/scaling/TECHNICAL.md. Operational guide: USAGE.md.

How to read the diff

Commit 1 = all source changes; commit 2 = docs + the fuel patch file + reports.

file	what changed
`sources/functions.cpp`	the bulk: zygote fork-server (one clean single-threaded process forked at startup; sector workers fork from it and run FLAME in-process — kills per-sector `execv` + ~8 ms C++ static-init + config re-parse), `flame_main()` moved here from `thread.cpp`, `build_flame_args()` (single source of FLAME argv), `worker_fixup()` (warm-zygote per-sector fixup), `FIRE_PROFILE` phase counters, opt-in mem-table hooks
`sources/thread.cpp`	FLAME `main` reduced to a thin wrapper around `flame_main()` (−84 lines)
`sources/parser.cpp`	prime-mode default `#fthreads = 1` (`#ifdef PRIME`; explicit value honored), child-side `fthreads/threads → 0` deadlock fix, warm-zygote parse gates, parse-stage profile timers
`sources/equation.h/.cpp/.inl`	`f_stop` made `std::atomic` + lost-wakeup fix (empty-critical-section fence before `notify_all`); opt-in in-memory sector table (`FIRE_MEMTABLE=1`, measured neutral)
`sources/main.cpp`	zygote spawn after `parseArgcArgv`; the same lost-wakeup fence at both f-thread teardown sites
`sources/common.h/.cpp`	three globals for the warm zygote
`extra/fuel-flint-prime-fixes.patch`	required companion patch for the fuel submodule (can't be a regular diff here): ① prime mode must force the modular FLINT path regardless of the caller's `modular` flag, ② the generic parser dropped juxtaposition products (`(7115)(1)`, `-2(1)` → wrong masters). With these two fixes `#calc flint` becomes correct in prime mode and replaces every external `fer64` fork with in-process `nmod` arithmetic — the single biggest win and the reason concurrent instances stop destroying each other

Build

git submodule update --init FIRE6/extra/fuel
git -C FIRE6/extra/fuel apply ../fuel-flint-prime-fixes.patch   # BEFORE make dep
cd FIRE6
./configure --enable-zstd --enable-debug --enable-flint
make dep && make

Recommended config deltas: #calc flint, #database /dev/shm/<dir> (tmpfs; the parent's serial DB-shuttle was ~26 % at 8 cores), leave #fthreads unset.

Runtime switches

env	effect
`FIRE_NO_ZYGOTE=1`	legacy per-sector `fork+execv`
`FIRE_NO_WARM_ZYGOTE=1`	zygote serves forks but workers parse their own config
`FIRE_PROFILE=1`	per-sector phase timings, sector dependency edges, parent barrier/serial breakdown
`FIRE_MEMTABLE=1`	opt-in in-memory sector table (neutral)

Tables are byte-identical with the switches on or off.

Validation

byte-identical tables vs. stock (same configs, SHA-256) across the full suite, confirmed at ≥15 % target coverage; master decompositions additionally validated coefficient-by-coefficient against independent ground truth (2000/2000 on the batch test)
a fresh end-to-end build from a clean clone reproduces the shipped binaries' outputs bit-for-bit (including both FIRE_NO_* reverts)
the lost-wakeup fix survived a 240/240 stress run; note an atomic flag alone does not close that race — the fence is the fix
one semantic trap worth knowing even if you reject the rest of this PR: pre-loading other sectors' lbases/dbases/ibases into a worker changes the reduction (different pivots, different step count). Caught by the byte-identical gate; workers here load only their own sector's bases. Details + negative results (DAG scheduling ≈2 %, mem-table neutral) in TECHNICAL.md §5

Scope

Linux only (fork, SOCK_SEQPACKET, SCM_RIGHTS). Optimized & validated for prime mode; non-prime FIRE6 builds and runs (the zygote serves it too) but wasn't the optimization target. MPI untouched.

🤖 Generated with Claude Code

… deadlock fixes, profiling End-to-end effect together with the companion fuel patch (#calc flint) and a tmpfs #database: heaviest benchmark (9-propagator 2-loop, ~120k-step Laporta) goes 8.98 s -> 0.91 s at 1 core and 4.93 s -> 0.38 s at 8 cores vs stock 6.5.2 (fermat backend), with byte-identical reduction tables. Details: documentation/scaling/ (next commit). - prime-mode default #fthreads=1 (#ifdef PRIME): the evaluator pool is unused in prime mode (fuel_time == 0); defaulting it to #threads forks N idle fer64 processes per invocation and dominates short reductions. An explicit #fthreads is still honored. - zygote fork-server (functions.cpp): fork a single-threaded zygote at main entry; per-sector workers fork from it and run flame_main() in-process (standalone --thread -1), skipping execv, ~8 ms of C++ static init, and config re-parse. SOCK_SEQPACKET + SCM_RIGHTS done-pipe protocol; explicit reaping keeps children-max-RSS accounting truthful. Revert at runtime: FIRE_NO_ZYGOTE=1. - warm zygote: the zygote pre-parses the sector-independent IBP templates once; workers skip parse_config entirely and load only their own sector's #lbases (loading other sectors' rule bases CHANGES the reduction - see TECHNICAL.md). Revert: FIRE_NO_WARM_ZYGOTE=1. - two latent deadlock fixes (independently useful on stock FIRE): (1) child-side fthreads /= threads can yield 0 evaluator workers and hang standalone children; (2) lost-wakeup race in f-worker teardown - f_stop was written without holding f_submit_mutex; fix = atomic flag + empty-critical-section fence before notify_all (3 sites). - FIRE_PROFILE=1 instrumentation: per-sector phase timings (sort/ apply/fwd/split/bksub, point-table get/add), sector dependency edges, parent barrier/serial breakdown. - opt-in in-memory sector table (FIRE_MEMTABLE=1): measured NEUTRAL (kyotocabinet CacheDB is already in-memory); kept for experiments. FLAME's entry logic moved from thread.cpp to flame_main() in functions.cpp so zygote workers can run it in-process; thread.cpp keeps a thin wrapper. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

- FIRE6/scaling/{README,TECHNICAL,USAGE}.md: mechanisms, measurements, negative results, build & config guide - FIRE6/scaling/reports/: self-contained HTML benchmark reports - FIRE6/extra/fuel-flint-prime-fixes.patch: required companion fix for the fuel submodule (#calc flint correctness in prime mode); apply with git -C FIRE6/extra/fuel apply ../fuel-flint-prime-fixes.patch before 'make dep' Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

wanweilin and others added 2 commits June 10, 2026 13:37

wanweilin mentioned this pull request Jun 12, 2026

Prime-mode scaling: zygote + startup / backward-fusion / DB / allocator latency wins (byte-identical) #4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prime-mode scaling: FLINT backend fixes + zygote fork-server (~10-16x single-integral latency, byte-identical tables)#3

Prime-mode scaling: FLINT backend fixes + zygote fork-server (~10-16x single-integral latency, byte-identical tables)#3
wanweilin wants to merge 2 commits into
thibautbar:masterfrom
wanweilin:claude/prime-mode-scaling

wanweilin commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wanweilin commented Jun 10, 2026

How to read the diff

Build

Runtime switches

Validation

Scope

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant