Prime-mode scaling: FLINT backend fixes + zygote fork-server (~10-16x single-integral latency, byte-identical tables)#3
Open
wanweilin wants to merge 2 commits into
Conversation
… deadlock fixes, profiling End-to-end effect together with the companion fuel patch (#calc flint) and a tmpfs #database: heaviest benchmark (9-propagator 2-loop, ~120k-step Laporta) goes 8.98 s -> 0.91 s at 1 core and 4.93 s -> 0.38 s at 8 cores vs stock 6.5.2 (fermat backend), with byte-identical reduction tables. Details: documentation/scaling/ (next commit). - prime-mode default #fthreads=1 (#ifdef PRIME): the evaluator pool is unused in prime mode (fuel_time == 0); defaulting it to #threads forks N idle fer64 processes per invocation and dominates short reductions. An explicit #fthreads is still honored. - zygote fork-server (functions.cpp): fork a single-threaded zygote at main entry; per-sector workers fork from it and run flame_main() in-process (standalone --thread -1), skipping execv, ~8 ms of C++ static init, and config re-parse. SOCK_SEQPACKET + SCM_RIGHTS done-pipe protocol; explicit reaping keeps children-max-RSS accounting truthful. Revert at runtime: FIRE_NO_ZYGOTE=1. - warm zygote: the zygote pre-parses the sector-independent IBP templates once; workers skip parse_config entirely and load only their own sector's #lbases (loading other sectors' rule bases CHANGES the reduction - see TECHNICAL.md). Revert: FIRE_NO_WARM_ZYGOTE=1. - two latent deadlock fixes (independently useful on stock FIRE): (1) child-side fthreads /= threads can yield 0 evaluator workers and hang standalone children; (2) lost-wakeup race in f-worker teardown - f_stop was written without holding f_submit_mutex; fix = atomic flag + empty-critical-section fence before notify_all (3 sites). - FIRE_PROFILE=1 instrumentation: per-sector phase timings (sort/ apply/fwd/split/bksub, point-table get/add), sector dependency edges, parent barrier/serial breakdown. - opt-in in-memory sector table (FIRE_MEMTABLE=1): measured NEUTRAL (kyotocabinet CacheDB is already in-memory); kept for experiments. FLAME's entry logic moved from thread.cpp to flame_main() in functions.cpp so zygote workers can run it in-process; thread.cpp keeps a thin wrapper. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- FIRE6/scaling/{README,TECHNICAL,USAGE}.md: mechanisms, measurements,
negative results, build & config guide
- FIRE6/scaling/reports/: self-contained HTML benchmark reports
- FIRE6/extra/fuel-flint-prime-fixes.patch: required companion fix for
the fuel submodule (#calc flint correctness in prime mode); apply with
git -C FIRE6/extra/fuel apply ../fuel-flint-prime-fixes.patch
before 'make dep'
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Performance work on prime mode (
FIRE6p), measured against stock 6.5.2 (this repo'smaster) on a 13-topology suite (1- and 2-loop families, 2–9 propagators), dual-socket Xeon Platinum 8468. Reduction tables are byte-identical (SHA-256) to stock on every benchmark, including with every runtime switch toggled.Full curves/plots:
scaling/reports/(self-contained HTML, download & open). Mechanisms and measurements:FIRE6/scaling/TECHNICAL.md. Operational guide:USAGE.md.How to read the diff
Commit 1 = all source changes; commit 2 = docs + the fuel patch file + reports.
sources/functions.cppexecv+ ~8 ms C++ static-init + config re-parse),flame_main()moved here fromthread.cpp,build_flame_args()(single source of FLAME argv),worker_fixup()(warm-zygote per-sector fixup),FIRE_PROFILEphase counters, opt-in mem-table hookssources/thread.cppmainreduced to a thin wrapper aroundflame_main()(−84 lines)sources/parser.cpp#fthreads = 1(#ifdef PRIME; explicit value honored), child-sidefthreads/threads → 0deadlock fix, warm-zygote parse gates, parse-stage profile timerssources/equation.h/.cpp/.inlf_stopmadestd::atomic+ lost-wakeup fix (empty-critical-section fence beforenotify_all); opt-in in-memory sector table (FIRE_MEMTABLE=1, measured neutral)sources/main.cppparseArgcArgv; the same lost-wakeup fence at both f-thread teardown sitessources/common.h/.cppextra/fuel-flint-prime-fixes.patchmodularflag, ② the generic parser dropped juxtaposition products ((7115)(1),-2(1)→ wrong masters). With these two fixes#calc flintbecomes correct in prime mode and replaces every externalfer64fork with in-processnmodarithmetic — the single biggest win and the reason concurrent instances stop destroying each otherBuild
Recommended config deltas:
#calc flint,#database /dev/shm/<dir>(tmpfs; the parent's serial DB-shuttle was ~26 % at 8 cores), leave#fthreadsunset.Runtime switches
FIRE_NO_ZYGOTE=1fork+execvFIRE_NO_WARM_ZYGOTE=1FIRE_PROFILE=1FIRE_MEMTABLE=1Tables are byte-identical with the switches on or off.
Validation
FIRE_NO_*reverts)lbases/dbases/ibasesinto a worker changes the reduction (different pivots, different step count). Caught by the byte-identical gate; workers here load only their own sector's bases. Details + negative results (DAG scheduling ≈2 %, mem-table neutral) inTECHNICAL.md§5Scope
Linux only (fork,
SOCK_SEQPACKET,SCM_RIGHTS). Optimized & validated for prime mode; non-primeFIRE6builds and runs (the zygote serves it too) but wasn't the optimization target. MPI untouched.🤖 Generated with Claude Code