Skip to content

Prime-mode scaling: FLINT backend fixes + zygote fork-server (~10-16x single-integral latency, byte-identical tables)#3

Open
wanweilin wants to merge 2 commits into
thibautbar:masterfrom
wanweilin:claude/prime-mode-scaling
Open

Prime-mode scaling: FLINT backend fixes + zygote fork-server (~10-16x single-integral latency, byte-identical tables)#3
wanweilin wants to merge 2 commits into
thibautbar:masterfrom
wanweilin:claude/prime-mode-scaling

Conversation

@wanweilin

Copy link
Copy Markdown

Performance work on prime mode (FIRE6p), measured against stock 6.5.2 (this repo's master) on a 13-topology suite (1- and 2-loop families, 2–9 propagators), dual-socket Xeon Platinum 8468. Reduction tables are byte-identical (SHA-256) to stock on every benchmark, including with every runtime switch toggled.

metric (heaviest case: 9-propagator 2-loop, ~120k-step Laporta) stock this PR factor
single integral, 1 core 8.98 s 0.91 s 9.9×
single integral, 8 cores 4.93 s 0.38 s 13.0×
self-speedup 1→8 cores 1.8× (regresses >8c) 2.4× (81–96 % of the 2.95× sector-width ceiling)
concurrent independent integrals, 96 cores ~1.8 /s (fer64 fork storm) 53.8 /s ~29×
same-topology batch (2000 targets, one run) 163 /s 1754 /s 10.8×

Full curves/plots: scaling/reports/ (self-contained HTML, download & open). Mechanisms and measurements: FIRE6/scaling/TECHNICAL.md. Operational guide: USAGE.md.

How to read the diff

Commit 1 = all source changes; commit 2 = docs + the fuel patch file + reports.

file what changed
sources/functions.cpp the bulk: zygote fork-server (one clean single-threaded process forked at startup; sector workers fork from it and run FLAME in-process — kills per-sector execv + ~8 ms C++ static-init + config re-parse), flame_main() moved here from thread.cpp, build_flame_args() (single source of FLAME argv), worker_fixup() (warm-zygote per-sector fixup), FIRE_PROFILE phase counters, opt-in mem-table hooks
sources/thread.cpp FLAME main reduced to a thin wrapper around flame_main() (−84 lines)
sources/parser.cpp prime-mode default #fthreads = 1 (#ifdef PRIME; explicit value honored), child-side fthreads/threads → 0 deadlock fix, warm-zygote parse gates, parse-stage profile timers
sources/equation.h/.cpp/.inl f_stop made std::atomic + lost-wakeup fix (empty-critical-section fence before notify_all); opt-in in-memory sector table (FIRE_MEMTABLE=1, measured neutral)
sources/main.cpp zygote spawn after parseArgcArgv; the same lost-wakeup fence at both f-thread teardown sites
sources/common.h/.cpp three globals for the warm zygote
extra/fuel-flint-prime-fixes.patch required companion patch for the fuel submodule (can't be a regular diff here): ① prime mode must force the modular FLINT path regardless of the caller's modular flag, ② the generic parser dropped juxtaposition products ((7115)(1), -2(1) → wrong masters). With these two fixes #calc flint becomes correct in prime mode and replaces every external fer64 fork with in-process nmod arithmetic — the single biggest win and the reason concurrent instances stop destroying each other

Build

git submodule update --init FIRE6/extra/fuel
git -C FIRE6/extra/fuel apply ../fuel-flint-prime-fixes.patch   # BEFORE make dep
cd FIRE6
./configure --enable-zstd --enable-debug --enable-flint
make dep && make

Recommended config deltas: #calc flint, #database /dev/shm/<dir> (tmpfs; the parent's serial DB-shuttle was ~26 % at 8 cores), leave #fthreads unset.

Runtime switches

env effect
FIRE_NO_ZYGOTE=1 legacy per-sector fork+execv
FIRE_NO_WARM_ZYGOTE=1 zygote serves forks but workers parse their own config
FIRE_PROFILE=1 per-sector phase timings, sector dependency edges, parent barrier/serial breakdown
FIRE_MEMTABLE=1 opt-in in-memory sector table (neutral)

Tables are byte-identical with the switches on or off.

Validation

  • byte-identical tables vs. stock (same configs, SHA-256) across the full suite, confirmed at ≥15 % target coverage; master decompositions additionally validated coefficient-by-coefficient against independent ground truth (2000/2000 on the batch test)
  • a fresh end-to-end build from a clean clone reproduces the shipped binaries' outputs bit-for-bit (including both FIRE_NO_* reverts)
  • the lost-wakeup fix survived a 240/240 stress run; note an atomic flag alone does not close that race — the fence is the fix
  • one semantic trap worth knowing even if you reject the rest of this PR: pre-loading other sectors' lbases/dbases/ibases into a worker changes the reduction (different pivots, different step count). Caught by the byte-identical gate; workers here load only their own sector's bases. Details + negative results (DAG scheduling ≈2 %, mem-table neutral) in TECHNICAL.md §5

Scope

Linux only (fork, SOCK_SEQPACKET, SCM_RIGHTS). Optimized & validated for prime mode; non-prime FIRE6 builds and runs (the zygote serves it too) but wasn't the optimization target. MPI untouched.


🤖 Generated with Claude Code

wanweilin and others added 2 commits June 10, 2026 13:37
… deadlock fixes, profiling

End-to-end effect together with the companion fuel patch (#calc flint)
and a tmpfs #database: heaviest benchmark (9-propagator 2-loop,
~120k-step Laporta) goes 8.98 s -> 0.91 s at 1 core and 4.93 s -> 0.38 s
at 8 cores vs stock 6.5.2 (fermat backend), with byte-identical
reduction tables. Details: documentation/scaling/ (next commit).

- prime-mode default #fthreads=1 (#ifdef PRIME): the evaluator pool is
  unused in prime mode (fuel_time == 0); defaulting it to #threads forks
  N idle fer64 processes per invocation and dominates short reductions.
  An explicit #fthreads is still honored.
- zygote fork-server (functions.cpp): fork a single-threaded zygote at
  main entry; per-sector workers fork from it and run flame_main()
  in-process (standalone --thread -1), skipping execv, ~8 ms of C++
  static init, and config re-parse. SOCK_SEQPACKET + SCM_RIGHTS
  done-pipe protocol; explicit reaping keeps children-max-RSS
  accounting truthful. Revert at runtime: FIRE_NO_ZYGOTE=1.
- warm zygote: the zygote pre-parses the sector-independent IBP
  templates once; workers skip parse_config entirely and load only
  their own sector's #lbases (loading other sectors' rule bases CHANGES
  the reduction - see TECHNICAL.md). Revert: FIRE_NO_WARM_ZYGOTE=1.
- two latent deadlock fixes (independently useful on stock FIRE):
  (1) child-side fthreads /= threads can yield 0 evaluator workers and
  hang standalone children; (2) lost-wakeup race in f-worker teardown -
  f_stop was written without holding f_submit_mutex; fix = atomic flag
  + empty-critical-section fence before notify_all (3 sites).
- FIRE_PROFILE=1 instrumentation: per-sector phase timings (sort/
  apply/fwd/split/bksub, point-table get/add), sector dependency edges,
  parent barrier/serial breakdown.
- opt-in in-memory sector table (FIRE_MEMTABLE=1): measured NEUTRAL
  (kyotocabinet CacheDB is already in-memory); kept for experiments.

FLAME's entry logic moved from thread.cpp to flame_main() in
functions.cpp so zygote workers can run it in-process; thread.cpp keeps
a thin wrapper.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- FIRE6/scaling/{README,TECHNICAL,USAGE}.md: mechanisms, measurements,
  negative results, build & config guide
- FIRE6/scaling/reports/: self-contained HTML benchmark reports
- FIRE6/extra/fuel-flint-prime-fixes.patch: required companion fix for
  the fuel submodule (#calc flint correctness in prime mode); apply with
  git -C FIRE6/extra/fuel apply ../fuel-flint-prime-fixes.patch
  before 'make dep'

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant