Skip to content

Prime-mode scaling: zygote + startup / backward-fusion / DB / allocator latency wins (byte-identical)#4

Open
wanweilin wants to merge 4 commits into
thibautbar:masterfrom
wanweilin:claude/prime-scaling-perf
Open

Prime-mode scaling: zygote + startup / backward-fusion / DB / allocator latency wins (byte-identical)#4
wanweilin wants to merge 4 commits into
thibautbar:masterfrom
wanweilin:claude/prime-scaling-perf

Conversation

@wanweilin

Copy link
Copy Markdown

Prime-mode (FIRE6p, modular arithmetic) single-integral latency work. This is a self-contained superset of #3 — it carries #3's two commits (in-process FLINT backend fixes, the zygote fork-server, warm parse, #fthreads=1 default, the two deadlock fixes, and the FIRE6/scaling/ docs + fuel patch) and adds four more hot-path / startup wins below. If you prefer to take this one, #3 can be closed.

All of the additions are output-neutral: with identical inputs the patched binary produces reduction tables that are SHA-256-identical to the stock binary. Re-verified this round on grav2l (9D 2-loop), tri2l (7D 2-loop) and p3lLA (lbases-symmetric), in the sequential scored mode and with FIRE_BWD_FUSED=1.

What the two new commits add

1. Startup latency + backward-wave fusion

  • Remove the write-only needed_for[] array (common.h). It was a static set<sector_count_t>[MAX_SECTORS+1] in a header with a single insert and no reader, so every TU built and tore down its own MAX_SECTORS empty std::sets at process start and exit — a large slice of the fixed exec→main + main→exit cost on small reductions.
  • Opt-in startup/exit gates: FIRE_SKIP_VERSION (skip the git log popen), FIRE_FAST_EXIT (_exit() after tables/DBs are durable, reaping the zygote first so children-rusage stays intact), FIRE_SKIP_CLEAN (leave per-run tmpfs DBs for an external harness).
  • FIRE_BWD_FUSED=1 (opt-in): replaces the per-(level, half-wave) full barrier in backward substitution with per-dependency waits. A sector waits only for the specific lowers whose backward task is in flight; planned-but-undispatched lowers are same/future-wave and the unfused path reads their forward-final tables anyway, so the result is byte-identical by construction.

2. DB slot count, LRU-rotation off, term-list allocator pool

  • CacheDB SLOTNUM 16 → 64 (kccachedb.h.in + .threadsafe.in): more slot tables cut intra-slot contention / rehash cost on the per-sector prime-mode DBs.
  • Disable LRU rotation on the in-memory sector DBs (switch_rotation(false)): capacity is unlimited (nothing is evicted), so the per-read splice to the LRU tail is pure overhead — and it makes every get() a write. Off → reads are read-only.
  • Term-list node allocator → FSBAllocator2 pool (equation.h ALLOCATOR2 + FSBAllocator.hh): the std::list<pair<point,COEFF>> behind add_to/descent/split was on the default std::allocator (a malloc per node) while the analogous map ALLOCATOR1 already used the pool. std::list needs an equality-comparable allocator (added operator==/!=), and the pool statics are made thread_local (per-thread free list — no lock/atomic; each FLAME worker is single-threaded, so it is a pure malloc→pool speedup).

Scope / honesty

  • Targets prime mode on Linux; non-prime binaries build and run but weren't the optimization target.
  • The FIRE_* gates and FIRE_BWD_FUSED are opt-in (default off); SLOTNUM, rotation-off and the allocator pool are always on and output-neutral.
  • Deliberately not included here: an experimental intra-sector elimination thread-pool and a zero-copy get() path — they're entangled with profiling/slot-lock scaffolding and aren't a safe default win at benchmark sizes, so they're left for a separate, cleanly-isolated change.

🤖 Generated with Claude Code

wanweilin and others added 4 commits June 10, 2026 13:37
… deadlock fixes, profiling

End-to-end effect together with the companion fuel patch (#calc flint)
and a tmpfs #database: heaviest benchmark (9-propagator 2-loop,
~120k-step Laporta) goes 8.98 s -> 0.91 s at 1 core and 4.93 s -> 0.38 s
at 8 cores vs stock 6.5.2 (fermat backend), with byte-identical
reduction tables. Details: documentation/scaling/ (next commit).

- prime-mode default #fthreads=1 (#ifdef PRIME): the evaluator pool is
  unused in prime mode (fuel_time == 0); defaulting it to #threads forks
  N idle fer64 processes per invocation and dominates short reductions.
  An explicit #fthreads is still honored.
- zygote fork-server (functions.cpp): fork a single-threaded zygote at
  main entry; per-sector workers fork from it and run flame_main()
  in-process (standalone --thread -1), skipping execv, ~8 ms of C++
  static init, and config re-parse. SOCK_SEQPACKET + SCM_RIGHTS
  done-pipe protocol; explicit reaping keeps children-max-RSS
  accounting truthful. Revert at runtime: FIRE_NO_ZYGOTE=1.
- warm zygote: the zygote pre-parses the sector-independent IBP
  templates once; workers skip parse_config entirely and load only
  their own sector's #lbases (loading other sectors' rule bases CHANGES
  the reduction - see TECHNICAL.md). Revert: FIRE_NO_WARM_ZYGOTE=1.
- two latent deadlock fixes (independently useful on stock FIRE):
  (1) child-side fthreads /= threads can yield 0 evaluator workers and
  hang standalone children; (2) lost-wakeup race in f-worker teardown -
  f_stop was written without holding f_submit_mutex; fix = atomic flag
  + empty-critical-section fence before notify_all (3 sites).
- FIRE_PROFILE=1 instrumentation: per-sector phase timings (sort/
  apply/fwd/split/bksub, point-table get/add), sector dependency edges,
  parent barrier/serial breakdown.
- opt-in in-memory sector table (FIRE_MEMTABLE=1): measured NEUTRAL
  (kyotocabinet CacheDB is already in-memory); kept for experiments.

FLAME's entry logic moved from thread.cpp to flame_main() in
functions.cpp so zygote workers can run it in-process; thread.cpp keeps
a thin wrapper.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- FIRE6/scaling/{README,TECHNICAL,USAGE}.md: mechanisms, measurements,
  negative results, build & config guide
- FIRE6/scaling/reports/: self-contained HTML benchmark reports
- FIRE6/extra/fuel-flint-prime-fixes.patch: required companion fix for
  the fuel submodule (#calc flint correctness in prime mode); apply with
  git -C FIRE6/extra/fuel apply ../fuel-flint-prime-fixes.patch
  before 'make dep'

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Startup/structure wins on top of the Day-1 zygote work, all output-neutral
(reduction tables SHA-256-identical, re-verified on grav2l / tri2l / p3lLA):

- Remove the write-only `needed_for[]` header array. It was a
  `static set<sector_count_t>[MAX_SECTORS+1]` in common.h with a single insert
  and no reader anywhere, so every translation unit constructed and destructed
  its own ~MAX_SECTORS empty std::sets at process startup AND exit -- a large
  share of the fixed exec->main + main->exit cost on small reductions.

- Startup / exit gates (env, opt-in, default off): FIRE_SKIP_VERSION skips the
  `git log` popen; FIRE_FAST_EXIT _exit()s once the tables and databases are
  durable (reaping the zygote first so the children-rusage accounting stays
  intact); FIRE_SKIP_CLEAN leaves the per-run tmpfs databases for an external
  harness to remove. zygote_shutdown() is exposed (functions.h) for FAST_EXIT.

- Backward-substitution wave fusion (FIRE_BWD_FUSED=1, opt-in): replaces the
  per-(level, half-wave) full barrier with per-dependency waits -- a sector's
  point transfer waits only for the specific lower sectors whose backward task
  is in flight (state 2). Planned-but-undispatched lowers are same-/future-wave
  and the unfused path reads their forward-final tables anyway, so the fused
  result is byte-identical by construction.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Three reduction hot-path wins, all output-neutral (tables SHA-256-identical):

- CacheDB SLOTNUM 16 -> 64 (kccachedb.h.in + kccachedb.h.threadsafe.in): with
  the per-sector prime-mode databases, more slot tables cut intra-slot
  contention and rehash cost on the descent-path get()s.

- Disable LRU rotation in the in-memory sector databases (common.cpp,
  switch_rotation(false)): capacity is unlimited here (nothing is ever evicted),
  so the per-read record splice to the LRU tail is pure overhead -- and it turns
  every get() into a write. With rotation off a get() is purely read-only.

- Term-list node allocator -> FSBAllocator2 pool (equation.h ALLOCATOR2 +
  FSBAllocator.hh): the std::list<pair<point,COEFF>> behind add_to / descent /
  split was using the default std::allocator (a malloc per node), while the
  analogous map type ALLOCATOR1 already used the pooled FSBAllocator2. Two small
  header fixes: std::list (unlike std::map) needs an equality-comparable
  allocator (added operator==/!=), and the pool statics are made thread_local so
  the free list is per-thread -- no lock, no atomic on the hot path. Each FLAME
  worker is single-threaded, so this is a pure malloc->pool speedup.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant