feat(planmemory): first-fit-decreasing buffer ordering (opt-in)#1
Closed
tonibohnlein wants to merge 1 commit into
Closed
feat(planmemory): first-fit-decreasing buffer ordering (opt-in)#1tonibohnlein wants to merge 1 commit into
tonibohnlein wants to merge 1 commit into
Conversation
The local-memory allocator processes buffers in a DMA-first order (VEC only) and otherwise in generation order. For the heterogeneous buffer sizes real kernels produce, first-fit-decreasing (largest-first) order packs tighter -- it is the ordering XLA, TVM and SOMAS all use. Add a default-off `order-by-size` pass option (CLI flag --plan-memory-order-by-size). When enabled, the reuse path sorts buffers largest-first across every memory space, keeping ping-pong pairs contiguous. Default behavior is unchanged and uniform-size instances are untouched (stable sort), so existing tests are unaffected. Measured over the TileLang ST suite + JIT kernels (213 files): one space-peak reduction (-32KB), zero regressions. New lit test exercises the reuse path: the largest tile is placed at offset 0 only with the flag.
0ec4b6a to
f8d878c
Compare
Owner
Author
|
Superseded by the upstream PR hw-native-sys#885 (same branch, rebased onto latest main). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an opt-in
order-by-sizeoption to PlanMemory's local allocator thatprocesses buffers largest-first (first-fit-decreasing) instead of the current
DMA-first / generation order. Default behavior is unchanged.
--plan-memory-order-by-sizeorder-by-size(defaultfalse)Why — theoretical argument
On-chip buffer allocation is Dynamic Storage Allocation (DSA): pack buffers,
each with a fixed live interval and a size, into a fixed-capacity strip while
minimizing the peak (lower bound =
LOAD, the max simultaneously-live size).With uniform sizes this is interval-graph colouring (greedy is optimal); with
the heterogeneous sizes real kernels produce (mixed dtypes, reductions,
asymmetric tiles) DSA is NP-hard, and the quality of a first-fit allocator
depends strongly on the order buffers are placed:
bound (bin packing:
FFD ≤ 11/9·OPTvs first-fit's17/10·OPT) and is theordering used by XLA (best-fit-decreasing heap simulation), TVM USMP
(greedy-by-size), and MindSpore SOMAS. PlanMemory was the outlier — it
ordered by DMA-touch (VEC only) and otherwise by generation order.
This change brings PlanMemory's ordering in line with the established baselines,
at zero risk to default behaviour.
What changed
GetSizeOrderedRootStorageEntry: when enabled, stable-sort buffers bydecreasing size across all memory spaces (the existing DMA-first reorder is
VEC-only), keeping ping-pong (double-buffer) pairs contiguous.
Passes.td) + a ptoas CLI flag.Σ buffer sizes ≥ capacity); a clique that fits still takes the sequential fast path, wherethere is no peak to save — so the option is a no-op exactly where it cannot
help.
Results
Measured over the TileLang ST suite + JIT kernels (213 files):
tsortkernel)so all 16 existing
plan_memory_*lit tests pass unchanged.plan_memory_order_by_size_reuse.ptoexercises the reuse path:the largest tile is placed at offset 0 only with the flag.
The on-corpus win is small because the available test corpus is dominated by
fitting cliques and tiny per-op kernels; the benefit appears only under forced
reuse + heterogeneity, which is where larger real kernels live.
Scope / follow-ups
impact — tighter packing increases buffer reuse, which can add synchronization
(the memory↔sync coupling).
aliasing/donation, and InEx (half-open) lifetime semantics.