feat(planmemory): first-fit-decreasing buffer ordering (opt-in)#885
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces an option to order buffers by size (largest-first) during local memory planning, which helps pack heterogeneous-size buffers tighter. The review feedback correctly identifies a potential issue when reordering the root storage entry: the old root's children are not cleared, which could lead to cycles or stale references, and the memscope2rootStorageEntry map is not updated to reflect the new root. Implementing the suggested fix will prevent downstream analysis errors.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| StorageEntry *reorderedRootStorageEntry = entries[0]; | ||
| reorderedRootStorageEntry->mergedChildren.clear(); | ||
| for (size_t j = 1; j < entries.size(); ++j) { | ||
| reorderedRootStorageEntry->mergedChildren.push_back(entries[j]); | ||
| } | ||
| return reorderedRootStorageEntry; |
There was a problem hiding this comment.
When the root storage entry of a scope changes after sorting, the old root becomes a child of the new root. If we do not clear the mergedChildren of the old root, it will still contain references to the other entries (including the new root), creating potential cycles or stale child references in the tree. Clearing mergedChildren for all entries in entries before rebuilding the new root's children prevents this.
Additionally, the memscope2rootStorageEntry map is used by other methods (such as RecordOverflowIfAny and PrintSuccessfulAllocatedMaxBits) to find the root storage entry of a scope. If we do not update this map when the root changes, those methods will continue to operate on the old root, leading to incorrect or incomplete analysis.
for (auto *entry : entries) {
entry->mergedChildren.clear();
}
StorageEntry *reorderedRootStorageEntry = entries[0];
for (size_t j = 1; j < entries.size(); ++j) {
reorderedRootStorageEntry->mergedChildren.push_back(entries[j]);
}
memscope2rootStorageEntry[reorderedRootStorageEntry->bufInfo->bufferScope] = reorderedRootStorageEntry;
return reorderedRootStorageEntry;There was a problem hiding this comment.
Thanks — applied. The new root now clears every entry's mergedChildren before rebuilding the flat child list, and updates memscope2rootStorageEntry to point at the new root.
Worth noting the two changes must go together: clearing the old root's children without updating the map would leave RecordOverflowIfAny reading the stale root with an empty child list (under-reporting the peak). With both applied, the scope's root + children stay consistent. (In the prior version the old root kept its complete child list, so overflow detection was still correct — but the tree had a stale back-reference; this makes it well-formed.)
The local-memory allocator processes buffers in a DMA-first order (VEC only) and otherwise in generation order. For the heterogeneous buffer sizes real kernels produce, first-fit-decreasing (largest-first) order packs tighter -- it is the ordering XLA, TVM and SOMAS all use. Add a default-off `order-by-size` pass option (CLI flag --plan-memory-order-by-size). When enabled, the reuse path sorts buffers largest-first across every memory space, keeping ping-pong pairs contiguous. Default behavior is unchanged and uniform-size instances are untouched (stable sort), so existing tests are unaffected. Measured over the TileLang ST suite + JIT kernels (213 files): one space-peak reduction (-32KB), zero regressions. New lit test exercises the reuse path: the largest tile is placed at offset 0 only with the flag.
f8d878c to
bf04064
Compare
Codex Review该评论由 review 机器人自动更新。
SummaryReview failed at stage Findings未生成结构化 findings,因为 review 过程提前失败。 Log Tail |
PlanMemAddressOfWholeLocalBuffer() exits through IsEnoughForBuffersNoReuse() for any scope that fits without reuse, and that fast path (PlanBuffersWithoutReuse) never reordered buffers. As a result --plan-memory-order-by-size was silently a no-op for kernels that fit in local memory, contradicting the option's largest-first placement contract (Codex review P3). Reorder largest-first on the no-reuse path as well, so the option means the same thing regardless of whether reuse kicks in. This is a pure offset reassignment: the required-size sum is order-independent so the fit check is unaffected, and the stable sort keeps uniform-size scopes byte-identical to the default. Peak is unchanged on this path (there is no peak to save when a scope fits) -- the value is a deterministic, consistent layout. Add plan_memory_order_by_size_noreuse.pto covering the fast path: three heterogeneous UB tiles (48KB total, well under the 192KB budget). Default leaves offset 0 to the small first-generated input tile; order-by-size places the largest tile at offset 0. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Re: P3 —
|
zhangstevenunity
left a comment
There was a problem hiding this comment.
Review: opt-in first-fit-decreasing buffer ordering
Verdict: no blocker. The mechanism is sound and the default path is provably untouched. I traced the full local-planning flow (GetReorderRootStorageEntry / IsEnoughForBuffersNoReuse / PlanReusableLocalBuffer / PlanMemAddressForLevel0 -> GetSizeOrderedRootStorageEntry, plus the RecordOverflowIfAny / PrintSuccessfulAllocatedMaxBits / UpdateBuffer2Offsets consumers) and checked both new lit tests against the a3 memory spec.
What I confirmed:
- Default behavior is genuinely unchanged. Every new code path is gated on
orderBySize, which defaults false; GetSizeOrderedRootStorageEntry is only reachable with the flag set. Uniform-size scopes stay byte-identical (stable sort). - The reuse test really forces the reuse path on a3. src(1x8192xf32)+idx(1x8192xui32)+dst(1x32768xf32) = 262144+262144+1048576 = 1572864 bits, which is exactly kA3.ubSpaceSize (1572864). IsEnoughForBuffersNoReuse uses
required < capacity, so1572864 < 1572864is false -> reuse path. The no-reuse test (48KB << 192KB) takes the fast path. Good separation, and the reuse case fits UB exactly. - Both IRs verify. tsort32 imposes no dst/src size-ratio constraint (only same elem type f16/f32 + idx i32/u32), and the shorthand
tile_buf<vec, NxM...>form parses. The BYSIZE / DEFAULT-NOT patterns match the realpto.pointer_cast(%cN_i64) : memref<...>output. - Wiring is consistent. Only one MemPlan construction site exists; the CLI flag -> pass option -> ctor arg is threaded correctly.
Two non-blocking notes (one inline on the map write).
The FFD guarantee does not transfer to DSA
The description justifies the ordering with the bin-packing bound (FFD <= 11/9 OPT). That bound is for classic bin packing, where items have only a size. The problem here is Dynamic Storage Allocation: buffers carry a size AND a live interval. For DSA, decreasing-size-first is a good heuristic but has no constant-factor guarantee and can produce a HIGHER peak than the default order on some instances. So enabling the flag can regress the space peak (or even overflow a scope that fit by default) on kernels outside the 213-file corpus. The opt-in / default-off framing and the "measure sync impact before flipping" caveat keep this safe in practice; I would just soften the theoretical claim in the description so nobody flips the default assuming a proven bound.
Pre-existing (optional follow-up)
The default VEC reuse path (GetReorderRootStorageEntry with the flag off) swaps the root but never updates memscope2rootStorageEntry, so RecordOverflowIfAny / PrintSuccessfulAllocatedMaxBits read a root whose alignedConstBits omits its now-nonzero bitsOffset -- a minor peak under-count. This PR fixes that for the order-by-size path (the map write + clear-all-children) but leaves it on the default path. Not introduced here; worth a follow-up if that accounting matters.
| // PrintSuccessfulAllocatedMaxBits) read the new root and its full child list. | ||
| // This must accompany the clear above: clearing children without updating the | ||
| // map would leave the stale root pointing at an empty child list. | ||
| memscope2rootStorageEntry[reorderedRootStorageEntry->bufInfo->bufferScope] = |
There was a problem hiding this comment.
Non-blocking (latent fragility): this writes memscope2rootStorageEntry[scope] while the same map is being range-iterated in PlanMemAddressOfWholeLocalBuffer (for (auto &it : memscope2rootStorageEntry)), reached through both IsEnoughForBuffersNoReuse and PlanReusableLocalBuffer / PlanMemAddressForLevel0.
It is safe today only because scope is always the key already being visited (every entry in a scope shares that scope, and the key was inserted in MergeSameScopeSE), so DenseMap::operator[] hits an existing bucket and never inserts or reallocates -- the active iterator stays valid. That is an implicit invariant. If a future change ever routes a scope not yet in the map through this path, operator[] would insert, possibly rehash, and invalidate the in-flight range-for iterator (UB). Cheap hardening: assign through the iterator the caller already holds, or assert memscope2rootStorageEntry.count(scope) here before writing.
A3 板测完成(有跳过)
|
Summary
Adds an opt-in
order-by-sizeoption to PlanMemory's local allocator thatprocesses buffers largest-first (first-fit-decreasing) instead of the current
DMA-first / generation order. Default behavior is unchanged.
--plan-memory-order-by-sizeorder-by-size(defaultfalse)Why — theoretical argument
On-chip buffer allocation is Dynamic Storage Allocation (DSA): pack buffers,
each with a fixed live interval and a size, into a fixed-capacity strip while
minimizing the peak (lower bound =
LOAD, the max simultaneously-live size).With uniform sizes this is interval-graph colouring (greedy is optimal); with
the heterogeneous sizes real kernels produce (mixed dtypes, reductions,
asymmetric tiles) DSA is NP-hard, and the quality of a first-fit allocator
depends strongly on the order buffers are placed:
bound (bin packing:
FFD ≤ 11/9·OPTvs first-fit's17/10·OPT) and is theordering used by XLA (best-fit-decreasing heap simulation), TVM USMP
(greedy-by-size), and MindSpore SOMAS. PlanMemory was the outlier — it
ordered by DMA-touch (VEC only) and otherwise by generation order.
This change brings PlanMemory's ordering in line with the established baselines,
at zero risk to default behaviour.
What changed
GetSizeOrderedRootStorageEntry: when enabled, stable-sort buffers bydecreasing size across all memory spaces (the existing DMA-first reorder is
VEC-only), keeping ping-pong (double-buffer) pairs contiguous.
Passes.td) + a ptoas CLI flag.thing regardless of whether reuse kicks in:
Σ buffer sizes ≥ capacity): decreasing-sizeorder can lower the space peak — this is where the option pays off.
deterministic largest-first layout. The peak is unchanged here (a fitting
clique has no peak to save), but the placement now honors the largest-first
contract instead of falling back to generation order.
Results
Measured over the TileLang ST suite + JIT kernels (213 files):
tsortkernel)so all 16 existing
plan_memory_*lit tests pass unchanged.plan_memory_order_by_size_reuse.ptoexercises the reuse pathand
plan_memory_order_by_size_noreuse.ptoexercises the no-reuse fast path;in both, the largest tile is placed at offset 0 only with the flag.
The on-corpus win is small because the available test corpus is dominated by
fitting cliques and tiny per-op kernels; the benefit appears only under forced
reuse + heterogeneity, which is where larger real kernels live.
Scope / follow-ups
impact — tighter packing increases buffer reuse, which can add synchronization
(the memory↔sync coupling).
aliasing/donation, and InEx (half-open) lifetime semantics.