Skip to content

feat(planmemory): first-fit-decreasing buffer ordering (opt-in)#885

Merged
zhangstevenunity merged 2 commits into
hw-native-sys:mainfrom
tonibohnlein:planmem-order-by-size
Jul 2, 2026
Merged

feat(planmemory): first-fit-decreasing buffer ordering (opt-in)#885
zhangstevenunity merged 2 commits into
hw-native-sys:mainfrom
tonibohnlein:planmem-order-by-size

Conversation

@tonibohnlein

@tonibohnlein tonibohnlein commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds an opt-in order-by-size option to PlanMemory's local allocator that
processes buffers largest-first (first-fit-decreasing) instead of the current
DMA-first / generation order. Default behavior is unchanged.

  • CLI: --plan-memory-order-by-size
  • Pass option: order-by-size (default false)

Why — theoretical argument

On-chip buffer allocation is Dynamic Storage Allocation (DSA): pack buffers,
each with a fixed live interval and a size, into a fixed-capacity strip while
minimizing the peak (lower bound = LOAD, the max simultaneously-live size).
With uniform sizes this is interval-graph colouring (greedy is optimal); with
the heterogeneous sizes real kernels produce (mixed dtypes, reductions,
asymmetric tiles) DSA is NP-hard, and the quality of a first-fit allocator
depends strongly on the order buffers are placed:

  • Arbitrary / generation order has no constant-factor guarantee.
  • Decreasing-size order is the basis of the classic first-fit-decreasing
    bound (bin packing: FFD ≤ 11/9·OPT vs first-fit's 17/10·OPT) and is the
    ordering used by XLA (best-fit-decreasing heap simulation), TVM USMP
    (greedy-by-size), and MindSpore SOMAS. PlanMemory was the outlier — it
    ordered by DMA-touch (VEC only) and otherwise by generation order.

This change brings PlanMemory's ordering in line with the established baselines,
at zero risk to default behaviour.

What changed

  • GetSizeOrderedRootStorageEntry: when enabled, stable-sort buffers by
    decreasing size across all memory spaces (the existing DMA-first reorder is
    VEC-only), keeping ping-pong (double-buffer) pairs contiguous.
  • Wired as a default-off pass option (Passes.td) + a ptoas CLI flag.
  • Ordering is applied on both planning paths, so the option means the same
    thing regardless of whether reuse kicks in:
    • Reuse path (entered when Σ buffer sizes ≥ capacity): decreasing-size
      order can lower the space peak — this is where the option pays off.
    • No-reuse fast path (a clique that fits): decreasing-size order gives a
      deterministic largest-first layout. The peak is unchanged here (a fitting
      clique has no peak to save), but the placement now honors the largest-first
      contract instead of falling back to generation order.

Results

Measured over the TileLang ST suite + JIT kernels (213 files):

metric default order-by-size
space-peak regressions 0
space-peak improvements 1 (−32 KB, −16.7% on a heterogeneous tsort kernel)
  • No degradation. Uniform-size instances are byte-identical (stable sort),
    so all 16 existing plan_memory_* lit tests pass unchanged.
  • Two lit tests: plan_memory_order_by_size_reuse.pto exercises the reuse path
    and plan_memory_order_by_size_noreuse.pto exercises the no-reuse fast path;
    in both, the largest tile is placed at offset 0 only with the flag.

The on-corpus win is small because the available test corpus is dominated by
fitting cliques and tiny per-op kernels; the benefit appears only under forced
reuse + heterogeneity, which is where larger real kernels live.

Scope / follow-ups

  • Default stays off. Flipping it should first measure downstream sync
    impact — tighter packing increases buffer reuse, which can add synchronization
    (the memory↔sync coupling).
  • Complementary next steps from the same baselines: best-fit placement, in-place
    aliasing/donation, and InEx (half-open) lifetime semantics.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an option to order buffers by size (largest-first) during local memory planning, which helps pack heterogeneous-size buffers tighter. The review feedback correctly identifies a potential issue when reordering the root storage entry: the old root's children are not cleared, which could lead to cycles or stale references, and the memscope2rootStorageEntry map is not updated to reflect the new root. Implementing the suggested fix will prevent downstream analysis errors.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +1645 to +1650
StorageEntry *reorderedRootStorageEntry = entries[0];
reorderedRootStorageEntry->mergedChildren.clear();
for (size_t j = 1; j < entries.size(); ++j) {
reorderedRootStorageEntry->mergedChildren.push_back(entries[j]);
}
return reorderedRootStorageEntry;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When the root storage entry of a scope changes after sorting, the old root becomes a child of the new root. If we do not clear the mergedChildren of the old root, it will still contain references to the other entries (including the new root), creating potential cycles or stale child references in the tree. Clearing mergedChildren for all entries in entries before rebuilding the new root's children prevents this.

Additionally, the memscope2rootStorageEntry map is used by other methods (such as RecordOverflowIfAny and PrintSuccessfulAllocatedMaxBits) to find the root storage entry of a scope. If we do not update this map when the root changes, those methods will continue to operate on the old root, leading to incorrect or incomplete analysis.

  for (auto *entry : entries) {
    entry->mergedChildren.clear();
  }
  StorageEntry *reorderedRootStorageEntry = entries[0];
  for (size_t j = 1; j < entries.size(); ++j) {
    reorderedRootStorageEntry->mergedChildren.push_back(entries[j]);
  }
  memscope2rootStorageEntry[reorderedRootStorageEntry->bufInfo->bufferScope] = reorderedRootStorageEntry;
  return reorderedRootStorageEntry;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks — applied. The new root now clears every entry's mergedChildren before rebuilding the flat child list, and updates memscope2rootStorageEntry to point at the new root.

Worth noting the two changes must go together: clearing the old root's children without updating the map would leave RecordOverflowIfAny reading the stale root with an empty child list (under-reporting the peak). With both applied, the scope's root + children stay consistent. (In the prior version the old root kept its complete child list, so overflow detection was still correct — but the tree had a stale back-reference; this makes it well-formed.)

The local-memory allocator processes buffers in a DMA-first order (VEC
only) and otherwise in generation order. For the heterogeneous buffer
sizes real kernels produce, first-fit-decreasing (largest-first) order
packs tighter -- it is the ordering XLA, TVM and SOMAS all use.

Add a default-off `order-by-size` pass option (CLI flag
--plan-memory-order-by-size). When enabled, the reuse path sorts buffers
largest-first across every memory space, keeping ping-pong pairs
contiguous. Default behavior is unchanged and uniform-size instances are
untouched (stable sort), so existing tests are unaffected.

Measured over the TileLang ST suite + JIT kernels (213 files): one
space-peak reduction (-32KB), zero regressions. New lit test exercises
the reuse path: the largest tile is placed at offset 0 only with the flag.
@tonibohnlein tonibohnlein force-pushed the planmem-order-by-size branch from f8d878c to bf04064 Compare June 30, 2026 10:17
@reedhecre

reedhecre commented Jun 30, 2026

Copy link
Copy Markdown

Codex Review

该评论由 review 机器人自动更新。

  • PR: feat(planmemory): first-fit-decreasing buffer ordering (opt-in) #885 feat(planmemory): first-fit-decreasing buffer ordering (opt-in)
  • Author: tonibohnlein
  • Base/Head: main / planmem-order-by-size
  • Head SHA: 466da8d0c3eb
  • Trigger: PR 有新提交
  • Generated At: 2026-07-01T10:00:52Z
  • Previous Head SHA: bf0406457753
  • Status: failed at codex-review (exit=1)

Summary

Review failed at stage codex-review: exit=1

Findings

未生成结构化 findings,因为 review 过程提前失败。

Log Tail

git clone --branch 'main' --depth 50 'https://github.com/hw-native-sys/PTOAS.git' '/tmp/ptoas-pr-review-monitor/runs/20260701_180035_pr885/repo'
cd '/tmp/ptoas-pr-review-monitor/runs/20260701_180035_pr885/repo'
git fetch origin 'refs/pull/885/head:pr-885' --depth 50
git fetch origin 'main' --depth 50 || true
git checkout -f 'pr-885'
git rev-parse HEAD
git diff --stat 'origin/main...HEAD' || true
Cloning into '/tmp/ptoas-pr-review-monitor/runs/20260701_180035_pr885/repo'...
From https://github.com/hw-native-sys/PTOAS
 * [new ref]           refs/pull/885/head -> pr-885
From https://github.com/hw-native-sys/PTOAS
 * branch              main       -> FETCH_HEAD
Switched to branch 'pr-885'
466da8d0c3eb47e734ede51d2124531d9cae623a
 include/PTO/Transforms/Passes.td                   |  6 ++
 lib/PTO/Transforms/PTOPlanMemory.cpp               | 55 ++++++++++++++++-
 lib/PTO/Transforms/PTOPlanMemory.h                 | 12 +++-
 test/lit/pto/plan_memory_order_by_size_noreuse.pto | 69 ++++++++++++++++++++++
 test/lit/pto/plan_memory_order_by_size_reuse.pto   | 66 +++++++++++++++++++++
 tools/ptoas/ptoas.cpp                              |  8 +++
 6 files changed, 213 insertions(+), 3 deletions(-)
===== END STAGE clone rc=0 @ 2026-07-01 18:00:43 =====

===== STAGE codex-review @ 2026-07-01 18:00:43 =====
set -euo pipefail
cd '/tmp/ptoas-pr-review-monitor/runs/20260701_180035_pr885/repo'
'codex' exec -C '/tmp/ptoas-pr-review-monitor/runs/20260701_180035_pr885/repo' -s read-only -c 'model_provider="codereview"' -c 'model="gpt-5.4"' -c 'model_reasoning_effort="xhigh"' --output-schema '/tmp/ptoas-pr-review-monitor/runs/20260701_180035_pr885/review_schema.json' -o '/tmp/ptoas-pr-review-monitor/runs/20260701_180035_pr885/codex_last_message.json' --color never - < '/tmp/ptoas-pr-review-monitor/runs/20260701_180035_pr885/review_prompt.txt'
[monitor] stage timeout: 1800s
OpenAI Codex v0.115.0 (research preview)
--------
workdir: /tmp/ptoas-pr-review-monitor/runs/20260701_180035_pr885/repo
model: gpt-5.4
provider: codereview
approval: never
sandbox: read-only
reasoning effort: xhigh
reasoning summaries: none
session id: 019f1d1f-b022-74e2-ab2e-5c03af4919cf
--------
user
你现在在审查 GitHub PR。

仓库:hw-native-sys/PTOAS
PR:#885 feat(planmemory): first-fit-decreasing buffer ordering (opt-in)
作者:tonibohnlein
base branch:origin/main
head branch:HEAD(当前已 checkout 到 PR head)

要求:
1. 只审查这个 PR 相对 origin/main 的改动,必要时可以看上下文文件。
2. 重点找真实的 correctness / regression / contract mismatch / CI / runtime / compatibility 问题。
3. 不要提纯风格建议,不要提低价值猜测。
4. 严格按优先级输出:
   - P1:高概率会导致错误结果、编译/运行失败、严重回归、发布阻断
   - P2:重要缺陷、行为回归、遗漏校验/测试、较大兼容性问题
   - P3:次要但明确可改的问题
5. 如果没有问题,summary 直接写:未检查到 PR #885 存在问题,并返回 findings=[]。
6. 如果有问题,summary 简洁概括,findings 里每条都要给出:
   - severity
   - title
   - body(说明为什么是问题,尽量具体)
   - file(尽量给相对路径)
   - line(能确定就填整数,否则 null)

建议先查看:
- git status --short
- git diff --stat origin/main...HEAD
- git diff --unified=80 origin/main...HEAD

最终输出必须严格匹配 JSON schema。

mcp startup: no servers
Reconnecting... 1/5 (unexpected status 403 Forbidden: {"code":"INSUFFICIENT_BALANCE","message":"Insufficient account balance"}, url: https://codex.0u0o.com/responses, cf-ray: a14490bb4d505e27-LAX, request id: 8c39018c-b13d-4abc-9244-fccb5b5fb738)
Reconnecting... 2/5 (unexpected status 403 Forbidden: {"code":"INSUFFICIENT_BALANCE","message":"Insufficient account balance"}, url: https://codex.0u0o.com/responses, cf-ray: a14490c02eb9792b-LAX, request id: 56f40c08-edc1-4bf6-bffb-02e7fb4e1d8d)
Reconnecting... 3/5 (unexpected status 403 Forbidden: {"code":"INSUFFICIENT_BALANCE","message":"Insufficient account balance"}, url: https://codex.0u0o.com/responses, cf-ray: a14490c4f8936d31-LAX, request id: 9c09571a-8497-47d2-8a92-5f1240912899)
Reconnecting... 4/5 (unexpected status 403 Forbidden: {"code":"INSUFFICIENT_BALANCE","message":"Insufficient account balance"}, url: https://codex.0u0o.com/responses, cf-ray: a14490cbd87719ff-LAX, request id: 168a64d7-d60f-40c1-be2a-cccce7695806)
Reconnecting... 5/5 (unexpected status 403 Forbidden: {"code":"INSUFFICIENT_BALANCE","message":"Insufficient account balance"}, url: https://codex.0u0o.com/responses, cf-ray: a14490d84895d7a4-LAX, request id: db058c50-263e-4cf0-bd72-b5703fd79e3f)
ERROR: unexpected status 403 Forbidden: {"code":"INSUFFICIENT_BALANCE","message":"Insufficient account balance"}, url: https://codex.0u0o.com/responses, cf-ray: a14490ec7a95191b-LAX, request id: 0458d9a6-2ab1-46fa-97d0-7cd221c1cfd4
Warning: no last agent message; wrote empty content to /tmp/ptoas-pr-review-monitor/runs/20260701_180035_pr885/codex_last_message.json
===== END STAGE codex-review rc=1 @ 2026-07-01 18:00:52 =====

PlanMemAddressOfWholeLocalBuffer() exits through IsEnoughForBuffersNoReuse()
for any scope that fits without reuse, and that fast path
(PlanBuffersWithoutReuse) never reordered buffers. As a result
--plan-memory-order-by-size was silently a no-op for kernels that fit in
local memory, contradicting the option's largest-first placement contract
(Codex review P3).

Reorder largest-first on the no-reuse path as well, so the option means the
same thing regardless of whether reuse kicks in. This is a pure offset
reassignment: the required-size sum is order-independent so the fit check is
unaffected, and the stable sort keeps uniform-size scopes byte-identical to
the default. Peak is unchanged on this path (there is no peak to save when a
scope fits) -- the value is a deterministic, consistent layout.

Add plan_memory_order_by_size_noreuse.pto covering the fast path: three
heterogeneous UB tiles (48KB total, well under the 192KB budget). Default
leaves offset 0 to the small first-generated input tile; order-by-size places
the largest tile at offset 0.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@tonibohnlein

Copy link
Copy Markdown
Contributor Author

Re: P3 — order-by-size had no effect unless the planner already needs reuse

Fixed in 466da8d.

PlanMemAddressOfWholeLocalBuffer() exited through IsEnoughForBuffersNoReuse() for any scope that fits without reuse, and that fast path (PlanBuffersWithoutReuse) never reordered — so the flag was silently a no-op for kernels that fit, which is exactly the contract mismatch you flagged.

IsEnoughForBuffersNoReuse() now applies GetSizeOrderedRootStorageEntry() (guarded by orderBySize) before laying buffers out sequentially, so the option means the same thing on both paths — deterministic largest-first placement regardless of whether reuse kicks in. On the fast path the peak is unchanged (a fitting clique has no peak to save), but the layout now honors the largest-first contract instead of falling back to generation order.

Notes:

  • Default behavior unchanged. The reorder is a pure offset reassignment: the required-size sum is order-independent so the fit check is unaffected, and the stable sort keeps uniform-size scopes byte-identical to the default. All 16 existing plan_memory_* lit tests still pass.
  • New regression test plan_memory_order_by_size_noreuse.pto covers the fast path: three heterogeneous UB tiles (48 KB total, well under the 192 KB budget). Default leaves offset 0 to the small first-generated input tile; --plan-memory-order-by-size places the largest tile (1x8192xf32) at offset 0.
  • PR description updated to reflect that ordering now applies on both paths.

@zhangstevenunity zhangstevenunity left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: opt-in first-fit-decreasing buffer ordering

Verdict: no blocker. The mechanism is sound and the default path is provably untouched. I traced the full local-planning flow (GetReorderRootStorageEntry / IsEnoughForBuffersNoReuse / PlanReusableLocalBuffer / PlanMemAddressForLevel0 -> GetSizeOrderedRootStorageEntry, plus the RecordOverflowIfAny / PrintSuccessfulAllocatedMaxBits / UpdateBuffer2Offsets consumers) and checked both new lit tests against the a3 memory spec.

What I confirmed:

  • Default behavior is genuinely unchanged. Every new code path is gated on orderBySize, which defaults false; GetSizeOrderedRootStorageEntry is only reachable with the flag set. Uniform-size scopes stay byte-identical (stable sort).
  • The reuse test really forces the reuse path on a3. src(1x8192xf32)+idx(1x8192xui32)+dst(1x32768xf32) = 262144+262144+1048576 = 1572864 bits, which is exactly kA3.ubSpaceSize (1572864). IsEnoughForBuffersNoReuse uses required < capacity, so 1572864 < 1572864 is false -> reuse path. The no-reuse test (48KB << 192KB) takes the fast path. Good separation, and the reuse case fits UB exactly.
  • Both IRs verify. tsort32 imposes no dst/src size-ratio constraint (only same elem type f16/f32 + idx i32/u32), and the shorthand tile_buf<vec, NxM...> form parses. The BYSIZE / DEFAULT-NOT patterns match the real pto.pointer_cast(%cN_i64) : memref<...> output.
  • Wiring is consistent. Only one MemPlan construction site exists; the CLI flag -> pass option -> ctor arg is threaded correctly.

Two non-blocking notes (one inline on the map write).

The FFD guarantee does not transfer to DSA

The description justifies the ordering with the bin-packing bound (FFD <= 11/9 OPT). That bound is for classic bin packing, where items have only a size. The problem here is Dynamic Storage Allocation: buffers carry a size AND a live interval. For DSA, decreasing-size-first is a good heuristic but has no constant-factor guarantee and can produce a HIGHER peak than the default order on some instances. So enabling the flag can regress the space peak (or even overflow a scope that fit by default) on kernels outside the 213-file corpus. The opt-in / default-off framing and the "measure sync impact before flipping" caveat keep this safe in practice; I would just soften the theoretical claim in the description so nobody flips the default assuming a proven bound.

Pre-existing (optional follow-up)

The default VEC reuse path (GetReorderRootStorageEntry with the flag off) swaps the root but never updates memscope2rootStorageEntry, so RecordOverflowIfAny / PrintSuccessfulAllocatedMaxBits read a root whose alignedConstBits omits its now-nonzero bitsOffset -- a minor peak under-count. This PR fixes that for the order-by-size path (the map write + clear-all-children) but leaves it on the default path. Not introduced here; worth a follow-up if that accounting matters.

// PrintSuccessfulAllocatedMaxBits) read the new root and its full child list.
// This must accompany the clear above: clearing children without updating the
// map would leave the stale root pointing at an empty child list.
memscope2rootStorageEntry[reorderedRootStorageEntry->bufInfo->bufferScope] =

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking (latent fragility): this writes memscope2rootStorageEntry[scope] while the same map is being range-iterated in PlanMemAddressOfWholeLocalBuffer (for (auto &it : memscope2rootStorageEntry)), reached through both IsEnoughForBuffersNoReuse and PlanReusableLocalBuffer / PlanMemAddressForLevel0.

It is safe today only because scope is always the key already being visited (every entry in a scope shares that scope, and the key was inserted in MergeSameScopeSE), so DenseMap::operator[] hits an existing bucket and never inserts or reallocates -- the active iterator stays valid. That is an implicit invariant. If a future change ever routes a scope not yet in the map through this path, operator[] would insert, possibly rehash, and invalidate the in-flight range-for iterator (UB). Cheap hardening: assign through the iterator the caller already holds, or assert memscope2rootStorageEntry.count(scope) here before writing.

@zhangstevenunity zhangstevenunity merged commit d06e3be into hw-native-sys:main Jul 2, 2026
@reedhecre

Copy link
Copy Markdown

A3 板测完成(有跳过)

  • 触发方式:merged
  • 源码提交:d06e3beb54d7
  • 结果汇总:OK 220 / FAIL 0 / SKIP 2
  • 日志:/home/zhongxuan/ptoas-board-monitor/runtime/logs/20260702_202039_merged_pr885.log
  • LLVM cache:/home/zhongxuan/ptoas-board-monitor/cache/llvm-project-vpto-llvm21/build-shared
  • 结果 TSV:/home/zhongxuan/ptoas-board-monitor/runtime/logs/20260702_202039_merged_pr885.tsv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants