Skip to content

feat(planmemory): first-fit-decreasing buffer ordering (opt-in)#1

Closed
tonibohnlein wants to merge 1 commit into
mainfrom
planmem-order-by-size
Closed

feat(planmemory): first-fit-decreasing buffer ordering (opt-in)#1
tonibohnlein wants to merge 1 commit into
mainfrom
planmem-order-by-size

Conversation

@tonibohnlein

Copy link
Copy Markdown
Owner

Summary

Adds an opt-in order-by-size option to PlanMemory's local allocator that
processes buffers largest-first (first-fit-decreasing) instead of the current
DMA-first / generation order. Default behavior is unchanged.

  • CLI: --plan-memory-order-by-size
  • Pass option: order-by-size (default false)

Why — theoretical argument

On-chip buffer allocation is Dynamic Storage Allocation (DSA): pack buffers,
each with a fixed live interval and a size, into a fixed-capacity strip while
minimizing the peak (lower bound = LOAD, the max simultaneously-live size).
With uniform sizes this is interval-graph colouring (greedy is optimal); with
the heterogeneous sizes real kernels produce (mixed dtypes, reductions,
asymmetric tiles) DSA is NP-hard, and the quality of a first-fit allocator
depends strongly on the order buffers are placed:

  • Arbitrary / generation order has no constant-factor guarantee.
  • Decreasing-size order is the basis of the classic first-fit-decreasing
    bound (bin packing: FFD ≤ 11/9·OPT vs first-fit's 17/10·OPT) and is the
    ordering used by XLA (best-fit-decreasing heap simulation), TVM USMP
    (greedy-by-size), and MindSpore SOMAS. PlanMemory was the outlier — it
    ordered by DMA-touch (VEC only) and otherwise by generation order.

This change brings PlanMemory's ordering in line with the established baselines,
at zero risk to default behaviour.

What changed

  • GetSizeOrderedRootStorageEntry: when enabled, stable-sort buffers by
    decreasing size across all memory spaces (the existing DMA-first reorder is
    VEC-only), keeping ping-pong (double-buffer) pairs contiguous.
  • Wired as a default-off pass option (Passes.td) + a ptoas CLI flag.
  • Ordering only affects the reuse path (entered when Σ buffer sizes ≥ capacity); a clique that fits still takes the sequential fast path, where
    there is no peak to save — so the option is a no-op exactly where it cannot
    help.

Results

Measured over the TileLang ST suite + JIT kernels (213 files):

metric default order-by-size
space-peak regressions 0
space-peak improvements 1 (−32 KB, −16.7% on a heterogeneous tsort kernel)
  • No degradation. Uniform-size instances are byte-identical (stable sort),
    so all 16 existing plan_memory_* lit tests pass unchanged.
  • New lit test plan_memory_order_by_size_reuse.pto exercises the reuse path:
    the largest tile is placed at offset 0 only with the flag.

The on-corpus win is small because the available test corpus is dominated by
fitting cliques and tiny per-op kernels; the benefit appears only under forced
reuse + heterogeneity, which is where larger real kernels live.

Scope / follow-ups

  • Default stays off. Flipping it should first measure downstream sync
    impact — tighter packing increases buffer reuse, which can add synchronization
    (the memory↔sync coupling).
  • Complementary next steps from the same baselines: best-fit placement, in-place
    aliasing/donation, and InEx (half-open) lifetime semantics.

The local-memory allocator processes buffers in a DMA-first order (VEC
only) and otherwise in generation order. For the heterogeneous buffer
sizes real kernels produce, first-fit-decreasing (largest-first) order
packs tighter -- it is the ordering XLA, TVM and SOMAS all use.

Add a default-off `order-by-size` pass option (CLI flag
--plan-memory-order-by-size). When enabled, the reuse path sorts buffers
largest-first across every memory space, keeping ping-pong pairs
contiguous. Default behavior is unchanged and uniform-size instances are
untouched (stable sort), so existing tests are unaffected.

Measured over the TileLang ST suite + JIT kernels (213 files): one
space-peak reduction (-32KB), zero regressions. New lit test exercises
the reuse path: the largest tile is placed at offset 0 only with the flag.
@tonibohnlein tonibohnlein force-pushed the planmem-order-by-size branch from 0ec4b6a to f8d878c Compare June 30, 2026 10:09
@tonibohnlein

Copy link
Copy Markdown
Owner Author

Superseded by the upstream PR hw-native-sys#885 (same branch, rebased onto latest main).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant