Skip to content

feat(dsv4/v3_2): mark L2 orch outputs with correct direction (pl.Out / pl.InOut)#659

Open
YunjiQin wants to merge 1 commit into
hw-native-sys:mainfrom
YunjiQin:feat/l2-orch-inout-annotations
Open

feat(dsv4/v3_2): mark L2 orch outputs with correct direction (pl.Out / pl.InOut)#659
YunjiQin wants to merge 1 commit into
hw-native-sys:mainfrom
YunjiQin:feat/l2-orch-inout-annotations

Conversation

@YunjiQin

@YunjiQin YunjiQin commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

What

pypto requires orchestration-entry outputs to declare a direction (hw-native-sys/pypto#1901). An output left as a plain pl.Tensor is treated as In, so its device→host copy-back is skipped and the tensor silently reads back as all-zeros on the host, failing golden.

This PR annotates every golden-compared output on its orchestration entry with the correct direction, decided by the golden TensorSpec:

spec direction annotation
is_output=True + init_value inout (read-modify-write) pl.InOut
is_output=True, no init_value pure output pl.Out

pl.InOut round-trips correctly as of hw-native-sys/pypto#1918 (the specializer previously dropped the wrapper → "missing type annotation"). pypto-lib CI builds against pypto main, which now includes that fix.

Scope / rules

  • Direction lives on the orchestration entry only — the @pl.jit entry, its @pl.jit.host L3 driver, or the @pl.function(type=Opaque) method. @pl.jit.inline sub-kernels are left as upstream (their direction tag is stripped at splice time).
  • Normalizes pre-existing pl.Out-on-inout entries (prefill_attention_, decode_compressor_, decode_indexer*, prefill_indexer*) to pl.InOut so the whole codebase is consistent.

Changes (21 files, +36/−36 — 33 pl.InOut, 3 pl.Out)

  • inout → pl.InOut: decode/prefill attention kv_cache; compressor kv_state/score_state/compress_state/cmp_kv/cmp_kv_cache; indexer idx_kv_cache; decode_layer / decode_fwd / prefill_fwd kv_cache (entry + L3 host); v3_2 decode_front kv_cache/pe_cache/k_cache_idx/dispatch_buf.
  • pure output → pl.Out: v3_2 back out; prefill_front_draft dispatch_buf (x_out / x_next / logits already pl.Out).

Verification

  • AST audit: every is_output spec (incl. @pl.jit.host params and conditional is_output=name==… specs) maps to a correctly-directioned entry param; no @pl.jit.inline param carries pl.InOut.
  • Device (a2a3) golden PASS: decode/prefill attention (csa/hca/swa), compressor ratio4/128, indexer(_compressor), decode_compressor, decode_indexer, decode_layer — all compile and pass (kv_cache / kv_state / score_state / cmp_kv / idx_kv_cache / x_out).

Note: some cases fail their second output (x_out/x_next) on the simulator (a2a3sim/a5sim) — a pre-existing numerical-precision limitation, tracked as daily-CI known failures, independent of this annotation change (fails identically on upstream regardless of direction).

🤖 Generated with Claude Code

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR updates tensor parameter type annotations across DeepSeek v3.2 and v4 model files, changing cache and dispatch buffer parameters (kv_cache, pe_cache, k_cache_idx, dispatch_buf, kv_state, score_state, out) from plain pl.Tensor[...] to pl.Out[pl.Tensor[...]], marking them as output/writable buffers in each function's signature.

Changes

pl.Out Annotation Updates

Layer / File(s) Summary
DeepSeek v3.2 decode/prefill signatures
models/deepseek/v3_2/deepseek_v3_2_decode_back.py, models/deepseek/v3_2/deepseek_v3_2_decode_front.py, models/deepseek/v3_2/deepseek_v3_2_prefill_back.py, models/deepseek/v3_2/deepseek_v3_2_prefill_front_draft.py
The out, kv_cache, pe_cache, k_cache_idx, and dispatch_buf parameters are retyped from pl.Tensor[...] to pl.Out[pl.Tensor[...]] in the decode-back, decode-front, prefill-back, and prefill-front-draft layer functions.
DeepSeek v4 decode attention kv_cache signatures
models/deepseek/v4/decode_attention_csa.py, models/deepseek/v4/decode_attention_hca.py, models/deepseek/v4/decode_attention_swa.py
attention_csa_test, attention_hca_test, and attention_swa_test retype kv_cache from input tensor to pl.Out[pl.Tensor[...]].
DeepSeek v4 decode_fwd/decode_layer kv_cache signatures
models/deepseek/v4/decode_fwd.py, models/deepseek/v4/decode_layer.py
decode_fwd, l3_decode_fwd, decode_layer, and l3_decode_layer retype kv_cache from input tensor to pl.Out[pl.Tensor[...]].
DeepSeek v4 prefill kv_cache/kv_state/score_state signatures
models/deepseek/v4/prefill_attention_hca.py, models/deepseek/v4/prefill_fwd.py, models/deepseek/v4/prefill_compressor_ratio128.py, models/deepseek/v4/prefill_compressor_ratio4.py, models/deepseek/v4/prefill_indexer_compressor.py
prefill_attention_hca_test, prefill_fwd, l3_prefill_fwd, prefill_compressor_ratio128_test, prefill_compressor_ratio4_test, and prefill_indexer_compressor_test retype kv_cache, kv_state, and score_state from input tensors to pl.Out[pl.Tensor[...]].

Estimated code review effort: 2 (Simple) | ~10 minutes

Possibly related PRs

  • hw-native-sys/pypto-lib#74: Both PRs mark decode-layer out tensor parameters as pl.Out[pl.Tensor[...]] rather than plain pl.Tensor.
  • hw-native-sys/pypto-lib#476: Both PRs touch the same v4 decode-attention kv_cache interface contracts in decode_attention_{csa,hca,swa}.py.
  • hw-native-sys/pypto-lib#459: Both PRs modify the DeepSeek-V4 HCA prefill path's KV-cache tensor interface to treat it as a writable output buffer.

Suggested labels: enhancement

Poem

A rabbit hops through tensors bright,
Turning inputs into outputs right,
kv_cache, dispatch, state—all marked anew,
With pl.Out wrapped snug around each cue,
Hop, hop, the contracts now align! 🐇✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title matches the main change: annotating DeepSeek orchestration outputs with pl.Out/pl.InOut.
Description check ✅ Passed The description accurately describes the direction-annotation changes and their motivation.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@YunjiQin YunjiQin force-pushed the feat/l2-orch-inout-annotations branch from 0645675 to fc669c3 Compare July 1, 2026 07:11
@YunjiQin YunjiQin changed the title feat(dsv4/v3_2): annotate L2 orch outputs with pl.Out / pl.InOut feat(dsv4/v3_2): mark L2 orch outputs with pl.Out for golden copy-back Jul 1, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
models/deepseek/v4/prefill_attention_hca.py (1)

87-108: 🗄️ Data Integrity & Integration | 🔵 Trivial | 💤 Low value

Align kv_cache with the other cache-writing inline kernels. prefill_attention_hca mutates kv_cache, and the matching inline kernels in this tree annotate the same buffer as pl.Out[...]; keeping this one as plain pl.Tensor[...] makes the nested-inline contract inconsistent.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@models/deepseek/v4/prefill_attention_hca.py` around lines 87 - 108, The
prefill_attention_hca signature treats kv_cache as an input tensor even though
the kernel writes to it, which is inconsistent with the other cache-writing
inline kernels. Update the kv_cache parameter annotation in
prefill_attention_hca to use the output form used elsewhere in this module (the
pl.Out-style buffer annotation) so the nested-inline contract matches the actual
mutation behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@models/deepseek/v4/prefill_attention_hca.py`:
- Around line 87-108: The prefill_attention_hca signature treats kv_cache as an
input tensor even though the kernel writes to it, which is inconsistent with the
other cache-writing inline kernels. Update the kv_cache parameter annotation in
prefill_attention_hca to use the output form used elsewhere in this module (the
pl.Out-style buffer annotation) so the nested-inline contract matches the actual
mutation behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c19e3c06-0a43-4638-977b-6384147a7fdb

📥 Commits

Reviewing files that changed from the base of the PR and between b92fe93 and fc669c3.

📒 Files selected for processing (14)
  • models/deepseek/v3_2/deepseek_v3_2_decode_back.py
  • models/deepseek/v3_2/deepseek_v3_2_decode_front.py
  • models/deepseek/v3_2/deepseek_v3_2_prefill_back.py
  • models/deepseek/v3_2/deepseek_v3_2_prefill_front_draft.py
  • models/deepseek/v4/decode_attention_csa.py
  • models/deepseek/v4/decode_attention_hca.py
  • models/deepseek/v4/decode_attention_swa.py
  • models/deepseek/v4/decode_fwd.py
  • models/deepseek/v4/decode_layer.py
  • models/deepseek/v4/prefill_attention_hca.py
  • models/deepseek/v4/prefill_compressor_ratio128.py
  • models/deepseek/v4/prefill_compressor_ratio4.py
  • models/deepseek/v4/prefill_fwd.py
  • models/deepseek/v4/prefill_indexer_compressor.py

…/ pl.InOut)

pypto requires orchestration-entry outputs to declare a direction
(hw-native-sys/pypto#1901). An output left as a plain pl.Tensor is treated
as In, so its device->host copy-back is skipped and the tensor silently
reads back as all-zeros on the host, failing golden.

Annotate every golden-compared output on its orchestration entry (the
@pl.jit entry, its @pl.jit.host L3 driver, or the @pl.function Opaque
method) by direction, decided by the golden TensorSpec:
- is_output + init_value  -> inout  -> pl.InOut  (in-place caches / state:
  kv_cache, kv_state, score_state, compress_state, cmp_kv, cmp_kv_cache,
  idx_kv_cache, and v3_2 decode_front kv/pe/k_cache_idx/dispatch_buf)
- is_output, no init_value -> pure output -> pl.Out  (v3_2 back `out`,
  prefill_front_draft `dispatch_buf`, x_out / x_next / logits)

pl.InOut round-trips correctly as of hw-native-sys/pypto#1918 (the
specializer previously dropped the wrapper). Direction lives on the
orchestration entry only; @pl.jit.inline sub-kernels are left as upstream
(their direction tag is stripped at splice time).

Also normalizes pre-existing pl.Out-on-inout entries (prefill_attention_*,
decode_compressor_*, decode_indexer*, prefill_indexer*) to pl.InOut so the
whole codebase is consistent.

Verified on device (a2a3): decode/prefill attention (csa/hca/swa),
compressor ratio4/128, indexer(_compressor), decode_layer, decode_fwd all
compile and pass golden.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@YunjiQin YunjiQin force-pushed the feat/l2-orch-inout-annotations branch from fc669c3 to b9aab0c Compare July 1, 2026 09:17
@YunjiQin YunjiQin changed the title feat(dsv4/v3_2): mark L2 orch outputs with pl.Out for golden copy-back feat(dsv4/v3_2): mark L2 orch outputs with correct direction (pl.Out / pl.InOut) Jul 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant