feat(dsv4/v3_2): mark L2 orch outputs with correct direction (pl.Out / pl.InOut) by YunjiQin · Pull Request #659 · hw-native-sys/pypto-lib

YunjiQin · 2026-07-01T06:22:04Z

What

pypto requires orchestration-entry outputs to declare a direction (hw-native-sys/pypto#1901). An output left as a plain pl.Tensor is treated as In, so its device→host copy-back is skipped and the tensor silently reads back as all-zeros on the host, failing golden.

This PR annotates every golden-compared output on its orchestration entry with the correct direction, decided by the golden TensorSpec:

spec	direction	annotation
`is_output=True` + `init_value`	inout (read-modify-write)	`pl.InOut`
`is_output=True`, no `init_value`	pure output	`pl.Out`

pl.InOut round-trips correctly as of hw-native-sys/pypto#1918 (the specializer previously dropped the wrapper → "missing type annotation"). pypto-lib CI builds against pypto main, which now includes that fix.

Scope / rules

Direction lives on the orchestration entry only — the @pl.jit entry, its @pl.jit.host L3 driver, or the @pl.function(type=Opaque) method. @pl.jit.inline sub-kernels are left as upstream (their direction tag is stripped at splice time).
Normalizes pre-existing pl.Out-on-inout entries (prefill_attention_, decode_compressor_, decode_indexer*, prefill_indexer*) to pl.InOut so the whole codebase is consistent.

Changes (21 files, +36/−36 — 33 `pl.InOut`, 3 `pl.Out`)

inout → pl.InOut: decode/prefill attention kv_cache; compressor kv_state/score_state/compress_state/cmp_kv/cmp_kv_cache; indexer idx_kv_cache; decode_layer / decode_fwd / prefill_fwd kv_cache (entry + L3 host); v3_2 decode_front kv_cache/pe_cache/k_cache_idx/dispatch_buf.
pure output → pl.Out: v3_2 back out; prefill_front_draft dispatch_buf (x_out / x_next / logits already pl.Out).

Verification

AST audit: every is_output spec (incl. @pl.jit.host params and conditional is_output=name==… specs) maps to a correctly-directioned entry param; no @pl.jit.inline param carries pl.InOut.
Device (a2a3) golden PASS: decode/prefill attention (csa/hca/swa), compressor ratio4/128, indexer(_compressor), decode_compressor, decode_indexer, decode_layer — all compile and pass (kv_cache / kv_state / score_state / cmp_kv / idx_kv_cache / x_out).

Note: some cases fail their second output (x_out/x_next) on the simulator (a2a3sim/a5sim) — a pre-existing numerical-precision limitation, tracked as daily-CI known failures, independent of this annotation change (fails identically on upstream regardless of direction).

🤖 Generated with Claude Code

gemini-code-assist · 2026-07-01T06:22:07Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

coderabbitai · 2026-07-01T06:22:13Z

📝 Walkthrough

Walkthrough

This PR updates tensor parameter type annotations across DeepSeek v3.2 and v4 model files, changing cache and dispatch buffer parameters (kv_cache, pe_cache, k_cache_idx, dispatch_buf, kv_state, score_state, out) from plain pl.Tensor[...] to pl.Out[pl.Tensor[...]], marking them as output/writable buffers in each function's signature.

Changes

pl.Out Annotation Updates

Layer / File(s)	Summary
DeepSeek v3.2 decode/prefill signatures `models/deepseek/v3_2/deepseek_v3_2_decode_back.py`, `models/deepseek/v3_2/deepseek_v3_2_decode_front.py`, `models/deepseek/v3_2/deepseek_v3_2_prefill_back.py`, `models/deepseek/v3_2/deepseek_v3_2_prefill_front_draft.py`	The `out`, `kv_cache`, `pe_cache`, `k_cache_idx`, and `dispatch_buf` parameters are retyped from `pl.Tensor[...]` to `pl.Out[pl.Tensor[...]]` in the decode-back, decode-front, prefill-back, and prefill-front-draft layer functions.
DeepSeek v4 decode attention kv_cache signatures `models/deepseek/v4/decode_attention_csa.py`, `models/deepseek/v4/decode_attention_hca.py`, `models/deepseek/v4/decode_attention_swa.py`	`attention_csa_test`, `attention_hca_test`, and `attention_swa_test` retype `kv_cache` from input tensor to `pl.Out[pl.Tensor[...]]`.
DeepSeek v4 decode_fwd/decode_layer kv_cache signatures `models/deepseek/v4/decode_fwd.py`, `models/deepseek/v4/decode_layer.py`	`decode_fwd`, `l3_decode_fwd`, `decode_layer`, and `l3_decode_layer` retype `kv_cache` from input tensor to `pl.Out[pl.Tensor[...]]`.
DeepSeek v4 prefill kv_cache/kv_state/score_state signatures `models/deepseek/v4/prefill_attention_hca.py`, `models/deepseek/v4/prefill_fwd.py`, `models/deepseek/v4/prefill_compressor_ratio128.py`, `models/deepseek/v4/prefill_compressor_ratio4.py`, `models/deepseek/v4/prefill_indexer_compressor.py`	`prefill_attention_hca_test`, `prefill_fwd`, `l3_prefill_fwd`, `prefill_compressor_ratio128_test`, `prefill_compressor_ratio4_test`, and `prefill_indexer_compressor_test` retype `kv_cache`, `kv_state`, and `score_state` from input tensors to `pl.Out[pl.Tensor[...]]`.

Estimated code review effort: 2 (Simple) | ~10 minutes

Possibly related PRs

hw-native-sys/pypto-lib#74: Both PRs mark decode-layer out tensor parameters as pl.Out[pl.Tensor[...]] rather than plain pl.Tensor.
hw-native-sys/pypto-lib#476: Both PRs touch the same v4 decode-attention kv_cache interface contracts in decode_attention_{csa,hca,swa}.py.
hw-native-sys/pypto-lib#459: Both PRs modify the DeepSeek-V4 HCA prefill path's KV-cache tensor interface to treat it as a writable output buffer.

Suggested labels: enhancement

Poem

A rabbit hops through tensors bright,
Turning inputs into outputs right,
kv_cache, dispatch, state—all marked anew,
With pl.Out wrapped snug around each cue,
Hop, hop, the contracts now align! 🐇✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title matches the main change: annotating DeepSeek orchestration outputs with pl.Out/pl.InOut.
Description check	✅ Passed	The description accurately describes the direction-annotation changes and their motivation.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

🧹 Nitpick comments (1)

models/deepseek/v4/prefill_attention_hca.py (1)
87-108: 🗄️ Data Integrity & Integration | 🔵 Trivial | 💤 Low value

Align kv_cache with the other cache-writing inline kernels. prefill_attention_hca mutates kv_cache, and the matching inline kernels in this tree annotate the same buffer as pl.Out[...]; keeping this one as plain pl.Tensor[...] makes the nested-inline contract inconsistent.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@models/deepseek/v4/prefill_attention_hca.py` around lines 87 - 108, The
prefill_attention_hca signature treats kv_cache as an input tensor even though
the kernel writes to it, which is inconsistent with the other cache-writing
inline kernels. Update the kv_cache parameter annotation in
prefill_attention_hca to use the output form used elsewhere in this module (the
pl.Out-style buffer annotation) so the nested-inline contract matches the actual
mutation behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@models/deepseek/v4/prefill_attention_hca.py`:
- Around line 87-108: The prefill_attention_hca signature treats kv_cache as an
input tensor even though the kernel writes to it, which is inconsistent with the
other cache-writing inline kernels. Update the kv_cache parameter annotation in
prefill_attention_hca to use the output form used elsewhere in this module (the
pl.Out-style buffer annotation) so the nested-inline contract matches the actual
mutation behavior.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c19e3c06-0a43-4638-977b-6384147a7fdb

📥 Commits

Reviewing files that changed from the base of the PR and between b92fe93 and fc669c3.

📒 Files selected for processing (14)

models/deepseek/v3_2/deepseek_v3_2_decode_back.py
models/deepseek/v3_2/deepseek_v3_2_decode_front.py
models/deepseek/v3_2/deepseek_v3_2_prefill_back.py
models/deepseek/v3_2/deepseek_v3_2_prefill_front_draft.py
models/deepseek/v4/decode_attention_csa.py
models/deepseek/v4/decode_attention_hca.py
models/deepseek/v4/decode_attention_swa.py
models/deepseek/v4/decode_fwd.py
models/deepseek/v4/decode_layer.py
models/deepseek/v4/prefill_attention_hca.py
models/deepseek/v4/prefill_compressor_ratio128.py
models/deepseek/v4/prefill_compressor_ratio4.py
models/deepseek/v4/prefill_fwd.py
models/deepseek/v4/prefill_indexer_compressor.py

…/ pl.InOut) pypto requires orchestration-entry outputs to declare a direction (hw-native-sys/pypto#1901). An output left as a plain pl.Tensor is treated as In, so its device->host copy-back is skipped and the tensor silently reads back as all-zeros on the host, failing golden. Annotate every golden-compared output on its orchestration entry (the @pl.jit entry, its @pl.jit.host L3 driver, or the @pl.function Opaque method) by direction, decided by the golden TensorSpec: - is_output + init_value -> inout -> pl.InOut (in-place caches / state: kv_cache, kv_state, score_state, compress_state, cmp_kv, cmp_kv_cache, idx_kv_cache, and v3_2 decode_front kv/pe/k_cache_idx/dispatch_buf) - is_output, no init_value -> pure output -> pl.Out (v3_2 back `out`, prefill_front_draft `dispatch_buf`, x_out / x_next / logits) pl.InOut round-trips correctly as of hw-native-sys/pypto#1918 (the specializer previously dropped the wrapper). Direction lives on the orchestration entry only; @pl.jit.inline sub-kernels are left as upstream (their direction tag is stripped at splice time). Also normalizes pre-existing pl.Out-on-inout entries (prefill_attention_*, decode_compressor_*, decode_indexer*, prefill_indexer*) to pl.InOut so the whole codebase is consistent. Verified on device (a2a3): decode/prefill attention (csa/hca/swa), compressor ratio4/128, indexer(_compressor), decode_layer, decode_fwd all compile and pass golden. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

YunjiQin force-pushed the feat/l2-orch-inout-annotations branch from 0645675 to fc669c3 Compare July 1, 2026 07:11

YunjiQin changed the title ~~feat(dsv4/v3_2): annotate L2 orch outputs with pl.Out / pl.InOut~~ feat(dsv4/v3_2): mark L2 orch outputs with pl.Out for golden copy-back Jul 1, 2026

coderabbitai Bot reviewed Jul 1, 2026

View reviewed changes

YunjiQin force-pushed the feat/l2-orch-inout-annotations branch from fc669c3 to b9aab0c Compare July 1, 2026 09:17

YunjiQin changed the title ~~feat(dsv4/v3_2): mark L2 orch outputs with pl.Out for golden copy-back~~ feat(dsv4/v3_2): mark L2 orch outputs with correct direction (pl.Out / pl.InOut) Jul 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(dsv4/v3_2): mark L2 orch outputs with correct direction (pl.Out / pl.InOut)#659

feat(dsv4/v3_2): mark L2 orch outputs with correct direction (pl.Out / pl.InOut)#659
YunjiQin wants to merge 1 commit into
hw-native-sys:mainfrom
YunjiQin:feat/l2-orch-inout-annotations

YunjiQin commented Jul 1, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Jul 1, 2026

Uh oh!

coderabbitai Bot commented Jul 1, 2026 •

edited

Loading

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

YunjiQin commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Scope / rules

Changes (21 files, +36/−36 — 33 pl.InOut, 3 pl.Out)

Verification

Uh oh!

gemini-code-assist Bot commented Jul 1, 2026

Uh oh!

coderabbitai Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

YunjiQin commented Jul 1, 2026 •

edited

Loading

Changes (21 files, +36/−36 — 33 `pl.InOut`, 3 `pl.Out`)

coderabbitai Bot commented Jul 1, 2026 •

edited

Loading