Date: 2026-05-24
This is a detailed handoff of the current RK3588 NPU CONV work. It is written to be usable without relying on chat history.
Current working objective:
Create/promote examples/conv_tiles.py as the top-level downstream RKNPU version
of the kernel_6_18 CONV work, using decoded registers rather than opaque blobs,
and continue the direct-spatial/RKNN-captured sparse task path until the next
real blocker is understood.
Relevant project rules:
- Do not offload op work to CPU/GPU unless explicitly requested.
- Do not edit files in
examples/kernel_6_18/unless explicitly requested. - Do not kill long-running NPU processes.
- Be very careful around
npu_submit,task_count,regcmd_addr,regcfg_amount, andenable_mask. - For all
ops_rknnwork, source is under~/npu/ops_rknn. - Use
python3 examples/simple_add.pyas the quick NPU health/recovery check.
MMIO read is not needed to compute the conv output. It was requested as evidence for the current C64/H56 blocker.
The raw C64/H56 sparse path returns a correct conv result, but the next PC job
(simple_add.py) times out. Userspace-visible RKNN metadata already matches:
task bytes, live regcmd window, active offset, allocation sizes, ACTION sequence,
submit metadata, and MEM_SYNC sequence. Because those all match, the remaining
suspect is hidden hardware/driver state after completion: PC/DPU/interrupt/task
status that is not visible from normal userspace ioctls.
The specific state we want to compare is:
- after an RKNN C64/H56 run that does not poison the next job;
- after the raw
conv_tiles.pyC64/H56 run that computes correctly but poisons the next job; - before submitting the next
simple_add.py.
Non-root sysfs is not enough for that. It only shows runtime/devfreq state, not
per-core PC/DPU/IRQ status. /dev/mem MMIO access is currently blocked:
MMIO_SNAPSHOT baseline_after_recovery
MMIO_UNAVAILABLE device=/dev/mem errno=13 error=Permission denied
So MMIO is a diagnostic path, not part of the conv implementation.
The blocker is not that one conv shape timed out during its own run. The C64/H56 conv itself completed and produced correct output.
The blocker is:
C64/H56 raw sparse CONV succeeds, but it leaves the NPU/driver state polluted so
the next PC job times out until the downstream RKNPU driver timeout path performs
a soft reset.
Observed next-job timeout:
post simple_add TimeoutError: [Errno 110] Connection timed out
RKNPU: job: ..., mask: 0x1, ... timeout: 6000000us
RKNPU: failed to wait job, task counter: 0, flags: 0x5
RKNPU: job timeout, flags: 0x0:
RKNPU: core 0 irq status: 0x0, raw status: 0xc0000000, require mask: 0x300, task counter: 0x0
RKNPU: soft reset, num: 6
After that timeout/soft-reset, python3 examples/simple_add.py passes again.
Yes. The refreshed RKNN/RKNPU capture set includes:
experimental/rknn/capture_rknpu_ioctl_sweep_action_readonly_20260524_135914.log
experimental/rknn/capture_rknpu_ioctl_c64_h56_readonly_20260524_133736.log
experimental/rknn/capture_rknpu_submit_dump_gems_c64_h56_20260524_133709.log
experimental/rknn/capture_rknpu_submit_dump_gems_c64_h56_live_regcmd_1168_20260524_133725.log
The C64/H56 RKNN capture showed the schedule-specific runtime profile now
modeled in conv_tiles.py:
- task BO size:
560 - regcmd BO size:
28352 - active regcmd offset:
0x4800 - active regcmd bytes:
9344 - weight/internal BO size:
1204224 - input BO size:
401408 - output BO size:
802816 - task object: 14 records
- live active regcmd window: 1168 qwords
- one extra input
MEM_SYNC FROM_DEVICEbefore submit - four output
MEM_SYNC FROM_DEVICEcalls after submit
The offline verifier confirms this capture is modeled:
offline verifier PASS
PASS C64/H56 RKNN-mode runtime profile
PASS C64/H56 RKNN-mode sync profile
PASS unpromoted C64/H56 live regcmd evidence
It is mixed.
The direct-spatial schedules are not a complete formulaic tiler yet. They are RKNN-captured templates plus decoded register math.
Captured/template-driven pieces:
- record amounts;
- record count/topology;
- PC links;
- PC root/separator layout;
- per-record PC core assignment for promoted special cases;
- selected CBUF/grain overrides;
- active regcmd offsets/windows;
- RKNN-mode BO sizes for captured shape profiles;
- RKNN-mode MEM_SYNC sequence variants.
Formula/math-derived pieces:
- many register field encodings;
- DMA addresses and offsets from BO base plus tensor layout;
- input/output tensor placement;
- NC1HWC2/flat/direct-spatial packing logic;
- output offset computation;
- weight/input/output buffer population;
- register qword generation from descriptor objects;
- normalization/comparison of live RKNN regcmd windows against generated qwords.
So the honest status is:
conv_tiles.py is not yet a general RKNN tiling compiler. Promoted shapes are
template schedules captured from RKNN, with register bodies and buffer layout
computed by decoded math. A shape is promoted only after byte-level offline
equality against RKNN capture plus one gated hardware probe bracketed by
simple_add before and after.
Top-level downstream RKNPU CONV experiment.
Current behavior:
- Based on
examples/kernel_6_18/conv.py, but adapted to downstream RockchiprknpuABI. - Uses
/dev/dri/card1. - Uses
DRM_IOCTL_RKNPU_*. - Defines downstream structs:
rknpu_mem_createrknpu_mem_maprknpu_mem_syncrknpu_submitrknpu_subcore_taskstruct_rknpu_task
- Uses downstream
op_idx=1. - Normal register-list task descriptors use downstream RKNPU PC-tail semantics:
npu_tasks[idx].regcfg_amount = len(regs)
Important runtime gates:
RK3588_CONV_NO_DEVICE=1
RK3588_CONV_DIRECT_SPATIAL=1
RK3588_CONV_DIRECT_SPATIAL_UNSAFE=1
RK3588_CONV_DIRECT_SPATIAL_TASKS=rknpu_sparse_task_gem
RK3588_CONV_RKNN_MEM_SYNC=1
RK3588_CONV_RKNN_SKIP_RESET=1
RK3588_CONV_RKNN_INIT_ACTIONS=1
RK3588_CONV_C64_H56_SPARSE_UNSAFE=1
Important safety detail:
RK3588_CONV_C64_H56_SPARSE_UNSAFE=1is required for C64/H56 sparse hardware submit.- Without it, C64/H56 sparse submit is guarded and raises before hardware work.
- This is intentional because C64/H56 currently poisons the next PC job.
Rejected/crashy policy:
RK3588_CONV_DIRECT_SPATIAL_TASKS=rknpu_sparse_task_gem_tail
That tail-index hypothesis previously crashed/rebooted the board. Do not rerun it.
Offline verifier. It imports examples/conv_tiles.py with:
RK3588_CONV_NO_DEVICE=1
It does not submit NPU work.
Coverage currently includes:
- downstream RKNPU ABI layout checks;
- captured RKNN task object comparison;
- live RKNN regcmd window comparison;
- H40 sparse task/regcmd comparison;
- six-desc H7/H14/PW C256/H14/C32-H14 comparisons;
- C64/H56 live regcmd comparison;
- C64/H56 RKNN-mode runtime BO profile comparison;
- C64/H56 RKNN-mode sync profile comparison;
- RKNN ioctl sequence comparison;
- RKNN allocation metadata comparison;
- DMA binding checks;
- RKNPU driver commit candidate model;
- IRQ completion source model;
- timeout source model;
- sysfs snapshot parsing;
- MMIO snapshot parsing, including
MMIO_UNAVAILABLEskip behavior; - PC commit trace patch/log parser checks.
Representative command:
python3 experimental/rknn/compare_conv_tiles_offline.py \
--ioctl-log experimental/rknn/capture_rknpu_ioctl_sweep_action_readonly_20260524_135914.log \
--mmio-log experimental/rknn/mmio_snapshot_baseline_after_recovery_20260524.logLatest relevant result:
offline verifier PASS
SKIP rknpu MMIO snapshot comparison
MMIO unavailable: /dev/mem errno=13 error=Permission denied
Guarded state-capture harness for the current C64/H56 blocker.
Default is dry-run. It prints commands and does not submit work.
Dry-run examples:
python3 experimental/rknn/capture_c64_state.py --mode rknn --mmio --sysfs --tag dryrun
python3 experimental/rknn/capture_c64_state.py --mode raw --tag dryrunRaw execution requires both:
--execute --allow-raw-c64
Safety check result:
python3 experimental/rknn/capture_c64_state.py --mode raw --execute --tag should_refuse--execute --mode raw requires --allow-raw-c64
Purpose:
- RKNN mode can run the existing ACTION-aware GDB capture with optional sysfs and MMIO snapshots.
- Raw mode brackets raw C64/H56 with before/after sysfs/MMIO snapshots.
- It intentionally does not blindly continue into next-job checks because that next job is known to timeout while the pollution remains.
Read-only /dev/mem MMIO snapshot helper for selected RKNPU core state.
It now reports permission failure cleanly instead of throwing a traceback:
MMIO_UNAVAILABLE device=/dev/mem errno=13 error=Permission denied
This is currently blocked in the non-root shell.
Non-privileged sysfs snapshot helper.
Latest baseline after recovery:
runtime_status = suspended
runtime_usage = 0
cur_freq = 1000000000
target_freq = 1000000000
governor = rknpu_ondemand
This proves runtime/devfreq state after recovery, but it does not expose the PC/DPU/IRQ state needed for the C64 pollution diagnosis.
Patch for the downstream RKNPU driver to log PC commit/IRQ/timeout state.
This is the fallback if root MMIO snapshots are not available.
It logs:
pc commit preludepc commit tracepc commit regspc commit armedvalid irq traceinvalid irq pc regstimeout pc regs
Offline parser command:
python3 experimental/rknn/compare_conv_tiles_offline.py \
--pc-trace-log experimental/rknn/rknpu_pc_commit_trace.logCurrent parser status:
pc commit preludeis parsed and compared.valid irq traceis parsed and compared.invalid irq pc regsis parsed and compared.- Trace regexes accept both
0x...and bare0, matching Linux%#xoutput where zero prints as0. - A synthetic trace log with prelude, commit, armed, and valid IRQ rows passes
compare_pc_trace_log().
Additional teardown evidence:
- The offline verifier now parses RKNN debug
free memoryrows. - H40 and C64/H56 captures both free BO debug rows in this order:
internal,weight,task,input,output. - Each free row is checked against the corresponding allocation row for handle, virtual address, object address, DMA address, aligned size, flags, and IOMMU domain.
- The existing captures do not show an extra RKNPU destroy ioctl; teardown
evidence visible from current logs is unsupported
POWER_OFFaction followed by RKNN debug free rows.
Command:
python3 examples/simple_add.pyKnown latest pass:
SUBMIT ret=0
ADD NPU=[8 8 8 8 8 8 8 8] expected=[8 8 8 8 8 8 8 8] PASS
Use this before and after every risky NPU probe.
Shape:
b1_c160_h40_w40_oc320_wic160_k3x3_g1_s1_pvalid
Controlled command:
python3 examples/simple_add.py
RK3588_CONV_DIRECT_SPATIAL=1 \
RK3588_CONV_DIRECT_SPATIAL_UNSAFE=1 \
RK3588_CONV_DIRECT_SPATIAL_TASKS=rknpu_sparse_task_gem \
RK3588_CONV_RKNN_MEM_SYNC=1 \
RK3588_CONV_RKNN_SKIP_RESET=1 \
RK3588_CONV_RKNN_INIT_ACTIONS=1 \
python3 examples/conv_tiles.py b1_c160_h40_w40_oc320_wic160_k3x3_g1_s1_pvalid
python3 examples/simple_add.pyResult:
pre simple_add PASS
conv_tiles H40 sparse rknpu PASS (max_diff=0.0625)
post simple_add PASS
Conclusion: H40 sparse RKNPU path is promoted under the RKNN-mode runtime conditions above.
Shape:
b1_c160_h7_w7_oc320_wic160_k3x3_g1_s1_pvalid
Known result:
pre-probe simple_add PASS
b1_c160_h7_w7_oc320_wic160_k3x3_g1_s1_pvalid PASS (max_diff=0.0621)
post-probe simple_add PASS
Shape:
b1_c160_h14_w14_oc320_wic160_k3x3_g1_s1_pvalid
Known result:
pre-probe simple_add PASS
b1_c160_h14_w14_oc320_wic160_k3x3_g1_s1_pvalid PASS (max_diff=0.0625)
post-probe simple_add PASS
Shape:
b1_c32_h14_w14_oc128_wic32_k3x3_g1_s1_pvalid
Known result:
pre-probe simple_add PASS
b1_c32_h14_w14_oc128_wic32_k3x3_g1_s1_pvalid PASS (max_diff=0.0304)
post-probe simple_add PASS
Shape:
b1_c256_h14_w14_oc512_wic256_k1x1_g1
Known result:
pre-probe simple_add PASS
b1_c256_h14_w14_oc512_wic256_k1x1_g1 PASS (max_diff=0.0311)
post-probe simple_add PASS
Shape:
b1_c64_h56_w56_oc128_wic64_k1x1_g1
RKNN model path:
/home/orangepi/npu/ops_rknn/models/conv2d_cc_b1_c64_h56_w56_oc128_wic64_k1x1_g1.rknn
Current modeled evidence:
- 14-record sparse task stream;
- active regcmd offset
0x4800; - active regcmd bytes
9344; - live active regcmd window
1168qwords; - task BO size
560; - regcmd BO size
28352; - weight/internal BO size
1204224; - input BO size
401408; - output BO size
802816; - captured C64/H56 MEM_SYNC sequence;
- captured ACTION/ioctl sequence;
- captured submit metadata.
Offline status:
PASS unpromoted C64/H56 live regcmd evidence
C64/H56 14-record sparse stream matches live RKNN GEM2 window after DMA/PC base normalization
PASS C64/H56 RKNN-mode runtime profile
PASS C64/H56 RKNN-mode sync profile
Hardware probe command used after adding captured C64 sync profile:
python3 examples/simple_add.py
RK3588_CONV_DIRECT_SPATIAL=1 \
RK3588_CONV_DIRECT_SPATIAL_UNSAFE=1 \
RK3588_CONV_C64_H56_SPARSE_UNSAFE=1 \
RK3588_CONV_DIRECT_SPATIAL_TASKS=rknpu_sparse_task_gem \
RK3588_CONV_RKNN_MEM_SYNC=1 \
RK3588_CONV_RKNN_SKIP_RESET=1 \
RK3588_CONV_RKNN_INIT_ACTIONS=1 \
python3 examples/conv_tiles.py b1_c64_h56_w56_oc128_wic64_k1x1_g1
python3 examples/simple_add.pyResult:
pre simple_add PASS
C64/H56 sparse conv PASS (max_diff=0.0156)
post simple_add TimeoutError: [Errno 110] Connection timed out
recovery simple_add PASS after driver timeout soft reset
Conclusion:
The missing C64/H56 cleanup/finalization semantic is not the captured MEM_SYNC
profile. The raw C64/H56 sparse path matches the known RKNN userspace-visible
runtime behavior and computes the correct output, but still leaves hidden
PC/DPU/IRQ/core state in a bad condition for the next PC job.
Do not rerun C64/H56 sparse hardware unchanged unless the run is explicitly for state capture.
Always use simple_add.py before and after intentional hardware probes:
python3 examples/simple_add.py
# exactly one probe
python3 examples/simple_add.pyDo not run C64/H56 raw sparse hardware casually:
RK3588_CONV_C64_H56_SPARSE_UNSAFE=1
Only use that flag for a deliberately bracketed state-capture run.
Do not run the old tail-layout policy:
RK3588_CONV_DIRECT_SPATIAL_TASKS=rknpu_sparse_task_gem_tail
It previously crashed/rebooted the board.
Do not change submit plumbing without rechecking every field:
task_counttask_obj_addrregcmd_addrregcfg_amountint_maskenable_masksubcore_taskflagscore_mask- PC-chain tail qwords
Run RKNN C64 state capture:
sudo -E python3 experimental/rknn/capture_c64_state.py \
--mode rknn --execute --mmio --sysfs --tag rknn_c64_mmioRun raw C64 state capture:
sudo -E python3 experimental/rknn/capture_c64_state.py \
--mode raw --execute --allow-raw-c64 --tag raw_c64_mmioCompare:
- PC data/address registers after RKNN C64 vs raw C64;
- DPU/PPU/RDMA completion or pending state;
- interrupt status/raw status/mask state;
- task counter state;
- whether RKNN leaves any cleanup state that raw path misses.
Use the driver trace patch:
experimental/rknn/rknpu_pc_commit_trace.patch
Collect dmesg for:
- RKNN C64 run;
- raw C64 run;
- next
simple_add.pytimeout.
Save as:
experimental/rknn/rknpu_pc_commit_trace.log
Then run:
python3 experimental/rknn/compare_conv_tiles_offline.py \
--pc-trace-log experimental/rknn/rknpu_pc_commit_trace.logThis should show whether the raw C64 run leaves PC commit/IRQ/task state different from RKNN even though the userspace task/regcmd/sync metadata matches.
Current direct-spatial schedules are still capture-template based. A future cleanup can replace the captured tables with formulas for:
- y-window splitting;
- k/channel splitting;
- setup/k-half/k-tile/y-tile descriptor ordering;
- record amount generation;
- PC link generation;
- PC core assignment;
- CBUF/grain selection;
- active regcmd offset/window sizing;
- RKNN-mode BO sizing;
- RKNN-mode sync profile selection.
This should be treated as a separate compiler/tiler task. It should not be mixed with the C64 pollution diagnosis unless it directly explains the hidden hardware state difference.
After Python edits:
python3 -m py_compile \
examples/conv_tiles.py \
experimental/rknn/compare_conv_tiles_offline.py \
experimental/rknn/capture_c64_state.py \
experimental/rknn/snapshot_rknpu_mmio_readonly.py \
experimental/rknn/snapshot_rknpu_sysfs.pyOffline verifier:
python3 experimental/rknn/compare_conv_tiles_offline.py \
--ioctl-log experimental/rknn/capture_rknpu_ioctl_sweep_action_readonly_20260524_135914.log \
--mmio-log experimental/rknn/mmio_snapshot_baseline_after_recovery_20260524.logNPU health:
python3 examples/simple_add.pyThe worktree is dirty and contains many untracked capture logs. Do not clean or revert unrelated files.
Important modified/untracked files observed:
examples/conv_tiles.py
experimental/rknn/compare_conv_tiles_offline.py
experimental/rknn/capture_c64_state.py
experimental/rknn/snapshot_rknpu_mmio_readonly.py
experimental/rknn/snapshot_rknpu_sysfs.py
experimental/rknn/mmio_snapshot_baseline_after_recovery_20260524.log
experimental/rknn/sysfs_snapshot_baseline_after_recovery_20260524.log
experimental/rknn/rknpu_pc_commit_trace.patch
experimental/rknn/capture_rknpu_ioctl_sweep_action_readonly_20260524_135914.log
experimental/rknn/capture_rknpu_ioctl_c64_h56_readonly_20260524_133736.log
experimental/rknn/capture_rknpu_submit_dump_gems_c64_h56_live_regcmd_1168_20260524_133725.log
plan.md
progress.md
The capture logs under experimental/rknn/ are evidence. Do not delete them as
noise.
Completed:
- Created top-level
examples/conv_tiles.py. - Converted the top-level CONV experiment from Rocket-style plumbing to downstream RKNPU ABI plumbing.
- Added RKNN-style cacheable/IOMMU BO allocation and MEM_SYNC path.
- Added RKNN init ACTION modeling.
- Proved H40 sparse direct-spatial hardware path.
- Proved H7/H14/C32-H14/PW-C256-H14 promoted paths.
- Re-ran/captured RKNN/RKNPU ioctl and live regcmd evidence.
- Modeled C64/H56 task/regcmd/runtime/sync profile against RKNN captures.
- Ran C64/H56 raw sparse hardware once with captured sync profile and confirmed correct output plus next-job pollution.
- Added guarded C64 state-capture harness.
- Added/updated MMIO/sysfs snapshot handling and offline verifier skip behavior
for non-root
/dev/memdenial. - Tightened the PC commit trace parser/verifier so instrumented driver logs can compare prelude, valid IRQ, invalid IRQ, and timeout PC state against the modeled commit state.
- Added RKNN debug free-row teardown comparison for H40 and C64/H56 captures.
Not complete:
- C64/H56 sparse hardware promotion.
- Root MMIO or driver-trace evidence explaining the post-C64 next-job pollution.
- General formulaic tiler to replace RKNN-captured schedule templates.