Now we have a few pointwise and depthwise mobile net layer examples. You can check and do a quick fix first to see if it works before dumping ops_rknn.
Row 0 of output is correct (max_diff < 0.001), rows 1-3 are garbage (max_diff=inf, values like 704, -40448, inf).
Build and run:
cd ~/npu/ops_reg
gcc -o main main.c -ldrm -lm -I../include
./main conv2d 1 2 4 4 2 2 1 1 1This calls test_conv2d() at main.c:6273 with positional args:
<batch> <in_c> <in_h> <in_w> <out_c> <weight_in_c> <kh> <kw> [groups]
- PASS → bug is in
conv.py's register generation or data pack/unpack - FAIL → the conv2d runtime logic in
ops_regalso has the same bug
Remaining register differences after _feature_grains fix (min not max):
| Register | conv.py | RKNN (working) |
|---|---|---|
| CNA_CONV_CON2 FEATURE_GRAINS | 5 (was 1024 before fix ✓) | 5 |
| CNA_CBUF_CON0 DATA_BANK | 1 (✓) | 1 |
| CNA_CBUF_CON1 DATA_ENTRIES | 16 | 1 |
| CNA_CVT_CON5 (0x1180) | 0x07 (input_pack_c2 - 1 = 7) | not written |
| _unpack_flat_1x1_output | contiguous read | N/A (RKNN uses OutputOperator on CPU) |
Writes input_pack_c2 - 1 = 7 to PER_CHANNEL_CVT_EN, enabling per-channel conversion on 3 pseudo-channels. RKNN doesn't write this register. Even with CVT_BYPASS=1, enabling per-channel mode may alter internal data routing.
Fix: Comment out or conditionalize for NHWC pack only.
Reads contiguous 128 FP16 elements (c1 * out_width_stride * c2 = 1 * 16 * 8). The DPU writes with DST_SURF_STRIDE=256 bytes between rows. If the NPU writes output with row stride, the contiguous read only captures row 0 + padding.
Fix: Verify raw output buffer layout. If strided, read row-by-row with stride.
RKNN uses 1 entry (128 bytes). conv.py allocates 16 entries (2048 bytes). Unlikely to cause wrong computation, but worth checking if CBUF allocation conflicts with weight.
Already done — captured RKNN compiler registers to /tmp/ops_rknn_emit.txt.
To re-dump:
cd ~/npu/ops_rknn
echo 'set pagination off
break rknn_destroy
commands
shell python3 dump.py 2 2>&1 | grep EMIT | sed "s/\\x1B\\[[0-9;]*[a-zA-Z]//g" | grep -v "0x00000000" > /tmp/ops_rknn_emit.txt
continue
end
run --case known_2x2_1x1_4x4 1 2 4 4 2 1 1 1
quit' > /tmp/dump_rknn.gdb
gdb -batch -x /tmp/dump_rknn.gdb ./conv2d_multi_feature_grains— changemaxtomin(already applied)CNA_CVT_CON5— tested removing for non-NHWC; rejected because it regressedconv2d_16x16_1x1_8x8- Output unpack — checked against
experimental/conv_mesa.py; current flat 1x1 unpack matches that reference - Root cause — 2-channel small 1x1 input must use NHWC/pixel input path, like the existing 1- and 3-channel small cases. Fix
should_use_nhwc_pack()to includechannels == 2 and c2 // channels == 4.
python examples/conv.py now passes the full sweep, including:
known_2x2_1x1_4x4 2x4x4 -> 2x4x4 kh=1 kw=1 g=1 PASS (max_diff=0.0017)
All previously commented known_issue_shapes were enabled one-by-one and now pass in examples/conv.py:
known_2x2_1x1_4x4
known_8x8_1x1_5x5
known_10x20_3x3_9x9
known_16x16_3x3_9x9
known_2x4_3x3_6x6
known_2x4_2x2_5x5
known_1x32_5x5_10x10
known_8x4_4x4_10x10
For known_10x20_3x3_9x9, ops_reg also failed, so the working RKNN compiler path was dumped from ~/npu/ops_rknn using conv2d_fail_10x20_3x3_9x9.rknn. RKNN showed:
Native input: NC1HWC2, dims 1 2 9 9 8
Native output: NC1HWC2, dims 1 4 1 52 8
CNA_WEIGHT_SIZE2 kernels=20
DPU_DST_SURF_STRIDE=52
DPU_SURFACE_ADD=104
Fixes derived from that dump:
- Use input
C2=8for non-depthwise/grouped inputs within_c > 4. - Pack KH-major spatial weights in output-channel blocks of 16.
- Read non-grouped spatial output as flattened
NC1HWC2(height=1,width=out_h*out_w,stride=out_width_stride). - Read aligned output planes using
align_out_c / 8, not onlyceil(out_c / 8). - Program
DPU_SURFACE_ADDwith 16-channel surface grouping (max(2, effective_align_out // 16)).
Added and fixed:
conv2d_cc_b1_c3_h224_w224_oc32_wic3_k3x3_g1
input: 1x3x224x224
weight: 32x3x3x3
output: 1x32x222x222
~/npu/ops_reg is expected to fail this because it does not support the RKNN pc-chain schedule for the large convolution. RKNN was dumped from ~/npu/ops_rknn using:
python3 gen_conv2d_models.py --custom --batch 1 --in-ch 3 --height 224 --width 224 --out-ch 32 --k-h 3 --k-w 3 --groups 1 --name conv2d_cc_b1_c3_h224_w224_oc32_wic3_k3x3_g1 --force --out-dir models
gdb -q ./conv2d_multi -ex "set pagination off" -ex "break rknn_outputs_get" -ex "run --case conv2d_cc_b1_c3_h224_w224_oc32_wic3_k3x3_g1 1 3 224 224 32 3 3 1 conv2d_cc_b1_c3_h224_w224_oc32_wic3_k3x3_g1" -ex "shell python3 dump.py 1 2 3 4 5 > dump/conv2d_cc_dump.txt" -ex "continue" -ex "quit"RKNN showed five height tiles: four 50-input-row tiles producing 48 output rows each, followed by one 32-input-row tile producing 30 output rows. Important register values from the dump:
CNA_CONV_CON2 FEATURE_GRAINS=50 for 50-row tiles
CNA_CBUF_CON0 DATA_BANK=11, WEIGHT_BANK=1
DPU_DST_SURF_STRIDE=49284 (full output surface stride)
DPU_SURFACE_ADD=98568
Fixes:
- Large non-grouped spatial convs now split into 48-output-row tiles and submit those tiles as one PC chain.
- Each chained tile gets its own input DMA offset and an output DMA base offset for the tile row.
make_conv2d_regs()accepts an output stride override so tiled DPU writes still use the full output surface stride.- NHWC spatial convs now use
FEATURE_GRAINS=in_hand all 11 data CBUF banks, matching RKNN and avoiding partial-row corruption.
Result:
conv2d_cc_b1_c3_h224_w224_oc32_wic3_k3x3_g1 3x224x224 -> 32x222x222 kh=3 kw=3 g=1 PASS (max_diff=0.0151)
The new cc shapes listed at the top of this file were added to known_issue_shapes in examples/conv.py and are being debugged one by one.
conv2d_cc_b1_c32_h112_w112_oc32_wic1_k3x3_g32 32x112x112 -> 32x110x110 kh=3 kw=3 g=32 PASS (max_diff=0.0075)
conv2d_cc_b1_c32_h112_w112_oc64_wic32_k1x1_g1 32x112x112 -> 64x112x112 kh=1 kw=1 g=1 PASS (max_diff=0.0146)
conv2d_cc_b1_c64_h56_w56_oc128_wic64_k1x1_g1 64x56x56 -> 128x56x56 kh=1 kw=1 g=1 PASS (max_diff=0.0156)
conv2d_cc_b1_c128_h56_w56_oc128_wic1_k3x3_g128 128x56x56 -> 128x54x54 kh=3 kw=3 g=128 PASS (max_diff=0.0078)
conv2d_cc_b1_c128_h56_w56_oc128_wic128_k1x1_g1 128x56x56 -> 128x56x56 kh=1 kw=1 g=1 PASS (max_diff=0.0312)
conv2d_cc_b1_c128_h28_w28_oc256_wic128_k1x1_g1 128x28x28 -> 256x28x28 kh=1 kw=1 g=1 PASS (max_diff=0.0156)
conv2d_cc_b1_c256_h28_w28_oc256_wic1_k3x3_g256 256x28x28 -> 256x26x26 kh=3 kw=3 g=256 PASS (max_diff=0.0076)
conv2d_cc_b1_c256_h28_w28_oc256_wic256_k1x1_g1 256x28x28 -> 256x28x28 kh=1 kw=1 g=1 PASS (max_diff=0.0312)
conv2d_cc_b1_c256_h14_w14_oc512_wic256_k1x1_g1 256x14x14 -> 512x14x14 kh=1 kw=1 g=1 PASS (max_diff=0.0311)
conv2d_cc_b1_c512_h14_w14_oc512_wic1_k3x3_g512 512x14x14 -> 512x12x12 kh=3 kw=3 g=512 PASS (max_diff=0.0076)
conv2d_cc_b1_c512_h14_w14_oc512_wic512_k1x1_g1 512x14x14 -> 512x14x14 kh=1 kw=1 g=1 PASS (max_diff=0.0312)
conv2d_cc_b1_c512_h7_w7_oc1024_wic512_k1x1_g1 512x7x7 -> 1024x7x7 kh=1 kw=1 g=1 PASS (max_diff=0.0312)
conv2d_cc_b1_c1024_h7_w7_oc1024_wic1_k3x3_g1024 1024x7x7 -> 1024x5x5 kh=3 kw=3 g=1024 PASS (max_diff=0.0064)
conv2d_cc_b1_c1024_h7_w7_oc1024_wic1024_k1x1_g1 1024x7x7 -> 1024x7x7 kh=1 kw=1 g=1 PASS (max_diff=0.0606)
conv2d_cc_b1_c1024_h7_w7_oc1024_wic1_k7x7_g1024 1024x7x7 -> 1024x1x1 kh=7 kw=7 g=1024 PASS (max_diff=0.0078)
conv2d_cc_b1_c1024_h1_w1_oc1001_wic1024_k1x1_g1 1024x1x1 -> 1001x1x1 kh=1 kw=1 g=1 PASS (max_diff=0.0314)
conv1d_* known_issue_shapes expressed as conv2d with H=1/KH=1 PASS
~/npu/ops_rknn/dump/conv2d_cc_b1_c32_h112_w112_oc32_wic1_k3x3_g32_dump.txt
~/npu/ops_rknn/dump/conv2d_cc_b1_c32_h112_w112_oc64_wic32_k1x1_g1_dump.txt
~/npu/ops_rknn/dump/conv2d_cc_b1_c64_h56_w56_oc128_wic64_k1x1_g1_dump.txt
~/npu/ops_rknn/dump/conv2d_cc_b1_c128_h56_w56_oc128_wic128_k1x1_g1_dump.txt
~/npu/ops_rknn/dump/conv2d_cc_b1_c128_h28_w28_oc256_wic128_k1x1_g1_dump.txt
~/npu/ops_rknn/dump/conv2d_cc_b1_c256_h28_w28_oc256_wic256_k1x1_g1_dump.txt
~/npu/ops_rknn/dump/conv2d_cc_b1_c256_h14_w14_oc512_wic256_k1x1_g1_dump.txt
/tmp/ops_rknn_emit_1024_1001.txt
RKNN schedules conv2d_cc_b1_c32_h112_w112_oc32_wic1_k3x3_g32 as a pc-chain, not as a single full-height task. Key register values:
CNA_CONV_CON2 FEATURE_GRAINS=15
CNA_CBUF_CON0 DATA_BANK=11, WEIGHT_BANK=1
CNA_CBUF_CON1 DATA_ENTRIES=112
CNA_DATA_SIZE0 height=50 for 48-output-row tiles
DPU_DST_SURF_STRIDE=12100
DPU_SURFACE_ADD=48400
Fixes applied:
- Depthwise spatial convs with
out_h > 48now enter the shared pc-chain height tiling path. _feature_grains()returns15for depthwise spatial convs, matching RKNN._data_bank()returns all 11 data banks for depthwise spatial convs.- The 224 spatial pc-chain shape was moved to the end of the sweep because running it before depthwise pc-chain shapes poisons later chained submissions even after explicit resets.
RKNN schedules conv2d_cc_b1_c32_h112_w112_oc64_wic32_k1x1_g1 as height-tiled pc-chain tasks. Important findings:
CNA_CONV_CON2 FEATURE_GRAINS=10
CNA_CBUF_CON0 DATA_BANK=11
CNA_CBUF_CON1 DATA_ENTRIES=112
CNA_FC_DATA_SIZE1 DMA_CHANNEL=32
The raw path also needed output-channel sub-tiling: without it, channels 0..15 were correct and channels 16+ were corrupt.
Fixes applied:
- Large non-spatial pointwise convs with
out_h > 50now use pc-chain height tiling. FEATURE_GRAINS,DATA_BANK,CNA_CBUF_CON1, andCNA_FC_DATA_SIZE1now use the aligned real input channel count for wide inputs, not just the CBUF lane width.- The 32->64 pointwise shape uses 16-output-channel block tasks with weight and output-surface offsets.
- Those output-channel block tasks are submitted as one PC chain, then the whole chain is submitted twice. Submitting each 16-channel tile as an independent job still left intermittent stale-state failures across repeated
python examples/conv.pyruns.
conv2d_cc_b1_c64_h56_w56_oc128_wic64_k1x1_g1 failed with max_diff around 74 after enabling the next known-issue shape. The first 16 output channels in each wide tile were close, but later channels were corrupt.
Root cause: the default 1x1 weight pack serialized each output channel with all 64 input channels contiguous. That happens to match the 16x32 hardware order when in_c == 32, but not when in_c >= 64. The wide pointwise path needs the same 16-output-channel by 32-input-channel chunk order used by the GEMM references:
weight.reshape(out_c/16, 16, in_c/32, 32).transpose(0, 2, 1, 3)
Fix applied:
- Added
_pack_pointwise_wide()for non-grouped 1x1 convolutions within_c >= 64. - Kept the proven 16-output-channel task tiling; switching to 64-output-channel tiles made only the first 16 channels of each tile correct.
Verified:
conv2d_cc_b1_c64_h56_w56_oc128_wic64_k1x1_g1 PASS (max_diff=0.0156)
python examples/conv.py PASS with this known-issue shape enabled
conv2d_cc_b1_c128_h56_w56_oc128_wic128_k1x1_g1 initially failed after converting the active known_issue_shapes entry from the pasted positional list into the runner's dict format:
FAIL (max_diff=109.5923)
RKNN dump for the exact shape showed that the 128-input pointwise schedule uses smaller height tiles than the 64-input case:
CNA_DATA_SIZE0 height=25 for main tiles, height=6 for tail
CNA_CONV_CON2 FEATURE_GRAINS=9 for 25-row tiles, 6 for tail
CNA_DATA_SIZE1 DATAIN_CHANNEL_REAL=63, DATAIN_CHANNEL=128
CNA_WEIGHT_SIZE0=0x8000, CNA_WEIGHT_SIZE1=256, CNA_WEIGHT_SIZE2 kernels=128
CNA_CBUF_CON1 DATA_ENTRIES=224
DPU_DST_SURF_STRIDE=3136, DPU_SURFACE_ADD=6272
Experiments:
- Switching from 16-output-channel subtasks to 32-output-channel subtasks did not improve the failure.
- Running full 128-output-channel tasks without output-channel subtasks still failed.
- Matching RKNN's 25-row height split without output-channel subtasks made only channels
0..15correct. - The passing raw schedule is 25-row height tiles plus the existing 16-output-channel pointwise subtasks and wide 16x32 weight packing.
Fix applied:
- For large non-spatial pointwise convs with
in_c >= 128andout_c >= 128, usetile_out_h=25instead of50. - Keep the existing output-channel tile size at 16 and keep
_pack_pointwise_wide().
Verified:
conv2d_cc_b1_c128_h56_w56_oc128_wic128_k1x1_g1 PASS (max_diff=0.0312)
python examples/conv.py PASS, then five repeated full sweeps PASS
128-depthwise -> 128-pointwise targeted transition PASS, 10 consecutive runs
conv2d_cc_b1_c128_h28_w28_oc256_wic128_k1x1_g1 initially failed after converting the pasted positional list into a normal dict entry:
FAIL (max_diff=104.1654)
Important RKNN differences for this exact shape:
CNA_DATA_SIZE0 height=28, width=28 (no height split)
CNA_CONV_CON2 FEATURE_GRAINS=10
CNA_DATA_SIZE1 DATAIN_CHANNEL_REAL=63, DATAIN_CHANNEL=128
CNA_WEIGHT_SIZE0=0x10000, CNA_WEIGHT_SIZE1=256, CNA_WEIGHT_SIZE2 kernels=256
CNA_CBUF_CON0 DATA_BANK=7, WEIGHT_BANK=5
CNA_CBUF_CON1 DATA_ENTRIES=112
DPU_DST_SURF_STRIDE=784, DPU_SURFACE_ADD=1568
Experiments tried before final fix:
- Enabling the 25-row pointwise PC split for this 28-row shape still failed.
- A 13-row split also failed.
- A full 256-output-channel task with RKNN's
DATA_BANK=7made only channels0..15correct. - A broad implementation of full-height PC-chain execution, 16-output-channel subtasks, wide 16x32 weight packing, and
DATA_BANK=7made the target shape pass in isolation/targeted runs, but regressed the full sweep.
Final fix applied:
- Keep
make_conv2d_regs()unchanged to avoid broad register-generation behavior changes. - Add an exact
pointwise_128_256path forin_c=128,out_c=256,out_h=out_w=28so only this shape enters PC-chain output-channel tiling whenout_h <= 50. - Keep full-height execution for this shape (
tile_out_h=out_h), matching RKNN's no-height-split schedule. - Patch only this shape's generated
CNA_CBUF_CON0entries with_with_cbuf_data_bank(regs, 7), producing RKNN'sDATA_BANK=7,WEIGHT_BANK=5split without affecting32->64or128->128pointwise paths.
Rejected broad attempt:
python examples/conv.py regressed before reaching the new shape:
conv2d_cc_b1_c32_h112_w112_oc64_wic32_k1x1_g1 FAIL (max_diff=79.9122)
Verified final exact-match fix:
conv2d_cc_b1_c128_h28_w28_oc256_wic128_k1x1_g1 PASS (max_diff=0.0156)
python examples/conv.py PASS, including 32->64 pointwise PASS (max_diff=0.0146)
python examples/conv.py PASS, five repeated full sweeps
Notes:
- Standalone targeted sequences that start with
32->64can still expose old pointwise stale-state behavior. The full sweep order is the regression guard for this file. - Do not reintroduce the broad
data_bank_overrideAPI/change; the safe fix is exact-shape scoped.
conv2d_cc_b1_c256_h28_w28_oc256_wic1_k3x3_g256 was added after the passing 128->256 pointwise case.
Initial issue:
TypeError: list indices must be integers or slices, not str
Root cause: the active known_issue_shapes entry was pasted as a positional string list, but the runner expects dict entries.
Fix applied:
- Convert the entry to the normal dict shape format.
- No register-generation change was needed. The existing depthwise spatial path already tiles channels in 32-channel blocks, uses 13-output-row height tiles for
out_h > 13, and submits each block twice.
Verified:
conv2d_cc_b1_c256_h28_w28_oc256_wic1_k3x3_g256 PASS (max_diff=0.0076)
python examples/conv.py PASS, five repeated full sweeps
single-shape targeted run PASS, ten consecutive runs
conv2d_cc_b1_c256_h28_w28_oc256_wic256_k1x1_g1 initially had the same pasted-list issue as other active follow-up shapes:
TypeError: list indices must be integers or slices, not str
After converting the entry to the normal dict format, the raw path reached the NPU and failed:
FAIL (max_diff=129.0518)
Rejected experiments:
- Simply entering the existing full-height pointwise PC-chain path still failed (
max_diff=113.5718). - Applying the
128->256DATA_BANK=7patch to the full-height path made it worse (max_diff=154.8297). - Matching RKNN's apparent 128-output-channel tile size was rejected because the current raw register generator programs a different
DPU_SURFACE_ADDfor 128-channel subtasks than RKNN does.
RKNN dump for the exact shape showed this is not the 128->256 schedule:
CNA_DATA_SIZE0 height=18 for the first tile, height=10 for the tail
CNA_CONV_CON2 FEATURE_GRAINS=9
CNA_DATA_SIZE1 DATAIN_CHANNEL_REAL=63, DATAIN_CHANNEL=256
CNA_WEIGHT_SIZE1_WEIGHT_BYTES_PER_KERNEL=512
CNA_CBUF_CON0 DATA_BANK=8, WEIGHT_BANK=4
CNA_CBUF_CON1 DATA_ENTRIES=224
DPU_DST_SURF_STRIDE=784
DPU_SURFACE_ADD=1568
Final raw fix applied:
- Add an exact
pointwise_256_256path forin_c=256,out_c=256,out_h=out_w=28. - Use RKNN's
18-row height tile, producing18 + 10rows. - Keep the proven raw 16-output-channel subtasks so
DPU_SURFACE_ADDstays at the passing1568spacing. - Patch only this exact shape's generated
CNA_CBUF_CON0entries with_with_cbuf_data_bank(regs, 8), producingDATA_BANK=8,WEIGHT_BANK=4.
Verified:
conv2d_cc_b1_c256_h28_w28_oc256_wic256_k1x1_g1 PASS (max_diff=0.0312)
single-shape targeted run PASS, ten consecutive runs
python examples/conv.py PASS, three serial full sweeps
conv2d_cc_b1_c256_h14_w14_oc512_wic256_k1x1_g1 initially failed after converting the pasted positional list into a normal dict entry:
FAIL (max_diff=107.6679)
RKNN dump for the exact shape showed:
CNA_CONV_CON2 FEATURE_GRAINS=10
CNA_DATA_SIZE1 DATAIN_CHANNEL_REAL=63, DATAIN_CHANNEL=256
CNA_WEIGHT_SIZE2 kernels=512
CNA_CBUF_CON0 DATA_BANK=8, WEIGHT_BANK=4
CNA_CBUF_CON1 DATA_ENTRIES=112
DPU_SURFACE_ADD=392
Rejected attempts:
- A single full pointwise task with
DATA_BANK=8still failed. - The same shape under the old full-task path stayed at the same
max_diffand did not match RKNN.
Final fix applied:
- Add an exact
pointwise_256_512PC-chain path forin_c=256,out_c=512,out_h=out_w=14. - Keep the 16-output-channel tile path so the generated subtasks reuse the correct
DPU_SURFACE_ADD=392spacing. - Patch this shape's
CNA_CBUF_CON0to_with_cbuf_data_bank(regs, 8)so the bank split matches RKNN.
Verified:
conv2d_cc_b1_c256_h14_w14_oc512_wic256_k1x1_g1 PASS (max_diff=0.0311)
python examples/conv.py PASS, three serial full sweeps
conv2d_cc_b1_c512_h14_w14_oc512_wic1_k3x3_g512 was the next follow-up shape.
Quick fix result:
- Converting the pasted list entry to the runner's dict format was enough to reach the NPU path.
- No register or packing change was needed.
- The existing depthwise path already covers this shape.
Verified:
conv2d_cc_b1_c512_h14_w14_oc512_wic1_k3x3_g512 PASS (max_diff=0.0076)
python examples/conv.py PASS, three serial full sweeps
conv2d_cc_b1_c512_h14_w14_oc512_wic512_k1x1_g1 initially failed with:
FAIL (max_diff=167.5826)
Quick fix applied:
- Add an exact
pointwise_512_512path forin_c=512,out_c=512,out_h=out_w=14. - Use the same 16-output-channel PC-chain tile path as the other wide pointwise shapes.
- Patch this shape's
CNA_CBUF_CON0with_with_cbuf_data_bank(regs, 8).
Verified:
conv2d_cc_b1_c512_h14_w14_oc512_wic512_k1x1_g1 PASS (max_diff=0.0312)
python examples/conv.py PASS, three serial full sweeps
Remaining pasted known_issue_shapes entries were converted from positional string lists to normal dict entries:
conv2d_cc_b1_c512_h7_w7_oc1024_wic512_k1x1_g1
conv2d_cc_b1_c1024_h7_w7_oc1024_wic1_k3x3_g1024
conv2d_cc_b1_c1024_h7_w7_oc1024_wic1024_k1x1_g1
conv2d_cc_b1_c1024_h7_w7_oc1024_wic1_k7x7_g1024
conv2d_cc_b1_c1024_h1_w1_oc1001_wic1024_k1x1_g1
Quick fixes before dumping RKNN:
512->1024 @ 7x7passed after adding an exactpointwise_512_1024PC-chain path, full-height single tile, 16-output-channel subtasks, and_with_cbuf_data_bank(regs, 8).1024->1024 @ 7x7passed after adding the same exact-shape PC-chain path aspointwise_1024_1024, also using_with_cbuf_data_bank(regs, 8).1024x7x7depthwise 3x3 and 7x7 passed without new register changes; the existing large depthwise channel/block path and spatial depthwise pack covered them.
The classifier head 1024->1001 @ 1x1 needed an RKNN dump after quick PC-chain/bank attempts failed. RKNN showed:
CNA_CBUF_CON0 DATA_BANK=1, WEIGHT_BANK=11
CNA_CONV_CON2 FEATURE_GRAINS=2
CNA_DMA_CON2 SURF_STRIDE=0x0ffffffd
CNA_WEIGHT_SIZE2 kernels=1001 for the full task
DPU_SURFACE_ADD=2
PC-chain second task split: 512 + 489 output kernels
Final raw fix:
- Do not output-channel-pad
_pack_pointwise_wide()for partial final 16-channel blocks; this keeps the classifier weight stream at exactly 1001 kernels while preserving multiple-of-16 shapes. - Add exact
pointwise_1024_1001PC-chain path with 512-output-channel tasks, matching RKNN's512 + 489split. - Patch that shape's generated regs to
DATA_BANK=1,CNA_DMA_CON2=(width_stride * (height - 4)) & 0x0fffffff, omitCNA_CVT_CON5, and useDPU_SURFACE_ADD=2. - Add a scoped normal 1x1 NPU warm-up before this classifier head. The classifier passes as the first job in a fresh process but fails after any prior large PC/depthwise task without this warm-up.
Verified:
conv2d_cc_b1_c512_h7_w7_oc1024_wic512_k1x1_g1 PASS (max_diff=0.0312)
conv2d_cc_b1_c1024_h7_w7_oc1024_wic1_k3x3_g1024 PASS (max_diff=0.0064)
conv2d_cc_b1_c1024_h7_w7_oc1024_wic1024_k1x1_g1 PASS (max_diff=0.0606)
conv2d_cc_b1_c1024_h7_w7_oc1024_wic1_k7x7_g1024 PASS (max_diff=0.0078)
conv2d_cc_b1_c1024_h1_w1_oc1001_wic1024_k1x1_g1 PASS (max_diff=0.0314)
targeted final-shape transition sequence PASS
python examples/conv.py PASS, three serial full sweeps
The newly added conv1d_* known-issue entries were names only, so they could not run directly in examples/conv.py. Their source configs are in experimental/main.c and ~/npu/ops_reg/main.c as:
name, batch, in_channels, input_size, out_channels, weight_in_channels, kernel_size, groups
DeepWiki check against the Rocket/RKNPU register model did not show a distinct conv1d register mode. The hardware exposes 2D convolution dimensions (DATAIN_HEIGHT, DATAIN_WIDTH, kernel height/width), so the quick fix was to express each conv1d as conv2d:
input: N x C x 1 x input_size
weight: OC x weight_in_channels x 1 x kernel_size
output: N x OC x 1 x (input_size - kernel_size + 1)
groups: inferred from ops_reg config when groups=0, otherwise explicit
No ops_reg register copy or RKNN dump was needed because all mapped cases passed on the current path:
conv1d_bs1_as_conv2d PASS (max_diff=0.0009)
conv1d_bs8_as_conv2d PASS (max_diff=0.0010)
conv1d_bs1_612_as_conv2d PASS (max_diff=0.0018)
conv1d_bs1_615_as_conv2d PASS (max_diff=0.0037)
conv1d_bs1_1311_631_as_conv2d PASS (max_diff=0.0036)
conv1d_bs1_1311_632_as_conv2d PASS (max_diff=0.0019)
conv1d_bs1_1311_635_as_conv2d PASS (max_diff=0.0037)
conv1d_bs1_1311_615_g3_as_conv2d PASS (max_diff=0.0021)
conv1d_bs8_8111_611_as_conv2d PASS (max_diff=0.0010)
conv1d_bs8_8111_612_a_as_conv2d PASS (max_diff=0.0019)
conv1d_bs8_8111_612_b_as_conv2d PASS (max_diff=0.0019)
conv1d_bs8_8111_615_as_conv2d PASS (max_diff=0.0031)
conv1d_bs8_8311_631_as_conv2d PASS (max_diff=0.0019)
conv1d_bs8_8311_632_as_conv2d PASS (max_diff=0.0029)
conv1d_bs8_8311_635_a_as_conv2d PASS (max_diff=0.0039)
conv1d_bs8_8311_635_g3_as_conv2d PASS (max_diff=0.0020)
Verified:
python examples/conv.py PASS
python examples/conv.py PASS, three serial full sweeps
Intermittent follow-up after enabling shapes_sweep:
- A later user run hit
conv1d_bs8_8311_632_as_conv2d, and repeated local sweeps also exposed stale-state failures after the conv1d block in large small-channel 1x1 sweep cases such asconv2d_1x3_290x290_k1/conv2d_1x3_338x338_k1. - The conv1d mapping itself was still correct; the failing shapes passed in isolation and the failure moved with accumulated NPU state.
- Scoped fix: mapped batch-8 conv1d jobs now submit each per-batch non-PC task twice.
- Scoped fix: small-channel non-PC 1x1 jobs that are internally height-tiled (
len(tiles) > 1) also submit each tile twice. - The existing
1024->1001classifier-head warm-up was strengthened to submit the warm-up task twice, because repeated full-process runs could still expose the old classifier stale-state signature before reaching conv1d.
Verified after the intermittent fix:
targeted conv1d block -> conv2d_1x3_338x338_k1 transition PASS, 10 consecutive runs
python examples/conv.py PASS, five serial full sweeps with shapes_sweep enabled
python examples/conv.py passed the active sweep for this debugging session.
Additional depthwise fixes applied:
- Depthwise spatial weight packing now handles all spatial kernels, not just 3x3. This fixed
conv2d_cc_b1_c1024_h7_w7_oc1024_wic1_k7x7_g1024. - Large depthwise spatial shapes use 32-channel block execution with per-block input and weight repacking.
- The 28x28 depthwise case is split into two 13-output-row tiles because a single 26-output-row tile is unstable.
- Depthwise spatial
FEATURE_GRAINSis capped at13, which avoids the unstable 28x28 grain settings while preserving the passing larger/smaller cases.
conv2d_cc_b1_c128_h56_w56_oc128_wic1_k3x3_g128 follow-up:
- The active
known_issue_shapesentry was initially added as a positional string list, but the runner expects dict entries; this causedTypeError: list indices must be integers or slices, not strbefore the NPU path ran. - Converting the entry to the normal dict shape format was enough to exercise the existing depthwise channel-tiling path.
- No register-generation change was needed for this shape; the existing 32-channel depthwise block execution and 13-row height tiling covers 128 channels at 56x56.
- Verified
python examples/conv.pyonce, then five repeated full sweeps, then ten targeted single-shape runs.
Known remaining work:
- Remaining wider non-grouped 1x1 pointwise cc shapes beyond the active 64->128 case still need to be enabled and validated one by one.
- The empirical double-submit state warm-ups should be reduced once a register-level reset/isolation sequence is identified.
Representative final output:
conv2d_cc_b1_c32_h112_w112_oc64_wic32_k1x1_g1 PASS (max_diff=0.0146)
conv2d_cc_b1_c32_h112_w112_oc32_wic1_k3x3_g32 PASS (max_diff=0.0075)
conv2d_cc_b1_c64_h112_w112_oc64_wic1_k3x3_g64 PASS (max_diff=0.0075)
conv2d_cc_b1_c3_h224_w224_oc32_wic3_k3x3_g1 PASS (max_diff=0.0151)
conv2d_cc_b1_c64_h56_w56_oc128_wic64_k1x1_g1 PASS (max_diff=0.0156)
conv2d_cc_b1_c128_h56_w56_oc128_wic1_k3x3_g128 PASS (max_diff=0.0078)
conv2d_cc_b1_c128_h56_w56_oc128_wic128_k1x1_g1 PASS (max_diff=0.0312)
conv2d_cc_b1_c128_h28_w28_oc256_wic128_k1x1_g1 PASS (max_diff=0.0156)
conv2d_cc_b1_c256_h28_w28_oc256_wic1_k3x3_g256 PASS (max_diff=0.0076)
conv2d_cc_b1_c256_h28_w28_oc256_wic256_k1x1_g1 PASS (max_diff=0.0312)
conv2d_cc_b1_c256_h14_w14_oc512_wic256_k1x1_g1 PASS (max_diff=0.0311)
conv2d_cc_b1_c512_h14_w14_oc512_wic1_k3x3_g512 PASS (max_diff=0.0076)
conv2d_cc_b1_c512_h14_w14_oc512_wic512_k1x1_g1 PASS (max_diff=0.0312)
conv2d_cc_b1_c512_h7_w7_oc1024_wic512_k1x1_g1 PASS (max_diff=0.0312)
conv2d_cc_b1_c1024_h7_w7_oc1024_wic1_k3x3_g1024 PASS (max_diff=0.0064)
conv2d_cc_b1_c1024_h7_w7_oc1024_wic1024_k1x1_g1 PASS (max_diff=0.0606)
conv2d_cc_b1_c1024_h7_w7_oc1024_wic1_k7x7_g1024 PASS (max_diff=0.0078)
conv2d_cc_b1_c1024_h1_w1_oc1001_wic1024_k1x1_g1 PASS (max_diff=0.0314)
conv1d_* known_issue_shapes as H=1/KH=1 conv2d PASS
Next real NPU work:
- Continue enabling the remaining wider pointwise 1x1 cases and validate whether the 25-row pointwise height tile generalizes for
in_c >= 128. - Keep state-transition regressions covered in
examples/conv.pywhile reducing empirical double-submit workarounds where possible.
Do not globally remove or zero CNA_CVT_CON5: removing it previously regressed conv2d_16x16_1x1_8x8. A later test of zeroing it for wider pointwise shapes did not fix 64->128 and was reverted.
conv2d_cc_b1_c32_h112_w112_oc64_wic32_k1x1_g1 could fail intermittently, including cold first-run failures in isolation (max_diff around 48) and failures after large depthwise work in the full sweep (max_diff around 32). The issue was stale task/ping-pong state, not CNA_CVT_CON5 or weight packing.
Fixes applied in examples/conv.py:
submit_conv_tasks()now usesRKNPU_JOB_PC | RKNPU_JOB_BLOCK, matching the Mesa-style reference submit path, instead of also settingRKNPU_JOB_PINGPONGfor every raw conv task.- Depthwise channel-tiled pc-chain blocks are submitted twice, keeping the existing stale-state warm-up strategy local to the depthwise block path.
- The 32->64 and 64->128 pointwise cases exercise the NPU path; remaining wider 1x1 pointwise cases still need separate validation.
Verified:
python examples/conv.py
The current sweep passes, including the active cc transition coverage. Representative cc results:
conv2d_cc_b1_c32_h112_w112_oc64_wic32_k1x1_g1 PASS (max_diff=0.0146)
conv2d_cc_b1_c32_h112_w112_oc32_wic1_k3x3_g32 PASS (max_diff=0.0075)
conv2d_cc_b1_c3_h224_w224_oc32_wic3_k3x3_g1 PASS (max_diff=0.0151)
conv2d_cc_b1_c64_h112_w112_oc64_wic1_k3x3_g64 PASS (max_diff=0.0075)
conv2d_cc_b1_c64_h56_w56_oc128_wic64_k1x1_g1 PASS (max_diff=0.0156)
conv2d_cc_b1_c3_h224_w224_oc32_wic3_k3x3_g1 and conv2d_cc_b1_c32_h112_w112_oc32_wic1_k3x3_g32 both passed independently, but running the 112-depthwise case after the 224 spatial pc-chain failed (max_diff around 14.67). This reproduced as a persistent stale PC-chain state failure: subsequent 112-depthwise retries in the same process also failed.
Root cause: the last task in write_regs_to_npu_task() only emitted PC_OPERATION_ENABLE. RKNN dumps show chain boundaries clearing REG_PC_REGISTER_AMOUNTS to 0, and stale nonzero PC next-task state can survive into the next raw conv despite per-submit reset.
Fix applied:
- Final PC tails now write
PC_BASE_ADDRESS=0andPC_REGISTER_AMOUNTS=0beforePC_OPERATION_ENABLE. - The main sweep includes
conv2d_cc_b1_c3_h224_w224_oc32_wic3_k3x3_g1immediately beforeconv2d_cc_b1_c32_h112_w112_oc32_wic1_k3x3_g32so the poisoning order is covered bypython examples/conv.py.
Verified:
conv2d_cc_b1_c3_h224_w224_oc32_wic3_k3x3_g1 PASS (max_diff=0.0151)
conv2d_cc_b1_c32_h112_w112_oc32_wic1_k3x3_g32 PASS (max_diff=0.0075)
conv2d_cc_b1_c32_h112_w112_oc64_wic32_k1x1_g1 still showed intermittent first-run/full-sweep failures (max_diff around 49) after the PC tail clear fix. The failing path is the 16-output-channel independent tile loop used by the 32->64 pointwise NPU implementation.
Fix applied:
- Each independent pointwise output-channel tile is submitted twice before moving to the next output-channel tile.
- This was re-tested after final PC tail clearing, so the earlier pointwise-to-depthwise poisoning did not recur.
Follow-up fix:
- The per-output-channel independent submits were still intermittently stale across repeated full-process sweeps (
max_diffaround44-52). - The pointwise output-channel tile regs are now accumulated into one PC chain and the complete chain is submitted twice.
- Testing
PC_BASE_ADDRESS=1as a regcmd preamble regressed the 224 spatial conv, and post-submit reset did not fix the instability, so both experiments were rejected.
Verified:
conv2d_cc_b1_c32_h112_w112_oc64_wic32_k1x1_g1 PASS (max_diff=0.0146)
conv2d_cc_b1_c64_h112_w112_oc64_wic1_k3x3_g64 PASS (max_diff=0.0075)
Also verified python examples/conv.py plus three additional full sweeps in a row.
Latest verification:
python examples/conv.py PASS, 10 consecutive full-process sweeps
targeted cc transition stress sequence PASS, 10 consecutive runs
The conv-cleanup worktree removes the local pointwise_128_256,
pointwise_256_256, pointwise_256_512, pointwise_512_512,
pointwise_512_1024, pointwise_1024_1024, pointwise_1024_1001, and
is_conv1d_mapped booleans from run_conv2d(). The old code made the correct
hardware path hard to see because tile height, output-channel tiling, CBUF bank
selection, compact classifier-head handling, and repeat-submit state workarounds
were all encoded as exact shapes.
The first cleanup pass split those rules into several pointwise helpers. The current cleanup collapses that further into the normal scheduling path:
_conv_tiles(p, is_spatial, grouped_spatial)returns the PC-chain flag and(row_start, tile_input_height)list for 1x1, spatial, and large-depthwise convolutions. Pointwise is no longer selected by exact layer names; it falls out of the same tile planner:- small-channel flat output keeps the
RK_MAX_CONV_FLAT_STRIDErow limit; - tall wide pointwise keeps the proven
50or25row split (25when both input and output channels are at least128); - shorter output-channel-tiled pointwise uses a CBUF-limited row count from row bytes and eight data banks;
- spatial and depthwise paths keep their existing 48-row and 13-row splits.
- small-channel flat output keeps the
_tile_data_bank(p, tile_in_h)computes the output-channel-tile CBUF data-bank override from the tile input footprint and input-channel floor, capped to the empirically passing[1, 8]range. Tall pointwise tiles keep the all-data-bank path.- The classifier-head tail is now a local condition in
run_conv2d(): non-multiple-of-16 output channels with one output atom. That still applies the scoped tail fixes:DATA_BANK=1, patchedCNA_DMA_CON2, omitCNA_CVT_CON5, andDPU_SURFACE_ADD=2. _direct_submit_repeat(p, batch, is_spatial, tile_count)centralizes the stale-state workarounds: depthwise spatial direct jobs repeat, small-channel tiled direct jobs repeat, and batch>1 single-row NHWC conv1d-mapped jobs repeat.
DeepWiki MCP was installed and queried during this cleanup. The useful answer was
from allbilly/mesa: Rocket's driver model computes entries_per_slice, turns
input/weight footprints into CBUF banks, chooses whether weights fit and can be
reused, then derives task slices from available input banks. That supports moving
the Python code toward bank/footprint formulas. DeepWiki could not fully derive
the RK3588-specific stale-state double-submit behavior; those repeats remain
empirical and should be reduced only after a register-level reset/isolation
sequence is identified.
Verification on the rebased cleanup worktree:
python -m py_compile examples/conv.py PASS
targeted conv1d/small-tiled/pointwise/classifier cases PASS
python examples/conv.py PASS
| File | Location |
|---|---|
| Register programming | ~/rk3588/examples/conv.py |
| ops_reg test runner | ~/npu/ops_reg/main.c (test_conv2d at line 6273) |
| RKNN reference dump | /tmp/ops_rknn_emit.txt |
| GDB dump script | ~/npu/ops_rknn/rknn.gdb |
| Register decoder | ~/npu/ops_rknn/dump.py |
| Register definitions | ~/npu/ops_rknn/registers.xml |
| RKNN test model | ~/npu/ops_rknn/ops_rknn/models/known_2x2_1x1_4x4.rknn |