docs(cube): 补充细节文档的多链并发和细粒度依赖说明 by fengzhazha · Pull Request #41 · LinxISA/LinxCore

fengzhazha · 2026-06-09T11:23:08Z

概述

本 PR 更新 CUBE 架构细节文档，补充多链并发机制和细粒度依赖管理的详细说明。

主要更新

1. 架构图和索引

README.md: 更新文档状态表，添加 CUBE_SPEC.md 和 WORK_IN_PROGRESS.md
diagrams/cube_architecture.md: 在顶层框图中添加 Tile Command Buffer (4-deep)

2. 实现细节文档（5个）

datapath.md

删除重复的参数表
修正 BufferC 说明：K 全部完成后写入 ACC（删除 K_chunk 概念）
确认 Prefetch Buffer 32 entries 和 TileStore Queue 8 entries

isq.md

新增 §13 ISQ 多链并发组织方式
- 详细对比全局共享、静态分区、动态分区三种方案
- 推荐全局共享 + 公平性保证（FAIR_THRESHOLD=20）
- 补充跨链发射仲裁策略

l0cache.md

新增 §6.1 TMU Ring 接口和 Flit 匹配
- 详细说明 512B L0 entry = 2×256B TMU flit
- 补充地址对齐要求（512B 对齐）
- 扩展 MSHR entry 结构（flit_mask[2]）
更新 Prefetch Buffer 深度为 32 entries
更新预取发射和填充流程

accumulator.md

ACC 映射表增加 last_uop_id[128] 字段
- 每个 slice 记录写入它的最后一个 uop_id
- 支持 per-输出位置的细粒度依赖
补充 tmatmul.acc 链流水线并行的优势说明

acccvt.md

重写 §3.2 为细粒度依赖建立
- acccvtuop[i] 只依赖对应 slice 的最后一个 matmul uop
- 删除旧的 K_chunk 提前 Wakeup 概念
补充 per-slice 依赖和流水线并行示例
更新 Wakeup 机制为细粒度广播示例

3. 进度追踪

WORK_IN_PROGRESS.md: 更新文档修改状态，标记已完成项

技术要点

细粒度依赖

tmatmul.acc 链的依赖从「等待整个 tileop 完成」优化为「per-输出位置依赖」
第二个 tileop 的 uop[i,j,0] 只需等待第一个 tileop 的 uop[i,j,K-1] 完成
支持两个 tileop 的流水线并行执行

多链并发

ISQ 推荐全局共享 32 entries，单链最多占 20 entries（防止饥饿）
跨链 uop 完全独立，可并发执行

TMU Ring 协议

L0 Cache entry (512B) = 2 个 TMU flit (256B each)
MSHR 跟踪 flit_mask，两个 flit 都到达后才标记 entry ready

影响范围

纯文档更新，无代码变更。

所有修改日期更新为 2026-06-09。

🤖 Generated with Claude Code

- 添加 CUBE_SPEC.md 和 WORK_IN_PROGRESS.md 到文档状态表 - 更新最后修改日期为 2026-06-09 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- 在顶层模块框图中添加 Tile Cmd Buffer (depth=4) - 更新数据流图中的命令接收路径 - 更新最后修改日期为 2026-06-09 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- 添加 README.md 和 diagrams/cube_architecture.md 的更新记录 - 标记已完成的文档更新项 - 明确剩余待完成的修改项 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

更新了 5 个 CUBE 实现细节文档： 1. datapath.md: - 删除重复的参数表 - 修正 K_chunk 描述为 K 全部完成后 - 更新日期为 2026-06-09 2. isq.md: - 新增 §13 ISQ 多链并发组织方式 - 详细说明全局共享、静态分区、动态分区三种方案 - 推荐全局共享 + 公平性保证方案 - 补充跨链发射仲裁策略 - 更新日期为 2026-06-09 3. l0cache.md: - 新增 §6.1 TMU Ring 接口和 Flit 匹配 - 详细说明 512B entry 需要 2 个 256B TMU flit - 补充地址对齐要求和 MSHR flit_mask 机制 - 更新 Prefetch Buffer 深度为 32 entries - 更新预取发射和填充流程 - 更新日期为 2026-06-09 4. accumulator.md: - ACC 映射表增加 last_uop_id[128] 字段 - 新增 per-输出位置的细粒度依赖管理说明 - 补充 tmatmul.acc 链流水线并行优势 - 更新日期为 2026-06-09 5. acccvt.md: - 重写 §3.2 为细粒度依赖建立 - 删除旧的 K_chunk 提前 Wakeup 概念 - 补充 per-slice 依赖和流水线并行示例 - 更新 Wakeup 机制为细粒度示例 - 更新日期为 2026-06-09 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- 标记 datapath.md、isq.md、l0cache.md、accumulator.md、acccvt.md 为已完成 - 详细列出每个文档的完成项 - 更新完成日期为 2026-06-09 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request updates the Janus CUBE architecture documentation, introducing fine-grained dependency management for the accumulator and acccvt modules, detailing the ISQ multi-chain concurrent organization options, and defining the TMU Ring interface, flit matching, and prefetch mechanisms in the L0 cache. The review feedback identifies a logical contradiction in the ISQ global shared allocation strategy, an inconsistency in the Prefetch Buffer capacity and flit launch logic, and an ambiguity regarding the update timing of the last uop ID in the accumulator mapping table.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-09T11:25:10Z

+    if (chain_count[req_chain] >= FAIR_THRESHOLD) {
+        // 该链已占用过多，降低优先级
+        allow = (other_chains_have_space == 0);
+    } else {
+        allow = (free_entries > 0);
+    }


在全局共享的分配策略中存在逻辑矛盾：
当当前链的占用达到 FAIR_THRESHOLD（20）时，代码中设置 allow = (other_chains_have_space == 0)。
这意味着，如果其他链完全空闲（即 other_chains_have_space 为真/非0），allow 将变为 0（假），从而阻止当前链继续分配。
这与前文所述的优点“单链可用全部 32 entries（其他链空闲时）”直接冲突。在其他链没有活跃请求时，不应限制当前链的分配。

建议引入其他链是否有活跃请求的判断（例如 other_chains_active_requests），只有在其他链有实际分配需求且当前链超出阈值时才进行限制。

Suggested change

if (chain_count[req_chain] >= FAIR_THRESHOLD) {

// 该链已占用过多，降低优先级

allow = (other_chains_have_space == 0);

} else {

allow = (free_entries > 0);

}

if (chain_count[req_chain] >= FAIR_THRESHOLD) {

// 该链已占用过多，只有在其他链有活跃请求且需要空间时才限制

// 若其他链无活跃请求，应允许当前链继续分配以达到最大利用率

allow = (free_entries > 0) && (other_chains_active_requests == 0);

} else {

allow = (free_entries > 0);

}

gemini-code-assist · 2026-06-09T11:25:10Z

 ```

-**Buffer 容量**：8-16 个 entry
+**Buffer 容量**：32 个 entry（可缓冲 16 个 L0 entry 的请求）


此处关于 Prefetch Buffer 容量的描述与后文的发射逻辑存在不一致：

此处声明 32 个 entry（可缓冲 16 个 L0 entry 的请求），意味着每个 512B 的 L0 entry 需要占用 2 个 prefetch buffer entry（每个对应一个 256B flit）。

但在 §6.4 的发射条件中，描述为单个 prefetch buffer 请求被选中后，会连续在两拍内发送第一个 flit (flit_seq=0) 和第二个 flit (flit_seq=1)。这暗示一个 prefetch buffer entry 就代表了一个完整的 512B L0 entry 请求（此时 32 个 entry 应该能缓冲 32 个 L0 entry 请求）。

如果每个 prefetch buffer entry 确实只代表单个 256B flit，那么发射逻辑不应该由单个 entry 自动连续发射两个 flit，而是应该将两个 flit 作为独立的 entry 分别进行调度和发射。请澄清并修正此处的定义。

gemini-code-assist · 2026-06-09T11:25:10Z

+第一个 tmatmul：
+  uop[i,j,k] → ACC[slice_ij]
+  last_uop_id[slice_ij] = uop[i,j,K_tiles-1].uop_id  // K 方向最后一个
+
+第二个 tmatmul.acc：
+  uop'[i,j,0].deps_uop_id = last_uop_id[slice_ij]
+  // 只依赖该输出位置的最后一个 uop，无需等待其他位置


此处关于 last_uop_id 的更新时机描述不够精确，可能会引起实现上的歧义：
文档中写道：

第一个 tmatmul： uop[i,j,k] → ACC[slice_ij] last_uop_id[slice_ij] = uop[i,j,K_tiles-1].uop_id

这容易被理解为在 uop 执行/写回 ACC 时才更新 last_uop_id。
然而，为了支持流水线并行（即在第一个 tmatmul 还在执行时，第二个 tmatmul.acc 就能被译码拆分并进入 ISQ），last_uop_id 必须在 FSM 译码拆分/发射（Dispatch/Split）阶段就由 FSM 立即更新。这样第二个 tmatmul.acc 才能在入队 ISQ 时正确获取并绑定其依赖的 deps_uop_id。如果等到执行阶段才更新，将无法实现跨 tileop 的流水线重叠。

建议在文档中明确说明 last_uop_id 是在 FSM 译码拆分阶段（Dispatch Time） 进行更新和绑定的。

根据 PR #41 的 review 意见进行修复： 1. isq.md - 修复分配策略逻辑矛盾： - 修正公平性保证机制，只有在其他链有活跃请求时才限制 - 允许单链在其他链空闲时使用全部 32 entries - 条件改为：allow = (free_entries > 0) && (other_chains_active_requests == 0) 2. l0cache.md - 澄清 Prefetch Buffer 容量定义： - 明确每个 prefetch buffer entry 代表一个完整的 512B L0 entry 请求 - 32 个 entry 可缓冲 32 个 L0 entry 请求（非 16 个） - 更新 entry 结构：flit0_issued 和 flit1_issued 分别跟踪两个 flit - 更新发射策略：连续两拍发送 flit0 和 flit1，然后释放 buffer entry 3. accumulator.md - 明确 last_uop_id 更新时机： - 强调 last_uop_id 在 FSM 译码拆分阶段（Dispatch Time）更新 - 补充关键时序，说明第二个 tileop 可在第一个执行时就完成拆分 - 支持跨 tileop 的流水线并行（译码、发射、执行重叠） Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- Add microarchitecture detail diagrams to architecture.md: * ISQ 32-entry structure with ready logic and selection * L0 Cache 4-way set-associative organization with PLRU * ACC Pool 8-bank organization with slice allocator * MAC 16x16 systolic array with PE structure * FixPipe 11-stage pipeline with nz to nd conversion * Fractal split visualization for 64x64 matrix - Add microarchitecture diagrams to CUBE_SPEC.md: * ISQ, L0, ACC, FixPipe overview diagrams * Fractal split example with dependency chains * Single uop and full tmatmul timing diagrams - Add timing diagrams to datapath.md: * TMU Ring read timing (tile prefetch) * End-to-end datapath timing with stage breakdown - Add detailed FixPipe diagram to acccvt.md: * 11-stage pipeline with format conversion details * nz to nd transformation, quantization, rowmax - Create new diagrams/microarchitecture_details.md: * Comprehensive microarchitecture diagram collection * 8 detailed diagrams covering all key modules - Update README.md with new documentation structure Total: +479 lines of diagrams across 5 modified files, 1 new diagram doc Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Shulin Feng and others added 5 commits June 9, 2026 17:09

docs(cube): 更新 README 文档状态表

d80a4cb

- 添加 CUBE_SPEC.md 和 WORK_IN_PROGRESS.md 到文档状态表 - 更新最后修改日期为 2026-06-09 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

docs(cube): 更新架构图添加 Tile Command Buffer

7bd7574

- 在顶层模块框图中添加 Tile Cmd Buffer (depth=4) - 更新数据流图中的命令接收路径 - 更新最后修改日期为 2026-06-09 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

docs(cube): 更新 WORK_IN_PROGRESS 文档修改状态

ad9e163

- 添加 README.md 和 diagrams/cube_architecture.md 的更新记录 - 标记已完成的文档更新项 - 明确剩余待完成的修改项 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

gemini-code-assist Bot reviewed Jun 9, 2026

View reviewed changes

Shulin Feng and others added 2 commits June 9, 2026 19:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs(cube): 补充细节文档的多链并发和细粒度依赖说明#41

docs(cube): 补充细节文档的多链并发和细粒度依赖说明#41
fengzhazha wants to merge 7 commits into
mainfrom
cube-docs-update-20260609

fengzhazha commented Jun 9, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

fengzhazha commented Jun 9, 2026

概述

主要更新

1. 架构图和索引

2. 实现细节文档（5个）

datapath.md

isq.md

l0cache.md

accumulator.md

acccvt.md

3. 进度追踪

技术要点

细粒度依赖

多链并发

TMU Ring 协议

影响范围

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant