Issue 12：统一 scheduler 与 queue telemetry 字段的语义和计算来源

**涉及文件：** `contracts/runtime_telemetry_contract.py`、`node_runtime.py`、`runtime.py`

**问题描述：**

`runtime_telemetry_contract.py` 中 `_normalize_queue_summary` 的实现非常复杂：它综合 `stream_tracker`、`backends`、`workload_lanes`、`legacy_payload` 四路数据，对 `depth`、`pending`、`running`、`inflight` 等字段进行多层 `max()` 推导，最终选取各路来源的最大值作为输出。这种设计背后的逻辑是为了兼容不同版本和不同节点返回的不完整数据，属于防御性编码。但这也导致以下问题：

- 当 `pending = 3`（来自 `stream_tracker.active_requests`）和 `running = 5`（来自 `lane_running`）各来自不同数据源时，这两个数字可能在语义上无法直接相加或比较，但 `depth = pending + running` 的最终计算会将它们合并，结果可能超出真实的 pipeline 负载。
- `fallback_rate` 字段来自 `_scheduler_observability` 中的聚合计数器，这些计数器是累积值而非速率，需要在外部除以时间窗口才能计算真正的速率，但 contract 中没有暴露时间窗口信息。
- `inflight` 的定义在不同数据路径中不一致：有时等于 `running`（lane 层面），有时等于 `backend_running`（backend 层面），有时等于 `max(running, backend_running)`。
- telemetry 字段在 `summarize_runtime_scheduler_observability` 和 `normalize_runtime_scheduler_telemetry` 之间有重复计算，但两个函数都是对外暴露的，调用方可能对同一字段得到不同的值。

需要做的工作：为每个字段（`depth`、`pending`、`running`、`inflight`、`fallback`、`spillover`）写出明确的语义定义文档，说明它代表什么、来自哪个数据源、如何计算；识别多路 fallback 推导中存在语义不一致的字段，选择一个权威来源并删除冗余路径；将 `fallback_count` 等累积计数器补充时间窗口信息，使外部可以计算真实速率。

**任务目标：** 让 SAGE 的 scheduler 与 queue telemetry 字段含义稳定清晰，每个字段有唯一的语义来源，避免观测数据在不同调用路径下给出矛盾结果，为后续的可解释调度（Issue 7）和性能调试提供可信赖的指标基础。

**完成标准：** `depth`、`pending`、`running`、`inflight`、`fallback`、`spillover` 六个字段各自有明确的语义定义和唯一计算来源，`fallback_count` 等累积计数器补充时间窗口信息，`summarize_runtime_scheduler_observability` 与 `normalize_runtime_scheduler_telemetry` 对同一字段的输出值一致。

---
父 Issue: #1484

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue 12：统一 scheduler 与 queue telemetry 字段的语义和计算来源 #1496

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue 12：统一 scheduler 与 queue telemetry 字段的语义和计算来源 #1496

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions