Feature: 增加 Prometheus Metrics 暴露并补齐 oplog 获取/消费延迟监控，兼容现有 REST 监控方式

### 背景

MongoShake 当前公开文档仍以 RESTful API 作为主要监控方式。全量阶段主要通过 `/progress` 获取进度，增量阶段通过 `/worker`、`/sentinel`、`/repl`、`/queue`、`/executor`、`/persist`、`/conf` 等接口获取运行状态，`mongoshake-stat` 也是基于这些接口工作的。

当前分支已经有一版 Prometheus 接入实现：collector 在独立端口暴露 `/metrics`，并输出一批 Prometheus 指标。这给 Prometheus/Grafana 接入打下了基础，但现阶段还缺少统一开关语义和更直接的延迟观测能力。

考虑是否可以原生支持Prometheus的metrics

### 问题

- 现有 full/incr REST 监控方式需要继续兼容，不能影响已有脚本、看板和 `mongoshake-stat`。
- 三种监控输出目前没有统一的开关语义。Prometheus 可以通过 `prom.http_port` 控制是否监听，但 full/incr 的 HTTP 端口当前仍会被默认值或随机端口补齐，无法显式关闭。
- 当前缺少对 oplog 拉取延迟和消费延迟的直接观测。已有 `lsn` / `lsn_ack` / `lsn_checkpoint` 能辅助判断进度，但不能直接回答“拉取端滞后了多少”“消费端落后了多少”。


### 目标

- 兼容现有 full/incr REST 监控接口。
- 新增标准 Prometheus `/metrics` 暴露，便于 Prometheus/Grafana 接入。
- 支持 `full_sync.http_port`、`incr_sync.http_port`、`prom.http_port` 三种监控方式独立开关。
- 提供可直接使用的 oplog 获取延迟、消费延迟指标。

### 建议实现

- 统一三类监控端口的开关语义：
  - `full_sync.http_port <= 0`：关闭全量 REST 监控
  - `incr_sync.http_port <= 0`：关闭增量 REST 监控
  - `prom.http_port <= 0`：关闭 Prometheus 监控
- 增加端口冲突校验，Prometheus 端口也应纳入校验范围。
- 保持现有 REST API 路径和返回结构不变；已有 `mongoshake-stat` 无需修改即可继续使用。
- Prometheus 指标名沿用当前分支已有命名，避免引入额外迁移成本。
- 延迟指标建议明确语义：
  - `oplog_get_delay`：当前时间与 source oplog `wall` / `clusterTime` 的差值，表示拉取侧滞后
  - `oplog_put_delay`：source oplog `wall` / `clusterTime` 到成功 apply / ack 的差值，表示消费侧真实滞后
- 多副本集/分片场景下，延迟指标建议统一带上 `name` 和 `stage` 标签，避免只能看到聚合值。
- 补充 `conf/collector.conf`、README/Wiki 中的配置说明和 Prometheus 抓取示例。


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: 增加 Prometheus Metrics 暴露并补齐 oplog 获取/消费延迟监控，兼容现有 REST 监控方式 #950

背景

问题

目标

建议实现

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature: 增加 Prometheus Metrics 暴露并补齐 oplog 获取/消费延迟监控，兼容现有 REST 监控方式 #950

Description

背景

问题

目标

建议实现

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions