Publish sdiag to perfpipe_fair_sdiag_v2 via graph_api (#141)#141
Conversation
CI CommandsThe following CI workflows run automatically on every push and pull request:
The following commands can be used by maintainers to trigger additional tests that require access to secrets:
|
luccabb
left a comment
There was a problem hiding this comment.
add some docs: https://facebookresearch.github.io/gcm/docs/GCM_Monitoring/collectors/
if you could add e2e tests too, mainly to verify that the final scuba table is reasonable
6299623 to
01d5138
Compare
| result = Sdiag(**data) | ||
|
|
||
| if reset: | ||
| self._reset_sdiag_counters() | ||
|
|
||
| return Sdiag(**data) | ||
| return result |
There was a problem hiding this comment.
| result = Sdiag(**data) | |
| if reset: | |
| self._reset_sdiag_counters() | |
| return Sdiag(**data) | |
| return result | |
| if reset: | |
| self._reset_sdiag_counters() | |
| return Sdiag(**data) |
| "DBD Agent queue size:": "dbd_agent_queue_size", | ||
| } | ||
| data: dict[str, Optional[int]] = { | ||
| data: dict[str, Any] = { |
There was a problem hiding this comment.
can we be more specific than Any?
| # fields so any sdiag stat we have not broken into its own column is | ||
| # still queryable from Scuba via JSON_EXTRACT and so future Slurm | ||
| # additions land in the table without a schema change. |
There was a problem hiding this comment.
| # fields so any sdiag stat we have not broken into its own column is | |
| # still queryable from Scuba via JSON_EXTRACT and so future Slurm | |
| # additions land in the table without a schema change. | |
| # fields so any sdiag stat we have not broken into its own column stays | |
| # queryable downstream via JSON path expressions, and new fields land | |
| # without requiring a schema change. |
| # Don't reset here: the existing slurm_monitor ODS collector also calls | ||
| # sdiag_structured() and owns the reset cadence. Resetting in both | ||
| # collectors would double-reset the counters and lose data. |
There was a problem hiding this comment.
maybe expose both as parameters and let upstream callers decide what/when to reset
|
@luccabb — pushed a structural rewrite (force-pushed after rebasing on Architecture change. Dropped the standalone Comments addressed:
Bonus fix while in |
01d5138 to
b8a82d9
Compare
|
v12: fixed a bug I introduced in the previous force-push. With Fix in
Tests added:
E2E re-verified on fair-aws-rc-1: one |
b8a82d9 to
0bba3e7
Compare
|
v13: fixes OSS CI that v12 broke.
All 3 OSS checks pass locally: |
0bba3e7 to
9abe246
Compare
767f1fe to
6b9f082
Compare
| # only explicit `self.otel_logger.info("", extra=...)` emits in the | ||
| # otel pipeline. | ||
| self.otel_logger = logging.getLogger("_gcm_otel_emit") | ||
| self.otel_logger.propagate = False |
There was a problem hiding this comment.
polluting the target Scuba table with
# null-data log rows.
the row gets published to scuba but with empty fields? do you have a sample?
There was a problem hiding this comment.
There was a problem hiding this comment.
Yes — concrete sample. Two tables affected, the second one is the dramatic case.
fair_sdiag sample row (one per dataclass_utils.warning("Missing field_name=...") per slurm_monitor cycle):
time=2026-05-17 16:14:41 PDT cluster=null derived_cluster=null
schedule_cycle_last=null bf_active=null ...all 70 cols null...
code.file.path=gcm/monitoring/dataclass_utils.py
code.function.name=instantiate_dataclass
~80 such null rows in last 24h. Scuba: https://fburl.com/scuba/fair_sdiag/oj8u9g16
fair_sacct_running is the dramatic case: 637,929 of 713,988 rows over the last 24h (89.3%) are null-data leakage. Most of those (574,000) attribute to gcm/exporters/otel.py:_write_log itself — the handler was recursing on log records emitted by its own emit path (_write_log logs at gcm.exporters.otel → propagates up to "gcm" parent → re-fires the handler → emits another row → which logs again → snowball). sacct_running runs hourly with large batches so each invocation generates 26-38k junk rows.
Full counts after running this query across all otel-targeted Scuba tables:
| Dataset | Total 24h | Null-cluster leaked | % |
|---|---|---|---|
| fair_sacct_running | 713,988 | 637,929 | 89.3% |
| fair_sdiag | 200 | 80 | 40% |
| fair_sacct | 10,360,194 | 0 | 0% |
| fair_scontrol_data | 74 | 0 | 0% |
| fair_scontrol_config | 68 | 0 | 0% |
| fair_sacctmgr_qos | 639 | 0 | 0% |
| fair_sacctmgr_user | 517,754 | 0 | 0% |
Only the first two trigger gcm-level logging during their otel write path; the other four don't, so the bug never fires on them. The fix (logger = "_gcm_otel_emit", propagate=False) removes the entire class of self-amplification.
There was a problem hiding this comment.
that query is Cluster vs cluster naming diff. there's a few nulls on both but with other fields populated, I'm not confident this fixes the root of this issue: https://fburl.com/scuba/fair_sacct_running/28owvuw2
fair_sdiag
there's a few rows here where cluster=null and Body gets populated, I think your change fixes this: https://fburl.com/scuba/fair_sdiag/q9a0ou8f
add some tests:
def test_unrelated_gcm_logger_does_not_emit_to_otel():
sink = OtelSink(...)
other = logging.getLogger("gcm.dataclass_utils")
other.warning("Missing field_name=foo")
assert sink._scuba_writes == []
def test_gcm_logger_emits_to_otel():
sink = OtelSink(...)
sink.write(<data> ...)
assert sink._scuba_writes == [<data> ...]
6b9f082 to
d473e23
Compare
…ch#141) Summary: Routes Slurm `sdiag` scrapes into a Scuba-backed perfpipe category (`perfpipe_fair_sdiag_v2`, namespace `ai_infra`, oncall `fair_efficiency`) via the existing `graph_api` exporter. Same shape as every other `perfpipe_fair_*` category (node_data, job_data, sacct_running, sacct, scontrol_data): graph_api writes ODS metrics and scribe_logs; scribe routes per-category to a Scuba dataset of the same name. No new exporter. Three code changes: 1. **`Sdiag` dataclass + `SlurmCliClient.sdiag_structured()` extended** to capture the ~50 fields exposed by `sdiag --all --json` on Slurm 23.2+ that were previously dropped: schedule cycle depth/last, backfill cycle/depth/queue-length/table-size, timing metadata (`req_time`, `gettimeofday_latency`, `job_states_ts`, `parts_packed`), RPC blobs (`rpcs_by_message_type_json`, `rpcs_by_user_json`) as flat JSON strings. `bf_active` is `Optional[bool]` so the JSON serializer emits real `true`/`false`. 2. **`sdiag_scribe_category` kwarg threaded through `GraphAPI.__init__`** (alongside the existing `node_scribe_category`, `job_scribe_category`, `statvfs_scribe_category`, `pure_scribe_category`). `DataIdentifier.SDIAG` is registered in `monitoring/sink/protocol.py`; `CliObject.last_sdiag` cache + `collect_sdiag` generator added in `monitoring/cli/slurm_monitor.py`. A second tuple in `data_collection_tasks` schedules sdiag LOG writes on the same 5-min cycle as the existing METRIC tasks, reusing the per-cycle scrape (no extra slurmctld load). `last_sdiag` is cleared before each scrape so a raised scrape cannot republish stale data; the dual-task is no-op when `last_sdiag is None`. 3. **`meta_utils/scribe.py::ScribeErrorWithAcks` surfaces raw reject codes.** Before: `"Failed to write N/N messages"`. After: `"Failed to write N/N messages; reject codes: ['INVALID_CATEGORY']"`. Three lines, no behavior change on success; applies to every graph_api LOG path. Ride-along (`exporters/otel.py`): OTLP handler moved from the `"gcm"` parent logger to a child `"_gcm_otel_emit"` with `propagate=False`, so OTLP-emit validation errors no longer mirror into Scuba LOG rows as `body="Missing field_name=..."`. Standalone hygiene fix. Docs: `website/docs/GCM_Monitoring/collectors/slurm_monitor.md` documents the dual-publish (METRIC->ODS + LOG->scribe via `DataIdentifier.SDIAG`) and the `last_sdiag` single-scrape caching pattern. `exporters/graph_api.md` gains a per-`DataIdentifier` routing table covering all six `*_scribe_category` kwargs plus the unset-category assertion behavior. Reviewed By: yonglimeta Differential Revision: D95971502
d473e23 to
ff069c9
Compare
…ch#141) Summary: Routes Slurm `sdiag` scrapes into a Scuba-backed perfpipe category (`perfpipe_fair_sdiag_v2`, namespace `ai_infra`, oncall `fair_efficiency`) via the existing `graph_api` exporter. Same shape as every other `perfpipe_fair_*` category (node_data, job_data, sacct_running, sacct, scontrol_data): graph_api writes ODS metrics and scribe_logs; scribe routes per-category to a Scuba dataset of the same name. No new exporter. Three code changes: 1. **`Sdiag` dataclass + `SlurmCliClient.sdiag_structured()` extended** to capture the ~50 fields exposed by `sdiag --all --json` on Slurm 23.2+ that were previously dropped: schedule cycle depth/last, backfill cycle/depth/queue-length/table-size, timing metadata (`req_time`, `gettimeofday_latency`, `job_states_ts`, `parts_packed`), RPC blobs (`rpcs_by_message_type_json`, `rpcs_by_user_json`) as flat JSON strings. `bf_active` is `Optional[bool]` so the JSON serializer emits real `true`/`false`. 2. **`sdiag_scribe_category` kwarg threaded through `GraphAPI.__init__`** (alongside the existing `node_scribe_category`, `job_scribe_category`, `statvfs_scribe_category`, `pure_scribe_category`). `DataIdentifier.SDIAG` is registered in `monitoring/sink/protocol.py`; `CliObject.last_sdiag` cache + `collect_sdiag` generator added in `monitoring/cli/slurm_monitor.py`. A second tuple in `data_collection_tasks` schedules sdiag LOG writes on the same 5-min cycle as the existing METRIC tasks, reusing the per-cycle scrape (no extra slurmctld load). `last_sdiag` is cleared before each scrape so a raised scrape cannot republish stale data; the dual-task is no-op when `last_sdiag is None`. 3. **`meta_utils/scribe.py::ScribeErrorWithAcks` surfaces raw reject codes.** Before: `"Failed to write N/N messages"`. After: `"Failed to write N/N messages; reject codes: ['INVALID_CATEGORY']"`. Three lines, no behavior change on success; applies to every graph_api LOG path. Ride-along (`exporters/otel.py`): OTLP handler moved from the `"gcm"` parent logger to a child `"_gcm_otel_emit"` with `propagate=False`, so OTLP-emit validation errors no longer mirror into Scuba LOG rows as `body="Missing field_name=..."`. Standalone hygiene fix. Docs: `website/docs/GCM_Monitoring/collectors/slurm_monitor.md` documents the dual-publish (METRIC->ODS + LOG->scribe via `DataIdentifier.SDIAG`) and the `last_sdiag` single-scrape caching pattern. `exporters/graph_api.md` gains a per-`DataIdentifier` routing table covering all six `*_scribe_category` kwargs plus the unset-category assertion behavior. Reviewed By: yonglimeta Differential Revision: D95971502
ff069c9 to
c668cd8
Compare
| @dataclass(kw_only=True) | ||
| class SdiagLog(Sdiag, DerivedCluster): | ||
| cluster: str |
There was a problem hiding this comment.
can we add DerivedCluster and cluster to the base Sdiag class?
There was a problem hiding this comment.
Done in v25.
Added cluster: Optional[str] = None to the base Sdiag dataclass (populated at the collect-sdiag site, not by the JSON parser — keeps the parser path untouched).
Dropped derived_cluster and deleted SdiagLog entirely. Rationale: sdiag is a cluster-wide slurmctld stat, so tagging it with derived_cluster would generate misleading duplicate rows for the same scrape (fair-sc, fair-sc.h200, fair-sc.h100 would all carry identical sdiag values). One row per cluster per cycle is the correct shape, even with --heterogeneous-cluster-v1. collect_sdiag now yields Sdiag directly via replace(sdiag, cluster=cluster).
| def collect_sdiag( | ||
| obj: CliObject, | ||
| cluster: str, | ||
| logger: logging.Logger, | ||
| ) -> Generator[SdiagLog, None, None]: | ||
| """Project the most recent sdiag scrape (cached on the CliObject by | ||
| `collect_slurm`) to an SdiagLog row for the Scuba `perfpipe_fair_sdiag_v2` dataset. | ||
|
|
||
| Single sdiag scrape per cycle is shared with `collect_slurm` -- no | ||
| double-scrape, no reset-counter race. sdiag is a cluster-wide slurmctld | ||
| stat so we yield exactly one SdiagLog per cycle (not one per partition). | ||
| """ | ||
| sdiag = obj.last_sdiag | ||
| if sdiag is None: | ||
| # collect_slurm hasn't populated the cache yet (first cycle race or | ||
| # collect_slurm raised). Skip rather than crash the loop -- the next | ||
| # cycle will recover. | ||
| logger.debug("collect_sdiag: no cached sdiag scrape available, skipping") | ||
| return | ||
| yield SdiagLog(cluster=cluster, derived_cluster=cluster, **asdict(sdiag)) |
There was a problem hiding this comment.
I think just fold this into collect_slurm -> collect_slurm_and_sdiag. the current state creates an implicit requirement that collect_slurm must always run before collect_sdiag
There was a problem hiding this comment.
i think another way is to write a class that does the querying
class CollectionContext:
def sdiag(): ...
def sinfo(): ...
then you init that class inside run_data_collection_loop at every round of collection, call its methods, and cache results.
def callable(collection_context):
return collection_context.sinfo
...
...
while True:
collection_context = CollectionContext(...)
for callable, ...:
callable(collection_context, ...)
either way, this would require changes on run_data_collection_loop, which has a decent blast radius. Wouldn't be opposed to run_data_collection_loop_v2 that handles this
If we need the data asap, maybe just say explicitly in the docstring that collect_slurm must be called before. Also add a comment here that not respecting task ordering may break collection.
There was a problem hiding this comment.
Took the short path in v25.
Added an explicit Prerequisite block to the collect_sdiag docstring (must run after collect_slurm in the same cycle; no-ops cleanly if last_sdiag is None), plus a shared-cache ordering comment at the top of the data_collection_tasks loop in monitoring/utils/monitor.py.
The CollectionContext + run_data_collection_loop_v2 refactor is the right long-term shape, but the blast radius (every gcm collector) is wider than this diff. Filing as a follow-up once this lands — rc1 burn-in is queued behind it and the docstring + loop comment is enough to prevent the ordering footgun for now.
c668cd8 to
8791db4
Compare
|
v25 addresses all three review comments. Comment on Added Dropped Comments on Took the short path you outlined: explicit docstring on The longer-term refactor you sketched ( |
…ch#141) Summary: Pull Request resolved: facebookresearch#141 Routes Slurm `sdiag` scrapes into a Scuba-backed perfpipe category (`perfpipe_fair_sdiag_v2`, namespace `ai_infra`, oncall `fair_efficiency`) via the existing `graph_api` exporter. Same shape as every other `perfpipe_fair_*` category (node_data, job_data, sacct_running, sacct, scontrol_data): graph_api writes ODS metrics and scribe_logs; scribe routes per-category to a Scuba dataset of the same name. No new exporter. Three code changes: 1. **`Sdiag` dataclass + `SlurmCliClient.sdiag_structured()` extended** to capture the ~50 fields exposed by `sdiag --all --json` on Slurm 23.2+ that were previously dropped: schedule cycle depth/last, backfill cycle/depth/queue-length/table-size, timing metadata (`req_time`, `gettimeofday_latency`, `job_states_ts`, `parts_packed`), RPC blobs (`rpcs_by_message_type_json`, `rpcs_by_user_json`) as flat JSON strings. `bf_active` is `Optional[bool]` so the JSON serializer emits real `true`/`false`. `cluster: Optional[str] = None` is also on the base dataclass (populated at the collect-sdiag site, not by the parser); `derived_cluster` is intentionally omitted because sdiag is a cluster-wide slurmctld stat and tagging it with a partition-split identifier would generate misleading duplicate rows. 2. **`sdiag_scribe_category` kwarg threaded through `GraphAPI.__init__`** (alongside the existing `node_scribe_category`, `job_scribe_category`, `statvfs_scribe_category`, `pure_scribe_category`). `DataIdentifier.SDIAG` is registered in `monitoring/sink/protocol.py`; `CliObject.last_sdiag` cache + `collect_sdiag` generator added in `monitoring/cli/slurm_monitor.py`. A second tuple in `data_collection_tasks` schedules sdiag LOG writes on the same 5-min cycle as the existing METRIC tasks, reusing the per-cycle scrape (no extra slurmctld load). `last_sdiag` is cleared before each scrape so a raised scrape cannot republish stale data; the dual-task is no-op when `last_sdiag is None`. `collect_sdiag` documents the cache-ordering prerequisite (must run after `collect_slurm` in the same cycle); the loop site at `monitoring/utils/monitor.py` carries a matching comment about shared-cache patterns being sensitive to task ordering. 3. **`meta_utils/scribe.py::ScribeErrorWithAcks` surfaces raw reject codes.** Before: `"Failed to write N/N messages"`. After: `"Failed to write N/N messages; reject codes: ['INVALID_CATEGORY']"`. Three lines, no behavior change on success; applies to every graph_api LOG path. Ride-along (`exporters/otel.py`): OTLP handler moved from the `"gcm"` parent logger to a child `"_gcm_otel_emit"` with `propagate=False`, so OTLP-emit validation errors no longer mirror into Scuba LOG rows as `body="Missing field_name=..."`. Standalone hygiene fix. Docs: `website/docs/GCM_Monitoring/collectors/slurm_monitor.md` documents the dual-publish (METRIC->ODS + LOG->scribe via `DataIdentifier.SDIAG`), the `last_sdiag` single-scrape caching pattern, and the ordering prerequisite. `exporters/graph_api.md` gains a per-`DataIdentifier` routing table covering all six `*_scribe_category` kwargs plus the unset-category assertion behavior, and an "Adding a new scribe category (Meta-internal)" section linking the Scribe wiki for DMV/PrivacyLib/ScribeShell unblock steps. Reviewed By: yonglimeta Differential Revision: D95971502
8791db4 to
f5fcba3
Compare
…ch#141) Summary: Pull Request resolved: facebookresearch#141 Routes Slurm `sdiag` scrapes into a Scuba-backed perfpipe category (`perfpipe_fair_sdiag_v2`, namespace `ai_infra`, oncall `fair_efficiency`) via the existing `graph_api` exporter. Same shape as every other `perfpipe_fair_*` category (node_data, job_data, sacct_running, sacct, scontrol_data): graph_api writes ODS metrics and scribe_logs; scribe routes per-category to a Scuba dataset of the same name. No new exporter. Three code changes: 1. **`Sdiag` dataclass + `SlurmCliClient.sdiag_structured()` extended** to capture the ~50 fields exposed by `sdiag --all --json` on Slurm 23.2+ that were previously dropped: schedule cycle depth/last, backfill cycle/depth/queue-length/table-size, timing metadata (`req_time`, `gettimeofday_latency`, `job_states_ts`, `parts_packed`), RPC blobs (`rpcs_by_message_type_json`, `rpcs_by_user_json`) as flat JSON strings. `bf_active` is `Optional[bool]` so the JSON serializer emits real `true`/`false`. `cluster: Optional[str] = None` is also on the base dataclass (populated at the collect-sdiag site, not by the parser); `derived_cluster` is intentionally omitted because sdiag is a cluster-wide slurmctld stat and tagging it with a partition-split identifier would generate misleading duplicate rows. 2. **`sdiag_scribe_category` kwarg threaded through `GraphAPI.__init__`** (alongside the existing `node_scribe_category`, `job_scribe_category`, `statvfs_scribe_category`, `pure_scribe_category`). `DataIdentifier.SDIAG` is registered in `monitoring/sink/protocol.py`; `CliObject.last_sdiag` cache + `collect_sdiag` generator added in `monitoring/cli/slurm_monitor.py`. A second tuple in `data_collection_tasks` schedules sdiag LOG writes on the same 5-min cycle as the existing METRIC tasks, reusing the per-cycle scrape (no extra slurmctld load). `last_sdiag` is cleared before each scrape so a raised scrape cannot republish stale data; the dual-task is no-op when `last_sdiag is None`. `collect_sdiag` documents the cache-ordering prerequisite (must run after `collect_slurm` in the same cycle); the loop site at `monitoring/utils/monitor.py` carries a matching comment about shared-cache patterns being sensitive to task ordering. 3. **`meta_utils/scribe.py::ScribeErrorWithAcks` surfaces raw reject codes.** Before: `"Failed to write N/N messages"`. After: `"Failed to write N/N messages; reject codes: ['INVALID_CATEGORY']"`. Three lines, no behavior change on success; applies to every graph_api LOG path. Ride-along (`exporters/otel.py`): OTLP handler moved from the `"gcm"` parent logger to a child `"_gcm_otel_emit"` with `propagate=False`, so OTLP-emit validation errors no longer mirror into Scuba LOG rows as `body="Missing field_name=..."`. Standalone hygiene fix. Docs: `website/docs/GCM_Monitoring/collectors/slurm_monitor.md` documents the dual-publish (METRIC->ODS + LOG->scribe via `DataIdentifier.SDIAG`), the `last_sdiag` single-scrape caching pattern, and the ordering prerequisite. `exporters/graph_api.md` gains a per-`DataIdentifier` routing table covering all six `*_scribe_category` kwargs plus the unset-category assertion behavior, and an "Adding a new scribe category (Meta-internal)" section linking the Scribe wiki for DMV/PrivacyLib/ScribeShell unblock steps. Reviewed By: yonglimeta Differential Revision: D95971502
f5fcba3 to
4e83a6c
Compare
…ch#141) Summary: Routes Slurm `sdiag` scrapes into a Scuba-backed perfpipe category (`perfpipe_fair_sdiag_v2`, namespace `ai_infra`, oncall `fair_efficiency`) via the existing `graph_api` exporter. Same shape as every other `perfpipe_fair_*` category (node_data, job_data, sacct_running, sacct, scontrol_data): graph_api writes ODS metrics and scribe_logs; scribe routes per-category to a Scuba dataset of the same name. No new exporter. Three code changes: 1. **`Sdiag` dataclass + `SlurmCliClient.sdiag_structured()` extended** to capture the ~50 fields exposed by `sdiag --all --json` on Slurm 23.2+ that were previously dropped: schedule cycle depth/last, backfill cycle/depth/queue-length/table-size, timing metadata (`req_time`, `gettimeofday_latency`, `job_states_ts`, `parts_packed`), RPC blobs (`rpcs_by_message_type_json`, `rpcs_by_user_json`) as flat JSON strings. `bf_active` is `Optional[bool]` so the JSON serializer emits real `true`/`false`. `cluster: Optional[str] = None` is also on the base dataclass (populated at the collect-sdiag site, not by the parser); `derived_cluster` is intentionally omitted because sdiag is a cluster-wide slurmctld stat and tagging it with a partition-split identifier would generate misleading duplicate rows. 2. **`sdiag_scribe_category` kwarg threaded through `GraphAPI.__init__`** (alongside the existing `node_scribe_category`, `job_scribe_category`, `statvfs_scribe_category`, `pure_scribe_category`). `DataIdentifier.SDIAG` is registered in `monitoring/sink/protocol.py`; `CliObject.last_sdiag` cache + `collect_sdiag` generator added in `monitoring/cli/slurm_monitor.py`. A second tuple in `data_collection_tasks` schedules sdiag LOG writes on the same 5-min cycle as the existing METRIC tasks, reusing the per-cycle scrape (no extra slurmctld load). `last_sdiag` is cleared before each scrape so a raised scrape cannot republish stale data; the dual-task is no-op when `last_sdiag is None`. `collect_sdiag` documents the cache-ordering prerequisite (must run after `collect_slurm` in the same cycle); the loop site at `monitoring/utils/monitor.py` carries a matching comment about shared-cache patterns being sensitive to task ordering. 3. **`meta_utils/scribe.py::ScribeErrorWithAcks` surfaces raw reject codes.** Before: `"Failed to write N/N messages"`. After: `"Failed to write N/N messages; reject codes: ['INVALID_CATEGORY']"`. Three lines, no behavior change on success; applies to every graph_api LOG path. Ride-along (`exporters/otel.py`): OTLP handler moved from the `"gcm"` parent logger to a child `"_gcm_otel_emit"` with `propagate=False`, so OTLP-emit validation errors no longer mirror into Scuba LOG rows as `body="Missing field_name=..."`. Standalone hygiene fix. Docs: `website/docs/GCM_Monitoring/collectors/slurm_monitor.md` documents the dual-publish (METRIC->ODS + LOG->scribe via `DataIdentifier.SDIAG`), the `last_sdiag` single-scrape caching pattern, and the ordering prerequisite. `exporters/graph_api.md` gains a per-`DataIdentifier` routing table covering all six `*_scribe_category` kwargs plus the unset-category assertion behavior, and an "Adding a new scribe category (Meta-internal)" section linking the Scribe wiki for DMV/PrivacyLib/ScribeShell unblock steps. Reviewed By: yonglimeta Differential Revision: D95971502
4e83a6c to
79c1a7f
Compare
| # Task ordering matters when callables share an instance cache (e.g. | ||
| # `collect_slurm` populates `CliObject.last_sdiag`, consumed by | ||
| # `collect_sdiag`). Reordering tasks in `data_collection_tasks` can | ||
| # break collection for shared-cache consumers. |
There was a problem hiding this comment.
| # Task ordering matters when callables share an instance cache (e.g. | |
| # `collect_slurm` populates `CliObject.last_sdiag`, consumed by | |
| # `collect_sdiag`). Reordering tasks in `data_collection_tasks` can | |
| # break collection for shared-cache consumers. | |
| # Task ordering matters for some collections. If you wish to update this assumption please create a new `run_data_collection_loop` function |
| # Task B: Sdiag -> LOG -> scribe (perfpipe_fair_sdiag_v2) -> | ||
| # Scuba (perfpipe_fair_sdiag_v2) via graph_api._write_log. Reads cached | ||
| # sdiag from Task A. Mirrors the slurm_job_monitor pattern of | ||
| # per-task DataIdentifier-driven scribe routing. |
There was a problem hiding this comment.
| # Task B: Sdiag -> LOG -> scribe (perfpipe_fair_sdiag_v2) -> | |
| # Scuba (perfpipe_fair_sdiag_v2) via graph_api._write_log. Reads cached | |
| # sdiag from Task A. Mirrors the slurm_job_monitor pattern of | |
| # per-task DataIdentifier-driven scribe routing. |
| # Task A: SLURMLog -> METRIC -> ODS via graph_api._write_metric. | ||
| # Populates obj.last_sdiag for Task B. |
There was a problem hiding this comment.
| # Task A: SLURMLog -> METRIC -> ODS via graph_api._write_metric. | |
| # Populates obj.last_sdiag for Task B. |
| import logging as _logging # noqa: E402 | ||
|
|
||
| from gcm.monitoring.cli.slurm_monitor import ( # noqa: E402 | ||
| CliObjectImpl as _CliObjectImpl, | ||
| collect_sdiag as _collect_sdiag, | ||
| ) | ||
| from gcm.schemas.slurm.sdiag import Sdiag as _Sdiag # noqa: E402 |
| Performs comprehensive analytics on SLURM cluster state by combining data from multiple sources (`sinfo`, `sdiag`, `sacct`) and computing aggregated metrics. Provides deep insights into cluster health, resource utilization, job analytics, and user activity. | ||
|
|
||
| **Data Type**: `DataType.METRIC`, **Schemas**: `SLURMLog`, `SLURMLogAccountMetrics` | ||
| **Data Type**: `DataType.METRIC` (`SLURMLog` → ODS) + `DataType.LOG` with `DataIdentifier.SDIAG` (`Sdiag` → scribe → Scuba `perfpipe_fair_sdiag_v2`). |
There was a problem hiding this comment.
| **Data Type**: `DataType.METRIC` (`SLURMLog` → ODS) + `DataType.LOG` with `DataIdentifier.SDIAG` (`Sdiag` → scribe → Scuba `perfpipe_fair_sdiag_v2`). | |
| **Data Type**: `DataType.METRIC` + `DataType.LOG` with `DataIdentifier.SDIAG`. |
…ch#141) Summary: Routes Slurm `sdiag` scrapes into a Scuba-backed perfpipe category (`perfpipe_fair_sdiag_v2`, namespace `ai_infra`, oncall `fair_efficiency`) via the existing `graph_api` exporter. Same shape as every other `perfpipe_fair_*` category (node_data, job_data, sacct_running, sacct, scontrol_data): graph_api writes ODS metrics and scribe_logs; scribe routes per-category to a Scuba dataset of the same name. No new exporter. Three code changes: 1. **`Sdiag` dataclass + `SlurmCliClient.sdiag_structured()` extended** to capture the ~50 fields exposed by `sdiag --all --json` on Slurm 23.2+ that were previously dropped: schedule cycle depth/last, backfill cycle/depth/queue-length/table-size, timing metadata (`req_time`, `gettimeofday_latency`, `job_states_ts`, `parts_packed`), RPC blobs (`rpcs_by_message_type_json`, `rpcs_by_user_json`) as flat JSON strings. `bf_active` is `Optional[bool]` so the JSON serializer emits real `true`/`false`. `cluster: Optional[str] = None` is also on the base dataclass (populated at the collect-sdiag site, not by the parser); `derived_cluster` is intentionally omitted because sdiag is a cluster-wide slurmctld stat and tagging it with a partition-split identifier would generate misleading duplicate rows. 2. **`sdiag_scribe_category` kwarg threaded through `GraphAPI.__init__`** (alongside the existing `node_scribe_category`, `job_scribe_category`, `statvfs_scribe_category`, `pure_scribe_category`). `DataIdentifier.SDIAG` is registered in `monitoring/sink/protocol.py`; `CliObject.last_sdiag` cache + `collect_sdiag` generator added in `monitoring/cli/slurm_monitor.py`. A second tuple in `data_collection_tasks` schedules sdiag LOG writes on the same 5-min cycle as the existing METRIC tasks, reusing the per-cycle scrape (no extra slurmctld load). `last_sdiag` is cleared before each scrape so a raised scrape cannot republish stale data; the dual-task is no-op when `last_sdiag is None`. `collect_sdiag` documents the cache-ordering prerequisite (must run after `collect_slurm` in the same cycle); the loop site at `monitoring/utils/monitor.py` carries a matching comment about shared-cache patterns being sensitive to task ordering. 3. **`meta_utils/scribe.py::ScribeErrorWithAcks` surfaces raw reject codes.** Before: `"Failed to write N/N messages"`. After: `"Failed to write N/N messages; reject codes: ['INVALID_CATEGORY']"`. Three lines, no behavior change on success; applies to every graph_api LOG path. Ride-along (`exporters/otel.py`): OTLP handler moved from the `"gcm"` parent logger to a child `"_gcm_otel_emit"` with `propagate=False`, so OTLP-emit validation errors no longer mirror into Scuba LOG rows as `body="Missing field_name=..."`. Standalone hygiene fix. Docs: `website/docs/GCM_Monitoring/collectors/slurm_monitor.md` documents the dual-publish (METRIC->ODS + LOG->scribe via `DataIdentifier.SDIAG`), the `last_sdiag` single-scrape caching pattern, and the ordering prerequisite. `exporters/graph_api.md` gains a per-`DataIdentifier` routing table covering all six `*_scribe_category` kwargs plus the unset-category assertion behavior, and an "Adding a new scribe category (Meta-internal)" section linking the Scribe wiki for DMV/PrivacyLib/ScribeShell unblock steps. Reviewed By: yonglimeta Differential Revision: D95971502
79c1a7f to
11b2b73
Compare
…ch#141) Summary: Routes Slurm `sdiag` scrapes into a Scuba-backed perfpipe category (`perfpipe_fair_sdiag_v2`, namespace `ai_infra`, oncall `fair_efficiency`) via the existing `graph_api` exporter. Same shape as every other `perfpipe_fair_*` category (node_data, job_data, sacct_running, sacct, scontrol_data): graph_api writes ODS metrics and scribe_logs; scribe routes per-category to a Scuba dataset of the same name. No new exporter. Three code changes: 1. **`Sdiag` dataclass + `SlurmCliClient.sdiag_structured()` extended** to capture the ~50 fields exposed by `sdiag --all --json` on Slurm 23.2+ that were previously dropped: schedule cycle depth/last, backfill cycle/depth/queue-length/table-size, timing metadata (`req_time`, `gettimeofday_latency`, `job_states_ts`, `parts_packed`), RPC blobs (`rpcs_by_message_type_json`, `rpcs_by_user_json`) as flat JSON strings. `bf_active` is `Optional[bool]` so the JSON serializer emits real `true`/`false`. `cluster: Optional[str] = None` is also on the base dataclass (populated at the collect-sdiag site, not by the parser); `derived_cluster` is intentionally omitted because sdiag is a cluster-wide slurmctld stat and tagging it with a partition-split identifier would generate misleading duplicate rows. 2. **`sdiag_scribe_category` kwarg threaded through `GraphAPI.__init__`** (alongside the existing `node_scribe_category`, `job_scribe_category`, `statvfs_scribe_category`, `pure_scribe_category`). `DataIdentifier.SDIAG` is registered in `monitoring/sink/protocol.py`; `CliObject.last_sdiag` cache + `collect_sdiag` generator added in `monitoring/cli/slurm_monitor.py`. A second tuple in `data_collection_tasks` schedules sdiag LOG writes on the same 5-min cycle as the existing METRIC tasks, reusing the per-cycle scrape (no extra slurmctld load). `last_sdiag` is cleared before each scrape so a raised scrape cannot republish stale data; the dual-task is no-op when `last_sdiag is None`. `collect_sdiag` documents the cache-ordering prerequisite (must run after `collect_slurm` in the same cycle); the loop site at `monitoring/utils/monitor.py` carries a matching comment about shared-cache patterns being sensitive to task ordering. 3. **`meta_utils/scribe.py::ScribeErrorWithAcks` surfaces raw reject codes.** Before: `"Failed to write N/N messages"`. After: `"Failed to write N/N messages; reject codes: ['INVALID_CATEGORY']"`. Three lines, no behavior change on success; applies to every graph_api LOG path. Ride-along (`exporters/otel.py`): OTLP handler moved from the `"gcm"` parent logger to a child `"_gcm_otel_emit"` with `propagate=False`, so OTLP-emit validation errors no longer mirror into Scuba LOG rows as `body="Missing field_name=..."`. Standalone hygiene fix. Docs: `website/docs/GCM_Monitoring/collectors/slurm_monitor.md` documents the dual-publish (METRIC->ODS + LOG->scribe via `DataIdentifier.SDIAG`), the `last_sdiag` single-scrape caching pattern, and the ordering prerequisite. `exporters/graph_api.md` gains a per-`DataIdentifier` routing table covering all six `*_scribe_category` kwargs plus the unset-category assertion behavior, and an "Adding a new scribe category (Meta-internal)" section linking the Scribe wiki for DMV/PrivacyLib/ScribeShell unblock steps. Reviewed By: yonglimeta Differential Revision: D95971502
11b2b73 to
c41c8cb
Compare
…ch#141) Summary: Routes Slurm `sdiag` scrapes into a Scuba-backed perfpipe category (`perfpipe_fair_sdiag_v2`, namespace `ai_infra`, oncall `fair_efficiency`) via the existing `graph_api` exporter. Same shape as every other `perfpipe_fair_*` category (node_data, job_data, sacct_running, sacct, scontrol_data): graph_api writes ODS metrics and scribe_logs; scribe routes per-category to a Scuba dataset of the same name. No new exporter. Three code changes: 1. **`Sdiag` dataclass + `SlurmCliClient.sdiag_structured()` extended** to capture the ~50 fields exposed by `sdiag --all --json` on Slurm 23.2+ that were previously dropped: schedule cycle depth/last, backfill cycle/depth/queue-length/table-size, timing metadata (`req_time`, `gettimeofday_latency`, `job_states_ts`, `parts_packed`), RPC blobs (`rpcs_by_message_type_json`, `rpcs_by_user_json`) as flat JSON strings. `bf_active` is `Optional[bool]` so the JSON serializer emits real `true`/`false`. `cluster: Optional[str] = None` is also on the base dataclass (populated at the collect-sdiag site, not by the parser); `derived_cluster` is intentionally omitted because sdiag is a cluster-wide slurmctld stat and tagging it with a partition-split identifier would generate misleading duplicate rows. 2. **`sdiag_scribe_category` kwarg threaded through `GraphAPI.__init__`** (alongside the existing `node_scribe_category`, `job_scribe_category`, `statvfs_scribe_category`, `pure_scribe_category`). `DataIdentifier.SDIAG` is registered in `monitoring/sink/protocol.py`; `CliObject.last_sdiag` cache + `collect_sdiag` generator added in `monitoring/cli/slurm_monitor.py`. A second tuple in `data_collection_tasks` schedules sdiag LOG writes on the same 5-min cycle as the existing METRIC tasks, reusing the per-cycle scrape (no extra slurmctld load). `last_sdiag` is cleared before each scrape so a raised scrape cannot republish stale data; the dual-task is no-op when `last_sdiag is None`. `collect_sdiag` documents the cache-ordering prerequisite (must run after `collect_slurm` in the same cycle); the loop site at `monitoring/utils/monitor.py` carries a matching comment about shared-cache patterns being sensitive to task ordering. 3. **`meta_utils/scribe.py::ScribeErrorWithAcks` surfaces raw reject codes.** Before: `"Failed to write N/N messages"`. After: `"Failed to write N/N messages; reject codes: ['INVALID_CATEGORY']"`. Three lines, no behavior change on success; applies to every graph_api LOG path. Ride-along (`exporters/otel.py`): OTLP handler moved from the `"gcm"` parent logger to a child `"_gcm_otel_emit"` with `propagate=False`, so OTLP-emit validation errors no longer mirror into Scuba LOG rows as `body="Missing field_name=..."`. Standalone hygiene fix. Docs: `website/docs/GCM_Monitoring/collectors/slurm_monitor.md` documents the dual-publish (METRIC->ODS + LOG->scribe via `DataIdentifier.SDIAG`), the `last_sdiag` single-scrape caching pattern, and the ordering prerequisite. `exporters/graph_api.md` gains a per-`DataIdentifier` routing table covering all six `*_scribe_category` kwargs plus the unset-category assertion behavior, and an "Adding a new scribe category (Meta-internal)" section linking the Scribe wiki for DMV/PrivacyLib/ScribeShell unblock steps. Reviewed By: yonglimeta Differential Revision: D95971502
c41c8cb to
7c29686
Compare
|
1 failed test seems to be because of fbcode <--> github cyclic dependency. Fbcode test failure is because of github dependency is failing. Github test failures point to fbcode diff failure. Force pushing to resolve this. Already tried rebasing multiple times. |
Summary:
Routes Slurm
sdiagscrapes into a Scuba-backed perfpipe category (perfpipe_fair_sdiag_v2, namespaceai_infra, oncallfair_efficiency) via the existinggraph_apiexporter. Same shape as every otherperfpipe_fair_*category (node_data, job_data, sacct_running, sacct, scontrol_data): graph_api writes ODS metrics and scribe_logs; scribe routes per-category to a Scuba dataset of the same name. No new exporter.Three code changes:
Sdiagdataclass +SlurmCliClient.sdiag_structured()extended to capture the ~50 fields exposed bysdiag --all --jsonon Slurm 23.2+ that were previously dropped: schedule cycle depth/last, backfill cycle/depth/queue-length/table-size, timing metadata (req_time,gettimeofday_latency,job_states_ts,parts_packed), RPC blobs (rpcs_by_message_type_json,rpcs_by_user_json) as flat JSON strings.bf_activeisOptional[bool]so the JSON serializer emits realtrue/false.cluster: Optional[str] = Noneis also on the base dataclass (populated at the collect-sdiag site, not by the parser);derived_clusteris intentionally omitted because sdiag is a cluster-wide slurmctld stat and tagging it with a partition-split identifier would generate misleading duplicate rows.sdiag_scribe_categorykwarg threaded throughGraphAPI.__init__(alongside the existingnode_scribe_category,job_scribe_category,statvfs_scribe_category,pure_scribe_category).DataIdentifier.SDIAGis registered inmonitoring/sink/protocol.py;CliObject.last_sdiagcache +collect_sdiaggenerator added inmonitoring/cli/slurm_monitor.py. A second tuple indata_collection_tasksschedules sdiag LOG writes on the same 5-min cycle as the existing METRIC tasks, reusing the per-cycle scrape (no extra slurmctld load).last_sdiagis cleared before each scrape so a raised scrape cannot republish stale data; the dual-task is no-op whenlast_sdiag is None.collect_sdiagdocuments the cache-ordering prerequisite (must run aftercollect_slurmin the same cycle); the loop site atmonitoring/utils/monitor.pycarries a matching comment about shared-cache patterns being sensitive to task ordering.meta_utils/scribe.py::ScribeErrorWithAckssurfaces raw reject codes. Before:"Failed to write N/N messages". After:"Failed to write N/N messages; reject codes: ['INVALID_CATEGORY']". Three lines, no behavior change on success; applies to every graph_api LOG path.Ride-along (
exporters/otel.py): OTLP handler moved from the"gcm"parent logger to a child"_gcm_otel_emit"withpropagate=False, so OTLP-emit validation errors no longer mirror into Scuba LOG rows asbody="Missing field_name=...". Standalone hygiene fix.Docs:
website/docs/GCM_Monitoring/collectors/slurm_monitor.mddocuments the dual-publish (METRIC->ODS + LOG->scribe viaDataIdentifier.SDIAG), thelast_sdiagsingle-scrape caching pattern, and the ordering prerequisite.exporters/graph_api.mdgains a per-DataIdentifierrouting table covering all six*_scribe_categorykwargs plus the unset-category assertion behavior, and an "Adding a new scribe category (Meta-internal)" section linking the Scribe wiki for DMV/PrivacyLib/ScribeShell unblock steps.Reviewed By: yonglimeta
Differential Revision: D95971502