Publish sdiag to perfpipe_fair_sdiag_v2 via graph_api (#141) by Yash0270 · Pull Request #141 · facebookresearch/gcm

Yash0270 · 2026-05-07T18:43:14Z

Summary:

Routes Slurm sdiag scrapes into a Scuba-backed perfpipe category (perfpipe_fair_sdiag_v2, namespace ai_infra, oncall fair_efficiency) via the existing graph_api exporter. Same shape as every other perfpipe_fair_* category (node_data, job_data, sacct_running, sacct, scontrol_data): graph_api writes ODS metrics and scribe_logs; scribe routes per-category to a Scuba dataset of the same name. No new exporter.

Three code changes:

Sdiag dataclass + SlurmCliClient.sdiag_structured() extended to capture the ~50 fields exposed by sdiag --all --json on Slurm 23.2+ that were previously dropped: schedule cycle depth/last, backfill cycle/depth/queue-length/table-size, timing metadata (req_time, gettimeofday_latency, job_states_ts, parts_packed), RPC blobs (rpcs_by_message_type_json, rpcs_by_user_json) as flat JSON strings. bf_active is Optional[bool] so the JSON serializer emits real true/false. cluster: Optional[str] = None is also on the base dataclass (populated at the collect-sdiag site, not by the parser); derived_cluster is intentionally omitted because sdiag is a cluster-wide slurmctld stat and tagging it with a partition-split identifier would generate misleading duplicate rows.
sdiag_scribe_category kwarg threaded through GraphAPI.__init__ (alongside the existing node_scribe_category, job_scribe_category, statvfs_scribe_category, pure_scribe_category). DataIdentifier.SDIAG is registered in monitoring/sink/protocol.py; CliObject.last_sdiag cache + collect_sdiag generator added in monitoring/cli/slurm_monitor.py. A second tuple in data_collection_tasks schedules sdiag LOG writes on the same 5-min cycle as the existing METRIC tasks, reusing the per-cycle scrape (no extra slurmctld load). last_sdiag is cleared before each scrape so a raised scrape cannot republish stale data; the dual-task is no-op when last_sdiag is None. collect_sdiag documents the cache-ordering prerequisite (must run after collect_slurm in the same cycle); the loop site at monitoring/utils/monitor.py carries a matching comment about shared-cache patterns being sensitive to task ordering.
meta_utils/scribe.py::ScribeErrorWithAcks surfaces raw reject codes. Before: "Failed to write N/N messages". After: "Failed to write N/N messages; reject codes: ['INVALID_CATEGORY']". Three lines, no behavior change on success; applies to every graph_api LOG path.

Ride-along (exporters/otel.py): OTLP handler moved from the "gcm" parent logger to a child "_gcm_otel_emit" with propagate=False, so OTLP-emit validation errors no longer mirror into Scuba LOG rows as body="Missing field_name=...". Standalone hygiene fix.

Docs: website/docs/GCM_Monitoring/collectors/slurm_monitor.md documents the dual-publish (METRIC->ODS + LOG->scribe via DataIdentifier.SDIAG), the last_sdiag single-scrape caching pattern, and the ordering prerequisite. exporters/graph_api.md gains a per-DataIdentifier routing table covering all six *_scribe_category kwargs plus the unset-category assertion behavior, and an "Adding a new scribe category (Meta-internal)" section linking the Scribe wiki for DMV/PrivacyLib/ScribeShell unblock steps.

Reviewed By: yonglimeta

Differential Revision: D95971502

github-actions · 2026-05-07T18:43:24Z

CI Commands

The following CI workflows run automatically on every push and pull request:

Workflow	What it runs
GPU Cluster Monitoring Python CI	lint, tests, typecheck, format, deb build, pyoxidizer builds
Go packages CI	shelper tests, format, lint

The following commands can be used by maintainers to trigger additional tests that require access to secrets:

Command	Description	Requires approval?
`/metaci tests`	Runs Meta internal integration tests (pytest)	Yes — a maintainer must trigger the command and approve the deployment request
`/metaci integration tests`	Same as above (alias)	Yes

Note: Only repository maintainers (OWNER association) can trigger /metaci commands. After commenting the command, a maintainer must also navigate to the Actions tab and approve the deployment to the graph-api-access environment before the jobs will run. See the approval guidelines for what to approve or reject.

meta-codesync · 2026-05-07T18:43:24Z

@Yash0270 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D95971502.

luccabb

add some docs: https://facebookresearch.github.io/gcm/docs/GCM_Monitoring/collectors/

if you could add e2e tests too, mainly to verify that the final scuba table is reasonable

luccabb · 2026-05-12T21:58:10Z

+        result = Sdiag(**data)
+
+        if reset:
+            self._reset_sdiag_counters()

-        return Sdiag(**data)
+        return result


Suggested change

result = Sdiag(**data)

if reset:

self._reset_sdiag_counters()

return Sdiag(**data)

return result

if reset:

self._reset_sdiag_counters()

return Sdiag(**data)

luccabb · 2026-05-12T22:03:52Z

            "DBD Agent queue size:": "dbd_agent_queue_size",
        }
-        data: dict[str, Optional[int]] = {
+        data: dict[str, Any] = {


can we be more specific than Any?

luccabb · 2026-05-12T22:06:15Z

+    # fields so any sdiag stat we have not broken into its own column is
+    # still queryable from Scuba via JSON_EXTRACT and so future Slurm
+    # additions land in the table without a schema change.


Suggested change

# fields so any sdiag stat we have not broken into its own column is

# still queryable from Scuba via JSON_EXTRACT and so future Slurm

# additions land in the table without a schema change.

# fields so any sdiag stat we have not broken into its own column stays

# queryable downstream via JSON path expressions, and new fields land

# without requiring a schema change.

luccabb · 2026-05-12T22:10:37Z

+    # Don't reset here: the existing slurm_monitor ODS collector also calls
+    # sdiag_structured() and owns the reset cadence. Resetting in both
+    # collectors would double-reset the counters and lose data.


maybe expose both as parameters and let upstream callers decide what/when to reset

Yash0270 · 2026-05-18T00:12:10Z

@luccabb — pushed a structural rewrite (force-pushed after rebasing on main, so the previous inline comment anchors are gone; summary below covers each).

Architecture change. Dropped the standalone gcm sdiag CLI + SdiagPayload dataclass + meta_utils/scuba helper. Instead, the existing slurm_monitor sub-command now uses a new ods_scuba_tee exporter that wraps graph_api (ODS, METRIC) and otel (Scuba fair_sdiag, LOG). Single sdiag --all --json scrape per 5-min cycle, two writes. Avoids the read/reset race that would have existed between two independent collectors hitting the same in-controller sdiag counters, and removes the need for a reset parameter on sdiag_structured().

Comments addressed:

Old comment	Status in this revision
`client.py` `data: dict[str, Any]`	Narrowed to `dict[str, Optional[int]]` (legacy text-parse path only emits int/None).
`client.py` `reset` parameter suggestion	Architecture moot — `slurm_monitor` is the only resetter, no second collector.
`sdiag_payload.py` docstring wording	Obsolete — `SdiagPayload` removed; replaced by `SdiagLog` (no payload-specific docstring).
`cli/sdiag.py` reset on CLI	Obsolete — sub-command removed.
Top-level: add docs + e2e tests	Added `website/docs/GCM_Monitoring/exporters/ods_scuba_tee.md` (108 lines) + dual-publish note in `slurm_monitor.md`. Added `test_ods_scuba_tee.py` with 10 unit tests (routing, projection, filtering, failure isolation, shutdown, logger-isolation regression). E2E validated by sideloading a standalone PAR into the gcm-slurm-monitor container on a FAIR HPC AWS cluster — confirmed rows land in Scuba `fair_sdiag` with cluster identity preserved.

Bonus fix while in otel.py: the handler was attached to the "gcm" parent logger, so every gcm.* log record (e.g. dataclass_utils.warning("Missing field_name=...")) propagated up and leaked into whatever Scuba dataset the otel sink was targeting. Moved emit to a dedicated leaf logger _gcm_otel_emit with propagate=False. Net positive for all otel-using sub-commands (scontrol, sacct_publish, etc.) — likely also cleans up similar null-row noise in fair_sacct, fair_scontrol_data.

Yash0270 · 2026-05-18T03:20:15Z

v12: fixed a bug I introduced in the previous force-push.

With --heterogeneous-cluster-v1, slurm_monitor yields one SLURMLog per (cluster, partition), but all sdiag fields are identical across those rows (sdiag is a cluster-wide slurmctld stat, not partition-scoped). The previous version of the tee sink wrote one Scuba row per partition with identical sdiag values — 4× storage and misleading derived_cluster tags suggesting per-partition data.

Fix in ods_scuba_tee.py:

Dedupe by cluster before projecting to SdiagLog
Set derived_cluster=cluster (not the partition-suffixed value)
ODS path unchanged — graph_api still receives all N partition rows for per-partition entity suffixes

Tests added:

test_write_dedupes_partition_split_to_one_row_per_cluster
test_write_dedupes_multiple_clusters_independently

E2E re-verified on fair-aws-rc-1: one --once invocation now lands exactly 1 row in fair_sdiag (was 4), cluster=fair-aws-rc-1, derived_cluster=fair-aws-rc-1.

Yash0270 · 2026-05-18T04:07:04Z

v13: fixes OSS CI that v12 broke.

ufmt: ods_scuba_tee.py was not ufmt-clean (missing blank line after module docstring + import order).
mypy:
- client.py was narrowed per your dict[str, Any] comment, but the narrow type confused mypy on the Sdiag(**data) splat (per-key check against Sdiag's mixed int|str|None fields). Kept the narrow type for clarity, added cast(dict[str, Any], data) at the splat site with a comment explaining why.
- test_slurm.py: json.loads(result.rpcs_by_message_type_json) needed an explicit is not None check (the field is Optional[str]).
- test_ods_scuba_tee.py: refactored the test fixture so MagicMock instances are exposed via a typed _Fixture dataclass instead of accessed through sink._graph_api (which is typed as the Sink protocol).

All 3 OSS checks pass locally: flake8 ✓ | ufmt ✓ | mypy ✓ — All OSS checks passed!
15 pytest tests pass (was 12; +2 partition-dedup tests, +1 logger-isolation regression).

luccabb · 2026-05-18T18:57:47Z

+        # only explicit `self.otel_logger.info("", extra=...)` emits in the
+        # otel pipeline.
+        self.otel_logger = logging.getLogger("_gcm_otel_emit")
+        self.otel_logger.propagate = False


polluting the target Scuba table with
# null-data log rows.

the row gets published to scuba but with empty fields? do you have a sample?

Here: https://fburl.com/scuba/fair_sacct_running/eqt1fnfy

Yes — concrete sample. Two tables affected, the second one is the dramatic case.

fair_sdiag sample row (one per dataclass_utils.warning("Missing field_name=...") per slurm_monitor cycle):

time=2026-05-17 16:14:41 PDT cluster=null derived_cluster=null schedule_cycle_last=null bf_active=null ...all 70 cols null... code.file.path=gcm/monitoring/dataclass_utils.py code.function.name=instantiate_dataclass

~80 such null rows in last 24h. Scuba: https://fburl.com/scuba/fair_sdiag/oj8u9g16

fair_sacct_running is the dramatic case: 637,929 of 713,988 rows over the last 24h (89.3%) are null-data leakage. Most of those (574,000) attribute to gcm/exporters/otel.py:_write_log itself — the handler was recursing on log records emitted by its own emit path (_write_log logs at gcm.exporters.otel → propagates up to "gcm" parent → re-fires the handler → emits another row → which logs again → snowball). sacct_running runs hourly with large batches so each invocation generates 26-38k junk rows.

Full counts after running this query across all otel-targeted Scuba tables:

Dataset Total 24h Null-cluster leaked %

fair_sacct_running 713,988 637,929 89.3%

fair_sdiag 200 80 40%

fair_sacct 10,360,194 0 0%

fair_scontrol_data 74 0 0%

fair_scontrol_config 68 0 0%

fair_sacctmgr_qos 639 0 0%

fair_sacctmgr_user 517,754 0 0%

Only the first two trigger gcm-level logging during their otel write path; the other four don't, so the bug never fires on them. The fix (logger = "_gcm_otel_emit", propagate=False) removes the entire class of self-amplification.

Here: https://fburl.com/scuba/fair_sacct_running/eqt1fnfy

that query is Cluster vs cluster naming diff. there's a few nulls on both but with other fields populated, I'm not confident this fixes the root of this issue: https://fburl.com/scuba/fair_sacct_running/28owvuw2

fair_sdiag

there's a few rows here where cluster=null and Body gets populated, I think your change fixes this: https://fburl.com/scuba/fair_sdiag/q9a0ou8f

add some tests:

def test_unrelated_gcm_logger_does_not_emit_to_otel(): sink = OtelSink(...) other = logging.getLogger("gcm.dataclass_utils") other.warning("Missing field_name=foo") assert sink._scuba_writes == [] def test_gcm_logger_emits_to_otel(): sink = OtelSink(...) sink.write(<data> ...) assert sink._scuba_writes == [<data> ...]

…ch#141) Summary: Routes Slurm `sdiag` scrapes into a Scuba-backed perfpipe category (`perfpipe_fair_sdiag_v2`, namespace `ai_infra`, oncall `fair_efficiency`) via the existing `graph_api` exporter. Same shape as every other `perfpipe_fair_*` category (node_data, job_data, sacct_running, sacct, scontrol_data): graph_api writes ODS metrics and scribe_logs; scribe routes per-category to a Scuba dataset of the same name. No new exporter. Three code changes: 1. **`Sdiag` dataclass + `SlurmCliClient.sdiag_structured()` extended** to capture the ~50 fields exposed by `sdiag --all --json` on Slurm 23.2+ that were previously dropped: schedule cycle depth/last, backfill cycle/depth/queue-length/table-size, timing metadata (`req_time`, `gettimeofday_latency`, `job_states_ts`, `parts_packed`), RPC blobs (`rpcs_by_message_type_json`, `rpcs_by_user_json`) as flat JSON strings. `bf_active` is `Optional[bool]` so the JSON serializer emits real `true`/`false`. 2. **`sdiag_scribe_category` kwarg threaded through `GraphAPI.__init__`** (alongside the existing `node_scribe_category`, `job_scribe_category`, `statvfs_scribe_category`, `pure_scribe_category`). `DataIdentifier.SDIAG` is registered in `monitoring/sink/protocol.py`; `CliObject.last_sdiag` cache + `collect_sdiag` generator added in `monitoring/cli/slurm_monitor.py`. A second tuple in `data_collection_tasks` schedules sdiag LOG writes on the same 5-min cycle as the existing METRIC tasks, reusing the per-cycle scrape (no extra slurmctld load). `last_sdiag` is cleared before each scrape so a raised scrape cannot republish stale data; the dual-task is no-op when `last_sdiag is None`. 3. **`meta_utils/scribe.py::ScribeErrorWithAcks` surfaces raw reject codes.** Before: `"Failed to write N/N messages"`. After: `"Failed to write N/N messages; reject codes: ['INVALID_CATEGORY']"`. Three lines, no behavior change on success; applies to every graph_api LOG path. Ride-along (`exporters/otel.py`): OTLP handler moved from the `"gcm"` parent logger to a child `"_gcm_otel_emit"` with `propagate=False`, so OTLP-emit validation errors no longer mirror into Scuba LOG rows as `body="Missing field_name=..."`. Standalone hygiene fix. Docs: `website/docs/GCM_Monitoring/collectors/slurm_monitor.md` documents the dual-publish (METRIC->ODS + LOG->scribe via `DataIdentifier.SDIAG`) and the `last_sdiag` single-scrape caching pattern. `exporters/graph_api.md` gains a per-`DataIdentifier` routing table covering all six `*_scribe_category` kwargs plus the unset-category assertion behavior. Reviewed By: yonglimeta Differential Revision: D95971502

luccabb · 2026-05-21T08:02:10Z

+@dataclass(kw_only=True)
+class SdiagLog(Sdiag, DerivedCluster):
+    cluster: str


can we add DerivedCluster and cluster to the base Sdiag class?

Done in v25.

Added cluster: Optional[str] = None to the base Sdiag dataclass (populated at the collect-sdiag site, not by the JSON parser — keeps the parser path untouched).

Dropped derived_cluster and deleted SdiagLog entirely. Rationale: sdiag is a cluster-wide slurmctld stat, so tagging it with derived_cluster would generate misleading duplicate rows for the same scrape (fair-sc, fair-sc.h200, fair-sc.h100 would all carry identical sdiag values). One row per cluster per cycle is the correct shape, even with --heterogeneous-cluster-v1. collect_sdiag now yields Sdiag directly via replace(sdiag, cluster=cluster).

luccabb · 2026-05-21T08:13:33Z

+def collect_sdiag(
+    obj: CliObject,
+    cluster: str,
+    logger: logging.Logger,
+) -> Generator[SdiagLog, None, None]:
+    """Project the most recent sdiag scrape (cached on the CliObject by
+    `collect_slurm`) to an SdiagLog row for the Scuba `perfpipe_fair_sdiag_v2` dataset.
+
+    Single sdiag scrape per cycle is shared with `collect_slurm` -- no
+    double-scrape, no reset-counter race. sdiag is a cluster-wide slurmctld
+    stat so we yield exactly one SdiagLog per cycle (not one per partition).
+    """
+    sdiag = obj.last_sdiag
+    if sdiag is None:
+        # collect_slurm hasn't populated the cache yet (first cycle race or
+        # collect_slurm raised). Skip rather than crash the loop -- the next
+        # cycle will recover.
+        logger.debug("collect_sdiag: no cached sdiag scrape available, skipping")
+        return
+    yield SdiagLog(cluster=cluster, derived_cluster=cluster, **asdict(sdiag))


I think just fold this into collect_slurm -> collect_slurm_and_sdiag. the current state creates an implicit requirement that collect_slurm must always run before collect_sdiag

i think another way is to write a class that does the querying

class CollectionContext: def sdiag(): ... def sinfo(): ...

then you init that class inside run_data_collection_loop at every round of collection, call its methods, and cache results.

def callable(collection_context): return collection_context.sinfo ... ... while True: collection_context = CollectionContext(...) for callable, ...: callable(collection_context, ...)

either way, this would require changes on run_data_collection_loop, which has a decent blast radius. Wouldn't be opposed to run_data_collection_loop_v2 that handles this

If we need the data asap, maybe just say explicitly in the docstring that collect_slurm must be called before. Also add a comment here that not respecting task ordering may break collection.

Took the short path in v25.

Added an explicit Prerequisite block to the collect_sdiag docstring (must run after collect_slurm in the same cycle; no-ops cleanly if last_sdiag is None), plus a shared-cache ordering comment at the top of the data_collection_tasks loop in monitoring/utils/monitor.py.

The CollectionContext + run_data_collection_loop_v2 refactor is the right long-term shape, but the blast radius (every gcm collector) is wider than this diff. Filing as a follow-up once this lands — rc1 burn-in is queued behind it and the docstring + loop comment is enough to prevent the ordering footgun for now.

Yash0270 · 2026-05-21T19:34:59Z

v25 addresses all three review comments.

Comment on sdiag_log.py:11 (cluster/derived_cluster):

Added cluster: Optional[str] = None to the base Sdiag dataclass (populated at the collect-sdiag site, not by the JSON parser — keeps the parser path untouched).

Dropped derived_cluster and deleted SdiagLog entirely. Rationale: sdiag is a cluster-wide slurmctld stat. Tagging it with derived_cluster would generate misleading duplicate rows for the same scrape (fair-sc, fair-sc.h200, fair-sc.h100 would all carry identical sdiag values). One row per cluster per cycle is the correct shape, even with --heterogeneous-cluster-v1. collect_sdiag now yields Sdiag directly via replace(sdiag, cluster=cluster).

Comments on slurm_monitor.py:356 (ordering coupling between collect_slurm and collect_sdiag):

Took the short path you outlined: explicit docstring on collect_sdiag documenting the prerequisite (must run after collect_slurm in the same cycle; no-ops cleanly if last_sdiag is None), plus a shared-cache ordering comment at the top of the data_collection_tasks loop in monitoring/utils/monitor.py.

The longer-term refactor you sketched (CollectionContext class + run_data_collection_loop_v2) is the right shape but the blast radius (every gcm collector) is wider than this diff. Filing as a follow-up once this lands — rc1 burn-in is queued behind it and the docstring + loop comment is enough to prevent the ordering footgun for now.

…ch#141) Summary: Pull Request resolved: facebookresearch#141 Routes Slurm `sdiag` scrapes into a Scuba-backed perfpipe category (`perfpipe_fair_sdiag_v2`, namespace `ai_infra`, oncall `fair_efficiency`) via the existing `graph_api` exporter. Same shape as every other `perfpipe_fair_*` category (node_data, job_data, sacct_running, sacct, scontrol_data): graph_api writes ODS metrics and scribe_logs; scribe routes per-category to a Scuba dataset of the same name. No new exporter. Three code changes: 1. **`Sdiag` dataclass + `SlurmCliClient.sdiag_structured()` extended** to capture the ~50 fields exposed by `sdiag --all --json` on Slurm 23.2+ that were previously dropped: schedule cycle depth/last, backfill cycle/depth/queue-length/table-size, timing metadata (`req_time`, `gettimeofday_latency`, `job_states_ts`, `parts_packed`), RPC blobs (`rpcs_by_message_type_json`, `rpcs_by_user_json`) as flat JSON strings. `bf_active` is `Optional[bool]` so the JSON serializer emits real `true`/`false`. `cluster: Optional[str] = None` is also on the base dataclass (populated at the collect-sdiag site, not by the parser); `derived_cluster` is intentionally omitted because sdiag is a cluster-wide slurmctld stat and tagging it with a partition-split identifier would generate misleading duplicate rows. 2. **`sdiag_scribe_category` kwarg threaded through `GraphAPI.__init__`** (alongside the existing `node_scribe_category`, `job_scribe_category`, `statvfs_scribe_category`, `pure_scribe_category`). `DataIdentifier.SDIAG` is registered in `monitoring/sink/protocol.py`; `CliObject.last_sdiag` cache + `collect_sdiag` generator added in `monitoring/cli/slurm_monitor.py`. A second tuple in `data_collection_tasks` schedules sdiag LOG writes on the same 5-min cycle as the existing METRIC tasks, reusing the per-cycle scrape (no extra slurmctld load). `last_sdiag` is cleared before each scrape so a raised scrape cannot republish stale data; the dual-task is no-op when `last_sdiag is None`. `collect_sdiag` documents the cache-ordering prerequisite (must run after `collect_slurm` in the same cycle); the loop site at `monitoring/utils/monitor.py` carries a matching comment about shared-cache patterns being sensitive to task ordering. 3. **`meta_utils/scribe.py::ScribeErrorWithAcks` surfaces raw reject codes.** Before: `"Failed to write N/N messages"`. After: `"Failed to write N/N messages; reject codes: ['INVALID_CATEGORY']"`. Three lines, no behavior change on success; applies to every graph_api LOG path. Ride-along (`exporters/otel.py`): OTLP handler moved from the `"gcm"` parent logger to a child `"_gcm_otel_emit"` with `propagate=False`, so OTLP-emit validation errors no longer mirror into Scuba LOG rows as `body="Missing field_name=..."`. Standalone hygiene fix. Docs: `website/docs/GCM_Monitoring/collectors/slurm_monitor.md` documents the dual-publish (METRIC->ODS + LOG->scribe via `DataIdentifier.SDIAG`), the `last_sdiag` single-scrape caching pattern, and the ordering prerequisite. `exporters/graph_api.md` gains a per-`DataIdentifier` routing table covering all six `*_scribe_category` kwargs plus the unset-category assertion behavior, and an "Adding a new scribe category (Meta-internal)" section linking the Scribe wiki for DMV/PrivacyLib/ScribeShell unblock steps. Reviewed By: yonglimeta Differential Revision: D95971502

…ch#141) Summary: Routes Slurm `sdiag` scrapes into a Scuba-backed perfpipe category (`perfpipe_fair_sdiag_v2`, namespace `ai_infra`, oncall `fair_efficiency`) via the existing `graph_api` exporter. Same shape as every other `perfpipe_fair_*` category (node_data, job_data, sacct_running, sacct, scontrol_data): graph_api writes ODS metrics and scribe_logs; scribe routes per-category to a Scuba dataset of the same name. No new exporter. Three code changes: 1. **`Sdiag` dataclass + `SlurmCliClient.sdiag_structured()` extended** to capture the ~50 fields exposed by `sdiag --all --json` on Slurm 23.2+ that were previously dropped: schedule cycle depth/last, backfill cycle/depth/queue-length/table-size, timing metadata (`req_time`, `gettimeofday_latency`, `job_states_ts`, `parts_packed`), RPC blobs (`rpcs_by_message_type_json`, `rpcs_by_user_json`) as flat JSON strings. `bf_active` is `Optional[bool]` so the JSON serializer emits real `true`/`false`. `cluster: Optional[str] = None` is also on the base dataclass (populated at the collect-sdiag site, not by the parser); `derived_cluster` is intentionally omitted because sdiag is a cluster-wide slurmctld stat and tagging it with a partition-split identifier would generate misleading duplicate rows. 2. **`sdiag_scribe_category` kwarg threaded through `GraphAPI.__init__`** (alongside the existing `node_scribe_category`, `job_scribe_category`, `statvfs_scribe_category`, `pure_scribe_category`). `DataIdentifier.SDIAG` is registered in `monitoring/sink/protocol.py`; `CliObject.last_sdiag` cache + `collect_sdiag` generator added in `monitoring/cli/slurm_monitor.py`. A second tuple in `data_collection_tasks` schedules sdiag LOG writes on the same 5-min cycle as the existing METRIC tasks, reusing the per-cycle scrape (no extra slurmctld load). `last_sdiag` is cleared before each scrape so a raised scrape cannot republish stale data; the dual-task is no-op when `last_sdiag is None`. `collect_sdiag` documents the cache-ordering prerequisite (must run after `collect_slurm` in the same cycle); the loop site at `monitoring/utils/monitor.py` carries a matching comment about shared-cache patterns being sensitive to task ordering. 3. **`meta_utils/scribe.py::ScribeErrorWithAcks` surfaces raw reject codes.** Before: `"Failed to write N/N messages"`. After: `"Failed to write N/N messages; reject codes: ['INVALID_CATEGORY']"`. Three lines, no behavior change on success; applies to every graph_api LOG path. Ride-along (`exporters/otel.py`): OTLP handler moved from the `"gcm"` parent logger to a child `"_gcm_otel_emit"` with `propagate=False`, so OTLP-emit validation errors no longer mirror into Scuba LOG rows as `body="Missing field_name=..."`. Standalone hygiene fix. Docs: `website/docs/GCM_Monitoring/collectors/slurm_monitor.md` documents the dual-publish (METRIC->ODS + LOG->scribe via `DataIdentifier.SDIAG`), the `last_sdiag` single-scrape caching pattern, and the ordering prerequisite. `exporters/graph_api.md` gains a per-`DataIdentifier` routing table covering all six `*_scribe_category` kwargs plus the unset-category assertion behavior, and an "Adding a new scribe category (Meta-internal)" section linking the Scribe wiki for DMV/PrivacyLib/ScribeShell unblock steps. Reviewed By: yonglimeta Differential Revision: D95971502

luccabb · 2026-05-22T18:30:23Z

+        # Task ordering matters when callables share an instance cache (e.g.
+        # `collect_slurm` populates `CliObject.last_sdiag`, consumed by
+        # `collect_sdiag`). Reordering tasks in `data_collection_tasks` can
+        # break collection for shared-cache consumers.


Suggested change

# Task ordering matters when callables share an instance cache (e.g.

# `collect_slurm` populates `CliObject.last_sdiag`, consumed by

# `collect_sdiag`). Reordering tasks in `data_collection_tasks` can

# break collection for shared-cache consumers.

# Task ordering matters for some collections. If you wish to update this assumption please create a new `run_data_collection_loop` function

luccabb · 2026-05-23T01:28:25Z

+            # Task B: Sdiag -> LOG -> scribe (perfpipe_fair_sdiag_v2) ->
+            # Scuba (perfpipe_fair_sdiag_v2) via graph_api._write_log. Reads cached
+            # sdiag from Task A. Mirrors the slurm_job_monitor pattern of
+            # per-task DataIdentifier-driven scribe routing.


Suggested change

# Task B: Sdiag -> LOG -> scribe (perfpipe_fair_sdiag_v2) ->

# Scuba (perfpipe_fair_sdiag_v2) via graph_api._write_log. Reads cached

# sdiag from Task A. Mirrors the slurm_job_monitor pattern of

# per-task DataIdentifier-driven scribe routing.

luccabb · 2026-05-23T01:28:37Z

+            # Task A: SLURMLog -> METRIC -> ODS via graph_api._write_metric.
+            # Populates obj.last_sdiag for Task B.


Suggested change

# Task A: SLURMLog -> METRIC -> ODS via graph_api._write_metric.

# Populates obj.last_sdiag for Task B.

luccabb · 2026-05-23T01:43:21Z

+import logging as _logging  # noqa: E402
+
+from gcm.monitoring.cli.slurm_monitor import (  # noqa: E402
+    CliObjectImpl as _CliObjectImpl,
+    collect_sdiag as _collect_sdiag,
+)
+from gcm.schemas.slurm.sdiag import Sdiag as _Sdiag  # noqa: E402


move these to the top

luccabb · 2026-05-23T01:49:51Z

 Performs comprehensive analytics on SLURM cluster state by combining data from multiple sources (`sinfo`, `sdiag`, `sacct`) and computing aggregated metrics. Provides deep insights into cluster health, resource utilization, job analytics, and user activity.

-**Data Type**: `DataType.METRIC`, **Schemas**: `SLURMLog`, `SLURMLogAccountMetrics`
+**Data Type**: `DataType.METRIC` (`SLURMLog` → ODS) + `DataType.LOG` with `DataIdentifier.SDIAG` (`Sdiag` → scribe → Scuba `perfpipe_fair_sdiag_v2`).


Suggested change

**Data Type**: `DataType.METRIC` (`SLURMLog` → ODS) + `DataType.LOG` with `DataIdentifier.SDIAG` (`Sdiag` → scribe → Scuba `perfpipe_fair_sdiag_v2`).

**Data Type**: `DataType.METRIC` + `DataType.LOG` with `DataIdentifier.SDIAG`.

…ch#141) Summary: Routes Slurm `sdiag` scrapes into a Scuba-backed perfpipe category (`perfpipe_fair_sdiag_v2`, namespace `ai_infra`, oncall `fair_efficiency`) via the existing `graph_api` exporter. Same shape as every other `perfpipe_fair_*` category (node_data, job_data, sacct_running, sacct, scontrol_data): graph_api writes ODS metrics and scribe_logs; scribe routes per-category to a Scuba dataset of the same name. No new exporter. Three code changes: 1. **`Sdiag` dataclass + `SlurmCliClient.sdiag_structured()` extended** to capture the ~50 fields exposed by `sdiag --all --json` on Slurm 23.2+ that were previously dropped: schedule cycle depth/last, backfill cycle/depth/queue-length/table-size, timing metadata (`req_time`, `gettimeofday_latency`, `job_states_ts`, `parts_packed`), RPC blobs (`rpcs_by_message_type_json`, `rpcs_by_user_json`) as flat JSON strings. `bf_active` is `Optional[bool]` so the JSON serializer emits real `true`/`false`. `cluster: Optional[str] = None` is also on the base dataclass (populated at the collect-sdiag site, not by the parser); `derived_cluster` is intentionally omitted because sdiag is a cluster-wide slurmctld stat and tagging it with a partition-split identifier would generate misleading duplicate rows. 2. **`sdiag_scribe_category` kwarg threaded through `GraphAPI.__init__`** (alongside the existing `node_scribe_category`, `job_scribe_category`, `statvfs_scribe_category`, `pure_scribe_category`). `DataIdentifier.SDIAG` is registered in `monitoring/sink/protocol.py`; `CliObject.last_sdiag` cache + `collect_sdiag` generator added in `monitoring/cli/slurm_monitor.py`. A second tuple in `data_collection_tasks` schedules sdiag LOG writes on the same 5-min cycle as the existing METRIC tasks, reusing the per-cycle scrape (no extra slurmctld load). `last_sdiag` is cleared before each scrape so a raised scrape cannot republish stale data; the dual-task is no-op when `last_sdiag is None`. `collect_sdiag` documents the cache-ordering prerequisite (must run after `collect_slurm` in the same cycle); the loop site at `monitoring/utils/monitor.py` carries a matching comment about shared-cache patterns being sensitive to task ordering. 3. **`meta_utils/scribe.py::ScribeErrorWithAcks` surfaces raw reject codes.** Before: `"Failed to write N/N messages"`. After: `"Failed to write N/N messages; reject codes: ['INVALID_CATEGORY']"`. Three lines, no behavior change on success; applies to every graph_api LOG path. Ride-along (`exporters/otel.py`): OTLP handler moved from the `"gcm"` parent logger to a child `"_gcm_otel_emit"` with `propagate=False`, so OTLP-emit validation errors no longer mirror into Scuba LOG rows as `body="Missing field_name=..."`. Standalone hygiene fix. Docs: `website/docs/GCM_Monitoring/collectors/slurm_monitor.md` documents the dual-publish (METRIC->ODS + LOG->scribe via `DataIdentifier.SDIAG`), the `last_sdiag` single-scrape caching pattern, and the ordering prerequisite. `exporters/graph_api.md` gains a per-`DataIdentifier` routing table covering all six `*_scribe_category` kwargs plus the unset-category assertion behavior, and an "Adding a new scribe category (Meta-internal)" section linking the Scribe wiki for DMV/PrivacyLib/ScribeShell unblock steps. Reviewed By: yonglimeta Differential Revision: D95971502

Yash0270 · 2026-05-29T16:14:58Z

1 failed test seems to be because of fbcode <--> github cyclic dependency. Fbcode test failure is because of github dependency is failing. Github test failures point to fbcode diff failure.

Force pushing to resolve this. Already tried rebasing multiple times.

Yash0270 requested review from calebho, giongto35, jj10306 and luccabb as code owners May 7, 2026 18:43

meta-cla Bot added the cla signed label May 7, 2026

meta-codesync Bot added fb-exported meta-exported labels May 7, 2026

luccabb suggested changes May 8, 2026

View reviewed changes

Yash0270 force-pushed the export-D95971502 branch 2 times, most recently from 6299623 to 01d5138 Compare May 12, 2026 18:43

luccabb suggested changes May 12, 2026

View reviewed changes

Yash0270 force-pushed the export-D95971502 branch from 01d5138 to b8a82d9 Compare May 18, 2026 00:12

Yash0270 force-pushed the export-D95971502 branch from b8a82d9 to 0bba3e7 Compare May 18, 2026 03:20

Yash0270 force-pushed the export-D95971502 branch from 0bba3e7 to 9abe246 Compare May 18, 2026 04:07

meta-codesync Bot changed the title ~~Add standalone sdiag CLI command for Scuba export~~ Expose additional sdiag fields + dual-publish ODS/Scuba via ods_scuba_tee sink May 18, 2026

Yash0270 force-pushed the export-D95971502 branch 2 times, most recently from 767f1fe to 6b9f082 Compare May 18, 2026 17:21

Yash0270 requested a review from luccabb May 18, 2026 18:18

luccabb reviewed May 18, 2026

View reviewed changes

meta-codesync Bot changed the title ~~Expose additional sdiag fields + dual-publish ODS/Scuba via ods_scuba_tee sink~~ Publish sdiag to perfpipe_fair_sdiag_v2 via graph_api May 21, 2026

Yash0270 force-pushed the export-D95971502 branch from 6b9f082 to d473e23 Compare May 21, 2026 02:37

meta-codesync Bot changed the title ~~Publish sdiag to perfpipe_fair_sdiag_v2 via graph_api~~ Publish sdiag to perfpipe_fair_sdiag_v2 via graph_api (#141) May 21, 2026

Yash0270 force-pushed the export-D95971502 branch from d473e23 to ff069c9 Compare May 21, 2026 02:38

Yash0270 force-pushed the export-D95971502 branch from ff069c9 to c668cd8 Compare May 21, 2026 05:00

luccabb reviewed May 21, 2026

View reviewed changes

meta-codesync Bot changed the title ~~Publish sdiag to perfpipe_fair_sdiag_v2 via graph_api (#141)~~ Publish sdiag to perfpipe_fair_sdiag_v2 via graph_api May 21, 2026

Yash0270 force-pushed the export-D95971502 branch from c668cd8 to 8791db4 Compare May 21, 2026 19:03

Yash0270 force-pushed the export-D95971502 branch from 8791db4 to f5fcba3 Compare May 21, 2026 19:47

meta-codesync Bot changed the title ~~Publish sdiag to perfpipe_fair_sdiag_v2 via graph_api~~ Publish sdiag to perfpipe_fair_sdiag_v2 via graph_api (#141) May 21, 2026

Yash0270 force-pushed the export-D95971502 branch from f5fcba3 to 4e83a6c Compare May 21, 2026 21:29

Yash0270 force-pushed the export-D95971502 branch from 4e83a6c to 79c1a7f Compare May 21, 2026 22:29

luccabb approved these changes May 23, 2026

View reviewed changes

Yash0270 force-pushed the export-D95971502 branch from 79c1a7f to 11b2b73 Compare May 28, 2026 18:12

Yash0270 force-pushed the export-D95971502 branch 2 times, most recently from 11b2b73 to c41c8cb Compare May 28, 2026 18:44

Yash0270 force-pushed the export-D95971502 branch from c41c8cb to 7c29686 Compare May 28, 2026 20:46

Yash0270 requested a review from yonglimeta May 28, 2026 22:14

yonglimeta approved these changes May 28, 2026

View reviewed changes

Yash0270 merged commit 27b18d2 into facebookresearch:main May 29, 2026
38 of 39 checks passed

Yash0270 deleted the export-D95971502 branch May 29, 2026 16:13

Dataset	Total 24h	Null-cluster leaked	%
fair_sacct_running	713,988	637,929	89.3%
fair_sdiag	200	80	40%
fair_sacct	10,360,194	0	0%
fair_scontrol_data	74	0	0%
fair_scontrol_config	68	0	0%
fair_sacctmgr_qos	639	0	0%
fair_sacctmgr_user	517,754	0	0%

		# Task A: SLURMLog -> METRIC -> ODS via graph_api._write_metric.
		# Populates obj.last_sdiag for Task B.

	Data Type: `DataType.METRIC` (`SLURMLog` → ODS) + `DataType.LOG` with `DataIdentifier.SDIAG` (`Sdiag` → scribe → Scuba `perfpipe_fair_sdiag_v2`).
	Data Type: `DataType.METRIC` + `DataType.LOG` with `DataIdentifier.SDIAG`.

Conversation

Yash0270 commented May 7, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 7, 2026

CI Commands

Uh oh!

meta-codesync Bot commented May 7, 2026

Uh oh!

luccabb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Yash0270 commented May 18, 2026

Uh oh!

Yash0270 commented May 18, 2026

Uh oh!

Yash0270 commented May 18, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Yash0270 commented May 21, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Yash0270 commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Yash0270 commented May 7, 2026 •

edited by meta-codesync Bot

Loading