Skip to content

Publish sdiag to perfpipe_fair_sdiag_v2 via graph_api (#141)#141

Merged
Yash0270 merged 1 commit into
facebookresearch:mainfrom
Yash0270:export-D95971502
May 29, 2026
Merged

Publish sdiag to perfpipe_fair_sdiag_v2 via graph_api (#141)#141
Yash0270 merged 1 commit into
facebookresearch:mainfrom
Yash0270:export-D95971502

Conversation

@Yash0270

@Yash0270 Yash0270 commented May 7, 2026

Copy link
Copy Markdown
Contributor

Summary:

Routes Slurm sdiag scrapes into a Scuba-backed perfpipe category (perfpipe_fair_sdiag_v2, namespace ai_infra, oncall fair_efficiency) via the existing graph_api exporter. Same shape as every other perfpipe_fair_* category (node_data, job_data, sacct_running, sacct, scontrol_data): graph_api writes ODS metrics and scribe_logs; scribe routes per-category to a Scuba dataset of the same name. No new exporter.

Three code changes:

  1. Sdiag dataclass + SlurmCliClient.sdiag_structured() extended to capture the ~50 fields exposed by sdiag --all --json on Slurm 23.2+ that were previously dropped: schedule cycle depth/last, backfill cycle/depth/queue-length/table-size, timing metadata (req_time, gettimeofday_latency, job_states_ts, parts_packed), RPC blobs (rpcs_by_message_type_json, rpcs_by_user_json) as flat JSON strings. bf_active is Optional[bool] so the JSON serializer emits real true/false. cluster: Optional[str] = None is also on the base dataclass (populated at the collect-sdiag site, not by the parser); derived_cluster is intentionally omitted because sdiag is a cluster-wide slurmctld stat and tagging it with a partition-split identifier would generate misleading duplicate rows.

  2. sdiag_scribe_category kwarg threaded through GraphAPI.__init__ (alongside the existing node_scribe_category, job_scribe_category, statvfs_scribe_category, pure_scribe_category). DataIdentifier.SDIAG is registered in monitoring/sink/protocol.py; CliObject.last_sdiag cache + collect_sdiag generator added in monitoring/cli/slurm_monitor.py. A second tuple in data_collection_tasks schedules sdiag LOG writes on the same 5-min cycle as the existing METRIC tasks, reusing the per-cycle scrape (no extra slurmctld load). last_sdiag is cleared before each scrape so a raised scrape cannot republish stale data; the dual-task is no-op when last_sdiag is None. collect_sdiag documents the cache-ordering prerequisite (must run after collect_slurm in the same cycle); the loop site at monitoring/utils/monitor.py carries a matching comment about shared-cache patterns being sensitive to task ordering.

  3. meta_utils/scribe.py::ScribeErrorWithAcks surfaces raw reject codes. Before: "Failed to write N/N messages". After: "Failed to write N/N messages; reject codes: ['INVALID_CATEGORY']". Three lines, no behavior change on success; applies to every graph_api LOG path.

Ride-along (exporters/otel.py): OTLP handler moved from the "gcm" parent logger to a child "_gcm_otel_emit" with propagate=False, so OTLP-emit validation errors no longer mirror into Scuba LOG rows as body="Missing field_name=...". Standalone hygiene fix.

Docs: website/docs/GCM_Monitoring/collectors/slurm_monitor.md documents the dual-publish (METRIC->ODS + LOG->scribe via DataIdentifier.SDIAG), the last_sdiag single-scrape caching pattern, and the ordering prerequisite. exporters/graph_api.md gains a per-DataIdentifier routing table covering all six *_scribe_category kwargs plus the unset-category assertion behavior, and an "Adding a new scribe category (Meta-internal)" section linking the Scribe wiki for DMV/PrivacyLib/ScribeShell unblock steps.

Reviewed By: yonglimeta

Differential Revision: D95971502

@github-actions

github-actions Bot commented May 7, 2026

Copy link
Copy Markdown

CI Commands

The following CI workflows run automatically on every push and pull request:

Workflow What it runs
GPU Cluster Monitoring Python CI lint, tests, typecheck, format, deb build, pyoxidizer builds
Go packages CI shelper tests, format, lint

The following commands can be used by maintainers to trigger additional tests that require access to secrets:

Command Description Requires approval?
/metaci tests Runs Meta internal integration tests (pytest) Yes — a maintainer must trigger the command and approve the deployment request
/metaci integration tests Same as above (alias) Yes

Note: Only repository maintainers (OWNER association) can trigger /metaci commands. After commenting the command, a maintainer must also navigate to the Actions tab and approve the deployment to the graph-api-access environment before the jobs will run. See the approval guidelines for what to approve or reject.

@meta-codesync

meta-codesync Bot commented May 7, 2026

Copy link
Copy Markdown

@Yash0270 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D95971502.

@luccabb luccabb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add some docs: https://facebookresearch.github.io/gcm/docs/GCM_Monitoring/collectors/

if you could add e2e tests too, mainly to verify that the final scuba table is reasonable

@Yash0270 Yash0270 force-pushed the export-D95971502 branch 2 times, most recently from 6299623 to 01d5138 Compare May 12, 2026 18:43
Comment thread gcm/monitoring/slurm/client.py Outdated
Comment on lines +398 to +403
result = Sdiag(**data)

if reset:
self._reset_sdiag_counters()

return Sdiag(**data)
return result

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
result = Sdiag(**data)
if reset:
self._reset_sdiag_counters()
return Sdiag(**data)
return result
if reset:
self._reset_sdiag_counters()
return Sdiag(**data)

Comment thread gcm/monitoring/slurm/client.py Outdated
"DBD Agent queue size:": "dbd_agent_queue_size",
}
data: dict[str, Optional[int]] = {
data: dict[str, Any] = {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we be more specific than Any?

Comment thread gcm/schemas/slurm/sdiag_payload.py Outdated
Comment on lines +17 to +19
# fields so any sdiag stat we have not broken into its own column is
# still queryable from Scuba via JSON_EXTRACT and so future Slurm
# additions land in the table without a schema change.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# fields so any sdiag stat we have not broken into its own column is
# still queryable from Scuba via JSON_EXTRACT and so future Slurm
# additions land in the table without a schema change.
# fields so any sdiag stat we have not broken into its own column stays
# queryable downstream via JSON path expressions, and new fields land
# without requiring a schema change.

Comment thread gcm/monitoring/cli/sdiag.py Outdated
Comment on lines +55 to +57
# Don't reset here: the existing slurm_monitor ODS collector also calls
# sdiag_structured() and owns the reset cadence. Resetting in both
# collectors would double-reset the counters and lose data.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe expose both as parameters and let upstream callers decide what/when to reset

@Yash0270

Copy link
Copy Markdown
Contributor Author

@luccabb — pushed a structural rewrite (force-pushed after rebasing on main, so the previous inline comment anchors are gone; summary below covers each).

Architecture change. Dropped the standalone gcm sdiag CLI + SdiagPayload dataclass + meta_utils/scuba helper. Instead, the existing slurm_monitor sub-command now uses a new ods_scuba_tee exporter that wraps graph_api (ODS, METRIC) and otel (Scuba fair_sdiag, LOG). Single sdiag --all --json scrape per 5-min cycle, two writes. Avoids the read/reset race that would have existed between two independent collectors hitting the same in-controller sdiag counters, and removes the need for a reset parameter on sdiag_structured().

Comments addressed:

Old comment Status in this revision
client.py data: dict[str, Any] Narrowed to dict[str, Optional[int]] (legacy text-parse path only emits int/None).
client.py reset parameter suggestion Architecture moot — slurm_monitor is the only resetter, no second collector.
sdiag_payload.py docstring wording Obsolete — SdiagPayload removed; replaced by SdiagLog (no payload-specific docstring).
cli/sdiag.py reset on CLI Obsolete — sub-command removed.
Top-level: add docs + e2e tests Added website/docs/GCM_Monitoring/exporters/ods_scuba_tee.md (108 lines) + dual-publish note in slurm_monitor.md. Added test_ods_scuba_tee.py with 10 unit tests (routing, projection, filtering, failure isolation, shutdown, logger-isolation regression). E2E validated by sideloading a standalone PAR into the gcm-slurm-monitor container on a FAIR HPC AWS cluster — confirmed rows land in Scuba fair_sdiag with cluster identity preserved.

Bonus fix while in otel.py: the handler was attached to the "gcm" parent logger, so every gcm.* log record (e.g. dataclass_utils.warning("Missing field_name=...")) propagated up and leaked into whatever Scuba dataset the otel sink was targeting. Moved emit to a dedicated leaf logger _gcm_otel_emit with propagate=False. Net positive for all otel-using sub-commands (scontrol, sacct_publish, etc.) — likely also cleans up similar null-row noise in fair_sacct, fair_scontrol_data.

@Yash0270 Yash0270 force-pushed the export-D95971502 branch from 01d5138 to b8a82d9 Compare May 18, 2026 00:12
@Yash0270

Copy link
Copy Markdown
Contributor Author

v12: fixed a bug I introduced in the previous force-push.

With --heterogeneous-cluster-v1, slurm_monitor yields one SLURMLog per (cluster, partition), but all sdiag fields are identical across those rows (sdiag is a cluster-wide slurmctld stat, not partition-scoped). The previous version of the tee sink wrote one Scuba row per partition with identical sdiag values — 4× storage and misleading derived_cluster tags suggesting per-partition data.

Fix in ods_scuba_tee.py:

  • Dedupe by cluster before projecting to SdiagLog
  • Set derived_cluster=cluster (not the partition-suffixed value)
  • ODS path unchanged — graph_api still receives all N partition rows for per-partition entity suffixes

Tests added:

  • test_write_dedupes_partition_split_to_one_row_per_cluster
  • test_write_dedupes_multiple_clusters_independently

E2E re-verified on fair-aws-rc-1: one --once invocation now lands exactly 1 row in fair_sdiag (was 4), cluster=fair-aws-rc-1, derived_cluster=fair-aws-rc-1.

@Yash0270 Yash0270 force-pushed the export-D95971502 branch from b8a82d9 to 0bba3e7 Compare May 18, 2026 03:20
@Yash0270

Copy link
Copy Markdown
Contributor Author

v13: fixes OSS CI that v12 broke.

  • ufmt: ods_scuba_tee.py was not ufmt-clean (missing blank line after module docstring + import order).
  • mypy:
    • client.py was narrowed per your dict[str, Any] comment, but the narrow type confused mypy on the Sdiag(**data) splat (per-key check against Sdiag's mixed int|str|None fields). Kept the narrow type for clarity, added cast(dict[str, Any], data) at the splat site with a comment explaining why.
    • test_slurm.py: json.loads(result.rpcs_by_message_type_json) needed an explicit is not None check (the field is Optional[str]).
    • test_ods_scuba_tee.py: refactored the test fixture so MagicMock instances are exposed via a typed _Fixture dataclass instead of accessed through sink._graph_api (which is typed as the Sink protocol).

All 3 OSS checks pass locally: flake8 ✓ | ufmt ✓ | mypy ✓ — All OSS checks passed!
15 pytest tests pass (was 12; +2 partition-dedup tests, +1 logger-isolation regression).

@Yash0270 Yash0270 force-pushed the export-D95971502 branch from 0bba3e7 to 9abe246 Compare May 18, 2026 04:07
@meta-codesync meta-codesync Bot changed the title Add standalone sdiag CLI command for Scuba export Expose additional sdiag fields + dual-publish ODS/Scuba via ods_scuba_tee sink May 18, 2026
@Yash0270 Yash0270 force-pushed the export-D95971502 branch 2 times, most recently from 767f1fe to 6b9f082 Compare May 18, 2026 17:21
@Yash0270 Yash0270 requested a review from luccabb May 18, 2026 18:18
Comment thread website/docs/GCM_Monitoring/collectors/slurm_monitor.md
Comment thread gcm/exporters/otel.py
# only explicit `self.otel_logger.info("", extra=...)` emits in the
# otel pipeline.
self.otel_logger = logging.getLogger("_gcm_otel_emit")
self.otel_logger.propagate = False

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

polluting the target Scuba table with
# null-data log rows.

the row gets published to scuba but with empty fields? do you have a sample?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes — concrete sample. Two tables affected, the second one is the dramatic case.

fair_sdiag sample row (one per dataclass_utils.warning("Missing field_name=...") per slurm_monitor cycle):

time=2026-05-17 16:14:41 PDT  cluster=null  derived_cluster=null
schedule_cycle_last=null  bf_active=null  ...all 70 cols null...
code.file.path=gcm/monitoring/dataclass_utils.py
code.function.name=instantiate_dataclass

~80 such null rows in last 24h. Scuba: https://fburl.com/scuba/fair_sdiag/oj8u9g16

fair_sacct_running is the dramatic case: 637,929 of 713,988 rows over the last 24h (89.3%) are null-data leakage. Most of those (574,000) attribute to gcm/exporters/otel.py:_write_log itself — the handler was recursing on log records emitted by its own emit path (_write_log logs at gcm.exporters.otel → propagates up to "gcm" parent → re-fires the handler → emits another row → which logs again → snowball). sacct_running runs hourly with large batches so each invocation generates 26-38k junk rows.

Full counts after running this query across all otel-targeted Scuba tables:

Dataset Total 24h Null-cluster leaked %
fair_sacct_running 713,988 637,929 89.3%
fair_sdiag 200 80 40%
fair_sacct 10,360,194 0 0%
fair_scontrol_data 74 0 0%
fair_scontrol_config 68 0 0%
fair_sacctmgr_qos 639 0 0%
fair_sacctmgr_user 517,754 0 0%

Only the first two trigger gcm-level logging during their otel write path; the other four don't, so the bug never fires on them. The fix (logger = "_gcm_otel_emit", propagate=False) removes the entire class of self-amplification.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here: https://fburl.com/scuba/fair_sacct_running/eqt1fnfy

that query is Cluster vs cluster naming diff. there's a few nulls on both but with other fields populated, I'm not confident this fixes the root of this issue: https://fburl.com/scuba/fair_sacct_running/28owvuw2

fair_sdiag

there's a few rows here where cluster=null and Body gets populated, I think your change fixes this: https://fburl.com/scuba/fair_sdiag/q9a0ou8f

add some tests:

def test_unrelated_gcm_logger_does_not_emit_to_otel():
    sink = OtelSink(...)
    other = logging.getLogger("gcm.dataclass_utils")
    other.warning("Missing field_name=foo")
    assert sink._scuba_writes == []
    
def test_gcm_logger_emits_to_otel():
    sink = OtelSink(...)
    sink.write(<data> ...)
    assert sink._scuba_writes == [<data> ...]

Comment thread website/docs/GCM_Monitoring/exporters/ods_scuba_tee.md Outdated
@meta-codesync meta-codesync Bot changed the title Expose additional sdiag fields + dual-publish ODS/Scuba via ods_scuba_tee sink Publish sdiag to perfpipe_fair_sdiag_v2 via graph_api May 21, 2026
@Yash0270 Yash0270 force-pushed the export-D95971502 branch from 6b9f082 to d473e23 Compare May 21, 2026 02:37
@meta-codesync meta-codesync Bot changed the title Publish sdiag to perfpipe_fair_sdiag_v2 via graph_api Publish sdiag to perfpipe_fair_sdiag_v2 via graph_api (#141) May 21, 2026
Yash0270 added a commit to Yash0270/gcm that referenced this pull request May 21, 2026
…ch#141)

Summary:

Routes Slurm `sdiag` scrapes into a Scuba-backed perfpipe category (`perfpipe_fair_sdiag_v2`, namespace `ai_infra`, oncall `fair_efficiency`) via the existing `graph_api` exporter. Same shape as every other `perfpipe_fair_*` category (node_data, job_data, sacct_running, sacct, scontrol_data): graph_api writes ODS metrics and scribe_logs; scribe routes per-category to a Scuba dataset of the same name. No new exporter.

Three code changes:

1. **`Sdiag` dataclass + `SlurmCliClient.sdiag_structured()` extended** to capture the ~50 fields exposed by `sdiag --all --json` on Slurm 23.2+ that were previously dropped: schedule cycle depth/last, backfill cycle/depth/queue-length/table-size, timing metadata (`req_time`, `gettimeofday_latency`, `job_states_ts`, `parts_packed`), RPC blobs (`rpcs_by_message_type_json`, `rpcs_by_user_json`) as flat JSON strings. `bf_active` is `Optional[bool]` so the JSON serializer emits real `true`/`false`.

2. **`sdiag_scribe_category` kwarg threaded through `GraphAPI.__init__`** (alongside the existing `node_scribe_category`, `job_scribe_category`, `statvfs_scribe_category`, `pure_scribe_category`). `DataIdentifier.SDIAG` is registered in `monitoring/sink/protocol.py`; `CliObject.last_sdiag` cache + `collect_sdiag` generator added in `monitoring/cli/slurm_monitor.py`. A second tuple in `data_collection_tasks` schedules sdiag LOG writes on the same 5-min cycle as the existing METRIC tasks, reusing the per-cycle scrape (no extra slurmctld load). `last_sdiag` is cleared before each scrape so a raised scrape cannot republish stale data; the dual-task is no-op when `last_sdiag is None`.

3. **`meta_utils/scribe.py::ScribeErrorWithAcks` surfaces raw reject codes.** Before: `"Failed to write N/N messages"`. After: `"Failed to write N/N messages; reject codes: ['INVALID_CATEGORY']"`. Three lines, no behavior change on success; applies to every graph_api LOG path.

Ride-along (`exporters/otel.py`): OTLP handler moved from the `"gcm"` parent logger to a child `"_gcm_otel_emit"` with `propagate=False`, so OTLP-emit validation errors no longer mirror into Scuba LOG rows as `body="Missing field_name=..."`. Standalone hygiene fix.

Docs: `website/docs/GCM_Monitoring/collectors/slurm_monitor.md` documents the dual-publish (METRIC->ODS + LOG->scribe via `DataIdentifier.SDIAG`) and the `last_sdiag` single-scrape caching pattern. `exporters/graph_api.md` gains a per-`DataIdentifier` routing table covering all six `*_scribe_category` kwargs plus the unset-category assertion behavior.

Reviewed By: yonglimeta

Differential Revision: D95971502
@Yash0270 Yash0270 force-pushed the export-D95971502 branch from d473e23 to ff069c9 Compare May 21, 2026 02:38
Yash0270 added a commit to Yash0270/gcm that referenced this pull request May 21, 2026
…ch#141)

Summary:

Routes Slurm `sdiag` scrapes into a Scuba-backed perfpipe category (`perfpipe_fair_sdiag_v2`, namespace `ai_infra`, oncall `fair_efficiency`) via the existing `graph_api` exporter. Same shape as every other `perfpipe_fair_*` category (node_data, job_data, sacct_running, sacct, scontrol_data): graph_api writes ODS metrics and scribe_logs; scribe routes per-category to a Scuba dataset of the same name. No new exporter.

Three code changes:

1. **`Sdiag` dataclass + `SlurmCliClient.sdiag_structured()` extended** to capture the ~50 fields exposed by `sdiag --all --json` on Slurm 23.2+ that were previously dropped: schedule cycle depth/last, backfill cycle/depth/queue-length/table-size, timing metadata (`req_time`, `gettimeofday_latency`, `job_states_ts`, `parts_packed`), RPC blobs (`rpcs_by_message_type_json`, `rpcs_by_user_json`) as flat JSON strings. `bf_active` is `Optional[bool]` so the JSON serializer emits real `true`/`false`.

2. **`sdiag_scribe_category` kwarg threaded through `GraphAPI.__init__`** (alongside the existing `node_scribe_category`, `job_scribe_category`, `statvfs_scribe_category`, `pure_scribe_category`). `DataIdentifier.SDIAG` is registered in `monitoring/sink/protocol.py`; `CliObject.last_sdiag` cache + `collect_sdiag` generator added in `monitoring/cli/slurm_monitor.py`. A second tuple in `data_collection_tasks` schedules sdiag LOG writes on the same 5-min cycle as the existing METRIC tasks, reusing the per-cycle scrape (no extra slurmctld load). `last_sdiag` is cleared before each scrape so a raised scrape cannot republish stale data; the dual-task is no-op when `last_sdiag is None`.

3. **`meta_utils/scribe.py::ScribeErrorWithAcks` surfaces raw reject codes.** Before: `"Failed to write N/N messages"`. After: `"Failed to write N/N messages; reject codes: ['INVALID_CATEGORY']"`. Three lines, no behavior change on success; applies to every graph_api LOG path.

Ride-along (`exporters/otel.py`): OTLP handler moved from the `"gcm"` parent logger to a child `"_gcm_otel_emit"` with `propagate=False`, so OTLP-emit validation errors no longer mirror into Scuba LOG rows as `body="Missing field_name=..."`. Standalone hygiene fix.

Docs: `website/docs/GCM_Monitoring/collectors/slurm_monitor.md` documents the dual-publish (METRIC->ODS + LOG->scribe via `DataIdentifier.SDIAG`) and the `last_sdiag` single-scrape caching pattern. `exporters/graph_api.md` gains a per-`DataIdentifier` routing table covering all six `*_scribe_category` kwargs plus the unset-category assertion behavior.

Reviewed By: yonglimeta

Differential Revision: D95971502
@Yash0270 Yash0270 force-pushed the export-D95971502 branch from ff069c9 to c668cd8 Compare May 21, 2026 05:00
Comment thread gcm/schemas/slurm/sdiag_log.py Outdated
Comment on lines +9 to +11
@dataclass(kw_only=True)
class SdiagLog(Sdiag, DerivedCluster):
cluster: str

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add DerivedCluster and cluster to the base Sdiag class?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in v25.

Added cluster: Optional[str] = None to the base Sdiag dataclass (populated at the collect-sdiag site, not by the JSON parser — keeps the parser path untouched).

Dropped derived_cluster and deleted SdiagLog entirely. Rationale: sdiag is a cluster-wide slurmctld stat, so tagging it with derived_cluster would generate misleading duplicate rows for the same scrape (fair-sc, fair-sc.h200, fair-sc.h100 would all carry identical sdiag values). One row per cluster per cycle is the correct shape, even with --heterogeneous-cluster-v1. collect_sdiag now yields Sdiag directly via replace(sdiag, cluster=cluster).

Comment thread gcm/monitoring/cli/slurm_monitor.py Outdated
Comment on lines +337 to +356
def collect_sdiag(
obj: CliObject,
cluster: str,
logger: logging.Logger,
) -> Generator[SdiagLog, None, None]:
"""Project the most recent sdiag scrape (cached on the CliObject by
`collect_slurm`) to an SdiagLog row for the Scuba `perfpipe_fair_sdiag_v2` dataset.

Single sdiag scrape per cycle is shared with `collect_slurm` -- no
double-scrape, no reset-counter race. sdiag is a cluster-wide slurmctld
stat so we yield exactly one SdiagLog per cycle (not one per partition).
"""
sdiag = obj.last_sdiag
if sdiag is None:
# collect_slurm hasn't populated the cache yet (first cycle race or
# collect_slurm raised). Skip rather than crash the loop -- the next
# cycle will recover.
logger.debug("collect_sdiag: no cached sdiag scrape available, skipping")
return
yield SdiagLog(cluster=cluster, derived_cluster=cluster, **asdict(sdiag))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just fold this into collect_slurm -> collect_slurm_and_sdiag. the current state creates an implicit requirement that collect_slurm must always run before collect_sdiag

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think another way is to write a class that does the querying

class CollectionContext:

    def sdiag(): ...

    def sinfo(): ...

then you init that class inside run_data_collection_loop at every round of collection, call its methods, and cache results.

def callable(collection_context):
    return collection_context.sinfo

...

...
while True:
    collection_context = CollectionContext(...)
    for callable, ...:
        callable(collection_context, ...)

either way, this would require changes on run_data_collection_loop, which has a decent blast radius. Wouldn't be opposed to run_data_collection_loop_v2 that handles this

If we need the data asap, maybe just say explicitly in the docstring that collect_slurm must be called before. Also add a comment here that not respecting task ordering may break collection.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took the short path in v25.

Added an explicit Prerequisite block to the collect_sdiag docstring (must run after collect_slurm in the same cycle; no-ops cleanly if last_sdiag is None), plus a shared-cache ordering comment at the top of the data_collection_tasks loop in monitoring/utils/monitor.py.

The CollectionContext + run_data_collection_loop_v2 refactor is the right long-term shape, but the blast radius (every gcm collector) is wider than this diff. Filing as a follow-up once this lands — rc1 burn-in is queued behind it and the docstring + loop comment is enough to prevent the ordering footgun for now.

@meta-codesync meta-codesync Bot changed the title Publish sdiag to perfpipe_fair_sdiag_v2 via graph_api (#141) Publish sdiag to perfpipe_fair_sdiag_v2 via graph_api May 21, 2026
@Yash0270 Yash0270 force-pushed the export-D95971502 branch from c668cd8 to 8791db4 Compare May 21, 2026 19:03
@Yash0270

Copy link
Copy Markdown
Contributor Author

v25 addresses all three review comments.

Comment on sdiag_log.py:11 (cluster/derived_cluster):

Added cluster: Optional[str] = None to the base Sdiag dataclass (populated at the collect-sdiag site, not by the JSON parser — keeps the parser path untouched).

Dropped derived_cluster and deleted SdiagLog entirely. Rationale: sdiag is a cluster-wide slurmctld stat. Tagging it with derived_cluster would generate misleading duplicate rows for the same scrape (fair-sc, fair-sc.h200, fair-sc.h100 would all carry identical sdiag values). One row per cluster per cycle is the correct shape, even with --heterogeneous-cluster-v1. collect_sdiag now yields Sdiag directly via replace(sdiag, cluster=cluster).

Comments on slurm_monitor.py:356 (ordering coupling between collect_slurm and collect_sdiag):

Took the short path you outlined: explicit docstring on collect_sdiag documenting the prerequisite (must run after collect_slurm in the same cycle; no-ops cleanly if last_sdiag is None), plus a shared-cache ordering comment at the top of the data_collection_tasks loop in monitoring/utils/monitor.py.

The longer-term refactor you sketched (CollectionContext class + run_data_collection_loop_v2) is the right shape but the blast radius (every gcm collector) is wider than this diff. Filing as a follow-up once this lands — rc1 burn-in is queued behind it and the docstring + loop comment is enough to prevent the ordering footgun for now.

Yash0270 added a commit to Yash0270/gcm that referenced this pull request May 21, 2026
…ch#141)

Summary:
Pull Request resolved: facebookresearch#141

Routes Slurm `sdiag` scrapes into a Scuba-backed perfpipe category (`perfpipe_fair_sdiag_v2`, namespace `ai_infra`, oncall `fair_efficiency`) via the existing `graph_api` exporter. Same shape as every other `perfpipe_fair_*` category (node_data, job_data, sacct_running, sacct, scontrol_data): graph_api writes ODS metrics and scribe_logs; scribe routes per-category to a Scuba dataset of the same name. No new exporter.

Three code changes:

1. **`Sdiag` dataclass + `SlurmCliClient.sdiag_structured()` extended** to capture the ~50 fields exposed by `sdiag --all --json` on Slurm 23.2+ that were previously dropped: schedule cycle depth/last, backfill cycle/depth/queue-length/table-size, timing metadata (`req_time`, `gettimeofday_latency`, `job_states_ts`, `parts_packed`), RPC blobs (`rpcs_by_message_type_json`, `rpcs_by_user_json`) as flat JSON strings. `bf_active` is `Optional[bool]` so the JSON serializer emits real `true`/`false`. `cluster: Optional[str] = None` is also on the base dataclass (populated at the collect-sdiag site, not by the parser); `derived_cluster` is intentionally omitted because sdiag is a cluster-wide slurmctld stat and tagging it with a partition-split identifier would generate misleading duplicate rows.

2. **`sdiag_scribe_category` kwarg threaded through `GraphAPI.__init__`** (alongside the existing `node_scribe_category`, `job_scribe_category`, `statvfs_scribe_category`, `pure_scribe_category`). `DataIdentifier.SDIAG` is registered in `monitoring/sink/protocol.py`; `CliObject.last_sdiag` cache + `collect_sdiag` generator added in `monitoring/cli/slurm_monitor.py`. A second tuple in `data_collection_tasks` schedules sdiag LOG writes on the same 5-min cycle as the existing METRIC tasks, reusing the per-cycle scrape (no extra slurmctld load). `last_sdiag` is cleared before each scrape so a raised scrape cannot republish stale data; the dual-task is no-op when `last_sdiag is None`. `collect_sdiag` documents the cache-ordering prerequisite (must run after `collect_slurm` in the same cycle); the loop site at `monitoring/utils/monitor.py` carries a matching comment about shared-cache patterns being sensitive to task ordering.

3. **`meta_utils/scribe.py::ScribeErrorWithAcks` surfaces raw reject codes.** Before: `"Failed to write N/N messages"`. After: `"Failed to write N/N messages; reject codes: ['INVALID_CATEGORY']"`. Three lines, no behavior change on success; applies to every graph_api LOG path.

Ride-along (`exporters/otel.py`): OTLP handler moved from the `"gcm"` parent logger to a child `"_gcm_otel_emit"` with `propagate=False`, so OTLP-emit validation errors no longer mirror into Scuba LOG rows as `body="Missing field_name=..."`. Standalone hygiene fix.

Docs: `website/docs/GCM_Monitoring/collectors/slurm_monitor.md` documents the dual-publish (METRIC->ODS + LOG->scribe via `DataIdentifier.SDIAG`), the `last_sdiag` single-scrape caching pattern, and the ordering prerequisite. `exporters/graph_api.md` gains a per-`DataIdentifier` routing table covering all six `*_scribe_category` kwargs plus the unset-category assertion behavior, and an "Adding a new scribe category (Meta-internal)" section linking the Scribe wiki for DMV/PrivacyLib/ScribeShell unblock steps.

Reviewed By: yonglimeta

Differential Revision: D95971502
@Yash0270 Yash0270 force-pushed the export-D95971502 branch from 8791db4 to f5fcba3 Compare May 21, 2026 19:47
@meta-codesync meta-codesync Bot changed the title Publish sdiag to perfpipe_fair_sdiag_v2 via graph_api Publish sdiag to perfpipe_fair_sdiag_v2 via graph_api (#141) May 21, 2026
Yash0270 added a commit to Yash0270/gcm that referenced this pull request May 21, 2026
…ch#141)

Summary:
Pull Request resolved: facebookresearch#141

Routes Slurm `sdiag` scrapes into a Scuba-backed perfpipe category (`perfpipe_fair_sdiag_v2`, namespace `ai_infra`, oncall `fair_efficiency`) via the existing `graph_api` exporter. Same shape as every other `perfpipe_fair_*` category (node_data, job_data, sacct_running, sacct, scontrol_data): graph_api writes ODS metrics and scribe_logs; scribe routes per-category to a Scuba dataset of the same name. No new exporter.

Three code changes:

1. **`Sdiag` dataclass + `SlurmCliClient.sdiag_structured()` extended** to capture the ~50 fields exposed by `sdiag --all --json` on Slurm 23.2+ that were previously dropped: schedule cycle depth/last, backfill cycle/depth/queue-length/table-size, timing metadata (`req_time`, `gettimeofday_latency`, `job_states_ts`, `parts_packed`), RPC blobs (`rpcs_by_message_type_json`, `rpcs_by_user_json`) as flat JSON strings. `bf_active` is `Optional[bool]` so the JSON serializer emits real `true`/`false`. `cluster: Optional[str] = None` is also on the base dataclass (populated at the collect-sdiag site, not by the parser); `derived_cluster` is intentionally omitted because sdiag is a cluster-wide slurmctld stat and tagging it with a partition-split identifier would generate misleading duplicate rows.

2. **`sdiag_scribe_category` kwarg threaded through `GraphAPI.__init__`** (alongside the existing `node_scribe_category`, `job_scribe_category`, `statvfs_scribe_category`, `pure_scribe_category`). `DataIdentifier.SDIAG` is registered in `monitoring/sink/protocol.py`; `CliObject.last_sdiag` cache + `collect_sdiag` generator added in `monitoring/cli/slurm_monitor.py`. A second tuple in `data_collection_tasks` schedules sdiag LOG writes on the same 5-min cycle as the existing METRIC tasks, reusing the per-cycle scrape (no extra slurmctld load). `last_sdiag` is cleared before each scrape so a raised scrape cannot republish stale data; the dual-task is no-op when `last_sdiag is None`. `collect_sdiag` documents the cache-ordering prerequisite (must run after `collect_slurm` in the same cycle); the loop site at `monitoring/utils/monitor.py` carries a matching comment about shared-cache patterns being sensitive to task ordering.

3. **`meta_utils/scribe.py::ScribeErrorWithAcks` surfaces raw reject codes.** Before: `"Failed to write N/N messages"`. After: `"Failed to write N/N messages; reject codes: ['INVALID_CATEGORY']"`. Three lines, no behavior change on success; applies to every graph_api LOG path.

Ride-along (`exporters/otel.py`): OTLP handler moved from the `"gcm"` parent logger to a child `"_gcm_otel_emit"` with `propagate=False`, so OTLP-emit validation errors no longer mirror into Scuba LOG rows as `body="Missing field_name=..."`. Standalone hygiene fix.

Docs: `website/docs/GCM_Monitoring/collectors/slurm_monitor.md` documents the dual-publish (METRIC->ODS + LOG->scribe via `DataIdentifier.SDIAG`), the `last_sdiag` single-scrape caching pattern, and the ordering prerequisite. `exporters/graph_api.md` gains a per-`DataIdentifier` routing table covering all six `*_scribe_category` kwargs plus the unset-category assertion behavior, and an "Adding a new scribe category (Meta-internal)" section linking the Scribe wiki for DMV/PrivacyLib/ScribeShell unblock steps.

Reviewed By: yonglimeta

Differential Revision: D95971502
@Yash0270 Yash0270 force-pushed the export-D95971502 branch from f5fcba3 to 4e83a6c Compare May 21, 2026 21:29
Yash0270 added a commit to Yash0270/gcm that referenced this pull request May 21, 2026
…ch#141)

Summary:

Routes Slurm `sdiag` scrapes into a Scuba-backed perfpipe category (`perfpipe_fair_sdiag_v2`, namespace `ai_infra`, oncall `fair_efficiency`) via the existing `graph_api` exporter. Same shape as every other `perfpipe_fair_*` category (node_data, job_data, sacct_running, sacct, scontrol_data): graph_api writes ODS metrics and scribe_logs; scribe routes per-category to a Scuba dataset of the same name. No new exporter.

Three code changes:

1. **`Sdiag` dataclass + `SlurmCliClient.sdiag_structured()` extended** to capture the ~50 fields exposed by `sdiag --all --json` on Slurm 23.2+ that were previously dropped: schedule cycle depth/last, backfill cycle/depth/queue-length/table-size, timing metadata (`req_time`, `gettimeofday_latency`, `job_states_ts`, `parts_packed`), RPC blobs (`rpcs_by_message_type_json`, `rpcs_by_user_json`) as flat JSON strings. `bf_active` is `Optional[bool]` so the JSON serializer emits real `true`/`false`. `cluster: Optional[str] = None` is also on the base dataclass (populated at the collect-sdiag site, not by the parser); `derived_cluster` is intentionally omitted because sdiag is a cluster-wide slurmctld stat and tagging it with a partition-split identifier would generate misleading duplicate rows.

2. **`sdiag_scribe_category` kwarg threaded through `GraphAPI.__init__`** (alongside the existing `node_scribe_category`, `job_scribe_category`, `statvfs_scribe_category`, `pure_scribe_category`). `DataIdentifier.SDIAG` is registered in `monitoring/sink/protocol.py`; `CliObject.last_sdiag` cache + `collect_sdiag` generator added in `monitoring/cli/slurm_monitor.py`. A second tuple in `data_collection_tasks` schedules sdiag LOG writes on the same 5-min cycle as the existing METRIC tasks, reusing the per-cycle scrape (no extra slurmctld load). `last_sdiag` is cleared before each scrape so a raised scrape cannot republish stale data; the dual-task is no-op when `last_sdiag is None`. `collect_sdiag` documents the cache-ordering prerequisite (must run after `collect_slurm` in the same cycle); the loop site at `monitoring/utils/monitor.py` carries a matching comment about shared-cache patterns being sensitive to task ordering.

3. **`meta_utils/scribe.py::ScribeErrorWithAcks` surfaces raw reject codes.** Before: `"Failed to write N/N messages"`. After: `"Failed to write N/N messages; reject codes: ['INVALID_CATEGORY']"`. Three lines, no behavior change on success; applies to every graph_api LOG path.

Ride-along (`exporters/otel.py`): OTLP handler moved from the `"gcm"` parent logger to a child `"_gcm_otel_emit"` with `propagate=False`, so OTLP-emit validation errors no longer mirror into Scuba LOG rows as `body="Missing field_name=..."`. Standalone hygiene fix.

Docs: `website/docs/GCM_Monitoring/collectors/slurm_monitor.md` documents the dual-publish (METRIC->ODS + LOG->scribe via `DataIdentifier.SDIAG`), the `last_sdiag` single-scrape caching pattern, and the ordering prerequisite. `exporters/graph_api.md` gains a per-`DataIdentifier` routing table covering all six `*_scribe_category` kwargs plus the unset-category assertion behavior, and an "Adding a new scribe category (Meta-internal)" section linking the Scribe wiki for DMV/PrivacyLib/ScribeShell unblock steps.

Reviewed By: yonglimeta

Differential Revision: D95971502
@Yash0270 Yash0270 force-pushed the export-D95971502 branch from 4e83a6c to 79c1a7f Compare May 21, 2026 22:29
Comment thread gcm/monitoring/utils/monitor.py Outdated
Comment on lines +146 to +149
# Task ordering matters when callables share an instance cache (e.g.
# `collect_slurm` populates `CliObject.last_sdiag`, consumed by
# `collect_sdiag`). Reordering tasks in `data_collection_tasks` can
# break collection for shared-cache consumers.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Task ordering matters when callables share an instance cache (e.g.
# `collect_slurm` populates `CliObject.last_sdiag`, consumed by
# `collect_sdiag`). Reordering tasks in `data_collection_tasks` can
# break collection for shared-cache consumers.
# Task ordering matters for some collections. If you wish to update this assumption please create a new `run_data_collection_loop` function

Comment thread gcm/monitoring/cli/slurm_monitor.py Outdated
Comment on lines +436 to +439
# Task B: Sdiag -> LOG -> scribe (perfpipe_fair_sdiag_v2) ->
# Scuba (perfpipe_fair_sdiag_v2) via graph_api._write_log. Reads cached
# sdiag from Task A. Mirrors the slurm_job_monitor pattern of
# per-task DataIdentifier-driven scribe routing.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Task B: Sdiag -> LOG -> scribe (perfpipe_fair_sdiag_v2) ->
# Scuba (perfpipe_fair_sdiag_v2) via graph_api._write_log. Reads cached
# sdiag from Task A. Mirrors the slurm_job_monitor pattern of
# per-task DataIdentifier-driven scribe routing.

Comment thread gcm/monitoring/cli/slurm_monitor.py Outdated
Comment on lines +427 to +428
# Task A: SLURMLog -> METRIC -> ODS via graph_api._write_metric.
# Populates obj.last_sdiag for Task B.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Task A: SLURMLog -> METRIC -> ODS via graph_api._write_metric.
# Populates obj.last_sdiag for Task B.

Comment thread gcm/tests/test_slurm_monitor.py Outdated
Comment on lines +5456 to +5462
import logging as _logging # noqa: E402

from gcm.monitoring.cli.slurm_monitor import ( # noqa: E402
CliObjectImpl as _CliObjectImpl,
collect_sdiag as _collect_sdiag,
)
from gcm.schemas.slurm.sdiag import Sdiag as _Sdiag # noqa: E402

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move these to the top

Performs comprehensive analytics on SLURM cluster state by combining data from multiple sources (`sinfo`, `sdiag`, `sacct`) and computing aggregated metrics. Provides deep insights into cluster health, resource utilization, job analytics, and user activity.

**Data Type**: `DataType.METRIC`, **Schemas**: `SLURMLog`, `SLURMLogAccountMetrics`
**Data Type**: `DataType.METRIC` (`SLURMLog` → ODS) + `DataType.LOG` with `DataIdentifier.SDIAG` (`Sdiag` → scribe → Scuba `perfpipe_fair_sdiag_v2`).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**Data Type**: `DataType.METRIC` (`SLURMLog` → ODS) + `DataType.LOG` with `DataIdentifier.SDIAG` (`Sdiag` → scribe → Scuba `perfpipe_fair_sdiag_v2`).
**Data Type**: `DataType.METRIC` + `DataType.LOG` with `DataIdentifier.SDIAG`.

Yash0270 added a commit to Yash0270/gcm that referenced this pull request May 28, 2026
…ch#141)

Summary:

Routes Slurm `sdiag` scrapes into a Scuba-backed perfpipe category (`perfpipe_fair_sdiag_v2`, namespace `ai_infra`, oncall `fair_efficiency`) via the existing `graph_api` exporter. Same shape as every other `perfpipe_fair_*` category (node_data, job_data, sacct_running, sacct, scontrol_data): graph_api writes ODS metrics and scribe_logs; scribe routes per-category to a Scuba dataset of the same name. No new exporter.

Three code changes:

1. **`Sdiag` dataclass + `SlurmCliClient.sdiag_structured()` extended** to capture the ~50 fields exposed by `sdiag --all --json` on Slurm 23.2+ that were previously dropped: schedule cycle depth/last, backfill cycle/depth/queue-length/table-size, timing metadata (`req_time`, `gettimeofday_latency`, `job_states_ts`, `parts_packed`), RPC blobs (`rpcs_by_message_type_json`, `rpcs_by_user_json`) as flat JSON strings. `bf_active` is `Optional[bool]` so the JSON serializer emits real `true`/`false`. `cluster: Optional[str] = None` is also on the base dataclass (populated at the collect-sdiag site, not by the parser); `derived_cluster` is intentionally omitted because sdiag is a cluster-wide slurmctld stat and tagging it with a partition-split identifier would generate misleading duplicate rows.

2. **`sdiag_scribe_category` kwarg threaded through `GraphAPI.__init__`** (alongside the existing `node_scribe_category`, `job_scribe_category`, `statvfs_scribe_category`, `pure_scribe_category`). `DataIdentifier.SDIAG` is registered in `monitoring/sink/protocol.py`; `CliObject.last_sdiag` cache + `collect_sdiag` generator added in `monitoring/cli/slurm_monitor.py`. A second tuple in `data_collection_tasks` schedules sdiag LOG writes on the same 5-min cycle as the existing METRIC tasks, reusing the per-cycle scrape (no extra slurmctld load). `last_sdiag` is cleared before each scrape so a raised scrape cannot republish stale data; the dual-task is no-op when `last_sdiag is None`. `collect_sdiag` documents the cache-ordering prerequisite (must run after `collect_slurm` in the same cycle); the loop site at `monitoring/utils/monitor.py` carries a matching comment about shared-cache patterns being sensitive to task ordering.

3. **`meta_utils/scribe.py::ScribeErrorWithAcks` surfaces raw reject codes.** Before: `"Failed to write N/N messages"`. After: `"Failed to write N/N messages; reject codes: ['INVALID_CATEGORY']"`. Three lines, no behavior change on success; applies to every graph_api LOG path.

Ride-along (`exporters/otel.py`): OTLP handler moved from the `"gcm"` parent logger to a child `"_gcm_otel_emit"` with `propagate=False`, so OTLP-emit validation errors no longer mirror into Scuba LOG rows as `body="Missing field_name=..."`. Standalone hygiene fix.

Docs: `website/docs/GCM_Monitoring/collectors/slurm_monitor.md` documents the dual-publish (METRIC->ODS + LOG->scribe via `DataIdentifier.SDIAG`), the `last_sdiag` single-scrape caching pattern, and the ordering prerequisite. `exporters/graph_api.md` gains a per-`DataIdentifier` routing table covering all six `*_scribe_category` kwargs plus the unset-category assertion behavior, and an "Adding a new scribe category (Meta-internal)" section linking the Scribe wiki for DMV/PrivacyLib/ScribeShell unblock steps.

Reviewed By: yonglimeta

Differential Revision: D95971502
@Yash0270 Yash0270 force-pushed the export-D95971502 branch from 79c1a7f to 11b2b73 Compare May 28, 2026 18:12
Yash0270 added a commit to Yash0270/gcm that referenced this pull request May 28, 2026
…ch#141)

Summary:

Routes Slurm `sdiag` scrapes into a Scuba-backed perfpipe category (`perfpipe_fair_sdiag_v2`, namespace `ai_infra`, oncall `fair_efficiency`) via the existing `graph_api` exporter. Same shape as every other `perfpipe_fair_*` category (node_data, job_data, sacct_running, sacct, scontrol_data): graph_api writes ODS metrics and scribe_logs; scribe routes per-category to a Scuba dataset of the same name. No new exporter.

Three code changes:

1. **`Sdiag` dataclass + `SlurmCliClient.sdiag_structured()` extended** to capture the ~50 fields exposed by `sdiag --all --json` on Slurm 23.2+ that were previously dropped: schedule cycle depth/last, backfill cycle/depth/queue-length/table-size, timing metadata (`req_time`, `gettimeofday_latency`, `job_states_ts`, `parts_packed`), RPC blobs (`rpcs_by_message_type_json`, `rpcs_by_user_json`) as flat JSON strings. `bf_active` is `Optional[bool]` so the JSON serializer emits real `true`/`false`. `cluster: Optional[str] = None` is also on the base dataclass (populated at the collect-sdiag site, not by the parser); `derived_cluster` is intentionally omitted because sdiag is a cluster-wide slurmctld stat and tagging it with a partition-split identifier would generate misleading duplicate rows.

2. **`sdiag_scribe_category` kwarg threaded through `GraphAPI.__init__`** (alongside the existing `node_scribe_category`, `job_scribe_category`, `statvfs_scribe_category`, `pure_scribe_category`). `DataIdentifier.SDIAG` is registered in `monitoring/sink/protocol.py`; `CliObject.last_sdiag` cache + `collect_sdiag` generator added in `monitoring/cli/slurm_monitor.py`. A second tuple in `data_collection_tasks` schedules sdiag LOG writes on the same 5-min cycle as the existing METRIC tasks, reusing the per-cycle scrape (no extra slurmctld load). `last_sdiag` is cleared before each scrape so a raised scrape cannot republish stale data; the dual-task is no-op when `last_sdiag is None`. `collect_sdiag` documents the cache-ordering prerequisite (must run after `collect_slurm` in the same cycle); the loop site at `monitoring/utils/monitor.py` carries a matching comment about shared-cache patterns being sensitive to task ordering.

3. **`meta_utils/scribe.py::ScribeErrorWithAcks` surfaces raw reject codes.** Before: `"Failed to write N/N messages"`. After: `"Failed to write N/N messages; reject codes: ['INVALID_CATEGORY']"`. Three lines, no behavior change on success; applies to every graph_api LOG path.

Ride-along (`exporters/otel.py`): OTLP handler moved from the `"gcm"` parent logger to a child `"_gcm_otel_emit"` with `propagate=False`, so OTLP-emit validation errors no longer mirror into Scuba LOG rows as `body="Missing field_name=..."`. Standalone hygiene fix.

Docs: `website/docs/GCM_Monitoring/collectors/slurm_monitor.md` documents the dual-publish (METRIC->ODS + LOG->scribe via `DataIdentifier.SDIAG`), the `last_sdiag` single-scrape caching pattern, and the ordering prerequisite. `exporters/graph_api.md` gains a per-`DataIdentifier` routing table covering all six `*_scribe_category` kwargs plus the unset-category assertion behavior, and an "Adding a new scribe category (Meta-internal)" section linking the Scribe wiki for DMV/PrivacyLib/ScribeShell unblock steps.

Reviewed By: yonglimeta

Differential Revision: D95971502
@Yash0270 Yash0270 force-pushed the export-D95971502 branch 2 times, most recently from 11b2b73 to c41c8cb Compare May 28, 2026 18:44
…ch#141)

Summary:

Routes Slurm `sdiag` scrapes into a Scuba-backed perfpipe category (`perfpipe_fair_sdiag_v2`, namespace `ai_infra`, oncall `fair_efficiency`) via the existing `graph_api` exporter. Same shape as every other `perfpipe_fair_*` category (node_data, job_data, sacct_running, sacct, scontrol_data): graph_api writes ODS metrics and scribe_logs; scribe routes per-category to a Scuba dataset of the same name. No new exporter.

Three code changes:

1. **`Sdiag` dataclass + `SlurmCliClient.sdiag_structured()` extended** to capture the ~50 fields exposed by `sdiag --all --json` on Slurm 23.2+ that were previously dropped: schedule cycle depth/last, backfill cycle/depth/queue-length/table-size, timing metadata (`req_time`, `gettimeofday_latency`, `job_states_ts`, `parts_packed`), RPC blobs (`rpcs_by_message_type_json`, `rpcs_by_user_json`) as flat JSON strings. `bf_active` is `Optional[bool]` so the JSON serializer emits real `true`/`false`. `cluster: Optional[str] = None` is also on the base dataclass (populated at the collect-sdiag site, not by the parser); `derived_cluster` is intentionally omitted because sdiag is a cluster-wide slurmctld stat and tagging it with a partition-split identifier would generate misleading duplicate rows.

2. **`sdiag_scribe_category` kwarg threaded through `GraphAPI.__init__`** (alongside the existing `node_scribe_category`, `job_scribe_category`, `statvfs_scribe_category`, `pure_scribe_category`). `DataIdentifier.SDIAG` is registered in `monitoring/sink/protocol.py`; `CliObject.last_sdiag` cache + `collect_sdiag` generator added in `monitoring/cli/slurm_monitor.py`. A second tuple in `data_collection_tasks` schedules sdiag LOG writes on the same 5-min cycle as the existing METRIC tasks, reusing the per-cycle scrape (no extra slurmctld load). `last_sdiag` is cleared before each scrape so a raised scrape cannot republish stale data; the dual-task is no-op when `last_sdiag is None`. `collect_sdiag` documents the cache-ordering prerequisite (must run after `collect_slurm` in the same cycle); the loop site at `monitoring/utils/monitor.py` carries a matching comment about shared-cache patterns being sensitive to task ordering.

3. **`meta_utils/scribe.py::ScribeErrorWithAcks` surfaces raw reject codes.** Before: `"Failed to write N/N messages"`. After: `"Failed to write N/N messages; reject codes: ['INVALID_CATEGORY']"`. Three lines, no behavior change on success; applies to every graph_api LOG path.

Ride-along (`exporters/otel.py`): OTLP handler moved from the `"gcm"` parent logger to a child `"_gcm_otel_emit"` with `propagate=False`, so OTLP-emit validation errors no longer mirror into Scuba LOG rows as `body="Missing field_name=..."`. Standalone hygiene fix.

Docs: `website/docs/GCM_Monitoring/collectors/slurm_monitor.md` documents the dual-publish (METRIC->ODS + LOG->scribe via `DataIdentifier.SDIAG`), the `last_sdiag` single-scrape caching pattern, and the ordering prerequisite. `exporters/graph_api.md` gains a per-`DataIdentifier` routing table covering all six `*_scribe_category` kwargs plus the unset-category assertion behavior, and an "Adding a new scribe category (Meta-internal)" section linking the Scribe wiki for DMV/PrivacyLib/ScribeShell unblock steps.

Reviewed By: yonglimeta

Differential Revision: D95971502
@Yash0270 Yash0270 force-pushed the export-D95971502 branch from c41c8cb to 7c29686 Compare May 28, 2026 20:46
@Yash0270 Yash0270 requested a review from yonglimeta May 28, 2026 22:14
@Yash0270 Yash0270 merged commit 27b18d2 into facebookresearch:main May 29, 2026
38 of 39 checks passed
@Yash0270 Yash0270 deleted the export-D95971502 branch May 29, 2026 16:13
@Yash0270

Copy link
Copy Markdown
Contributor Author

1 failed test seems to be because of fbcode <--> github cyclic dependency. Fbcode test failure is because of github dependency is failing. Github test failures point to fbcode diff failure.

Force pushing to resolve this. Already tried rebasing multiple times.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants