Skip to content

swordfish: add seamless kernel profiling workflow#6

Open
chokevin wants to merge 2 commits into
mainfrom
chokevin/seamless-kernel-profiling-20260506
Open

swordfish: add seamless kernel profiling workflow#6
chokevin wants to merge 2 commits into
mainfrom
chokevin/seamless-kernel-profiling-20260506

Conversation

@chokevin
Copy link
Copy Markdown
Owner

@chokevin chokevin commented May 6, 2026

What

Add the seamless kernel-profiling workflow from the recent airun investigation: A100 NCU safety checks, DCGM profiling-window helpers, stable trace bundling, vector-sum benchmark plumbing, and FSDP overlap knobs.

Why

A100 Nsight Compute profiling needed guardrails after the Rune renderer dropped profile security context and DCGM exporter interference made profiling brittle. This also standardizes trace handoff under runs/traces/<bundle>.tar.gz so profiles can be shipped to Hermes or inspected locally.

Non-goals

  • Does not vendor the upstream Rune renderer fix; that lives in aks-ai-runtime.
  • Does not rebuild or push the benchmark container image.
  • Does not claim NCU-wrapped timings as clean latency results.

Testing

  • git diff --check
  • make test
  • A100 NCU smoke previously completed as sf-ncu-smoke-001148-a100 and produced a trace bundle.

Risk

Moderate operational surface area: new helpers patch/restore the GPU operator DCGM exporter DaemonSet during A100 NCU windows. The helper includes status/restore flows, and the submit preflight prevents burning A100 NCU jobs when SYS_ADMIN is missing.

Add A100 NCU safety checks, DCGM profiling-window helpers, stable trace bundling, vector-sum benchmark plumbing, and FSDP overlap knobs from the kernel investigation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@chokevin
Copy link
Copy Markdown
Owner Author

chokevin commented May 6, 2026

CI note: local validation passed (git diff --check, make test) and ci/lint is green. build-swordfish-image/build is red because the workflow pulls voiceagentcr.azurecr.io/airun/autoresearch-pytorch-ray:dev anonymously and ACR returns 401; recent main runs for this workflow show the same baseline failure since the private ACR base switch.

The benchmark image builds from a private voiceagentcr ACR base image, which pull_request runners cannot fetch anonymously. Keep image publication on main and workflow_dispatch while relying on ci/lint plus local tests for PR validation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant