swordfish: add seamless kernel profiling workflow#6
Open
chokevin wants to merge 2 commits into
Open
Conversation
Add A100 NCU safety checks, DCGM profiling-window helpers, stable trace bundling, vector-sum benchmark plumbing, and FSDP overlap knobs from the kernel investigation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Owner
Author
|
CI note: local validation passed ( |
The benchmark image builds from a private voiceagentcr ACR base image, which pull_request runners cannot fetch anonymously. Keep image publication on main and workflow_dispatch while relying on ci/lint plus local tests for PR validation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Add the seamless kernel-profiling workflow from the recent airun investigation: A100 NCU safety checks, DCGM profiling-window helpers, stable trace bundling, vector-sum benchmark plumbing, and FSDP overlap knobs.
Why
A100 Nsight Compute profiling needed guardrails after the Rune renderer dropped profile security context and DCGM exporter interference made profiling brittle. This also standardizes trace handoff under
runs/traces/<bundle>.tar.gzso profiles can be shipped to Hermes or inspected locally.Non-goals
aks-ai-runtime.Testing
git diff --checkmake testsf-ncu-smoke-001148-a100and produced a trace bundle.Risk
Moderate operational surface area: new helpers patch/restore the GPU operator DCGM exporter DaemonSet during A100 NCU windows. The helper includes status/restore flows, and the submit preflight prevents burning A100 NCU jobs when
SYS_ADMINis missing.