Tuning agent, memory based benchmarking support plus fixes#755
Open
araina-amd wants to merge 13 commits into
Open
Tuning agent, memory based benchmarking support plus fixes#755araina-amd wants to merge 13 commits into
araina-amd wants to merge 13 commits into
Conversation
- Fix projection accuracy; rename Megatron ILP -> SeaAILab ILP; add scheduler comparison - Collective model: additive A2A correction, proportional num_experts reduction, A2A mesh/remote contention derates, P2P-based inter-node BW, hierarchical AllReduce with pipelining + RCCL overhead + NIC RDMA warmup, and P2P/PP-aware SendRecv - FSDP per-layer compute/comm overlap model; loss-fusion + SyncFree overlap; hybrid sourcing with auto bg=1 compute baseline; auto-disable turbo-deepep when TP*EP=1 - Config validation; training_config + projection CLI updates
- Add bench-anchored memory mode with shared bench artifact - memory_capture, benchmark, extrapolation, and reports modules - Rename memory_projection/projection.py -> simulate.py; enable layer enumeration
- primus/agents/tuning_agent: agent, evaluator, plan, legality, scratchpad, tools, workload, history, plotting, config, cli - unit tests: evaluator benchmark + recompute schedule legality
- megatron TE patches: deep-probe primus_turbo before applying TE patches - torchtitan qwen3 model-config tweaks; bump third_party/torchtitan pointer - drop torchtitan upstream backend-gap report/summary docs
…c provider patch The squash/refactor dropped the _use_legacy_grouped_gemm helper while leaving two call sites, raising NameError whenever the te_spec_provider turbo patch activates. Define the helper and remove the now-unused turbo-enable probes. Co-authored-by: Cursor <cursoragent@cursor.com>
Restore docs/backend-gap/reports/torchtitan/upstream-main and the primus/configs/models/torchtitan qwen3 configs to match main; these were unintended local-dev changes that should not be part of this branch. Co-authored-by: Cursor <cursoragent@cursor.com>
Replace silent pass-only handlers with explanatory comments, explicit fallbacks, debug logging, or lightweight session-log error events so best-effort paths remain non-fatal but diagnosable. Co-authored-by: Cursor <cursoragent@cursor.com>
Remove unused imports in the tuning agent and tests, and make routing-patch restore cleanup in utils.py non-fatal but visible via a warning message. Co-authored-by: Cursor <cursoragent@cursor.com>
| # Per-trial YAML generation | ||
| # --------------------------------------------------------------------------- | ||
|
|
||
| _PARALLEL_OVERRIDES_KEYS = { |
Run pre-commit hooks across the branch so CI formatting checks pass. Also restore the general_gemm workspace patch import logic that autoflake had incorrectly stripped. Co-authored-by: Cursor <cursoragent@cursor.com>
Comment on lines
+42
to
+49
| from primus.agents.tuning_agent.evaluator import ( # noqa: E402 | ||
| EvalResult, | ||
| Evaluator, | ||
| _build_env, | ||
| _build_memory_cmd, | ||
| _build_perf_cmd, | ||
| _parse_metrics, | ||
| ) |
| f"(bench/target)" | ||
| ) | ||
| else: | ||
| loaded_payload = None |
| ) | ||
| else: | ||
| loaded_payload = None | ||
| loaded_metadata = {} |
Enables the squash-branch projection benchmarks (gpt_oss, qwen, mixtral) to run on the rocm/primus v26.2/v26.3 containers, plus small tuning-agent fixes. - primus_turbo: guard QuantizedTensor/QuantizedTensorPair imports and alias ScalingRecipe -> MXScalingRecipe so the wrapper imports against primus_turbo 0.2.0 (shipped in v26.2/v26.3, predating PR #735). These symbols are only used on FP8/FP4 weight-quant paths, so BF16 turbo attention/DeepEP now work. - trainer: adapt the track_config_flags call to the pinned Megatron-LM signature (accepts 6 vs 8 positional args across commits). - tuning_agent/evaluator: pass --memory-mode explicitly so the simulate/ memory-only path does not fall back to the CLI's benchmark default. - tuning_agent/workload: coerce ${VAR:default}-templated virtual-pipeline values to int (or None) before use. - examples/agents: add mi355x 4-node target cluster config. Co-authored-by: Cursor <cursoragent@cursor.com>
The agent previously required a running LiteLLM sidecar process and AMD-internal gateway routing (LITELLM_BASE_URL, OCP_APIM_SUBSCRIPTION_KEY, _amd_onprem_llm_headers). DSPy 3.x bundles LiteLLM natively, so no proxy is needed for any provider. - agent.py: remove _amd_onprem_llm_headers and AMD-gateway routing from configure_dspy(); pass model/api_key/base_url directly to dspy.LM() so any LiteLLM-supported provider works out of the box - config.py: replace LITELLM_* env vars with standard provider vars (OPENAI_API_KEY, ANTHROPIC_API_KEY, LLM_API_KEY, OPENAI_API_BASE, LLM_MODEL); remove AMD-private .env search path; drop provider field from LLMConfig; change DEFAULT_MODEL to openai/gpt-4o - requirements.txt: drop litellm as a direct dep (dspy brings it) - README.md: replace proxy setup with generic dspy/litellm instructions covering OpenAI, Anthropic, Ollama, and custom OpenAI-compatible endpoints Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
…ports Co-authored-by: Cursor <cursoragent@cursor.com>
| tools=tools, | ||
| max_iterations=budget.max_rlm_iterations, | ||
| ) | ||
| result = rlm( |
Cover the deterministic/analytical helpers (no GPU required) for the tuning agent (workload resolver, legality, history) and both projection paths (config guards, memory extrapolation, performance placement/timing helpers). Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds an LLM-driven tuning agent for parallelism/config search and substantially
improves the performance & memory projection stack (accuracy, collective/comm
modeling, overlap models, bench-anchored memory). Also includes supporting backend/
config cleanups and a turbo-patch crash fix.