Tuning agent, memory based benchmarking support plus fixes by araina-amd · Pull Request #755 · AMD-AGI/Primus

araina-amd · 2026-06-08T16:48:07Z

Adds an LLM-driven tuning agent for parallelism/config search and substantially
improves the performance & memory projection stack (accuracy, collective/comm
modeling, overlap models, bench-anchored memory). Also includes supporting backend/
config cleanups and a turbo-patch crash fix.

- Fix projection accuracy; rename Megatron ILP -> SeaAILab ILP; add scheduler comparison - Collective model: additive A2A correction, proportional num_experts reduction, A2A mesh/remote contention derates, P2P-based inter-node BW, hierarchical AllReduce with pipelining + RCCL overhead + NIC RDMA warmup, and P2P/PP-aware SendRecv - FSDP per-layer compute/comm overlap model; loss-fusion + SyncFree overlap; hybrid sourcing with auto bg=1 compute baseline; auto-disable turbo-deepep when TP*EP=1 - Config validation; training_config + projection CLI updates

- Add bench-anchored memory mode with shared bench artifact - memory_capture, benchmark, extrapolation, and reports modules - Rename memory_projection/projection.py -> simulate.py; enable layer enumeration

- primus/agents/tuning_agent: agent, evaluator, plan, legality, scratchpad, tools, workload, history, plotting, config, cli - unit tests: evaluator benchmark + recompute schedule legality

- megatron TE patches: deep-probe primus_turbo before applying TE patches - torchtitan qwen3 model-config tweaks; bump third_party/torchtitan pointer - drop torchtitan upstream backend-gap report/summary docs

…c provider patch The squash/refactor dropped the _use_legacy_grouped_gemm helper while leaving two call sites, raising NameError whenever the te_spec_provider turbo patch activates. Define the helper and remove the now-unused turbo-enable probes. Co-authored-by: Cursor <cursoragent@cursor.com>

Restore docs/backend-gap/reports/torchtitan/upstream-main and the primus/configs/models/torchtitan qwen3 configs to match main; these were unintended local-dev changes that should not be part of this branch. Co-authored-by: Cursor <cursoragent@cursor.com>

Replace silent pass-only handlers with explanatory comments, explicit fallbacks, debug logging, or lightweight session-log error events so best-effort paths remain non-fatal but diagnosable. Co-authored-by: Cursor <cursoragent@cursor.com>

Remove unused imports in the tuning agent and tests, and make routing-patch restore cleanup in utils.py non-fatal but visible via a warning message. Co-authored-by: Cursor <cursoragent@cursor.com>

+# Per-trial YAML generation
+# ---------------------------------------------------------------------------
+
+_PARALLEL_OVERRIDES_KEYS = {


Run pre-commit hooks across the branch so CI formatting checks pass. Also restore the general_gemm workspace patch import logic that autoflake had incorrectly stripped. Co-authored-by: Cursor <cursoragent@cursor.com>

+from primus.agents.tuning_agent.evaluator import (  # noqa: E402
+    EvalResult,
+    Evaluator,
+    _build_env,
+    _build_memory_cmd,
+    _build_perf_cmd,
+    _parse_metrics,
+)


+                f"(bench/target)"
+            )
+    else:
+        loaded_payload = None


+            )
+    else:
+        loaded_payload = None
+        loaded_metadata = {}


Enables the squash-branch projection benchmarks (gpt_oss, qwen, mixtral) to run on the rocm/primus v26.2/v26.3 containers, plus small tuning-agent fixes. - primus_turbo: guard QuantizedTensor/QuantizedTensorPair imports and alias ScalingRecipe -> MXScalingRecipe so the wrapper imports against primus_turbo 0.2.0 (shipped in v26.2/v26.3, predating PR #735). These symbols are only used on FP8/FP4 weight-quant paths, so BF16 turbo attention/DeepEP now work. - trainer: adapt the track_config_flags call to the pinned Megatron-LM signature (accepts 6 vs 8 positional args across commits). - tuning_agent/evaluator: pass --memory-mode explicitly so the simulate/ memory-only path does not fall back to the CLI's benchmark default. - tuning_agent/workload: coerce ${VAR:default}-templated virtual-pipeline values to int (or None) before use. - examples/agents: add mi355x 4-node target cluster config. Co-authored-by: Cursor <cursoragent@cursor.com>

The agent previously required a running LiteLLM sidecar process and AMD-internal gateway routing (LITELLM_BASE_URL, OCP_APIM_SUBSCRIPTION_KEY, _amd_onprem_llm_headers). DSPy 3.x bundles LiteLLM natively, so no proxy is needed for any provider. - agent.py: remove _amd_onprem_llm_headers and AMD-gateway routing from configure_dspy(); pass model/api_key/base_url directly to dspy.LM() so any LiteLLM-supported provider works out of the box - config.py: replace LITELLM_* env vars with standard provider vars (OPENAI_API_KEY, ANTHROPIC_API_KEY, LLM_API_KEY, OPENAI_API_BASE, LLM_MODEL); remove AMD-private .env search path; drop provider field from LLMConfig; change DEFAULT_MODEL to openai/gpt-4o - requirements.txt: drop litellm as a direct dep (dspy brings it) - README.md: replace proxy setup with generic dspy/litellm instructions covering OpenAI, Anthropic, Ollama, and custom OpenAI-compatible endpoints Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

…ports Co-authored-by: Cursor <cursoragent@cursor.com>

+                    tools=tools,
+                    max_iterations=budget.max_rlm_iterations,
+                )
+                result = rlm(


Cover the deterministic/analytical helpers (no GPU required) for the tuning agent (workload resolver, legality, history) and both projection paths (config guards, memory extrapolation, performance placement/timing helpers). Co-authored-by: Cursor <cursoragent@cursor.com>

+from primus.core.projection.performance_projection import (  # noqa: E402
+    projection as proj,
+)


araina-amd and others added 5 commits June 6, 2026 00:23

projection(memory): bench-anchored memory mode + extrapolation/reports

a9c7687

- Add bench-anchored memory mode with shared bench artifact - memory_capture, benchmark, extrapolation, and reports modules - Rename memory_projection/projection.py -> simulate.py; enable layer enumeration

agents: add tuning agent + tests

708af34

- primus/agents/tuning_agent: agent, evaluator, plan, legality, scratchpad, tools, workload, history, plotting, config, cli - unit tests: evaluator benchmark + recompute schedule legality

projection: supporting backend/config/doc cleanups

6e5c48a

- megatron TE patches: deep-probe primus_turbo before applying TE patches - torchtitan qwen3 model-config tweaks; bump third_party/torchtitan pointer - drop torchtitan upstream backend-gap report/summary docs

araina-amd requested review from Xiaoming-AMD, limou102 and wenxie-amd as code owners June 8, 2026 16:48

github-code-quality Bot found potential problems Jun 8, 2026

View reviewed changes

fix: resolve remaining CodeQL unused-import and empty-except findings

c19890e

Remove unused imports in the tuning agent and tests, and make routing-patch restore cleanup in utils.py non-fatal but visible via a warning message. Co-authored-by: Cursor <cursoragent@cursor.com>

github-code-quality Bot found potential problems Jun 8, 2026

View reviewed changes

style: apply pre-commit isort/autoflake/black formatting

1f7d66f

Run pre-commit hooks across the branch so CI formatting checks pass. Also restore the general_gemm workspace patch import logic that autoflake had incorrectly stripped. Co-authored-by: Cursor <cursoragent@cursor.com>

github-code-quality Bot found potential problems Jun 8, 2026

View reviewed changes

araina-amd and others added 3 commits June 10, 2026 15:52

style: satisfy pre-commit isort blank-line spacing in primus_turbo im…

615cc69

…ports Co-authored-by: Cursor <cursoragent@cursor.com>

github-code-quality Bot found potential problems Jun 10, 2026

View reviewed changes

Comment thread primus/agents/tuning_agent/agent.py

tools=tools,

max_iterations=budget.max_rlm_iterations,

)

result = rlm(

github-code-quality Bot found potential problems Jun 10, 2026

View reviewed changes

Comment thread tests/unit_tests/core/projection/test_performance_projection.py

Comment on lines +36 to +38

from primus.core.projection.performance_projection import ( # noqa: E402

projection as proj,

)

Conversation

araina-amd commented Jun 8, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants