Skip to content

Tuning agent, memory based benchmarking support plus fixes#755

Open
araina-amd wants to merge 13 commits into
mainfrom
araina/dev/tuning_agent_plus_bench_memory_plus_fixes
Open

Tuning agent, memory based benchmarking support plus fixes#755
araina-amd wants to merge 13 commits into
mainfrom
araina/dev/tuning_agent_plus_bench_memory_plus_fixes

Conversation

@araina-amd

Copy link
Copy Markdown
Collaborator

Adds an LLM-driven tuning agent for parallelism/config search and substantially
improves the performance & memory projection stack (accuracy, collective/comm
modeling, overlap models, bench-anchored memory). Also includes supporting backend/
config cleanups and a turbo-patch crash fix.

araina-amd and others added 5 commits June 6, 2026 00:23
- Fix projection accuracy; rename Megatron ILP -> SeaAILab ILP; add scheduler comparison
- Collective model: additive A2A correction, proportional num_experts reduction, A2A
  mesh/remote contention derates, P2P-based inter-node BW, hierarchical AllReduce with
  pipelining + RCCL overhead + NIC RDMA warmup, and P2P/PP-aware SendRecv
- FSDP per-layer compute/comm overlap model; loss-fusion + SyncFree overlap; hybrid
  sourcing with auto bg=1 compute baseline; auto-disable turbo-deepep when TP*EP=1
- Config validation; training_config + projection CLI updates
- Add bench-anchored memory mode with shared bench artifact
- memory_capture, benchmark, extrapolation, and reports modules
- Rename memory_projection/projection.py -> simulate.py; enable layer enumeration
- primus/agents/tuning_agent: agent, evaluator, plan, legality, scratchpad, tools,
  workload, history, plotting, config, cli
- unit tests: evaluator benchmark + recompute schedule legality
- megatron TE patches: deep-probe primus_turbo before applying TE patches
- torchtitan qwen3 model-config tweaks; bump third_party/torchtitan pointer
- drop torchtitan upstream backend-gap report/summary docs
…c provider patch

The squash/refactor dropped the _use_legacy_grouped_gemm helper while leaving
two call sites, raising NameError whenever the te_spec_provider turbo patch
activates. Define the helper and remove the now-unused turbo-enable probes.

Co-authored-by: Cursor <cursoragent@cursor.com>
Restore docs/backend-gap/reports/torchtitan/upstream-main and the
primus/configs/models/torchtitan qwen3 configs to match main; these were
unintended local-dev changes that should not be part of this branch.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread primus/agents/tuning_agent/agent.py Fixed
Comment thread primus/core/projection/memory_projection/benchmark.py Fixed
Comment thread primus/core/projection/performance_projection/projection.py Fixed
Comment thread primus/core/projection/performance_projection/projection.py Fixed
Comment thread primus/core/projection/performance_projection/projection.py Fixed
Comment thread primus/core/projection/performance_projection/projection.py Fixed
Comment thread primus/core/projection/performance_projection/projection.py Fixed
Comment thread primus/cli/subcommands/projection.py Fixed
Comment thread primus/core/projection/module_profilers/transformer_layer.py Fixed
Replace silent pass-only handlers with explanatory comments, explicit
fallbacks, debug logging, or lightweight session-log error events so
best-effort paths remain non-fatal but diagnosable.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread primus/core/projection/module_profilers/utils.py Fixed
Comment thread primus/agents/tuning_agent/agent.py Fixed
Comment thread primus/agents/tuning_agent/agent.py Fixed
Comment thread primus/agents/tuning_agent/cli.py Fixed
Comment thread primus/agents/tuning_agent/evaluator.py Fixed
Comment thread primus/agents/tuning_agent/history.py Fixed
Comment thread primus/agents/tuning_agent/legality.py Fixed
Comment thread primus/agents/tuning_agent/plan.py Fixed
Comment thread tests/unit_tests/agents/test_evaluator_benchmark.py Fixed
Remove unused imports in the tuning agent and tests, and make routing-patch
restore cleanup in utils.py non-fatal but visible via a warning message.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread tests/unit_tests/agents/test_evaluator_benchmark.py Fixed
Comment thread primus/agents/tuning_agent/tools.py Fixed
Comment thread primus/agents/tuning_agent/tools.py Fixed
Comment thread primus/agents/tuning_agent/tools.py Fixed
Comment thread primus/agents/tuning_agent/agent.py Fixed
Comment thread primus/core/projection/memory_projection/benchmark.py Fixed
Comment thread primus/core/projection/performance_projection/projection.py Fixed
Comment thread primus/core/projection/performance_projection/projection.py Fixed
# Per-trial YAML generation
# ---------------------------------------------------------------------------

_PARALLEL_OVERRIDES_KEYS = {
Run pre-commit hooks across the branch so CI formatting checks pass.
Also restore the general_gemm workspace patch import logic that autoflake
had incorrectly stripped.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment on lines +42 to +49
from primus.agents.tuning_agent.evaluator import ( # noqa: E402
EvalResult,
Evaluator,
_build_env,
_build_memory_cmd,
_build_perf_cmd,
_parse_metrics,
)
Comment thread primus/agents/tuning_agent/agent.py Fixed
f"(bench/target)"
)
else:
loaded_payload = None
)
else:
loaded_payload = None
loaded_metadata = {}
araina-amd and others added 3 commits June 10, 2026 15:52
Enables the squash-branch projection benchmarks (gpt_oss, qwen, mixtral) to
run on the rocm/primus v26.2/v26.3 containers, plus small tuning-agent fixes.

- primus_turbo: guard QuantizedTensor/QuantizedTensorPair imports and alias
  ScalingRecipe -> MXScalingRecipe so the wrapper imports against primus_turbo
  0.2.0 (shipped in v26.2/v26.3, predating PR #735). These symbols are only
  used on FP8/FP4 weight-quant paths, so BF16 turbo attention/DeepEP now work.
- trainer: adapt the track_config_flags call to the pinned Megatron-LM
  signature (accepts 6 vs 8 positional args across commits).
- tuning_agent/evaluator: pass --memory-mode explicitly so the simulate/
  memory-only path does not fall back to the CLI's benchmark default.
- tuning_agent/workload: coerce ${VAR:default}-templated virtual-pipeline
  values to int (or None) before use.
- examples/agents: add mi355x 4-node target cluster config.

Co-authored-by: Cursor <cursoragent@cursor.com>
The agent previously required a running LiteLLM sidecar process and
AMD-internal gateway routing (LITELLM_BASE_URL, OCP_APIM_SUBSCRIPTION_KEY,
_amd_onprem_llm_headers). DSPy 3.x bundles LiteLLM natively, so no proxy
is needed for any provider.

- agent.py: remove _amd_onprem_llm_headers and AMD-gateway routing from
  configure_dspy(); pass model/api_key/base_url directly to dspy.LM() so
  any LiteLLM-supported provider works out of the box
- config.py: replace LITELLM_* env vars with standard provider vars
  (OPENAI_API_KEY, ANTHROPIC_API_KEY, LLM_API_KEY, OPENAI_API_BASE,
  LLM_MODEL); remove AMD-private .env search path; drop provider field
  from LLMConfig; change DEFAULT_MODEL to openai/gpt-4o
- requirements.txt: drop litellm as a direct dep (dspy brings it)
- README.md: replace proxy setup with generic dspy/litellm instructions
  covering OpenAI, Anthropic, Ollama, and custom OpenAI-compatible endpoints

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
…ports

Co-authored-by: Cursor <cursoragent@cursor.com>
tools=tools,
max_iterations=budget.max_rlm_iterations,
)
result = rlm(
Cover the deterministic/analytical helpers (no GPU required) for the tuning
agent (workload resolver, legality, history) and both projection paths
(config guards, memory extrapolation, performance placement/timing helpers).

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment on lines +36 to +38
from primus.core.projection.performance_projection import ( # noqa: E402
projection as proj,
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants