Skip to content

Playbook rc1#198

Open
rkalaniNV wants to merge 59 commits into
NVIDIA-NeMo:romeyn/agenticfrom
rkalaniNV:playbook_rc1
Open

Playbook rc1#198
rkalaniNV wants to merge 59 commits into
NVIDIA-NeMo:romeyn/agenticfrom
rkalaniNV:playbook_rc1

Conversation

@rkalaniNV
Copy link
Copy Markdown

@rkalaniNV rkalaniNV commented May 11, 2026

This PR brings in support for customization workflows as well as Build-your-own-benchmark to the Nemotron repo. Agentic run is supported for the various stages of customization starting from data curation, model training and eval.

hvnguyenNV and others added 30 commits May 11, 2026 09:58
- Add deploy-scoped airgap tooling for Nemotron Customizer steps under
  src/nemotron/steps.
- Build a portable submitter image plus deduplicated task images for selected
  workflow targets.
- Expand step dependencies and map selected steps to task image families through
  a single airgap.yaml.
- Discover small task-image Python dependency gaps and bake pinned repo overlays
  required by step configs.
-  Models, datasets, checkpoints, and customer data to be kept in external persistent
  storage by user
- Add resumable build state, image manifests with checksums, Dockerfiles, SFT
  overlay configs, README guidance, and focused tests.

Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
- Rename airgap artifacts to use launcher and execution image terminology
- Update runner stages, manifests, README, and config keys to match the new naming
- Keep execution image generation scoped to selected Nemotron Customizer steps
- Preserve external handling for models, datasets, checkpoints, and customer storage paths
- Refresh SFT Megatron Bridge airgap overlay configs
- Update tests for launcher/execution image behavior and staged runner flow

Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
- Install git and CA certificates in the launcher image before uv sync
- Capture only docker inspect stdout while suppressing stderr during platform checks
- Keep the airgap runner platform probe compatible with subprocess stderr handling

Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
- Move generic step commands and backends from commands/step to commands/steps
- Register only `nemotron steps`; remove the singular `nemotron step` alias
- Expose `steps list`, `steps show`, `steps run`, and `steps translation`
- Update imports, tests, docs, skills, and config examples to the plural CLI
- Add coverage for plural command registration and singular alias rejection

Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
Add airgap packaging for Nemotron Customizer
… across documentation and codebase, reflecting the removal of the planned GRPO step. Clean up related files and ensure consistency in the Nano3 recipes and references, and synth cleanup

Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
…eration for training, post-training, and training-data preparation workflows. Remove non-training workflow guidance while preserving SFT, PEFT/LoRA, RL alignment, continued pretraining, and post-training coverage.
Update RL step references from `rl/nemo_rl_grpo` to `rl/nemo_rl/rlvr` and fixes the synth rename
- Add BYOB `stage=all` dispatch and config-driven family/stage defaults
- Mount pinned Curator at `/opt/Curator` for BYOB and translate configs
- Keep BYOB tiny smoke self-contained and avoid heavy semantic imports unless enabled
- Simplify curate step for JSONL smoke runs with optional filters and Ray CPU env override
- Add curate tiny config/data and refresh curate docs/metadata
- Add Lepton/Slurm env profiles for BYOB, translate, curate, and SDG Data Designer
- Prefer `tiny` configs in generated env examples where available
- Fix env TOML rendering for empty lists

Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
- Refactor step IDs, docs, configs, tests, and downstream references to use data_prep consistently across the Nemotron step library.
- Addressed review comments
- Removed benchmark folder

Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
Add lightweight Curator-backed step profiles and BYOB all-stage lepton support changes
Signed-off-by: rkalani <rkalani@nvidia.com>
Signed-off-by: rkalani <rkalani@nvidia.com>
Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
Update and expand the env toml creation to all the steps
Refocus Nemotron customize skill on repo-native training configs
Signed-off-by: rkalani <rkalani@nvidia.com>
rkalaniNV and others added 29 commits May 13, 2026 16:10
Signed-off-by: rkalani <rkalani@nvidia.com>
Signed-off-by: rkalani <rkalani@nvidia.com>
Signed-off-by: rkalani <rkalani@nvidia.com>
Signed-off-by: rkalani <rkalani@nvidia.com>
Signed-off-by: rkalani <rkalani@nvidia.com>
Signed-off-by: rkalani <rkalani@nvidia.com>
pass generation for faith config in translation
Every step now lives at `<category>/<implementation>/step.toml` and carries a
two-segment `<category>/<implementation>` id. Two outliers were fixed:

- `byob` -> `byob/mcq` (was a flat manifest directly under `byob/`, breaking
  the convention and leaving no room for future families). Only the manifest,
  entry script, and `config/` move into `byob/mcq/`; the `nemotron.steps.byob.*`
  Python package keeps its current import paths.
- `translate/translation` -> `translate/curator`, naming the step after its
  backing engine to match `curate/nemo_curator` and to leave room for
  `translate/nmt` / `translate/google` / `translate/aws` siblings.

Also:

- Add the missing `env` category title in `CATEGORY_TITLES` (and drop the
  unused `benchmark` title that had no folder).
- Regenerate `src/nemotron/steps/STEPS.md`.
- Register one-release legacy aliases (`byob`, `translate/translation`) in
  `cli/commands/steps/_resolve.py` with a deprecation note, and start
  suggesting close matches on unknown ids.
- Update every doc, skill pack, pattern manifest, tier-2 plan-graph case, and
  use-case example that referenced the old step ids or config paths.
Both commands re-implemented config loading on top of the generic step
dispatcher, forced ``mode == "local"``, rejected passthrough, and skipped the
executor/backend selection that every other step uses. They existed solely
because the underlying steps (`translate/translation`, `byob`) had irregular
layout — fixed in the previous commit.

Now every step is run through the same surface:

  nemotron steps run byob/mcq       -c default -o stage=generate -o family=mcq
  nemotron steps run translate/curator -c default -o input_path=...

This removes ~150 lines of duplicate config plumbing, restores one CLI
contract for users and agents, and makes batch executors (Lepton, Slurm,
DGXCloud) immediately available for translation and BYOB. The BYOB step's
``--list-families`` is still reachable through
``python -m nemotron.steps.byob.scripts.run --list-families`` and the choices
are listed in ``nemotron steps show byob/mcq``.

Updates docs, SKILL.md files, the use-case BYOB notebook, the QA scope
checklist, and the translation CLI tests to assert the bespoke commands stay
removed.
Keep the override flow as bare key=value positionals — no -o flag. Verbose
"-o key=value -o key=value" forms inflate every translation/BYOB example without
adding meaningful safety over what split_unknown_args already does, so this
commit drops the experimental -o/--override flag and rewrites the docs back to
the concise positional form everyone was already using.

Net additions on the catalog side:

- `nemotron steps list --tag <tag>` filters by manifest tag, which makes the
  byob/translation overlap discoverable (`--tag translation`, `--tag mcq`, …).
- `nemotron steps list --tree` groups discovered steps under their category
  title (driven by CATEGORY_TITLES) for a human-friendly view.

No change to `steps run`'s public surface beyond what was already there: bare
`key=value` positionals at the end of the command remain the override path.
Removes the temporary `byob` and `translate/translation` -> canonical-id
redirects that the layout-normalisation commit had kept around for one
release. With the rest of the repo (docs, skills, patterns, tier-2 plan-graph
cases, use-case notebook, QA scope) already updated to the new ids in the
earlier commits, the redirects only delayed the cutover and gave agents two
valid spellings for the same step.

Now:

- `nemotron steps run byob` -> exit 1 with `Did you mean: byob/mcq?`.
- `nemotron steps run translate/translation` -> exit 1 with
  `Did you mean: translate/curator?`.

The fuzzy-match suggestion in `_resolve.py` still points users at the right
id, so the legacy ids degrade gracefully instead of silently routing into a
deprecated path.

Tests swap the two `*_resolves_legacy_*_alias` cases for explicit rejection
cases, and drops a stray `(legacy alias: byob)` mention from the
byob-benchmark-curator-translation context pack.
- Generate BYOB/translate/curate runtime requirements from pyproject/uv.lock at submission time.
- Pass runtime metadata to remote jobs via env payloads instead of committed requirement files.
- Run Curator-backed profiles through a shared curator_runtime entrypoint.
- Update Lepton, Slurm, and DGX Cloud env templates for BYOB, translate, and curate.
- Fix Slurm Curator jobs to use CodePackager and set PYTHONPATH for remote imports.
- Add focused tests for runtime payloads, preflight behavior, BYOB config, and Slurm run_command handling.

Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
Signed-off-by: rkalani <rkalani@nvidia.com>
- Corrected references from `curator` to `nemo_curator` in the documentation.
- Ensured consistency in the workflow instructions and load more section for clarity.

Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
…zations

- Expanded the list of excluded suffixes in data mover to include common data artifacts.
- Introduced functions for tarball size warnings and formatted byte output.
- Updated execution scripts to generate unique cloud config paths based on content digest.
- Refactored symlink command for improved clarity and functionality.
- Adjusted tests to validate new features and ensure proper functionality.

Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
- Changed default configuration file from `nano3.yaml` to `default.yaml` in both the step implementation and documentation.
- Updated recovery instructions in `step.toml` to reflect new dataset path specifications.
- Removed obsolete `nano3.yaml` configuration file.

Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
CLI Fixes
- Add shared convert runner for HF->Megatron, Megatron->HF, and LoRA merge
- Add default configs and metadata for convert steps
- Add convert profiles to Lepton, Slurm, and DGX Cloud env configs
- Add focused convert runner tests

Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
- Updated the conversion logic to prefer `torch_dtype` over the deprecated `dtype` alias.
- Modified related documentation and configuration files to reflect the change from `dtype` to `torch_dtype`.
- Added tests to ensure the new preference is correctly implemented and deprecated alias is handled.

Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
Add HF/Megatron conversion and LoRA merge steps
Signed-off-by: rkalani <rkalani@nvidia.com>
Resolve high severity dependency CVEs
@rkalaniNV rkalaniNV marked this pull request as ready for review May 21, 2026 09:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants