Playbook rc1 by rkalaniNV · Pull Request #198 · NVIDIA-NeMo/Nemotron

rkalaniNV · 2026-05-11T14:38:55Z

This PR brings in support for customization workflows as well as Build-your-own-benchmark to the Nemotron repo. Agentic run is supported for the various stages of customization starting from data curation, model training and eval.

- Add deploy-scoped airgap tooling for Nemotron Customizer steps under src/nemotron/steps. - Build a portable submitter image plus deduplicated task images for selected workflow targets. - Expand step dependencies and map selected steps to task image families through a single airgap.yaml. - Discover small task-image Python dependency gaps and bake pinned repo overlays required by step configs. - Models, datasets, checkpoints, and customer data to be kept in external persistent storage by user - Add resumable build state, image manifests with checksums, Dockerfiles, SFT overlay configs, README guidance, and focused tests. Signed-off-by: Rakesh Paul <rapaul@nvidia.com>

- Rename airgap artifacts to use launcher and execution image terminology - Update runner stages, manifests, README, and config keys to match the new naming - Keep execution image generation scoped to selected Nemotron Customizer steps - Preserve external handling for models, datasets, checkpoints, and customer storage paths - Refresh SFT Megatron Bridge airgap overlay configs - Update tests for launcher/execution image behavior and staged runner flow Signed-off-by: Rakesh Paul <rapaul@nvidia.com>

- Install git and CA certificates in the launcher image before uv sync - Capture only docker inspect stdout while suppressing stderr during platform checks - Keep the airgap runner platform probe compatible with subprocess stderr handling Signed-off-by: Rakesh Paul <rapaul@nvidia.com>

Signed-off-by: Rakesh Paul <rapaul@nvidia.com>

- Move generic step commands and backends from commands/step to commands/steps - Register only `nemotron steps`; remove the singular `nemotron step` alias - Expose `steps list`, `steps show`, `steps run`, and `steps translation` - Update imports, tests, docs, skills, and config examples to the plural CLI - Add coverage for plural command registration and singular alias rejection Signed-off-by: Rakesh Paul <rapaul@nvidia.com>

Signed-off-by: Rakesh Paul <rapaul@nvidia.com>

environment bugs resolve

Add airgap packaging for Nemotron Customizer

SDG data designer providers addition

… across documentation and codebase, reflecting the removal of the planned GRPO step. Clean up related files and ensure consistency in the Nano3 recipes and references, and synth cleanup Signed-off-by: Rakesh Paul <rapaul@nvidia.com>

…_skill_validation

…eration for training, post-training, and training-data preparation workflows. Remove non-training workflow guidance while preserving SFT, PEFT/LoRA, RL alignment, continued pretraining, and post-training coverage.

Update RL step references from `rl/nemo_rl_grpo` to `rl/nemo_rl/rlvr` and fixes the synth rename

- Add BYOB `stage=all` dispatch and config-driven family/stage defaults - Mount pinned Curator at `/opt/Curator` for BYOB and translate configs - Keep BYOB tiny smoke self-contained and avoid heavy semantic imports unless enabled - Simplify curate step for JSONL smoke runs with optional filters and Ray CPU env override - Add curate tiny config/data and refresh curate docs/metadata - Add Lepton/Slurm env profiles for BYOB, translate, curate, and SDG Data Designer - Prefer `tiny` configs in generated env examples where available - Fix env TOML rendering for empty lists Signed-off-by: Rakesh Paul <rapaul@nvidia.com>

- Refactor step IDs, docs, configs, tests, and downstream references to use data_prep consistently across the Nemotron step library. - Addressed review comments - Removed benchmark folder Signed-off-by: Rakesh Paul <rapaul@nvidia.com>

Add lightweight Curator-backed step profiles and BYOB all-stage lepton support changes

Signed-off-by: rkalani <rkalani@nvidia.com>

Signed-off-by: Rakesh Paul <rapaul@nvidia.com>

Update and expand the env toml creation to all the steps

Refocus Nemotron customize skill on repo-native training configs

Signed-off-by: rkalani <rkalani@nvidia.com>

Refactor skills

pass generation for faith config in translation

Every step now lives at `<category>/<implementation>/step.toml` and carries a two-segment `<category>/<implementation>` id. Two outliers were fixed: - `byob` -> `byob/mcq` (was a flat manifest directly under `byob/`, breaking the convention and leaving no room for future families). Only the manifest, entry script, and `config/` move into `byob/mcq/`; the `nemotron.steps.byob.*` Python package keeps its current import paths. - `translate/translation` -> `translate/curator`, naming the step after its backing engine to match `curate/nemo_curator` and to leave room for `translate/nmt` / `translate/google` / `translate/aws` siblings. Also: - Add the missing `env` category title in `CATEGORY_TITLES` (and drop the unused `benchmark` title that had no folder). - Regenerate `src/nemotron/steps/STEPS.md`. - Register one-release legacy aliases (`byob`, `translate/translation`) in `cli/commands/steps/_resolve.py` with a deprecation note, and start suggesting close matches on unknown ids. - Update every doc, skill pack, pattern manifest, tier-2 plan-graph case, and use-case example that referenced the old step ids or config paths.

Both commands re-implemented config loading on top of the generic step dispatcher, forced ``mode == "local"``, rejected passthrough, and skipped the executor/backend selection that every other step uses. They existed solely because the underlying steps (`translate/translation`, `byob`) had irregular layout — fixed in the previous commit. Now every step is run through the same surface: nemotron steps run byob/mcq -c default -o stage=generate -o family=mcq nemotron steps run translate/curator -c default -o input_path=... This removes ~150 lines of duplicate config plumbing, restores one CLI contract for users and agents, and makes batch executors (Lepton, Slurm, DGXCloud) immediately available for translation and BYOB. The BYOB step's ``--list-families`` is still reachable through ``python -m nemotron.steps.byob.scripts.run --list-families`` and the choices are listed in ``nemotron steps show byob/mcq``. Updates docs, SKILL.md files, the use-case BYOB notebook, the QA scope checklist, and the translation CLI tests to assert the bespoke commands stay removed.

Keep the override flow as bare key=value positionals — no -o flag. Verbose "-o key=value -o key=value" forms inflate every translation/BYOB example without adding meaningful safety over what split_unknown_args already does, so this commit drops the experimental -o/--override flag and rewrites the docs back to the concise positional form everyone was already using. Net additions on the catalog side: - `nemotron steps list --tag <tag>` filters by manifest tag, which makes the byob/translation overlap discoverable (`--tag translation`, `--tag mcq`, …). - `nemotron steps list --tree` groups discovered steps under their category title (driven by CATEGORY_TITLES) for a human-friendly view. No change to `steps run`'s public surface beyond what was already there: bare `key=value` positionals at the end of the command remain the override path.

Removes the temporary `byob` and `translate/translation` -> canonical-id redirects that the layout-normalisation commit had kept around for one release. With the rest of the repo (docs, skills, patterns, tier-2 plan-graph cases, use-case notebook, QA scope) already updated to the new ids in the earlier commits, the redirects only delayed the cutover and gave agents two valid spellings for the same step. Now: - `nemotron steps run byob` -> exit 1 with `Did you mean: byob/mcq?`. - `nemotron steps run translate/translation` -> exit 1 with `Did you mean: translate/curator?`. The fuzzy-match suggestion in `_resolve.py` still points users at the right id, so the legacy ids degrade gracefully instead of silently routing into a deprecated path. Tests swap the two `*_resolves_legacy_*_alias` cases for explicit rejection cases, and drops a stray `(legacy alias: byob)` mention from the byob-benchmark-curator-translation context pack.

- Generate BYOB/translate/curate runtime requirements from pyproject/uv.lock at submission time. - Pass runtime metadata to remote jobs via env payloads instead of committed requirement files. - Run Curator-backed profiles through a shared curator_runtime entrypoint. - Update Lepton, Slurm, and DGX Cloud env templates for BYOB, translate, and curate. - Fix Slurm Curator jobs to use CodePackager and set PYTHONPATH for remote imports. - Add focused tests for runtime payloads, preflight behavior, BYOB config, and Slurm run_command handling. Signed-off-by: Rakesh Paul <rapaul@nvidia.com>

Signed-off-by: rkalani <rkalani@nvidia.com>

- Corrected references from `curator` to `nemo_curator` in the documentation. - Ensured consistency in the workflow instructions and load more section for clarity. Signed-off-by: Rakesh Paul <rapaul@nvidia.com>

…zations - Expanded the list of excluded suffixes in data mover to include common data artifacts. - Introduced functions for tarball size warnings and formatted byte output. - Updated execution scripts to generate unique cloud config paths based on content digest. - Refactored symlink command for improved clarity and functionality. - Adjusted tests to validate new features and ensure proper functionality. Signed-off-by: Rakesh Paul <rapaul@nvidia.com>

- Changed default configuration file from `nano3.yaml` to `default.yaml` in both the step implementation and documentation. - Updated recovery instructions in `step.toml` to reflect new dataset path specifications. - Removed obsolete `nano3.yaml` configuration file. Signed-off-by: Rakesh Paul <rapaul@nvidia.com>

CLI Fixes

- Add shared convert runner for HF->Megatron, Megatron->HF, and LoRA merge - Add default configs and metadata for convert steps - Add convert profiles to Lepton, Slurm, and DGX Cloud env configs - Add focused convert runner tests Signed-off-by: Rakesh Paul <rapaul@nvidia.com>

- Updated the conversion logic to prefer `torch_dtype` over the deprecated `dtype` alias. - Modified related documentation and configuration files to reflect the change from `dtype` to `torch_dtype`. - Added tests to ensure the new preference is correctly implemented and deprecated alias is handled. Signed-off-by: Rakesh Paul <rapaul@nvidia.com>

Add HF/Megatron conversion and LoRA merge steps

Signed-off-by: rkalani <rkalani@nvidia.com>

Resolve high severity dependency CVEs

hvnguyenNV and others added 30 commits May 11, 2026 09:58

remove personal paths

57edf64

add instruction & example sections

870fe77

Add configurable Data Designer providers

5807439

Signed-off-by: Rakesh Paul <rapaul@nvidia.com>

Add Data Designer custom provider example

7d9244b

Signed-off-by: Rakesh Paul <rapaul@nvidia.com>

Move Data Designer provider example into config comments

321f71c

Signed-off-by: Rakesh Paul <rapaul@nvidia.com>

Airgap SKILL addition

6332e3b

Signed-off-by: Rakesh Paul <rapaul@nvidia.com>

bugs env resolve

6341061

Merge pull request #6 from rkalaniNV/env_fixes

312c025

environment bugs resolve

Merge pull request #4 from rkalaniNV/rapaul/airgap-support

970b456

Add airgap packaging for Nemotron Customizer

Merge pull request #5 from rkalaniNV/rapaul/sdg-data-designer-providers

bdb74dc

SDG data designer providers addition

Merge remote-tracking branch 'origin/rapaul/airgap-support' into byom…

2b5c5ff

…_skill_validation

Merge pull request #7 from rkalaniNV/rapaul/pre_rc1_fixes

8ba5622

Update RL step references from `rl/nemo_rl_grpo` to `rl/nemo_rl/rlvr` and fixes the synth rename

Rename "prep" steps category to "data_prep"

a648505

- Refactor step IDs, docs, configs, tests, and downstream references to use data_prep consistently across the Nemotron step library. - Addressed review comments - Removed benchmark folder Signed-off-by: Rakesh Paul <rapaul@nvidia.com>

Merge pull request #10 from rkalaniNV/rapaul/pre_rc1_fixes

f8993d9

Add lightweight Curator-backed step profiles and BYOB all-stage lepton support changes

Update translation QA runbook and dependencies

614f4b0

Signed-off-by: rkalani <rkalani@nvidia.com>

Restructure QA runbook into test case flow

86e4010

Signed-off-by: rkalani <rkalani@nvidia.com>

Merge branch 'playbook_rc1' into byom_skill_validation

5682c15

Update and expand the env toml creation to all the steps

ad1b75a

Signed-off-by: Rakesh Paul <rapaul@nvidia.com>

Updated SKILL.md

fb39af8

Signed-off-by: Rakesh Paul <rapaul@nvidia.com>

Merge pull request #11 from rkalaniNV/rapaul/pre_rc1_fixes

de1b48d

Update and expand the env toml creation to all the steps

Merge pull request #9 from rkalaniNV/byom_skill_validation

f085e5b

Refocus Nemotron customize skill on repo-native training configs

Update QA customization runbook

253a018

Signed-off-by: rkalani <rkalani@nvidia.com>

Simplify Lepton training QA runbook

2420deb

rkalaniNV and others added 29 commits May 13, 2026 16:10

Update QA runbook for Lepton execution

f395cf9

Signed-off-by: rkalani <rkalani@nvidia.com>

Update QA data prep validation

a4dcf71

Signed-off-by: rkalani <rkalani@nvidia.com>

Add translation skill guidance

c3c572a

Signed-off-by: rkalani <rkalani@nvidia.com>

Improve Nemotron customize skill guidance

b1c6b9c

Signed-off-by: rkalani <rkalani@nvidia.com>

Incorporate translation skill review guidance

afc3e89

Signed-off-by: rkalani <rkalani@nvidia.com>

Add evaluator launcher step guidance

254a257

Signed-off-by: rkalani <rkalani@nvidia.com>

faith generation config fix

a116251

remove test step

c473b52

test scope doc update

f75c3b6

Merge pull request #17 from rkalaniNV/rkalani/customize-skill-evals

7c1c739

Refactor skills

undo guide changes

56c114e

Temp fix for faith generation params

94dbc40

pass generation for faith config in translation

Fix curator runtime profiles and evaluator hosting config

cb76fa4

Remove QA runbook from CLI fixes branch

b80fe3b

Sync CLI fixes with latest step layout

441bce3

Signed-off-by: rkalani <rkalani@nvidia.com>

Update SKILL.md to reflect new paths for nemo_curator step

cfccadc

- Corrected references from `curator` to `nemo_curator` in the documentation. - Ensured consistency in the workflow instructions and load more section for clarity. Signed-off-by: Rakesh Paul <rapaul@nvidia.com>

CLI fixes

1d78170

CLI Fixes

Add checkpoint conversion

04ef1ef

Add HF/Megatron conversion and LoRA merge steps

Resolve high severity dependency CVEs

6bdce86

Signed-off-by: rkalani <rkalani@nvidia.com>

Merge pull request #27 from rkalaniNV/cve-resolution

b3a483c

Resolve high severity dependency CVEs

rkalaniNV marked this pull request as ready for review May 21, 2026 09:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Playbook rc1#198

Playbook rc1#198
rkalaniNV wants to merge 59 commits into
NVIDIA-NeMo:romeyn/agenticfrom
rkalaniNV:playbook_rc1

rkalaniNV commented May 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

rkalaniNV commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rkalaniNV commented May 11, 2026 •

edited

Loading