Playbook rc1#198
Open
rkalaniNV wants to merge 59 commits into
Open
Conversation
- Add deploy-scoped airgap tooling for Nemotron Customizer steps under src/nemotron/steps. - Build a portable submitter image plus deduplicated task images for selected workflow targets. - Expand step dependencies and map selected steps to task image families through a single airgap.yaml. - Discover small task-image Python dependency gaps and bake pinned repo overlays required by step configs. - Models, datasets, checkpoints, and customer data to be kept in external persistent storage by user - Add resumable build state, image manifests with checksums, Dockerfiles, SFT overlay configs, README guidance, and focused tests. Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
- Rename airgap artifacts to use launcher and execution image terminology - Update runner stages, manifests, README, and config keys to match the new naming - Keep execution image generation scoped to selected Nemotron Customizer steps - Preserve external handling for models, datasets, checkpoints, and customer storage paths - Refresh SFT Megatron Bridge airgap overlay configs - Update tests for launcher/execution image behavior and staged runner flow Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
- Install git and CA certificates in the launcher image before uv sync - Capture only docker inspect stdout while suppressing stderr during platform checks - Keep the airgap runner platform probe compatible with subprocess stderr handling Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
- Move generic step commands and backends from commands/step to commands/steps - Register only `nemotron steps`; remove the singular `nemotron step` alias - Expose `steps list`, `steps show`, `steps run`, and `steps translation` - Update imports, tests, docs, skills, and config examples to the plural CLI - Add coverage for plural command registration and singular alias rejection Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
environment bugs resolve
Add airgap packaging for Nemotron Customizer
SDG data designer providers addition
… across documentation and codebase, reflecting the removal of the planned GRPO step. Clean up related files and ensure consistency in the Nano3 recipes and references, and synth cleanup Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
…_skill_validation
…eration for training, post-training, and training-data preparation workflows. Remove non-training workflow guidance while preserving SFT, PEFT/LoRA, RL alignment, continued pretraining, and post-training coverage.
Update RL step references from `rl/nemo_rl_grpo` to `rl/nemo_rl/rlvr` and fixes the synth rename
- Add BYOB `stage=all` dispatch and config-driven family/stage defaults - Mount pinned Curator at `/opt/Curator` for BYOB and translate configs - Keep BYOB tiny smoke self-contained and avoid heavy semantic imports unless enabled - Simplify curate step for JSONL smoke runs with optional filters and Ray CPU env override - Add curate tiny config/data and refresh curate docs/metadata - Add Lepton/Slurm env profiles for BYOB, translate, curate, and SDG Data Designer - Prefer `tiny` configs in generated env examples where available - Fix env TOML rendering for empty lists Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
- Refactor step IDs, docs, configs, tests, and downstream references to use data_prep consistently across the Nemotron step library. - Addressed review comments - Removed benchmark folder Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
Add lightweight Curator-backed step profiles and BYOB all-stage lepton support changes
Signed-off-by: rkalani <rkalani@nvidia.com>
Signed-off-by: rkalani <rkalani@nvidia.com>
Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
Update and expand the env toml creation to all the steps
Refocus Nemotron customize skill on repo-native training configs
Signed-off-by: rkalani <rkalani@nvidia.com>
Signed-off-by: rkalani <rkalani@nvidia.com>
Signed-off-by: rkalani <rkalani@nvidia.com>
Signed-off-by: rkalani <rkalani@nvidia.com>
Signed-off-by: rkalani <rkalani@nvidia.com>
Signed-off-by: rkalani <rkalani@nvidia.com>
Signed-off-by: rkalani <rkalani@nvidia.com>
Refactor skills
pass generation for faith config in translation
Every step now lives at `<category>/<implementation>/step.toml` and carries a two-segment `<category>/<implementation>` id. Two outliers were fixed: - `byob` -> `byob/mcq` (was a flat manifest directly under `byob/`, breaking the convention and leaving no room for future families). Only the manifest, entry script, and `config/` move into `byob/mcq/`; the `nemotron.steps.byob.*` Python package keeps its current import paths. - `translate/translation` -> `translate/curator`, naming the step after its backing engine to match `curate/nemo_curator` and to leave room for `translate/nmt` / `translate/google` / `translate/aws` siblings. Also: - Add the missing `env` category title in `CATEGORY_TITLES` (and drop the unused `benchmark` title that had no folder). - Regenerate `src/nemotron/steps/STEPS.md`. - Register one-release legacy aliases (`byob`, `translate/translation`) in `cli/commands/steps/_resolve.py` with a deprecation note, and start suggesting close matches on unknown ids. - Update every doc, skill pack, pattern manifest, tier-2 plan-graph case, and use-case example that referenced the old step ids or config paths.
Both commands re-implemented config loading on top of the generic step dispatcher, forced ``mode == "local"``, rejected passthrough, and skipped the executor/backend selection that every other step uses. They existed solely because the underlying steps (`translate/translation`, `byob`) had irregular layout — fixed in the previous commit. Now every step is run through the same surface: nemotron steps run byob/mcq -c default -o stage=generate -o family=mcq nemotron steps run translate/curator -c default -o input_path=... This removes ~150 lines of duplicate config plumbing, restores one CLI contract for users and agents, and makes batch executors (Lepton, Slurm, DGXCloud) immediately available for translation and BYOB. The BYOB step's ``--list-families`` is still reachable through ``python -m nemotron.steps.byob.scripts.run --list-families`` and the choices are listed in ``nemotron steps show byob/mcq``. Updates docs, SKILL.md files, the use-case BYOB notebook, the QA scope checklist, and the translation CLI tests to assert the bespoke commands stay removed.
Keep the override flow as bare key=value positionals — no -o flag. Verbose "-o key=value -o key=value" forms inflate every translation/BYOB example without adding meaningful safety over what split_unknown_args already does, so this commit drops the experimental -o/--override flag and rewrites the docs back to the concise positional form everyone was already using. Net additions on the catalog side: - `nemotron steps list --tag <tag>` filters by manifest tag, which makes the byob/translation overlap discoverable (`--tag translation`, `--tag mcq`, …). - `nemotron steps list --tree` groups discovered steps under their category title (driven by CATEGORY_TITLES) for a human-friendly view. No change to `steps run`'s public surface beyond what was already there: bare `key=value` positionals at the end of the command remain the override path.
Removes the temporary `byob` and `translate/translation` -> canonical-id redirects that the layout-normalisation commit had kept around for one release. With the rest of the repo (docs, skills, patterns, tier-2 plan-graph cases, use-case notebook, QA scope) already updated to the new ids in the earlier commits, the redirects only delayed the cutover and gave agents two valid spellings for the same step. Now: - `nemotron steps run byob` -> exit 1 with `Did you mean: byob/mcq?`. - `nemotron steps run translate/translation` -> exit 1 with `Did you mean: translate/curator?`. The fuzzy-match suggestion in `_resolve.py` still points users at the right id, so the legacy ids degrade gracefully instead of silently routing into a deprecated path. Tests swap the two `*_resolves_legacy_*_alias` cases for explicit rejection cases, and drops a stray `(legacy alias: byob)` mention from the byob-benchmark-curator-translation context pack.
- Generate BYOB/translate/curate runtime requirements from pyproject/uv.lock at submission time. - Pass runtime metadata to remote jobs via env payloads instead of committed requirement files. - Run Curator-backed profiles through a shared curator_runtime entrypoint. - Update Lepton, Slurm, and DGX Cloud env templates for BYOB, translate, and curate. - Fix Slurm Curator jobs to use CodePackager and set PYTHONPATH for remote imports. - Add focused tests for runtime payloads, preflight behavior, BYOB config, and Slurm run_command handling. Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
Signed-off-by: rkalani <rkalani@nvidia.com>
- Corrected references from `curator` to `nemo_curator` in the documentation. - Ensured consistency in the workflow instructions and load more section for clarity. Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
…zations - Expanded the list of excluded suffixes in data mover to include common data artifacts. - Introduced functions for tarball size warnings and formatted byte output. - Updated execution scripts to generate unique cloud config paths based on content digest. - Refactored symlink command for improved clarity and functionality. - Adjusted tests to validate new features and ensure proper functionality. Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
- Changed default configuration file from `nano3.yaml` to `default.yaml` in both the step implementation and documentation. - Updated recovery instructions in `step.toml` to reflect new dataset path specifications. - Removed obsolete `nano3.yaml` configuration file. Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
- Add shared convert runner for HF->Megatron, Megatron->HF, and LoRA merge - Add default configs and metadata for convert steps - Add convert profiles to Lepton, Slurm, and DGX Cloud env configs - Add focused convert runner tests Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
- Updated the conversion logic to prefer `torch_dtype` over the deprecated `dtype` alias. - Modified related documentation and configuration files to reflect the change from `dtype` to `torch_dtype`. - Added tests to ensure the new preference is correctly implemented and deprecated alias is handled. Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
Add HF/Megatron conversion and LoRA merge steps
Signed-off-by: rkalani <rkalani@nvidia.com>
Resolve high severity dependency CVEs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR brings in support for customization workflows as well as Build-your-own-benchmark to the Nemotron repo. Agentic run is supported for the various stages of customization starting from data curation, model training and eval.