Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
57edf64
remove personal paths
hvnguyenNV May 11, 2026
870fe77
add instruction & example sections
hvnguyenNV May 11, 2026
aa2c430
Add airgap packaging flow for Nemotron Customizer
rapaul-nv May 8, 2026
f4b8f50
Refine Nemotron Customizer airgap image flow
rapaul-nv May 8, 2026
4c02460
Fix launcher image setup and Docker platform inspection
rapaul-nv May 8, 2026
5807439
Add configurable Data Designer providers
rapaul-nv May 11, 2026
7d9244b
Add Data Designer custom provider example
rapaul-nv May 11, 2026
321f71c
Move Data Designer provider example into config comments
rapaul-nv May 11, 2026
2097108
Normalize step CLI under nemotron steps
rapaul-nv May 11, 2026
6332e3b
Airgap SKILL addition
rapaul-nv May 11, 2026
6341061
bugs env resolve
anushaknvidia May 11, 2026
312c025
Merge pull request #6 from rkalaniNV/env_fixes
rapaul-nv May 11, 2026
970b456
Merge pull request #4 from rkalaniNV/rapaul/airgap-support
rapaul-nv May 11, 2026
bdb74dc
Merge pull request #5 from rkalaniNV/rapaul/sdg-data-designer-providers
rapaul-nv May 11, 2026
2f188e7
Update RL step references from `rl/nemo_rl_grpo` to `rl/nemo_rl/rlvr`…
rapaul-nv May 11, 2026
2b5c5ff
Merge remote-tracking branch 'origin/rapaul/airgap-support' into byom…
hvnguyenNV May 12, 2026
041a242
update skill: Prioritize existing repo code with YAML-only config gen…
hvnguyenNV May 12, 2026
8ba5622
Merge pull request #7 from rkalaniNV/rapaul/pre_rc1_fixes
rapaul-nv May 12, 2026
664f9c9
Add lightweight Curator-backed step profiles and BYOB all-stage support
rapaul-nv May 12, 2026
a648505
Rename "prep" steps category to "data_prep"
rapaul-nv May 12, 2026
f8993d9
Merge pull request #10 from rkalaniNV/rapaul/pre_rc1_fixes
rkalaniNV May 12, 2026
614f4b0
Update translation QA runbook and dependencies
rkalaniNV May 12, 2026
86e4010
Restructure QA runbook into test case flow
rkalaniNV May 12, 2026
5682c15
Merge branch 'playbook_rc1' into byom_skill_validation
hvnguyenNV May 12, 2026
ad1b75a
Update and expand the env toml creation to all the steps
rapaul-nv May 12, 2026
fb39af8
Updated SKILL.md
rapaul-nv May 12, 2026
de1b48d
Merge pull request #11 from rkalaniNV/rapaul/pre_rc1_fixes
rkalaniNV May 12, 2026
f085e5b
Merge pull request #9 from rkalaniNV/byom_skill_validation
rkalaniNV May 13, 2026
253a018
Update QA customization runbook
rkalaniNV May 13, 2026
2420deb
Simplify Lepton training QA runbook
rkalaniNV May 13, 2026
f395cf9
Update QA runbook for Lepton execution
rkalaniNV May 13, 2026
a4dcf71
Update QA data prep validation
rkalaniNV May 13, 2026
c3c572a
Add translation skill guidance
rkalaniNV May 15, 2026
b1c6b9c
Improve Nemotron customize skill guidance
rkalaniNV May 15, 2026
afc3e89
Incorporate translation skill review guidance
rkalaniNV May 15, 2026
254a257
Add evaluator launcher step guidance
rkalaniNV May 16, 2026
a116251
faith generation config fix
anushaknvidia May 18, 2026
c473b52
remove test step
anushaknvidia May 18, 2026
f75c3b6
test scope doc update
anushaknvidia May 18, 2026
7c1c739
Merge pull request #17 from rkalaniNV/rkalani/customize-skill-evals
rkalaniNV May 18, 2026
56c114e
undo guide changes
anushaknvidia May 18, 2026
94dbc40
Temp fix for faith generation params
rkalaniNV May 18, 2026
dae0077
steps: normalise byob and translate step layout
rapaul-nv May 18, 2026
d442c1d
steps: drop bespoke `steps translation` and top-level `nemotron byob`
rapaul-nv May 18, 2026
87ccae2
steps: list ergonomics (tag filter, tree view)
rapaul-nv May 18, 2026
854e467
steps: drop legacy id aliases for byob and translate
rapaul-nv May 18, 2026
da0363b
Add generic Curator runtime bootstrap
rapaul-nv May 13, 2026
cb76fa4
Fix curator runtime profiles and evaluator hosting config
rkalaniNV May 18, 2026
b80fe3b
Remove QA runbook from CLI fixes branch
rkalaniNV May 18, 2026
441bce3
Sync CLI fixes with latest step layout
rkalaniNV May 18, 2026
cfccadc
Update SKILL.md to reflect new paths for nemo_curator step
rapaul-nv May 18, 2026
2fc8d90
Enhance data mover and execution scripts with new features and optimi…
rapaul-nv May 18, 2026
60373b0
Update Megatron-Bridge configuration and documentation
rapaul-nv May 18, 2026
1d78170
CLI fixes
rkalaniNV May 19, 2026
e0fd17c
Add HF/Megatron conversion and LoRA merge steps
rapaul-nv May 19, 2026
f22e946
Refactor dtype handling in HF to Megatron conversion
rapaul-nv May 19, 2026
04ef1ef
Add checkpoint conversion
rkalaniNV May 19, 2026
6bdce86
Resolve high severity dependency CVEs
rkalaniNV May 21, 2026
b3a483c
Merge pull request #27 from rkalaniNV/cve-resolution
rkalaniNV May 21, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 14 additions & 14 deletions .claude-plugin/marketplace.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,24 +4,24 @@
"name": "NVIDIA Nemotron Team"
},
"metadata": {
"description": "NVIDIA Nemotron AI stack plugins — pipeline builder, model knowledge bases, and contributor tools"
"description": "NVIDIA Nemotron AI stack plugins"
},
"plugins": [
{
"name": "nemotron",
"source": "./plugins/nemotron",
"description": "NVIDIA Nemotron AI stack — pipeline builder and model knowledge bases",
"version": "0.3.0",
"name": "nemotron-customize",
"source": "./skills/nemotron-customize",
"description": "Compose runnable Nemotron model-customization pipelines from repo steps.",
"version": "0.1.0",
"category": "ml-pipelines",
"keywords": ["nvidia", "nemotron", "training", "sft", "rl", "megatron", "models"]
},
{
"name": "nemotron-dev",
"source": "./plugins/nemotron-dev",
"description": "Internal: contributor tools for Nemotron repo developers",
"version": "0.3.0",
"category": "developer-tools",
"keywords": ["nvidia", "nemotron", "internal", "contributing", "dev"]
"keywords": [
"nvidia",
"nemotron",
"training",
"sft",
"rl",
"megatron",
"customization"
]
}
]
}
11 changes: 0 additions & 11 deletions .claude-plugin/plugin.json

This file was deleted.

1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,7 @@ CLAUDE.md
# Compiled config
config.yaml
main.py
src/nemotron/steps/_bootstrap/runtime/

# Documentation build
docs/_build/
30 changes: 30 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,36 @@

---

## Use from Claude Code

This repo ships a Claude Code plugin called **`nemotron-customize`** that turns the step catalog under [`src/nemotron/steps/`](./src/nemotron/steps/) into a guided, repo-native pipeline builder.

Install once:

```text
/plugin marketplace add NVIDIA/Nemotron
/plugin install nemotron-customize@nvidia-nemotron
```

Then, **start Claude Code from the repo root** and invoke the skill:

```bash
cd /path/to/Nemotron # repo root: must contain pyproject.toml and src/nemotron/steps/
claude
```

```text
/nemotron-customize
```

The skill resolves all file paths against your current working directory, so it must be invoked from the Nemotron checkout root. Running it from a subdirectory will cause file reads to fail.

The skill plans the step DAG, validates artifact wiring, and emits the YAML configs needed to run the requested pipeline. See [`skills/nemotron-customize/SKILL.md`](./skills/nemotron-customize/SKILL.md) for the full contract.

> The marketplace installs **only** `nemotron-customize`. The other folders under [`skills/`](./skills/) (model knowledge bases, contributor add-`*` skills) stay on disk for repo browsing but are not loaded as plugins.

---

## Repository Overview

```
Expand Down
7 changes: 7 additions & 0 deletions deploy/nemotron-customizer/airgap/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Generated by airgap runner.
out/
airgap-bundle/
archives/
__pycache__/
*.lock.yaml
*.tar
52 changes: 52 additions & 0 deletions deploy/nemotron-customizer/airgap/Dockerfile.execution
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Derivative execution image for Nemotron Customizer airgap.
# Built from the real training/runtime image and only adds small missing
# wrapper packages.

ARG BASE_IMAGE
FROM ${BASE_IMAGE}

ARG EXECUTION_REQUIREMENTS
ARG REPO_OVERLAYS
ARG REPO_OVERLAYS_DIR
ARG PYTHON_BIN=python
ARG PIP_NO_DEPS=true

ENV HF_HUB_OFFLINE=1
ENV TRANSFORMERS_OFFLINE=1
ENV HF_DATASETS_OFFLINE=1
ENV WANDB_MODE=offline

COPY ${EXECUTION_REQUIREMENTS} /opt/nemotron-airgap/execution-requirements.txt
COPY ${REPO_OVERLAYS} /opt/nemotron-airgap/repo-overlays.json
COPY ${REPO_OVERLAYS_DIR}/ /opt/nemotron-airgap/repo-overlays/

# Build-time installs keep --no-cache-dir so derivative image layers stay small.
RUN if [ -s /opt/nemotron-airgap/execution-requirements.txt ]; then \
if [ "${PIP_NO_DEPS}" = "true" ]; then \
${PYTHON_BIN} -m pip install --no-cache-dir --no-deps -r /opt/nemotron-airgap/execution-requirements.txt; \
else \
${PYTHON_BIN} -m pip install --no-cache-dir -r /opt/nemotron-airgap/execution-requirements.txt; \
fi; \
fi && \
${PYTHON_BIN} - <<'PY'
import json
import pathlib
import shutil

root = pathlib.Path("/opt/nemotron-airgap/repo-overlays")
items = json.loads(pathlib.Path("/opt/nemotron-airgap/repo-overlays.json").read_text())
for item in items:
repo = item["repo"]
source = item.get("source", repo)
target = pathlib.Path(item["target"])
src = root / source
if not src.exists():
raise SystemExit(f"missing baked repo overlay: {src}")
if target.exists() or target.is_symlink():
if target.is_dir() and not target.is_symlink():
shutil.rmtree(target)
else:
target.unlink()
target.parent.mkdir(parents=True, exist_ok=True)
shutil.copytree(src, target)
PY
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
**

!deploy
!deploy/nemotron-customizer
!deploy/nemotron-customizer/airgap
!deploy/nemotron-customizer/airgap/out
!deploy/nemotron-customizer/airgap/out/execution-context
!deploy/nemotron-customizer/airgap/out/execution-context/**
!deploy/nemotron-customizer/airgap/out/repo-overlays
!deploy/nemotron-customizer/airgap/out/repo-overlays/**

**/.git
**/__pycache__
**/*.pyc
30 changes: 30 additions & 0 deletions deploy/nemotron-customizer/airgap/Dockerfile.launcher
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Launcher image for Nemotron Customizer airgap.
# It contains the repo and a uv-synced environment. It does not run training.

ARG BASE_IMAGE=python:3.12-slim
FROM ${BASE_IMAGE}

ARG UV_VERSION=0.11.1

WORKDIR /workspace/Nemotron

ENV UV_LINK_MODE=copy
ENV UV_PYTHON_DOWNLOADS=never
ENV HF_HUB_OFFLINE=1
ENV TRANSFORMERS_OFFLINE=1
ENV HF_DATASETS_OFFLINE=1
ENV WANDB_MODE=offline
ENV PYTHONPATH=/workspace/Nemotron/src
ENV PATH=/workspace/Nemotron/.venv/bin:$PATH

RUN apt-get update && \
apt-get install -y --no-install-recommends git ca-certificates && \
rm -rf /var/lib/apt/lists/*

RUN python -m pip install --no-cache-dir "uv==${UV_VERSION}"

COPY . .

RUN uv sync --frozen --no-dev

CMD ["bash"]
21 changes: 21 additions & 0 deletions deploy/nemotron-customizer/airgap/Dockerfile.launcher.dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
.git
.venv
.ruff_cache
.pytest_cache
**/__pycache__
**/*.pyc

/.nemo_run
/outputs
/output
/logs
/checkpoints
/wandb
/data
/downloads

deploy/nemotron-customizer/airgap/out
deploy/nemotron-customizer/airgap/airgap-bundle
deploy/nemotron-customizer/airgap/archives
deploy/nemotron-customizer/airgap/*.tar
deploy/nemotron-customizer/airgap/*.lock.yaml
135 changes: 135 additions & 0 deletions deploy/nemotron-customizer/airgap/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
# Nemotron Customizer Airgap

This folder is scoped only to Nemotron Customizer steps under
`src/nemotron/steps/`.

The flow is intentionally small:

1. Build one **launcher image** with this repo and `uv.lock`.
2. Build one or more **execution images** by grouping selected workflow stages by base image.
3. Save those images as tarballs for the airgapped side.
4. Keep models, datasets, checkpoints, and customer files on persistent storage.

Edit `airgap.yaml` first:

- `workflow.stages`: the Nemotron Customizer steps the customer wants to run
- `dependencies`: central step dependency map, for example SFT training needs SFT packing
- `step_execution_images`: which execution image each step should use
- `execution_images`: the base image, output tag, and known/import-probed Python requirements

Only steps reached from `workflow.stages` are built. Steps are grouped by
`base_image + repo_overlays`; each group gets one derivative image with the
union of its small missing packages. If two selected step families share the
same base image and repo overlays, the runner emits one combined execution image for
both.

Run from the repo root:

```bash
uv run python deploy/nemotron-customizer/airgap/runner.py \
--config deploy/nemotron-customizer/airgap/airgap.yaml
```

That prints the plan. To actually pull/build/save images on the connected
machine:

```bash
uv run python deploy/nemotron-customizer/airgap/runner.py \
--config deploy/nemotron-customizer/airgap/airgap.yaml \
--execute
```

To run only a few stages:

```bash
uv run python deploy/nemotron-customizer/airgap/runner.py \
--config deploy/nemotron-customizer/airgap/airgap.yaml \
--stage validate \
--stage discover-execution-deps
```

To override the workflow without editing YAML, pass one or more selected
Nemotron step targets. Dependencies are still expanded from `dependencies`.
For example, SDG plus SFT also adds `data_prep/sft_packing` because SFT needs packed
data:

```bash
uv run python deploy/nemotron-customizer/airgap/runner.py \
--config deploy/nemotron-customizer/airgap/airgap.yaml \
--target sdg/data_designer:tiny \
--target sft/megatron_bridge:tiny
```

Outputs are written under `deploy/nemotron-customizer/airgap/out/` by default:

- `airgap-manifest.yaml`: what was validated and built
- `airgap-build-state.yaml`: incomplete execute run state used for resume
- `airgap-build-complete.yaml`: final execute run state after success
- `requirements-<execution-group>.txt`: small missing packages per execution image
- `repo-overlays-<execution-group>.json`: git auto-mounts discovered from selected step configs
- `launcher-image.tar`
- `execution-*.tar`
- SHA256 checksums for saved image tarballs in `airgap-manifest.yaml`

If an execute run fails midway, leave `airgap-build-state.yaml` in place and rerun
the same command. Completed expensive actions are reused when their artifacts
still exist. If you intentionally change the workflow or image plan before
finishing, move or remove `airgap-build-state.yaml` first; the runner will not
silently overwrite incomplete state from a different plan.

Runtime dependency probes use Docker volumes named
`nemotron-airgap-pip-cache-<platform>` to avoid downloading the same wheels on
every probe loop. To reset them, run `docker volume ls | grep
nemotron-airgap-pip-cache` and remove the relevant volume with
`docker volume rm`.

Large assets are not baked into images. The customer should stage them on
executor-visible persistent storage and reference them through config overrides
and `run.env.mounts`.

During dependency discovery, the runner mounts the connected-machine checkout
into each execution image only to probe imports. The final execution image deliberately
does not bake this repo; the launcher image and the normal nemo-run/nemo-runspec
code transport provide the repo to the remote job at submission time.

Repo logistics stay outside `airgap.yaml`. If a selected step config contains
`${auto_mount:git+...}`, the runner treats it as a connected-machine build input:
it fetches that pinned repo and bakes it into the derivative execution image at the
requested target path. Runtime jobs then use the baked image and do not clone
from GitHub. Site-specific data/model mounts remain in env profiles or step
overrides.

If the connected machine is not the same architecture as the target cluster,
set `platform: linux/amd64` on the `launcher_image` or execution image entry in
`airgap.yaml`. If you need to minimize transfer size for several images that
share layers, `docker save -o all-images.tar tag1 tag2 ...` can be used after
the runner builds the images; a single tar deduplicates shared layers better
than one tar per image.

The Dockerfiles expect the chosen base images to have Python and `pip` available
for bootstrapping small offline additions. The runtime defaults bake
`HF_HUB_OFFLINE=1`, `TRANSFORMERS_OFFLINE=1`, `HF_DATASETS_OFFLINE=1`, and
`WANDB_MODE=offline`; customers with an internal mirror can override those at
submission time through their env profile or `run.env.env_vars`.

For SFT Megatron-Bridge, build with the normal config so the runner can discover
the pinned Megatron-LM and Megatron-Bridge auto-mounts:

```yaml
workflow:
stages:
- sft/megatron_bridge:tiny
```

When submitting inside the airgap, use the deploy overlay config so those git
auto-mounts are cleared at runtime while persistent storage mounts from the env
profile still apply. Use the image printed by the runner under
`selected execution images`, or read it from `out/airgap-manifest.yaml` under
`step_execution_images`.

```bash
uv run nemotron steps run sft/megatron_bridge \
-c deploy/nemotron-customizer/airgap/configs/sft_megatron_bridge_tiny.yaml \
-b <your-airgap-profile> \
run.env.container_image=<image-printed-for-sft/megatron_bridge>
```
Loading
Loading