NVIDIA-NeMo · oliverholworthy · May 21, 2026
diff --git a/.agents/skills/retriever-finetune-recipe/PITFALLS.md b/.agents/skills/retriever-finetune-recipe/PITFALLS.md
@@ -0,0 +1,60 @@
+# Retriever Recipe Pitfalls
+
+Load this file when a recipe command fails, metrics look wrong, or the user asks for debugging help.
+
+## Setup And CLI
+
+- CLI help or dry-run fails before reaching `embed` or `rerank` with a missing optional dependency: run the repo's documented sync path, usually `uv sync --all-extras`. If the error names Data Designer, the smaller recovery may be `uv sync --extra data-sdg`.
+- `uv run` rebuilds or installs packages unexpectedly: report that the environment is being prepared, then continue with help/dry-run before launching work.
+- CUDA symbol, `nvJitLink`, or library mismatch errors: clear inherited CUDA library paths with `LD_LIBRARY_PATH=""` for the command, then rerun the cheapest failing validation.
+- Unknown override field: inspect the stage config model or `uv run nemotron <family> <stage> --help`; Pydantic configs usually reject extra fields.
+- Hugging Face `429 Too Many Requests` or gated-model access errors: set `HF_TOKEN`, run `huggingface-cli login`, or reduce parallel work before retrying.
+
+## Stage 0 SDG
+
+- Missing `NVIDIA_API_KEY`: Stage 0 requires it. Ask the user to configure the environment, but do not ask them to paste the key.
+- API rate limits or flaky generation: reduce `max_parallel_requests_for_gen`, lower `batch_size`, or run a smaller pilot with fewer files.
+- No or low-quality generated QA: inspect a sample of generated JSON before lowering `quality_threshold`; improve corpus quality, chunking, or SDG model settings first.
+- Large corpus takes too long: use `num_files`, batch index ranges, or a representative pilot corpus before full generation.
+
+## Stage 1 Prep
+
+- GPU OOM during hard-negative mining: reduce `mining_batch_size`, sequence lengths, or visible GPUs workload.
+- Few valid training rows: check Stage 0 quality scores and Stage 1 `quality_threshold`; confirm SDG output path points to the intended family output directory.
+- Train/eval comparisons shift unexpectedly: preserve the same `eval_beir/` split across runs.
+- Hard negatives are insufficient: ensure `hard_negatives_to_mine >= train_n_passages - 1`.
+
+## Stage 2 Finetune
+
+- OOM: reduce `local_batch_size`, `global_batch_size`, sequence length, or `train_n_passages`.
+- NaN or unstable loss: reduce learning rate, inspect corrupted data, and check positives/negatives in the unrolled training file.
+- Loss not decreasing: try a lower learning rate, inspect data quality, and confirm positives and hard negatives are sensible.
+- Overfitting: start real corpora at 1-2 epochs; the default 3 epochs is mainly for small example datasets.
+- Small datasets may trigger training-code auto-scaling of batch size or checkpoint/validation frequency. Preserve those log messages when reporting what happened.
+- Checkpoint expectations are wrong: Stage 3 and Stage 4 default to `checkpoints/LATEST/model/consolidated`; pass explicit paths when using older or custom checkpoints.
+- Rerank optimizer confusion: `optimizer_backend=auto` should use Transformer Engine FusedAdam in the container and FlashAdamW otherwise.
+
+## Stage 3 Eval
+
+- Fine-tuned model looks worse: confirm eval data, prefixes, sequence lengths, prompt template, pooling/normalization, and checkpoint path match training.
+- Reranker cannot improve recall: a reranker only reorders retrieved candidates. If relevant documents are missing from `top_k`, tune the embedder or retrieval index.
+- Metrics look noisy: increase held-out eval queries where possible and compare on a fixed `eval_beir/` split.
+- NIM eval mismatch: compare checkpoint vs ONNX vs TensorRT, then inspect quantization, pooling/normalization, prefixes, prompt template, and sequence lengths.
+
+## Stage 4 Export
+
+- ONNX export fails with attention kernels: keep `attn_implementation=eager` for export.
+- TensorRT export fails: first validate ONNX-only export with `export_to_trt=false`, then check the NeMo Export-Deploy container and TensorRT profile settings.
+- Rerank TensorRT instability: keep the layernorm FP32 overrides unless there is a tested reason to change them.
+
+## Stage 5 Deploy
+
+- Docker or NGC errors: confirm Docker runtime, GPU access, NGC login/access, and `NGC_API_KEY`.
+- Port conflicts: override `host_port` or stop the existing container.
+- Service starts but eval fails: run the family-specific smoke test from the reference, then run Stage 3 NIM eval with `eval_nim=true eval_base=false`.
+
+## Artifact Hygiene
+
+- Before rerunning stages, inspect the family output directory: `output/embed/` or `output/rerank/`.
+- Do not delete generated data, cached embeddings, checkpoints, exports, or running containers unless the user explicitly asks.
+- If stale artifacts may be causing shape or resume problems, explain the specific path and ask before cleanup.
diff --git a/.agents/skills/retriever-finetune-recipe/SKILL.md b/.agents/skills/retriever-finetune-recipe/SKILL.md
@@ -0,0 +1,58 @@
+---
+name: retriever-finetune-recipe
+description: Operate Nemotron retriever fine-tuning recipes for embedding and reranking models. Use when Codex needs to plan, run, debug, tune, evaluate, export, deploy, document, or modify `nemotron embed ...` or `nemotron rerank ...` workflows; interpret BEIR, nDCG, Recall, hard-negative mining, Automodel training, ONNX/TensorRT export, or NIM deployment results; or choose between embedder and reranker personalization.
+---
+
+# Retriever Fine-Tune Recipe
+
+Use this skill to work with Nemotron embedding and reranking fine-tuning recipes in a source checkout or installed package. Prefer the current checkout over memory, because the recipe CLI, configs, containers, and output paths are actively changing.
+
+## First Decisions
+
+1. Identify the recipe family.
+   - Use `references/embed.md` for embedding, embed, bi-encoder, vector search, first-stage retrieval, low Recall@k, missing relevant documents, NIM embeddings, or `nemotron embed`.
+   - Use `references/rerank.md` for rerank, reranker, cross-encoder, second-stage retrieval, acceptable recall but poor top-rank ordering, low nDCG with good Recall, or `nemotron rerank`.
+   - Use both references only when the user asks about both families or asks which family to choose.
+2. Choose the model to tune from the retrieval failure mode.
+   - Prefer embedding fine-tuning when relevant documents are absent from the candidate set.
+   - Prefer reranker fine-tuning when relevant documents are retrieved but ordered poorly near the top.
+   - For production retrieval stacks, remember that these are complementary: embed first, rerank candidates second.
+3. Identify the intent: plan a run, execute a stage, debug a failure, tune hyperparameters, interpret metrics, export/deploy a model, or modify recipe code/configs.
+4. Inspect the current public surface before acting:
+   - Recipe files: `src/nemotron/recipes/<embed|rerank>/`
+   - CLI files: `src/nemotron/cli/commands/<embed|rerank>/`
+   - Default configs: `src/nemotron/recipes/<family>/stage*/config/default.yaml`
+   - Help and dry runs: `uv run nemotron <family> --help`, `uv run nemotron <family> <stage> -c default -d`
+
+## Safe Workflow
+
+1. Gather only task-relevant context: corpus path, existing SDG/training/eval data, target stage range, output directory, checkpoint path, execution mode, GPU IDs, and whether required secrets are configured. Never ask users to paste secret values.
+2. Start with cheap checks before expensive work:
+   - `uv run nemotron <family> --help`
+   - `uv run nemotron <family> <stage> --help`
+   - `uv run nemotron <family> <stage> -c default -d`
+   - `uv run nemotron <family> run -c default -d --from <stage> --to <stage>`
+3. Check prerequisites for the requested stage:
+   - Repo environment: `uv sync --all-extras` or the smallest relevant extra if documented by the repo.
+   - Stage 0 SDG: `NVIDIA_API_KEY`.
+   - Stage 1-4 GPU work: CUDA/NVIDIA driver availability and enough VRAM.
+   - Stage 4 export: the NeMo Export-Deploy container when using TensorRT.
+   - Stage 5 deploy: Docker, NGC access, and `NGC_API_KEY`.
+   - Remote execution: root `env.toml` profile for `--run` or `--batch`.
+4. Use dotlist overrides instead of editing defaults unless the user asks for reusable config changes. Keep sequence length, prefixes, pooling/normalization, prompt templates, and hard-negative counts consistent across stages.
+5. Avoid launching API, GPU, Docker, Slurm, NIM, or long-running jobs unless the user explicitly asked to run them. Offer or run dry-runs, config review, and small pilots first.
+6. If the user specifies GPU IDs, scope every stage command with `CUDA_VISIBLE_DEVICES=<ids>`.
+7. For multi-stage local runs, prefer `uv run nemotron <family> run -c default --from <stage> --to <stage>`. The default `run` target stops at `eval`; `export` and `deploy` are opt-in.
+8. For long-running SDG, prep, finetune, or eval work, start the process in a session-safe way and poll at human-scale intervals: roughly 60 seconds for small pilots and 120-300 seconds for larger runs.
+9. For failures, load `PITFALLS.md`, localize the failing stage, then inspect the stage config, expected inputs, output directory, and corresponding CLI wrapper or `run_uv.py`.
+
+## References
+
+- `references/embed.md`: embedding recipe stages, commands, defaults, output paths, and operating patterns.
+- `references/rerank.md`: rerank recipe stages, commands, defaults, output paths, and operating patterns.
+- `references/evaluation.md`: metric interpretation, comparison hygiene, and deployment readiness checks.
+- `PITFALLS.md`: common failures and recovery moves for SDG, prep, training, eval, export, deploy, and CLI setup.
+
+## Output Style
+
+Give concrete commands and file paths. State assumptions, expected inputs, expected outputs, and the cheapest validation step that proves the next action is ready. For long-running stages, separate preview commands from execution commands so the user can choose deliberately.
diff --git a/.agents/skills/retriever-finetune-recipe/agents/openai.yaml b/.agents/skills/retriever-finetune-recipe/agents/openai.yaml
@@ -0,0 +1,4 @@
+interface:
+  display_name: "Retriever Fine-Tune Recipe"
+  short_description: "Run and debug retrieval fine-tuning"
+  default_prompt: "Use $retriever-finetune-recipe to plan, run, debug, and validate Nemotron embedding or reranking recipe stages."
diff --git a/.agents/skills/retriever-finetune-recipe/references/embed.md b/.agents/skills/retriever-finetune-recipe/references/embed.md
@@ -0,0 +1,142 @@
+# Embedding Recipe Reference
+
+Load this reference for `nemotron embed ...` work or for questions about first-stage retrieval, bi-encoder training, low Recall@k, missing relevant documents, embedding NIMs, or re-indexing after model changes.
+
+## Contents
+
+- Grounding Paths
+- When To Use Embed
+- Commands
+- Stage Map
+- Important Defaults
+- Operating Patterns
+- NIM Smoke Test
+- Tests And Checks
+
+## Grounding Paths
+
+- Recipe README: `src/nemotron/recipes/embed/README.md`
+- CLI group: `src/nemotron/cli/commands/embed/_typer_group.py`
+- Pipeline command: `src/nemotron/cli/commands/embed/run.py`
+- Stage configs: `src/nemotron/recipes/embed/stage*/config/default.yaml`
+- Main outputs: `output/embed/`
+
+## When To Use Embed
+
+Use embedding fine-tuning when relevant documents are not retrieved into the candidate set, Recall@k is low, domain terms are poorly matched, or the user needs a better first-stage retrieval model. Embedding changes usually require re-embedding and re-indexing the deployment corpus.
+
+## Commands
+
+Use `uv run` when `nemotron` is not already available.
+
+```bash
+uv run nemotron embed info
+uv run nemotron embed --help
+uv run nemotron embed run -c default -d --from prep --to eval
+```
+
+Stage commands:
+
+```bash
+uv run nemotron embed sdg -c default corpus_dir=/path/to/docs
+uv run nemotron embed prep -c default
+uv run nemotron embed finetune -c default
+uv run nemotron embed eval -c default
+uv run nemotron embed export -c default
+uv run nemotron embed deploy -c default
+```
+
+Remote execution uses root `env.toml` profiles:
+
+```bash
+uv run nemotron embed finetune -c default --run my-cluster
+uv run nemotron embed finetune -c default --batch my-cluster
+```
+
+## Stage Map
+
+| Stage | Command | Input | Output | Notes |
+| --- | --- | --- | --- | --- |
+| 0 SDG | `embed sdg` | Text corpus or HF URI | `output/embed/stage0_sdg` | Requires `NVIDIA_API_KEY`; generates synthetic retrieval QA data. |
+| 1 prep | `embed prep` | Stage 0 output or existing QA data | `output/embed/stage1_data_prep` | Converts to train/eval data, mines hard negatives, creates BEIR eval data. |
+| 2 finetune | `embed finetune` | `train_mined.automodel_unrolled.json` | `output/embed/stage2_finetune/checkpoints` | Automodel contrastive training. |
+| 3 eval | `embed eval` | BEIR eval data and checkpoint | `output/embed/stage3_eval/eval_results.json` | Compare base vs fine-tuned on nDCG, Recall, Precision, and MAP. |
+| 4 export | `embed export` | Fine-tuned HF checkpoint | `output/embed/stage4_export` | Default config exports ONNX only; set `export_to_trt=true` for TensorRT. |
+| 5 deploy | `embed deploy` | ONNX/TensorRT model dir | NIM on `host_port` | Requires Docker/NGC setup and `NGC_API_KEY`. |
+
+The pipeline order is `sdg`, `prep`, `finetune`, `eval`, `export`, `deploy`; `embed run` defaults to `--to eval`.
+
+## Important Defaults
+
+Stage 0:
+
+- Sample corpus: `hf://nvidia/Retrieval-Synthetic-NVDocs-v1@1c0d1856f3fb595b2dda98d4b61061fa6d782d51/sample_corpus/nv_pp_random`
+- Output: `./output/embed/stage0_sdg`
+- Generation model: `nvidia/nemotron-3-nano-30b-a3b`
+- SDG embedding model: `nvidia/llama-3.2-nv-embedqa-1b-v2`
+- Useful overrides: `corpus_dir`, `num_pairs`, `sentences_per_chunk`, `file_extensions`, `max_parallel_requests_for_gen`, `preview=true`
+
+Stage 1:
+
+- Input: `./output/embed/stage0_sdg`
+- Output: `./output/embed/stage1_data_prep`
+- Base model for mining: `nvidia/llama-nemotron-embed-1b-v2`
+- Quality threshold: `7.0`
+- Split: `train_ratio=0.8`, `val_ratio=0`, `test_ratio=0.2`
+- Hard negatives: `hard_negatives_to_mine=5`, `hard_neg_margin=0.95`, `mining_batch_size=128`
+
+Stage 2:
+
+- Base model: `nvidia/llama-nemotron-embed-1b-v2`
+- Train data: `./output/embed/stage1_data_prep/train_mined.automodel_unrolled.json`
+- Checkpoints: `./output/embed/stage2_finetune/checkpoints`
+- Defaults: `num_epochs=3`, `global_batch_size=128`, `local_batch_size=4`, `learning_rate=1.0e-5`, `temperature=0.02`, `train_n_passages=5`
+- Prefixes: `query_prefix="query:"`, `passage_prefix="passage:"`
+- For real corpora, start with 1-2 epochs unless Stage 3 metrics still improve; the 3 epoch default is for small examples.
+
+Stage 3:
+
+- Eval data: `./output/embed/stage1_data_prep/eval_beir`
+- Fine-tuned model: `./output/embed/stage2_finetune/checkpoints/LATEST/model/consolidated`
+- Metrics: `k_values=[1,5,10,100]`
+- Modes: `eval_base=true`, `eval_finetuned=true`, `eval_nim=false`
+- NIM verification: `uv run nemotron embed eval -c default eval_nim=true eval_base=false`
+
+Stage 4:
+
+- Model path: `./output/embed/stage2_finetune/checkpoints/LATEST/model/consolidated`
+- ONNX output: `./output/embed/stage4_export/onnx`
+- TensorRT output: `./output/embed/stage4_export/tensorrt`
+- `attn_implementation=eager` is the export-safe default.
+
+Stage 5:
+
+- NIM image: `nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2:1.10.1`
+- Container: `nemotron-embed-nim`
+- Default API: `http://localhost:8000/v1/embeddings`
+
+## Operating Patterns
+
+- Skip SDG when the user already has generated QA pairs or wants NVIDIA's pre-generated dataset; start Stage 1 with `sdg_input_path`.
+- For production-like chunks, align `sentences_per_chunk`, `passage_max_length`, and eval `max_length` with expected retrieval chunks.
+- If increasing sequence length, reduce batch sizes before attempting to recover from OOM.
+- Mine at least as many hard negatives as Stage 2 will consume: `hard_negatives_to_mine >= train_n_passages - 1`.
+- Preserve `output/embed/stage1_data_prep/eval_beir/` across comparisons so metrics are not shifted by new splits.
+- Use `val_ratio=0` only for small datasets where preserving test size matters; use a validation split for larger datasets.
+- Inspect existing `output/embed/` artifacts before rerunning a stage. Ask before deleting checkpoints, cached embeddings, or generated data.
+
+## NIM Smoke Test
+
+```bash
+curl -X POST http://localhost:8000/v1/embeddings \
+  -H 'Content-Type: application/json' \
+  -d '{"input": ["hello"], "model": "nvidia/llama-3.2-nv-embedqa-1b-v2", "input_type": "query"}'
+```
+
+## Tests And Checks
+
+```bash
+uv run nemotron embed --help
+uv run nemotron embed finetune -c default -d
+uv run pytest tests/recipes/embed tests/nemo_runspec/test_execution_uv_spec.py -q
+```
diff --git a/.agents/skills/retriever-finetune-recipe/references/evaluation.md b/.agents/skills/retriever-finetune-recipe/references/evaluation.md
@@ -0,0 +1,30 @@
+# Evaluation Practices
+
+Use Stage 3 metrics as the source of truth for recipe quality. Training loss is useful for diagnosing learning dynamics, but it is not retrieval accuracy.
+
+## Minimum Practice
+
+- Compare base vs fine-tuned on the same held-out eval set.
+- Keep the Stage 1 `eval_beir/` split fixed across hyperparameter, SDG, and data-volume comparisons.
+- Inspect `output/embed/stage3_eval/eval_results.json` or `output/rerank/stage3_eval/eval_results.json`.
+- Prioritize nDCG@10 and Recall@10, then check the rest of the k values for consistency.
+- Use at least 100 eval queries when possible; 200-500 is better for detecting small changes.
+- Treat less than roughly 5 absolute points of nDCG@10 improvement as a reason to inspect data quality, SDG coverage, hard negatives, and hyperparameters before deployment.
+- For rerank, treat high Recall with low nDCG as a ranking problem; treat low Recall as a first-stage retrieval or embedding problem.
+- Public benchmarks can be useful for broad sanity checks, but recipe personalization should be judged on the recipe's domain-specific held-out eval split.
+
+## Experiment Hygiene
+
+- Save the exact command, dotlist overrides, git commit, config files, and output directory for each run.
+- Change one major variable at a time.
+- Start embedding LR sweeps near `5e-6`, `1e-5`, and `2e-5`.
+- Start rerank LR sweeps near `1e-6`, `3e-6`, and `1e-5`.
+- Start real datasets at 1-2 epochs unless validation and Stage 3 metrics continue improving.
+- Evaluate data saturation by running 25%, 50%, and 100% corpus sizes with the same held-out eval set.
+
+## Deployment Checks
+
+- Evaluate the exported or served model against the same eval set.
+- For embedding NIM, use `uv run nemotron embed eval -c default eval_nim=true eval_base=false`.
+- For rerank NIM, use `uv run nemotron rerank eval -c default eval_nim=true eval_base=false`.
+- If metrics drift after export or deploy, check ONNX vs TensorRT, quantization, pooling, normalization, prefixes, prompt templates, and sequence length.