Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 60 additions & 0 deletions .agents/skills/retriever-finetune-recipe/PITFALLS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Retriever Recipe Pitfalls

Load this file when a recipe command fails, metrics look wrong, or the user asks for debugging help.

## Setup And CLI

- CLI help or dry-run fails before reaching `embed` or `rerank` with a missing optional dependency: run the repo's documented sync path, usually `uv sync --all-extras`. If the error names Data Designer, the smaller recovery may be `uv sync --extra data-sdg`.
- `uv run` rebuilds or installs packages unexpectedly: report that the environment is being prepared, then continue with help/dry-run before launching work.
- CUDA symbol, `nvJitLink`, or library mismatch errors: clear inherited CUDA library paths with `LD_LIBRARY_PATH=""` for the command, then rerun the cheapest failing validation.
- Unknown override field: inspect the stage config model or `uv run nemotron <family> <stage> --help`; Pydantic configs usually reject extra fields.
- Hugging Face `429 Too Many Requests` or gated-model access errors: set `HF_TOKEN`, run `huggingface-cli login`, or reduce parallel work before retrying.

## Stage 0 SDG

- Missing `NVIDIA_API_KEY`: Stage 0 requires it. Ask the user to configure the environment, but do not ask them to paste the key.
- API rate limits or flaky generation: reduce `max_parallel_requests_for_gen`, lower `batch_size`, or run a smaller pilot with fewer files.
- No or low-quality generated QA: inspect a sample of generated JSON before lowering `quality_threshold`; improve corpus quality, chunking, or SDG model settings first.
- Large corpus takes too long: use `num_files`, batch index ranges, or a representative pilot corpus before full generation.

## Stage 1 Prep

- GPU OOM during hard-negative mining: reduce `mining_batch_size`, sequence lengths, or visible GPUs workload.
- Few valid training rows: check Stage 0 quality scores and Stage 1 `quality_threshold`; confirm SDG output path points to the intended family output directory.
- Train/eval comparisons shift unexpectedly: preserve the same `eval_beir/` split across runs.
- Hard negatives are insufficient: ensure `hard_negatives_to_mine >= train_n_passages - 1`.

## Stage 2 Finetune

- OOM: reduce `local_batch_size`, `global_batch_size`, sequence length, or `train_n_passages`.
- NaN or unstable loss: reduce learning rate, inspect corrupted data, and check positives/negatives in the unrolled training file.
- Loss not decreasing: try a lower learning rate, inspect data quality, and confirm positives and hard negatives are sensible.
- Overfitting: start real corpora at 1-2 epochs; the default 3 epochs is mainly for small example datasets.
- Small datasets may trigger training-code auto-scaling of batch size or checkpoint/validation frequency. Preserve those log messages when reporting what happened.
- Checkpoint expectations are wrong: Stage 3 and Stage 4 default to `checkpoints/LATEST/model/consolidated`; pass explicit paths when using older or custom checkpoints.
- Rerank optimizer confusion: `optimizer_backend=auto` should use Transformer Engine FusedAdam in the container and FlashAdamW otherwise.

## Stage 3 Eval

- Fine-tuned model looks worse: confirm eval data, prefixes, sequence lengths, prompt template, pooling/normalization, and checkpoint path match training.
- Reranker cannot improve recall: a reranker only reorders retrieved candidates. If relevant documents are missing from `top_k`, tune the embedder or retrieval index.
- Metrics look noisy: increase held-out eval queries where possible and compare on a fixed `eval_beir/` split.
- NIM eval mismatch: compare checkpoint vs ONNX vs TensorRT, then inspect quantization, pooling/normalization, prefixes, prompt template, and sequence lengths.

## Stage 4 Export

- ONNX export fails with attention kernels: keep `attn_implementation=eager` for export.
- TensorRT export fails: first validate ONNX-only export with `export_to_trt=false`, then check the NeMo Export-Deploy container and TensorRT profile settings.
- Rerank TensorRT instability: keep the layernorm FP32 overrides unless there is a tested reason to change them.

## Stage 5 Deploy

- Docker or NGC errors: confirm Docker runtime, GPU access, NGC login/access, and `NGC_API_KEY`.
- Port conflicts: override `host_port` or stop the existing container.
- Service starts but eval fails: run the family-specific smoke test from the reference, then run Stage 3 NIM eval with `eval_nim=true eval_base=false`.

## Artifact Hygiene

- Before rerunning stages, inspect the family output directory: `output/embed/` or `output/rerank/`.
- Do not delete generated data, cached embeddings, checkpoints, exports, or running containers unless the user explicitly asks.
- If stale artifacts may be causing shape or resume problems, explain the specific path and ask before cleanup.
58 changes: 58 additions & 0 deletions .agents/skills/retriever-finetune-recipe/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
---
name: retriever-finetune-recipe
description: Operate Nemotron retriever fine-tuning recipes for embedding and reranking models. Use when Codex needs to plan, run, debug, tune, evaluate, export, deploy, document, or modify `nemotron embed ...` or `nemotron rerank ...` workflows; interpret BEIR, nDCG, Recall, hard-negative mining, Automodel training, ONNX/TensorRT export, or NIM deployment results; or choose between embedder and reranker personalization.
---

# Retriever Fine-Tune Recipe

Use this skill to work with Nemotron embedding and reranking fine-tuning recipes in a source checkout or installed package. Prefer the current checkout over memory, because the recipe CLI, configs, containers, and output paths are actively changing.

## First Decisions

1. Identify the recipe family.
- Use `references/embed.md` for embedding, embed, bi-encoder, vector search, first-stage retrieval, low Recall@k, missing relevant documents, NIM embeddings, or `nemotron embed`.
- Use `references/rerank.md` for rerank, reranker, cross-encoder, second-stage retrieval, acceptable recall but poor top-rank ordering, low nDCG with good Recall, or `nemotron rerank`.
- Use both references only when the user asks about both families or asks which family to choose.
2. Choose the model to tune from the retrieval failure mode.
- Prefer embedding fine-tuning when relevant documents are absent from the candidate set.
- Prefer reranker fine-tuning when relevant documents are retrieved but ordered poorly near the top.
- For production retrieval stacks, remember that these are complementary: embed first, rerank candidates second.
3. Identify the intent: plan a run, execute a stage, debug a failure, tune hyperparameters, interpret metrics, export/deploy a model, or modify recipe code/configs.
4. Inspect the current public surface before acting:
- Recipe files: `src/nemotron/recipes/<embed|rerank>/`
- CLI files: `src/nemotron/cli/commands/<embed|rerank>/`
- Default configs: `src/nemotron/recipes/<family>/stage*/config/default.yaml`
- Help and dry runs: `uv run nemotron <family> --help`, `uv run nemotron <family> <stage> -c default -d`

## Safe Workflow

1. Gather only task-relevant context: corpus path, existing SDG/training/eval data, target stage range, output directory, checkpoint path, execution mode, GPU IDs, and whether required secrets are configured. Never ask users to paste secret values.
2. Start with cheap checks before expensive work:
- `uv run nemotron <family> --help`
- `uv run nemotron <family> <stage> --help`
- `uv run nemotron <family> <stage> -c default -d`
- `uv run nemotron <family> run -c default -d --from <stage> --to <stage>`
3. Check prerequisites for the requested stage:
- Repo environment: `uv sync --all-extras` or the smallest relevant extra if documented by the repo.
- Stage 0 SDG: `NVIDIA_API_KEY`.
- Stage 1-4 GPU work: CUDA/NVIDIA driver availability and enough VRAM.
- Stage 4 export: the NeMo Export-Deploy container when using TensorRT.
- Stage 5 deploy: Docker, NGC access, and `NGC_API_KEY`.
- Remote execution: root `env.toml` profile for `--run` or `--batch`.
4. Use dotlist overrides instead of editing defaults unless the user asks for reusable config changes. Keep sequence length, prefixes, pooling/normalization, prompt templates, and hard-negative counts consistent across stages.
5. Avoid launching API, GPU, Docker, Slurm, NIM, or long-running jobs unless the user explicitly asked to run them. Offer or run dry-runs, config review, and small pilots first.
6. If the user specifies GPU IDs, scope every stage command with `CUDA_VISIBLE_DEVICES=<ids>`.
7. For multi-stage local runs, prefer `uv run nemotron <family> run -c default --from <stage> --to <stage>`. The default `run` target stops at `eval`; `export` and `deploy` are opt-in.
8. For long-running SDG, prep, finetune, or eval work, start the process in a session-safe way and poll at human-scale intervals: roughly 60 seconds for small pilots and 120-300 seconds for larger runs.
9. For failures, load `PITFALLS.md`, localize the failing stage, then inspect the stage config, expected inputs, output directory, and corresponding CLI wrapper or `run_uv.py`.

## References

- `references/embed.md`: embedding recipe stages, commands, defaults, output paths, and operating patterns.
- `references/rerank.md`: rerank recipe stages, commands, defaults, output paths, and operating patterns.
- `references/evaluation.md`: metric interpretation, comparison hygiene, and deployment readiness checks.
- `PITFALLS.md`: common failures and recovery moves for SDG, prep, training, eval, export, deploy, and CLI setup.

## Output Style

Give concrete commands and file paths. State assumptions, expected inputs, expected outputs, and the cheapest validation step that proves the next action is ready. For long-running stages, separate preview commands from execution commands so the user can choose deliberately.
4 changes: 4 additions & 0 deletions .agents/skills/retriever-finetune-recipe/agents/openai.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
interface:
display_name: "Retriever Fine-Tune Recipe"
short_description: "Run and debug retrieval fine-tuning"
default_prompt: "Use $retriever-finetune-recipe to plan, run, debug, and validate Nemotron embedding or reranking recipe stages."
142 changes: 142 additions & 0 deletions .agents/skills/retriever-finetune-recipe/references/embed.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
# Embedding Recipe Reference

Load this reference for `nemotron embed ...` work or for questions about first-stage retrieval, bi-encoder training, low Recall@k, missing relevant documents, embedding NIMs, or re-indexing after model changes.

## Contents

- Grounding Paths
- When To Use Embed
- Commands
- Stage Map
- Important Defaults
- Operating Patterns
- NIM Smoke Test
- Tests And Checks

## Grounding Paths

- Recipe README: `src/nemotron/recipes/embed/README.md`
- CLI group: `src/nemotron/cli/commands/embed/_typer_group.py`
- Pipeline command: `src/nemotron/cli/commands/embed/run.py`
- Stage configs: `src/nemotron/recipes/embed/stage*/config/default.yaml`
- Main outputs: `output/embed/`

## When To Use Embed

Use embedding fine-tuning when relevant documents are not retrieved into the candidate set, Recall@k is low, domain terms are poorly matched, or the user needs a better first-stage retrieval model. Embedding changes usually require re-embedding and re-indexing the deployment corpus.

## Commands

Use `uv run` when `nemotron` is not already available.

```bash
uv run nemotron embed info
uv run nemotron embed --help
uv run nemotron embed run -c default -d --from prep --to eval
```

Stage commands:

```bash
uv run nemotron embed sdg -c default corpus_dir=/path/to/docs
uv run nemotron embed prep -c default
uv run nemotron embed finetune -c default
uv run nemotron embed eval -c default
uv run nemotron embed export -c default
uv run nemotron embed deploy -c default
```

Remote execution uses root `env.toml` profiles:

```bash
uv run nemotron embed finetune -c default --run my-cluster
uv run nemotron embed finetune -c default --batch my-cluster
```

## Stage Map

| Stage | Command | Input | Output | Notes |
| --- | --- | --- | --- | --- |
| 0 SDG | `embed sdg` | Text corpus or HF URI | `output/embed/stage0_sdg` | Requires `NVIDIA_API_KEY`; generates synthetic retrieval QA data. |
| 1 prep | `embed prep` | Stage 0 output or existing QA data | `output/embed/stage1_data_prep` | Converts to train/eval data, mines hard negatives, creates BEIR eval data. |
| 2 finetune | `embed finetune` | `train_mined.automodel_unrolled.json` | `output/embed/stage2_finetune/checkpoints` | Automodel contrastive training. |
| 3 eval | `embed eval` | BEIR eval data and checkpoint | `output/embed/stage3_eval/eval_results.json` | Compare base vs fine-tuned on nDCG, Recall, Precision, and MAP. |
| 4 export | `embed export` | Fine-tuned HF checkpoint | `output/embed/stage4_export` | Default config exports ONNX only; set `export_to_trt=true` for TensorRT. |
| 5 deploy | `embed deploy` | ONNX/TensorRT model dir | NIM on `host_port` | Requires Docker/NGC setup and `NGC_API_KEY`. |

The pipeline order is `sdg`, `prep`, `finetune`, `eval`, `export`, `deploy`; `embed run` defaults to `--to eval`.

## Important Defaults

Stage 0:

- Sample corpus: `hf://nvidia/Retrieval-Synthetic-NVDocs-v1@1c0d1856f3fb595b2dda98d4b61061fa6d782d51/sample_corpus/nv_pp_random`
- Output: `./output/embed/stage0_sdg`
- Generation model: `nvidia/nemotron-3-nano-30b-a3b`
- SDG embedding model: `nvidia/llama-3.2-nv-embedqa-1b-v2`
- Useful overrides: `corpus_dir`, `num_pairs`, `sentences_per_chunk`, `file_extensions`, `max_parallel_requests_for_gen`, `preview=true`

Stage 1:

- Input: `./output/embed/stage0_sdg`
- Output: `./output/embed/stage1_data_prep`
- Base model for mining: `nvidia/llama-nemotron-embed-1b-v2`
- Quality threshold: `7.0`
- Split: `train_ratio=0.8`, `val_ratio=0`, `test_ratio=0.2`
- Hard negatives: `hard_negatives_to_mine=5`, `hard_neg_margin=0.95`, `mining_batch_size=128`

Stage 2:

- Base model: `nvidia/llama-nemotron-embed-1b-v2`
- Train data: `./output/embed/stage1_data_prep/train_mined.automodel_unrolled.json`
- Checkpoints: `./output/embed/stage2_finetune/checkpoints`
- Defaults: `num_epochs=3`, `global_batch_size=128`, `local_batch_size=4`, `learning_rate=1.0e-5`, `temperature=0.02`, `train_n_passages=5`
- Prefixes: `query_prefix="query:"`, `passage_prefix="passage:"`
- For real corpora, start with 1-2 epochs unless Stage 3 metrics still improve; the 3 epoch default is for small examples.

Stage 3:

- Eval data: `./output/embed/stage1_data_prep/eval_beir`
- Fine-tuned model: `./output/embed/stage2_finetune/checkpoints/LATEST/model/consolidated`
- Metrics: `k_values=[1,5,10,100]`
- Modes: `eval_base=true`, `eval_finetuned=true`, `eval_nim=false`
- NIM verification: `uv run nemotron embed eval -c default eval_nim=true eval_base=false`

Stage 4:

- Model path: `./output/embed/stage2_finetune/checkpoints/LATEST/model/consolidated`
- ONNX output: `./output/embed/stage4_export/onnx`
- TensorRT output: `./output/embed/stage4_export/tensorrt`
- `attn_implementation=eager` is the export-safe default.

Stage 5:

- NIM image: `nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2:1.10.1`
- Container: `nemotron-embed-nim`
- Default API: `http://localhost:8000/v1/embeddings`

## Operating Patterns

- Skip SDG when the user already has generated QA pairs or wants NVIDIA's pre-generated dataset; start Stage 1 with `sdg_input_path`.
- For production-like chunks, align `sentences_per_chunk`, `passage_max_length`, and eval `max_length` with expected retrieval chunks.
- If increasing sequence length, reduce batch sizes before attempting to recover from OOM.
- Mine at least as many hard negatives as Stage 2 will consume: `hard_negatives_to_mine >= train_n_passages - 1`.
- Preserve `output/embed/stage1_data_prep/eval_beir/` across comparisons so metrics are not shifted by new splits.
- Use `val_ratio=0` only for small datasets where preserving test size matters; use a validation split for larger datasets.
- Inspect existing `output/embed/` artifacts before rerunning a stage. Ask before deleting checkpoints, cached embeddings, or generated data.

## NIM Smoke Test

```bash
curl -X POST http://localhost:8000/v1/embeddings \
-H 'Content-Type: application/json' \
-d '{"input": ["hello"], "model": "nvidia/llama-3.2-nv-embedqa-1b-v2", "input_type": "query"}'
```

## Tests And Checks

```bash
uv run nemotron embed --help
uv run nemotron embed finetune -c default -d
uv run pytest tests/recipes/embed tests/nemo_runspec/test_execution_uv_spec.py -q
```
30 changes: 30 additions & 0 deletions .agents/skills/retriever-finetune-recipe/references/evaluation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Evaluation Practices

Use Stage 3 metrics as the source of truth for recipe quality. Training loss is useful for diagnosing learning dynamics, but it is not retrieval accuracy.

## Minimum Practice

- Compare base vs fine-tuned on the same held-out eval set.
- Keep the Stage 1 `eval_beir/` split fixed across hyperparameter, SDG, and data-volume comparisons.
- Inspect `output/embed/stage3_eval/eval_results.json` or `output/rerank/stage3_eval/eval_results.json`.
- Prioritize nDCG@10 and Recall@10, then check the rest of the k values for consistency.
- Use at least 100 eval queries when possible; 200-500 is better for detecting small changes.
- Treat less than roughly 5 absolute points of nDCG@10 improvement as a reason to inspect data quality, SDG coverage, hard negatives, and hyperparameters before deployment.
- For rerank, treat high Recall with low nDCG as a ranking problem; treat low Recall as a first-stage retrieval or embedding problem.
- Public benchmarks can be useful for broad sanity checks, but recipe personalization should be judged on the recipe's domain-specific held-out eval split.

## Experiment Hygiene

- Save the exact command, dotlist overrides, git commit, config files, and output directory for each run.
- Change one major variable at a time.
- Start embedding LR sweeps near `5e-6`, `1e-5`, and `2e-5`.
- Start rerank LR sweeps near `1e-6`, `3e-6`, and `1e-5`.
- Start real datasets at 1-2 epochs unless validation and Stage 3 metrics continue improving.
- Evaluate data saturation by running 25%, 50%, and 100% corpus sizes with the same held-out eval set.

## Deployment Checks

- Evaluate the exported or served model against the same eval set.
- For embedding NIM, use `uv run nemotron embed eval -c default eval_nim=true eval_base=false`.
- For rerank NIM, use `uv run nemotron rerank eval -c default eval_nim=true eval_base=false`.
- If metrics drift after export or deploy, check ONNX vs TensorRT, quantization, pooling, normalization, prefixes, prompt templates, and sequence length.
Loading