From 58d50a5efcf3d4aa476042df477dd69edade2b92 Mon Sep 17 00:00:00 2001 From: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com> Date: Thu, 21 May 2026 22:26:20 +0100 Subject: [PATCH] Add retriever fine-tune recipe skill Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com> --- .../retriever-finetune-recipe/PITFALLS.md | 60 +++++++ .../skills/retriever-finetune-recipe/SKILL.md | 58 +++++++ .../agents/openai.yaml | 4 + .../references/embed.md | 142 +++++++++++++++++ .../references/evaluation.md | 30 ++++ .../references/rerank.md | 146 ++++++++++++++++++ .claude/skills/retriever-finetune-recipe | 1 + .gitignore | 5 + 8 files changed, 446 insertions(+) create mode 100644 .agents/skills/retriever-finetune-recipe/PITFALLS.md create mode 100644 .agents/skills/retriever-finetune-recipe/SKILL.md create mode 100644 .agents/skills/retriever-finetune-recipe/agents/openai.yaml create mode 100644 .agents/skills/retriever-finetune-recipe/references/embed.md create mode 100644 .agents/skills/retriever-finetune-recipe/references/evaluation.md create mode 100644 .agents/skills/retriever-finetune-recipe/references/rerank.md create mode 120000 .claude/skills/retriever-finetune-recipe diff --git a/.agents/skills/retriever-finetune-recipe/PITFALLS.md b/.agents/skills/retriever-finetune-recipe/PITFALLS.md new file mode 100644 index 000000000..52c3f8f66 --- /dev/null +++ b/.agents/skills/retriever-finetune-recipe/PITFALLS.md @@ -0,0 +1,60 @@ +# Retriever Recipe Pitfalls + +Load this file when a recipe command fails, metrics look wrong, or the user asks for debugging help. + +## Setup And CLI + +- CLI help or dry-run fails before reaching `embed` or `rerank` with a missing optional dependency: run the repo's documented sync path, usually `uv sync --all-extras`. If the error names Data Designer, the smaller recovery may be `uv sync --extra data-sdg`. +- `uv run` rebuilds or installs packages unexpectedly: report that the environment is being prepared, then continue with help/dry-run before launching work. +- CUDA symbol, `nvJitLink`, or library mismatch errors: clear inherited CUDA library paths with `LD_LIBRARY_PATH=""` for the command, then rerun the cheapest failing validation. +- Unknown override field: inspect the stage config model or `uv run nemotron --help`; Pydantic configs usually reject extra fields. +- Hugging Face `429 Too Many Requests` or gated-model access errors: set `HF_TOKEN`, run `huggingface-cli login`, or reduce parallel work before retrying. + +## Stage 0 SDG + +- Missing `NVIDIA_API_KEY`: Stage 0 requires it. Ask the user to configure the environment, but do not ask them to paste the key. +- API rate limits or flaky generation: reduce `max_parallel_requests_for_gen`, lower `batch_size`, or run a smaller pilot with fewer files. +- No or low-quality generated QA: inspect a sample of generated JSON before lowering `quality_threshold`; improve corpus quality, chunking, or SDG model settings first. +- Large corpus takes too long: use `num_files`, batch index ranges, or a representative pilot corpus before full generation. + +## Stage 1 Prep + +- GPU OOM during hard-negative mining: reduce `mining_batch_size`, sequence lengths, or visible GPUs workload. +- Few valid training rows: check Stage 0 quality scores and Stage 1 `quality_threshold`; confirm SDG output path points to the intended family output directory. +- Train/eval comparisons shift unexpectedly: preserve the same `eval_beir/` split across runs. +- Hard negatives are insufficient: ensure `hard_negatives_to_mine >= train_n_passages - 1`. + +## Stage 2 Finetune + +- OOM: reduce `local_batch_size`, `global_batch_size`, sequence length, or `train_n_passages`. +- NaN or unstable loss: reduce learning rate, inspect corrupted data, and check positives/negatives in the unrolled training file. +- Loss not decreasing: try a lower learning rate, inspect data quality, and confirm positives and hard negatives are sensible. +- Overfitting: start real corpora at 1-2 epochs; the default 3 epochs is mainly for small example datasets. +- Small datasets may trigger training-code auto-scaling of batch size or checkpoint/validation frequency. Preserve those log messages when reporting what happened. +- Checkpoint expectations are wrong: Stage 3 and Stage 4 default to `checkpoints/LATEST/model/consolidated`; pass explicit paths when using older or custom checkpoints. +- Rerank optimizer confusion: `optimizer_backend=auto` should use Transformer Engine FusedAdam in the container and FlashAdamW otherwise. + +## Stage 3 Eval + +- Fine-tuned model looks worse: confirm eval data, prefixes, sequence lengths, prompt template, pooling/normalization, and checkpoint path match training. +- Reranker cannot improve recall: a reranker only reorders retrieved candidates. If relevant documents are missing from `top_k`, tune the embedder or retrieval index. +- Metrics look noisy: increase held-out eval queries where possible and compare on a fixed `eval_beir/` split. +- NIM eval mismatch: compare checkpoint vs ONNX vs TensorRT, then inspect quantization, pooling/normalization, prefixes, prompt template, and sequence lengths. + +## Stage 4 Export + +- ONNX export fails with attention kernels: keep `attn_implementation=eager` for export. +- TensorRT export fails: first validate ONNX-only export with `export_to_trt=false`, then check the NeMo Export-Deploy container and TensorRT profile settings. +- Rerank TensorRT instability: keep the layernorm FP32 overrides unless there is a tested reason to change them. + +## Stage 5 Deploy + +- Docker or NGC errors: confirm Docker runtime, GPU access, NGC login/access, and `NGC_API_KEY`. +- Port conflicts: override `host_port` or stop the existing container. +- Service starts but eval fails: run the family-specific smoke test from the reference, then run Stage 3 NIM eval with `eval_nim=true eval_base=false`. + +## Artifact Hygiene + +- Before rerunning stages, inspect the family output directory: `output/embed/` or `output/rerank/`. +- Do not delete generated data, cached embeddings, checkpoints, exports, or running containers unless the user explicitly asks. +- If stale artifacts may be causing shape or resume problems, explain the specific path and ask before cleanup. diff --git a/.agents/skills/retriever-finetune-recipe/SKILL.md b/.agents/skills/retriever-finetune-recipe/SKILL.md new file mode 100644 index 000000000..ec5b417d3 --- /dev/null +++ b/.agents/skills/retriever-finetune-recipe/SKILL.md @@ -0,0 +1,58 @@ +--- +name: retriever-finetune-recipe +description: Operate Nemotron retriever fine-tuning recipes for embedding and reranking models. Use when Codex needs to plan, run, debug, tune, evaluate, export, deploy, document, or modify `nemotron embed ...` or `nemotron rerank ...` workflows; interpret BEIR, nDCG, Recall, hard-negative mining, Automodel training, ONNX/TensorRT export, or NIM deployment results; or choose between embedder and reranker personalization. +--- + +# Retriever Fine-Tune Recipe + +Use this skill to work with Nemotron embedding and reranking fine-tuning recipes in a source checkout or installed package. Prefer the current checkout over memory, because the recipe CLI, configs, containers, and output paths are actively changing. + +## First Decisions + +1. Identify the recipe family. + - Use `references/embed.md` for embedding, embed, bi-encoder, vector search, first-stage retrieval, low Recall@k, missing relevant documents, NIM embeddings, or `nemotron embed`. + - Use `references/rerank.md` for rerank, reranker, cross-encoder, second-stage retrieval, acceptable recall but poor top-rank ordering, low nDCG with good Recall, or `nemotron rerank`. + - Use both references only when the user asks about both families or asks which family to choose. +2. Choose the model to tune from the retrieval failure mode. + - Prefer embedding fine-tuning when relevant documents are absent from the candidate set. + - Prefer reranker fine-tuning when relevant documents are retrieved but ordered poorly near the top. + - For production retrieval stacks, remember that these are complementary: embed first, rerank candidates second. +3. Identify the intent: plan a run, execute a stage, debug a failure, tune hyperparameters, interpret metrics, export/deploy a model, or modify recipe code/configs. +4. Inspect the current public surface before acting: + - Recipe files: `src/nemotron/recipes//` + - CLI files: `src/nemotron/cli/commands//` + - Default configs: `src/nemotron/recipes//stage*/config/default.yaml` + - Help and dry runs: `uv run nemotron --help`, `uv run nemotron -c default -d` + +## Safe Workflow + +1. Gather only task-relevant context: corpus path, existing SDG/training/eval data, target stage range, output directory, checkpoint path, execution mode, GPU IDs, and whether required secrets are configured. Never ask users to paste secret values. +2. Start with cheap checks before expensive work: + - `uv run nemotron --help` + - `uv run nemotron --help` + - `uv run nemotron -c default -d` + - `uv run nemotron run -c default -d --from --to ` +3. Check prerequisites for the requested stage: + - Repo environment: `uv sync --all-extras` or the smallest relevant extra if documented by the repo. + - Stage 0 SDG: `NVIDIA_API_KEY`. + - Stage 1-4 GPU work: CUDA/NVIDIA driver availability and enough VRAM. + - Stage 4 export: the NeMo Export-Deploy container when using TensorRT. + - Stage 5 deploy: Docker, NGC access, and `NGC_API_KEY`. + - Remote execution: root `env.toml` profile for `--run` or `--batch`. +4. Use dotlist overrides instead of editing defaults unless the user asks for reusable config changes. Keep sequence length, prefixes, pooling/normalization, prompt templates, and hard-negative counts consistent across stages. +5. Avoid launching API, GPU, Docker, Slurm, NIM, or long-running jobs unless the user explicitly asked to run them. Offer or run dry-runs, config review, and small pilots first. +6. If the user specifies GPU IDs, scope every stage command with `CUDA_VISIBLE_DEVICES=`. +7. For multi-stage local runs, prefer `uv run nemotron run -c default --from --to `. The default `run` target stops at `eval`; `export` and `deploy` are opt-in. +8. For long-running SDG, prep, finetune, or eval work, start the process in a session-safe way and poll at human-scale intervals: roughly 60 seconds for small pilots and 120-300 seconds for larger runs. +9. For failures, load `PITFALLS.md`, localize the failing stage, then inspect the stage config, expected inputs, output directory, and corresponding CLI wrapper or `run_uv.py`. + +## References + +- `references/embed.md`: embedding recipe stages, commands, defaults, output paths, and operating patterns. +- `references/rerank.md`: rerank recipe stages, commands, defaults, output paths, and operating patterns. +- `references/evaluation.md`: metric interpretation, comparison hygiene, and deployment readiness checks. +- `PITFALLS.md`: common failures and recovery moves for SDG, prep, training, eval, export, deploy, and CLI setup. + +## Output Style + +Give concrete commands and file paths. State assumptions, expected inputs, expected outputs, and the cheapest validation step that proves the next action is ready. For long-running stages, separate preview commands from execution commands so the user can choose deliberately. diff --git a/.agents/skills/retriever-finetune-recipe/agents/openai.yaml b/.agents/skills/retriever-finetune-recipe/agents/openai.yaml new file mode 100644 index 000000000..aba1ff717 --- /dev/null +++ b/.agents/skills/retriever-finetune-recipe/agents/openai.yaml @@ -0,0 +1,4 @@ +interface: + display_name: "Retriever Fine-Tune Recipe" + short_description: "Run and debug retrieval fine-tuning" + default_prompt: "Use $retriever-finetune-recipe to plan, run, debug, and validate Nemotron embedding or reranking recipe stages." diff --git a/.agents/skills/retriever-finetune-recipe/references/embed.md b/.agents/skills/retriever-finetune-recipe/references/embed.md new file mode 100644 index 000000000..5212bb884 --- /dev/null +++ b/.agents/skills/retriever-finetune-recipe/references/embed.md @@ -0,0 +1,142 @@ +# Embedding Recipe Reference + +Load this reference for `nemotron embed ...` work or for questions about first-stage retrieval, bi-encoder training, low Recall@k, missing relevant documents, embedding NIMs, or re-indexing after model changes. + +## Contents + +- Grounding Paths +- When To Use Embed +- Commands +- Stage Map +- Important Defaults +- Operating Patterns +- NIM Smoke Test +- Tests And Checks + +## Grounding Paths + +- Recipe README: `src/nemotron/recipes/embed/README.md` +- CLI group: `src/nemotron/cli/commands/embed/_typer_group.py` +- Pipeline command: `src/nemotron/cli/commands/embed/run.py` +- Stage configs: `src/nemotron/recipes/embed/stage*/config/default.yaml` +- Main outputs: `output/embed/` + +## When To Use Embed + +Use embedding fine-tuning when relevant documents are not retrieved into the candidate set, Recall@k is low, domain terms are poorly matched, or the user needs a better first-stage retrieval model. Embedding changes usually require re-embedding and re-indexing the deployment corpus. + +## Commands + +Use `uv run` when `nemotron` is not already available. + +```bash +uv run nemotron embed info +uv run nemotron embed --help +uv run nemotron embed run -c default -d --from prep --to eval +``` + +Stage commands: + +```bash +uv run nemotron embed sdg -c default corpus_dir=/path/to/docs +uv run nemotron embed prep -c default +uv run nemotron embed finetune -c default +uv run nemotron embed eval -c default +uv run nemotron embed export -c default +uv run nemotron embed deploy -c default +``` + +Remote execution uses root `env.toml` profiles: + +```bash +uv run nemotron embed finetune -c default --run my-cluster +uv run nemotron embed finetune -c default --batch my-cluster +``` + +## Stage Map + +| Stage | Command | Input | Output | Notes | +| --- | --- | --- | --- | --- | +| 0 SDG | `embed sdg` | Text corpus or HF URI | `output/embed/stage0_sdg` | Requires `NVIDIA_API_KEY`; generates synthetic retrieval QA data. | +| 1 prep | `embed prep` | Stage 0 output or existing QA data | `output/embed/stage1_data_prep` | Converts to train/eval data, mines hard negatives, creates BEIR eval data. | +| 2 finetune | `embed finetune` | `train_mined.automodel_unrolled.json` | `output/embed/stage2_finetune/checkpoints` | Automodel contrastive training. | +| 3 eval | `embed eval` | BEIR eval data and checkpoint | `output/embed/stage3_eval/eval_results.json` | Compare base vs fine-tuned on nDCG, Recall, Precision, and MAP. | +| 4 export | `embed export` | Fine-tuned HF checkpoint | `output/embed/stage4_export` | Default config exports ONNX only; set `export_to_trt=true` for TensorRT. | +| 5 deploy | `embed deploy` | ONNX/TensorRT model dir | NIM on `host_port` | Requires Docker/NGC setup and `NGC_API_KEY`. | + +The pipeline order is `sdg`, `prep`, `finetune`, `eval`, `export`, `deploy`; `embed run` defaults to `--to eval`. + +## Important Defaults + +Stage 0: + +- Sample corpus: `hf://nvidia/Retrieval-Synthetic-NVDocs-v1@1c0d1856f3fb595b2dda98d4b61061fa6d782d51/sample_corpus/nv_pp_random` +- Output: `./output/embed/stage0_sdg` +- Generation model: `nvidia/nemotron-3-nano-30b-a3b` +- SDG embedding model: `nvidia/llama-3.2-nv-embedqa-1b-v2` +- Useful overrides: `corpus_dir`, `num_pairs`, `sentences_per_chunk`, `file_extensions`, `max_parallel_requests_for_gen`, `preview=true` + +Stage 1: + +- Input: `./output/embed/stage0_sdg` +- Output: `./output/embed/stage1_data_prep` +- Base model for mining: `nvidia/llama-nemotron-embed-1b-v2` +- Quality threshold: `7.0` +- Split: `train_ratio=0.8`, `val_ratio=0`, `test_ratio=0.2` +- Hard negatives: `hard_negatives_to_mine=5`, `hard_neg_margin=0.95`, `mining_batch_size=128` + +Stage 2: + +- Base model: `nvidia/llama-nemotron-embed-1b-v2` +- Train data: `./output/embed/stage1_data_prep/train_mined.automodel_unrolled.json` +- Checkpoints: `./output/embed/stage2_finetune/checkpoints` +- Defaults: `num_epochs=3`, `global_batch_size=128`, `local_batch_size=4`, `learning_rate=1.0e-5`, `temperature=0.02`, `train_n_passages=5` +- Prefixes: `query_prefix="query:"`, `passage_prefix="passage:"` +- For real corpora, start with 1-2 epochs unless Stage 3 metrics still improve; the 3 epoch default is for small examples. + +Stage 3: + +- Eval data: `./output/embed/stage1_data_prep/eval_beir` +- Fine-tuned model: `./output/embed/stage2_finetune/checkpoints/LATEST/model/consolidated` +- Metrics: `k_values=[1,5,10,100]` +- Modes: `eval_base=true`, `eval_finetuned=true`, `eval_nim=false` +- NIM verification: `uv run nemotron embed eval -c default eval_nim=true eval_base=false` + +Stage 4: + +- Model path: `./output/embed/stage2_finetune/checkpoints/LATEST/model/consolidated` +- ONNX output: `./output/embed/stage4_export/onnx` +- TensorRT output: `./output/embed/stage4_export/tensorrt` +- `attn_implementation=eager` is the export-safe default. + +Stage 5: + +- NIM image: `nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2:1.10.1` +- Container: `nemotron-embed-nim` +- Default API: `http://localhost:8000/v1/embeddings` + +## Operating Patterns + +- Skip SDG when the user already has generated QA pairs or wants NVIDIA's pre-generated dataset; start Stage 1 with `sdg_input_path`. +- For production-like chunks, align `sentences_per_chunk`, `passage_max_length`, and eval `max_length` with expected retrieval chunks. +- If increasing sequence length, reduce batch sizes before attempting to recover from OOM. +- Mine at least as many hard negatives as Stage 2 will consume: `hard_negatives_to_mine >= train_n_passages - 1`. +- Preserve `output/embed/stage1_data_prep/eval_beir/` across comparisons so metrics are not shifted by new splits. +- Use `val_ratio=0` only for small datasets where preserving test size matters; use a validation split for larger datasets. +- Inspect existing `output/embed/` artifacts before rerunning a stage. Ask before deleting checkpoints, cached embeddings, or generated data. + +## NIM Smoke Test + +```bash +curl -X POST http://localhost:8000/v1/embeddings \ + -H 'Content-Type: application/json' \ + -d '{"input": ["hello"], "model": "nvidia/llama-3.2-nv-embedqa-1b-v2", "input_type": "query"}' +``` + +## Tests And Checks + +```bash +uv run nemotron embed --help +uv run nemotron embed finetune -c default -d +uv run pytest tests/recipes/embed tests/nemo_runspec/test_execution_uv_spec.py -q +``` diff --git a/.agents/skills/retriever-finetune-recipe/references/evaluation.md b/.agents/skills/retriever-finetune-recipe/references/evaluation.md new file mode 100644 index 000000000..8ffd03504 --- /dev/null +++ b/.agents/skills/retriever-finetune-recipe/references/evaluation.md @@ -0,0 +1,30 @@ +# Evaluation Practices + +Use Stage 3 metrics as the source of truth for recipe quality. Training loss is useful for diagnosing learning dynamics, but it is not retrieval accuracy. + +## Minimum Practice + +- Compare base vs fine-tuned on the same held-out eval set. +- Keep the Stage 1 `eval_beir/` split fixed across hyperparameter, SDG, and data-volume comparisons. +- Inspect `output/embed/stage3_eval/eval_results.json` or `output/rerank/stage3_eval/eval_results.json`. +- Prioritize nDCG@10 and Recall@10, then check the rest of the k values for consistency. +- Use at least 100 eval queries when possible; 200-500 is better for detecting small changes. +- Treat less than roughly 5 absolute points of nDCG@10 improvement as a reason to inspect data quality, SDG coverage, hard negatives, and hyperparameters before deployment. +- For rerank, treat high Recall with low nDCG as a ranking problem; treat low Recall as a first-stage retrieval or embedding problem. +- Public benchmarks can be useful for broad sanity checks, but recipe personalization should be judged on the recipe's domain-specific held-out eval split. + +## Experiment Hygiene + +- Save the exact command, dotlist overrides, git commit, config files, and output directory for each run. +- Change one major variable at a time. +- Start embedding LR sweeps near `5e-6`, `1e-5`, and `2e-5`. +- Start rerank LR sweeps near `1e-6`, `3e-6`, and `1e-5`. +- Start real datasets at 1-2 epochs unless validation and Stage 3 metrics continue improving. +- Evaluate data saturation by running 25%, 50%, and 100% corpus sizes with the same held-out eval set. + +## Deployment Checks + +- Evaluate the exported or served model against the same eval set. +- For embedding NIM, use `uv run nemotron embed eval -c default eval_nim=true eval_base=false`. +- For rerank NIM, use `uv run nemotron rerank eval -c default eval_nim=true eval_base=false`. +- If metrics drift after export or deploy, check ONNX vs TensorRT, quantization, pooling, normalization, prefixes, prompt templates, and sequence length. diff --git a/.agents/skills/retriever-finetune-recipe/references/rerank.md b/.agents/skills/retriever-finetune-recipe/references/rerank.md new file mode 100644 index 000000000..5acda74ac --- /dev/null +++ b/.agents/skills/retriever-finetune-recipe/references/rerank.md @@ -0,0 +1,146 @@ +# Rerank Recipe Reference + +Load this reference for `nemotron rerank ...` work or for questions about cross-encoder reranking, second-stage retrieval, top-rank precision, low nDCG with acceptable Recall, ranking NIMs, or reranking evaluation. + +## Contents + +- Grounding Paths +- When To Use Rerank +- Commands +- Stage Map +- Important Defaults +- Operating Patterns +- NIM Smoke Test +- Tests And Checks + +## Grounding Paths + +- Recipe README: `src/nemotron/recipes/rerank/README.md` +- CLI group: `src/nemotron/cli/commands/rerank/_typer_group.py` +- Pipeline command: `src/nemotron/cli/commands/rerank/run.py` +- Stage configs: `src/nemotron/recipes/rerank/stage*/config/default.yaml` +- Main outputs: `output/rerank/` + +## When To Use Rerank + +Use reranker fine-tuning when relevant documents are already in the candidate set but the top ranks are wrong, nDCG@k is low while Recall@k is acceptable, or users say the right answer appears below worse answers. A reranker re-scores query-document pairs; it cannot recover documents that first-stage retrieval did not return. + +## Commands + +Use `uv run` when `nemotron` is not already available. + +```bash +uv run nemotron rerank info +uv run nemotron rerank --help +uv run nemotron rerank run -c default -d --from prep --to eval +``` + +Stage commands: + +```bash +uv run nemotron rerank sdg -c default corpus_dir=/path/to/docs +uv run nemotron rerank prep -c default +uv run nemotron rerank finetune -c default +uv run nemotron rerank eval -c default +uv run nemotron rerank export -c default +uv run nemotron rerank deploy -c default +``` + +Remote execution uses root `env.toml` profiles: + +```bash +uv run nemotron rerank finetune -c default --run my-cluster +uv run nemotron rerank finetune -c default --batch my-cluster +``` + +## Stage Map + +| Stage | Command | Input | Output | Notes | +| --- | --- | --- | --- | --- | +| 0 SDG | `rerank sdg` | Text corpus or HF URI | `output/rerank/stage0_sdg` | Requires `NVIDIA_API_KEY`; uses the same SDG pipeline shape as embed. | +| 1 prep | `rerank prep` | Stage 0 output or existing QA data | `output/rerank/stage1_prep` | Converts to train/eval data, mines hard negatives, creates BEIR eval data. | +| 2 finetune | `rerank finetune` | `train_mined.automodel_unrolled.json` | `output/rerank/stage2_finetune/checkpoints` | Automodel cross-encoder classification training. | +| 3 eval | `rerank eval` | BEIR eval data and checkpoint | `output/rerank/stage3_eval/eval_results.json` | Dense retrieval, rerank top candidates, compare base vs fine-tuned nDCG. | +| 4 export | `rerank export` | Fine-tuned HF checkpoint | `output/rerank/stage4_export` | Default config exports ONNX only; set `export_to_trt=true` for TensorRT. | +| 5 deploy | `rerank deploy` | ONNX/TensorRT model dir | NIM on `host_port` | Requires Docker/NGC setup and `NGC_API_KEY`. | + +The pipeline order is `sdg`, `prep`, `finetune`, `eval`, `export`, `deploy`; `rerank run` defaults to `--to eval`. + +## Important Defaults + +Stage 0: + +- Sample corpus: `hf://nvidia/Retrieval-Synthetic-NVDocs-v1@1c0d1856f3fb595b2dda98d4b61061fa6d782d51/sample_corpus/nv_pp_random` +- Output: `./output/rerank/stage0_sdg` +- Generation model: `nvidia/nemotron-3-nano-30b-a3b` +- SDG embedding model: `nvidia/llama-3.2-nv-embedqa-1b-v2` +- Useful overrides: `corpus_dir`, `num_pairs`, `sentences_per_chunk`, `file_extensions`, `max_parallel_requests_for_gen`, `preview=true` + +Stage 1: + +- Input: `./output/rerank/stage0_sdg` +- Output: `./output/rerank/stage1_prep` +- Base model for hard-negative mining: `nvidia/llama-nemotron-embed-1b-v2` +- Quality threshold: `7.0` +- Split: `train_ratio=0.8`, `val_ratio=0`, `test_ratio=0.2` +- Hard negatives: `hard_negatives_to_mine=5`, `hard_neg_margin=0.95`, `mining_batch_size=128` + +Stage 2: + +- Base model: `nvidia/llama-nemotron-rerank-1b-v2` +- Train data: `./output/rerank/stage1_prep/train_mined.automodel_unrolled.json` +- Checkpoints: `./output/rerank/stage2_finetune/checkpoints` +- Defaults: `num_epochs=3`, `global_batch_size=128`, `local_batch_size=4`, `learning_rate=3.0e-6`, `train_n_passages=5` +- Optimizer backend: `auto`, using Transformer Engine FusedAdam when available and FlashAdamW otherwise. +- Tokenization: `rerank_max_length=512`, `prompt_template="question:{query} \n \n passage:{passage}"` +- For real corpora, start with 1-2 epochs unless Stage 3 metrics still improve; the 3 epoch default is for small examples. + +Stage 3: + +- Eval data: `./output/rerank/stage1_prep/eval_beir` +- Fine-tuned model: `./output/rerank/stage2_finetune/checkpoints/LATEST/model/consolidated` +- First-stage retrieval model: `nvidia/llama-nemotron-embed-1b-v2` +- Candidate depth: `top_k=100` +- Metrics: `k_values=[1,5,10,100]` +- Modes: `eval_base=true`, `eval_finetuned=true`, `eval_nim=false` +- NIM verification: `uv run nemotron rerank eval -c default eval_nim=true eval_base=false` + +Stage 4: + +- Model path: `./output/rerank/stage2_finetune/checkpoints/LATEST/model/consolidated` +- ONNX output: `./output/rerank/stage4_export/onnx` +- TensorRT output: `./output/rerank/stage4_export/tensorrt` +- `attn_implementation=eager` is the export-safe default. +- TensorRT sequence profile defaults: min 3, opt 256, max 512. + +Stage 5: + +- NIM image: `nvcr.io/nim/nvidia/llama-nemotron-rerank-1b-v2:1.10.0` +- Container: `nemotron-rerank-nim` +- Default API: `http://localhost:8000/v1/ranking` + +## Operating Patterns + +- Keep Stage 3's first-stage retrieval model and `top_k` fixed across base vs fine-tuned comparisons. +- Track candidate depth carefully. If Recall is low before reranking, tune the embedder or retrieval index first. +- Mine at least as many hard negatives as Stage 2 will consume: `hard_negatives_to_mine >= train_n_passages - 1`. +- Hold the Stage 1 `eval_beir/` split fixed across sweeps so metric changes are not caused by new splits. +- Start learning-rate sweeps near `1e-6`, `3e-6`, and `1e-5`. +- Keep the Stage 2 `prompt_template` and Stage 3 eval `prompt_template` identical. +- Inspect existing `output/rerank/` artifacts before rerunning a stage. Ask before deleting checkpoints, cached embeddings, or generated data. + +## NIM Smoke Test + +```bash +curl -X POST http://localhost:8000/v1/ranking \ + -H 'Content-Type: application/json' \ + -d '{"model": "nvidia/llama-nemotron-rerank-1b-v2", "query": {"text": "what is AI?"}, "passages": [{"text": "AI is artificial intelligence"}]}' +``` + +## Tests And Checks + +```bash +uv run nemotron rerank --help +uv run nemotron rerank finetune -c default -d +uv run pytest src/nemotron/recipes/rerank/stage2_finetune/tests tests/nemo_runspec/test_execution_uv_spec.py -q +``` diff --git a/.claude/skills/retriever-finetune-recipe b/.claude/skills/retriever-finetune-recipe new file mode 120000 index 000000000..a0294ed56 --- /dev/null +++ b/.claude/skills/retriever-finetune-recipe @@ -0,0 +1 @@ +../../.agents/skills/retriever-finetune-recipe \ No newline at end of file diff --git a/.gitignore b/.gitignore index 92231acc6..d12b8424a 100644 --- a/.gitignore +++ b/.gitignore @@ -98,6 +98,11 @@ wandb/ # Claude Code CLAUDE.md .claude/ +!.claude/ +.claude/* +!.claude/skills/ +.claude/skills/* +!.claude/skills/retriever-finetune-recipe # Compiled config config.yaml