fix(rerank) Fix rerank finetune optimizer precision by shan-nvidia · Pull Request #208 · NVIDIA-NeMo/Nemotron

shan-nvidia · 2026-05-15T20:39:37Z

Summary

Fix the reranker fine-tune recipe so local torch AdamW training preserves fp32 optimizer-state behavior and can resume without regressing optimizer metadata to bf16.

The investigation found that the recipe path could create Adam moment state from bf16 model parameters, while the reference POC behavior kept optimizer state in fp32. This PR makes that behavior explicit for the local recipe path and adds regression coverage for resume.

What Changed

Keep local fine-tune parameters and torch AdamW optimizer state fp32 when force_fp32_parameters=true.
Use native torch.optim.AdamW(fused=True) for local mode instead of requiring Transformer Engine.
Load the reranker with torch_dtype: float32 and configure an fp32 FSDP MixedPrecisionPolicy.
Add explicit resume handling so restore loads model state, casts model parts to fp32, then loads optimizer/scheduler state.
Assert after explicit resume that optimizer checkpoint metadata still records fp32 Adam state.
Pin Stage 3 eval tokenizer EOS behavior to match training (add_eos_token=false).
Document the fp32 optimizer-state behavior and train/eval tokenizer invariant in the rerank README.
Keep checkpoint frequency at the recipe default of 100 steps.

Root Cause

AdamW moment tensors (exp_avg, exp_avg_sq, and step) followed the dtype of the parameters used to construct/load optimizer state. With bf16 parameters, the optimizer checkpoint metadata recorded bf16 moment state, which changed training behavior relative to the fp32-state POC.

Resume had a second failure mode: casting parameters after the normal Automodel restore path was too late, because optimizer state had already been loaded. The new restore hook makes the load order explicit.

Validation

Manual fine-tune/eval runs confirmed that the fixed train curve and eval results match the reference POC behavior.

Add a new `nemotron rerank` recipe for fine-tuning cross-encoder reranking models, following the same stage-based pattern as the embed recipe. Stages: - finetune: Fine-tune using TrainCrossEncoderRecipe from nemo-automodel - eval: Evaluate reranking quality via BEIR (first-stage retrieval + re-rank) - export: Export to ONNX/TensorRT using nemo-export reranker adapter - deploy: Deploy NIM reranker container The recipe consumes training data from the embed prep stage directly — the same {query, pos_doc[], neg_doc[]} format works for both biencoder and cross-encoder training via nemo-automodel's model_type parameter. Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>

- Monkeypatch create_bidirectional_mask during ONNX export to avoid transformers masking_utils tracing incompatibility - Cap lr_warmup_steps to total_steps-1 for small datasets - Add eval_nim code path with NIM reranker /v1/ranking API support - Use SentenceTransformer for first-stage retrieval (trust_remote_code) - Pass trust_remote_code=True to CrossEncoder in eval - Rename model from llama-3.2-nv-rerankqa-1b-v2 to llama-nemotron-rerank-1b-v2 across all configs and code - Pin NIM image to nvcr.io/nim/nvidia/llama-nemotron-rerank-1b-v2:1.10.0 - Pin onnx<1.20 and add ml_dtypes compat, add UV environments constraint - Set trust_remote_code in crossencoder_base.yaml for model and tokenizer - Pin transformers>=5.3.0,<5.4.0 for nemo-automodel compat in finetune Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>

Add sdg and prep subcommands to the rerank recipe that delegate to the embed implementations. This lets users run the full pipeline end-to-end with `nemotron rerank run` without needing to know about the embed recipe. The pipeline now runs: sdg → prep → finetune → eval (→ export → deploy). Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>

- Add sdg and prep subcommands delegating to embed implementations, so users can run the full pipeline with `nemotron rerank run` - Pipeline now runs: sdg → prep → finetune → eval (→ export → deploy) - Update nemo-automodel to 897ebedf - Use transformers 5.5.x via uv override-dependencies Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>

Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>

- Renumber rerank stages to match 6-stage layout: stage0_sdg, stage1_prep, stage2_finetune, stage3_eval, stage4_export, stage5_deploy - Add rerank-specific config for sdg and prep stages that write to output/rerank/ instead of output/embed/ - Create proper CLI commands for sdg/prep that use embed scripts with rerank config directories - Fix retriever-sdg deduplication to use data-designer 0.5.3+ API (generate_text_embeddings instead of removed _router.embedding) - Bump data-designer minimum to >=0.5.3 - Update all output paths to use new stage numbering Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>

Add NIM vs finetuned metrics comparison in eval stage output, with accuracy threshold checks matching embed recipe pattern. Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>

Local evaluation was feeding raw (query, passage) pairs to the cross-encoder without the "question:{query} \n \n passage:{passage}" template used during training and by NIM internally. This caused local scores to underreport by ~10% NDCG, making it impossible to compare fine-tuned checkpoints against NIM baselines. Replace sentence_transformers CrossEncoder with direct AutoModelForSequenceClassification + AutoTokenizer so we control input formatting and apply the prompt template consistently. Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>

Use torch.distributed.run with --nproc_per_node=gpu so training automatically uses all available GPUs (works correctly with 1 GPU too). Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>

Detect whether the input file is a JSON array or JSONL (one object per line) by peeking at the first character, so both formats are handled. Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>

Add retrieval_batch_size (default 32) for first-stage retrieval encoding, keeping batch_size (128) for reranker scoring. The embedding model needs a smaller batch size due to longer sequence processing. Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>

Torch was being locked to cu130 wheels in uv.lock, causing GPU to not be used on CUDA 12.x systems. Instead, exclude torch from dependency resolution and supply it explicitly via `--with torch` in the CLI commands, with UV_TORCH_BACKEND=auto to resolve the correct CUDA variant. Applies to rerank (finetune, eval, export, prep) and embed (prep) stages. Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>

Use SentenceTransformer multi-process pool for first-stage retrieval encoding and DataParallel for cross-encoder reranking when multiple GPUs are available. Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>

Show scoring progress with batch count and size during evaluation. Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>

Reduce top_k from 100 to 10 and k_values from [1,5,10,100] to [1,5,10]. This gives ~10x speedup on the cross-encoder scoring step since fewer candidates are re-ranked per query. Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>

Load cross-encoder with torch_dtype=bfloat16 and padding_side="left" to align with nemo-retriever-research evaluation defaults. Reduces memory usage and matches the reference implementation's tokenization. Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>

sentence-transformers now handles multi-GPU automatically in encode(). The explicit multi-process pool and encode_multi_process calls are deprecated and no longer needed. Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>

…ntainers When torch is already importable (e.g., inside an NVIDIA container), create a venv with --system-site-packages and exclude torch from UV resolution. This avoids the CUDA version mismatch where UV's torch-backend=auto detects the kernel driver CUDA version (via nvidia-smi) but the container's libcuda.so is a different version. When torch is NOT importable (bare machine), fall back to the existing uv run --with torch approach with UV_TORCH_BACKEND=auto. Consolidates duplicated _execute_uv_local logic from 10 CLI commands into nemotron.kit.uv_local.execute_uv_local. Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>

* add rerank recipe readme Signed-off-by: Steve Han <sthan@nvidia.com> * fix wrong repo url Signed-off-by: Steve Han <sthan@nvidia.com> --------- Signed-off-by: Steve Han <sthan@nvidia.com>

Signed-off-by: Steve Han <sthan@nvidia.com>

rnyak · 2026-05-18T13:35:43Z

@@ -71,6 +72,7 @@ optimizer:
  weight_decay: 0.01
  betas: [0.9, 0.999]
  eps: 1.0e-8
+  fused: true



@shan-nvidia @oliverholworthy we should set "master_weights: true" here. that will give you proper loss curve. If it is False by default, you might get lower accuracy.

"master_weights: true" tells Transformer Engine’s FusedAdam optimizer to keep a separate higher-precision copy of the model weights, usually FP32, inside the optimizer.

I see you also changed the target: transformer_engine.pytorch.optimizers.fused_adam.FusedAdam to target: torch.optim.AdamW . In that case we would not need master_weights: true, but was that really necessary?

Please see the example config: https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/retrieval/cross_encoder/llama3_2_1b.yaml#L115C3-L115C17

Thanks @rnyak. Automodel uses TE Adam whereas recipe uses torch Adam. I think torch Adam doesn't have a master_weights flag. So the fix is sort of mimicking the master_weights with FP32 behavior.

I tried to switch to TE Adam for recipe but got dep version issue (on rapids lab server) on local mode. Here are some details:
transformer-engine[pytorch]<=2.11.0 (added on this branch) cannot install in your local uv environment because no prebuilt TE wheel exists for torch 2.10. I verified the actual assets on the TE GitHub release pages:

TE version Available CUDA‑12 wheels (cp312, x86_64)

v2.11 +cu12torch2.8.0+cu129 only

v2.12 +cu12torch2.8.0+cu129 only

v2.13 +cu12torch2.8.0+cu129 only

v2.14 / v2.14.1 +cu12torch2.8.0+cu129 only

v2.15 +cu12torch2.8.0+cu129 only

What's happening at install time:

_execute_with_uv_torch in src/nemotron/kit/uv_local.py runs uv run --with torch ... with UV_TORCH_BACKEND=auto. Your driver advertises CUDA 12.9, so uv resolves torch to 2.10.0+cu129.
TE 2.11's setup.py then guesses the wheel URL transformer_engine_torch-2.11.0+cu12torch2.10.0+cu128cxx11abiTRUE-...whl — that file doesn't exist (correct asset is +cu12torch2.8.0+cu129).
It falls back to source build, which immediately fails on cudnn.h: No such file or directory because cuDNN dev headers aren't on your bare host (they live in the nvcr.io/nvidia/pytorch:25.12-py3 container the recipe declares).
The mismatch isn't really cudnn — it's that TE 2.11 only ships a wheel for torch 2.8.0, while uv's auto backend pulled torch 2.10.0. The cudnn error is just the second symptom after the wheel‑lookup misses.

nemo-automodel @ 897ebedf declares torch>=2.6.0,<=2.10.0 and transformer-engine[pytorch]<=2.11.0, so it has the same shape; nemo-automodel relies on the NGC container to either ship TE prebuilt or to provide cudnn for source builds.

@oliverholworthy let me know if there is a better fix.

oliverholworthy and others added 20 commits April 10, 2026 17:35

fix(rerank): use port 8000 default, consistent with embed

e8e844d

Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>

feat(rerank): add NIM vs finetuned comparison output in eval

bc23241

Add NIM vs finetuned metrics comparison in eval stage output, with accuracy threshold checks matching embed recipe pattern. Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>

feat(rerank): launch finetune with torchrun for multi-GPU support

756e4f2

Use torch.distributed.run with --nproc_per_node=gpu so training automatically uses all available GPUs (works correctly with 1 GPU too). Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>

fix(embed): support JSONL format in convert_to_retriever_data

45ea89c

Detect whether the input file is a JSON array or JSONL (one object per line) by peeking at the first character, so both formats are handled. Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>

Merge branch 'main' into oholworthy/rerank-recipe-v1

b01c0a0

feat(rerank): add tqdm progress bar for reranker scoring in eval

c30729d

Show scoring progress with batch count and size during evaluation. Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>

Add rerank recipe readme (#151)

e6e8a32

* add rerank recipe readme Signed-off-by: Steve Han <sthan@nvidia.com> * fix wrong repo url Signed-off-by: Steve Han <sthan@nvidia.com> --------- Signed-off-by: Steve Han <sthan@nvidia.com>

shan-nvidia changed the title ~~[codex] Fix rerank finetune optimizer precision~~ fix(rerank) Fix rerank finetune optimizer precision May 15, 2026

Fix rerank finetune optimizer precision

cda44d5

Signed-off-by: Steve Han <sthan@nvidia.com>

shan-nvidia force-pushed the sthan/rerank-fusedadam-optimizer branch from a2b6436 to cda44d5 Compare May 15, 2026 20:50

shan-nvidia marked this pull request as ready for review May 15, 2026 20:52

shan-nvidia requested a review from oliverholworthy May 15, 2026 20:52

oliverholworthy force-pushed the oholworthy/rerank-recipe-v1 branch from e6e8a32 to 15f8009 Compare May 18, 2026 10:16

rnyak reviewed May 18, 2026

View reviewed changes

oliverholworthy force-pushed the oholworthy/rerank-recipe-v1 branch 3 times, most recently from 420c0ac to d4696b9 Compare May 18, 2026 16:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(rerank) Fix rerank finetune optimizer precision#208

fix(rerank) Fix rerank finetune optimizer precision#208
shan-nvidia wants to merge 21 commits into
oholworthy/rerank-recipe-v1from
sthan/rerank-fusedadam-optimizer

shan-nvidia commented May 15, 2026 •

edited

Loading

Uh oh!

rnyak May 18, 2026 •

edited

Loading

Uh oh!

shan-nvidia May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

TE version	Available CUDA‑12 wheels (cp312, x86_64)
v2.11	+cu12torch2.8.0+cu129 only
v2.12	+cu12torch2.8.0+cu129 only
v2.13	+cu12torch2.8.0+cu129 only
v2.14 / v2.14.1	+cu12torch2.8.0+cu129 only
v2.15	+cu12torch2.8.0+cu129 only
What's happening at install time:

Conversation

shan-nvidia commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Changed

Root Cause

Validation

Uh oh!

rnyak May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shan-nvidia May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shan-nvidia commented May 15, 2026 •

edited

Loading

rnyak May 18, 2026 •

edited

Loading