From 2f0e2ed0f3d06331baf5f06b787d0aafdb7b501f Mon Sep 17 00:00:00 2001
From: Marc Romeyn <marcromeyn@gmail.com>
Date: Mon, 12 Jan 2026 14:51:41 +0100
Subject: [PATCH 1/2] Updated pretrain and sft docs

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
---
 docs/train/nano3/pretrain.md | 306 +++++++++++++++++++++++++----------
 docs/train/nano3/sft.md      | 277 +++++++++++++++++++++++--------
 2 files changed, 435 insertions(+), 148 deletions(-)

diff --git a/docs/train/nano3/pretrain.md b/docs/train/nano3/pretrain.md
index b134949da..22dbde817 100644
--- a/docs/train/nano3/pretrain.md
+++ b/docs/train/nano3/pretrain.md
@@ -2,9 +2,162 @@
 
 This stage trains the base Nemotron 3 Nano model from scratch on 25 trillion tokens using [Megatron-Bridge](../nvidia-stack.md#megatron-bridge).
 
+Nemotron 3 Nano is a **hybrid Mamba-Transformer-MoE** model with 52 layers, combining state-space models for efficiency, attention for global context, and mixture-of-experts for capacity. Key innovations include aux-loss-free MoE balancing and a two-phase data curriculum.
+
 > **Open-Source Data Only**: This recipe uses exclusively open-sourced training data from the [Nemotron Pre-training Datasets](https://huggingface.co/collections/nvidia/nemotron-pre-training-datasets) collection, which is a subset of the full data used to train the released model. The recipe includes datasets from Nemotron-CC-Math-v1, Nemotron-CC-v2, Nemotron-CC-v2.1, and Nemotron-Pretraining-Specialized-v1. Results will differ from the benchmarks in the [tech report](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf). Use this recipe as a reference implementation to apply the methodology with your own data.
 
-## Quick Start
+---
+
+## Training Methodology
+
+> **Training Framework**: Pretraining is implemented using [Megatron-Bridge](https://docs.nvidia.com/nemo/megatron-bridge/latest/), which provides the training loop, distributed training primitives, and checkpoint management. See [Training Entry Points](https://docs.nvidia.com/nemo/megatron-bridge/latest/training/entry-points.html) for details on how `pretrain()` works.
+>
+> For complete methodology, see [Tech Report Section 2](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf).
+
+### Model Architecture
+
+Nemotron 3 Nano uses a **hybrid Mamba-Transformer-MoE** architecture with 52 layers:
+
+| Layer Type | Count | Role |
+|------------|-------|------|
+| Mamba-2 | 23 | Efficient sequence modeling via state space |
+| Attention | 6 | Global context at key positions |
+| MoE | 23 | Sparse computation with 8 experts per layer |
+
+The hybrid pattern interleaves these layer types to balance efficiency and capability:
+
+```mermaid
+%%{init: {'theme': 'base', 'themeVariables': { 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'primaryTextColor': '#333333'}}}%%
+flowchart LR
+    subgraph layers["52 Layers"]
+        direction LR
+        m1["Mamba-2"] --> m2["Mamba-2"] --> a1["Attention"]
+        a1 --> moe1["MoE"] --> m3["Mamba-2"] --> m4["..."]
+    end
+
+    style m1 fill:#e8f5e9,stroke:#4caf50
+    style m2 fill:#e8f5e9,stroke:#4caf50
+    style m3 fill:#e8f5e9,stroke:#4caf50
+    style a1 fill:#e3f2fd,stroke:#2196f3
+    style moe1 fill:#fff3e0,stroke:#ff9800
+```
+
+**Key design choices:**
+
+- **Mamba-2 layers** provide linear-time sequence processing, enabling efficient inference on long contexts
+- **Attention layers** are placed at strategic intervals (every ~8 layers) for global information mixing
+- **MoE layers** use 8 experts with top-2 routing, keeping active parameters at ~4B while total parameters reach ~9B
+
+> For architecture rationale, see [Tech Report Section 2.1](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf).
+>
+> For implementation details, see [Megatron-Bridge Nemotron 3](https://docs.nvidia.com/nemo/megatron-bridge/latest/models/llm/nemotron3.html).
+
+### Pretraining Data
+
+The pretraining corpus comprises four main dataset families:
+
+| Dataset Family | Description |
+|----------------|-------------|
+| **Nemotron-CC-Code-v1** | High-quality code from Common Crawl |
+| **Nemotron-Pretraining-Code-v2** | GitHub code with student-teacher generation |
+| **Nemotron-CC-v2.1** | General English web crawl with synthetic rephrasing |
+| **Nemotron-Pretrain-Specialized-v1** | Synthetic STEM, math textbooks, scientific coding |
+
+Data spans 15 categories including web crawl (various quality tiers), code, math, academic, and multilingual content.
+
+> For dataset details, see [Tech Report Section 2.2](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf).
+
+### Data Mixture
+
+Training follows a two-phase curriculum that transitions from broad coverage to focused quality:
+
+| Phase | Tokens | Focus | Strategy |
+|-------|--------|-------|----------|
+| Phase 1 | 23.5T | Diversity | Broad coverage across all data sources |
+| Phase 2 | 1.5T | Quality | Increased weight on high-quality and STEM data |
+
+**Phase 1: Foundation Building**
+
+- Uses all dataset families with balanced weights
+- Emphasizes diversity: web (multiple quality tiers), code, math, multilingual
+- Builds broad knowledge base and language understanding
+
+**Phase 2: Quality Refinement**
+
+- Increases sampling from high-quality sources:
+  - `High-Quality` and `High-Quality-Synthetic` subsets
+  - Nemotron-Pretraining-Specialized-v1 (STEM, math textbooks, scientific coding)
+- Reduces low-quality web content
+- Sharpens model capabilities on curated data
+
+> For mixture strategy details, see [Tech Report Section 2.3](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf).
+
+### Hyperparameters
+
+| Parameter | Value |
+|-----------|-------|
+| **Total Tokens** | 25 trillion |
+| **Batch Size** | 8192 sequences |
+| **Sequence Length** | 4096 tokens |
+| **Peak Learning Rate** | 1e-3 |
+| **Minimum Learning Rate** | 1e-5 |
+| **Optimizer** | AdamW (β₁=0.9, β₂=0.95) |
+| **Weight Decay** | 0.1 |
+| **MoE Load Balancing** | DeepSeek aux-loss-free strategy |
+
+**Learning Rate Schedule:**
+
+| Phase | Tokens | LR |
+|-------|--------|-----|
+| Warmup | 8.4B | 0 → 1e-3 |
+| Stable | 20T (80%) | 1e-3 |
+| Decay | 5T (20%) | 1e-3 → 1e-5 |
+
+The warmup is token-based (8.4B tokens), not percentage-based. The stable phase maintains peak LR for 80% of training before cosine decay.
+
+> For hyperparameter rationale, see [Tech Report Section 2.4](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf).
+
+### MoE Load Balancing
+
+Nemotron 3 Nano uses the **aux-loss-free load balancing** strategy from DeepSeek, avoiding the auxiliary losses traditionally used to balance expert utilization.
+
+**Why aux-loss-free?**
+
+Traditional MoE training adds an auxiliary loss term to encourage balanced routing. However, this:
+- Adds a hyperparameter (aux loss weight) that's hard to tune
+- Can conflict with the main training objective
+- May hurt model quality at scale
+
+**How it works:**
+
+Instead of auxiliary losses, the router uses **bias terms** that are adjusted dynamically:
+- Track expert utilization over a sliding window
+- Increase bias for underutilized experts (more tokens routed to them)
+- Decrease bias for overloaded experts
+- No gradient flows through the bias adjustment
+
+This achieves balanced expert utilization without interfering with the main loss function.
+
+> For details, see the [Auxiliary-Loss-Free Load Balancing paper](https://arxiv.org/abs/2408.15664).
+
+### Long-Context Extension
+
+The LC-Phase extends context to 1M tokens after main pretraining:
+
+| Parameter | Value |
+|-----------|-------|
+| **Duration** | 121 billion tokens |
+| **Learning Rate** | 1e-5 (constant) |
+| **Global Batch Size** | 48 |
+| **Parallelism** | 8-way context/tensor/expert, 4-way pipeline |
+
+> For long-context methodology, see [Tech Report Section 2.5](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf).
+
+---
+
+## Recipe Execution
+
+### Quick Start
 
 <div class="termy">
 
@@ -20,7 +173,7 @@ $ uv run nemotron nano3 pretrain --run YOUR-CLUSTER
 
 > **Note**: The `--run YOUR-CLUSTER` flag submits jobs via [NeMo-Run](../nemo-run.md). See [Execution through NeMo-Run](../nemo-run.md) for setup.
 
-### Direct Script Execution
+#### Direct Script Execution
 
 Inside a container on a compute node:
 
@@ -35,7 +188,7 @@ uv run python train.py --config config/default.yaml
 uv run torchrun --nproc_per_node=8 train.py --config config/default.yaml
 ```
 
-## Configuration
+### Configuration
 
 | File | Purpose |
 |------|---------|
@@ -43,11 +196,26 @@ uv run torchrun --nproc_per_node=8 train.py --config config/default.yaml
 | `config/data_prep.yaml` | Data preparation settings |
 | `config/data_blend_raw.json` | Dataset blend definition |
 
-## Data Preparation
+**Blend Configuration**
+
+Data blends are defined in `config/data_prep/data_blend_raw.json`. Each entry specifies:
+
+```json
+{
+  "name": "dataset-name",
+  "path": "hf://nvidia/...",
+  "subset": "subset-name",
+  "weight": 1.0
+}
+```
+
+Weights control sampling probability during data preparation. Phase transitions are implemented by using different blend configurations.
+
+### Data Preparation
 
 The `data_prep.py` script tokenizes raw text datasets into Megatron's binary format. See [Data Preparation Module](../data-prep.md) for detailed documentation.
 
-### CLI Command
+#### CLI Command
 
 ```bash
 uv run nemotron nano3 data prep pretrain [options]
@@ -59,7 +227,7 @@ uv run nemotron nano3 data prep pretrain [options]
 | `--sample N` | Limit rows per dataset (for testing) |
 | `--force` | Force re-run, ignoring cache |
 
-### Output
+#### Output
 
 ```
 output/nano3/stage0_pretrain/
@@ -74,9 +242,9 @@ output/nano3/stage0_pretrain/
 
 The output is registered as a [W&B Artifact](../artifacts.md) (`DataBlendsArtifact-pretrain`) for lineage tracking.
 
-## Training
+### Training
 
-### CLI Command
+#### CLI Command
 
 ```bash
 uv run nemotron nano3 pretrain [options] [overrides...]
@@ -89,7 +257,7 @@ uv run nemotron nano3 pretrain [options] [overrides...]
 | `--dry-run` | Preview execution plan |
 | `key=value` | Override config values ([CLI Framework](../cli.md#dotlist-overrides)) |
 
-### Override Examples
+#### Override Examples
 
 ```bash
 # More training iterations
@@ -102,7 +270,7 @@ uv run nemotron nano3 pretrain train.global_batch_size=64
 uv run nemotron nano3 pretrain checkpoint.save=/path/to/checkpoints
 ```
 
-## Running with NeMo-Run
+### Running with NeMo-Run
 
 Configure execution profiles in `env.toml`:
 
@@ -123,7 +291,31 @@ mounts = ["/lustre:/lustre"]
 
 See [Execution through NeMo-Run](../nemo-run.md) for complete configuration options.
 
-## Artifact Lineage
+### Checkpoint & Resume
+
+Training automatically saves checkpoints at regular intervals. To resume from a checkpoint:
+
+```bash
+# Resume from a specific checkpoint
+uv run nemotron nano3 pretrain checkpoint.load=/path/to/checkpoint
+
+# Resume from latest checkpoint in a directory
+uv run nemotron nano3 pretrain checkpoint.load=/path/to/checkpoints/
+```
+
+**Checkpoint Configuration:**
+
+| Option | Description |
+|--------|-------------|
+| `checkpoint.save` | Directory for saving checkpoints |
+| `checkpoint.load` | Path to checkpoint for resuming |
+| `checkpoint.save_interval` | Steps between saves (default: 1000) |
+
+Checkpoints use Megatron's distributed format, which handles model parallelism automatically. Each checkpoint contains model weights, optimizer state, and training progress.
+
+> For checkpoint format and advanced options, see [Megatron-Bridge Checkpointing](https://docs.nvidia.com/nemo/megatron-bridge/latest/training/checkpointing.html).
+
+### Artifact Lineage
 
 ```mermaid
 %%{init: {'theme': 'base', 'themeVariables': { 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'primaryTextColor': '#333333'}}}%%
@@ -142,69 +334,9 @@ flowchart TB
     style next fill:#f3e5f5,stroke:#9c27b0
 ```
 
-## Methodology
-
-> For complete methodology, see [Tech Report Section 2](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf).
-
-### Pretraining Data
+---
 
-The pretraining corpus comprises four main dataset families:
-
-| Dataset Family | Description |
-|----------------|-------------|
-| **Nemotron-CC-Code-v1** | High-quality code from Common Crawl |
-| **Nemotron-Pretraining-Code-v2** | GitHub code with student-teacher generation |
-| **Nemotron-CC-v2.1** | General English web crawl with synthetic rephrasing |
-| **Nemotron-Pretrain-Specialized-v1** | Synthetic STEM, math textbooks, scientific coding |
-
-Data spans 15 categories including web crawl (various quality tiers), code, math, academic, and multilingual content.
-
-> For dataset details, see [Tech Report Section 2.2](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf).
-
-### Data Mixture
-
-Two-phase curriculum approach:
-
-| Phase | Tokens | Focus |
-|-------|--------|-------|
-| Phase 1 | 23.5T | High diversity across web, code, math, multilingual |
-| Phase 2 | 1.5T | High-quality data with curated sources |
-
-> For mixture strategy, see [Tech Report Section 2.3](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf).
-
-### Hyperparameters
-
-| Parameter | Value |
-|-----------|-------|
-| **Total Tokens** | 25 trillion |
-| **Batch Size** | 8192 sequences |
-| **Sequence Length** | 4096 tokens |
-| **Learning Rate** | 1e-4 (stable) → 1e-5 (decay) |
-| **Warmup** | 80% of training (20T tokens) |
-| **Optimizer** | AdamW (β₁=0.9, β₂=0.95) |
-| **Weight Decay** | 0.1 |
-| **MoE Load Balancing** | DeepSeek aux-loss-free strategy |
-
-> For hyperparameter rationale, see [Tech Report Section 2.4](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf).
-
-### Long-Context Extension
-
-The LC-Phase extends context to 1M tokens after main pretraining:
-
-| Parameter | Value |
-|-----------|-------|
-| **Duration** | 121 billion tokens |
-| **Learning Rate** | 1e-5 (constant) |
-| **Global Batch Size** | 48 |
-| **Parallelism** | 8-way context/tensor/expert, 4-way pipeline |
-
-> For long-context methodology, see [Tech Report Section 2.5](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf).
-
-## Open-Source Data
-
-> **Note**: This recipe trains exclusively on the open-sourced subset of pretraining data. Results will differ from the tech report benchmarks, which used additional proprietary data.
-
-## NVIDIA AI Stack
+## Infrastructure
 
 This stage uses the following components from the [NVIDIA AI Stack](../nvidia-stack.md):
 
@@ -215,16 +347,24 @@ This stage uses the following components from the [NVIDIA AI Stack](../nvidia-st
 
 ### Parallelism Configuration
 
-Pretraining uses multiple parallelism strategies for efficient scaling:
+Pretraining uses multiple parallelism strategies for efficient scaling. The specific values differ between main pretraining and long-context extension:
 
-| Parallelism | Config Key | Description |
-|-------------|------------|-------------|
-| Tensor (TP) | `model.tensor_model_parallel_size` | Split weight matrices across GPUs |
-| Pipeline (PP) | `model.pipeline_model_parallel_size` | Split layers into pipeline stages |
-| Data (DP) | Automatic | Replicate model, distribute batches |
-| Expert (EP) | `model.expert_model_parallel_size` | Distribute MoE experts across GPUs |
-| Context (CP) | `model.context_parallel_size` | Distribute long sequences |
-| Sequence (SP) | `model.sequence_parallel` | Distribute LayerNorm/Dropout activations |
+| Parallelism | Main Pretraining | Long-Context (LC) | Config Key |
+|-------------|------------------|-------------------|------------|
+| Tensor (TP) | 8 | 8 | `model.tensor_model_parallel_size` |
+| Pipeline (PP) | 1 | 4 | `model.pipeline_model_parallel_size` |
+| Expert (EP) | 8 | 8 | `model.expert_model_parallel_size` |
+| Context (CP) | 1 | 8 | `model.context_parallel_size` |
+| Sequence (SP) | Yes | Yes | `model.sequence_parallel` |
+| Data (DP) | Auto | Auto | Computed from world size |
+
+**Why the difference?**
+
+- **Main pretraining** uses 4K sequences, so context parallelism (CP=1) isn't needed
+- **Long-context extension** handles up to 1M tokens, requiring CP=8 to distribute sequences across GPUs
+- **Pipeline parallelism** increases in LC phase (PP=4) to handle larger activation memory
+
+> For parallelism concepts, see [NVIDIA AI Stack: Parallelism](../nvidia-stack.md#parallelism-strategies).
 
 ### Container
 
@@ -232,6 +372,8 @@ Pretraining uses multiple parallelism strategies for efficient scaling:
 nvcr.io/nvidia/nemo:25.11.nemotron_3_nano
 ```
 
+---
+
 ## Next Steps
 
 After pretraining completes, proceed to [Stage 1: SFT](./sft.md) for instruction tuning.
diff --git a/docs/train/nano3/sft.md b/docs/train/nano3/sft.md
index 9f87bef19..a348cbcf7 100644
--- a/docs/train/nano3/sft.md
+++ b/docs/train/nano3/sft.md
@@ -4,7 +4,203 @@ This stage fine-tunes the pretrained model for instruction following using [Mega
 
 > **Open-Source Data Only**: This recipe uses exclusively open-sourced SFT data from the [Nemotron Post-training Datasets](https://huggingface.co/collections/nvidia/nemotron-post-training-v3) collection, which is a subset of the full data used to train the released model. The recipe includes datasets from Nemotron-Science-v1, Nemotron-Instruction-Following-Chat-v1, Nemotron-Math-Proofs-v1, Nemotron-SWE-v1, Nemotron-Agentic-v1, and Nemotron-Competitive-Programming-v1. Results will differ from the benchmarks in the [tech report](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf). Use this recipe as a reference implementation to apply the methodology with your own data.
 
-## Quick Start
+---
+
+## Training Methodology
+
+> **Training Framework**: SFT is implemented using [Megatron-Bridge](https://docs.nvidia.com/nemo/megatron-bridge/latest/)'s `finetune()` entry point, which loads a pretrained checkpoint and handles the training loop with role-based loss masking. See [Training Entry Points](https://docs.nvidia.com/nemo/megatron-bridge/latest/training/entry-points.html) for implementation details.
+>
+> For complete methodology, see [Tech Report Section 3.1](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf).
+
+### Data Preparation Pipeline
+
+Before training, chat conversations are transformed into training-ready sequences through several stages:
+
+```mermaid
+%%{init: {'theme': 'base', 'themeVariables': { 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'primaryTextColor': '#333333'}}}%%
+flowchart LR
+    subgraph prep["Data Preparation"]
+        direction LR
+        chat["OpenAI Chat<br/>Format"] --> template["Chat<br/>Template"]
+        template --> chunks["Role-Labeled<br/>Chunks"]
+        chunks --> tok["Tokenization"]
+        tok --> mask["Loss Mask<br/>(role-based)"]
+        mask --> pack["Packing"]
+        pack --> roll["Mask Rolling"]
+    end
+    roll --> npy[".npy Output"]
+
+    style chat fill:#e3f2fd,stroke:#2196f3
+    style template fill:#e3f2fd,stroke:#2196f3
+    style chunks fill:#e3f2fd,stroke:#2196f3
+    style tok fill:#f3e5f5,stroke:#9c27b0
+    style mask fill:#f3e5f5,stroke:#9c27b0
+    style pack fill:#fff3e0,stroke:#ff9800
+    style roll fill:#fff3e0,stroke:#ff9800
+    style npy fill:#e8f5e9,stroke:#4caf50
+```
+
+| Stage | What Happens |
+|-------|--------------|
+| **OpenAI Chat Format** | Input messages with `role` (system/user/assistant) and `content` fields |
+| **Chat Template** | Renders messages using Nano3 Jinja template with special tokens (`<\|im_start\|>`, `<\|im_end\|>`) |
+| **Role-Labeled Chunks** | Splits rendered text back into chunks, each tagged with its source role |
+| **Tokenization** | Converts text chunks to token IDs |
+| **Loss Mask** | Builds mask: `1` for assistant tokens, `0` for system/user tokens |
+| **Packing** | Multiple sequences packed into fixed-length bins (4096 tokens) |
+| **Mask Rolling** | Shifts mask by 1 position for next-token prediction alignment |
+
+**Multi-turn splitting**: For conversations with reasoning content (`reasoning_content` field), the pipeline creates separate training sequences at each user turn. Reasoning from previous turns is dropped when a new user message appears—this matches inference behavior where users don't see intermediate reasoning.
+
+> For data preparation implementation, see **Recipe Source**: `src/nemotron/recipes/nano3/stage1_sft/data_prep.py`
+
+### Loss Masking
+
+Loss masking determines which tokens contribute to the training loss. In SFT, we only want the model to learn to generate responses—not to predict prompts or system instructions.
+
+**Why mask non-assistant tokens?**
+
+The model should learn to *respond*, not to *prompt*. If we computed loss on user messages, the model would be optimized to predict "What is 2+2?" given prior context—which isn't useful for an assistant. By masking user and system tokens (setting their loss weight to 0), gradients only flow from assistant responses, teaching the model what to generate without wasting capacity on predicting inputs.
+
+| Role | Loss Mask | Training Signal |
+|------|-----------|-----------------|
+| `system` | 0 | Ignored (instructions) |
+| `user` | 0 | Ignored (prompts) |
+| `assistant` | 1 | Learned (responses) |
+
+**Why roll the mask by 1?**
+
+In next-token prediction, the model predicts `token[i+1]` given `tokens[0:i]`. The loss compares the prediction against the *label*, which is the input sequence shifted by one position:
+
+```
+Position:     0    1    2    3    4
+Input:       [A]  [B]  [C]  [D]  [E]
+Label:       [B]  [C]  [D]  [E]  [_]   <- shifted by 1
+```
+
+If assistant content starts at position 2 (`[C]`), we want loss on predicting `[C]`, `[D]`, and `[E]`. But the label for position 2 is `[D]`—so we need to shift the mask to align with labels:
+
+```
+Original mask:  [0]  [0]  [1]  [1]  [1]   <- "assistant starts at C"
+Rolled mask:    [0]  [0]  [0]  [1]  [1]   <- aligns with labels D, E
+```
+
+The pipeline rolls the loss mask by 1 position so it correctly masks the *predictions* (labels) rather than the *inputs*.
+
+**Truncation behavior (`max_doc_tokens`):**
+
+- **Default (null)**: No truncation—full sequences are preserved
+- **When set**: Sequences exceeding the limit are truncated from the end, with the loss mask adjusted accordingly
+
+> For implementation details, see `src/nemotron/data_prep/chat_sft_processor.py`
+
+### Packed Sequences
+
+**Why pack sequences?**
+
+Individual chat conversations vary in length—some are 50 tokens, others 3000. Without packing, each training sample would require padding to the maximum sequence length, wasting compute on empty tokens. Packing concatenates multiple conversations into a single fixed-length sequence (default 4096 tokens), maximizing GPU utilization.
+
+The packed sequence format stores everything Megatron-Bridge needs for training:
+
+| Field | Description |
+|-------|-------------|
+| `input_ids` | Concatenated token IDs from multiple conversations |
+| `loss_mask` | Rolled mask indicating which positions contribute to loss (see [Loss Masking](#loss-masking)) |
+| `seq_start_id` | Boundary indices marking where each original conversation starts within the pack |
+
+**How `seq_start_id` works:**
+
+When multiple conversations are packed together, the model needs to know where one ends and another begins—otherwise attention could "leak" between unrelated conversations. The `seq_start_id` array marks these boundaries:
+
+```
+Pack: [Conv A tokens] [Conv B tokens] [Conv C tokens]
+       ^              ^              ^
+seq_start_id: [0,    128,           384]
+```
+
+Megatron-Bridge uses these boundaries for:
+- **Variable-length attention**: Attention is masked so tokens from Conv A can't attend to Conv B
+- **FlashAttention optimization**: Boundaries map to `cu_seqlens` parameter for efficient packed attention
+
+> For packing implementation, see `src/nemotron/data_prep/packing/builder.py`
+
+### Chat Template
+
+Nemotron 3 Nano supports both reasoning and non-reasoning modes:
+
+- **Multi-Step**: Existing reasoning tokens preserved for reuse in subsequent steps
+- **Multi-Turn**: Reasoning from previous turns dropped when user message introduced
+- **Tool Calling**: Uses XML-style special tags to reduce character escaping
+
+### SFT Data Domains
+
+| Domain | Description |
+|--------|-------------|
+| **Competition Math** | Tool-integrated reasoning with GPT-OSS teachers |
+| **Competition Code** | OpenCodeReasoning solutions with obfuscation/complication |
+| **InfinityByte** | Cross-domain code synthesis at model capability boundaries |
+| **STEM Reasoning (RQA)** | Reasoning Q&A from undergraduate/graduate STEM content |
+| **Conversational Tool Use** | Multi-turn trajectories with simulated tool execution |
+| **Long Context** | 128k mean token length, 256k hard limit |
+| **Formal Proofs** | Lean theorem proving with 300k examples |
+| **Multilingual** | French, Spanish, Italian, German, Japanese |
+| **Terminal Use** | Terminal operations from Terminal Bench |
+| **General Chat** | Multi-turn responses from LMSYS and WildChat |
+| **Instruction Following** | Tulu 3 methodology with verifier filtering |
+| **Safety** | Refusal behaviors from safety datasets |
+| **Software Engineering** | GitHub issue resolution trajectories |
+| **Science** | Physics, chemistry, biology via NeMo Data Designer |
+
+> For detailed data generation pipelines, see [Tech Report Section 3.1](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf).
+
+### Data Filtering
+
+The pipeline applies:
+- **Structural checks**: Discard malformed examples
+- **Pathological repetition filtering**: Remove repeated n-grams
+- **Consistency filtering**: Judge-based action consistency verification
+- **Narrative filtering**: Remove political/nationalistic narratives
+
+### Troubleshooting
+
+Common data preparation errors and solutions:
+
+| Error | Cause | Solution |
+|-------|-------|----------|
+| "# Tools missing" validation failure | Messages contain `<tool_call>` but system prompt lacks `# Tools` header | Add a `# Tools` section in the system prompt before tool definitions |
+| Empty sequences after processing | All tokens masked (no assistant content in conversation) | Verify input data contains assistant responses with actual content |
+| Template rendering mismatch | Tokenizer BPE splits differ from template expectations | Ensure tokenizer model matches the one used during template creation |
+| Sequences truncated excessively | Many conversations exceed `max_doc_tokens` | Consider increasing `max_doc_tokens` or `pack_size`, or chunking long conversations |
+
+**Debugging tips:**
+
+- Use `--sample 100` to test data preparation on a small subset
+- Check `metadata.json` output for statistics on filtered/truncated sequences
+- Review W&B artifacts for lineage tracking and validation metrics
+
+### Hyperparameters
+
+| Parameter | Value |
+|-----------|-------|
+| **Learning Rate** | 1e-5 |
+| **Sequence Length** | 4096 tokens (pack_size) |
+| **Loss Masking** | Role-based (assistant tokens only) |
+| **Loss Normalization** | Per-token (`calculate_per_token_loss: true`) |
+| **Optimizer** | AdamW |
+| **Total Samples** | 18M+ |
+
+**`calculate_per_token_loss` explained:**
+
+- **True (default)**: Loss is normalized by the number of tokens with `loss_mask=1` across the batch. Each token contributes equally regardless of which sequence it belongs to.
+- **False**: Loss is normalized by the number of sequences. Longer sequences (more assistant tokens) contribute more to the gradient.
+
+Per-token normalization is preferred for SFT because it ensures consistent learning signal regardless of conversation length.
+
+---
+
+## Recipe Execution
+
+### Quick Start
 
 <div class="termy">
 
@@ -20,7 +216,7 @@ $ uv run nemotron nano3 sft --run YOUR-CLUSTER
 
 > **Note**: The `--run YOUR-CLUSTER` flag submits jobs via [NeMo-Run](../nemo-run.md). See [Execution through NeMo-Run](../nemo-run.md) for setup.
 
-### Direct Script Execution
+#### Direct Script Execution
 
 Inside a container on a compute node:
 
@@ -35,7 +231,7 @@ uv run python train.py --config config/default.yaml
 uv run torchrun --nproc_per_node=8 train.py --config config/default.yaml
 ```
 
-## Configuration
+### Configuration
 
 | File | Purpose |
 |------|---------|
@@ -43,11 +239,11 @@ uv run torchrun --nproc_per_node=8 train.py --config config/default.yaml
 | `config/data_prep.yaml` | Data preparation settings |
 | `config/data_blend_raw.json` | Dataset blend definition |
 
-## Data Preparation
+### Data Preparation
 
 The `data_prep.py` script processes OpenAI-format chat data into packed sequences with role-based loss masking. See [Data Preparation Module](../data-prep.md) for detailed documentation.
 
-### CLI Command
+#### CLI Command
 
 ```bash
 uv run nemotron nano3 data prep sft [options]
@@ -59,7 +255,7 @@ uv run nemotron nano3 data prep sft [options]
 | `--sample N` | Limit rows per dataset (for testing) |
 | `--force` | Force re-run, ignoring cache |
 
-### Output
+#### Output
 
 ```
 output/stage1_sft/
@@ -71,9 +267,9 @@ output/stage1_sft/
 
 The output is registered as a [W&B Artifact](../artifacts.md) (`DataBlendsArtifact-sft`) for lineage tracking.
 
-## Training
+### Training
 
-### CLI Command
+#### CLI Command
 
 ```bash
 uv run nemotron nano3 sft [options] [overrides...]
@@ -86,7 +282,7 @@ uv run nemotron nano3 sft [options] [overrides...]
 | `--dry-run` | Preview execution plan |
 | `key=value` | Override config values ([CLI Framework](../cli.md#dotlist-overrides)) |
 
-### Override Examples
+#### Override Examples
 
 ```bash
 # More training iterations
@@ -99,7 +295,7 @@ uv run nemotron nano3 sft optimizer.lr=1e-5
 uv run nemotron nano3 sft checkpoint.load=/path/to/pretrain/checkpoint
 ```
 
-## Running with NeMo-Run
+### Running with NeMo-Run
 
 Configure execution profiles in `env.toml`:
 
@@ -120,7 +316,7 @@ mounts = ["/lustre:/lustre"]
 
 See [Execution through NeMo-Run](../nemo-run.md) for complete configuration options.
 
-## Artifact Lineage
+### Artifact Lineage
 
 ```mermaid
 %%{init: {'theme': 'base', 'themeVariables': { 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'primaryTextColor': '#333333'}}}%%
@@ -141,62 +337,9 @@ flowchart TB
     style next fill:#e8f5e9,stroke:#4caf50
 ```
 
-## Methodology
+---
 
-> For complete methodology, see [Tech Report Section 3.1](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf).
-
-### Chat Template
-
-Nemotron 3 Nano supports both reasoning and non-reasoning modes:
-
-- **Multi-Step**: Existing reasoning tokens preserved for reuse in subsequent steps
-- **Multi-Turn**: Reasoning from previous turns dropped when user message introduced
-- **Tool Calling**: Uses XML-style special tags to reduce character escaping
-
-### SFT Data Domains
-
-| Domain | Description |
-|--------|-------------|
-| **Competition Math** | Tool-integrated reasoning with GPT-OSS teachers |
-| **Competition Code** | OpenCodeReasoning solutions with obfuscation/complication |
-| **InfinityByte** | Cross-domain code synthesis at model capability boundaries |
-| **STEM Reasoning (RQA)** | Reasoning Q&A from undergraduate/graduate STEM content |
-| **Conversational Tool Use** | Multi-turn trajectories with simulated tool execution |
-| **Long Context** | 128k mean token length, 256k hard limit |
-| **Formal Proofs** | Lean theorem proving with 300k examples |
-| **Multilingual** | French, Spanish, Italian, German, Japanese |
-| **Terminal Use** | Terminal operations from Terminal Bench |
-| **General Chat** | Multi-turn responses from LMSYS and WildChat |
-| **Instruction Following** | Tülu 3 methodology with verifier filtering |
-| **Safety** | Refusal behaviors from safety datasets |
-| **Software Engineering** | GitHub issue resolution trajectories |
-| **Science** | Physics, chemistry, biology via NeMo Data Designer |
-
-> For detailed data generation pipelines, see [Tech Report Section 3.1](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf).
-
-### Data Filtering
-
-The pipeline applies:
-- **Structural checks**: Discard malformed examples
-- **Pathological repetition filtering**: Remove repeated n-grams
-- **Consistency filtering**: Judge-based action consistency verification
-- **Narrative filtering**: Remove political/nationalistic narratives
-
-### Hyperparameters
-
-| Parameter | Value |
-|-----------|-------|
-| **Learning Rate** | 1e-5 |
-| **Sequence Length** | 4096 tokens (pack_size) |
-| **Loss Masking** | Role-based (assistant tokens only) |
-| **Optimizer** | AdamW |
-| **Total Samples** | 18M+ |
-
-## Open-Source Data
-
-> **Note**: This recipe trains exclusively on the open-sourced subset of SFT data. Results will differ from the tech report benchmarks, which used additional proprietary data.
-
-## NVIDIA AI Stack
+## Infrastructure
 
 This stage uses the following components from the [NVIDIA AI Stack](../nvidia-stack.md):
 
@@ -220,6 +363,8 @@ This stage uses the following components from the [NVIDIA AI Stack](../nvidia-st
 nvcr.io/nvidia/nemo:25.11.nemotron_3_nano
 ```
 
+---
+
 ## Next Steps
 
 After SFT completes, proceed to [Stage 2: RL](./rl.md) for alignment training.

From 70b14874925720b783c4e8590da1877dd410abe2 Mon Sep 17 00:00:00 2001
From: Marc Romeyn <marcromeyn@gmail.com>
Date: Mon, 12 Jan 2026 15:24:56 +0100
Subject: [PATCH 2/2] Some fixes and add update rl.md

Signed-off-by: Marc Romeyn <marcromeyn@gmail.com>
---
 docs/train/nano3/pretrain.md |  35 ++-
 docs/train/nano3/rl.md       | 448 +++++++++++++++++++++++++++--------
 docs/train/nano3/sft.md      |  29 ++-
 3 files changed, 386 insertions(+), 126 deletions(-)

diff --git a/docs/train/nano3/pretrain.md b/docs/train/nano3/pretrain.md
index 22dbde817..73d64fad1 100644
--- a/docs/train/nano3/pretrain.md
+++ b/docs/train/nano3/pretrain.md
@@ -46,7 +46,7 @@ flowchart LR
 
 - **Mamba-2 layers** provide linear-time sequence processing, enabling efficient inference on long contexts
 - **Attention layers** are placed at strategic intervals (every ~8 layers) for global information mixing
-- **MoE layers** use 8 experts with top-2 routing, keeping active parameters at ~4B while total parameters reach ~9B
+- **MoE layers** use 128 routed experts plus 1 shared expert, with 6 experts activated per token, keeping active parameters at ~3.5B while total parameters reach ~31.6B
 
 > For architecture rationale, see [Tech Report Section 2.1](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf).
 >
@@ -97,8 +97,8 @@ Training follows a two-phase curriculum that transitions from broad coverage to
 | Parameter | Value |
 |-----------|-------|
 | **Total Tokens** | 25 trillion |
-| **Batch Size** | 8192 sequences |
-| **Sequence Length** | 4096 tokens |
+| **Batch Size** | 3,072 sequences |
+| **Sequence Length** | 8,192 tokens |
 | **Peak Learning Rate** | 1e-3 |
 | **Minimum Learning Rate** | 1e-5 |
 | **Optimizer** | AdamW (β₁=0.9, β₂=0.95) |
@@ -173,21 +173,30 @@ $ uv run nemotron nano3 pretrain --run YOUR-CLUSTER
 
 > **Note**: The `--run YOUR-CLUSTER` flag submits jobs via [NeMo-Run](../nemo-run.md). See [Execution through NeMo-Run](../nemo-run.md) for setup.
 
-#### Direct Script Execution
+#### Direct Script Execution (Megatron-Bridge)
 
-Inside a container on a compute node:
+For direct execution outside this CLI, use the scripts in the [Megatron-Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) repository:
 
 ```bash
-# Data preparation
-uv run python data_prep.py --config config/data_prep.yaml
-
-# Training (single node)
-uv run python train.py --config config/default.yaml
-
-# Training (distributed)
-uv run torchrun --nproc_per_node=8 train.py --config config/default.yaml
+# Clone the repository and checkout the nano-v3 branch
+git clone https://github.com/NVIDIA-NeMo/Megatron-Bridge.git
+cd Megatron-Bridge
+git checkout nano-v3
+
+# Run pretraining (inside container on compute node)
+python examples/recipes/nemotron_3/pretrain_nemotron_3_nano.py \
+    --per-split-data-args-path /path/to/data_args.json \
+    --tokenizer-model /path/to/tokenizer.model
+
+# With config file overrides
+python examples/recipes/nemotron_3/pretrain_nemotron_3_nano.py \
+    --config-file /path/to/overrides.yaml \
+    --per-split-data-args-path /path/to/data_args.json \
+    --tokenizer-model /path/to/tokenizer.model
 ```
 
+See the [Megatron-Bridge Nemotron 3 documentation](https://docs.nvidia.com/nemo/megatron-bridge/latest/models/llm/nemotron3.html) for detailed configuration options.
+
 ### Configuration
 
 | File | Purpose |
diff --git a/docs/train/nano3/rl.md b/docs/train/nano3/rl.md
index 1a92ef046..0b96fb07d 100644
--- a/docs/train/nano3/rl.md
+++ b/docs/train/nano3/rl.md
@@ -4,7 +4,201 @@ This stage aligns the instruction-tuned model using GRPO (Group Relative Policy
 
 > **Open-Source Data Only**: This recipe uses exclusively open-sourced RL data from the [Nemotron Post-training Datasets](https://huggingface.co/collections/nvidia/nemotron-post-training-v3) collection, which is a subset of the full data used to train the released model. The recipe uses the [Nemotron-3-Nano-RL-Training-Blend](https://huggingface.co/datasets/nvidia/Nemotron-3-Nano-RL-Training-Blend) dataset. Results will differ from the benchmarks in the [tech report](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf). Use this recipe as a reference implementation to apply the methodology with your own data.
 
-## Quick Start
+---
+
+## Training Methodology
+
+> **Training Framework**: RL alignment is implemented using [NeMo-RL](https://docs.nvidia.com/nemo/rl/latest/) with Ray for distributed actor coordination and vLLM for fast rollout generation. The Megatron backend handles distributed policy training with tensor, pipeline, context, and expert parallelism. See [NeMo-RL Documentation](https://docs.nvidia.com/nemo/rl/latest/) for implementation details.
+>
+> For complete methodology, see [Tech Report Section 3.2](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf).
+
+### RL Pipeline Overview
+
+The RL pipeline consists of three components:
+1. **RLVR** — Multi-environment training with verifiable rewards
+2. **RLHF with GenRM** — Generative reward model-based alignment
+3. **DPO** — Preference learning to reduce tool hallucination
+
+### Data Preparation Pipeline
+
+Before training, the RL dataset is transformed into JSONL format compatible with NeMo-Gym:
+
+```mermaid
+%%{init: {'theme': 'base', 'themeVariables': { 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'primaryTextColor': '#333333'}}}%%
+flowchart LR
+    subgraph prep["Data Preparation"]
+        direction LR
+        hf["HuggingFace<br/>Dataset"] --> resolve["Placeholder<br/>Resolution"]
+        resolve --> jsonl["JSONL<br/>Format"]
+        jsonl --> split["Train/Val/Test<br/>Split"]
+    end
+    split --> gym["NeMo-Gym<br/>Environment"]
+    gym --> reward["Reward<br/>Computation"]
+
+    style hf fill:#e1f5fe,stroke:#2196f3
+    style resolve fill:#e1f5fe,stroke:#2196f3
+    style jsonl fill:#f3e5f5,stroke:#9c27b0
+    style split fill:#f3e5f5,stroke:#9c27b0
+    style gym fill:#e8f5e9,stroke:#4caf50
+    style reward fill:#e8f5e9,stroke:#4caf50
+```
+
+| Stage | What Happens |
+|-------|--------------|
+| **HuggingFace Dataset** | Load [Nemotron-3-Nano-RL-Training-Blend](https://huggingface.co/datasets/nvidia/Nemotron-3-Nano-RL-Training-Blend) from HuggingFace Hub |
+| **Placeholder Resolution** | Resolve `_hf_placeholder` records by fetching from external datasets (DAPO, Skywork) and applying template restoration |
+| **JSONL Format** | Convert to JSONL with `question`, `expected_answer`, and `responses_create_params` fields |
+| **Train/Val/Test Split** | Split into training (98%), validation (1%), and test (1%) sets |
+| **NeMo-Gym Environment** | Route samples to appropriate reward environments based on task type |
+| **Reward Computation** | Compute verifiable rewards (math correctness, code execution, schema adherence) |
+
+**Placeholder Resolution:**
+
+The [Nemotron-3-Nano-RL-Training-Blend](https://huggingface.co/datasets/nvidia/Nemotron-3-Nano-RL-Training-Blend) dataset contains placeholder records that reference external HuggingFace datasets. The `data_prep.py` script resolves these by:
+
+1. Detecting placeholder records by the presence of `_hf_placeholder` field
+2. Fetching actual data from external HF datasets:
+   - [ByteDance-Seed/DAPO-Math-17k](https://huggingface.co/datasets/ByteDance-Seed/DAPO-Math-17k) — Math reasoning problems
+   - [Skywork/Skywork-OR1-RL-Data](https://huggingface.co/datasets/Skywork/Skywork-OR1-RL-Data) — Open reasoning data
+3. Applying template restoration (DAPO prefix/suffix, Skywork `{question}` replacement)
+
+> For data preparation implementation, see **Recipe Source**: `src/nemotron/recipes/nano3/stage2_rl/data_prep.py`
+
+### GRPO Algorithm
+
+GRPO (Group Relative Policy Optimization) optimizes the policy using group-relative advantages:
+
+1. **Generate responses** from the current policy using vLLM
+2. **Evaluate** responses using NeMo-Gym reward environments
+3. **Compute group-relative advantages** across response groups per prompt
+4. **Update the policy** to favor higher-reward responses with clipped gradients
+
+**Loss Function:**
+
+The GRPO loss uses clipped policy gradients with KL regularization:
+
+$$
+L(\theta) = E_{x \sim \pi_{\theta_{\text{old}}}} \Big[ \min \Big(\frac{\pi_\theta(x)}{\pi_{\theta_{\text{old}}}(x)}A_t, \text{clip} \big( \frac{\pi_\theta(x)}{\pi_{\theta_{\text{old}}}(x)}, 1 - \varepsilon, 1 + \varepsilon \big) A_t \Big) \Big] - \beta D_{\text{KL}} (\pi_\theta \| \pi_\text{ref})
+$$
+
+Where:
+- $\pi_\theta$ is the policy being optimized
+- $\pi_{\theta_{\text{old}}}$ is the policy from the beginning of this step
+- $A_t$ is the advantage estimate (group-relative)
+- $\varepsilon$ is the clipping hyperparameter (0.2–0.28)
+- $\beta$ is the KL penalty coefficient
+- $\pi_{\text{ref}}$ is the reference policy (frozen SFT checkpoint)
+
+**Stability Improvements:**
+
+| Improvement | Description |
+|-------------|-------------|
+| **On-Policy KL Approximation** | Uses importance weights to correct for off-policy samples, providing an unbiased and guaranteed-positive KL estimator |
+| **Importance Sampling Correction** | Corrects for discrepancies between inference (vLLM) and training (Megatron) token probabilities |
+| **Overlong Filtering** | Excludes sequences that hit max length without EOS from loss computation, reducing noise from truncated generations |
+| **Asymmetric Clipping** | Uses `ratio_clip_min=0.2` and `ratio_clip_max=0.28` for asymmetric policy update bounds |
+
+> For detailed loss function derivations, see the [NeMo-RL GRPO Guide](https://docs.nvidia.com/nemo/rl/latest/guides/grpo.html#loss).
+
+### Multi-Environment RLVR
+
+Training uses 6 reward environments through NeMo-Gym:
+
+| Environment | Description | Reward Type |
+|-------------|-------------|-------------|
+| **math_with_judge** | Mathematical reasoning (DAPO, Skywork math) | Answer correctness verification |
+| **code_gen** | Code correctness with test case execution | Unit test pass rate |
+| **mcqa** | STEM multiple choice questions | Answer matching |
+| **instruction_following** | IFEval, Multi-Challenge compliance | Constraint satisfaction |
+| **workplace_assistant** | Agentic tool use, multi-turn interactions | Task completion |
+| **structured_outputs_json** | JSON schema adherence | Schema validation |
+
+Training on all environments simultaneously provides stable gains without co-reward degradation.
+
+> For environment implementation details, see [NeMo-RL Environments Guide](https://docs.nvidia.com/nemo/rl/latest/guides/environments.html).
+
+### GenRM (RLHF)
+
+Generative reward models use circular comparison strategy (N comparisons instead of O(N²)) with length-normalized reward adjustment:
+
+| Parameter | Value |
+|-----------|-------|
+| **Prompts per batch** | 128 |
+| **Responses per prompt** | 16 |
+| **Comparison strategy** | Circular |
+| **Length bonus α** | 0.5 |
+
+> For GenRM training details, see [Tech Report Section 3.2](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf).
+
+### DPO for Tool Hallucination
+
+DPO reduces hallucinated tool usage with minimal computational overhead:
+
+| Metric | Before DPO | After DPO |
+|--------|------------|-----------|
+| **AIME25 Accuracy** | 80.88% | 84.58% |
+| **Hallucination Rate** | 8.33% | 0.7% |
+
+> For DPO methodology, see [Tech Report Appendix C](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf) and [NeMo-RL DPO Guide](https://docs.nvidia.com/nemo/rl/latest/guides/dpo.html).
+
+### Reasoning Control
+
+The model supports:
+- **Reasoning on/off control** — Strip reasoning from 10% of samples
+- **Token budget control** — Truncate 3% of reasoning traces to different budgets
+
+### Hyperparameters
+
+**GRPO Settings:**
+
+| Parameter | Value | Description |
+|-----------|-------|-------------|
+| `num_prompts_per_step` | 128 | Prompts sampled per training step |
+| `num_generations_per_prompt` | 16 | Rollouts generated per prompt |
+| `max_total_sequence_length` | 49152 | Maximum sequence length (~49K tokens) |
+| `normalize_rewards` | true | Normalize rewards across batch |
+| `use_leave_one_out_baseline` | true | Variance reduction for advantage estimation |
+| `val_period` | 5 | Validation every N steps |
+| `max_num_epochs` | 1 | Single epoch over data |
+| `seed` | 42 | Random seed for reproducibility |
+
+**Loss Function:**
+
+| Parameter | Value | Description |
+|-----------|-------|-------------|
+| `ratio_clip_min` | 0.2 | Lower bound for importance ratio clipping |
+| `ratio_clip_max` | 0.28 | Upper bound for importance ratio clipping |
+| `use_on_policy_kl_approximation` | true | Use unbiased on-policy KL estimator |
+| `use_importance_sampling_correction` | true | Correct for inference/training mismatch |
+| `token_level_loss` | true | Per-token loss normalization |
+| `reference_policy_kl_penalty` | 0 | KL regularization weight (disabled) |
+
+**Optimizer:**
+
+| Parameter | Value |
+|-----------|-------|
+| `optimizer` | AdamW |
+| `lr` | 3e-6 |
+| `min_lr` | 3e-6 |
+| `weight_decay` | 0.0 |
+| `adam_beta1` | 0.9 |
+| `adam_beta2` | 0.999 |
+| `adam_eps` | 1e-8 |
+| `clip_grad` | 1.0 |
+
+**Sequence Packing:**
+
+| Parameter | Value |
+|-----------|-------|
+| `enabled` | true |
+| `algorithm` | modified_first_fit_decreasing |
+| `sequence_length_round` | 64 |
+
+---
+
+## Recipe Execution
+
+### Quick Start
 
 <div class="termy">
 
@@ -20,31 +214,53 @@ $ uv run nemotron nano3 rl --run YOUR-CLUSTER
 
 > **Note**: The `--run YOUR-CLUSTER` flag submits jobs via [NeMo-Run](../nemo-run.md). See [Execution through NeMo-Run](../nemo-run.md) for setup.
 
-### Direct Script Execution
+### Running in NeMo-RL Repository
 
-Inside a container on a compute node (requires [NeMo-RL](../nvidia-stack.md#nemo-rl) and Ray):
+For direct execution using NeMo-RL (without the nemotron CLI wrapper), follow the [NeMo-RL Nemotron 3 Nano Guide](https://docs.nvidia.com/nemo/rl/latest/guides/nemotron-3-nano.html):
+
+**1. Download and prepare the dataset:**
 
 ```bash
-# Data preparation
-uv run python data_prep.py --config config/data_prep.yaml
+# Download the RL blend dataset
+huggingface-cli download nvidia/Nemotron-3-Nano-RL-Training-Blend \
+    --repo-type dataset \
+    --local-dir /path/to/rl-blend
+
+# Fill in placeholder entries (resolves DAPO, Skywork references)
+python /path/to/rl-blend/create_nanov3_jsonl.py /path/to/rl-blend/data/train.jsonl
 
-# Training (Ray initialized internally)
-uv run python train.py --config config/grpo_nanov3.yaml
+# Split into train/validation
+head -n -1000 /path/to/rl-blend/data/train.jsonl > /path/to/train.jsonl
+tail -n 1000 /path/to/rl-blend/data/train.jsonl > /path/to/validation.jsonl
 ```
 
-## Configuration
+**2. Run GRPO training:**
+
+```bash
+# From NeMo-RL repository root
+uv run python examples/nemo_gym/run_grpo_nemo_gym.py \
+    --config examples/nemo_gym/grpo_nanov3.yaml \
+    data.train_jsonl_fpath=/path/to/train.jsonl \
+    data.validation_jsonl_fpath=/path/to/validation.jsonl \
+    policy.model_name=/path/to/sft/checkpoint \
+    logger.wandb_enabled=True
+```
+
+> **Note**: The default recipe requires 32 nodes with 8 GPUs each. See the [NeMo-RL cluster documentation](https://docs.nvidia.com/nemo/rl/latest/cluster.html) for Slurm configuration.
+
+### Configuration
 
 | File | Purpose |
 |------|---------|
-| `config/grpo_nanov3.yaml` | Production GRPO configuration |
-| `config/data_prep.yaml` | Data preparation settings |
-| `config/data_blend_raw.json` | RL dataset blend |
+| `config/default.yaml` | Production GRPO configuration |
+| `config/data_prep/default.yaml` | Data preparation settings |
+| `config/data_prep/data_blend_raw.json` | RL dataset blend |
 
-## Data Preparation
+### Data Preparation
 
 The `data_prep.py` script converts datasets to JSONL format compatible with [NeMo-RL](../nvidia-stack.md#nemo-rl)'s NeMo-Gym interface. See [Data Preparation Module](../data-prep.md) for detailed documentation.
 
-### CLI Command
+#### CLI Command
 
 ```bash
 uv run nemotron nano3 data prep rl [options]
@@ -56,7 +272,7 @@ uv run nemotron nano3 data prep rl [options]
 | `--sample N` | Limit rows per dataset (for testing) |
 | `--force` | Force re-run, ignoring cache |
 
-### Output
+#### Output
 
 ```
 output/nano3/stage2_rl/
@@ -71,9 +287,9 @@ output/nano3/stage2_rl/
 
 The output is registered as a [W&B Artifact](../artifacts.md) (`DataBlendsArtifact-rl`) for lineage tracking.
 
-## Training
+### Training
 
-### CLI Command
+#### CLI Command
 
 ```bash
 uv run nemotron nano3 rl [options] [overrides...]
@@ -86,20 +302,23 @@ uv run nemotron nano3 rl [options] [overrides...]
 | `--dry-run` | Preview execution plan |
 | `key=value` | Override config values ([CLI Framework](../cli.md#dotlist-overrides)) |
 
-### Override Examples
+#### Override Examples
 
 ```bash
-# More iterations
-uv run nemotron nano3 rl grpo.num_iterations=200
+# More training steps
+uv run nemotron nano3 rl grpo.max_num_steps=200000
 
-# Different temperature
+# Different temperature for generation
 uv run nemotron nano3 rl policy.generation.temperature=0.8
 
 # Different learning rate
-uv run nemotron nano3 rl grpo.learning_rate=5e-7
+uv run nemotron nano3 rl policy.megatron_cfg.optimizer.lr=5e-7
+
+# Disable sequence packing
+uv run nemotron nano3 rl policy.sequence_packing.enabled=false
 ```
 
-## Running with NeMo-Run
+### Running with NeMo-Run
 
 Configure execution profiles in `env.toml`:
 
@@ -112,7 +331,7 @@ entity = "YOUR-TEAM"
 executor = "slurm"
 account = "YOUR-ACCOUNT"
 partition = "batch"
-nodes = 2
+nodes = 32
 ntasks_per_node = 8
 gpus_per_node = 8
 mem = "0"
@@ -122,13 +341,54 @@ mounts = ["/lustre:/lustre"]
 
 See [Execution through NeMo-Run](../nemo-run.md) for complete configuration options.
 
+### Checkpoint & Resume
+
+Training automatically saves checkpoints based on validation reward. To resume from a checkpoint:
+
+```bash
+# Resume from a specific checkpoint
+uv run nemotron nano3 rl policy.model_name=/path/to/checkpoint
+
+# Resume from latest checkpoint in results directory
+uv run nemotron nano3 rl checkpointing.checkpoint_dir=/path/to/results
+```
+
+**Checkpoint Configuration:**
+
+| Option | Value | Description |
+|--------|-------|-------------|
+| `save_period` | 10 | Steps between checkpoint saves |
+| `metric_name` | val:total_reward/mean | Metric for best checkpoint selection |
+| `higher_is_better` | true | Higher reward = better checkpoint |
+| `keep_top_k` | 1000000 | Number of checkpoints to retain |
+
+### Troubleshooting
+
+Common errors and solutions:
+
+| Error | Cause | Solution |
+|-------|-------|----------|
+| High `token_mult_prob_error` | Mismatch between vLLM and Megatron probabilities | Check weight refitting; ensure vLLM compilation settings match |
+| KL divergence spikes | Single token probability errors in MoE | Monitor `gen_kl_error` metric; values above 1e-3 indicate issues |
+| OOM during generation | vLLM memory allocation too high | Reduce `gpu_memory_utilization` (default 0.5) |
+| Slow convergence | Learning rate too low or high | Adjust `policy.megatron_cfg.optimizer.lr` |
+
+**Debugging tips:**
+
+- Monitor `token_mult_prob_error` for inference/training consistency (should stay below ~2%)
+- Watch `sampling_importance_ratio` (should hover around 1.0)
+- Check `approx_entropy` for entropy collapse during training
+- Use `--sample N` in data prep for quick iteration
+
+---
+
 ## Artifact Lineage
 
 ```mermaid
 %%{init: {'theme': 'base', 'themeVariables': { 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'primaryTextColor': '#333333'}}}%%
 flowchart TB
     prev["ModelArtifact-sft<br/>(from Stage 1)"] --> train
-    rl["RL Datasets<br/>(preference/reward data)"] --> dp["data_prep.py"]
+    rl["RL Datasets<br/>(HuggingFace)"] --> dp["data_prep.py"]
     dp --> data["DataBlendsArtifact-rl<br/>(JSONL files)"]
     data --> train["train.py<br/>(GRPO with NeMo-RL)"]
     train --> model["ModelArtifact-rl<br/>(final aligned model)"]
@@ -141,96 +401,74 @@ flowchart TB
     style model fill:#e8f5e9,stroke:#4caf50
 ```
 
-## Methodology
-
-> For complete methodology, see [Tech Report Section 3.2](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf).
-
-The RL pipeline consists of three components:
-1. **RLVR** — Multi-environment training with verifiable rewards
-2. **RLHF with GenRM** — Generative reward model-based alignment
-3. **DPO** — Preference learning to reduce tool hallucination
-
-### Multi-Environment RLVR
-
-Training uses 7 reward environments through NeMo-Gym:
+---
 
-| Environment | Description |
-|-------------|-------------|
-| **Competition Math** | Mathematical reasoning (DAPO, SkyWorks math) |
-| **Competition Coding** | Code correctness with test case execution |
-| **Question Answering** | STEM multiple choice verification |
-| **Structured Outputs** | JSON schema adherence |
-| **Instruction Following** | IFEval, Multi-Challenge compliance |
-| **Long Context** | 256k token multi-document synthesis |
-| **Agentic Tool Use** | Workplace Assistant, Multi-Turn Agent |
-
-Training on all environments simultaneously provides stable gains without co-reward degradation.
+## Infrastructure
 
-> For GRPO algorithm details, see [Tech Report Section 3.2](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf).
-
-### GenRM (RLHF)
-
-Generative reward models use circular comparison strategy (N comparisons instead of O(N²)) with length-normalized reward adjustment.
+This stage uses the following components from the [NVIDIA AI Stack](../nvidia-stack.md):
 
-| Parameter | Value |
-|-----------|-------|
-| **Prompts per batch** | 128 |
-| **Responses per prompt** | 16 |
-| **Comparison strategy** | Circular |
-| **Length bonus α** | 0.5 |
+| Component | Role | Documentation |
+|-----------|------|---------------|
+| [NeMo-RL](../nvidia-stack.md#nemo-rl) | GRPO algorithm, policy training, reward computation | [Docs](https://docs.nvidia.com/nemo/rl/latest/) |
+| [Megatron-Core](../nvidia-stack.md#megatron-core) | Distributed training primitives (TP, PP, CP, EP) | [GitHub](https://github.com/NVIDIA/Megatron-LM) |
+| [Ray](https://ray.io/) | Distributed actor coordination | [Docs](https://docs.ray.io/) |
+| vLLM | Fast rollout generation | [GitHub](https://github.com/vllm-project/vllm) |
 
-> For GenRM training details, see [Tech Report Section 3.2](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf).
+### Parallelism Configuration
 
-### DPO for Tool Hallucination
+Training uses multiple parallelism strategies for efficient scaling:
 
-DPO reduces hallucinated tool usage with minimal computational overhead:
+| Parallelism | Value | Config Key |
+|-------------|-------|------------|
+| Tensor (TP) | 2 | `policy.megatron_cfg.tensor_model_parallel_size` |
+| Pipeline (PP) | 2 | `policy.megatron_cfg.pipeline_model_parallel_size` |
+| Context (CP) | 4 | `policy.megatron_cfg.context_parallel_size` |
+| Expert (EP) | 8 | `policy.megatron_cfg.expert_model_parallel_size` |
+| Sequence (SP) | Yes | `policy.megatron_cfg.sequence_parallel` |
 
-| Metric | Before DPO | After DPO |
-|--------|------------|-----------|
-| **AIME25 Accuracy** | 80.88% | 84.58% |
-| **Hallucination Rate** | 8.33% | 0.7% |
+**Generation (vLLM):**
 
-> For DPO methodology, see [Tech Report Appendix C](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf).
+| Parameter | Value | Description |
+|-----------|-------|-------------|
+| `tensor_parallel_size` | 4 | TP for vLLM generation |
+| `gpu_memory_utilization` | 0.5 | GPU memory fraction for KV cache |
+| `colocated` | true | Share GPUs with training |
+| `enforce_eager` | false | Use torch.compile |
 
-### GRPO Hyperparameters
+**Cluster:**
 
 | Parameter | Value |
 |-----------|-------|
-| **Prompts per step** | 128 |
-| **Generations per prompt** | 16 |
-| **Max Generation Length** | 49K tokens |
-| **Epsilon Filtering** | Cosine annealing with 4% limit |
-| **MoE Load Balancing** | DeepSeek aux-loss-free strategy |
-
-### Reasoning Control
-
-The model supports:
-- **Reasoning on/off control** — Strip reasoning from 10% of samples
-- **Token budget control** — Truncate 3% of reasoning traces to different budgets
-
-## Requirements
-
-- **GPU nodes**: Recommended 8 GPUs per node (H100)
-- **Ray cluster**: Automatically initialized for distributed execution
-
-## NVIDIA AI Stack
-
-This stage uses the following components from the [NVIDIA AI Stack](../nvidia-stack.md):
-
-| Component | Role | Documentation |
-|-----------|------|---------------|
-| [NeMo-RL](../nvidia-stack.md#nemo-rl) | GRPO algorithm, policy training, reward computation | [Docs](https://docs.nvidia.com/nemo/rl/latest/) |
-| [Ray](https://ray.io/) | Distributed actor coordination | [Docs](https://docs.ray.io/) |
-| vLLM | Fast rollout generation | [GitHub](https://github.com/vllm-project/vllm) |
+| `num_nodes` | 32 |
+| `gpus_per_node` | 8 |
 
 ### Key Features Used
 
 | Feature | Purpose |
 |---------|---------|
 | GRPO algorithm | Group Relative Policy Optimization with clipped gradients |
-| Multi-environment training | Simultaneous training across 7 reward environments |
-| NeMo-Gym | Reward environments (math, code, tool-use) |
-| DTensor backend | FSDP2-based distributed training |
+| Megatron backend | Distributed training with TP/PP/CP/EP parallelism |
+| Sequence Packing | Efficient batch utilization for variable-length generations |
+| vLLM Generation | Fast rollout with tensor parallelism |
+| MoE Router Bias | Aux-loss-free load balancing (`freeze_moe_router=true`) |
+| Per-token Loss | Consistent gradient signal (`calculate_per_token_loss=true`) |
+
+### NeMo-Gym Environments
+
+The training configuration includes these reward environment configs:
+
+```yaml
+env:
+  nemo_gym:
+    config_paths:
+      - responses_api_models/vllm_model/configs/vllm_model_for_training.yaml
+      - resources_servers/math_with_judge/configs/math_with_judge.yaml
+      - resources_servers/code_gen/configs/code_gen.yaml
+      - resources_servers/workplace_assistant/configs/workplace_assistant.yaml
+      - resources_servers/mcqa/configs/mcqa.yaml
+      - resources_servers/instruction_following/configs/instruction_following.yaml
+      - resources_servers/structured_outputs/configs/structured_outputs_json.yaml
+```
 
 ### Architecture
 
@@ -238,10 +476,10 @@ NeMo-RL uses a Ray-based actor model:
 
 | Actor | Function |
 |-------|----------|
-| Policy Model | Trainable policy weights |
-| Generator | vLLM-backed rollout generation |
-| Reward Model | Environment-specific reward computation |
-| Reference Model | KL divergence regularization |
+| **Policy Model** | Trainable policy weights (Megatron backend) |
+| **Generator** | vLLM-backed rollout generation (colocated) |
+| **Reward Environments** | NeMo-Gym environments for reward computation |
+| **Reference Model** | Frozen SFT checkpoint for KL divergence |
 
 ### Container
 
@@ -249,14 +487,18 @@ NeMo-RL uses a Ray-based actor model:
 nvcr.io/nvidia/nemo-rl:v0.4.0.nemotron_3_nano
 ```
 
-## Open-Source Data
+---
+
+## Next Steps
 
-> **Note**: This recipe trains exclusively on the open-sourced subset of RL data. Results will differ from the tech report benchmarks, which used additional proprietary data.
+After RL completes, the final aligned model (`ModelArtifact-rl`) is ready for evaluation and deployment.
 
 ## Reference
 
 - [Tech Report Section 3.2](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf) — RL methodology
-- [NVIDIA AI Stack](../nvidia-stack.md) — NeMo-RL documentation
+- [NeMo-RL Documentation](https://docs.nvidia.com/nemo/rl/latest/) — GRPO, DPO, environments
+- [NeMo-RL Nemotron 3 Nano Guide](https://docs.nvidia.com/nemo/rl/latest/guides/nemotron-3-nano.html) — Upstream training guide
+- [NVIDIA AI Stack](../nvidia-stack.md) — NeMo-RL, Megatron-Core documentation
 - [Artifact Lineage](../artifacts.md) — W&B artifact system
 - [Stage 0: Pretraining](./pretrain.md) — Pretrain the base model
 - [Stage 1: SFT](./sft.md) — Instruction tuning
diff --git a/docs/train/nano3/sft.md b/docs/train/nano3/sft.md
index a348cbcf7..12666c6ba 100644
--- a/docs/train/nano3/sft.md
+++ b/docs/train/nano3/sft.md
@@ -216,21 +216,30 @@ $ uv run nemotron nano3 sft --run YOUR-CLUSTER
 
 > **Note**: The `--run YOUR-CLUSTER` flag submits jobs via [NeMo-Run](../nemo-run.md). See [Execution through NeMo-Run](../nemo-run.md) for setup.
 
-#### Direct Script Execution
+#### Direct Script Execution (Megatron-Bridge)
 
-Inside a container on a compute node:
+For direct execution outside this CLI, use the scripts in the [Megatron-Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) repository:
 
 ```bash
-# Data preparation
-uv run python data_prep.py --config config/data_prep.yaml
-
-# Training (single node)
-uv run python train.py --config config/default.yaml
-
-# Training (distributed)
-uv run torchrun --nproc_per_node=8 train.py --config config/default.yaml
+# Clone the repository and checkout the nano-v3 branch
+git clone https://github.com/NVIDIA-NeMo/Megatron-Bridge.git
+cd Megatron-Bridge
+git checkout nano-v3
+
+# Run fine-tuning (inside container on compute node)
+python examples/recipes/nemotron_3/finetune_nemotron_3_nano.py \
+    --per-split-data-args-path /path/to/data_args.json \
+    --tokenizer-model /path/to/tokenizer.model
+
+# With config file overrides
+python examples/recipes/nemotron_3/finetune_nemotron_3_nano.py \
+    --config-file /path/to/overrides.yaml \
+    --per-split-data-args-path /path/to/data_args.json \
+    --tokenizer-model /path/to/tokenizer.model
 ```
 
+See the [Megatron-Bridge Nemotron 3 documentation](https://docs.nvidia.com/nemo/megatron-bridge/latest/models/llm/nemotron3.html) for detailed configuration options.
+
 ### Configuration
 
 | File | Purpose |