NVIDIA-NeMo · ayushdg · May 19, 2026 · May 13, 2026 · May 18, 2026 · May 18, 2026
diff --git a/docs/nemotron/data/curation/nemotron-cc.md b/docs/nemotron/data/curation/nemotron-cc.md
@@ -19,7 +19,7 @@ Common Crawl → Extract & Clean → Deduplicate → Quality Classify → Synthe
 | 2b | `step_2b-fuzzy_dedup.py` | MinHash + LSH fuzzy deduplication | GPU (identify), CPU (remove) |
 | 2c | `step_2c-substring_dedup/` | Exact substring deduplication using suffix arrays | CPU-only |
 | 3 | `step_3-quality_classification.py` | Ensemble quality scoring into 20 buckets | GPU (classify), CPU (ensemble) |
-| 4 | `step_4-sdg.py` | LLM-based synthetic data generation on top-quality data | CPU + LLM endpoint |
+| 4 | `step_4-sdg.py` | LLM-based synthetic data generation on top-quality data | GPU (local inference server, default) or CPU + external LLM endpoint (with `--no-serve-model`) |
 
 Steps 1–3 progressively filter and annotate the data. Step 4 generates synthetic training data (diverse QA, distillation, knowledge extraction, knowledge lists) from the highest-quality documents (buckets 18–19).
 
@@ -35,9 +35,9 @@ See the recipe README at `src/nemotron/recipes/data/curation/nemotron-cc/README.
 
 ## Prerequisites
 
-- [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator) installed with Ray support
+- [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator) 1.2.0 (26.04 release) or newer, installed with Ray support
 - GPU(s) for steps 2a, 2b, and 3 (deduplication and classification)
-- Access to an OpenAI-compatible LLM endpoint for step 4 (NVIDIA NIM, vLLM, or cloud API)
+- For step 4, one of: GPU(s) to host a local inference server (default), an OpenAI-compatible endpoint (self-hosted vLLM/NIM or cloud, via `--no-serve-model`), or an [NVIDIA Build](https://build.nvidia.com/) API key
 
 ## After Curation
 

diff --git a/src/nemotron/recipes/data/curation/nemotron-cc/README.md b/src/nemotron/recipes/data/curation/nemotron-cc/README.md
@@ -4,70 +4,89 @@ This directory contains the recipe for curating datasets similar to the [Nemotro
 
 ### Requirements
 
-- [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator) > 1.1.0 ([install from main](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html))
+- [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator) 1.2.0 (26.04 release) or newer ([install instructions](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html))
 - GPU(s) for steps 2a, 2b, and 3 (deduplication and classification)
 - [Cargo/Rust](https://doc.rust-lang.org/cargo/getting-started/installation.html) for step 2c (building `deduplicate-text-datasets`)
-- Access to an OpenAI-compatible LLM endpoint for step 4 (NVIDIA NIM, vLLM, or cloud API)
+- For step 4, one of: GPU(s) to host a local inference server (default), an OpenAI-compatible endpoint (self-hosted vLLM/NIM or cloud), or an [NVIDIA Build](https://build.nvidia.com/) API key
 
 ### Pipeline Overview
 
-#### Step 1: Download, Extract, and Clean (`step_1-download_extract.py`)
+| # | Script | Compute | Output |
+|---|--------|---------|--------|
+| 1 | [`step_1-download_extract.py`](#step-1-download-extract-and-clean) | CPU | `data/cleaned_extracted/` |
+| 2a | [`step_2a-exact_dedup.py`](#step-2a-exact-deduplication) | GPU + CPU | `data/exact_deduplicated/` |
+| 2b | [`step_2b-fuzzy_dedup.py`](#step-2b-fuzzy-deduplication) | GPU + CPU | `data/fuzzy_deduplicated/` |
+| 2c | [`step_2c-substring_dedup/`](#step-2c-substring-deduplication) | CPU | `data/substring_deduped/` |
+| 3 | [`step_3-quality_classification.py`](#step-3-quality-classification) | GPU + CPU | `data/quality_labeling/bucketed_results/` |
+| 4 | [`step_4-sdg.py`](#step-4-synthetic-data-generation) | GPU or external API | `data/sdg_output/` |
+
+---
+
+#### Step 1: Download, Extract, and Clean
 
 A CPU-only pipeline that produces clean text from raw web data:
 
 - Downloads Common Crawl snapshots (WARC files) and extracts text using JusText.
 - Annotates each document with a language using a FastText language identification model.
 - Fixes mojibake (encoding issues) via Unicode reformatting.
-- **Output:** `data/cleaned_extracted/`
-- **Resources:** CPU-only. We recommend each worker has at least 2GB of RAM to prevent OOM errors.
 
-#### Step 2a: Exact Deduplication (`step_2a-exact_dedup.py`)
+**Resources:** CPU-only. Recommend at least 2GB RAM per worker to prevent OOM.
+
+---
+
+#### Step 2a: Exact Deduplication
 
-Exact deduplication using document hashing:
+Exact deduplication via document hashing. Run `--identify` then `--remove`.
 
-- **Phase 1 (`--identify`):** Hashes every document and identifies exact duplicates.
-  - **Resources:** Requires GPU(s) for accelerated hashing. For a single snapshot (~4-10TB) extracted we tested with 8 H100 GPUs. For all of Common Crawl we recommend ~128 GPUs with 80GB VRAM per GPU.
-- **Phase 2 (`--remove`):** Removes duplicate documents, keeping one copy.
-  - **Resources:** CPU-only. Reads duplicate IDs from the cache directory and filters the original dataset. We recommend each worker has at-least 6GB of RAM to prevent OOM errors.
-- **Output:** `data/exact_deduplicated/`.
+| Phase | Compute | Scale tested / notes |
+|-------|---------|----------------------|
+| `--identify` | GPU | 8× H100 for a single snapshot (~4-10TB). ~128× 80GB GPUs recommended for full Common Crawl. |
+| `--remove` | CPU, ≥6GB RAM/worker | Reads cached duplicate IDs and filters the original dataset. |
 
-#### Step 2b: Fuzzy Deduplication (`step_2b-fuzzy_dedup.py`)
+---
 
-Fuzzy deduplication using MinHash + LSH:
+#### Step 2b: Fuzzy Deduplication
 
-- **Phase 1 (`--identify`):** Identify near duplicate docs using MinHash-LSH based duplicate identification.
-  - **Resources:** Requires GPU(s). For a single snapshot (~1-8TB) exact deduplicated we tested with 8 H100 GPUs.
-- **Phase 2 (`--remove`):** Removes fuzzy duplicates based on connected components.
-  - **Resources:** CPU-only. Reads duplicate IDs and filters the original dataset. We recommend each worker has at-least 6GB of RAM to prevent OOM errors.
-- **Output:** `data/fuzzy_deduplicated/`.
+Fuzzy deduplication using MinHash + LSH. Run `--identify` then `--remove`.
 
-#### Step 2c: Substring Deduplication (`step_2c-substring_dedup/`)
+| Phase | Compute | Scale tested / notes |
+|-------|---------|----------------------|
+| `--identify` | GPU | 8× H100 for a single snapshot (~1-8TB exact-deduped). |
+| `--remove` | CPU, ≥6GB RAM/worker | Filters using connected-components results. |
+
+---
+
+#### Step 2c: Substring Deduplication
 
 CPU-only exact substring deduplication using [Google Research's deduplicate-text-datasets](https://github.com/google-research/deduplicate-text-datasets) ([paper](https://arxiv.org/abs/2107.06499)). Removes duplicate substrings within and across documents using suffix arrays.
 
-- **Resources:** CPU-only. Requires 2-3x the input dataset size in RAM and 10-15x in disk space. We recommend splitting data into 100GB chunks.
-- **Output:** `data/substring_deduped/`.
+**Resources:** CPU-only. Requires 2-3× the input dataset size in RAM and 10-15× on disk. Recommend splitting data into 100GB chunks.
 
 See the [step_2c README](./step_2c-substring_dedup/README.md) for detailed instructions and debugging tips.
 
-#### Step 3: Quality Classification (`step_3-quality_classification.py`)
+---
+
+#### Step 3: Quality Classification
 
-Ensemble quality scoring and bucketing into 20 quality tiers:
+Ensemble quality scoring and bucketing into 20 quality tiers. Run `--classify` then `--ensemble`.
 
-- **Phase 1 (`--classify`):** Filters to English, then runs three quality classifiers in parallel:
-  - [FineWebNemotronEduClassifier](https://huggingface.co/nvidia/nemocurator-fineweb-nemotron-4-edu-classifier)
-  - [FineWebMixtralEduClassifier](https://huggingface.co/nvidia/nemocurator-fineweb-mixtral-edu-classifier)
-  - [FastText quality filter (`fasttext-oh-eli5`)](https://huggingface.co/mlfoundations/fasttext-oh-eli5)
-  - **Resources:** Requires GPU(s) for the neural classifiers. For a single snapshot we tested with 64 H100 GPUs. This scale is embarrassingly parallel so use fewer/more GPUs as needed with at least 80GB VRAM per GPU.
-- **Phase 2 (`--ensemble`):** Computes token-weighted percentile thresholds from sampled classification scores, maps float scores to integer bins (0-19), takes the per-document max across classifiers as the ensemble score.
-  - **Resources:** CPU-only. Reads classification results and computes thresholds and bucketing. Tested with max `fraction=0.1` on a machine with 200GB ram. For OOM errors would recommend reducing the sampling fraction.
-- **Output:** `data/quality_labeling/bucketed_results/ensemble-max-int={0-19}/` partitioned by quality bucket (0 = lowest, 19 = highest).
+| Phase | Compute | Scale tested / notes |
+|-------|---------|----------------------|
+| `--classify` | GPU, ≥80GB VRAM | Filters to English, runs three classifiers in parallel. Tested at 64× H100 per snapshot; embarrassingly parallel — scale up/down freely. |
+| `--ensemble` | CPU | Computes token-weighted percentile thresholds and per-document max across classifiers. Tested at `fraction=0.1` on 200GB RAM; reduce sampling fraction if OOM. |
 
-#### Step 4: Synthetic Data Generation (`step_4-sdg.py`)
+Classifiers used:
+- [FineWebNemotronEduClassifier](https://huggingface.co/nvidia/nemocurator-fineweb-nemotron-4-edu-classifier)
+- [FineWebMixtralEduClassifier](https://huggingface.co/nvidia/nemocurator-fineweb-mixtral-edu-classifier)
+- [FastText quality filter (`fasttext-oh-eli5`)](https://huggingface.co/mlfoundations/fasttext-oh-eli5)
 
-LLM-based synthetic data generation on the highest-quality documents (buckets 18 and 19). This is a CPU-only pipeline — LLM inference happens via API calls to an external endpoint (NVIDIA Integrate, or a self-hosted OAI endpoint compatible server).
+**Output layout:** `data/quality_labeling/bucketed_results/ensemble-max-int={0-19}/` partitioned by bucket (0 = lowest, 19 = highest).
 
-Four generation tasks:
+---
+
+#### Step 4: Synthetic Data Generation
+
+LLM-based synthetic data generation on the highest-quality documents (buckets 18 and 19). Four generation tasks:
 
 | Task | Description | Max Input / Output Tokens |
 |------|-------------|---------------------------|
@@ -76,8 +95,19 @@ Four generation tasks:
 | `extract_knowledge` | Rewrites text as textbook/Wikipedia-style passages focused on factual content | 1400 / 1400 |
 | `knowledge_list` | Extracts organized bulleted lists of key facts, concepts, and statistics | 1000 / 600 |
 
-Each task runs as an independent pipeline (preprocessing, LLM generation, postprocessing, write). When `--task all` is used, the four tasks run sequentially. They can also be run as separate processes in parallel.
+Each task is an independent pipeline (preprocess → LLM generate → postprocess → write). `--task all` runs the four sequentially; they can also be launched as parallel processes.
+
+**LLM backends** — pick one:
+
+| Backend | How to select | Notes |
+|---------|---------------|-------|
+| Local inference server (default) | (default) | Spins up a Ray Serve + vLLM deployment of `--model-name` on the local GPU cluster. No API key. Tune with `--tensor-parallel-size`, `--min-replicas`, `--max-replicas`; bump `--max-concurrent-requests` (try 256–512) if GPU utilization is low. |
+| Existing OpenAI-compatible endpoint | `--no-serve-model --base-url <url>` | Self-hosted vLLM/TRT-LLM/NIM or any OpenAI-compatible cloud provider. `--api-key` forwarded if set. |
+| [NVIDIA Build](https://build.nvidia.com/) | `--no-serve-model` | Uses the default `--base-url`. Requires `--api-key` (or `NVIDIA_API_KEY`). Default `--model-name` is not on NVIDIA Build — set `--model-name` (and `--tokenizer`) to a model that is. |
+
+> **Note on `--tokenizer`:** The tokenizer is loaded via Hugging Face `AutoTokenizer`, so `--tokenizer` must be a Hugging Face repo id (or local path to HF tokenizer files), regardless of which backend you pick. If `--tokenizer` is not set, it defaults to `--model-name`, which in some cases is not a valid HF tokenizer path — for example `--model-name meta/llama-3.3-70b-instruct` needs `--tokenizer meta-llama/Llama-3.3-70B-Instruct` set explicitly.
+
+**Defaults:**
 
-- **Default model:** [`Qwen/Qwen3-30B-A3B-Instruct-2507`](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507). This model is not available on [NVIDIA Build](https://build.nvidia.com/), so you'll need to provide a `--base-url` pointing to an endpoint serving it (self-hosted via vLLM/NIM, or any OpenAI-compatible cloud provider). Alternatively, you can use any model available on NVIDIA Build by setting `--model-name` and `--tokenizer` accordingly.
-- **Output:** `data/sdg_output/<task_name>/`.
-- **Resources:** CPU-only for the script itself. Requires access to an LLM endpoint.
+- **Default model:** [`Qwen/Qwen3-30B-A3B-Instruct-2507`](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507).
+- **Resources:** With `--serve-model`, GPU(s) for vLLM. Otherwise CPU-only; just needs network access to the chosen endpoint.
diff --git a/src/nemotron/recipes/data/curation/nemotron-cc/step_1-download_extract.py b/src/nemotron/recipes/data/curation/nemotron-cc/step_1-download_extract.py
@@ -29,7 +29,7 @@
 from fsspec.core import url_to_fs
 from loguru import logger
 
-from nemo_curator.backends.experimental.ray_data import RayDataExecutor
+from nemo_curator.backends.ray_data import RayDataExecutor
 from nemo_curator.core.client import RayClient
 from nemo_curator.pipeline import Pipeline
 from nemo_curator.stages.base import ProcessingStage

diff --git a/src/nemotron/recipes/data/curation/nemotron-cc/step_2a-exact_dedup.py b/src/nemotron/recipes/data/curation/nemotron-cc/step_2a-exact_dedup.py
@@ -37,7 +37,7 @@
 
 from nemo_curator.stages.file_partitioning import FilePartitioningStage
 from nemo_curator.tasks import EmptyTask
-from nemo_curator.backends.experimental.ray_data import RayDataExecutor
+from nemo_curator.backends.ray_data import RayDataExecutor
 from nemo_curator.core.client import RayClient
 from nemo_curator.stages.deduplication.exact.workflow import ExactDeduplicationWorkflow
 from nemo_curator.stages.deduplication.id_generator import CURATOR_DEDUP_ID_STR

diff --git a/src/nemotron/recipes/data/curation/nemotron-cc/step_2b-fuzzy_dedup.py b/src/nemotron/recipes/data/curation/nemotron-cc/step_2b-fuzzy_dedup.py
@@ -36,7 +36,7 @@
 
 from loguru import logger
 
-from nemo_curator.backends.experimental.ray_data import RayDataExecutor
+from nemo_curator.backends.ray_data import RayDataExecutor
 from nemo_curator.core.client import RayClient
 from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow
 from nemo_curator.stages.deduplication.id_generator import CURATOR_DEDUP_ID_STR

diff --git a/src/nemotron/recipes/data/curation/nemotron-cc/step_2c-substring_dedup/README.md b/src/nemotron/recipes/data/curation/nemotron-cc/step_2c-substring_dedup/README.md
@@ -13,7 +13,7 @@ If you are adapting it for Slurm:
 3. The `remove_duplicates` step must run on a single exclusive node.
 
 Dependencies:
-- `nemo_curator>=1.1.0`
+- `nemo_curator>=1.2.0` (26.04 release)
 - `cargo` (for building `deduplicate-text-datasets`). See https://doc.rust-lang.org/cargo/getting-started/installation.html
 
 We recommend splitting up the dataset into 100 GB chunks or less, and executing `exact_substring_dedup.sh` on each 100 GB chunk.

diff --git a/src/nemotron/recipes/data/curation/nemotron-cc/step_2c-substring_dedup/prepare_dataset.py b/src/nemotron/recipes/data/curation/nemotron-cc/step_2c-substring_dedup/prepare_dataset.py
@@ -21,7 +21,7 @@
 import tiktoken
 
 from nemo_curator.backends.base import WorkerMetadata
-from nemo_curator.backends.experimental.ray_data import RayDataExecutor
+from nemo_curator.backends.ray_data import RayDataExecutor
 from nemo_curator.core.client import RayClient
 from nemo_curator.pipeline import Pipeline
 from nemo_curator.stages.base import ProcessingStage
@@ -91,6 +91,7 @@ def write_id_to_filename(input_file_paths: list[str], output_path: str) -> dict[
     input_file_list = [os.path.basename(filename) for filename in input_file_paths]
     id_to_filename = {str(i): filename for i, filename in enumerate(input_file_list)}
 
+    os.makedirs(output_path, exist_ok=True)
     with open(f"{output_path}/id_to_filename.json", "w") as fp:
         json.dump(id_to_filename, fp)
 

diff --git a/src/nemotron/recipes/data/curation/nemotron-cc/step_3-quality_classification.py b/src/nemotron/recipes/data/curation/nemotron-cc/step_3-quality_classification.py
@@ -41,7 +41,7 @@
 import pandas as pd
 from loguru import logger
 
-from nemo_curator.backends.experimental.ray_data import RayDataExecutor
+from nemo_curator.backends.ray_data import RayDataExecutor
 from nemo_curator.core.client import RayClient
 from nemo_curator.pipeline import Pipeline
 from nemo_curator.stages.base import ProcessingStage